Big Data Hadoop and Spark Developer Certification

Master the various components of Hadoop ecosystem like Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark. Get hands-on practice on CloudLabs by implementing real life projects in the domains of banking, telecommunication, social media, insurance, and e-commerce. The course is aligned to Cloudera CCA175 certification. This course is best suited for IT, data management, and analytics professionals looking to gain expertise in Big Data.


  • Course Advisor
  • Course Description
  • Course Features
  • Course Content
  • Exam and Certification
  • FAQs

Ronald van Loon

Top 10 Big Data & Data Science Influencer, Director - Adversitement

Named by Onalytica as one of the three most influential people in Big Data, Ronald is also an author for a number of leading Big Data and Data Science websites, including Datafloq, Data Science Central, and The Guardian, and he regularly speaks at renowned events.


Course Description

  • What is this course about?

    The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.

    Mastering Hadoop and related tools: The course provides you with an in-depth understanding of the Hadoop framework including HDFS, YARN, and MapReduce. You will learn to use Pig, Hive, and Impala to process and analyze large datasets stored in the HDFS, and use Sqoop and Flume for data ingestion.

    Mastering real-time data processing using Spark: You will learn to do functional programming in Spark, implement Spark applications, understand parallel processing in Spark, and use Spark RDD optimization techniques. You will also learn the various interactive algorithm in Spark and use Spark SQL for creating, transforming, and querying data form.

    As a part of the course, you will be required to execute real-life industry-based projects using CloudLab. The projects included are in the domains of Banking, Telecommunication, Social media, Insurance, and E-commerce.  This Big Data course also prepares you for the Cloudera CCA175 certification.
  • What are the course objectives?

    • Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
    • Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
    • Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
    • Get an overview of Sqoop and Flume and describe how to ingest data using them
    • Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
    • Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
    • Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
    • Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
    • Gain a working knowledge of Pig and its components
    • Do functional programming in Spark
    • Understand resilient distribution datasets (RDD) in detail
    • Implement and build Spark applications
    • Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
    • Understand the common use-cases of Spark and the various interactive algorithms
    • Learn Spark SQL, creating, transforming, and querying Data frames
    • Prepare for Cloudera Big Data CCA175 certification

  • Who should take this course?

    Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
    • Software Developers and Architects
    • Analytics Professionals
    • Senior IT professionals
    • Testing and Mainframe professionals
    • Data Management Professionals
    • Business Intelligence Professionals
    • Project Managers
    • Aspiring Data Scientists
    • Graduates looking to build a career in Big Data Analytics

    Prerequisite:
    • As the knowledge of Java is necessary for this course, we are providing a complimentary access to “Java Essentials for Hadoop” course 
    • For Spark we use Python and Scala and an Ebook has been provided to help you with the same
    • Knowledge of an operating system like Linux is useful for the course

  • What is CloudLab?

    CloudLab is a cloud-based Hadoop and Spark environment lab that Simplilearn offers along with the course to ensure a hassle-free execution of the hands-on project which you need to complete in the Hadoop and Spark Developer course.

    With CloudLab, you do not need to install and maintain Hadoop or Spark on a virtual machine. Instead, you’ll be able to access a preconfigured environment on CloudLab via your browser. This provides a very strong semblance to what companies are using today to increase their Hadoop installation scalability and availability.

    You’ll have access to CloudLab from the Simplilearn LMS (Learning Management System) for the duration of the course. You can learn more about CloudLab by viewing our CloudLab video.

  • What projects are included in this course?

    The course includes 5 real-life, industry-based projects. CloudLab has been provided for a hassle-free execution of these projects. Successful evaluation of one of the following 2 projects is a part of the certification eligibility criteria.

    Project 1
    Domain- Banking
    Description- A Portuguese banking institution ran a marketing campaign to convince potential customers to invest in bank term deposit. The marketing campaigns were based on phone calls. Often, the same customer was contacted more than once through phone, in order to assess if they would want to subscribe to the bank term deposit or not. You have to analyze the data collected through the marketing campaign.

    Project 2
    Domain- Telecommunication
    Description- A mobile phone service provider has introduced a new Open Network campaign. The company has invited the users to raise a request to initiate a complaint about the towers in their locality if they face issues with their mobile network. The company has collected the dataset of users who had raised the complaint. The fourth and the fifth field of the dataset has latitude and longitude of users which is an important information for the company. You have to find this information of latitude and longitude on the basis of available dataset and create three clusters of users with k-means algorithm.

    For further practice, we have three more projects to help you start your Hadoop and Spark journey.

    Project 3
    Domain- Social Media
    Description- As part of a recruiting exercise, a major social media company asked candidates to analyze data set from Stack Exchange.
    You will be using the data set to arrive at certain key insights.

    Project 4
    Domain- Website providing movie-related information
    Description-IMBD is an online database of movie-related information. IMBD users rate the movies and provide reviews. They rate the movies on a scale of 1 to 5; 1 being the worst and 5 being the best. The data set also has additional information, such as the release year of the movie. You have to analyze the data collected.

    Project 5
    Domain- Insurance
    Description-A US-based insurance provider has decided to launch a new medical insurance program targeting various customers. To help a customer understand the current realities and the market better, you have to perform a series of data analysis using Hadoop.


Course Benefits

  • 40 hours of instructor-led training

  • 5 real-life industry projects in banking, telecom, insurance, and e-commerce domains

  • Includes training on Yarn, MapReduce, Pig, Hive, Impala, HBase, and Apache Spark

  • 24 hours of self-paced video

  • Hands-on practice with CloudLabs

  • Aligned to Cloudera CCA175 certification exam


Course Content

Introduction to Big Data and Hadoop


  • Lesson 00 - Course Introduction

  • Lesson 01 - Introduction to Big data and Hadoop Ecosystem

  • Lesson 02 - HDFS and YARN

  • Lesson 03 - MapReduce and Scoop

  • Lesson 04 - Basics of Hive and Impala

  • Lesson 05 - Working with Hive and Impala

  • Lesson 06 - Types of Data Formats

  • Lesson 07 - Advanced Hive Concept and Data File Partitioning

  • Lesson 08 - Apache Flume and HBase

  • Lesson 09 - Pig

  • Lesson 10 - Basics of Apache Spark

  • Lesson 11 - RDDs in Spark

  • Lesson 12 - Implementation of Spark Applications

  • Lesson 13 - Spark Parallel Processing

  • Lesson 14 - Spark RDD Optimization Techniques

  • Lesson 15 - Spark Algorithm

  • What"s next?

  • Lesson 16 - Spark SQL

  • Projects

  • Simulation Test Paper Instructions

  • Course Feedback

FREE COURSE - Java Essentials for Hadoop


  • Lesson 01 - Essentials of Java for Hadoop

  • Lesson 02 - Java Constructors

  • Lesson 03 - Essential Classes and Exceptions in Java

Exam and Certification

What do I need to do to unlock my certificate?


  • Complete 85% of the course.

  • Complete 1 project and 1 simulation test with a minimum score of 80%


What are the System Requirements?

To do the projects just log on to CloudLabs in your LMS.


Thank You

Self-Paced Learning

180 days of access to high-quality, self-paced learning content designed by industry experts.

Download Course Brochure

What's included in your brochure?

  • Detailed Course Content
  • Course Benefits
  • Certification Options

Happy to Suppport You, We will contact you soon