Introduction to Big Data with Spark and Hadoop

This course is part of multiple programs. Learn more

Instructors: Aije Egwaikhide +2 more

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

What you'll learn

  •   Explain the impact of big data, including use cases, tools, and processing methods.
  •   Describe Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce.
  •   Apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.
  •   Use Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark’s development and runtime environment options.
  • Skills you'll gain

  •   Distributed Computing
  •   Docker (Software)
  •   Apache Hadoop
  •   Apache Spark
  •   Apache Hive
  •   Big Data
  •   Data Processing
  •   PySpark
  •   Scalability
  •   IBM Cloud
  •   Kubernetes
  •   Data Transformation
  •   Performance Tuning
  •   Debugging
  • There are 7 modules in this course

    Bernard Marr defines big data as the digital trace that we are generating in this digital era. You will start the course by understanding what big data is and exploring how insights from big data can be harnessed for a variety of use cases. You’ll also explore how big data uses technologies like parallel processing, scaling, and data parallelism. Next, you will learn about Hadoop, an open-source framework that allows for the distributed processing of large data and its ecosystem. You will discover important applications that go hand in hand with Hadoop, like Distributed File System (HDFS), MapReduce, and HBase. You will become familiar with Hive, a data warehouse software that provides an SQL-like interface to efficiently query and manipulate large data sets. You’ll then gain insights into Apache Spark, an open-source processing engine that provides users with new ways to store and use big data. In this course, you will discover how to leverage Spark to deliver reliable insights. The course provides an overview of the platform, going into the components that make up Apache Spark. You’ll learn about DataFrames and perform basic DataFrame operations and work with SparkSQL. Explore how Spark processes and monitors the requests your application submits and how you can track work using the Spark Application UI. This course has several hands-on labs to help you apply and practice the concepts you learn. You will complete Hadoop and Spark labs using various tools and technologies, including Docker, Kubernetes, Python, and Jupyter Notebooks.

    Introduction to the Hadoop Ecosystem

    Apache Spark

    DataFrames and Spark SQL

    Development and Runtime Environment Options

    Monitoring and Tuning

    Final Project and Assessment

    ©2025  ementorhub.com. All rights reserved