Introduction to Spark as Distributed Computing Framework

3 Scenarios

3 Hours

Industry

e-commerce

Skills

approach

distributed-processing

Tools

spark

Learning Objectives

Understand the advantages of Spark's in-memory processing and speed compared to Hadoop.

Learn how Spark facilitates multi-language support.

Learn to utilize Pyspark Dataframe API for structured data processing

Overview

At Amazon's Data team, Python has traditionally played a crucial role in extracting insights by identifying loyal customers and gauging the effectiveness of marketing campaigns across various product categories.

However, as the scope of data analysis widened to encompass more product categories, Python alone began to struggle with the increasing ingestion times of larger data files. This highlighted a growing problem faced in many data-heavy industries: the need for a more robust solution to manage large and complex data sets efficiently.

Prerequisites

Data Wrangling using Python