Introduction to Spark as Distributed Computing Framework
3 Scenarios 
3 Hours
.webp&w=3840&q=75)
Industry
e-commerce
Skills
approach
distributed-processing
Tools
spark
Learning Objectives
Understand the advantages of Spark's in-memory processing and speed compared to Hadoop.
Learn how Spark facilitates multi-language support.
Learn to utilize Pyspark Dataframe API for structured data processing
Overview
At Amazon's Data team, Python has traditionally played a crucial role in extracting insights by identifying loyal customers and gauging the effectiveness of marketing campaigns across various product categories.
However, as the scope of data analysis widened to encompass more product categories, Python alone began to struggle with the increasing ingestion times of larger data files. This highlighted a growing problem faced in many data-heavy industries: the need for a more robust solution to manage large and complex data sets efficiently.
Prerequisites
- Data Wrangling using Python