
As a complete beginner in AWS Data Engineering, I set out with a simple but ambitious goal: build a data pipeline on AWS that could ingest, store, and query real retail data. I had a retail dataset with customer information, product catalogs, address data, and daily inventory records across 5 warehouses - perfect for learning data engineering concepts.
My vision was to:
Upload my data to AWS S3
Set up proper data cataloging
Query the data using SQL through AWS Athena
Learn AWS services hands-on, step by step
I started with the following datasets containing:
3 CSV files: customers.csv (customer data), products.csv (product catalog), addresses.csv (customer addresses)
150 JSON files: Daily inventory data for 5 warehouses over 30 days
Total size: ~112 MB of real-world retail data
Perfect for learning - complex enough to be realistic, small enough to manage costs.
My first step was setting up AWS CLI and configuring credentials. This seemed straightforward until I hit my first challenge - permissions.
Challenge #1: AWS Permissions Initially, my temp-user
account couldn't create S3 buckets. I learned about IAM policies the hard way:
Started with basic S3 permissions
Escalated to AmazonS3FullAccess
Eventually needed AthenaFullAccess and Glue permissions
Finally settled on AdministratorAccess for learning
Learning: AWS permissions are granular and service-specific. Start broad for learning, then narrow down in production.
I decided to organize everything under a single S3 bucket: amit-choudhary-765432892721
s3://amit-choudhary-765432892721/
├── raw-data/ # Source data
├── athena-results/ # Query results
├── processed-data/ # Future processed files
└── glue-scripts/ # ETL scripts
This taught me the importance of data lake organization from day one.
Uploading the 3 CSV files was smooth:
customers.csv (1.8MB) ✅
products.csv (100MB) ✅
addresses.csv (3MB) ✅
Then came the 150 JSON inventory files. I learned about the aws s3 sync
command:
aws s3 sync "local-inventory-folder" s3://amit-choudhary-765432892721/raw-data/inventory/
Watching 150 files upload in parallel was satisfying! Total uploaded: 153 files, ~112MB.
Learning: AWS CLI's sync
command is powerful for batch uploads. Much better than uploading files one by one.
Setting up AWS Athena to query my data taught me about service roles. Unlike user permissions, AWS services need their own identities to access resources.
I had to create a Glue Service Role for the data catalog:
aws iam create-role --role-name AWSGlueServiceRole-retail --assume-role-policy-document '{...}'
aws iam attach-role-policy --role-name AWSGlueServiceRole-retail --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Learning: Services like Glue crawlers need service roles, not user permissions. This is a key AWS security concept.
I used AWS Glue to automatically catalog my CSV files:
Created Glue Database: retail_db
Created Glue Crawler: Scanned my S3 data and inferred schemas
Generated Tables: Automatically created table definitions
The crawler found my data and created tables, but then I hit my biggest challenge...
My first Athena query returned 798,842 rows for "customers" but the results looked like this:
customerid: FSJ-186996
firstname: magnetic pin set of 2 pairs
lastname: accessories
The Problem: I had put all CSV files in one folder (raw-data/csv/
), so Athena was reading customers, products, AND addresses as one mixed table!
The Solution: Proper data organization. I restructured everything:
raw-data/csv/
├── customers/customers.csv # Only customer data
├── products/products.csv # Only product data
└── addresses/addresses.csv # Only address data
Then created separate Athena tables for each entity. This was a fundamental data modeling lesson.
Learning: In data lakes, folder structure = table structure. One entity type per folder!
After reorganization, my queries finally returned clean results:
Customers table: 20,000 clean customer records ✅
Products table: 720,109 product records ✅
Addresses table: Wait... 58,733 records instead of 30,000? 🤔
My addresses.csv should have had 30,000 records, but Athena reported 58,733. Looking at the query results, I saw corrupted data:
addressid: Haldia_060162"
city: "Billing
state: _Shipping"
Root Cause Investigation: I downloaded the raw CSV and discovered the issue - multi-line addresses with embedded commas and quotes:
OR-00001,CUST-01686,"61/96,_Chauhan_Road
Haldia_060162",Haldia,West_Bengal,60162.0,Billing
Athena was treating each line as a separate record!
I wrote a Python script to clean the problematic CSV:
def clean_addresses_csv(input_file, output_file):
df = pd.read_csv(input_file, quotechar='"', on_bad_lines='skip')
# Clean up embedded quotes, underscores, and newlines
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].str.replace(r'["\']', '', regex=True)
df[col] = df[col].str.replace('_', ' ')
df[col] = df[col].str.replace('\n', ' ')
df.to_csv(output_file, index=False)
Results:
✅ Successfully processed and cleaned 30,000 address records
✅ Fixed multi-line formatting issues
✅ Normalized text fields
✅ Athena now returns correct count: 30,000
Learning: Real-world data is messy! Data quality issues are common and require preprocessing before analysis.
After this journey, I successfully built a complete data pipeline:
S3 Data Lake: Organized storage for all data types
Glue Data Catalog: Automated schema discovery and metadata management
Athena Query Engine: SQL interface for data analysis
IAM Security: Proper roles and permissions for services
Customers: 20,000 records - demographics, contact info, registration dates
Products: 720,109 records - product catalog with prices and categories
Addresses: 30,000 records - customer addresses (cleaned and normalized)
Inventory: 150 JSON files - daily stock levels across 5 warehouses
I can now run SQL queries like:
-- Customer distribution by state
SELECT state, COUNT(*) as customer_count
FROM retail_db.addresses
GROUP BY state
ORDER BY customer_count DESC;
-- Product categories analysis
SELECT category, COUNT(*) as product_count, AVG(price) as avg_price
FROM retail_db.products
GROUP BY category;
AWS Services Integration: S3, Glue, Athena, and IAM work together as an ecosystem
Data Lake Design: Folder structure matters - organize by entity type
Service Roles: AWS services need their own identities, separate from user permissions
Data Quality: Real data is messy and requires cleaning before analysis
Schema Evolution: Automated schema discovery is powerful but needs validation