


As a complete beginner in AWS Data Engineering, I set out with a simple but ambitious goal: build a data pipeline on AWS that could ingest, store, and query real retail data. I had a retail dataset with customer information, product catalogs, address data, and daily inventory records across 5 warehouses - perfect for learning data engineering concepts.
My vision was to:
Upload my data to AWS S3
Set up proper data cataloging
Query the data using SQL through AWS Athena
Learn AWS services hands-on, step by step
I started with the following datasets containing:
3 CSV files: customers.csv (customer data), products.csv (product catalog), addresses.csv (customer addresses)
150 JSON files: Daily inventory data for 5 warehouses over 30 days
Total size: ~112 MB of real-world retail data
Perfect for learning - complex enough to be realistic, small enough to manage costs.
My first step was setting up AWS CLI and configuring credentials. This seemed straightforward until I hit my first challenge - permissions.
Challenge #1: AWS Permissions Initially, my temp-user account couldn't create S3 buckets. I learned about IAM policies the hard way:
Started with basic S3 permissions
Escalated to AmazonS3FullAccess
Eventually needed AthenaFullAccess and Glue permissions
Finally settled on AdministratorAccess for learning
Learning: AWS permissions are granular and service-specific. Start broad for learning, then narrow down in production.
I decided to organize everything under a single S3 bucket: amit-choudhary-765432892721
s3://amit-choudhary-765432892721/
├── raw-data/ # Source data
├── athena-results/ # Query results
├── processed-data/ # Future processed files
└── glue-scripts/ # ETL scriptsThis taught me the importance of data lake organization from day one.
Uploading the 3 CSV files was smooth:
customers.csv (1.8MB) ✅
products.csv (100MB) ✅
addresses.csv (3MB) ✅
Then came the 150 JSON inventory files. I learned about the aws s3 sync command:
aws s3 sync "local-inventory-folder" s3://amit-choudhary-765432892721/raw-data/inventory/Watching 150 files upload in parallel was satisfying! Total uploaded: 153 files, ~112MB.
Learning: AWS CLI's sync command is powerful for batch uploads. Much better than uploading files one by one.
Setting up AWS Athena to query my data taught me about service roles. Unlike user permissions, AWS services need their own identities to access resources.
I had to create a Glue Service Role for the data catalog:
aws iam create-role --role-name AWSGlueServiceRole-retail --assume-role-policy-document '{...}'
aws iam attach-role-policy --role-name AWSGlueServiceRole-retail --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRoleLearning: Services like Glue crawlers need service roles, not user permissions. This is a key AWS security concept.
I used AWS Glue to automatically catalog my CSV files:
Created Glue Database: retail_db
Created Glue Crawler: Scanned my S3 data and inferred schemas
Generated Tables: Automatically created table definitions
The crawler found my data and created tables, but then I hit my biggest challenge...
My first Athena query returned 798,842 rows for "customers" but the results looked like this:
customerid: FSJ-186996
firstname: magnetic pin set of 2 pairs
lastname: accessoriesThe Problem: I had put all CSV files in one folder (raw-data/csv/), so Athena was reading customers, products, AND addresses as one mixed table!
The Solution: Proper data organization. I restructured everything:
raw-data/csv/
├── customers/customers.csv # Only customer data
├── products/products.csv # Only product data
└── addresses/addresses.csv # Only address dataThen created separate Athena tables for each entity. This was a fundamental data modeling lesson.
Learning: In data lakes, folder structure = table structure. One entity type per folder!
After reorganization, my queries finally returned clean results:
Customers table: 20,000 clean customer records ✅
Products table: 720,109 product records ✅
Addresses table: Wait... 58,733 records instead of 30,000? 🤔
My addresses.csv should have had 30,000 records, but Athena reported 58,733. Looking at the query results, I saw corrupted data:
addressid: Haldia_060162"
city: "Billing
state: _Shipping"Root Cause Investigation: I downloaded the raw CSV and discovered the issue - multi-line addresses with embedded commas and quotes:
OR-00001,CUST-01686,"61/96,_Chauhan_Road
Haldia_060162",Haldia,West_Bengal,60162.0,BillingAthena was treating each line as a separate record!
I wrote a Python script to clean the problematic CSV:
def clean_addresses_csv(input_file, output_file):
df = pd.read_csv(input_file, quotechar='"', on_bad_lines='skip')
# Clean up embedded quotes, underscores, and newlines
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].str.replace(r'["\']', '', regex=True)
df[col] = df[col].str.replace('_', ' ')
df[col] = df[col].str.replace('\n', ' ')
df.to_csv(output_file, index=False)Results:
✅ Successfully processed and cleaned 30,000 address records
✅ Fixed multi-line formatting issues
✅ Normalized text fields
✅ Athena now returns correct count: 30,000
Learning: Real-world data is messy! Data quality issues are common and require preprocessing before analysis.
After this journey, I successfully built a complete data pipeline:
S3 Data Lake: Organized storage for all data types
Glue Data Catalog: Automated schema discovery and metadata management
Athena Query Engine: SQL interface for data analysis
IAM Security: Proper roles and permissions for services
Customers: 20,000 records - demographics, contact info, registration dates
Products: 720,109 records - product catalog with prices and categories
Addresses: 30,000 records - customer addresses (cleaned and normalized)
Inventory: 150 JSON files - daily stock levels across 5 warehouses
I can now run SQL queries like:
-- Customer distribution by state
SELECT state, COUNT(*) as customer_count
FROM retail_db.addresses
GROUP BY state
ORDER BY customer_count DESC;
-- Product categories analysis
SELECT category, COUNT(*) as product_count, AVG(price) as avg_price
FROM retail_db.products
GROUP BY category;AWS Services Integration: S3, Glue, Athena, and IAM work together as an ecosystem
Data Lake Design: Folder structure matters - organize by entity type
Service Roles: AWS services need their own identities, separate from user permissions
Data Quality: Real data is messy and requires cleaning before analysis
Schema Evolution: Automated schema discovery is powerful but needs validation

Running data quality checks on retail sales distribution data

This blog explores my experience with cleaning datasets during the process of performing EDA for analyzing whether geographical attributes impact sales of beverages

Snowflake recommends 100–250 MB files for optimal loading, but why? What happens when you load one large file versus splitting it into smaller chunks? I tested this with real data, and the results were surprising. Click to discover how this simple change can drastically improve loading performance.

Master the bronze layer foundation of medallion architecture with COPY INTO - the command that handles incremental ingestion and schema evolution automatically. No more duplicate data, no more broken pipelines when new columns arrive. Your complete guide to production-ready raw data ingestion

Learn Git and GitHub step by step with this complete guide. From Git basics to branching, merging, push, pull, and resolving merge conflicts—this tutorial helps beginners and developers collaborate like pros.

Discover how data management, governance, and security work together—just like your favorite food delivery app. Learn why these three pillars turn raw data into trusted insights, ensuring trust, compliance, and business growth.

A simple request to automate Google feedback forms turned into a technical adventure. From API roadblocks to a smart Google Apps Script pivot, discover how we built a seamless system that cut form creation time from 20 minutes to just 2.

Step-by-step journey of setting up end-to-end AKS monitoring with dashboards, alerts, workbooks, and real-world validations on Azure Kubernetes Service.

My learning experience tracing how an app works when browser is refreshed

Demonstrates the power of AI assisted development to build an end-to-end application grounds up

A hands-on learning journey of building a login and sign-up system from scratch using React, Node.js, Express, and PostgreSQL. Covers real-world challenges, backend integration, password security, and key full-stack development lessons for beginners.

This is the first in a five-part series detailing my experience implementing advanced data engineering solutions with Databricks on Google Cloud Platform. The series covers schema evolution, incremental loading, and orchestration of a robust ELT pipeline.

Discover the 7 major stages of the data engineering lifecycle, from data collection to storage and analysis. Learn the key processes, tools, and best practices that ensure a seamless and efficient data flow, supporting scalable and reliable data systems.

This blog is troubleshooting adventure which navigates networking quirks, uncovers why cluster couldn’t reach PyPI, and find the real fix—without starting from scratch.

Explore query scanning can be optimized from 9.78 MB down to just 3.95 MB using table partitioning. And how to use partitioning, how to decide the right strategy, and the impact it can have on performance and costs.

Dive deeper into query design, optimization techniques, and practical takeaways for BigQuery users.

Wondering when to use a stored procedure vs. a function in SQL? This blog simplifies the differences and helps you choose the right tool for efficient database management and optimized queries.

Discover how BigQuery Omni and BigLake break down data silos, enabling seamless multi-cloud analytics and cost-efficient insights without data movement.

In this article we'll build a motivation towards learning computer vision by solving a real world problem by hand along with assistance with chatGPT

This blog explains how Apache Airflow orchestrates tasks like a conductor leading an orchestra, ensuring smooth and efficient workflow management. Using a fun Romeo and Juliet analogy, it shows how Airflow handles timing, dependencies, and errors.

The blog underscores how snapshots and Point-in-Time Restore (PITR) are essential for data protection, offering a universal, cost-effective solution with applications in disaster recovery, testing, and compliance.

The blog contains the journey of ChatGPT, and what are the limitations of ChatGPT, due to which Langchain came into the picture to overcome the limitations and help us to create applications that can solve our real-time queries

This blog simplifies the complex world of data management by exploring two pivotal concepts: Data Lakes and Data Warehouses.

demystifying the concepts of IaaS, PaaS, and SaaS with Microsoft Azure examples

Discover how Azure Data Factory serves as the ultimate tool for data professionals, simplifying and automating data processes

Revolutionizing e-commerce with Azure Cosmos DB, enhancing data management, personalizing recommendations, real-time responsiveness, and gaining valuable insights.

Highlights the benefits and applications of various NoSQL database types, illustrating how they have revolutionized data management for modern businesses.

This blog delves into the capabilities of Calendar Events Automation using App Script.

Dive into the fundamental concepts and phases of ETL, learning how to extract valuable data, transform it into actionable insights, and load it seamlessly into your systems.

An easy to follow guide prepared based on our experience with upskilling thousands of learners in Data Literacy

Teaching a Robot to Recognize Pastries with Neural Networks and artificial intelligence (AI)

Streamlining Storage Management for E-commerce Business by exploring Flat vs. Hierarchical Systems

Figuring out how Cloud help reduce the Total Cost of Ownership of the IT infrastructure

Understand the circumstances which force organizations to start thinking about migration their business to cloud