


After successfully cleaning and analyzing the city demographics data, I moved to the core of the business problem: understanding the retail sales distribution patterns across GlobalBev's distribution channels.
My objectives were to:
Explore the sales transaction data
Understand which channels are performing best
Identify patterns that would inform the ML model for channel optimization
Uncover insights about product-channel relationships
I began by loading the retail sales dataset and examining its basic structure:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load retail sales data
df_sales = pd.read_csv('retailsalesdistribution.csv')
# Display first 10 rows
print(df_sales.head(10))
# Check the shape
print(f"\nDataset shape: {df_sales.shape}")The dataset contained 334,300 transactions with 6 columns: Date, City_ID, SKU_ID, Channel, Units_Sold, and Sales.
Looking at the data, I noticed SKU codes like "SKU1002", "SKU1004", etc. I had explored cities in Milestone 1, but now I needed to understand what these product SKUs represented.
I loaded the products dataset:
# Load products data
df_products = pd.read_csv('products.csv')
# Display all products (only 7 SKUs)
print(df_products)This revealed 7 products in GlobalBev's portfolio - all Minute Maid fruit juices with different flavors and pack sizes.
I had to pause and understand: What exactly is an SKU? Is it just the flavor? Or does pack size matter?
After examining the products data, I learned that an SKU (Stock Keeping Unit) is uniquely defined by:
Product + Flavor + Pack Size = Unique SKU
For example:
Mixed Fruit flavor has 3 different SKUs (1L, 135ml Pouch, 250ml Bottle)
Orange flavor has 2 different SKUs (1L, 250ml Bottle)
Launch date is just metadata, not part of the SKU definition.
# Check how many SKUs exist per flavor
print(df_products.groupby('Flavor Variant')['SKU Identification Number'].count())Output confirmed:
Apple: 1 SKU
Guava: 1 SKU
Mixed Fruit: 3 SKUs
Orange: 2 SKUs
Total: 7 SKUs
I also noticed SKU1005 was missing from the numbering sequence, which I verified in both datasets:
# Check which SKUs appear in sales data
print("SKUs in sales data:")
print(sorted(df_sales['SKU_ID'].unique()))
print(f"\nTotal unique SKUs in sales: {df_sales['SKU_ID'].nunique()}")
# Compare with products list
print("\nSKUs in products data:")
print(sorted(df_products['SKU Identification Number'].values))SKU1005 didn't exist in either dataset - likely never created or the numbering skipped it.
Before diving into analysis, I checked for basic data quality issues:
# Check data types
print("Data types:")
print(df_sales.dtypes)
print("\n" + "="*50)
# Check for missing values
print("\nMissing values:")
print(df_sales.isnull().sum())Good news: No missing values across any columns.
However, I noticed the Date column was stored as object (string) instead of datetime. This needed fixing for time-based analysis:
# Convert Date to datetime
df_sales['Date'] = pd.to_datetime(df_sales['Date'])
# Verify conversion
print("After conversion:")
print(df_sales.dtypes)
# Check date range
print(f"\nDate range: {df_sales['Date'].min()} to {df_sales['Date'].max()}")The data covered 2 years: January 1, 2023 to December 31, 2024.
Now to the core question: What channels exist and how are they performing?
# Check unique channels
print("Unique Channels:")
print(df_sales['Channel'].unique())
print(f"\nTotal unique channels: {df_sales['Channel'].nunique()}")
# Count records per channel
print("\nRecords per channel:")
print(df_sales['Channel'].value_counts())5 distribution channels emerged:
General Trade: 101,360 transactions (most)
E Commerce: 73,100 transactions
Modern Trade: 72,120 transactions
HoReCa: 58,480 transactions
Q Commerce: 29,240 transactions (least)
But transaction count isn't the full story...
I calculated total revenue and units sold per channel:
# Calculate total revenue and units per channel
channel_performance = df_sales.groupby('Channel').agg({
'Sales': 'sum',
'Units_Sold': 'sum',
'Date': 'count'
}).round(2)
channel_performance.columns = ['Total_Revenue', 'Total_Units', 'Transaction_Count']
channel_performance = channel_performance.sort_values('Total_Revenue', ascending=False)
print(channel_performance)Key Finding: General Trade dominated across all metrics:
Revenue: ₹1.58 billion
Units: 92M
Transactions: 101,360
But something interesting: Q Commerce had fewer transactions than HoReCa (29,240 vs 58,480) yet generated higher revenue (₹236M vs ₹202M).
This revenue-to-transaction mismatch suggested different channels had different transaction values. I calculated the average:
# Calculate average transaction value per channel
channel_performance['Avg_Revenue_Per_Transaction'] = (
channel_performance['Total_Revenue'] / channel_performance['Transaction_Count']
).round(2)
print(channel_performance[['Total_Revenue', 'Transaction_Count', 'Avg_Revenue_Per_Transaction']])Shocking discovery: General Trade's average transaction (₹15,633) was 194x larger than Q Commerce (₹81)!
This suggested fundamentally different business models:
General Trade: Likely bulk B2B orders to retailers
Q Commerce: Small individual consumer orders
I verified by checking units per transaction:
# Calculate average units per transaction
channel_performance['Avg_Units_Per_Transaction'] = (
channel_performance['Total_Units'] / channel_performance['Transaction_Count']
).round(2)
print(channel_performance[['Avg_Revenue_Per_Transaction', 'Avg_Units_Per_Transaction']])Interesting gap: Units were only 3.4x different (910 vs 269) but revenue was 194x different.
This massive gap meant there had to be huge price differences.
I calculated average price per unit:
# Calculate price per unit
channel_performance['Avg_Price_Per_Unit'] = (
channel_performance['Total_Revenue'] / channel_performance['Total_Units']
).round(2)
print(channel_performance[['Avg_Units_Per_Transaction', 'Avg_Revenue_Per_Transaction', 'Avg_Price_Per_Unit']])The anomaly: General Trade's price per unit (₹17.19) was 57x higher than Q Commerce (₹0.30).
Wait - this didn't make sense. Q Commerce should charge premium prices for convenience, not have the lowest prices!
I suspected different channels might be selling different products. Time to investigate:
# Check SKU distribution across channels
sku_channel = df_sales.groupby(['Channel', 'SKU_ID']).agg({
'Sales': 'sum',
'Units_Sold': 'sum'
}).reset_index()
# For each channel, show top SKUs
for channel in df_sales['Channel'].unique():
print(f"\n{channel}:")
channel_data = sku_channel[sku_channel['Channel'] == channel].sort_values('Sales', ascending=False)
print(channel_data)Critical finding:
Q Commerce only sells 2 SKUs (SKU1002, SKU1004)
General Trade sells all 7 SKUs
This product mix difference could explain the pricing gap. I needed to see which pack sizes each channel was selling:
# Merge sales with products to see pack sizes
sales_with_product = df_sales.merge(
df_products[['SKU Identification Number', 'Pack Size (ml/L)', 'Flavor Variant']],
left_on='SKU_ID',
right_on='SKU Identification Number',
how='left'
)
# Check pack sizes sold per channel
print("Pack sizes sold by channel:")
for channel in sales_with_product['Channel'].unique():
print(f"\n{channel}:")
pack_dist = sales_with_product[sales_with_product['Channel'] == channel]['Pack Size (ml/L)'].value_counts()
print(pack_dist)Another puzzle: Q Commerce only sold 1L bottles, while General Trade sold all sizes including small 135ml pouches.
But this deepened the mystery: If Q Commerce only sells large bottles, why is its price per unit so much lower?
I needed to isolate the channel effect by looking at the same SKU across channels:
# Check price per unit for SKU1002 (sold in both Q Commerce and General Trade)
sku1002_data = df_sales[df_sales['SKU_ID'] == 'SKU1002'].groupby('Channel').agg({
'Sales': 'sum',
'Units_Sold': 'sum'
})
sku1002_data['Price_Per_Unit'] = (sku1002_data['Sales'] / sku1002_data['Units_Sold']).round(2)
print("SKU1002 (same product, different channels):")
print(sku1002_data)Resolution: All channels sold SKU1002 at the exact same price per unit (₹0.3).
This meant the huge channel-level price differences weren't due to pricing strategies, but due to product mix - which SKUs each channel chose to sell.
Now I needed to see how prices varied across all SKUs:
# Calculate price per unit for each SKU
sku_pricing = df_sales.groupby('SKU_ID').agg({
'Sales': 'sum',
'Units_Sold': 'sum'
})
sku_pricing['Price_Per_Unit'] = (sku_pricing['Sales'] / sku_pricing['Units_Sold']).round(2)
# Merge with product info
sku_pricing_with_info = sku_pricing.merge(
df_products[['SKU Identification Number', 'Pack Size (ml/L)', 'Flavor Variant']],
left_index=True,
right_on='SKU Identification Number'
)
print(sku_pricing_with_info[['SKU Identification Number', 'Pack Size (ml/L)', 'Price_Per_Unit']].sort_values('Price_Per_Unit'))Strange pattern:
1L bottles: ₹0.30 per unit
135ml pouch: ₹40.50 per unit
250ml bottles: ₹75.00 per unit
The small packs appeared to be massively more expensive. But this felt off - was this a unit definition issue?
To normalize across pack sizes, I calculated price per ml:
# Calculate price per ml for each SKU
def extract_ml(pack_size):
if 'L' in pack_size and 'ml' not in pack_size:
return 1000 # 1L = 1000ml
else:
return int(''.join(filter(str.isdigit, pack_size)))
sku_pricing_with_info['Volume_ml'] = sku_pricing_with_info['Pack Size (ml/L)'].apply(extract_ml)
sku_pricing_with_info['Price_Per_ml'] = (sku_pricing_with_info['Price_Per_Unit'] / sku_pricing_with_info['Volume_ml']).round(4)
print(sku_pricing_with_info[['SKU Identification Number', 'Pack Size (ml/L)', 'Price_Per_Unit', 'Volume_ml', 'Price_Per_ml']])Critical data quality issue discovered:
1L bottles: ₹0.0003 per ml
Small packs (135ml, 250ml): ₹0.30 per ml
Small packs were 1000x more expensive per ml than large bottles!
I paused to evaluate this against real-world CPG dynamics. While small convenience packs typically have a premium, it's usually:
Real world: 2-5x more expensive per ml
My data: 1000x more expensive per ml
This contradicted reality. For context:
Real example: 250ml Coke ≈ ₹20 (₹0.08/ml) vs 2L bottle ≈ ₹80 (₹0.04/ml) = 2x difference
My data: 1000x difference
Moreover, my finding that Q-Commerce only sells 1L bottles contradicted actual Q-Commerce business models:
Blinkit, Zepto, Swiggy Instamart specialize in immediate consumption
They thrive on small, single-serve packs (250ml, 135ml)
Customers order when they want a drink NOW, not bulk storage
Small convenience packs are their highest margin items
# Document findings
print("="*80)
print("DATA QUALITY ISSUE IDENTIFIED")
print("="*80)
print("\nFinding: Small pack pricing shows 1000x premium over large packs")
print("Expected: Real-world CPG pricing typically shows 2-5x premium")
print("\nPossible causes:")
print("1. Unit definition inconsistency across pack sizes")
print("2. Decimal point errors in data entry")
print("3. Different measurement units (cases vs individual units)")
print("\nImpact: Channel-level price analysis may be misleading")
print("Recommendation: Validate with business stakeholders before model training")Technical Skills:
Data merging across multiple datasets
Calculating derived metrics (averages, ratios)
Multi-level groupby aggregations
Cross-validation of findings across different cuts of data
Business Understanding:
SKU definitions in CPG/retail context
Channel business models (B2B bulk vs B2C individual)
Product mix impact on performance metrics
Real-world CPG pricing dynamics
Critical Thinking:
Identified data quality issues through domain knowledge validation
Understood when numbers don't match reality
Chose documentation over data fabrication
Recognized that finding issues is as valuable as finding insights
Insights Despite Data Quality Issues:
General Trade dominates in revenue, volume, and transactions
Q Commerce has different economics - fewer, smaller transactions
Product mix varies by channel - not all channels sell all SKUs
Same products are priced consistently across channels (when sold)
Data quality issues exist that need stakeholder validation before ML modeling

This blog explores my experience with cleaning datasets during the process of performing EDA for analyzing whether geographical attributes impact sales of beverages

Snowflake recommends 100–250 MB files for optimal loading, but why? What happens when you load one large file versus splitting it into smaller chunks? I tested this with real data, and the results were surprising. Click to discover how this simple change can drastically improve loading performance.

Master the bronze layer foundation of medallion architecture with COPY INTO - the command that handles incremental ingestion and schema evolution automatically. No more duplicate data, no more broken pipelines when new columns arrive. Your complete guide to production-ready raw data ingestion

Learn Git and GitHub step by step with this complete guide. From Git basics to branching, merging, push, pull, and resolving merge conflicts—this tutorial helps beginners and developers collaborate like pros.

Discover how data management, governance, and security work together—just like your favorite food delivery app. Learn why these three pillars turn raw data into trusted insights, ensuring trust, compliance, and business growth.

Beginner’s journey in AWS Data Engineering—building a retail data pipeline with S3, Glue, and Athena. Key lessons on permissions, data lakes, and data quality. A hands-on guide for tackling real-world retail datasets.

A simple request to automate Google feedback forms turned into a technical adventure. From API roadblocks to a smart Google Apps Script pivot, discover how we built a seamless system that cut form creation time from 20 minutes to just 2.

Step-by-step journey of setting up end-to-end AKS monitoring with dashboards, alerts, workbooks, and real-world validations on Azure Kubernetes Service.

My learning experience tracing how an app works when browser is refreshed

Demonstrates the power of AI assisted development to build an end-to-end application grounds up

A hands-on learning journey of building a login and sign-up system from scratch using React, Node.js, Express, and PostgreSQL. Covers real-world challenges, backend integration, password security, and key full-stack development lessons for beginners.

This is the first in a five-part series detailing my experience implementing advanced data engineering solutions with Databricks on Google Cloud Platform. The series covers schema evolution, incremental loading, and orchestration of a robust ELT pipeline.

Discover the 7 major stages of the data engineering lifecycle, from data collection to storage and analysis. Learn the key processes, tools, and best practices that ensure a seamless and efficient data flow, supporting scalable and reliable data systems.

This blog is troubleshooting adventure which navigates networking quirks, uncovers why cluster couldn’t reach PyPI, and find the real fix—without starting from scratch.

Explore query scanning can be optimized from 9.78 MB down to just 3.95 MB using table partitioning. And how to use partitioning, how to decide the right strategy, and the impact it can have on performance and costs.

Dive deeper into query design, optimization techniques, and practical takeaways for BigQuery users.

Wondering when to use a stored procedure vs. a function in SQL? This blog simplifies the differences and helps you choose the right tool for efficient database management and optimized queries.

Discover how BigQuery Omni and BigLake break down data silos, enabling seamless multi-cloud analytics and cost-efficient insights without data movement.

In this article we'll build a motivation towards learning computer vision by solving a real world problem by hand along with assistance with chatGPT

This blog explains how Apache Airflow orchestrates tasks like a conductor leading an orchestra, ensuring smooth and efficient workflow management. Using a fun Romeo and Juliet analogy, it shows how Airflow handles timing, dependencies, and errors.

The blog underscores how snapshots and Point-in-Time Restore (PITR) are essential for data protection, offering a universal, cost-effective solution with applications in disaster recovery, testing, and compliance.

The blog contains the journey of ChatGPT, and what are the limitations of ChatGPT, due to which Langchain came into the picture to overcome the limitations and help us to create applications that can solve our real-time queries

This blog simplifies the complex world of data management by exploring two pivotal concepts: Data Lakes and Data Warehouses.

demystifying the concepts of IaaS, PaaS, and SaaS with Microsoft Azure examples

Discover how Azure Data Factory serves as the ultimate tool for data professionals, simplifying and automating data processes

Revolutionizing e-commerce with Azure Cosmos DB, enhancing data management, personalizing recommendations, real-time responsiveness, and gaining valuable insights.

Highlights the benefits and applications of various NoSQL database types, illustrating how they have revolutionized data management for modern businesses.

This blog delves into the capabilities of Calendar Events Automation using App Script.

Dive into the fundamental concepts and phases of ETL, learning how to extract valuable data, transform it into actionable insights, and load it seamlessly into your systems.

An easy to follow guide prepared based on our experience with upskilling thousands of learners in Data Literacy

Teaching a Robot to Recognize Pastries with Neural Networks and artificial intelligence (AI)

Streamlining Storage Management for E-commerce Business by exploring Flat vs. Hierarchical Systems

Figuring out how Cloud help reduce the Total Cost of Ownership of the IT infrastructure

Understand the circumstances which force organizations to start thinking about migration their business to cloud