Guides & Tutorials

Retail Sales Distribution Analysis - A Data Quality Detective Story

python

data-quality

Amit Choudhary

The Goal

After successfully cleaning and analyzing the city demographics data, I moved to the core of the business problem: understanding the retail sales distribution patterns across GlobalBev's distribution channels.

My objectives were to:

Explore the sales transaction data
Understand which channels are performing best
Identify patterns that would inform the ML model for channel optimization
Uncover insights about product-channel relationships

Starting Point: Loading the Sales Data

I began by loading the retail sales dataset and examining its basic structure:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load retail sales data
df_sales = pd.read_csv('retailsalesdistribution.csv')

# Display first 10 rows
print(df_sales.head(10))

# Check the shape
print(f"\nDataset shape: {df_sales.shape}")

The dataset contained 334,300 transactions with 6 columns: Date, City_ID, SKU_ID, Channel, Units_Sold, and Sales.

Challenge #1: Understanding SKU References

Looking at the data, I noticed SKU codes like "SKU1002", "SKU1004", etc. I had explored cities in Milestone 1, but now I needed to understand what these product SKUs represented.

I loaded the products dataset:

# Load products data
df_products = pd.read_csv('products.csv')

# Display all products (only 7 SKUs)
print(df_products)

This revealed 7 products in GlobalBev's portfolio - all Minute Maid fruit juices with different flavors and pack sizes.

Learning: What Defines an SKU in the Real World?

I had to pause and understand: What exactly is an SKU? Is it just the flavor? Or does pack size matter?

After examining the products data, I learned that an SKU (Stock Keeping Unit) is uniquely defined by:

Product + Flavor + Pack Size = Unique SKU

For example:

Mixed Fruit flavor has 3 different SKUs (1L, 135ml Pouch, 250ml Bottle)
Orange flavor has 2 different SKUs (1L, 250ml Bottle)

Launch date is just metadata, not part of the SKU definition.

# Check how many SKUs exist per flavor
print(df_products.groupby('Flavor Variant')['SKU Identification Number'].count())

Output confirmed:

Apple: 1 SKU
Guava: 1 SKU
Mixed Fruit: 3 SKUs
Orange: 2 SKUs
Total: 7 SKUs

I also noticed SKU1005 was missing from the numbering sequence, which I verified in both datasets:

# Check which SKUs appear in sales data
print("SKUs in sales data:")
print(sorted(df_sales['SKU_ID'].unique()))
print(f"\nTotal unique SKUs in sales: {df_sales['SKU_ID'].nunique()}")

# Compare with products list
print("\nSKUs in products data:")
print(sorted(df_products['SKU Identification Number'].values))

SKU1005 didn't exist in either dataset - likely never created or the numbering skipped it.

Data Quality Check

Before diving into analysis, I checked for basic data quality issues:

# Check data types
print("Data types:")
print(df_sales.dtypes)

print("\n" + "="*50)

# Check for missing values
print("\nMissing values:")
print(df_sales.isnull().sum())

Good news: No missing values across any columns.

However, I noticed the Date column was stored as object (string) instead of datetime. This needed fixing for time-based analysis:

# Convert Date to datetime
df_sales['Date'] = pd.to_datetime(df_sales['Date'])

# Verify conversion
print("After conversion:")
print(df_sales.dtypes)

# Check date range
print(f"\nDate range: {df_sales['Date'].min()} to {df_sales['Date'].max()}")

The data covered 2 years: January 1, 2023 to December 31, 2024.

Exploring Distribution Channels

Now to the core question: What channels exist and how are they performing?

# Check unique channels
print("Unique Channels:")
print(df_sales['Channel'].unique())

print(f"\nTotal unique channels: {df_sales['Channel'].nunique()}")

# Count records per channel
print("\nRecords per channel:")
print(df_sales['Channel'].value_counts())

5 distribution channels emerged:

General Trade: 101,360 transactions (most)
E Commerce: 73,100 transactions
Modern Trade: 72,120 transactions
HoReCa: 58,480 transactions
Q Commerce: 29,240 transactions (least)

But transaction count isn't the full story...

The Revenue Reality

I calculated total revenue and units sold per channel:

# Calculate total revenue and units per channel
channel_performance = df_sales.groupby('Channel').agg({
    'Sales': 'sum',
    'Units_Sold': 'sum',
    'Date': 'count'
}).round(2)

channel_performance.columns = ['Total_Revenue', 'Total_Units', 'Transaction_Count']
channel_performance = channel_performance.sort_values('Total_Revenue', ascending=False)

print(channel_performance)

Key Finding: General Trade dominated across all metrics:

Revenue: ₹1.58 billion
Units: 92M
Transactions: 101,360

But something interesting: Q Commerce had fewer transactions than HoReCa (29,240 vs 58,480) yet generated higher revenue (₹236M vs ₹202M).

Challenge #2: The Transaction Value Mystery

This revenue-to-transaction mismatch suggested different channels had different transaction values. I calculated the average:

# Calculate average transaction value per channel
channel_performance['Avg_Revenue_Per_Transaction'] = (
    channel_performance['Total_Revenue'] / channel_performance['Transaction_Count']
).round(2)

print(channel_performance[['Total_Revenue', 'Transaction_Count', 'Avg_Revenue_Per_Transaction']])

Shocking discovery: General Trade's average transaction (₹15,633) was 194x larger than Q Commerce (₹81)!

This suggested fundamentally different business models:

General Trade: Likely bulk B2B orders to retailers
Q Commerce: Small individual consumer orders

I verified by checking units per transaction:

# Calculate average units per transaction
channel_performance['Avg_Units_Per_Transaction'] = (
    channel_performance['Total_Units'] / channel_performance['Transaction_Count']
).round(2)

print(channel_performance[['Avg_Revenue_Per_Transaction', 'Avg_Units_Per_Transaction']])

Interesting gap: Units were only 3.4x different (910 vs 269) but revenue was 194x different.

This massive gap meant there had to be huge price differences.

The Pricing Puzzle

I calculated average price per unit:

# Calculate price per unit
channel_performance['Avg_Price_Per_Unit'] = (
    channel_performance['Total_Revenue'] / channel_performance['Total_Units']
).round(2)

print(channel_performance[['Avg_Units_Per_Transaction', 'Avg_Revenue_Per_Transaction', 'Avg_Price_Per_Unit']])

The anomaly: General Trade's price per unit (₹17.19) was 57x higher than Q Commerce (₹0.30).

Wait - this didn't make sense. Q Commerce should charge premium prices for convenience, not have the lowest prices!

Investigation: What's Each Channel Actually Selling?

I suspected different channels might be selling different products. Time to investigate:

# Check SKU distribution across channels
sku_channel = df_sales.groupby(['Channel', 'SKU_ID']).agg({
    'Sales': 'sum',
    'Units_Sold': 'sum'
}).reset_index()

# For each channel, show top SKUs
for channel in df_sales['Channel'].unique():
    print(f"\n{channel}:")
    channel_data = sku_channel[sku_channel['Channel'] == channel].sort_values('Sales', ascending=False)
    print(channel_data)

Critical finding:

Q Commerce only sells 2 SKUs (SKU1002, SKU1004)
General Trade sells all 7 SKUs

This product mix difference could explain the pricing gap. I needed to see which pack sizes each channel was selling:

# Merge sales with products to see pack sizes
sales_with_product = df_sales.merge(
    df_products[['SKU Identification Number', 'Pack Size (ml/L)', 'Flavor Variant']], 
    left_on='SKU_ID', 
    right_on='SKU Identification Number',
    how='left'
)

# Check pack sizes sold per channel
print("Pack sizes sold by channel:")
for channel in sales_with_product['Channel'].unique():
    print(f"\n{channel}:")
    pack_dist = sales_with_product[sales_with_product['Channel'] == channel]['Pack Size (ml/L)'].value_counts()
    print(pack_dist)

Another puzzle: Q Commerce only sold 1L bottles, while General Trade sold all sizes including small 135ml pouches.

But this deepened the mystery: If Q Commerce only sells large bottles, why is its price per unit so much lower?

Testing: Same SKU, Different Channels

I needed to isolate the channel effect by looking at the same SKU across channels:

# Check price per unit for SKU1002 (sold in both Q Commerce and General Trade)
sku1002_data = df_sales[df_sales['SKU_ID'] == 'SKU1002'].groupby('Channel').agg({
    'Sales': 'sum',
    'Units_Sold': 'sum'
})

sku1002_data['Price_Per_Unit'] = (sku1002_data['Sales'] / sku1002_data['Units_Sold']).round(2)
print("SKU1002 (same product, different channels):")
print(sku1002_data)

Resolution: All channels sold SKU1002 at the exact same price per unit (₹0.3).

This meant the huge channel-level price differences weren't due to pricing strategies, but due to product mix - which SKUs each channel chose to sell.

Understanding the SKU Price Structure

Now I needed to see how prices varied across all SKUs:

# Calculate price per unit for each SKU
sku_pricing = df_sales.groupby('SKU_ID').agg({
    'Sales': 'sum',
    'Units_Sold': 'sum'
})
sku_pricing['Price_Per_Unit'] = (sku_pricing['Sales'] / sku_pricing['Units_Sold']).round(2)

# Merge with product info
sku_pricing_with_info = sku_pricing.merge(
    df_products[['SKU Identification Number', 'Pack Size (ml/L)', 'Flavor Variant']], 
    left_index=True,
    right_on='SKU Identification Number'
)

print(sku_pricing_with_info[['SKU Identification Number', 'Pack Size (ml/L)', 'Price_Per_Unit']].sort_values('Price_Per_Unit'))

Strange pattern:

1L bottles: ₹0.30 per unit
135ml pouch: ₹40.50 per unit
250ml bottles: ₹75.00 per unit

The small packs appeared to be massively more expensive. But this felt off - was this a unit definition issue?

Challenge #3: The 1000x Pricing Anomaly

To normalize across pack sizes, I calculated price per ml:

# Calculate price per ml for each SKU
def extract_ml(pack_size):
    if 'L' in pack_size and 'ml' not in pack_size:
        return 1000  # 1L = 1000ml
    else:
        return int(''.join(filter(str.isdigit, pack_size)))

sku_pricing_with_info['Volume_ml'] = sku_pricing_with_info['Pack Size (ml/L)'].apply(extract_ml)
sku_pricing_with_info['Price_Per_ml'] = (sku_pricing_with_info['Price_Per_Unit'] / sku_pricing_with_info['Volume_ml']).round(4)

print(sku_pricing_with_info[['SKU Identification Number', 'Pack Size (ml/L)', 'Price_Per_Unit', 'Volume_ml', 'Price_Per_ml']])

Critical data quality issue discovered:

1L bottles: ₹0.0003 per ml
Small packs (135ml, 250ml): ₹0.30 per ml

Small packs were 1000x more expensive per ml than large bottles!

The Reality Check: Does This Make Sense?

I paused to evaluate this against real-world CPG dynamics. While small convenience packs typically have a premium, it's usually:

Real world: 2-5x more expensive per ml
My data: 1000x more expensive per ml

This contradicted reality. For context:

Real example: 250ml Coke ≈ ₹20 (₹0.08/ml) vs 2L bottle ≈ ₹80 (₹0.04/ml) = 2x difference
My data: 1000x difference

Moreover, my finding that Q-Commerce only sells 1L bottles contradicted actual Q-Commerce business models:

Blinkit, Zepto, Swiggy Instamart specialize in immediate consumption
They thrive on small, single-serve packs (250ml, 135ml)
Customers order when they want a drink NOW, not bulk storage
Small convenience packs are their highest margin items

# Document findings
print("="*80)
print("DATA QUALITY ISSUE IDENTIFIED")
print("="*80)
print("\nFinding: Small pack pricing shows 1000x premium over large packs")
print("Expected: Real-world CPG pricing typically shows 2-5x premium")
print("\nPossible causes:")
print("1. Unit definition inconsistency across pack sizes")
print("2. Decimal point errors in data entry")  
print("3. Different measurement units (cases vs individual units)")
print("\nImpact: Channel-level price analysis may be misleading")
print("Recommendation: Validate with business stakeholders before model training")

Key Learnings from this experience

Technical Skills:

Data merging across multiple datasets
Calculating derived metrics (averages, ratios)
Multi-level groupby aggregations
Cross-validation of findings across different cuts of data

Business Understanding:

SKU definitions in CPG/retail context
Channel business models (B2B bulk vs B2C individual)
Product mix impact on performance metrics
Real-world CPG pricing dynamics

Critical Thinking:

Identified data quality issues through domain knowledge validation
Understood when numbers don't match reality
Chose documentation over data fabrication
Recognized that finding issues is as valuable as finding insights

Insights Despite Data Quality Issues:

General Trade dominates in revenue, volume, and transactions
Q Commerce has different economics - fewer, smaller transactions
Product mix varies by channel - not all channels sell all SKUs
Same products are priced consistently across channels (when sold)
Data quality issues exist that need stakeholder validation before ML modeling

Data Cleaning Error Scenarios blog cover image

Guides & Tutorials

October 27, 2025

Data Cleaning Error Scenarios

This blog explores my experience with cleaning datasets during the process of performing EDA for analyzing whether geographical attributes impact sales of beverages

Amit Enqurious

The Snowflake Feature That Defies Common Sense blog cover image

Guides & Tutorials

September 15, 2025

The Snowflake Feature That Defies Common Sense

Snowflake recommends 100–250 MB files for optimal loading, but why? What happens when you load one large file versus splitting it into smaller chunks? I tested this with real data, and the results were surprising. Click to discover how this simple change can drastically improve loading performance.

Mandar Enqurious

Building Bronze Layer: Using COPY INTO in Databricks blog cover image

Guides & Tutorials

September 12, 2025

Building Bronze Layer: Using COPY INTO in Databricks

Master the bronze layer foundation of medallion architecture with COPY INTO - the command that handles incremental ingestion and schema evolution automatically. No more duplicate data, no more broken pipelines when new columns arrive. Your complete guide to production-ready raw data ingestion

Sayli Enqurious

Mastering Git & GitHub: A Complete Beginner-to-Advanced Guide blog cover image

Guides & Tutorials

September 11, 2025

Mastering Git & GitHub: A Complete Beginner-to-Advanced Guide

Learn Git and GitHub step by step with this complete guide. From Git basics to branching, merging, push, pull, and resolving merge conflicts—this tutorial helps beginners and developers collaborate like pros.

Sunil Enqurious

How Data, Governance & Security Work Like a Food Delivery App blog cover image

Guides & Tutorials

August 29, 2025

How Data, Governance & Security Work Like a Food Delivery App

Discover how data management, governance, and security work together—just like your favorite food delivery app. Learn why these three pillars turn raw data into trusted insights, ensuring trust, compliance, and business growth.

Divyanshi Enqurious

Setting up AWS ETL Infrastructure with a CLI first approach blog cover image

Guides & Tutorials

August 28, 2025

Setting up AWS ETL Infrastructure with a CLI first approach

Beginner’s journey in AWS Data Engineering—building a retail data pipeline with S3, Glue, and Athena. Key lessons on permissions, data lakes, and data quality. A hands-on guide for tackling real-world retail datasets.

Amit Enqurious

My Journey Building the Automated Feedback Forms System blog cover image

Guides & Tutorials

August 25, 2025

My Journey Building the Automated Feedback Forms System

A simple request to automate Google feedback forms turned into a technical adventure. From API roadblocks to a smart Google Apps Script pivot, discover how we built a seamless system that cut form creation time from 20 minutes to just 2.

Chethan

Mastering AKS Monitoring: Practical Journey with Azure Kubernetes Service blog cover image

Guides & Tutorials

August 21, 2025

Mastering AKS Monitoring: Practical Journey with Azure Kubernetes Service

Step-by-step journey of setting up end-to-end AKS monitoring with dashboards, alerts, workbooks, and real-world validations on Azure Kubernetes Service.

Yaseen Enqurious

Journey of an App | From Browser to Database and back blog cover image

Guides & Tutorials

August 9, 2025

Journey of an App | From Browser to Database and back

My learning experience tracing how an app works when browser is refreshed

Amit Enqurious

Building an Interactive SQL Practice Portal blog cover image

Guides & Tutorials

August 9, 2025

Building an Interactive SQL Practice Portal

Demonstrates the power of AI assisted development to build an end-to-end application grounds up

Amit Enqurious

Learning with AI - Building a Login Page From Scratch blog cover image

Guides & Tutorials

August 7, 2025

Learning with AI - Building a Login Page From Scratch

A hands-on learning journey of building a login and sign-up system from scratch using React, Node.js, Express, and PostgreSQL. Covers real-world challenges, backend integration, password security, and key full-stack development lessons for beginners.

Amit Enqurious

The Schema Evolution Challenge in Modern Data Pipelines (Part 1/5) blog cover image

Guides & Tutorials

May 10, 2025

The Schema Evolution Challenge in Modern Data Pipelines (Part 1/5)

This is the first in a five-part series detailing my experience implementing advanced data engineering solutions with Databricks on Google Cloud Platform. The series covers schema evolution, incremental loading, and orchestration of a robust ELT pipeline.

Amit Enqurious

7 Major Stages of the Data Engineering Lifecycle blog cover image

Guides & Tutorials

April 8, 2025

7 Major Stages of the Data Engineering Lifecycle

Discover the 7 major stages of the data engineering lifecycle, from data collection to storage and analysis. Learn the key processes, tools, and best practices that ensure a seamless and efficient data flow, supporting scalable and reliable data systems.

Ayushi Enqurious

Troubleshooting Pip Installation Issues on Dataproc with Internal IP Only blog cover image

Guides & Tutorials

April 3, 2025

Troubleshooting Pip Installation Issues on Dataproc with Internal IP Only

This blog is troubleshooting adventure which navigates networking quirks, uncovers why cluster couldn’t reach PyPI, and find the real fix—without starting from scratch.

Ayushi Enqurious

Optimizing Query Performance in BigQuery blog cover image

Guides & Tutorials

January 24, 2025

Optimizing Query Performance in BigQuery

Explore query scanning can be optimized from 9.78 MB down to just 3.95 MB using table partitioning. And how to use partitioning, how to decide the right strategy, and the impact it can have on performance and costs.

Ayushi Enqurious

When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries blog cover image

Guides & Tutorials

January 24, 2025

When Partitioning and Clustering Go Wrong: Lessons from Optimizing Queries

Dive deeper into query design, optimization techniques, and practical takeaways for BigQuery users.

Ayushi Enqurious

Stored Procedures vs. Functions: Choosing the Right Tool for the Job blog cover image

Guides & Tutorials

January 6, 2025

Stored Procedures vs. Functions: Choosing the Right Tool for the Job

Wondering when to use a stored procedure vs. a function in SQL? This blog simplifies the differences and helps you choose the right tool for efficient database management and optimized queries.

Divyanshi Enqurious

Breaking Down Data Silos with BigQuery Omni and BigLake blog cover image

Guides & Tutorials

December 23, 2024

Breaking Down Data Silos with BigQuery Omni and BigLake

Discover how BigQuery Omni and BigLake break down data silos, enabling seamless multi-cloud analytics and cost-efficient insights without data movement.

Ayushi Enqurious

Solving a Computer Vision task with AI assistance blog cover image

Guides & Tutorials

December 18, 2024

Solving a Computer Vision task with AI assistance

In this article we'll build a motivation towards learning computer vision by solving a real world problem by hand along with assistance with chatGPT

Amit Enqurious

How Apache Airflow Helps Manage Tasks, Just Like an Orchestra blog cover image

Guides & Tutorials

September 16, 2024

How Apache Airflow Helps Manage Tasks, Just Like an Orchestra

This blog explains how Apache Airflow orchestrates tasks like a conductor leading an orchestra, ensuring smooth and efficient workflow management. Using a fun Romeo and Juliet analogy, it shows how Airflow handles timing, dependencies, and errors.

Burhanuddin Enqurious

Snapshots and Point-in-Time Restore: The E-Commerce Lifesaver blog cover image

Guides & Tutorials

January 13, 2024

Snapshots and Point-in-Time Restore: The E-Commerce Lifesaver

The blog underscores how snapshots and Point-in-Time Restore (PITR) are essential for data protection, offering a universal, cost-effective solution with applications in disaster recovery, testing, and compliance.

Ayushi Enqurious

Guides & Tutorials

December 16, 2023

Basics of Langchain

The blog contains the journey of ChatGPT, and what are the limitations of ChatGPT, due to which Langchain came into the picture to overcome the limitations and help us to create applications that can solve our real-time queries

Burhanuddin Enqurious

Understanding Data Lakes and Data Warehouses: A Simple Guide blog cover image

Guides & Tutorials

December 8, 2023

Understanding Data Lakes and Data Warehouses: A Simple Guide

This blog simplifies the complex world of data management by exploring two pivotal concepts: Data Lakes and Data Warehouses.

Ayushi Enqurious

Serving Up Cloud Concepts: A Pizza Lover's Guide to Understanding Tech blog cover image

Guides & Tutorials

November 2, 2023

Serving Up Cloud Concepts: A Pizza Lover's Guide to Understanding Tech

demystifying the concepts of IaaS, PaaS, and SaaS with Microsoft Azure examples

Ayushi Enqurious

Azure Data Factory: The Ultimate Prep Cook for Your Data Kitchen blog cover image

Guides & Tutorials

October 31, 2023

Azure Data Factory: The Ultimate Prep Cook for Your Data Kitchen

Discover how Azure Data Factory serves as the ultimate tool for data professionals, simplifying and automating data processes

Ayushi Enqurious

Harnessing Azure Cosmos DB APIs: Transforming E-Commerce blog cover image

Guides & Tutorials

October 26, 2023

Harnessing Azure Cosmos DB APIs: Transforming E-Commerce

Revolutionizing e-commerce with Azure Cosmos DB, enhancing data management, personalizing recommendations, real-time responsiveness, and gaining valuable insights.

Ayushi Enqurious

Unleashing the Power of NoSQL: Beyond Traditional Databases blog cover image

Guides & Tutorials

October 26, 2023

Unleashing the Power of NoSQL: Beyond Traditional Databases

Highlights the benefits and applications of various NoSQL database types, illustrating how they have revolutionized data management for modern businesses.

Ayushi Enqurious

Calendar Events Automation: Streamline Your Life with App Script Automation blog cover image

Guides & Tutorials

October 10, 2023

Calendar Events Automation: Streamline Your Life with App Script Automation

This blog delves into the capabilities of Calendar Events Automation using App Script.

Burhanuddin Enqurious

A Journey Through Extraction, Transformation, and Loading blog cover image

Guides & Tutorials

September 7, 2023

A Journey Through Extraction, Transformation, and Loading

Dive into the fundamental concepts and phases of ETL, learning how to extract valuable data, transform it into actionable insights, and load it seamlessly into your systems.

Burhanuddin Enqurious