Before we get started on core analysis, we need to sanity check the data - check for missing values, measurement problems, and joining issues - and take a peek at how the variables are distributed.
This prepatory work will allow us to catch and fix data problems, correctly interpret our results, and guide the rest of the analysis.
The Foodmart data appears to have been drawn from a data warehouse: there is a central fact table of transactions, and dimension tables for products, product families, and promotions.
Some quick context is in order: in Data Warehouse "star schemas", fact tables store the measurements collected during particular business processes events - in this case, barcode scanning during supermarket checkout.
from IPython.display import Image
Image(filename='measurement_event.png')
Source: The Data Warehouse Toolkit by Ralph Kimball
Each fact consists of a physical, real-valued measurement - the sales price of the product being purchased, in this case - and a set of foreign keys referring to dimension tables. Dimension tables capture the context around the measurement - the who, what, where, and when. (For the curious, data warehouses use star schemas because they are more intuitive for analysts and performant for aggregate queries.)
Generally, if an organization has gone through the trouble of designing a warehouse and ETL'ing data from dozens of operational systems into it, they've also made sure that events are being measured correctly and reliably, and that the fact tables satisfy referential integrity - that every foreign key in the fact table can be matched to a primary key in the dimension table. In other words, that no measurements are missing context. The data cleaning we had to do raises some red flags, however, so we better kick the tires.
Since transactions are the heart of everything, we'll look at them first. For each table, we'll look count the number of records, overall and grouped by variable.
from os import environ
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Load cleaned datasets into DataFrame from Make through env variables.
dataset_paths = ['../' + dataset for dataset in environ['datasets'].split()]
product, product_class, promotion, transactions = [pd.read_pickle(path) for path in dataset_paths]
How many transactions are there?
num_transactions = transactions.shape[0]
num_transactions
Are any of the variables missing values?
transactions.isnull().sum()
Lookin' good! None of our columns have missing values. We're not missing any sales price, cost, or unit measurements, and none of our foreign keys are missing!
Right?
Wait... what if some of our numerical values are just using "0" to represent null values? Let's count the number of 0's for all of our numerical variables:
dtypes = transactions.dtypes
numerical_cols = dtypes[(dtypes == 'float64') | (dtypes == 'int64')].index
num_0_values_by_col = (transactions[numerical_cols] == 0).sum()
num_0_values_by_col
Uh oh, looks like both promotions and customers have 63,627 missing values! What percentage the transactions does this represent?
num_0_values_by_col['customer_id']/num_transactions
Over a third of the dataset! How can this be?
For promotions, 0 probably represents "null" - it makes sense that one in three transactions did not occur through a promotion. For customers, however, either third of transaction are missing customers, or 1 in 3 transactions were performed by a single customer, which doesn't make sense.
It's striking, however, that the promotion_id
and customer_id
columns have the same number of 0 values. Maybe null promotions are associated with "0" customer_ids? This could point to a data collection problem.
zero_customer = transactions[transactions.customer_id == 0].index
zero_promotion = transactions[transactions.promotion_id == 0].index
set(zero_customer) & set(zero_promotion)
Nope! There is not a single transactions where promotion_id
and customer_id
are both 0. This is very strange, considering the have the same exact number of 0 values.
Let's make a note of this and move on.
How many unique customers are there, excluding the duds?
transactions_with_customers = transactions[~(transactions.customer_id == 0)]
unique_customers = transactions_with_customers.customer_id.nunique()
unique_customers
How many transactions are there per customer?
transactions_per_customer = transactions_with_customers.groupby('customer_id').size().clip(upper=80)
plt.figure(figsize=(16,5))
transactions_per_customer.value_counts().sort_index().plot(kind='bar')
plt.xlabel('Transactions per customer')
plt.ylabel('Number of customers')
transactions_per_customer.describe()
Half of customers purchased 11 products or less in 1997. Like many count datasets bounded at zero, the distribution of transactions per customer is right-skewed due to a very long tail.
Although a median of 11 items a year seems low, the shape of the distribution makes sense: lots of people buy a couple products a few times a year, or drop in to a store for the first time and buy 8 things. The quarter of people that shop there every week will buy some variable number of products each time, which is why the tail is so long - people can end up in lots of different places.
We'll be analyzing promotions heavily in question two, so let's take a peak at them.
How many promotion names are there?
promotion.promotion_name.nunique()
Which of these promotion names have the most promotions associated with them?
plt.figure(figsize=(7,10))
promotion.groupby('promotion_name').size().sort_values().plot(kind='barh')
How many unique products were sold under each promotion name?
trans_with_promo = pd.merge(transactions, promotion, on='promotion_id')
plt.figure(figsize=(7,10))
trans_with_promo.groupby('promotion_name')['product_id'].nunique().sort_values().plot(kind='barh')
How many media types are there?
promotion.media_type.nunique()
Which media types show up most often?
promotion.media_type.value_counts()
To be able to meaningfully summarize trends in transactions, we need to join them to products and product classes. When working with new datasets, it's always important to check joins - it's possible to leave records on the table if you're not careful.
transactions.shape[0]
Recall that we have 173,602 transactions. If we right join transactions to products on product_id, we would hope the columns from the transactions table contained 173,602 values. That would mean every transaction matched to a product in the product table.
trans_with_prod = pd.merge(transactions, product, on='product_id', how='right')
trans_with_prod.count()
Indeed, the transaction columns - customer_id
to fact_count
have 173,602 non-null values.
The columns from the product table - product_class_id
to shelf_depth
- have one additional value. It appears that one of the products in the product table was unable to match to any transaction.
No big deal, this won't really affect things. We can safely do an inner join during our analysis.
Now let's join transactions-product to product_class:
trans_with_prod_class = pd.merge(trans_with_prod, product_class, on='product_class_id', how='right')
trans_with_prod_class.count()
It looks like all transaction-product records match to a product class record. Around 8 product_classes did not match to any product. Again, this just means there's a bit of irrelevant transaction metadata in the product_class
table, it won't affect our analysis.
Let's join to promotions next. We're joining on promotion_id
, which comes from the transactions
table, so if every transaction matches to a promotion we'd expect to see 173,602 non-null values for all transaction-product-product_class columns in the resulting table. If promotion columns have more than 173,603 non-null values, it's because some promotion did not match to any transactions.
trans_prod_promo = pd.merge(trans_with_prod_class, promotion, on='promotion_id', how='right')
trans_prod_promo.count()
unmatched_promotions = trans_prod_promo[trans_prod_promo.product_id.isnull()]
num_unmatched_promotions = unmatched_promotions['promotion_id'].nunique()
num_unmatched_promotions
Looks like all transactions matched to promotion. But 1702 promotions failed to match to a transaction!
num_unmatched_promotions / promotion.promotion_id.nunique()
That's 91% of promotions. What's going on?
unmatched_promotions_by_year = unmatched_promotions.start_date.dt.year.value_counts()
unmatched_promotions_by_year
(unmatched_promotions_by_year[1996] + unmatched_promotions_by_year[1998]) / unmatched_promotions_by_year.sum()
71% of these promotions started in either 1996 or 1998, whereas all of our transaction data is from 1997:
transactions.the_year.value_counts()
It's a little worisome that 488 campaigns from 1997 can't be matched to a single transaction. It's unlikely that so many promotions actually spurred zero product sales, so this is likely a data quality issue. Either transaction records are missing, transaction foreign keys are missing or inaccurate, or the promotions table contains dirty promotion records. We'll need to note this issue in any conclusions pertaining to promotions.
Let's circle back to transactions and take a look at their the revenues and costs associated with each one. Questions one and two both focus on gross profits, which are derived from costs and revenues, so we should do our diligence here.
How much total revenue did foodmart stores make in 1997?
total_revenue = transactions.store_sales.sum()
total_revenue
Just over a million dollars. This seems super low: according to this supermarket industry profile, the median yearly revenue for a single large supermarket was $14 million was in 2005. And there are multiple stores represented in this dataset. Something's up - we likely don't have every transaction from 1997 in our dataset.
Let's continue our revenue evaluation and then investigate.
total_costs = transactions.store_cost.sum()
total_costs
$450K in costs, for a profit of..
gross_profit = total_revenue - total_costs
gross_profit
gross_margin = gross_profit / total_revenue
gross_margin
680K, which is a 60% gross profit margin.
This gross profit margin - which is gross profit as a percentage of sales revenue, before deducting overhead, payroll, taxes, and so on - seems reasonable, given the high overhead costs of operating a grocery store. The net profit margin is likely much lower.
In other words, the total revenue, cost, and profit figures look reasonable in percentage terms, but not in absolute terms. Let's figure out why that is. We'll look at revenue per store first:
transactions.store_id.nunique()
These transactions took place in over 5,000 stores! Let's look at the revenue per store:
total_revenue_per_store = transactions.groupby('store_id')['store_sales'].sum()
# Plot distribution of total revenue by store, removing outliers so chart looks ok.
sns.distplot(total_revenue_per_store[total_revenue_per_store < 800])
total_revenue_per_store.describe()
The median store only brought in $65 during the whole year! These numbers are way too low.
# Plot number of transactions per store, removing outliers so chart looks ok.
transactions_per_store = transactions.groupby('store_id').size()
sns.distplot(transactions_per_store[transactions_per_store < 100])
transactions_per_store.describe()
And here's the culprit: we have fewer than 20 transactions for 75% of stores. The distribution looks reasonable, however: most stores have relative few transactions, with a long tail of stores that sell more. That suggests that transactions.csv
could be a random sample of the original fact table.
It could be that this sample isn't random throughout 1997, and instead oversamples from certain parts of the year. Let's find a day to look at the transactions over time.
# Creates a time series of the number of "unmatched promotions" - promotions we couldn't join
# to any of the transactions - that started in a given week.
promotions_by_week = unmatched_promotions.set_index(unmatched_promotions.start_date)['promotion_id'] \
.resample('W', how='count')
# Plot the time series.
plt.figure(figsize=(15,7))
promotions_by_week.plot()
plt.title('Number of non-matched promotion starts, by week')
plt.xlabel('Time')
plt.ylabel('Number of promotions starting that week')
Remember those promotions that we cound't match to any records? Let's call them "unmatched promotions."
The chart above shows the number of unmatched promotions that started on any given week in 1996, 1997, and 1998.
The first thing to note is that promotions tend to start every two weeks:
promotions_by_week.tail(10)
And you usually have 23-24 promotions kicking off during those weeks - at least in 1996 and 1998. In other words, none of the promotions during these years were matched to the transactions, which makes sense because all transaction are from 1997.
For 1997, however, the number of unmatched promotions starting during "on" weeks ranges between 16-20. This means that about 25% of promotions were matched to at least one transaction, and 75% weren't. If we assume that transactions.csv
is a small sample of a much larger dataset, it's perfectly possible that the sample ends up without all the promotions being represented. That reinforces the random sample hypothesis, as does the fact that the number of unmatched 1997 promotions stays fairly constant from month to month.
Is our transaction sample size is 173,602, how big is the population, and what percentage of it are we sampling?
median_store_revenue_sample = total_revenue_per_store.describe()['50%']
median_store_revenue_population = 6000000
pct_of_pop_sampled = (median_store_revenue_sample / median_store_revenue_population) * 100
pct_of_pop_sampled
If we use the supermarket industry report above as a rough prior and assume that the median total revenue per store is around $6,000,000 per year, then we are probably sampling between .01% and .001% of all 1997 transactions.
transactions.shape[0] / pct_of_pop_sampled
That would suggest that population size is on the order of 160 million yearly transactions, which sounds much more reasonable for a larger supermarket chain data warehouse.