Exploring your data
A deeper dive into customer segmentation
Review Project 2 datasets
What do we mean when we say explore your data?
\(\rightarrow\) Gain familiarity with the contents of the data
\(\rightarrow\) Identify questions that can be answered with the data
\(\rightarrow\) Generate summary statistics (average, variance, min, max)
\(\rightarrow\) Plot distributions and correlations between variables
A good data source will provide information about a dataset in the form of a data dictionary.
A data dictionary is a comprehensive guide that provides detailed information to the data analyst about the dataset being analyzed.
Can include:
Variable names
Data types
Descriptions
Possible values
Missing value indication
Source
Which stores have the greatest number of unique shoppers?
What is the average length of time a shopper remains in a store?
How far do shoppers travel to a store during a weekday?
What time of day is the busiest time to shop on a Saturday and a Sunday?
Summary statistics provide an overview of the data’s main characteristics.
They are important for understanding the data’s scope and identifying possible outliers.
What summary statistics would we want to report if we were to answer the following question: How far do shoppers travel to a store during a weekday?
Average distance (mean): Calculate the average distance traveled by all shoppers, conditioning on weekdays.
Range of distances (min and max): Find the shortest and longest distances traveled by shoppers, conditioning on weekdays.
Variation of distance traveled (standard deviation): This measures the amount of dispersion in the distances. A high standard deviation means that the distances vary widely from the mean, while low standard deviation means that the distances traveled are clustered closely around the mean.
Plotting data allows you to see relationships in the data that cannot be easily identified with summary statistics.
We plot distributions for a single variable:
Allows you to see patterns, anomalies, and spot outliers in the data
The shape of the distribution informs whether you will need to transform the data (e.g., take the log)
We plot correlations between two variables:
Discover relationships between two variables and use this to inform a hypothesis
Highly correlated data can be problematic and plotting correlations can detect these relationships
Uses cluster analysis on census demographic data (+ others) to define groups
We live in a noisy world and crowded marketplace
People are getting better at ignoring and there are more niches
Increase the chances of of reaching the right customers with the right message at the right time
Increase sales, revenue, and hopefully profit
Measuring Key Performance Indicators (KPIs)
Conversion rates - online click through
Customer loyalty - rates of repeat purchases
Identify customer personas
Customer stage - leads, prospects, existing customers
Customer demographics - age, gender, income, location, occupation
Customer behaviors - purchase history, web browsing activity
A common application of cluster analysis is to use sales data to divide customers into groups to improve target marketing.
Let’s look at a case study. . .
Use the data provided to:
Perform an exploratory data analysis
Assemble summary statistics
Analyze spatial patterns, correlations, etc.
Construct visualizations
You must:
Conduct market segmentation analysis via clustering
Label clusters based on characteristics
Develop a simple marketing strategy to target clusters
Overview: Transaction-level data from convenience stores across the U.S. during July of 2023
A transaction is defined as a purchase incident made by a shopper. (Think: receipt)
Transaction-level data includes information about the store where they made the purchase, the price and quantity of the products they purchased, and details about the items they purchased.
“shopper_info.csv”
“gtin.csv”
You will link this file with “shopper_info” based on the variable gtin.
GTIN stands for “Global Trade Item Number” and is similar to an SKU or UPC (i.e., barcode).
“store_info.csv”
By store:
Which stores are similar by mix of products sold?
Which stores are similar by sales?
By product:
By shopper:
Which shoppers are similar based on where they shop?
Which shoppers are similar based on what they purchase?
Explore your data: understanding your data, identifying questions, descriptive statistics, plotting distributions, plotting correlations