Week 10:
Cross-Sectional Data Analysis

Agenda

Exploring your data

A deeper dive into customer segmentation

Review Project 2 datasets

Exploring your data

Exploratory data analysis

What do we mean when we say explore your data?

\(\rightarrow\) Gain familiarity with the contents of the data

\(\rightarrow\) Identify questions that can be answered with the data

\(\rightarrow\) Generate summary statistics (average, variance, min, max)

\(\rightarrow\) Plot distributions and correlations between variables

Let’s look at an example: Grocery visitation data

Gain familiarity with your data

A good data source will provide information about a dataset in the form of a data dictionary.

A data dictionary is a comprehensive guide that provides detailed information to the data analyst about the dataset being analyzed.

Can include:

  • Variable names

  • Data types

  • Descriptions

  • Possible values

  • Missing value indication

  • Source

Identify questions that could be asked

  • Which stores have the greatest number of unique shoppers?

  • What is the average length of time a shopper remains in a store?

  • How far do shoppers travel to a store during a weekday?

  • What time of day is the busiest time to shop on a Saturday and a Sunday?

Summary statistics

Summary statistics provide an overview of the data’s main characteristics.

They are important for understanding the data’s scope and identifying possible outliers.

What summary statistics would we want to report if we were to answer the following question: How far do shoppers travel to a store during a weekday?

  1. Average distance (mean): Calculate the average distance traveled by all shoppers, conditioning on weekdays.

  2. Range of distances (min and max): Find the shortest and longest distances traveled by shoppers, conditioning on weekdays.

  3. Variation of distance traveled (standard deviation): This measures the amount of dispersion in the distances. A high standard deviation means that the distances vary widely from the mean, while low standard deviation means that the distances traveled are clustered closely around the mean.

Plotting distributions and correlations

Plotting distributions and correlations

Plotting data allows you to see relationships in the data that cannot be easily identified with summary statistics.

We plot distributions for a single variable:

  • Allows you to see patterns, anomalies, and spot outliers in the data

  • The shape of the distribution informs whether you will need to transform the data (e.g., take the log)

We plot correlations between two variables:

  • Discover relationships between two variables and use this to inform a hypothesis

  • Highly correlated data can be problematic and plotting correlations can detect these relationships

Customer segmentation

ESRI Tapestry Segmentation

Uses cluster analysis on census demographic data (+ others) to define groups

Why segment your customers?

We live in a noisy world and crowded marketplace

People are getting better at ignoring and there are more niches

Increase the chances of of reaching the right customers with the right message at the right time

Increase sales, revenue, and hopefully profit

How do you know if you are achieving your goal?

Measuring Key Performance Indicators (KPIs)

Conversion rates - online click through

Customer loyalty - rates of repeat purchases

Dimensions of customer segmentation

Identify customer personas

Customer stage - leads, prospects, existing customers

Customer demographics - age, gender, income, location, occupation

Customer behaviors - purchase history, web browsing activity

Customer segmentation analysis

A common application of cluster analysis is to use sales data to divide customers into groups to improve target marketing.

Let’s look at a case study. . .

MetLife case study

Project 2

Project 2: Overview

Use the data provided to:

  • Perform an exploratory data analysis

  • Assemble summary statistics

  • Analyze spatial patterns, correlations, etc.

  • Construct visualizations

You must:

  • Conduct market segmentation analysis via clustering

  • Label clusters based on characteristics

  • Develop a simple marketing strategy to target clusters

Project 2: Data

Overview: Transaction-level data from convenience stores across the U.S. during July of 2023

A transaction is defined as a purchase incident made by a shopper. (Think: receipt)

Transaction-level data includes information about the store where they made the purchase, the price and quantity of the products they purchased, and details about the items they purchased.

Project 2: Data

“shopper_info.csv”

  • This is the core file that contains information about the shopper and each transaction they made during the month of July 2023.

“gtin.csv”

  • You will link this file with “shopper_info” based on the variable gtin.

  • GTIN stands for “Global Trade Item Number” and is similar to an SKU or UPC (i.e., barcode).

“store_info.csv”

  • This file contains the store details and can be linked with the “shopper_info” using the variable store_id.

Project 2: Types of segmentations

By store:

  • Which stores are similar by mix of products sold?

  • Which stores are similar by sales?

By product:

  • Which products are similar?

By shopper:

  • Which shoppers are similar based on where they shop?

  • Which shoppers are similar based on what they purchase?

Summary

Explore your data: understanding your data, identifying questions, descriptive statistics, plotting distributions, plotting correlations