Project 2 – Problem Set 1: Cross Sectional Data Analysis

Problem Set Overview

This problem set is designed to help you start your Project 2 analysis by gaining familiarity with the convenience store datasets provided. You will perform exploratory data analysis (EDA) and begin initial clustering to identify patterns in your data.

Data for this Problem Set

You will use the following datasets:

shopper_info.csv: This file contains transaction-level data, including information about the shopper, the store where the purchase was made, the products purchased (identified by GTIN), and the quantities and prices.

gtin.csv: This file provides additional details about the products, which can be linked to the shopper_info.csv file using the GTIN (Global Trade Item Number) variable.

store_info.csv: This file includes information about the stores, such as location, size, and other characteristics, which can be linked to the shopper_info.csv file using the store_id variable.

See the Canvas assignment for a downloadable codebook describing the data files.

Part 1: Specify Your Research Question

  1. Review Project 2 Instructions. Then, clearly outline the segmentation and question you will analyze:
  • Specify one segmentation area you have chosen for the analysis (customers, stores, or products).

  • Clearly articulate your research question.

  • Explain why answering this question is valuable and specify who will benefit from the insights (e.g., business executives, store managers, marketers).

Three Options for This Project

Customer Segmentation:

  • Who are the highest-spending customers?

  • Are there differences in what products customers purchase across stores?

Store Segmentation:

  • Which stores have the highest total sales?

  • Which regions or cities have the most stores?

Product Segmentation:

  • What are the top-selling products?

  • Are there clear differences in product popularity by region or store type?

Part 2: R

Complete the following steps using the provided datasets (shopper_info.csv, gtin.csv, and store_info.csv):

  1. Depending on the segment you have chosen, merge the relevant datasets appropriately - these might be two or all three datasets - and clean the dataset. Clearly document and justify all processing steps, such as:
  • Handling missing data

  • Removing outliers

  • Dropping observations

  • Transforming or aggregating variables

  • Creating new variables

  1. Using your merged and cleaned dataset, generate a table of summary statistics. The table should include at least: variable names, mean (average), standard deviation, min, and max.

  2. Write a narrative for explaining the steps you took to merge and clean your data and providing an interpretation of your data. Describe your data so that readers understand what is measured.

  3. Conduct an initial K-means clustering analysis:

  • Select relevant numeric variables based on your segmentation choice. These variables may be ones that you had to create in answering question #2.

  • Perform cluster analysis using different cluster sizes (e.g., 3-5 clusters) and explain your choice clearly.

Note: You will come back to this R script in problem set 2.

  1. Write a narrative explaining your initial clustering results, including what the clusters indicate about the dataset.

How to Submit

Create a new page on your Google Site titled Project 2 Problem Set 1. Remember: Add this as a new page to your existing website. Do not create a new website.

This assignment has three components:

  1. Your Research Question
  • On your webpage for this assignment, describe the segmentations you have chosen, the research question, and who will benefit from the insights.
  1. R code
  • Comment each step of your code, including summary statistics, cluster results, and any relevant visualizations (e.g., the elbow and silhouette plots).
  1. Your Narrative
  • On your webpage for this assignment, write a narrative that corresponds with questions 4 and 6.

Submit the link to your webpage on Canvas.

Reminder: How to Use “Compile Report” Option in R

Going going forward, please use the “Compile Report” process to submit your R code. We will no longer use “sink-source-sink”.

  • Under the File menu, select Compile Report…
  • The default under “Report output format” is HTML.
  • Select Compile
  • Along the tool bar, select Open in Browser
  • Hover your cursor over white space and right click.
  • Select View page source.
  • A new screen will appear with HTML code. (Don’t worry, this might look like gibberish, but it’s not!)
  • Using your keyboard, click CTRL+A. This will highlight all of the HTML code.
  • While the HTML code is highlighted, right click and select Copy.
  • Go to your Google website.
  • Similar to as you would embed your Tableau visualization, select Embed under Insert.
  • Then, select Embed code.
  • Paste your embed code from your clipboard by either right-clicking and selecting Paste or CTRL+V.
  • Select Next.
  • Resize the image of your R script by expanding the window to fit the entire script.

Following these steps will display your entire code and output without truncation.