Week 09 Lab: Intro to Analyzing Cross-Sectional Data

This Lab Contributes to Course Objectives: 1, 3, 4, 5, 7, 8

Learning Objectives R

  • Conduct exploratory data analysis

  • Conduct cluster analysis

R

This lab will use the cross-sectional data set Arizona Grocery Store visitation data to introduce cluster analysis using R. First, we will explore these data to become familiar with the data and remove outliers.

Exploratory Data Analysis

We will use a dataset of grocery store locations in the United States that contains information about the number of visitors, the distance (meters) from visitors’ home, and the median time (minutes) visitors spend in the store. The dataset is provided at https://csu-arec-330.github.io/materials/unit_02/inputs/arizona_grocery_foot_traffic.csv. First, set your directory and load the following packages: tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret

# This script 

# load packages
library(pacman)
p_load(tidyverse,ggplot2,modelsummary,GGally,factoextra)

# read in dataset
raw <- read_csv("https://csu-arec-330.github.io/materials/unit_02/inputs/arizona_grocery_foot_traffic.csv")

The package modelsummary provides some handy utilities for easily generating summary statistics. We will use datasummary_skim() to get a quick overview of the data. First, we will look at the numeric data by specifying type="numeric". What do you notice? Do the same for categorical data by specifying type="categorical".

datasummary_skim(raw,type = "numeric")
datasummary_skim(raw,type = "categorical")

You can customize this summary statistics table and output the result to a docx file (see the other options in the documentation). In the following example, we are creating a table with raw_visitor_counts, distance_from_home, median_dwell as the rows and Mean, SD, Min, and Max as the columns. The character string as the output argument defines the file name in the current working directory.

datasummary(raw_visitor_counts + distance_from_home + median_dwell ~ Mean + SD + Min + Max,
            data=raw,
            output = "sumstats.docx")

The package GGally provides helpful EDA visualizations using ggplot2. Use ggpairs to visualize density plots and scatter plots to better understand correlation. What do you notice?

ggpairs(raw,columns = c("raw_visitor_counts","distance_from_home","median_dwell")) 

The data are heavily skewed and all of the variables have very few observations of very high values (called fat tails). This suggests that we may want to log the data. Let’s take a peak at what the pairs plot would look like if the data were logged. Note that we can use a utility called across() from dplyr to perform the same operation on multiple columns at once.

raw %>%
  select(raw_visitor_counts,distance_from_home,median_dwell) %>% #subset only our variables of interest
  mutate(across(everything(),log)) %>% #log transform each of the variables. 
  ggpairs() #plot the tranformed variables

The log transform helped correct the skew but not the fat tails in distance_from_home and median_dwell. The very large values in distance_from_home may be visitors to the area with low visitation rate, and the very long dwell times may be employees. Let’s simply remove these very large values for this analysis.

raw %>%
  filter(distance_from_home<48000,median_dwell<90) %>% #keep only stores with median distance from home <48km and median dwell less than 90 minutes
  select(raw_visitor_counts,distance_from_home,median_dwell) %>% #subset only our variables of interest
  mutate(across(everything(),log)) %>% #log transform each of the variables. 
  ggpairs() #plot the tranformed variables

This looks better. Let’s keep only those stores labeled as grocery stores, and apply the filter criteria from above.

# select relevant columns and scrub outliers
data <- raw %>% 
  dplyr::filter(top_category=="Grocery Stores", #focus on those classified as grocery stores
                distance_from_home<48000,median_dwell<90) %>% #apply outlier filter
  select(placekey,raw_visitor_counts,distance_from_home,median_dwell) %>%
  mutate(across(c(2:4),log)) %>% #log transform
  drop_na() #drop any observations with missing values

Clustering Analysis

We will cluster the observations based on distance_from_home. Our first step will be to standardize the data so the units of the variables do not affect the clustering. Use the function scale() to transform each variable by first subtracting the mean, then dividing by the standard deviation.

# scale data
data_scaled <- data %>%
  select(distance_from_home) %>% 
  scale()

Now, we can estimate the kmeans clustering algorithm with 3 clusters.

# perform k-means clustering with k=3
set.seed(123) # for reproducibility
kmeans_fit <- kmeans(data_scaled, 
                     centers=3,  #the number of clusters
                     nstart = 25) #the number of random starts

The kmeans algorithm needs you to specify the number of clusters. There are some helpful visuals that can guide your choice of cluster. One is known as the silhouette plot. See an explanation of the silhouette coefficient here. The other is the elbow plot, which involves calculating the “within sum of squares” for different numbers of clusters. As you increase the number of clusters, the within sum of squares declines. Stop when an increase in the number of clusters only marginally decreases the within sum of squares.

# Create silhouette plot
fviz_nbclust(data_scaled, kmeans, method = "silhouette")

fviz_nbclust(data_scaled, kmeans, method = "wss")

Adjust your number of clusters if necessary and rerun kmeans.

Assign the clusters back to the data and merge (join) the data back with other store level attributes.

#create a dataframe with the store level attribute data not included in the clustering
location_info <- raw %>%
  select(placekey,location_name,top_category,sub_category,latitude,longitude,city,region)

# add cluster labels to dataset and join to location info
data_clustered <- data %>% 
  mutate(cluster = kmeans_fit$cluster) %>%
  inner_join(location_info,by="placekey")

Do your clusters based on travel distance seem reasonable? What are the cluster means of another variable?