# This script
# load packages
library(pacman)
p_load(tidyverse,ggplot2,modelsummary,GGally,factoextra)
# read in dataset
<- read_csv("https://csu-arec-330.github.io/materials/unit_02/inputs/arizona_grocery_foot_traffic.csv") raw
Week 09 Lab: Intro to Analyzing Cross-Sectional Data
This Lab Contributes to Course Objectives: 1, 3, 4, 5, 7, 8
Learning Objectives R
Conduct exploratory data analysis
Conduct cluster analysis
R
This lab will use the cross-sectional data set Arizona Grocery Store visitation data to introduce cluster analysis using R. First, we will explore these data to become familiar with the data and remove outliers.
Exploratory Data Analysis
We will use a dataset of grocery store locations in the United States that contains information about the number of visitors, the distance (meters) from visitors’ home, and the median time (minutes) visitors spend in the store. The dataset is provided at https://csu-arec-330.github.io/materials/unit_02/inputs/arizona_grocery_foot_traffic.csv. First, set your directory and load the following packages: tidyverse,ggplot2,skimr,GGally,broom,ranger,rsample,caret
The package modelsummary
provides some handy utilities for easily generating summary statistics. We will use datasummary_skim()
to get a quick overview of the data. First, we will look at the numeric data by specifying type="numeric"
. What do you notice? Do the same for categorical data by specifying type="categorical"
.
datasummary_skim(raw,type = "numeric")
datasummary_skim(raw,type = "categorical")
You can customize this summary statistics table and output the result to a docx file (see the other options in the documentation). In the following example, we are creating a table with raw_visitor_counts
, distance_from_home
, median_dwell
as the rows and Mean
, SD
, Min
, and Max
as the columns. The character string as the output argument defines the file name in the current working directory.
datasummary(raw_visitor_counts + distance_from_home + median_dwell ~ Mean + SD + Min + Max,
data=raw,
output = "sumstats.docx")
The package GGally
provides helpful EDA visualizations using ggplot2
. Use ggpairs
to visualize density plots and scatter plots to better understand correlation. What do you notice?
ggpairs(raw,columns = c("raw_visitor_counts","distance_from_home","median_dwell"))
The data are heavily skewed and all of the variables have very few observations of very high values (called fat tails). This suggests that we may want to log the data. Let’s take a peak at what the pairs plot would look like if the data were logged. Note that we can use a utility called across()
from dplyr to perform the same operation on multiple columns at once.
%>%
raw select(raw_visitor_counts,distance_from_home,median_dwell) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
The log transform helped correct the skew but not the fat tails in distance_from_home
and median_dwell
. The very large values in distance_from_home
may be visitors to the area with low visitation rate, and the very long dwell times may be employees. Let’s simply remove these very large values for this analysis.
%>%
raw filter(distance_from_home<48000,median_dwell<90) %>% #keep only stores with median distance from home <48km and median dwell less than 90 minutes
select(raw_visitor_counts,distance_from_home,median_dwell) %>% #subset only our variables of interest
mutate(across(everything(),log)) %>% #log transform each of the variables.
ggpairs() #plot the tranformed variables
This looks better. Let’s keep only those stores labeled as grocery stores, and apply the filter criteria from above.
# select relevant columns and scrub outliers
<- raw %>%
data ::filter(top_category=="Grocery Stores", #focus on those classified as grocery stores
dplyr<48000,median_dwell<90) %>% #apply outlier filter
distance_from_homeselect(placekey,raw_visitor_counts,distance_from_home,median_dwell) %>%
mutate(across(c(2:4),log)) %>% #log transform
drop_na() #drop any observations with missing values
Clustering Analysis
We will cluster the observations based on distance_from_home
. Our first step will be to standardize the data so the units of the variables do not affect the clustering. Use the function scale()
to transform each variable by first subtracting the mean, then dividing by the standard deviation.
# scale data
<- data %>%
data_scaled select(distance_from_home) %>%
scale()
Now, we can estimate the kmeans
clustering algorithm with 3 clusters.
# perform k-means clustering with k=3
set.seed(123) # for reproducibility
<- kmeans(data_scaled,
kmeans_fit centers=3, #the number of clusters
nstart = 25) #the number of random starts
The kmeans
algorithm needs you to specify the number of clusters. There are some helpful visuals that can guide your choice of cluster. One is known as the silhouette plot. See an explanation of the silhouette coefficient here. The other is the elbow plot, which involves calculating the “within sum of squares” for different numbers of clusters. As you increase the number of clusters, the within sum of squares declines. Stop when an increase in the number of clusters only marginally decreases the within sum of squares.
# Create silhouette plot
fviz_nbclust(data_scaled, kmeans, method = "silhouette")
fviz_nbclust(data_scaled, kmeans, method = "wss")
Adjust your number of clusters if necessary and rerun kmeans.
Assign the clusters back to the data and merge (join) the data back with other store level attributes.
#create a dataframe with the store level attribute data not included in the clustering
<- raw %>%
location_info select(placekey,location_name,top_category,sub_category,latitude,longitude,city,region)
# add cluster labels to dataset and join to location info
<- data %>%
data_clustered mutate(cluster = kmeans_fit$cluster) %>%
inner_join(location_info,by="placekey")
Do your clusters based on travel distance seem reasonable? What are the cluster means of another variable?