Week 13 Lab: Introduction to Panel Data Analysis

This Lab Contributes to Course Objectives: 1, 3, 4, 5, 7, 8

Learning Objectives R

  • Read in and manipulate spatial data in R

  • Join spatial data

R

Spatial data is a form of cross-sectional data that contains geographic information.1 Examples include the location of a store or park, the boundary of a parcel of land, or the location of a weather measurement. In many cases, we need to process that data in some way or join it to other (potentially spatial) data for analysis.

Orientation and terms

Two common types of spatial data in R are vector and raster data. Vector data represent geographic features as points, lines, and polygons, while raster data represent geographic features as a grid of cells with values assigned to each cell. Vector data is often used to represent discrete features such as the boundaries of a city or the location of a specific point of interest. Raster data, on the other hand, is often used to represent phenomena such as elevation or temperature that are continuous across the landscape.

Working with spatial data in R involves using specialized packages and functions to read, manipulate, and visualize the data. Some popular packages for working with spatial data in R include sf and raster. With these tools, it’s possible to perform a wide range of spatial analyses, such as overlaying different layers of data to find areas of overlap or proximity, extracting data for specific regions of interest, and creating custom maps and visualizations. Taro Meino has written a useful reference on using R for GIS.

Task overview

It is common to find data with specific location information (e.g., latitude and longitude coordinates). You may have separate data with other information that you want to associate with your location data (e.g., census information on median income). This lab walks you through the process of joining one spatial dataset with another. Specifically, we will join store locations, defined by longitude and latitude, with county boundaries in order to associate county-level census data.

Setup

Our first step is to start an R script and load packages. We start by reading in the convenience store data set we have been using.

# This script will demonstrate how to work with spatial data and conduct some basic operations (e.g., intersection)

# load packages
library(pacman)
p_load(tidyverse,janitor,sf,tigris,mapview,tidycensus,dplyr)

# read in convenience store location dataset
store_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_03/inputs/store_info.csv.gz")

Converting to a spatial format

The fields of geography and geographic information systems (GIS) specialize in understanding and working with geospatial data. Locations are known because of a commonly understood reference system called a Coordinate Reference System (CRS). Think about a graph with \(x\) and \(y\) dimensions. A point on that graph has a set of coordinates that tell you the location. CRS in GIS can be more complicated because we are thinking about points that exist on a globe (this would be easier if the earth were flat).

The store_raw dataframe has latitude and longitude coordinates (variables in the dataframe). At this point, R considers these numbers like any other numeric variable. There are packages built to understand and work with geospatial data. sf stands for simple features and is a commonly used package. We will use it to convert the store_raw data into a spatially aware data frame.

# convert data frame to a points simple feature object
store_geo <- store_raw %>%
  select(store_id,latitude,longitude) %>%  #Keep only unique identifier and coordinates - we can always join back to the other information
  st_as_sf(coords=c("longitude","latitude"),crs=st_crs(4326))

Notice that we first subset the variables store_id and the coordinates. The next line tells R that these are essentially \(x\) and \(y\) coordinates in a certain projection (represented by the code 4326; see https://epsg.io/). store_geo is a different object than we have worked with before but it looks and usually behaves like a familiar dataframe. You can mutate new variables, arrange, and even summarize, but summarize also performs geospatial operations because you are changing the unit of analysis).

It is often helpful to look at data especially when mapping. R can leverage very powerful mapping tools, but we will just use them to make sure we are doing what we think we are doing. We can use mapview() to look at the data.

# set the option to make mapview render the points in preview
mapviewOptions(fgb = FALSE)

# map the grocery stores
mapview(store_geo)

After you plot these points, what do you observe about the locations of the stores? What might you want to do with your data to remove outliers? Consider these answers as we move into the next section.

Getting Census boundaries

The dataframe store_raw contained a lot of meta information but not county. Boundaries for commonly used geometries (such as county boundaries) are available from the US Census and accessible through an API to the Census repository of these boundaries (called Tiger). The R package tigris provides a very convenient API to access them.

# get county polygons from tigris (Census)
us_co <- tigris::counties(cb=T,class="sf") %>%
  janitor::clean_names() %>%
  st_transform(4326)

This command extracts the county boundaries for the US and reads them into an sf object called us_co. If for some reason, there is an error, you can also load the county boundary from the course website:

us_co <- st_read("https://csu-arec-330.github.io/materials/unit_03/inputs/us_co.gpkg")

The names are capitalized, so I like to clean them and we want to ensure that it is in the same coordinate reference system as our store locations st_transform(). Notice that the spatial type of these data are MULTIPOLYGONs rather than POINTs.

You Do It

Use mapview to plot the county layer.

Code
# use mapview to plot the county layer
mapview(us_co)

Note that the sf object us_co contains information about the counties in the US, including county name, state, the area covered by land and water, and some numeric codes. Since counties in different states can have the same name, a 5-digit county identifier known as a Federal Information Processing Standard (FIPS) uniquely identifies counties. This code is labeled geoid in the us_co object. The first two digits of the fips are the state and the next three identify a county. Larimer county in CO has the FIPS code 08069. This fips code is used to join county-level data.

Removing spatial outliers

When analyzing spatial data, for tractability’s sake, we might only want to look at data for the contiguous United States. So, it’s often useful to remove territories and states that are not part of the “lower 48.” States like Alaska and Hawaii, as well as territories such as American Samoa, Guam, and Puerto Rico, differ significantly in their geographical context and may introduce anomalies or skew the results when evaluating nationwide trends or metrics. Excluding them helps to maintain a focused analysis on the contiguous states where geographical and demographic characteristics are more uniform. This simplification can be particularly beneficial in analyses that involve spatial relationships or distance calculations, where the vast distances to these non-contiguous areas can distort results.

# Step 1: Look at the distinct state list by statefp code
unique_statefp <- us_co %>%
  st_set_geometry(NULL) %>%  # Remove geometry column
  select(stusps, statefp) %>%  # Select the columns of interest
  distinct() %>%  # Remove duplicates, keeping only unique rows
  arrange(stusps)  # Arrange alphabetically by stusps

# Step 2: Filter out counties from American Samoa (60), Guam (66), Saipan Municipality (69), Puerto Rico (72), Virgin Islands (78), Alaska (02), and Hawaii (15)
us_co_filtered <- us_co %>%
  filter(!statefp %in% c("60", "66", "72", "02", "15", "69", "78"))

# Step 3: Use mapview to plot the county layer, excluding the specified states
mapview(us_co_filtered)

Intersecting points with polygons

Spatial data processing tools can understand the shared location of points and polygons to associate data from one dataset with another. Our objective is to associate a county name and identifier with each convenience store.

# join county to store_geo
store_co_geo <- st_join(store_geo,us_co_filtered,join=st_intersects)

This operation may take some time depending on the machine you are working on. If working on the server, remember that it is a shared resource, so just be patient.

Notice that store_co_geo has all of the observations from store_geo and has the corresponding variables from us_co attached to it. Now we have county information associated with the store location. We can join the convenience store data to census data or other data at the county level via the 5-digit FIPS. We can also aggregate convenience store information up to the county level using group_by() %>% summarize().

You Do It

Aggregate convenience store information up to the county level using group_by() %>% summarize() and plot the number of stores per county using mapview.

Code
# Aggregate store count by county
store_count_by_county <- store_co_geo %>%
  group_by(geoid) %>%
  summarize(store_count = n(), .groups = 'drop')  # Drop groups to prevent regrouping

# Join aggregated data back with county geometries for mapping
county_store_map <- st_join(us_co_filtered,store_count_by_county,join=st_intersects)

# Visualize the result with mapview (showing number of stores per county)
mapview(county_store_map, zcol = "store_count")

Accessing US Census data

The US Census contains a wealth of data that can be used in analysis. We will access median household income from the American Community Survey, an annual survey conducted to supplement the decennial census, using the R package called tidycensus that conveniently wraps the Census API (https://walker-data.com/tidycensus/). The census stores many datasets, so we need to identify the one we want to use to represent median household income. Tidycensus provides a utility to find the table we want:

v22 <- load_variables(2022, "acs5", cache = TRUE)

Use the filter feature in the Rstudio viewer or export the data as a csv that you can open in excel and search the descriptions. As you can see, there are many other demographic tables available in the Census data. In this case, the median household income code is B19013_001 (the table is B19013 and the variable we want is 001). We can access the data using the function get_acs().

# download data from Census API
census_hhi <- get_acs(geography = "county",  #we want county-level data
                      survey = "acs5", #get data from the 5-year American Community Survey
                      variables = c(medincome = "B19013_001"), #we want median household income from table B19013 
                      state = NULL, #leave this null for the whole US but it could be used to subset states 
                      year = 2022)

# make names lower snake case and keep only the geoid and the estimate (renamed to hhi)
hhi <- census_hhi %>%
  clean_names() %>%
  select(geoid,hhi=estimate)

This call to the API downloads county names and identifiers along with the variable estimate and margin of error.

Joining with convenience store county

We can join median household income to the stores by the common variable, geoid.

# joining on county identifier
store_hhi <- store_co_geo %>%
  st_set_geometry(NULL) %>%  #convert sf object back to dataframe by simply stripping out the geometry
  inner_join(.,hhi,by="geoid")

We have added median household income to our store-level data. Export this data and visualize in Tableau.

Aggregating by geography

Now we have the information to count the number of store in each county. We can use the group_by() and summarize() combination to calculate aggregate measures by county. Moreover, we don’t need the spatial information anymore since everything will be at the county level. We use st_set_geometry(NULL) to convert an sf object back to a data frame (without spatial information), then add the number of stores in a county.

# joining on county identifier
county_stores <- store_co_geo %>%
  st_set_geometry(NULL) %>%  #convert sf object back to dataframe by simply stripping out the geometry
  group_by(geoid,county=name,state=stusps) %>%
  summarize(total_stores = n()) %>% #the function n() counts the number of occurrences
  ungroup() #safer; otherwise, the grouping stays with the dataframe

The result, county_stores, contains the number of convenience stores in each county.

Creating a panel

We may want to join our data to other data that varies over time. Suppose we are interested in how total sales at a store correlates with high temperatures. Weather data varies over time and space; however, store locations are fixed and do not change over time. We can use our existing data stores_co_geo, which has county information associated with each store, and join it to shopper_info, which contains the transaction level information. We need to aggregate the transaction level data up to sales per store-day.

# read in the transaction level data
shopper_info <- read_csv("https://csu-arec-330.github.io/materials/unit_03/inputs/shopper_info.csv.gz")

# read in the weather data
weather_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_03/inputs/gridmet_county_daily_2023-2023.csv.gz")

shopper_info contains store_id, which we can use to join to store_co_geo, and date_time. First, extract the date component as we are not concerned with the time of day. Then, we group by measure_date and store_id and sum sales (unit_price*unit_quantity)

# joining on county identifier
store_sales <- shopper_info %>%
  mutate(measure_date=as_date(date_time)) %>%
  group_by(store_id,measure_date) %>%
  summarize(sales=sum(unit_price*unit_quantity,na.rm=TRUE)) %>%
  ungroup()

store_sales contains the total sales by date for each store in the dataset. We can now join store_sales with county information associated with each store.

# joining on county identifier
store_sales_co <- st_set_geometry(store_co_geo,NULL) %>%
  select(store_id,geoid) %>% #selecting on store_id and geoid only because we need the store id to join and the county id for weather
  inner_join(store_sales,.,by="store_id") #inner_join here because we only want to consider stores for which we have county information

We have county-day measurements of weather. However, the data is in long format(as opposed to wide). The unit of observation of weather_raw is county-day-weather_measurement, but we need it to be county-day with different measurements as columns. We can reshape this data from long to wide using the function pivot_wider(). The next step is to join the weather data with store_sales_co. However, the columns we need to join on have different names. We can rename them or tell R how to map the names. Note that the order of the variable name matters - the variable name on the left should be from the dataframe listed first. Also note that the weather data is for the continental US so store locations outside of that will be dropped because of the inner_join().

# reshaping weather data from long to wide
weather_wide <- weather_raw %>%
  pivot_wider(id_cols = c(county,date),names_from = variable,values_from = value)

# joining weather to store_sales_co
store_sales_weather <- store_sales_co %>%
  inner_join(.,weather_wide, by = c("measure_date"="date","geoid"="county"))

Summary

This lab demonstrates how to join spatial point data (convenience store locations) with spatial polygon data (county boundaries). Standard political boundaries like counties can be used as a common identifier to join US Census data. The lab also covered joining data that varies over space and time.

Footnotes

  1. Spatial data can also have temporal dimensions, and the concepts we cover here extend to those data too.↩︎