Week 9:
Cross-Sectional Data

Agenda

What is cross-sectional data?

Common types of cross-sectional data and vizzes

Analyzing cross-sectional data

Project 2 overview and example

Data Types

What is cross sectional data?

  • Cross-sectional data is collected by observing a study population at a single point in time or for a period of time and aggregating information to a single observation per subject.

  • We call it cross-sectional data because we are observing information for a slice, snapshot, or cross-section of a group subjects.

  • This differs from time series data in that we only observe information at a single point in time.

Common types of cross-sectional data

  • Individual-level data

Common types of cross-sectional data

  • Individual-level data

  • Business or point of interest-level data

Common types of cross-sectional data

  • Individual-level data

  • Business or point of interest-level data

  • Country-level data

Common types of cross-sectional data

  • Individual-level data

  • Business or point of interest-level data

  • Country-level data

  • Region-level data

  • Spatial data

Analyzing Cross-Sectional Data

  • Cross-section data analysis typically involves studying:

    • similarities or differences among subjects

    • trends and associations in the population

Analyzing Cross-Sectional Data cont.

  • Questions about frequency:

    • how common is the outcome?

    • how many people are impacted?

Analyzing Cross-Sectional Data cont.

  • Questions about associations:

    • what factors are associated with the outcome (age, gender, income, pollution, health, etc.)?

Analyzing Cross-Sectional Data cont.

  • Questions about similarity or dissimilarity among units

    • Cluster Analysis - a systematic method to look for patterns in data

Customer segmentation analysis

Use sales data to divide your customers into groups to better target marketing

How does cluster analysis work?

K-means clustering

  1. Choose the number of clusters (k) that you want to identify in the data.

  2. Randomly initialize the k cluster centroids (points in space) within the data range.

  3. Assign each data point to the nearest centroid, based on the Euclidean distance between the point and each centroid.

  4. Calculate the mean (centroid) of each cluster based on the data points assigned to it.

  5. Update the cluster centroids to be the means of the data points assigned to them.

  6. Repeat steps 3-5 until convergence (when the cluster assignments no longer change or a maximum number of iterations is reached).

Usually, we repeat this whole process a number of times and choose the group assignment that minimizes the overall variance.

After clustering, summarize attributes of the clusters to understand the groups.

Analyzing Cross-Sectional Data

How might you use cross-sectional data to decide where to open a new grocery store?

Take a minute to write out what information you would need in a dataset to apply cluster analysis.

Project 2: Overview

Form groups of 2 (THIS WEEK)

We will provide convenience store data on shoppers, stores, and purchases

Perform an exploratory data analysis and assemble summary statistics; analyze spatial patterns, correlations, etc.; and construct visualizations to answer your research question.

Conduct market segmentation analysis via clustering

Label clusters based on characteristics

Develop simple marketing strategy to target clusters

Project 2: Example

Research Objective:

  • Develop wildfire mitigation recommendations for at-risk communities

Important background details:

  • Wildfire mitigation: community fuel treatments, grant programs to help offset costs, regulations and enforcement

  • Some communities already do wildfire risk mitigation. Does one size fit all, or are there certain strategies that work better in certain communities?

Project 2: Example

  • Collect data on community characteristics (socioeconomic and demographics from census) and wildfire risk

  • Use clustering to identify which communities are similar to each other

  • Name the clusters

  • Characterize the wildfire risk mitigation approaches that worked within clusters

Project 2: Example