Week 5 Lab: Time Series Exploratory Data Analysis (EDA)

This Lab Contributes to Course Objectives: 1, 2, 3, 4, 7, 8

Learning Objectives

Learning Objectives in R

  • Read in time series data
  • Work with dates in R
  • Download data via API

Learning Objectives in Tableau

  • Understand and use date functions
  • Create calculated fields using date functions
  • Visualize time series data
  • Create and interpret trend lines
  • Identify and visualize structural breaks
  • Use panes to visualize multiple time series

Download Lab Handout Here

Before you begin

  1. Create a new sub-folder called lab_05 under your arec_330 folder.
  • This is the folder where you will save all of the files we will download during this lab.
  1. Open a new R script and save it as lab_05_script.R.

  2. Add a description of the script as a comment.

  3. Set your working directory as arec_330/lab_05.

  4. Load (install if necessary) needed packages.

Time series with R

The overall objective of this lab is to acquaint you with time series data.

Time series data is widely used but lacks a standardized format. Importantly, dates and times are not inherently understood by machines—they are stored as numbers relative to a specific starting point (epoch).1

  • R Epoch: January 1, 1970
  • Excel Epoch: January 1, 1900

You can confirm this by running:

#epoch
as.numeric(as.Date("1970-01-01"))
[1] 0
#add one day to see how dates are tracked
as.numeric(as.Date("1970-01-02"))
[1] 1

Importing time series data

Importing time series data is similar to other types of data, but special attention is needed for date formatting.

We will use retail egg price data from the St. Louis FED (FRED).

You can download the data manually or import it directly via read_csv():

#Load necessary libraries
library(readr)
Warning: package 'readr' was built under R version 4.4.2
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
#Read in the data
egg_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111.csv")
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): APU0000708111
date (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The read_csv() function will try to guess the data type. In this case, it correctly identifies the variable DATE as a date.

Review the data type of the variable directly by looking at the dataframe:

#Confirm that R read in the date field as a date
glimpse(egg_raw)
Rows: 517
Columns: 2
$ DATE          <date> 1980-01-01, 1980-02-01, 1980-03-01, 1980-04-01, 1980-05…
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …

Handling dates in R

Sometimes, date fields are stored as text and need to be converted. In these cases, you need to give R some guidance.

To see this in practice, read in a modified version of the egg prices data.

Assign the dataframe to an object called egg_raw_c to avoid overwriting the existing data:

#Read in the data
egg_raw_c <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111_cdate.csv")
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): CHAR_DATE
dbl (1): APU0000708111

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Confirm that R read in the date field as text and *not* a date
glimpse(egg_raw_c)
Rows: 517
Columns: 2
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …
$ CHAR_DATE     <chr> "01/01/1980", "02/01/1980", "03/01/1980", "04/01/1980", …

Notice that the date is not recognized as a date type variable. You can verify this because CHAR_DATE is listed as chr (1).

Next, we need to tell R how to interpret that date. We will use functions from the tidyverse package, specifically the lubridate library, designed to help work with dates.

# The easiest way to get lubridate is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just lubridate:
install.packages("lubridate")

#Load the necessary libraries
library(lubridate)

#Use the mutate() function along with the mdy() function to convert the text date to a date value
egg_raw_c <- mutate(egg_raw_c,measure_date=mdy(CHAR_DATE))

#Confirm that you have successfully converted the variable to a date field
glimpse(egg_raw_c)

Alternatively, you can chain commands:

#Chaining commands to produce the same data frame
egg_raw_c %>%
  mutate(measure_date=mdy(CHAR_DATE)) %>%
  glimpse()

Video: Reading in Time Series Data

Accessing data via API

API stands for Application Programming Interface and they facilitate much of the communication across the modern web.

API schematic. Source: https://www.altexsoft.com/blog/engineering/what-is-api-definition-types-specifications-documentation/

We use APIs to query data. FRED maintains an API that can be accessed using the library tidyquant. This library also has a large array of finance analysis tools. Right now, we will just use it to access FRED data.

Retrieving data via an API

The API we will use retrieves data from the FRED. Specifically, we will retrieve price series data, available here.

Before you begin

Before you begin, explore the FRED website.

  1. Go to https://fred.stlouisfed.org/categories/32455. What do you see?

  2. Select Commodities.

  3. Use the Filter by navigation bar to filter concepts. Select Retail. What changes?

  4. The default list of tables is sorted by popularity. Currently, the average price of eggs is the most popular. Select that table.

Now, let’s deconstruct the image you see.

  1. This is the unique identifier from FRED that is specific to the data series.

  2. This is the date range underlying the data series.

  3. This is the actual data series, not seasonally adjusted.

Now that we are familiar with this layout, let’s use an API to retreive the data and save it as a data frame in R.

Step 1: Install and load libraries

Install tidyquant on your machine (e.g., install.packages("tidyquant")) and load the library.

#Install the package if you have not already done so. Comment this out after you have installed it once on your machine.
install.packages("tidyquant")

#Load the library. The library needs to be loaded every time you begin a new instance of R.
library(tidyquant, quietly = TRUE) #Note: The argument quietly = T in the library() function suppresses messages that would normally be printed when loading a package.

Step 2: Retrieve time series data via tq_get()

For this exercise, we will use the function tq_get() to access the FRED API.

The tq_get() function is used to retrieve time series data from sources such as FRED, stock prices, and exchange rates.

data <- tq_get("symbol", 
               get = "data_source", 
               from = "YYYY-MM-DD", 
               to = "YYYY-MM-DD")
  • “symbol” \(\rightarrow\) The unique identifier for the data series (e.g., “AAPL” for Apple stock, “GDP” for U.S. GDP from FRED).
  • get \(\rightarrow\) The data source, such as:
    • “economic.data” for macroeconomic data from FRED.
    • “stock.prices” for historical stock prices.
    • “dividends” for dividend data.
    • “exchange.rates” for currency exchange rates.
  • from and to \(\rightarrow\) Defines the date range for the retrieved data in YYYY-MM-DD format.
  • data \(\rightarrow\) The resulting dataframe containing the retrieved time series.

This function can do many things so we need to be specific with some parameters:

  • First, we need to find the unique identifier from FRED that points to the data we want. In this case, we are going to grab retail egg prices (the same data that we downloaded earlier). The unique identifier is "APU0000708111". Remember the quotes.

  • Second, the documentation tells us that to access data from FRED we need to set the get parameter to "economic.data".

  • Third, we can specify the date range with the from and to parameters. Remember to use the YYYY-MM-DD format.

#Use the tq_get() function to retrieve the time series data
egg_raw_tq <- tq_get("APU0000708111", # This is the symbol for retail egg prices
                     get = "economic.data", # This is the data source
                     from="1990-01-01", # This is the start of our time range
                     to="2024-02-28") # This is the end of our time range

glimpse(egg_raw_tq)

Step 3: Retrieve multiple time series data using tq_get()

You can download multiple time series at once by using a vector of identifiers.

Let’s download the producer price index to compare to the retail price of a dozen eggs. The identifier is WPU017107. Note that I am going to overwrite the existing object egg_raw_tq because I am going to download the retail price series as well as the producer price index.

#Use the tq_get() function to retrieve two time series datasets
egg_raw_tq <- tq_get(c("APU0000708111","WPU017107"),
                     get = "economic.data",
                     from="1990-01-01",
                     to="2024-02-28")

glimpse(egg_raw_tq)
Using c() to Retrieve Multiple Series

When retrieving multiple time series, use c() to pass multiple symbols into tq_get(). What this code does is:

  • c("APU0000708111", "WPU017107") \(\rightarrow\) A vector containing multiple FRED series.
  • This retrieves both retail egg prices (APU0000708111) and the producer price index for eggs (WPU017107).
  • The resulting dataframe will contain stacked time series, with an identifier column (symbol) to distinguish them.

Step 4: Rename variables for clarity

Notice that there are 3 columns in egg_raw_tq but we downloaded two time series. Why aren’t there 4 columns?

Answer: The new data is stacked. We can confirm this a few ways:

  1. Use the unique() function to see what values for symbol are in the dataset
#Display the unique values of the variable symbol
unique(egg_raw_tq$symbol)
  1. Use table() to see the counts by each value
#Count the number of observations associated with each value of symbol
table(egg_raw_tq$symbol)

Why does the retail series APU0000708111 have more observations?

  1. Use case_when() to replacing values based on a criteria

The time series are currently identified by the variable symbol. These codes are not informative. Let’s replace them with better labels.

We will use the case_when() function in the dplyr package to replace the symbol with a more descriptive name.

#Use the FRED API to retrieve two data series and rename them the series with a more descriptive name
egg_out <- egg_raw_tq %>%
  mutate(description=case_when(
    symbol == "APU0000708111" ~ "Egg Retail Price", #if symbol equals APU0000708111, then replace value with Egg Retail Price
    symbol == "WPU017107" ~ "Egg Producer Price" #if symbol equals WPU017107, then replace value with Egg Producer Price
  ))
  1. Use write_csv() to write the dataset to a csv file.

Now the dataset is ready for export in .csv format so you can import it into Tableau.

#Write the dataset to a csv file
write_csv(egg_out, "egg_prices.csv")
You do it
  1. Download the following data series using tq_get() as separate data frames:
  • Producer Price Index: Carrots, symbol = “WPU01130212”
  • Producer Price Index: Tomatoes, symbol = “WPU01130217”
  • Producer Price Index: Corn, symbol = “WPS012202”
  1. Rename the symbols to a more descriptive name.
  • Carrot Producer Price
  • Tomato Producer Price
  • Corn Producer Price
  1. Write each data frame as separate csv files:
  • carrot_prices.csv
  • tomato_prices.csv
  • corn_prices.csv

Video: Collecting Data via API

# This is the script for lab 05.

#Load necessary libraries
library(readr)
library(dplyr)

#epoch
as.numeric(as.Date("1970-01-01"))

#add one day to see how dates are tracked
as.numeric(as.Date("1970-01-02"))

#Read in the data
egg_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111.csv")

#Confirm that R read in the date field as a date
glimpse(egg_raw)

#Read in the data
egg_raw_c <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111_cdate.csv")

#Confirm that R read in the date field as text and *not* a date
glimpse(egg_raw_c)

# The easiest way to get lubridate is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just lubridate:
install.packages("lubridate")

#Load the necessary libraries
library(lubridate)

#Use the mutate() function along with the mdy() function to convert the text date to a date value
egg_raw_c <- mutate(egg_raw_c,measure_date=mdy(CHAR_DATE))

#Confirm that you have successfully converted the variable to a date field
glimpse(egg_raw_c)

egg_raw_c %>%
  mutate(measure_date=mdy(CHAR_DATE)) %>%
  glimpse()

#Install the package if you have not already done so. Comment this out after you have installed it once on your machine.
install.packages("tidyquant")

#Load the library. The library needs to be loaded every time you begin a new instance of R.
library(tidyquant, quietly = TRUE) #Note: The argument quietly = T in the library() function suppresses messages that would normally be printed when loading a package.

#Use the tq_get() function to retrieve the time series data
egg_raw_tq <- tq_get("APU0000708111",
                     get = "economic.data",
                     from="1990-01-01",
                     to="2024-02-28")

glimpse(egg_raw_tq)

#Use the tq_get() function to retrieve two time series datasets
egg_raw_tq <- tq_get(c("APU0000708111","WPU017107"),
                     get = "economic.data",
                     from="1990-01-01",
                     to="2024-02-28")

glimpse(egg_raw_tq)

#Display the unique values of the variable symbol
unique(egg_raw_tq$symbol)

#Count the number of observations associated with each value of symbol
table(egg_raw_tq$symbol)

#Use the FRED API to retrieve two data series and rename them the series with a more descriptive name
egg_out <- egg_raw_tq %>%
  mutate(description=case_when(
    symbol == "APU0000708111" ~ "Egg Retail Price", #if symbol equals APU0000708111, then replace value with Egg Retail Price
    symbol == "WPU017107" ~ "Egg Producer Price" #if symbol equals WPU017107, then replace value with Egg Producer Price
  ))

#Write the dataset to a csv file
write_csv(egg_out, "egg_prices.csv")

Tableau

A key attribute of time series data is that we observe values of variables across time. Effective analysis of time series data in Tableau thus begins with understanding date variables and functions. Next, we want to understand what visualizations are most appropriate for time series analysis. Finally we want to understand how to make these visualizations as effective as possible.

Before you begin

Connect to the folder where you have saved carrot_prices.csv, and the three other commodity price series data.

1. Date functions

Date functions are Tableau functions that use a date or a date-time field in some way, whether it’s part of the input to perform the calculation or whether the output is a date-time field. Let’s learn about different date functions, their expressions, and their use in Tableau.

You can explore these functions by going to Create calculated field (drop down menu in the data field pane). You can filter on date functions by choosing Date from the library of functions. This will list only the functions that are classified as Date functions.

You can look through these on your own, but I will highlight a few here.

  • DATEPART (date_part, date, [start_of_week]) returns part of a given date as a number.

    For example, DATEPART (‘month’, #January 23, 2021#) = 1. Note the hashtags

  • DATENAME (date_part, date, [start_of_week]) is similar to DATEPART, but it returns part of a given date as a string.

    For example, DATENAME (‘month’, #January 23, 2021#) = January

  • DATEADD (date_part, increment, date) returns a date in which the specified increment has been added to the given date.

    For example, DATEADD (‘day’, 7, #January 23, 2021#) = January 30, 2021

  • DATEDIFF (date_part, start_date, end_date, [start_of_week]) returns the difference between the two dates using the specified date part.

    For example, DATEDIFF (‘day’, #January 23, 2021#, #February 24, 2021#) = 32 and DATEDIFF (‘month’, #January 23, 2021#, #February 24, 2021#) = 1

  • DATETRUNC (date_part, date, [start_of_week]) truncates or “rounds down” the given date to the level of the specified date_part. This function returns a new date.

    For example, when you truncate a date that is in the middle of the month at the month level, this function returns the first day of the month. So DATETRUNC (‘month’, #January 23, 2021#) returns January 1, 2021.

  • MIN (expression) returns the earliest date across all records, and MIN (expression 1, expression 2) returns the earlier of the two dates for each record.

    For example, MIN ([Order Date]) returns the earliest date in the Order Date field.

  • MAX (expression) returns the latest date across all records, and MAX (expression 1, expression 2) returns the earlier of the two dates for each record.

    For example, MAX ([Order Date]) returns the latest date in the Order Date field.

Here are some handy definitions for date_time arguments:

You do it
  1. Create a calculated field named months_ago that determines the number of months between the commodity price data and today.

  2. Create a calculated field named first_date that determines the earliest date in the data.

  3. Create a calculated field named last_date that determines the last date that the carrot price index was less than 135.

  4. Create a new variable called month that only includes the month.

  5. Create a new variable called year that only includes the year.

  6. Create a new date variable called my_date that assigns a new date using your month and year variables and sets the day as 15.

2. Visualize time series data

What types of visualizations are most appropriate for time series data?

  • Line graphs

  • Bar charts

  • Dot plots

  • Others?

Let’s experiment with creating these visualizations for time series:

  1. Create a line graph that shows the value of carrots over time (by month)

  2. Add markers to your line graph

  3. Change your line graph to a bar graph dot chart

  4. Change your bar graph to a dot plot

  5. Change the size, colors, and opacity of the dots in your dot plot (use the Size and Color icons)

  6. Change the markers in your chart from dots to a different shape (use the drop down menu on the Marks card, then use the Shape icons)

Modifying Data Display in Charts

Now let’s experiment with changing what data is shown in your chart:

  1. Change your x-axis to annual

  2. Create one line that shows the average value of the commodity each year, one line that shows the maximum value, and one that shows the minimum value. Can you get all three of these lines in a single pane?

  3. Restrict your date ranges so that your visualization only shows the last 20 years of data (use the Filters card)

3. Creating and interpreting trend lines

Now we are going to add a trend line to our visualization. I am sure you are all familiar with trend lines, but do you know how they are created? Let’s start with a linear trend line…

  • What does a linear trend line tell you about your time series data?
  • How is this trend line estimated? (conceptually… we will go through some equations in lecture)

Trend Lines in Tableau

  1. Open a new worksheet and create a line chart showing the monthly values of the commodity from 1990 through 2022.

  2. Add a trend line to your chart.

  3. When you click on the trend line, what information does Tableau give you about it? What does this information mean?

  4. When you click on describe trend model what information does Tableau give you about it?

  5. Change your x-axis to annual. Did your trend line change? How do you know? Why did or didn’t the trend line change when you changed the date aggregation?

4. Visualizing structural breaks

Usually, in long time series data, we can see distinct points where trends change. These are called structural breaks. There are ways to test for structural breaks, but for now, let’s visually inspect them.

  1. Looking at your visualization of monthly carrot prices and the linear trend line, does your trend line accurately reflect trends in the data for the entire period from 1990 to 2022?

  2. Identify a month-year where it appears that trends in your data changed.

  3. Add a reference line at this month-year. (Right-click on the x-axis and select Reference Line) Does anything happen to your trend line?

Methods for Highlighting Structural Breaks

Method 1: Separate Your Time Series by Adding Color

  1. Create a new calculated field named before_after_break. Use the IF and THEN functions and an appropriate Date function to populate this variable with “after” if the date is after the break and “before” if the date is before the break.

  2. Drag this variable to the Color card.

  3. Add trend lines to your visualization. What happens?

Method 2: Separate Your Time Series by Segmenting the X-Axis

  1. Drag your new variable from the Marks card to the Columns shelf.

  2. Sort your variable so that the “after” time period is shown following the “before” time period.

  3. In the x-axis Edit Axis menu, select independent axis ranges for each row or column (right-click on the x-axis to access this menu).

  4. Note: Typically, Tableau automatically adjusts these panes to be the same size, but you can set them to be different sizes based on the windows you have set by toggling back and forth in your date variable aggregation.

5. Creating panes to visualize mutiple time series

Sometimes we might want to view multiple time series in separate figures. We can do this with panes in our visualizations.

Let’s start by bringing in some price data for a few new commodities.

  1. On the Data Source page, connect to the following data sources: corn_prices.csv and tomatoes_prices.csv.

  2. We are going to create a relationship between these data sources and our current working file.

  3. With the carrot data in the canvas, drag one of the other data sources next to it in the canvas. Do the same with the other data source.

  4. Now go to a new worksheet, what do you see? What did we just accomplish?

Now that we have all these data sources in a single file and they are structured in a “wide” format, we are going to construct time series plots that let us compare trends in these commodity prices across time.

Video: Using Tableau to Visualize Time Series Data

Footnotes

  1. https://knowledge.broadcom.com/external/article/57052/how-to-convert-unix-epoch-time-values-in.html#:~:text=Epoch%20%2F%20UNIX%20time%20is%20the,on%20the%2070%20years%20difference.↩︎