Week 5 Lab: Time Series Exploratory Data Analysis (EDA)

This Lab Contributes to Course Objectives: 1, 2, 3, 4, 7, 8

Learning Objectives

R:

  • Read in time series data
  • Work with dates in R
  • Download data via API

Tableau:

  • Understand and use date functions
  • Create calculated fields using date functions
  • Visualize time series data
  • Create and interpret trend lines
  • Identify and visualize structural breaks
  • Use panes to visualize multiple time series

Time series with R

The overall objective of this lab is to acquaint you with time series data.

Time series data is widely used but lacks a standardized format. Importantly, dates and times are not inherently understood by machines—they are stored as numbers relative to a specific starting point (epoch).1

  • R Epoch: January 1, 1970
  • Excel Epoch: January 1, 1900

You can confirm this by running:

#epoch
as.numeric(as.Date("1970-01-01"))
[1] 0
#add one day to see how dates are tracked
as.numeric(as.Date("1970-01-02"))
[1] 1

Importing time series data

Importing time series data is similar to other types of data, but special attention is needed for date formatting.

We will use retail egg price data from the St. Louis FED (FRED). You can download the data manually or import it directly via read_csv():

Important

Don’t forget the steps to starting an R script!

  1. Description of the script (comment)
  2. Load (install if necessary) needed packages
  3. Set working directory
#Load necessary libraries
library(readr)
Warning: package 'readr' was built under R version 4.4.2
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
#Read in the data
egg_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111.csv")
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): APU0000708111
date (1): DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The read_csv() function will try to guess the data type. In this case, it correctly identifies the variable DATE as a date. Check the data type of the variable by looking at the dataframe:

#Confirm that R read in the date field as a date
glimpse(egg_raw)
Rows: 517
Columns: 2
$ DATE          <date> 1980-01-01, 1980-02-01, 1980-03-01, 1980-04-01, 1980-05…
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …

Handling dates in R

Sometimes, date fields are stored as text and need to be converted. In these cases, you need to give R some guidance.

To see this in practice, read in a modified version of the egg prices data.

Assign the dataframe to an object called egg_raw_c to avoid overwriting the existing data:

#Read in the data
egg_raw_c <- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111_cdate.csv")
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): CHAR_DATE
dbl (1): APU0000708111

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Confirm that R read in the date field as text and *not* a date
glimpse(egg_raw_c)
Rows: 517
Columns: 2
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …
$ CHAR_DATE     <chr> "01/01/1980", "02/01/1980", "03/01/1980", "04/01/1980", …

Notice that the date is not recognized as a date type variable. You can verify this because CHAR_DATE is listed as chr (1).

Next, we need to tell R how to interpret that date. We will use functions from the tidyverse package, specifically the lubridate library, designed to help work with dates.

# The easiest way to get lubridate is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just lubridate:
install.packages("lubridate")

#Load the necessary libraries
library(lubridate)

#Use the mutate() function along with the mdy() function to convert the text date to a date value
egg_raw_c <- mutate(egg_raw_c,measure_date=mdy(CHAR_DATE))

#Confirm that you have successfully converted the variable to a date field
glimpse(egg_raw_c)

Alternatively, you can chain commands:

egg_raw_c %>%
  mutate(measure_date=mdy(CHAR_DATE)) %>%
  glimpse()

Video: Reading in Time Series Data

Accessing data via API

API stands for Application Programming Interface and they facilitate much of the communication across the modern web.

API schematic. Source: https://www.altexsoft.com/blog/engineering/what-is-api-definition-types-specifications-documentation/

We use APIs to query data. FRED maintains an API that can be accessed using the library tidyquant. This library also has a large array of finance analysis tools. Right now, we will just use it to access FRED data.

Retrieving data via an API

Step 1: Install and load libraries

Install tidyquant on your machine (e.g., install.packages("tidyquant")) and load the library.

#Install the package if you have not already done so. Comment this out after you have installed it once on your machine.
install.packages("tidyquant")

#Load the library. The library needs to be loaded every time you begin a new instance of R.
library(tidyquant, quietly = TRUE) #Note: The argument quietly = T in the library() function suppresses messages that would normally be printed when loading a package.

Step 2: Retrieve time series data via tq_get()

For this exercise, we will use the function tq_get() to access the FRED API.

This function can do many things so we need to be specific with some parameters:

  • First, we need to find the unique identifier from FRED that points to the data we want In this case, we are going to grab retail egg prices (the same data that we downloaded earlier). The unique identifier is "APU0000708111". Remember the quotes.

  • Second, the documentation tells us that to access data from FRED we need to set the get parameter to "economic.data".

  • Third, we can specify the date range with the from and to parameters. Remember to use the YYYY-MM-DD format.

#Use the tq_get() function to retrieve the time series data
egg_raw_tq <- tq_get(c("APU0000708111"),
                     get = "economic.data",
                     from="1990-01-01",
                     to="2024-02-28")

glimpse(egg_raw_tq)

Step 3: Retrieve multiple time series data using tq_get()

You can download multiple time series at once by using a vector of identifiers.

Let’s download the producer price index to compare to the retail price of a dozen eggs. The identifier is WPU017107. Note that I am going to overwrite the existing object egg_raw_tq because I am going to download the retail price series as well as the producer price index.

#Use the tq_get() function to retrieve two time series datasets
egg_raw_tq <- tq_get(c("APU0000708111","WPU017107"),
                     get = "economic.data",
                     from="1990-01-01",
                     to="2024-02-28")

glimpse(egg_raw_tq)

Step 4: Rename variables for clarity

Notice that there are 3 columns in egg_raw_tq but we downloaded two time series. Why aren’t there 4 columns?

Answer: The new data is stacked. We can confirm this a few ways:

  1. Use the unique() function to see what values for symbol are in the dataset
unique(egg_raw_tq$symbol)
  1. Use table() to see the counts by each value
table(egg_raw_tq$symbol)

Why does the retail series APU0000708111 have more observations?

  1. Use case_when() to replacing values based on a criteria

The time series are currently identified by the variable symbol. These codes are not informative. Let’s replace them with better labels.

We will use the case_when() function in the dplyr package to replace the symbol with a more descriptive name.

egg_out <- egg_raw_tq %>%
  mutate(description=case_when(
    symbol == "APU0000708111" ~ "Egg Retail Price", #if symbol equals APU0000708111, then replace value with Egg Retail Price
    symbol == "WPU017107" ~ "Egg Producer Price" #if symbol equals WPU017107, then replace value with Egg Producer Price
  ))
  1. Use write_csv() to write the dataset to a csv file.

Now the dataset is ready for export in .csv format so you can import it into Tableau.

write_csv(egg_out, "egg_out.csv")

Video: Collecting Data via API

Tableau

A key attribute of time series data is that we observe values of variables across time. Effective analysis of time series data in Tableau thus begins with understanding date variables and functions. Next, we want to understand what visualizations are most appropriate for time series analysis. Finally we want to understand how to make these visualizations as effective as possible.

1. Date functions

Connect to the eggs_out.csv data.

Date functions are Tableau functions that use a date or a date-time field in some way, whether it’s part of the input to perform the calculation or whether the output is a date-time field. Let’s learn about different date functions, their expressions, and their use in Tableau.

You can explore these functions by going to Create calculated field (drop down menu in the data field pane). You can filter on date functions by choosing Date from the library of functions. This will list only the functions that are classified as Date functions.

You can look through these on your own, but I will highlight a few here.

  • DATEPART (date_part, date, [start_of_week]) returns part of a given date as a number.

    For example, DATEPART (‘month’, #January 23, 2021#) = 1. Note the hashtags

  • DATENAME (date_part, date, [start_of_week]) is similar to DATEPART, but it returns part of a given date as a string.

    For example, DATENAME (‘month’, #January 23, 2021#) = January

  • DATEADD (date_part, increment, date) returns a date in which the specified increment has been added to the given date.

    For example, DATEADD (‘day’, 7, #January 23, 2021#) = January 30, 2021

  • DATEDIFF (date_part, start_date, end_date, [start_of_week]) returns the difference between the two dates using the specified date part.

    For example, DATEDIFF (‘day’, #January 23, 2021#, #February 24, 2021#) = 32 and DATEDIFF (‘month’, #January 23, 2021#, #February 24, 2021#) = 1

  • DATETRUNC (date_part, date, [start_of_week]) truncates or “rounds down” the given date to the level of the specified date_part. This function returns a new date.

    For example, when you truncate a date that is in the middle of the month at the month level, this function returns the first day of the month. So DATETRUNC (‘month’, #January 23, 2021#) returns January 1, 2021.

  • MIN (expression) returns the earliest date across all records, and MIN (expression 1, expression 2) returns the earlier of the two dates for each record.

    For example, MIN ([Order Date]) returns the earliest date in the Order Date field.

  • MAX (expression) returns the latest date across all records, and MAX (expression 1, expression 2) returns the earlier of the two dates for each record.

    For example, MAX ([Order Date]) returns the latest date in the Order Date field.

Here are some handy definitions for date_time arguments:

2. Visualize time series data

Connect to the carrots_prices.csv data and use the expressions above to do the following:

What types of visualizations are most appropriate for time series data?

  • Line graphs

  • Bar charts

  • Dot plots

  • Others?

Let’s experiment with creating these visualizations for time series:

  1. Create a line graph that shows the value of carrots over time (by month)

  2. Add markers to your line graph

  3. Change your line graph to a bar graph dot chart

  4. Change your bar graph to a dot plot

  5. Change the size, colors, and opacity of the dots in your dot plot (use the Size and Color icons)

  6. Change the markers in your chart from dots to a different shape (use the drop down menu on the Marks card, then use the Shape icons)

Now let’s experiment with changing what data is shown in your chart:

  1. Change your x-axis to annual

  2. Create one line that shows the average value of the commodity each year, one line that shows the maximum value, and one that shows the minimum value. Can you get all three of these lines in a single pane?

  3. Restrict your date ranges so that your visualization only shows the last 20 years of data (use the Filters card)

3. Creating panes to visualize mutiple time series

Sometimes we might want to view multiple time series in separate figures. There are two ways to do this depending on the structure of your data. Let’s consider wide data first:

Let’s start by bringing in some price data for a few new commodities.

  1. On the Data Source page connect to the following data sources corn_prices.csv and tomatoes_prices.csv

  2. We are going to create a relationship between these data sources and our current working file

  3. With the carrot data in the canvas, drag one of the other data sources next to it in the canvas. Do the same with the other data source.

  4. Now go to a new worksheet, what do you see? What did we just accomplish?

Now that we have all these data sources in a single file and they are structured in a “wide” format, we are going to construct time series plots that let us compare trends in these commodity prices across time.

Open a new worksheet and create a line plot of the prices over time for each commodity

Many functions and tools in Tableau and R are both designed to work with long data, where each row specifies the most granular unit in the data and the variables or fields describe the data elements (e.g., price).

  1. Connect to the egg data that we assembled in R.

  2. Create a time series of the price. How many lines are there? What is the default calculation doing?

  3. Differentiate the price series by dragging Description (a dimension) to the color under Marks. You should see two lines but the scales are off. What’s wrong?

  4. Rescale one of the series to match the other. You can go back to R to do this or do it in Tableau. In either case, you will want to use if/then logic.

Video: Using Tableau to Visualize Time Series Data

Footnotes

  1. https://knowledge.broadcom.com/external/article/57052/how-to-convert-unix-epoch-time-values-in.html#:~:text=Epoch%20%2F%20UNIX%20time%20is%20the,on%20the%2070%20years%20difference.↩︎