#epoch
as.numeric(as.Date("1970-01-01"))
[1] 0
#add one day to see how dates are tracked
as.numeric(as.Date("1970-01-02"))
[1] 1
This Lab Contributes to Course Objectives: 1, 2, 3, 4, 7, 8
The overall objective of this lab is to acquaint you with time series data.
Time series data is widely used but lacks a standardized format. Importantly, dates and times are not inherently understood by machines—they are stored as numbers relative to a specific starting point (epoch).1
You can confirm this by running:
#epoch
as.numeric(as.Date("1970-01-01"))
[1] 0
#add one day to see how dates are tracked
as.numeric(as.Date("1970-01-02"))
[1] 1
Importing time series data is similar to other types of data, but special attention is needed for date formatting.
We will use retail egg price data from the St. Louis FED (FRED). You can download the data manually or import it directly via read_csv()
:
Don’t forget the steps to starting an R script!
#Load necessary libraries
library(readr)
Warning: package 'readr' was built under R version 4.4.2
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.2
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
#Read in the data
<- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111.csv") egg_raw
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): APU0000708111
date (1): DATE
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The read_csv()
function will try to guess the data type. In this case, it correctly identifies the variable DATE
as a date. Check the data type of the variable by looking at the dataframe:
#Confirm that R read in the date field as a date
glimpse(egg_raw)
Rows: 517
Columns: 2
$ DATE <date> 1980-01-01, 1980-02-01, 1980-03-01, 1980-04-01, 1980-05…
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …
Sometimes, date fields are stored as text and need to be converted. In these cases, you need to give R some guidance.
To see this in practice, read in a modified version of the egg prices data.
Assign the dataframe to an object called egg_raw_c
to avoid overwriting the existing data:
#Read in the data
<- read_csv("https://csu-arec-330.github.io/materials/unit_01/inputs/APU0000708111_cdate.csv") egg_raw_c
Rows: 517 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): CHAR_DATE
dbl (1): APU0000708111
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Confirm that R read in the date field as text and *not* a date
glimpse(egg_raw_c)
Rows: 517
Columns: 2
$ APU0000708111 <dbl> 0.879, 0.774, 0.812, 0.797, 0.737, 0.731, 0.776, 0.907, …
$ CHAR_DATE <chr> "01/01/1980", "02/01/1980", "03/01/1980", "04/01/1980", …
Notice that the date is not recognized as a date type variable. You can verify this because CHAR_DATE
is listed as chr (1)
.
Next, we need to tell R how to interpret that date. We will use functions from the tidyverse
package, specifically the lubridate
library, designed to help work with dates.
# The easiest way to get lubridate is to install the whole tidyverse:
install.packages("tidyverse")
# Alternatively, install just lubridate:
install.packages("lubridate")
#Load the necessary libraries
library(lubridate)
#Use the mutate() function along with the mdy() function to convert the text date to a date value
<- mutate(egg_raw_c,measure_date=mdy(CHAR_DATE))
egg_raw_c
#Confirm that you have successfully converted the variable to a date field
glimpse(egg_raw_c)
Alternatively, you can chain commands:
%>%
egg_raw_c mutate(measure_date=mdy(CHAR_DATE)) %>%
glimpse()
API stands for Application Programming Interface and they facilitate much of the communication across the modern web.
We use APIs to query data. FRED maintains an API that can be accessed using the library tidyquant
. This library also has a large array of finance analysis tools. Right now, we will just use it to access FRED data.
Install tidyquant
on your machine (e.g., install.packages("tidyquant")
) and load the library.
#Install the package if you have not already done so. Comment this out after you have installed it once on your machine.
install.packages("tidyquant")
#Load the library. The library needs to be loaded every time you begin a new instance of R.
library(tidyquant, quietly = TRUE) #Note: The argument quietly = T in the library() function suppresses messages that would normally be printed when loading a package.
tq_get()
For this exercise, we will use the function tq_get()
to access the FRED API.
This function can do many things so we need to be specific with some parameters:
First, we need to find the unique identifier from FRED that points to the data we want In this case, we are going to grab retail egg prices (the same data that we downloaded earlier). The unique identifier is "APU0000708111"
. Remember the quotes.
Second, the documentation tells us that to access data from FRED we need to set the get
parameter to "economic.data"
.
Third, we can specify the date range with the from
and to
parameters. Remember to use the YYYY-MM-DD format.
#Use the tq_get() function to retrieve the time series data
<- tq_get(c("APU0000708111"),
egg_raw_tq get = "economic.data",
from="1990-01-01",
to="2024-02-28")
glimpse(egg_raw_tq)
tq_get()
You can download multiple time series at once by using a vector of identifiers.
Let’s download the producer price index to compare to the retail price of a dozen eggs. The identifier is WPU017107
. Note that I am going to overwrite the existing object egg_raw_tq
because I am going to download the retail price series as well as the producer price index.
#Use the tq_get() function to retrieve two time series datasets
<- tq_get(c("APU0000708111","WPU017107"),
egg_raw_tq get = "economic.data",
from="1990-01-01",
to="2024-02-28")
glimpse(egg_raw_tq)
Notice that there are 3 columns in egg_raw_tq but we downloaded two time series. Why aren’t there 4 columns?
Answer: The new data is stacked. We can confirm this a few ways:
unique()
function to see what values for symbol are in the datasetunique(egg_raw_tq$symbol)
table()
to see the counts by each valuetable(egg_raw_tq$symbol)
Why does the retail series
APU0000708111
have more observations?
case_when()
to replacing values based on a criteriaThe time series are currently identified by the variable symbol
. These codes are not informative. Let’s replace them with better labels.
We will use the case_when()
function in the dplyr
package to replace the symbol with a more descriptive name.
<- egg_raw_tq %>%
egg_out mutate(description=case_when(
== "APU0000708111" ~ "Egg Retail Price", #if symbol equals APU0000708111, then replace value with Egg Retail Price
symbol == "WPU017107" ~ "Egg Producer Price" #if symbol equals WPU017107, then replace value with Egg Producer Price
symbol ))
write_csv()
to write the dataset to a csv file.Now the dataset is ready for export in .csv format so you can import it into Tableau.
write_csv(egg_out, "egg_out.csv")
A key attribute of time series data is that we observe values of variables across time. Effective analysis of time series data in Tableau thus begins with understanding date variables and functions. Next, we want to understand what visualizations are most appropriate for time series analysis. Finally we want to understand how to make these visualizations as effective as possible.
Connect to the eggs_out.csv
data.
Date functions are Tableau functions that use a date or a date-time field in some way, whether it’s part of the input to perform the calculation or whether the output is a date-time field. Let’s learn about different date functions, their expressions, and their use in Tableau.
You can explore these functions by going to Create calculated field
(drop down menu in the data field pane). You can filter on date functions by choosing Date
from the library of functions. This will list only the functions that are classified as Date functions.
You can look through these on your own, but I will highlight a few here.
DATEPART (date_part, date, [start_of_week]) returns part of a given date as a number.
For example, DATEPART (‘month’, #January 23, 2021#) = 1. Note the hashtags
DATENAME (date_part, date, [start_of_week]) is similar to DATEPART, but it returns part of a given date as a string.
For example, DATENAME (‘month’, #January 23, 2021#) = January
DATEADD (date_part, increment, date) returns a date in which the specified increment has been added to the given date.
For example, DATEADD (‘day’, 7, #January 23, 2021#) = January 30, 2021
DATEDIFF (date_part, start_date, end_date, [start_of_week]) returns the difference between the two dates using the specified date part.
For example, DATEDIFF (‘day’, #January 23, 2021#, #February 24, 2021#) = 32 and DATEDIFF (‘month’, #January 23, 2021#, #February 24, 2021#) = 1
DATETRUNC (date_part, date, [start_of_week]) truncates or “rounds down” the given date to the level of the specified date_part. This function returns a new date.
For example, when you truncate a date that is in the middle of the month at the month level, this function returns the first day of the month. So DATETRUNC (‘month’, #January 23, 2021#) returns January 1, 2021.
MIN (expression) returns the earliest date across all records, and MIN (expression 1, expression 2) returns the earlier of the two dates for each record.
For example, MIN ([Order Date]) returns the earliest date in the Order Date field.
MAX (expression) returns the latest date across all records, and MAX (expression 1, expression 2) returns the earlier of the two dates for each record.
For example, MAX ([Order Date]) returns the latest date in the Order Date field.
Here are some handy definitions for date_time arguments:
Connect to the carrots_prices.csv
data and use the expressions above to do the following:
What types of visualizations are most appropriate for time series data?
Line graphs
Bar charts
Dot plots
Others?
Let’s experiment with creating these visualizations for time series:
Create a line graph that shows the value of carrots over time (by month)
Add markers to your line graph
Change your line graph to a bar graph dot chart
Change your bar graph to a dot plot
Change the size, colors, and opacity of the dots in your dot plot (use the Size
and Color
icons)
Change the markers in your chart from dots to a different shape (use the drop down menu on the Marks
card, then use the Shape
icons)
Now let’s experiment with changing what data is shown in your chart:
Change your x-axis to annual
Create one line that shows the average value of the commodity each year, one line that shows the maximum value, and one that shows the minimum value. Can you get all three of these lines in a single pane?
Restrict your date ranges so that your visualization only shows the last 20 years of data (use the Filters
card)
Sometimes we might want to view multiple time series in separate figures. There are two ways to do this depending on the structure of your data. Let’s consider wide data first:
Let’s start by bringing in some price data for a few new commodities.
On the Data Source
page connect to the following data sources corn_prices.csv
and tomatoes_prices.csv
We are going to create a relationship
between these data sources and our current working file
With the carrot data in the canvas
, drag one of the other data sources next to it in the canvas. Do the same with the other data source.
Now go to a new worksheet, what do you see? What did we just accomplish?
Now that we have all these data sources in a single file and they are structured in a “wide” format, we are going to construct time series plots that let us compare trends in these commodity prices across time.
Open a new worksheet and create a line plot of the prices over time for each commodity
Many functions and tools in Tableau and R are both designed to work with long data, where each row specifies the most granular unit in the data and the variables or fields describe the data elements (e.g., price).
Connect to the egg data that we assembled in R.
Create a time series of the price. How many lines are there? What is the default calculation doing?
Differentiate the price series by dragging Description
(a dimension) to the color under Marks. You should see two lines but the scales are off. What’s wrong?
Rescale one of the series to match the other. You can go back to R to do this or do it in Tableau. In either case, you will want to use if/then logic.
https://knowledge.broadcom.com/external/article/57052/how-to-convert-unix-epoch-time-values-in.html#:~:text=Epoch%20%2F%20UNIX%20time%20is%20the,on%20the%2070%20years%20difference.↩︎