#This script corresponds to the lab notes for week 3.
Week 3 Lab: Data Processing with R
This Lab Contributes to Course Objectives 2, 3, 4, 5, 6, and 8
Learning Objectives R
Understand how to navigate directories on your file system
Use the
summarize
andgroup_by
functionsUnderstand how to chain commands
Bind dataframes together
Read in data and evaluate datatypes
Calculate new columns based on existing columns
Subset or filter rows of the data
Subset or select columns of the data
Data processing with R
Data processing is an important step in analysis. It can be complicated depending on the number of data sources, quality of the data, and structure of the data. We will start with an already clean dataset to demonstrate how to use R to manipulate the dataset. We will continue to use these commands in increasingly complex combinations, so it will pay dividends to invest the time to understand them now.
Starting a task or project in R
Open RStudio either on your own computer or access it via the DARE server: http://darecompute-01.aggie.colostate.edu:8787/.1
Open up a new script.
Write a brief comment in the first line describing what the script will do.
Load any libraries/packages used in the script.2 In this lab, we will use 3 packages:
readr,dplyr,janitor
.Set the working directory to a known location. This may be a new directory where you will store any external data and save the code file.
Insert the following code below where you load packages. These commands print information about the system and version of R you are running. The information will be displayed when you source the script and is for your instructors.
versionprint(.packages())
Read the data into R
Download
supermarket_sales.csv
to the project directory.Use the function
read_csv()
to readsupermarket_sales.csv
into a dataframe calledsupermarket_raw
.Take a look at the dataframe. There are a few ways to look at the dataframe:
Type the name of the dataframe in the console. This shows the data in the first several rows of the first several columns.
Use the function
View()
with the dataframe name as the argument. Alternatively, you can click on the name of the object in the environment pane.Use the function
glimpse()
with the dataframe name as the argument.
Variable names: You’ll notice that some of the variable names have spaces. This is allowed in R, but they are awkward because you will be using these names in your code. We will cover renaming variables later today, but there is a function to clean (i.e., modify) the names. The function called clean_names()
is part of the janitor
package.
Try it on your own: Use the
clean_names()
function to convert all of the column names to snake case. Reference the documentation in the help tab of the lower right panel. The default conversion case is “snake” but you can also specify it in the function explicitly.
Code
<- clean_names(supermarket_raw,"snake") supermarket_raw
Check the datatypes of the variables in the dataframe. Do they make sense?
Modifying the dataframe
You will often need to modify the data you read in. The package dplyr
contains a set of utilities that help you modify dataframes.3 dplyr
argues that most data modification tasks can be broken into a set of tasks that can be accomplished with these functions named after verbs (https://dplyr.tidyverse.org/):
mutate()
creates new variables (vector) by transforming or operating on one or more other variablesselect()
picks variables based on their names.filter()
picks cases based on their values.arrange()
changes the ordering of the rows.summarise()
creates a new variable of different dimension by performing an operation that summarizes another variable(s)group_by()
an intermediate function that can alter howmutate
andsummarize
behave.
The following exercises will introduce you to these core functions while exploring the supermarket_sales data. These functions will all return a dataframe as the output.
1. Mutate
mutate()
is the function to create or add new variables. Let’s recreate the column total
by multiplying the columns unit_price
and quantity
. Note that we create a new variable called total_calc
and assign this dataframe to a new object called my_super_sales
.
<- mutate(supermarket_raw,total_calc=unit_price*quantity) my_super_sales
Is the result
total_calc
equal to the existing columntotal
? Why or why not?
You can perform most mathematical operations on numeric columns and other types of operations on string or character data types. The key point to remember is that mutate operations must output a vector that is the same length as the dataframe.
2. Select (and rename)
select()
is a function to subset columns from the dataframe. Suppose you want to create a new dataframe with only the invoice id, the city, the reported total, and the calculated total.
<- select(my_super_sales,invoice_id,city,total,total_calc) select_super_sales
rename()
is a related function used to rename columns rather than subset them. Suppose that you didn’t like the name city
and wanted to call it market
.
<- rename(select_super_sales,market=city) select_super_sales
You can combine the operations using the rename syntax to rename a column inside of a select statement.
<- select(my_super_sales,invoice_id,market=city,total,total_calc) select_super_sales
3. Filter
filter()
is a function to subset the rows of a dataframe based on a condition that returns a TRUE
or FALSE
. This is a very general concept used in all sorts of programming applications. In this context, we will write a statement that is true for some elements of the vector city
(renamed to market). To filter the dataframe to only data from the market “Yangon”, we write
<- filter(select_super_sales,market=="Yangon") yangon_super_sales
The statement market=="Yangon"
can be evaluated on every row as either TRUE
or FALSE
. This command keeps only the rows where the market name equals the exact word “Yangon.” This is case sensitive because the statement is comparing the market name in each row. You can also create statements using the greater than >
or less than <
comparing numeric values. You can find more information on logical operators here: https://www.statmethods.net/management/operators.html
4. Arrange (or sort)
arrange()
is a function that modifies the order of the rows in the dataframe based on the values of a column or set of columns. R can sort numeric and character strings (alphabetical). Let’s start by sorting the dataframe in ascending order based on the column total
.
<- arrange(yangon_super_sales,total) yangon_super_sales_sorted
You can sort on multiple columns. The order determines how the data are sorted.
Let’s try something more complex. Use the dataset with all of the markets (select_super_sales
) and sort based on market, then by total in descending order to see the top sales in each market.
<- arrange(select_super_sales,market,desc(total)) super_sales_top
The function desc()
is short for descending and indicates that the total should be sorted in descending order.
Remember to comment in your code so you can refer back to see what you did.
Now you know the four dplyr
verbs generally used to manipulate data. We covered mutate, filter, select,
and arrange
. Next, we will introduce summarize
and a helpful companion function group_by
.
5. Summarize
The summarize
function is used to collapse multiple rows of data into a single row, by computing summary statistics for each group. This function can be used to summarize and aggregate data by computing statistics such as mean, sum, count, minimum, maximum, etc. for each group defined in the data.
We can use summarize()
to calculate the total number of units sold in the dataset.
summarize(supermarket_raw,quantity_summed=sum(quantity))
Notice that the command above does not assign the output to an object. It can be useful to quickly check the output of a command before assigning an object. Is the variable quantity_summed
accessible to you? In other words, could you call that value within R and perform other operations on it?
Try it on your own: Modify the previous command to assign the output to an object called
total_sold
.
Code
<- summarize(supermarket_raw,quantity_summed=sum(quantity)) total_sold
Next, calculate the following: (1) Calculate the average of the
quantity
variable. In R, the function to calculate average ismean()
. (2) Calculate the maximumunit_price
. The function to calculate the maximum ismax()
.
Code
#Mean quantity
summarize(supermarket_raw,quantity_mean=mean(quantity))
#Max price
summarize(supermarket_raw,max_price=max(unit_price))
6. Group by
Often times, you need to perform some operation on a group of data. You can think about this as a sequence of operations: part 1) subset the group that you want to focus on, and part 2) perform the operation. For example, you might want to know the quantity of goods sold by branch.
Step 1: Filter the data in each branch (let’s just do branch A for example)
<- filter(supermarket_raw,branch=="A") branch_A_sales
You should see a new dataframe called branch_A_sales
with 340 observations. All of the variables from supermarket_raw
are there including quantity
.
Step 2: Sum the quantity of goods sold
<- summarize(branch_A_sales,total_quantity=sum(quantity)) branch_A_total_quantity
Try it on your own: Do the same thing for branch B and C.
Code
<- filter(supermarket_raw,branch=="B")
branch_B_sales <- summarize(branch_B_sales,total_quantity=sum(quantity))
branch_B_total_quantity
<- filter(supermarket_raw,branch=="C")
branch_C_sales <- summarize(branch_C_sales,total_quantity=sum(quantity)) branch_C_total_quantity
This is pretty tedious. If operations in R or other programming languages seems tedious or repetitive, there are probably much better ways to do it. In this case, we can use a function group_by()
. group_by()
tells R to perform some operation like sum
on each of the subgroups defined by another variable. Let’s sum the quantity by branch using group_by()
and summarize()
. We will do this in two steps:
Step 1: Use group_by()
to group the data by branch
#This will create a signal for summarize but not perform any calculation.
<- group_by(supermarket_raw,branch) super_sales_grouped_branch
Step 2: Use summarize()
to calculate the total quantity (by branch)
#This will create a signal for summarize but not perform any calculation.
<- summarize(super_sales_grouped_branch,total_quantity=sum(quantity)) total_quantity_by_branch
Notice that the result has the quantity sum for each branch. Do they match your manual calculations?
Try it on your own: Calculate the average
unit_price
for eachproduct_line
in the data.
Chaining commands (or piping %>%
or |>
)
A script is a set of instructions for R that get executed in order. In some cases, you do not care about or need to save the intermediate output. The group_by
and summarize
sequence is a good example; you don’t need to store the dataframe super_sales_grouped_branch
unless there is a clear use for it later.
The pipe symbol, represented as %>%
or |>
, plays a crucial role in streamlining code. It allows for a more readable way of writing sequences of operations.
What It Does: The pipe symbol takes the result of the expression on its left and passes it as the first argument to the function on its right. In simpler terms, it “pipes” the output of one function directly into the next function.
How to Read It: When you see x %>% f()
, interpret it as “take x and then apply f() to it.”
Code Structure: Often, you’ll find the pipe at the end of a line, with the next operation starting on a new, often indented, line. This format enhances readability, especially in complex sequences of data manipulation steps.
Here is a brief example:
%>%
data arrange(condition) %>%
summarize(result = mean(column))
This code snippet takes called data
, arranges it according to some condition, and then calculates the mean of column, summarizing the result. Each step is clearly separated by the pipe symbol, making the sequence of operations easy to follow.
Video tutorial here: https://youtu.be/e_SQnJpS5fA
Let’s return to our group_by()
and summarize()
sequence and use the pipe technique rather than assign the intermediate step to a dataframe. Try it!
Code
#Using the pipe to chain two statements
<- group_by(supermarket_raw,branch) %>% #group by branch and pass the output to summarize
total_quantity_by_branch_pipe summarize(.,total_quantity=sum(quantity)) #Sum the quantity by branch
Appending data
In many cases, you will need to combine parts of datasets into a single larger dataset.
Let’s return to the example of breaking up the dataset by branch. If you have not created dataframes for each branch above, do it using these commands:
<- filter(supermarket_raw,branch=="A")
branch_A_sales <- filter(supermarket_raw,branch=="B")
branch_B_sales <- filter(supermarket_raw,branch=="C") branch_C_sales
Notice that each of these dataframes contains only the sales from the branch indicated in the dataframe name.
Suppose you were given the data as three separate files and you read them into these dataframes. However, you want to profile company sales across the branches. You need to append them together. The dplyr
function to append rows is bind_rows()
.4
<- bind_rows(branch_A_sales,branch_B_sales,branch_C_sales) super_reconstruct
Note that bind_rows()
looks for shared column names in each dataframe. If one dataframe is missing a column, it will be filled with NA
. NA
in R stands for “Not Available” or “Not Applicable.” `NA’ is used to indicate that a value is missing or undefined in a data set. This can happen when data is not available for a particular observation, or when a value cannot be calculated for some reason. When R encounters an “NA” value, it treats it differently from other values, and special functions are often used to handle missing data.
Log file (revisited)
To output the log from sourcing an R script to a text file, you can use the sink()
function. The sink()
function redirects R output to a file or connection.
For example, to redirect the output of source("script.R",echo = TRUE)
to a file named “log.txt”, you can use the following command. Remember to run this set of commands from another script or directly in the console. Do not include it in the script you are sourcing (i.e., script.R
).
sink("log.txt")
source("script.R",echo = TRUE)
sink()
This will run the script “script.R” and redirect all the output to the file “log.txt”. echo=TRUE
ensures that the output of the script is just like you see in the console when you run each line individually. Once the script is finished, you can use the sink()
function again to stop redirecting the output and return it to the console. Note that the sink command must be run outside of “script.R” because the command source()
is equivalent to running every line of “script.R.”
Try it on your own: Open a new script and call it
lab_03_log.R
. Update the code above so it will run the scriptlab_03_script.R
. Note: You will have to comment out any places where you installed packages, and used the commandssetwd()
,getwd()
, anddir()
.
Code
sink("lab_03_log.txt")
source("lab_03_script.R",echo = TRUE)
sink()
Footnotes
Remember that you need to be signed on to the VPN if you are off campus. If you are on campus, the link should take you to the RStudio sign on page.↩︎
Install if necessary.↩︎
The package
dplyr
is part of a larger set of packages known as thetidyverse
https://www.tidyverse.org/. The term tidy refers to a structure of data. There is a large and growing community that develops R packages that adhere to the tidy principles, so there is extensive documentation and support available. Most of what I teach in this class will be part of the tidyverse or adherant to its principles. Video tutorial here: https://youtu.be/bUM3wX4YZDc↩︎bind_rows()
is a more efficient version of a function in base R calledrbind()
.↩︎