Week 3 Lab: Data Processing with R

This Lab Contributes to Course Objectives 2, 3, 4, 5, 6, and 8

Learning Objectives R

  • Understand how to navigate directories on your file system

  • Use the summarize and group_by functions

  • Understand how to chain commands

  • Bind dataframes together

  • Read in data and evaluate datatypes

  • Calculate new columns based on existing columns

  • Subset or filter rows of the data

  • Subset or select columns of the data

Data processing with R

Data processing is an important step in analysis. It can be complicated depending on the number of data sources, quality of the data, and structure of the data. We will start with an already clean dataset to demonstrate how to use R to manipulate the dataset. We will continue to use these commands in increasingly complex combinations, so it will pay dividends to invest the time to understand them now.

Starting a task or project in R

  1. Open RStudio either on your own computer or access it via the DARE server: http://darecompute-01.aggie.colostate.edu:8787/.1

  2. Open up a new script.

  3. Write a brief comment in the first line describing what the script will do.

#This script corresponds to the lab notes for week 3.
  1. Load any libraries/packages used in the script.2 In this lab, we will use 3 packages: readr,dplyr,janitor.

  2. Set the working directory to a known location. This may be a new directory where you will store any external data and save the code file.

  3. Insert the following code below where you load packages. These commands print information about the system and version of R you are running. The information will be displayed when you source the script and is for your instructors.

version
print(.packages())

Read the data into R

  1. Download supermarket_sales.csv to the project directory.

  2. Use the function read_csv() to read supermarket_sales.csv into a dataframe called supermarket_raw.

  3. Take a look at the dataframe. There are a few ways to look at the dataframe:

  • Type the name of the dataframe in the console. This shows the data in the first several rows of the first several columns.

  • Use the function View() with the dataframe name as the argument. Alternatively, you can click on the name of the object in the environment pane.

  • Use the function glimpse() with the dataframe name as the argument.

Variable names: You’ll notice that some of the variable names have spaces. This is allowed in R, but they are awkward because you will be using these names in your code. We will cover renaming variables later today, but there is a function to clean (i.e., modify) the names. The function called clean_names() is part of the janitor package.

Try it on your own: Use the clean_names() function to convert all of the column names to snake case. Reference the documentation in the help tab of the lower right panel. The default conversion case is “snake” but you can also specify it in the function explicitly.

Code
supermarket_raw <- clean_names(supermarket_raw,"snake")

Check the datatypes of the variables in the dataframe. Do they make sense?

Modifying the dataframe

You will often need to modify the data you read in. The package dplyr contains a set of utilities that help you modify dataframes.3 dplyr argues that most data modification tasks can be broken into a set of tasks that can be accomplished with these functions named after verbs (https://dplyr.tidyverse.org/):

  1. mutate() creates new variables (vector) by transforming or operating on one or more other variables

  2. select() picks variables based on their names.

  3. filter() picks cases based on their values.

  4. arrange() changes the ordering of the rows.

  5. summarise() creates a new variable of different dimension by performing an operation that summarizes another variable(s)

  6. group_by() an intermediate function that can alter how mutate and summarize behave.

The following exercises will introduce you to these core functions while exploring the supermarket_sales data. These functions will all return a dataframe as the output.

1. Mutate

mutate() is the function to create or add new variables. Let’s recreate the column total by multiplying the columns unit_price and quantity. Note that we create a new variable called total_calc and assign this dataframe to a new object called my_super_sales.

my_super_sales <- mutate(supermarket_raw,total_calc=unit_price*quantity)

Is the result total_calc equal to the existing column total? Why or why not?

You can perform most mathematical operations on numeric columns and other types of operations on string or character data types. The key point to remember is that mutate operations must output a vector that is the same length as the dataframe.

2. Select (and rename)

select() is a function to subset columns from the dataframe. Suppose you want to create a new dataframe with only the invoice id, the city, the reported total, and the calculated total.

select_super_sales <- select(my_super_sales,invoice_id,city,total,total_calc)

rename() is a related function used to rename columns rather than subset them. Suppose that you didn’t like the name city and wanted to call it market.

select_super_sales <- rename(select_super_sales,market=city)

You can combine the operations using the rename syntax to rename a column inside of a select statement.

select_super_sales <- select(my_super_sales,invoice_id,market=city,total,total_calc)

3. Filter

filter() is a function to subset the rows of a dataframe based on a condition that returns a TRUE or FALSE. This is a very general concept used in all sorts of programming applications. In this context, we will write a statement that is true for some elements of the vector city (renamed to market). To filter the dataframe to only data from the market “Yangon”, we write

yangon_super_sales <- filter(select_super_sales,market=="Yangon")

The statement market=="Yangon" can be evaluated on every row as either TRUE or FALSE. This command keeps only the rows where the market name equals the exact word “Yangon.” This is case sensitive because the statement is comparing the market name in each row. You can also create statements using the greater than > or less than < comparing numeric values. You can find more information on logical operators here: https://www.statmethods.net/management/operators.html

4. Arrange (or sort)

arrange() is a function that modifies the order of the rows in the dataframe based on the values of a column or set of columns. R can sort numeric and character strings (alphabetical). Let’s start by sorting the dataframe in ascending order based on the column total.

yangon_super_sales_sorted <- arrange(yangon_super_sales,total)

You can sort on multiple columns. The order determines how the data are sorted.

Let’s try something more complex. Use the dataset with all of the markets (select_super_sales) and sort based on market, then by total in descending order to see the top sales in each market.

super_sales_top <- arrange(select_super_sales,market,desc(total))

The function desc() is short for descending and indicates that the total should be sorted in descending order.

Remember to comment in your code so you can refer back to see what you did.

Now you know the four dplyr verbs generally used to manipulate data. We covered mutate, filter, select, and arrange. Next, we will introduce summarize and a helpful companion function group_by.

5. Summarize

The summarize function is used to collapse multiple rows of data into a single row, by computing summary statistics for each group. This function can be used to summarize and aggregate data by computing statistics such as mean, sum, count, minimum, maximum, etc. for each group defined in the data.

We can use summarize() to calculate the total number of units sold in the dataset.

summarize(supermarket_raw,quantity_summed=sum(quantity)) 

Notice that the command above does not assign the output to an object. It can be useful to quickly check the output of a command before assigning an object. Is the variable quantity_summed accessible to you? In other words, could you call that value within R and perform other operations on it?

Try it on your own: Modify the previous command to assign the output to an object called total_sold.

Code
total_sold <- summarize(supermarket_raw,quantity_summed=sum(quantity)) 

Next, calculate the following: (1) Calculate the average of the quantity variable. In R, the function to calculate average is mean(). (2) Calculate the maximum unit_price. The function to calculate the maximum is max().

Code
#Mean quantity
summarize(supermarket_raw,quantity_mean=mean(quantity)) 

#Max price
summarize(supermarket_raw,max_price=max(unit_price)) 

6. Group by

Often times, you need to perform some operation on a group of data. You can think about this as a sequence of operations: part 1) subset the group that you want to focus on, and part 2) perform the operation. For example, you might want to know the quantity of goods sold by branch.

Step 1: Filter the data in each branch (let’s just do branch A for example)

branch_A_sales <- filter(supermarket_raw,branch=="A")

You should see a new dataframe called branch_A_sales with 340 observations. All of the variables from supermarket_raw are there including quantity.

Step 2: Sum the quantity of goods sold

branch_A_total_quantity <- summarize(branch_A_sales,total_quantity=sum(quantity))

Try it on your own: Do the same thing for branch B and C.

Code
branch_B_sales <- filter(supermarket_raw,branch=="B")
branch_B_total_quantity <- summarize(branch_B_sales,total_quantity=sum(quantity))

branch_C_sales <- filter(supermarket_raw,branch=="C")
branch_C_total_quantity <- summarize(branch_C_sales,total_quantity=sum(quantity))

This is pretty tedious. If operations in R or other programming languages seems tedious or repetitive, there are probably much better ways to do it. In this case, we can use a function group_by(). group_by() tells R to perform some operation like sum on each of the subgroups defined by another variable. Let’s sum the quantity by branch using group_by() and summarize(). We will do this in two steps:

Step 1: Use group_by() to group the data by branch

#This will create a signal for summarize but not perform any calculation.  
super_sales_grouped_branch <- group_by(supermarket_raw,branch)

Step 2: Use summarize() to calculate the total quantity (by branch)

#This will create a signal for summarize but not perform any calculation.  
total_quantity_by_branch <- summarize(super_sales_grouped_branch,total_quantity=sum(quantity))

Notice that the result has the quantity sum for each branch. Do they match your manual calculations?

Try it on your own: Calculate the average unit_price for each product_line in the data.

Chaining commands (or piping %>% or |>)

A script is a set of instructions for R that get executed in order. In some cases, you do not care about or need to save the intermediate output. The group_by and summarize sequence is a good example; you don’t need to store the dataframe super_sales_grouped_branch unless there is a clear use for it later.

The pipe symbol, represented as %>% or |>, plays a crucial role in streamlining code. It allows for a more readable way of writing sequences of operations.

What It Does: The pipe symbol takes the result of the expression on its left and passes it as the first argument to the function on its right. In simpler terms, it “pipes” the output of one function directly into the next function.

How to Read It: When you see x %>% f(), interpret it as “take x and then apply f() to it.”

Code Structure: Often, you’ll find the pipe at the end of a line, with the next operation starting on a new, often indented, line. This format enhances readability, especially in complex sequences of data manipulation steps.

Here is a brief example:

data %>%
  arrange(condition) %>%
  summarize(result = mean(column))

This code snippet takes called data, arranges it according to some condition, and then calculates the mean of column, summarizing the result. Each step is clearly separated by the pipe symbol, making the sequence of operations easy to follow.

Video tutorial here: https://youtu.be/e_SQnJpS5fA

Let’s return to our group_by() and summarize() sequence and use the pipe technique rather than assign the intermediate step to a dataframe. Try it!

Code
#Using the pipe to chain two statements 
total_quantity_by_branch_pipe <- group_by(supermarket_raw,branch) %>% #group by branch and pass the output to summarize
  summarize(.,total_quantity=sum(quantity)) #Sum the quantity by branch

Appending data

In many cases, you will need to combine parts of datasets into a single larger dataset.

Let’s return to the example of breaking up the dataset by branch. If you have not created dataframes for each branch above, do it using these commands:

branch_A_sales <- filter(supermarket_raw,branch=="A")
branch_B_sales <- filter(supermarket_raw,branch=="B")
branch_C_sales <- filter(supermarket_raw,branch=="C")

Notice that each of these dataframes contains only the sales from the branch indicated in the dataframe name.

Suppose you were given the data as three separate files and you read them into these dataframes. However, you want to profile company sales across the branches. You need to append them together. The dplyr function to append rows is bind_rows().4

super_reconstruct <- bind_rows(branch_A_sales,branch_B_sales,branch_C_sales) 

Note that bind_rows() looks for shared column names in each dataframe. If one dataframe is missing a column, it will be filled with NA. NA in R stands for “Not Available” or “Not Applicable.” `NA’ is used to indicate that a value is missing or undefined in a data set. This can happen when data is not available for a particular observation, or when a value cannot be calculated for some reason. When R encounters an “NA” value, it treats it differently from other values, and special functions are often used to handle missing data.

Log file (revisited)

To output the log from sourcing an R script to a text file, you can use the sink() function. The sink() function redirects R output to a file or connection.

For example, to redirect the output of source("script.R",echo = TRUE) to a file named “log.txt”, you can use the following command. Remember to run this set of commands from another script or directly in the console. Do not include it in the script you are sourcing (i.e., script.R).

sink("log.txt")
source("script.R",echo = TRUE)
sink()

This will run the script “script.R” and redirect all the output to the file “log.txt”. echo=TRUE ensures that the output of the script is just like you see in the console when you run each line individually. Once the script is finished, you can use the sink() function again to stop redirecting the output and return it to the console. Note that the sink command must be run outside of “script.R” because the command source() is equivalent to running every line of “script.R.”

Try it on your own: Open a new script and call it lab_03_log.R. Update the code above so it will run the script lab_03_script.R. Note: You will have to comment out any places where you installed packages, and used the commands setwd(), getwd(), and dir().

Code
sink("lab_03_log.txt")
source("lab_03_script.R",echo = TRUE)
sink()

Footnotes

  1. Remember that you need to be signed on to the VPN if you are off campus. If you are on campus, the link should take you to the RStudio sign on page.↩︎

  2. Install if necessary.↩︎

  3. The package dplyr is part of a larger set of packages known as the tidyverse https://www.tidyverse.org/. The term tidy refers to a structure of data. There is a large and growing community that develops R packages that adhere to the tidy principles, so there is extensive documentation and support available. Most of what I teach in this class will be part of the tidyverse or adherant to its principles. Video tutorial here: https://youtu.be/bUM3wX4YZDc↩︎

  4. bind_rows() is a more efficient version of a function in base R called rbind().↩︎