Week 3 Lab: Data Processing with R

This Lab Contributes to Course Objectives 2, 3, 4, 5, 6, and 8

Learning Objectives R

Understand how to navigate directories on your file system
Use the summarize and group_by functions
Understand how to chain commands
Bind dataframes together
Read in data and evaluate datatypes
Calculate new columns based on existing columns
Subset or filter rows of the data
Subset or select columns of the data

Download Lab Handout Here

Data processing with R

Data processing is an important step in analysis. It can be complicated depending on the number of data sources, quality of the data, and structure of the data. We will start with an already clean dataset to demonstrate how to use R to manipulate the dataset. We will continue to use these commands in increasingly complex combinations, so it will pay dividends to invest the time to understand them now.

Starting a task or project in R

Open RStudio either on your computer.
Open up a new script.
- Create a new subfolder within your Documents/arec330 directory called lab_03.
- Save your new script as lab_03.R inside this folder.
Write a brief comment in the first line of the script describing what the script will do.

#This script corresponds to the lab notes for week 3.

Set the working directory to the new lab_03 folder you created. This ensures that your script can access any data files you store there.

# Set working directory (modify the path if needed)
setwd("~/Documents/arec330/lab_03")

# Confirm working directory
getwd()

Check that your working directory is set correctly

To check that your working directory is set correctly, look at the console window in RStudio. Click the curly arrow in the Console pane toolbar to refresh the working directory display.

Load any libraries/packages. If you have not installed them before, you may need to install them first. In this lab, we will use three packages: readr, dplyr, and janitor.

# Install packages if not already installed
# Comment out these lines after the first time you run them
install.packages("readr")
install.packages("dplyr")
install.packages("janitor")

# Load necessary libraries
library(readr)
library(dplyr)
library(janitor)

When to add comments with install.packages()

If you have already installed these packages, you should comment out (add # before) the install.packages() lines before running your script again. This prevents R from unnecessarily reinstalling the packages every time you run your script, which can slow down execution.

What Are Packages and Why Do We Use Them?

R packages are collections of functions, data, and documentation that extend the basic functionality of R. The base version of R includes many useful functions, but specialized tasks—such as reading different file formats, cleaning data, or creating visualizations—often require additional tools.

For example, in this lab, we will use three packages:

readr – for reading CSV files efficiently
dplyr – for data manipulation
janitor – for cleaning variable names and handling messy data

Why Do We Need to Install Packages?

When you first install R, it does not include all available packages. If you want to use a package, you need to install it first using the install.packages(“package_name”) function. Installation only needs to be done once on your computer. If you try to use a function from a package without loading it first, R will return an error saying the function is not found.

Why Do We Use the `library()` Function?

Even though a package is installed, R does not automatically load it when you start a new session. To use functions from an installed package, you must load it using the library() function. You only need to install a package once, but you must load it with library() every time you start R and want to use it.

Insert the following code below where you load packages. These commands print information about the system and version of R you are running. This information will be displayed when you source the script and is useful for troubleshooting.

version
print(.packages())

Read the data into R

Download supermarket_sales.csv to your working directory.
Use the function read_csv() to read supermarket_sales.csv into a dataframe called supermarket_raw.

# Read the data
supermarket_raw <- read_csv("supermarket_sales.csv")

# Check the first few rows
head(supermarket_raw)

Difference Between read.csv() and read_csv()

Both read.csv() and read_csv() are used to read CSV files into R, but they come from different packages and have some key differences.

Function	Package	Speed	Output Type	Default Behavior	Best Use Case
`read.csv()`	Base R	Slower	Data frame	Converts character columns to factors	Basic CSV reading in traditional R scripts
`read_csv()`	`readr` (tidyverse)	Faster	Tibble (modern data frame)	Keeps character columns as characters	Handles large datasets more efficiently, works better with modern workflows

Why use `read_csv()` instead of `read.csv()`?

Going forward, because we will be reading data from websites, read_csv() is the better option.

read_csv() supports reading from URLs directly without extra steps.

supermarket_raw <- read_csv("https://csu-arec-330.github.io/materials/unit_00/inputs/supermarket_sales.csv")

read.csv() requires downloading the file first or additional steps to read from a URL, just as we have been doing. You can see how cumbersome this is!
read_csv() is faster and more efficient, especially for larger datasets.
read_csv() treats text as text (character) instead of automatically converting it to factor, preventing issues when working with non-categorical text data.
read_csv() returns a tibble, which has better printing behavior and is more compatible with the tidyverse.

Tibbles vs. Data Frames

In R, both tibbles and data frames store tabular data, but tibbles are a more modern version with improved behavior.

Key Differences

Feature	Data Frame (Base R)	Tibble (tidyverse)
Character data	Converted to factors by default	Stays as character
Row names	Present by default	No row names
Printing	Shows all columns, can be cluttered	Shows first 10 rows, fits screen
Best for?	Base R functions	Tidyverse (`dplyr`, `ggplot2`)

Why Use Tibbles?

Tibbles are easier to read and work better with the tidyverse. Since read_csv() returns tibbles, we will use them in this course.

Examples

Code

# Create a data frame
df <- data.frame(name = c("Alice", "Bob"), age = c(25, 30))
str(df)  # Notice 'name' is a factor

# Create a tibble
library(tibble)
tb <- tibble(name = c("Alice", "Bob"), age = c(25, 30))
str(tb)  # 'name' stays as character

# You can convert between them if needed
tibble_version <- as_tibble(df)  # Convert data frame to tibble
df_version <- as.data.frame(tb)  # Convert tibble to data frame

We will cover more on tibbles in future labs.

Take a look at the dataframe. There are a few ways to inspect it:

Type the name of the dataframe in the console. This shows the data in the first several rows of the first several columns.
Use the function View() with the dataframe name as the argument. Notice View() has a capital ‘V’ unlike many other functions in R that are all lower case. Alternatively, you can click on the name of the object in the environment pane.

View(supermarket_raw)

Use the glimpse() function to get a compact summary of the dataframe.

glimpse(supermarket_raw)

Cleaning variable names

Sometimes variable names have spaces. We generally avoid spaces in coding because they make variable names harder to reference and require special syntax to use them in functions or calculations. This can make your code more difficult to read and maintain. Spaces are allowed in R, but they are awkward because you will be using these names in your code. We will cover renaming variables later today, but there is a function to clean (i.e., modify) the names. The function called clean_names() is part of the janitor package.

Try it on your own: Use the clean_names() function to convert all of the column names to snake case. Reference the documentation in the help tab of the lower right panel. The default conversion case is “snake” but you can also specify it in the function explicitly.

supermarket_raw <- clean_names(supermarket_raw, "snake")

Check the datatypes of the variables in the dataframe. Do they make sense?

# Check data types
str(supermarket_raw)

Modifying the Dataframe

You will often need to modify the data you read in. The dplyr package provides a set of functions that make data manipulation easier.¹

dplyr simplifies data modification by breaking tasks into a set of core functions that follow a verb-based structure (https://dplyr.tidyverse.org/):

mutate() – Creates new variables by transforming or operating on existing ones.
select() – Chooses specific columns based on their names.
filter() – Selects rows based on conditions.
arrange() – Reorders rows based on values.
summarise() – Aggregates values into a summary statistic, often changing the dimensions of the dataframe.
group_by() – Alters how mutate() and summarise() behave by grouping data before applying operations.

Note: The function summarise() (British spelling) is the same as summarize() (American spelling). You may see both versions in documentation, but we will cover its use later in the lab.

The following exercises will introduce you to these core functions while exploring the supermarket_sales dataset. Each function returns a modified dataframe as output.

1. Mutate

The mutate() function creates or modifies variables within a dataframe.

Reading mutate() Syntax

The mutate() function can be written generically as follows:

new_dataframe <- mutate(original_dataframe, new_column = operation, modified_column = transformation)

original_dataframe \(\rightarrow\) The dataframe where new or modified columns will be created.
new_column = operation \(\rightarrow\) Defines a new column by performing an operation.
modified_column = transformation \(\rightarrow\) Modifies an existing column using a transformation.
new_dataframe \(\rightarrow\) The resulting dataframe with the changes.

Important points about `mutate()`

mutate() adds or modifies a column while keeping the dataframe the same size.
It works with numeric columns (e.g., multiplication, addition) and character columns (e.g., string operations like combining text).
Important: The result of a mutate() operation must be a vector of the same length as the dataframe. If the lengths don’t match, R will return an error.

Let’s create the column subtotal by multiplying unit_price and quantity.

Here, we create a new variable called subtotal and store the modified dataframe in a new object called my_super_sales:

my_super_sales <- mutate(supermarket_raw, subtotal = unit_price * quantity)

Check: Is the calculated subtotal column exactly the same as the existing total column in supermarket_raw? Why or why not?

2. Select (and Rename)

The select() function allows you to choose specific columns from a dataframe.

Reading select() Syntax

The select() function can be written generically as follows:

new_dataframe <- select(original_dataframe, column1, column2, column3)

original_dataframe \(\rightarrow\) The dataframe you are selecting from.
column1, column2, column3 \(\rightarrow\) The columns you want to keep.
new_dataframe \(\rightarrow\) The resulting dataframe with only the selected columns.

For example, suppose you want to create a new dataframe containing only:
- Invoice ID (invoice_id)
- City (city)
- Reported total (total)
- Calculated total (subtotal)

You can use select() like this:

select_super_sales <- select(my_super_sales, invoice_id, city, total, subtotal)

Renaming columns

If you want to rename a column instead of selecting a subset, use rename().

Reading rename() Syntax

The rename() function can be written generically as follows:

new_dataframe <- rename(original_dataframe, new_name1 = old_name1, new_name2 = old_name2)

original_dataframe \(\rightarrow\) The dataframe you are renaming columns in.
`new_name1 = old_name1` \(\rightarrow\) The new column name is on the left, the old column name is on the right.
new_dataframe \(\rightarrow\) The resulting dataframe with renamed columns.

For example, if you prefer market instead of city:

select_super_sales <- rename(select_super_sales, market = city)

Combining `select()` and `rename()`

You can rename columns directly within select(), avoiding the need for a separate rename() step:

select_super_sales <- select(my_super_sales, invoice_id, market = city, total, subtotal)

3. Filter

The filter() function is used to keep only rows that meet a condition. It evaluates a logical statement for each row and retains rows where the result is TRUE. This is a common concept in programming and allows us to extract only the data that meets specific criteria.

Reading filter() Syntax

The filter() function can be written generically as follows:

new_dataframe <- filter(original_dataframe, condition)

original_dataframe \(\rightarrow\) The dataframe you are filtering.
condition \(\rightarrow\) A logical statement that evaluates to TRUE or FALSE for each row.
new_dataframe \(\rightarrow\) The resulting dataframe, containing only the rows where condition is TRUE.

Important points about `filter()`

filter() keeps only rows where the condition is TRUE.
Works with text (strings) and numbers using comparison operators.
Case-sensitive when filtering strings.
Multiple conditions can be combined using & (AND) or | (OR), which we will use later.

In this example, we filter the dataframe to include only rows where market (renamed from city) is “Yangon”:

yangon_super_sales <- filter(select_super_sales, market == "Yangon")

How this works:

The condition market == "Yangon" is checked for every row in the dataframe.
If the condition is TRUE, the row is kept; if FALSE, the row is removed.
This is case-sensitive, meaning “Yangon” must match exactly (uppercase and lowercase matter).

Other ways to filter data:

You can also use comparison operators to filter numeric values:

# Filter for rows where unit_price is greater than 10
high_price_sales <- filter(my_super_sales, unit_price > 10)

# Filter for rows where quantity is less than 5
low_quantity_sales <- filter(my_super_sales, quantity < 5)

For more on logical operators, check out this resource: Logical operators in R.

4. Arrange (Sorting Data)

The arrange() function reorders rows in a dataframe based on the values in one or more columns.

R can sort both numeric values (smallest to largest) and character strings (alphabetical order).
By default, sorting is ascending (smallest first), but you can use desc() to sort in descending order.

Reading arrange() Syntax

The arrange() function can be written generically as follows:

sorted_dataframe <- arrange(original_dataframe, column1, column2)

original_dataframe \(\rightarrow\) The dataframe you are sorting.
column1 \(\rightarrow\) The primary column used for sorting (default: ascending).
column2 (optional) \(\rightarrow\) If values in column1 are the same, use column2 for sorting.
sorted_dataframe \(\rightarrow\) The resulting dataframe, now sorted.

Important points about `arrange()`

arrange() modifies the order of rows without changing data values.
Default sorting is ascending; use desc() for descending order.
Sorting by multiple columns helps order data when the first column has duplicates.

Sorting by One Column

Let’s start by sorting the Yangon sales dataset (yangon_super_sales) in ascending order based on total:

yangon_super_sales_sorted <- arrange(yangon_super_sales, total)

Result: Reorders rows so that smaller total values appear first.

Sorting by Multiple Columns

You can sort by multiple columns. The order in which you list columns matters—R sorts by the first column first, then resolves ties using the second column, and so on.

Let’s use the full dataset (select_super_sales) and:

Sort by market (alphabetical order).
Within each market, sort total in descending order (highest sales first).

super_sales_top <- arrange(select_super_sales, market, desc(total))

Result: First, sorts by market alphabetically. Second, within each market, sorts total from highest to lowest.

Understanding `desc()`

The function desc() means “descending order”. Without it, arrange() sorts smallest to largest by default.

Examples

Code

arrange(supermarket_raw, total)       # Sorts total in ascending order (default)
arrange(supermarket_raw, desc(total)) # Sorts total in descending order

Reminder: Comment Your Code

Always add comments in your code to help you remember what each step does! This makes it easier to understand your work when you revisit it later.

5. Summarize

The summarize() function collapses multiple rows into a single row by computing summary statistics, such as:
- Sum: sum()
- Mean: mean()
- Count: n()
- Minimum: min()
- Maximum: max()

These statistics help aggregate data based on groups or the entire dataset.

Calculating total quantity sold

We can use summarize() to calculate the total number of units sold across all transactions:

summarize(supermarket_raw, quantity_summed = sum(quantity))

Result: This outputs a single row with the total quantity sold. Notice: This command does not assign the result to an object. This is useful for a quick check before saving the output.

Is quantity_summed accessible?

Notice how in the code above we created a new variable called quantity_summed. Can you use it elsewhere in R?

The answer is No. Because we didn’t store it in an object, the value exists only in the output window and cannot be used later.

Let’s modify the command to store the result in an object called total_sold:

total_sold <- summarize(supermarket_raw, quantity_summed = sum(quantity))

Result: Now, the result is saved in total_sold and can be used in other operations.

Try it Yourself

Write code that calculates the following:

The average (mean) of the quantity variable.
The maximum (max) unit_price.

# Calculate the mean of quantity
summarize(supermarket_raw, quantity_mean = mean(quantity))  

# Find the maximum unit price
summarize(supermarket_raw, max_price = max(unit_price))

Important points about summarize()

summarize() reduces multiple rows into a single row with summary statistics.
If you don’t assign the output to an object, you can’t use it later.
Common summary functions: sum(), mean(), max(), min(), and n().

6. Group By

Often, we need to perform calculations on subgroups of data rather than the entire dataset. You can think of this as a two-step process:

Subset the group you want to analyze.
Perform the operation (e.g., sum, mean, count) on that group.

For example, let’s calculate the total quantity of goods sold by branch.

Grouping Data Manually

First, we filter for only Branch A:

branch_A_sales <- filter(supermarket_raw, branch == "A")

Result: The new dataframe branch_A_sales contains only sales from Branch A. It has 340 observations and includes all original columns.

Now, we sum the quantity column:

branch_A_total_quantity <- summarize(branch_A_sales, total_quantity = sum(quantity))

Result: This gives us the total quantity sold for Branch A.

Try it Yourself

Repeat the same process for Branch B and Branch C:

# Branch B
branch_B_sales <- filter(supermarket_raw, branch == "B")
branch_B_total_quantity <- summarize(branch_B_sales, total_quantity = sum(quantity))

# Branch C
branch_C_sales <- filter(supermarket_raw, branch == "C")
branch_C_total_quantity <- summarize(branch_C_sales, total_quantity = sum(quantity))

A Better Way to Group Data: Using `group_by()`

Manually filtering each branch is tedious. In programming, if something feels repetitive, there’s usually a better way to do it!

The group_by() function automatically groups data based on a variable (e.g., branch) so we can summarize each group in a single step.

Reading group_by() Syntax

The function group_by() can be written generically as follows:

grouped_data <- group_by(original_dataframe, grouping_variable)

original_dataframe → The dataframe you want to group.
grouping_variable → The column that defines the groups.
grouped_data → The resulting dataframe, where rows are grouped but unchanged.

Important points about `group_by()`

group_by() does not change the data—it only signals R to group rows.
Use summarize() after group_by() to calculate statistics for each group.
summarize() performs calculations on each group separately.
You can group by one or multiple columns.
This is much more efficient than filtering each group manually.

Step 1: Group Data by Branch

# Group the data by branch
super_sales_grouped_branch <- group_by(supermarket_raw, branch)

Result: This does not perform any calculations yet. It creates a signal that tells R to treat each branch as a separate group.

Step 2: Summarize within Each Group

# Summarize total quantity sold by branch
total_quantity_by_branch <- summarize(super_sales_grouped_branch, total_quantity = sum(quantity))

Result: Now, we get the total quantity sold for each branch in a single step! Check: Do these results match your manual calculations?

Try it Yourself

Use group_by() and summarize() to calculate the average unit_price for each product_line in the dataset.

# Your turn: Find the average unit price by product line
avg_price_by_product <- summarize(group_by(supermarket_raw, product_line), avg_unit_price = mean(unit_price))

Chaining Commands (Piping with `%>%` or `|>`)

A script is a set of instructions for R that execute in order. Often, you do not need to store intermediate steps unless they serve a specific purpose later.

For example, in our group_by() and summarize() sequence, storing super_sales_grouped_branch is unnecessary if we only need the final result. Instead of creating temporary objects, we can chain commands using the pipe operator (%>% in dplyr, |> in base R).

What Does the Pipe Do?

The pipe passes the output of one function directly into the next function, eliminating the need for intermediate variables.

x %>% f() means “take x and apply f() to it.”
The pipe improves readability, especially when applying multiple functions in sequence.

How to Read Piping Syntax

data %>%  # Starts with the data
  arrange(condition) %>%   # Arranges the data based on a condition
  summarize(result = mean(column))  # Passes the arranged data to summarize() to compute the mean of column

Each step in this code chunk is clearly separated by the pipe symbol %>%, making it easier to follow than a series of nested function calls:

Step 1: Starts with data.
Step 2: Arranges data based on condition.
Step 3: Pass the arranged data to summarize() to compute the mean of column.

Video tutorial here

Using Piping with `group_by()` and `summarize()`

Let’s apply piping to our group_by() and summarize() sequence:

# Using the pipe to group and summarize in one step
total_quantity_by_branch_pipe <- supermarket_raw %>%  # Start with supermarket_raw
  group_by(branch) %>%  # Group by branch
  summarize(total_quantity = sum(quantity))  # Summarize total quantity by branch

Each step in this code chunk is clearly separated by the pipe symbol %>%:

Step 1: Start with supermarket_raw.
Step 2: Group by branch.
Step 3: Pass the grouped data to summarize() to calculate total_quantity.

Why Use Piping?

Reduces unnecessary variables – no need to store temporary objects.
Improves readability – each step is clearly separated.
Easier debugging – you can run each part separately if needed.

Appending data

In many cases, you will need to combine multiple datasets into a single, larger dataset. This process is called appending data.

Breaking the Dataset into Branches

Let’s revisit the example where we split the dataset by branch. If you haven’t already created separate dataframes for each branch, you can do so with the following commands:

branch_A_sales <- filter(supermarket_raw, branch == "A")
branch_B_sales <- filter(supermarket_raw, branch == "B")
branch_C_sales <- filter(supermarket_raw, branch == "C")

Each of these dataframes contains sales records for only one branch.

Why Append Data?

Suppose you were originally given the data as three separate files—one per branch. If you want to analyze company-wide sales, you need to append these datasets back together into a single dataframe.

Using bind_rows() to append data

The bind_rows() function is used to append (stack) multiple dataframes by rows, even if they do not have identical column names. It is a flexible and efficient alternative to rbind() from base R.

super_reconstruct <- bind_rows(branch_A_sales, branch_B_sales, branch_C_sales)

How bind_rows() Works

bind_rows() looks for shared column names in each dataframe.
If one dataframe is missing a column, the missing values will be automatically filled with NA.

Reading bind_rows() Syntax

The function bind_rows() can be written generically as follows:

new_dataframe <- bind_rows(dataframe1, dataframe2, dataframe3)

dataframe1, dataframe2, dataframe3 → Dataframes to be combined.
new_dataframe \(\rightarrow\) The resulting dataframe with rows from all input dataframes.

Important points about bind_rows()

Automatically handles missing columns: If a column exists in one dataframe but not another, bind_rows() fills the missing values with NA.
Can combine more than two dataframes at once: Unlike rbind(), bind_rows() easily stacks multiple dataframes in a single function call.
Works well with tibbles: bind_rows() is part of the tidyverse and integrates smoothly with tibbles.

Difference Between bind_rows() and rbind()

Both functions append dataframes by rows, but they differ in flexibility.

bind_rows() (dplyr)

Handles missing columns: Fills with NA if a column is missing.
Supports multiple dataframes: Can bind more than two at once.
Works with tibbles: Preserves tibble structure.

rbind() (Base R)

Requires identical column names: Will fail if columns do not match.
Only two at a time: Must manually chain for more.
Converts tibbles to data frames.

Comparison Table

Feature	`bind_rows()` (`dplyr`)	`rbind()` (Base R)
Missing columns	Fills with `NA`	Must match
Multiple inputs	Yes	No, only two
Tibble support	Yes	Converts to df
Performance	Faster	Slower for large data

Examples

library(dplyr)

df1 <- tibble(id = 1:3, value = c(10, 20, 30))
df2 <- tibble(id = 4:5, category = c("A", "B"))  # No 'value' column

# Using bind_rows()
bind_rows(df1, df2)  # Adds 'category' column, fills missing values with NA

# Using rbind() (will fail)
rbind(df1, df2)  # Error: columns must match

Handling Missing Values (`NA`) in R

In R, NA stands for “Not Available” or “Not Applicable”. It indicates missing or undefined values in a dataset. This can happen when:

Data is not available for a specific observation.
A calculation cannot be performed for some reason.

When R encounters an NA value, it treats it differently from regular values. Special functions, such as is.na(), are often used to handle missing data appropriately in analysis.

Now that you’ve reconstructed the full dataset, you can proceed with company-wide sales analysis!

Lab Script for Week 3: Data Processing with R

# LAB 03: Working with Data in R
# This script follows the steps outlined in Lab 03 to import, clean, 
# modify, and export data using R and the `tidyverse` package.

# ---------------------------------------------
# 1. SETUP: Load required packages
# ---------------------------------------------

# Install required packages if not already installed (comment out after first run)
# install.packages(c("readr", "dplyr", "janitor"))

# Load necessary libraries
library(readr)
library(dplyr)
library(janitor)

version
print(.packages())

# ---------------------------------------------
# 2. SET WORKING DIRECTORY (if needed)
# ---------------------------------------------

# Set working directory (modify the path if needed)
setwd("~/Documents/arec330/lab_03")

# Confirm working directory
getwd()

# ---------------------------------------------
# 3. IMPORT DATA
# ---------------------------------------------

# Read the CSV file into a dataframe
supermarket_raw <- read_csv("supermarket_sales.csv")

# View the first few rows of the dataframe
# Option 1
head(supermarket_raw)

# Option 2
View(supermarket_raw)

# Option 3
glimpse(supermarket_raw)

# ---------------------------------------------
# 4. CLEAN COLUMN NAMES
# ---------------------------------------------

# Clean column names using `janitor::clean_names()`
supermarket_clean <- clean_names(supermarket_raw)

# Check column names after cleaning
colnames(supermarket_clean)

# Check data types
str(supermarket_raw)

# ---------------------------------------------
# 5. ADDING A NEW COLUMN USING `MUTATE()`
# ---------------------------------------------

# Create a new column `total_calc` to calculate total sale per transaction
my_super_sales <- mutate(supermarket_raw, subtotal = unit_price * quantity)

# Verify if `total_calc` matches the existing `total` column
identical(my_super_sales$total_calc, my_super_sales$total)

# Rename the city variable to market
select_super_sales <- rename(select_super_sales, market = city)

# Combining select() and rename()
select_super_sales <- select(my_super_sales, invoice_id, market = city, total, subtotal)

# ---------------------------------------------
# 6. SUBSETTING DATA USING `FILTER()`
# ---------------------------------------------

# Filter to keep only transactions from "Yangon"
yangon_super_sales <- filter(select_super_sales, market == "Yangon")

# Filter for rows where unit_price is greater than 10
high_price_sales <- filter(my_super_sales, unit_price > 10)

# Filter for rows where quantity is less than 5
low_quantity_sales <- filter(my_super_sales, quantity < 5)

# ---------------------------------------------
# 7. SORTING DATA USING `ARRANGE()`
# ---------------------------------------------

# Arrange rows by `total` in ascending order
yangon_super_sales_sorted <- arrange(yangon_super_sales, total)

# Sorting by multiple columns: market in ascending order, then total in descending order
super_sales_top <- arrange(select_super_sales, market, desc(total))

# ---------------------------------------------
# 8. SUMMARIZING DATA USING `SUMMARIZE()`
# ---------------------------------------------

# Calculate total quantity sold across all transactions
summarize(supermarket_raw, quantity_summed = sum(quantity))

# Now, store it as a data frame
total_sold <- summarize(supermarket_raw, quantity_summed = sum(quantity))

# Calculate the mean of quantity
summarize(supermarket_raw, quantity_mean = mean(quantity))  

# Find the maximum unit price
summarize(supermarket_raw, max_price = max(unit_price))

# ---------------------------------------------
# 9. GROUPING DATA USING `GROUP_BY()`
# ---------------------------------------------

# Group the data by branch
super_sales_grouped_branch <- group_by(supermarket_raw, branch)

# Summarize total quantity sold by branch
total_quantity_by_branch <- summarize(super_sales_grouped_branch, total_quantity = sum(quantity))

# Your turn: Find the average unit price by product line
avg_price_by_product <- summarize(group_by(supermarket_raw, product_line), avg_unit_price = mean(unit_price))

# ---------------------------------------------
# 10. PIPING COMMANDS TOGETHER
# ---------------------------------------------

# Using the pipe to group and summarize in one step
total_quantity_by_branch_pipe <- supermarket_raw %>%  # Start with supermarket_raw
  group_by(branch) %>%  # Group by branch
  summarize(total_quantity = sum(quantity))  # Summarize total quantity by branch

# ---------------------------------------------
# 11. APPENDING DATA
# ---------------------------------------------

branch_A_sales <- filter(supermarket_raw, branch == "A")
branch_B_sales <- filter(supermarket_raw, branch == "B")
branch_C_sales <- filter(supermarket_raw, branch == "C")

super_reconstruct <- bind_rows(branch_A_sales, branch_B_sales, branch_C_sales)

# ---------------------------------------------
# 12. EXPORTING DATA
# ---------------------------------------------

# Save cleaned dataset to CSV
write_csv(supermarket_clean, "supermarket_cleaned.csv")

# ---------------------------------------------
# END OF SCRIPT
# ---------------------------------------------

Saving the Log File for `lab_03.R`

To capture the output of sourcing an R script in a text file, use the sink() function. This redirects R output to a specified file. The source() function runs an external R script as if you typed each line in the console, allowing automation and reproducibility.

Example: Saving the Log for lab_03.R

Open a new script.
Save it as lab_03_log_script.R.
Run the following commands. Do not include them in lab_03.R itself.

sink("lab_03_log.txt")
source("lab_03.R", echo = TRUE)
sink()

This will execute lab_03.R and store its output in lab_03_log.txt. The argument echo = TRUE ensures the log captures the script output as seen in the console.

Why Use .log or .txt for output?

When capturing script output using sink(), you can save the log as either a .log or .txt file.

Functionally, both file types generate the same output, and R does not treat them differently. The choice between .log and .txt is mostly a matter of convention:

.log is often used for logging outputs from scripts, making it clear that the file contains execution details.
.txt is a more general format, making it easier to open with standard text editors or share with others.

In short: Either extension works!

Important notes about running your log script

Before running the log script, in your lab_03.R script, comment out any occurrences of:

install.packages()
setwd()
getwd()
dir()

Also, save your lab_03.R script before executing the log script.

Here is a video that walks you through how to generate a log file using sink() and source() in RStudio. You can also access the video through this link.

Footnotes

The package dplyr is part of the tidyverse https://www.tidyverse.org/—a collection of R packages designed for data science. Tidy refers to a structured way of organizing data. The tidyverse has extensive documentation and a large community for support. Most of what we cover in this class follows tidyverse principles. Video tutorial: https://youtu.be/bUM3wX4YZDc ↩︎

Learning Objectives R

Download Lab Handout Here

Data processing with R

Starting a task or project in R

What Are Packages and Why Do We Use Them?

Why Do We Need to Install Packages?

Why Do We Use the library() Function?

Read the data into R

Why use read_csv() instead of read.csv()?

Key Differences

Why Use Tibbles?

Modifying the Dataframe

1. Mutate

Important points about mutate()

2. Select (and Rename)

Renaming columns

Combining select() and rename()

3. Filter

Important points about filter()

4. Arrange (Sorting Data)

Important points about arrange()

Sorting by One Column

Sorting by Multiple Columns

Understanding desc()

5. Summarize

6. Group By

Grouping Data Manually

A Better Way to Group Data: Using group_by()

Important points about group_by()

Chaining Commands (Piping with %>% or |>)

What Does the Pipe Do?

Using Piping with group_by() and summarize()

Appending data

Handling Missing Values (NA) in R

Saving the Log File for lab_03.R

Footnotes

Why Do We Use the `library()` Function?

Why use `read_csv()` instead of `read.csv()`?

Important points about `mutate()`

Combining `select()` and `rename()`

Important points about `filter()`

Important points about `arrange()`

Understanding `desc()`

A Better Way to Group Data: Using `group_by()`

Important points about `group_by()`

Chaining Commands (Piping with `%>%` or `|>`)

Using Piping with `group_by()` and `summarize()`

Handling Missing Values (`NA`) in R

Saving the Log File for `lab_03.R`