#This script corresponds to the lab notes for week 3.
Week 3 Lab: Data Processing with R
This Lab Contributes to Course Objectives 2, 3, 4, 5, 6, and 8
Learning Objectives R
Understand how to navigate directories on your file system
Use the
summarize
andgroup_by
functionsUnderstand how to chain commands
Bind dataframes together
Read in data and evaluate datatypes
Calculate new columns based on existing columns
Subset or filter rows of the data
Subset or select columns of the data
Download Lab Handout Here
Data processing with R
Data processing is an important step in analysis. It can be complicated depending on the number of data sources, quality of the data, and structure of the data. We will start with an already clean dataset to demonstrate how to use R to manipulate the dataset. We will continue to use these commands in increasingly complex combinations, so it will pay dividends to invest the time to understand them now.
Starting a task or project in R
Open RStudio either on your computer.
Open up a new script.
- Create a new subfolder within your
Documents/arec330
directory calledlab_03
.
- Save your new script as
lab_03.R
inside this folder.
- Create a new subfolder within your
Write a brief comment in the first line of the script describing what the script will do.
- Set the working directory to the new
lab_03
folder you created. This ensures that your script can access any data files you store there.
# Set working directory (modify the path if needed)
setwd("~/Documents/arec330/lab_03")
# Confirm working directory
getwd()
To check that your working directory is set correctly, look at the console window in RStudio. Click the curly arrow in the Console pane toolbar to refresh the working directory display.
- Load any libraries/packages. If you have not installed them before, you may need to install them first. In this lab, we will use three packages:
readr
,dplyr
, andjanitor
.
# Install packages if not already installed
# Comment out these lines after the first time you run them
install.packages("readr")
install.packages("dplyr")
install.packages("janitor")
# Load necessary libraries
library(readr)
library(dplyr)
library(janitor)
install.packages()
If you have already installed these packages, you should comment out (add # before) the install.packages()
lines before running your script again. This prevents R from unnecessarily reinstalling the packages every time you run your script, which can slow down execution.
What Are Packages and Why Do We Use Them?
R packages are collections of functions, data, and documentation that extend the basic functionality of R. The base version of R includes many useful functions, but specialized tasks—such as reading different file formats, cleaning data, or creating visualizations—often require additional tools.
For example, in this lab, we will use three packages:
readr
– for reading CSV files efficientlydplyr
– for data manipulationjanitor
– for cleaning variable names and handling messy data
Why Do We Need to Install Packages?
When you first install R, it does not include all available packages. If you want to use a package, you need to install it first using the install.packages(“package_name”) function. Installation only needs to be done once on your computer. If you try to use a function from a package without loading it first, R will return an error saying the function is not found.
Why Do We Use the library()
Function?
Even though a package is installed, R does not automatically load it when you start a new session. To use functions from an installed package, you must load it using the library() function. You only need to install a package once, but you must load it with library() every time you start R and want to use it.
- Insert the following code below where you load packages. These commands print information about the system and version of R you are running. This information will be displayed when you source the script and is useful for troubleshooting.
versionprint(.packages())
Read the data into R
Download
supermarket_sales.csv
to your working directory.Use the function
read_csv()
to readsupermarket_sales.csv
into a dataframe calledsupermarket_raw
.
# Read the data
<- read_csv("supermarket_sales.csv")
supermarket_raw
# Check the first few rows
head(supermarket_raw)
read.csv()
and read_csv()
Both read.csv()
and read_csv()
are used to read CSV files into R, but they come from different packages and have some key differences.
Function | Package | Speed | Output Type | Default Behavior | Best Use Case |
---|---|---|---|---|---|
read.csv() |
Base R | Slower | Data frame | Converts character columns to factors | Basic CSV reading in traditional R scripts |
read_csv() |
readr (tidyverse) |
Faster | Tibble (modern data frame) | Keeps character columns as characters | Handles large datasets more efficiently, works better with modern workflows |
Why use read_csv()
instead of read.csv()
?
Going forward, because we will be reading data from websites, read_csv()
is the better option.
read_csv()
supports reading from URLs directly without extra steps.
<- read_csv("https://csu-arec-330.github.io/materials/unit_00/inputs/supermarket_sales.csv") supermarket_raw
read.csv()
requires downloading the file first or additional steps to read from a URL, just as we have been doing. You can see how cumbersome this is!read_csv()
is faster and more efficient, especially for larger datasets.read_csv()
treats text as text (character) instead of automatically converting it to factor, preventing issues when working with non-categorical text data.read_csv()
returns a tibble, which has better printing behavior and is more compatible with the tidyverse.
In R, both tibbles and data frames store tabular data, but tibbles are a more modern version with improved behavior.
Key Differences
Feature | Data Frame (Base R) | Tibble (tidyverse) |
---|---|---|
Character data | Converted to factors by default | Stays as character |
Row names | Present by default | No row names |
Printing | Shows all columns, can be cluttered | Shows first 10 rows, fits screen |
Best for? | Base R functions | Tidyverse (dplyr , ggplot2 ) |
Why Use Tibbles?
Tibbles are easier to read and work better with the tidyverse. Since read_csv()
returns tibbles, we will use them in this course.
Examples
Code
# Create a data frame
<- data.frame(name = c("Alice", "Bob"), age = c(25, 30))
df str(df) # Notice 'name' is a factor
# Create a tibble
library(tibble)
<- tibble(name = c("Alice", "Bob"), age = c(25, 30))
tb str(tb) # 'name' stays as character
# You can convert between them if needed
<- as_tibble(df) # Convert data frame to tibble
tibble_version <- as.data.frame(tb) # Convert tibble to data frame df_version
We will cover more on tibbles in future labs.
- Take a look at the dataframe. There are a few ways to inspect it:
Type the name of the dataframe in the console. This shows the data in the first several rows of the first several columns.
Use the function
View()
with the dataframe name as the argument. Notice View() has a capital ‘V’ unlike many other functions in R that are all lower case. Alternatively, you can click on the name of the object in the environment pane.
View(supermarket_raw)
- Use the
glimpse()
function to get a compact summary of the dataframe.
glimpse(supermarket_raw)
Sometimes variable names have spaces. We generally avoid spaces in coding because they make variable names harder to reference and require special syntax to use them in functions or calculations. This can make your code more difficult to read and maintain. Spaces are allowed in R, but they are awkward because you will be using these names in your code. We will cover renaming variables later today, but there is a function to clean (i.e., modify) the names. The function called clean_names()
is part of the janitor
package.
Try it on your own: Use the clean_names()
function to convert all of the column names to snake case. Reference the documentation in the help tab of the lower right panel. The default conversion case is “snake” but you can also specify it in the function explicitly.
<- clean_names(supermarket_raw, "snake") supermarket_raw
- Check the datatypes of the variables in the dataframe. Do they make sense?
# Check data types
str(supermarket_raw)
Modifying the Dataframe
You will often need to modify the data you read in. The dplyr
package provides a set of functions that make data manipulation easier.1
dplyr
simplifies data modification by breaking tasks into a set of core functions that follow a verb-based structure (https://dplyr.tidyverse.org/):
mutate()
– Creates new variables by transforming or operating on existing ones.select()
– Chooses specific columns based on their names.
filter()
– Selects rows based on conditions.
arrange()
– Reorders rows based on values.
summarise()
– Aggregates values into a summary statistic, often changing the dimensions of the dataframe.
group_by()
– Alters howmutate()
andsummarise()
behave by grouping data before applying operations.
Note: The function
summarise()
(British spelling) is the same assummarize()
(American spelling). You may see both versions in documentation, but we will cover its use later in the lab.
The following exercises will introduce you to these core functions while exploring the supermarket_sales
dataset. Each function returns a modified dataframe as output.
1. Mutate
The mutate()
function creates or modifies variables within a dataframe.
mutate()
Syntax
The mutate()
function can be written generically as follows:
<- mutate(original_dataframe, new_column = operation, modified_column = transformation) new_dataframe
original_dataframe
\(\rightarrow\) The dataframe where new or modified columns will be created.new_column
= operation \(\rightarrow\) Defines a new column by performing an operation.modified_column
=transformation
\(\rightarrow\) Modifies an existing column using a transformation.new_dataframe
\(\rightarrow\) The resulting dataframe with the changes.
Important points about mutate()
mutate()
adds or modifies a column while keeping the dataframe the same size.- It works with numeric columns (e.g., multiplication, addition) and character columns (e.g., string operations like combining text).
- Important: The result of a
mutate()
operation must be a vector of the same length as the dataframe. If the lengths don’t match, R will return an error.
Let’s create the column subtotal
by multiplying unit_price
and quantity
.
Here, we create a new variable called subtotal
and store the modified dataframe in a new object called my_super_sales
:
<- mutate(supermarket_raw, subtotal = unit_price * quantity) my_super_sales
Check: Is the calculated subtotal column exactly the same as the existing total column in supermarket_raw? Why or why not?
2. Select (and Rename)
The select()
function allows you to choose specific columns from a dataframe.
select()
Syntax
The select()
function can be written generically as follows:
<- select(original_dataframe, column1, column2, column3) new_dataframe
original_dataframe
\(\rightarrow\) The dataframe you are selecting from.column1
,column2
,column3
\(\rightarrow\) The columns you want to keep.new_dataframe
\(\rightarrow\) The resulting dataframe with only the selected columns.
For example, suppose you want to create a new dataframe containing only:
- Invoice ID (invoice_id
)
- City (city
)
- Reported total (total
)
- Calculated total (subtotal
)
You can use select()
like this:
<- select(my_super_sales, invoice_id, city, total, subtotal) select_super_sales
Renaming columns
If you want to rename a column instead of selecting a subset, use rename()
.
rename()
Syntax
The rename()
function can be written generically as follows:
<- rename(original_dataframe, new_name1 = old_name1, new_name2 = old_name2) new_dataframe
original_dataframe
\(\rightarrow\) The dataframe you are renaming columns in.- `
new_name1
= old_name1` \(\rightarrow\) The new column name is on the left, the old column name is on the right. new_dataframe
\(\rightarrow\) The resulting dataframe with renamed columns.
For example, if you prefer market instead of city:
<- rename(select_super_sales, market = city) select_super_sales
Combining select()
and rename()
You can rename columns directly within select()
, avoiding the need for a separate rename()
step:
<- select(my_super_sales, invoice_id, market = city, total, subtotal) select_super_sales
3. Filter
The filter()
function is used to keep only rows that meet a condition. It evaluates a logical statement for each row and retains rows where the result is TRUE
. This is a common concept in programming and allows us to extract only the data that meets specific criteria.
filter()
Syntax
The filter()
function can be written generically as follows:
<- filter(original_dataframe, condition) new_dataframe
original_dataframe
\(\rightarrow\) The dataframe you are filtering.condition
\(\rightarrow\) A logical statement that evaluates toTRUE
orFALSE
for each row.new_dataframe
\(\rightarrow\) The resulting dataframe, containing only the rows wherecondition
isTRUE
.
Important points about filter()
filter()
keeps only rows where the condition isTRUE
.- Works with text (strings) and numbers using comparison operators.
- Case-sensitive when filtering strings.
- Multiple conditions can be combined using
&
(AND) or|
(OR), which we will use later.
In this example, we filter the dataframe to include only rows where market
(renamed from city
) is “Yangon”:
<- filter(select_super_sales, market == "Yangon") yangon_super_sales
How this works:
- The condition
market == "Yangon"
is checked for every row in the dataframe. - If the condition is
TRUE
, the row is kept; ifFALSE
, the row is removed. - This is case-sensitive, meaning “Yangon” must match exactly (uppercase and lowercase matter).
Other ways to filter data:
You can also use comparison operators to filter numeric values:
# Filter for rows where unit_price is greater than 10
<- filter(my_super_sales, unit_price > 10)
high_price_sales
# Filter for rows where quantity is less than 5
<- filter(my_super_sales, quantity < 5) low_quantity_sales
For more on logical operators, check out this resource: Logical operators in R.
4. Arrange (Sorting Data)
The arrange()
function reorders rows in a dataframe based on the values in one or more columns.
- R can sort both numeric values (smallest to largest) and character strings (alphabetical order).
- By default, sorting is ascending (smallest first), but you can use
desc()
to sort in descending order.
arrange()
Syntax
The arrange()
function can be written generically as follows:
<- arrange(original_dataframe, column1, column2) sorted_dataframe
original_dataframe
\(\rightarrow\) The dataframe you are sorting.column1
\(\rightarrow\) The primary column used for sorting (default: ascending).column2
(optional) \(\rightarrow\) If values incolumn1
are the same, usecolumn2
for sorting.sorted_dataframe
\(\rightarrow\) The resulting dataframe, now sorted.
Important points about arrange()
arrange()
modifies the order of rows without changing data values.- Default sorting is ascending; use
desc()
for descending order. - Sorting by multiple columns helps order data when the first column has duplicates.
Sorting by One Column
Let’s start by sorting the Yangon sales dataset (yangon_super_sales
) in ascending order based on total
:
<- arrange(yangon_super_sales, total) yangon_super_sales_sorted
Result: Reorders rows so that smaller
total
values appear first.
Sorting by Multiple Columns
You can sort by multiple columns. The order in which you list columns matters—R sorts by the first column first, then resolves ties using the second column, and so on.
Let’s use the full dataset (select_super_sales
) and:
- Sort by
market
(alphabetical order). - Within each market, sort
total
in descending order (highest sales first).
<- arrange(select_super_sales, market, desc(total)) super_sales_top
Result: First, sorts by
market
alphabetically. Second, within each market, sortstotal
from highest to lowest.
Understanding desc()
The function desc()
means “descending order”. Without it, arrange()
sorts smallest to largest by default.
Examples
Code
arrange(supermarket_raw, total) # Sorts total in ascending order (default)
arrange(supermarket_raw, desc(total)) # Sorts total in descending order
Always add comments in your code to help you remember what each step does! This makes it easier to understand your work when you revisit it later.
5. Summarize
The summarize()
function collapses multiple rows into a single row by computing summary statistics, such as:
- Sum: sum()
- Mean: mean()
- Count: n()
- Minimum: min()
- Maximum: max()
These statistics help aggregate data based on groups or the entire dataset.
Calculating total quantity sold
We can use summarize()
to calculate the total number of units sold across all transactions:
summarize(supermarket_raw, quantity_summed = sum(quantity))
Result: This outputs a single row with the total quantity sold. Notice: This command does not assign the result to an object. This is useful for a quick check before saving the output.
Is quantity_summed
accessible?
Notice how in the code above we created a new variable called quantity_summed
. Can you use it elsewhere in R?
The answer is No. Because we didn’t store it in an object, the value exists only in the output window and cannot be used later.
Let’s modify the command to store the result in an object called total_sold
:
<- summarize(supermarket_raw, quantity_summed = sum(quantity)) total_sold
Result: Now, the result is saved in total_sold and can be used in other operations.
Try it Yourself
Write code that calculates the following:
- The average (
mean
) of thequantity
variable. - The maximum (
max
)unit_price
.
# Calculate the mean of quantity
summarize(supermarket_raw, quantity_mean = mean(quantity))
# Find the maximum unit price
summarize(supermarket_raw, max_price = max(unit_price))
summarize()
summarize()
reduces multiple rows into a single row with summary statistics.- If you don’t assign the output to an object, you can’t use it later.
- Common summary functions:
sum()
,mean()
,max()
,min()
, andn()
.
6. Group By
Often, we need to perform calculations on subgroups of data rather than the entire dataset. You can think of this as a two-step process:
- Subset the group you want to analyze.
- Perform the operation (e.g., sum, mean, count) on that group.
For example, let’s calculate the total quantity of goods sold by branch.
Grouping Data Manually
First, we filter for only Branch A:
<- filter(supermarket_raw, branch == "A") branch_A_sales
Result: The new dataframe branch_A_sales contains only sales from Branch A. It has 340 observations and includes all original columns.
Now, we sum the quantity
column:
<- summarize(branch_A_sales, total_quantity = sum(quantity)) branch_A_total_quantity
Result: This gives us the total quantity sold for Branch A.
Try it Yourself
Repeat the same process for Branch B and Branch C:
# Branch B
<- filter(supermarket_raw, branch == "B")
branch_B_sales <- summarize(branch_B_sales, total_quantity = sum(quantity))
branch_B_total_quantity
# Branch C
<- filter(supermarket_raw, branch == "C")
branch_C_sales <- summarize(branch_C_sales, total_quantity = sum(quantity)) branch_C_total_quantity
A Better Way to Group Data: Using group_by()
Manually filtering each branch is tedious. In programming, if something feels repetitive, there’s usually a better way to do it!
The group_by()
function automatically groups data based on a variable (e.g., branch) so we can summarize each group in a single step.
group_by()
Syntax
The function group_by()
can be written generically as follows:
<- group_by(original_dataframe, grouping_variable) grouped_data
original_dataframe
→ The dataframe you want to group.grouping_variable
→ The column that defines the groups.grouped_data
→ The resulting dataframe, where rows are grouped but unchanged.
Important points about group_by()
group_by()
does not change the data—it only signals R to group rows.- Use
summarize()
aftergroup_by()
to calculate statistics for each group. summarize()
performs calculations on each group separately.- You can group by one or multiple columns.
- This is much more efficient than filtering each group manually.
Step 1: Group Data by Branch
# Group the data by branch
<- group_by(supermarket_raw, branch) super_sales_grouped_branch
Result: This does not perform any calculations yet. It creates a signal that tells R to treat each branch as a separate group.
Step 2: Summarize within Each Group
# Summarize total quantity sold by branch
<- summarize(super_sales_grouped_branch, total_quantity = sum(quantity)) total_quantity_by_branch
Result: Now, we get the total quantity sold for each branch in a single step! Check: Do these results match your manual calculations?
Try it Yourself
Use group_by()
and summarize()
to calculate the average unit_price for each product_line in the dataset.
# Your turn: Find the average unit price by product line
<- summarize(group_by(supermarket_raw, product_line), avg_unit_price = mean(unit_price)) avg_price_by_product
Chaining Commands (Piping with %>%
or |>
)
A script is a set of instructions for R that execute in order. Often, you do not need to store intermediate steps unless they serve a specific purpose later.
For example, in our group_by()
and summarize()
sequence, storing super_sales_grouped_branch
is unnecessary if we only need the final result. Instead of creating temporary objects, we can chain commands using the pipe operator (%>%
in dplyr
, |>
in base R).
What Does the Pipe Do?
The pipe passes the output of one function directly into the next function, eliminating the need for intermediate variables.
x %>% f()
means “takex
and applyf()
to it.”
- The pipe improves readability, especially when applying multiple functions in sequence.
%>% # Starts with the data
data arrange(condition) %>% # Arranges the data based on a condition
summarize(result = mean(column)) # Passes the arranged data to summarize() to compute the mean of column
Each step in this code chunk is clearly separated by the pipe symbol %>%
, making it easier to follow than a series of nested function calls:
- Step 1: Starts with data.
- Step 2: Arranges data based on condition.
- Step 3: Pass the arranged data to
summarize()
to compute the mean of column.
Using Piping with group_by()
and summarize()
Let’s apply piping to our group_by()
and summarize()
sequence:
# Using the pipe to group and summarize in one step
<- supermarket_raw %>% # Start with supermarket_raw
total_quantity_by_branch_pipe group_by(branch) %>% # Group by branch
summarize(total_quantity = sum(quantity)) # Summarize total quantity by branch
Each step in this code chunk is clearly separated by the pipe symbol %>%
:
- Step 1: Start with supermarket_raw.
- Step 2: Group by branch.
- Step 3: Pass the grouped data to summarize() to calculate total_quantity.
- Reduces unnecessary variables – no need to store temporary objects.
- Improves readability – each step is clearly separated.
- Easier debugging – you can run each part separately if needed.
Appending data
In many cases, you will need to combine multiple datasets into a single, larger dataset. This process is called appending data.
Breaking the Dataset into Branches
Let’s revisit the example where we split the dataset by branch. If you haven’t already created separate dataframes for each branch, you can do so with the following commands:
<- filter(supermarket_raw, branch == "A")
branch_A_sales <- filter(supermarket_raw, branch == "B")
branch_B_sales <- filter(supermarket_raw, branch == "C") branch_C_sales
Each of these dataframes contains sales records for only one branch.
Why Append Data?
Suppose you were originally given the data as three separate files—one per branch. If you want to analyze company-wide sales, you need to append these datasets back together into a single dataframe.
Using bind_rows()
to append data
The bind_rows()
function is used to append (stack) multiple dataframes by rows, even if they do not have identical column names. It is a flexible and efficient alternative to rbind()
from base R.
<- bind_rows(branch_A_sales, branch_B_sales, branch_C_sales) super_reconstruct
How bind_rows()
Works
bind_rows()
looks for shared column names in each dataframe.- If one dataframe is missing a column, the missing values will be automatically filled with
NA
.
bind_rows()
Syntax
The function bind_rows()
can be written generically as follows:
<- bind_rows(dataframe1, dataframe2, dataframe3) new_dataframe
dataframe1
,dataframe2
,dataframe3
→ Dataframes to be combined.new_dataframe
\(\rightarrow\) The resulting dataframe with rows from all input dataframes.
Important points about bind_rows()
- Automatically handles missing columns: If a column exists in one dataframe but not another,
bind_rows()
fills the missing values withNA
. - Can combine more than two dataframes at once: Unlike
rbind()
,bind_rows()
easily stacks multiple dataframes in a single function call. - Works well with tibbles:
bind_rows()
is part of thetidyverse
and integrates smoothly with tibbles.
bind_rows()
and rbind()
Both functions append dataframes by rows, but they differ in flexibility.
bind_rows()
(dplyr
)
- Handles missing columns: Fills with
NA
if a column is missing.
- Supports multiple dataframes: Can bind more than two at once.
- Works with tibbles: Preserves tibble structure.
rbind()
(Base R)
- Requires identical column names: Will fail if columns do not match.
- Only two at a time: Must manually chain for more.
- Converts tibbles to data frames.
Comparison Table
Feature | bind_rows() (dplyr ) |
rbind() (Base R) |
---|---|---|
Missing columns | Fills with NA |
Must match |
Multiple inputs | Yes | No, only two |
Tibble support | Yes | Converts to df |
Performance | Faster | Slower for large data |
Examples
library(dplyr)
<- tibble(id = 1:3, value = c(10, 20, 30))
df1 <- tibble(id = 4:5, category = c("A", "B")) # No 'value' column
df2
# Using bind_rows()
bind_rows(df1, df2) # Adds 'category' column, fills missing values with NA
# Using rbind() (will fail)
rbind(df1, df2) # Error: columns must match
Handling Missing Values (NA
) in R
In R, NA
stands for “Not Available” or “Not Applicable”. It indicates missing or undefined values in a dataset. This can happen when:
- Data is not available for a specific observation.
- A calculation cannot be performed for some reason.
When R encounters an NA
value, it treats it differently from regular values. Special functions, such as is.na()
, are often used to handle missing data appropriately in analysis.
Now that you’ve reconstructed the full dataset, you can proceed with company-wide sales analysis!
# LAB 03: Working with Data in R
# This script follows the steps outlined in Lab 03 to import, clean,
# modify, and export data using R and the `tidyverse` package.
# ---------------------------------------------
# 1. SETUP: Load required packages
# ---------------------------------------------
# Install required packages if not already installed (comment out after first run)
# install.packages(c("readr", "dplyr", "janitor"))
# Load necessary libraries
library(readr)
library(dplyr)
library(janitor)
versionprint(.packages())
# ---------------------------------------------
# 2. SET WORKING DIRECTORY (if needed)
# ---------------------------------------------
# Set working directory (modify the path if needed)
setwd("~/Documents/arec330/lab_03")
# Confirm working directory
getwd()
# ---------------------------------------------
# 3. IMPORT DATA
# ---------------------------------------------
# Read the CSV file into a dataframe
<- read_csv("supermarket_sales.csv")
supermarket_raw
# View the first few rows of the dataframe
# Option 1
head(supermarket_raw)
# Option 2
View(supermarket_raw)
# Option 3
glimpse(supermarket_raw)
# ---------------------------------------------
# 4. CLEAN COLUMN NAMES
# ---------------------------------------------
# Clean column names using `janitor::clean_names()`
<- clean_names(supermarket_raw)
supermarket_clean
# Check column names after cleaning
colnames(supermarket_clean)
# Check data types
str(supermarket_raw)
# ---------------------------------------------
# 5. ADDING A NEW COLUMN USING `MUTATE()`
# ---------------------------------------------
# Create a new column `total_calc` to calculate total sale per transaction
<- mutate(supermarket_raw, subtotal = unit_price * quantity)
my_super_sales
# Verify if `total_calc` matches the existing `total` column
identical(my_super_sales$total_calc, my_super_sales$total)
# Rename the city variable to market
<- rename(select_super_sales, market = city)
select_super_sales
# Combining select() and rename()
<- select(my_super_sales, invoice_id, market = city, total, subtotal)
select_super_sales
# ---------------------------------------------
# 6. SUBSETTING DATA USING `FILTER()`
# ---------------------------------------------
# Filter to keep only transactions from "Yangon"
<- filter(select_super_sales, market == "Yangon")
yangon_super_sales
# Filter for rows where unit_price is greater than 10
<- filter(my_super_sales, unit_price > 10)
high_price_sales
# Filter for rows where quantity is less than 5
<- filter(my_super_sales, quantity < 5)
low_quantity_sales
# ---------------------------------------------
# 7. SORTING DATA USING `ARRANGE()`
# ---------------------------------------------
# Arrange rows by `total` in ascending order
<- arrange(yangon_super_sales, total)
yangon_super_sales_sorted
# Sorting by multiple columns: market in ascending order, then total in descending order
<- arrange(select_super_sales, market, desc(total))
super_sales_top
# ---------------------------------------------
# 8. SUMMARIZING DATA USING `SUMMARIZE()`
# ---------------------------------------------
# Calculate total quantity sold across all transactions
summarize(supermarket_raw, quantity_summed = sum(quantity))
# Now, store it as a data frame
<- summarize(supermarket_raw, quantity_summed = sum(quantity))
total_sold
# Calculate the mean of quantity
summarize(supermarket_raw, quantity_mean = mean(quantity))
# Find the maximum unit price
summarize(supermarket_raw, max_price = max(unit_price))
# ---------------------------------------------
# 9. GROUPING DATA USING `GROUP_BY()`
# ---------------------------------------------
# Group the data by branch
<- group_by(supermarket_raw, branch)
super_sales_grouped_branch
# Summarize total quantity sold by branch
<- summarize(super_sales_grouped_branch, total_quantity = sum(quantity))
total_quantity_by_branch
# Your turn: Find the average unit price by product line
<- summarize(group_by(supermarket_raw, product_line), avg_unit_price = mean(unit_price))
avg_price_by_product
# ---------------------------------------------
# 10. PIPING COMMANDS TOGETHER
# ---------------------------------------------
# Using the pipe to group and summarize in one step
<- supermarket_raw %>% # Start with supermarket_raw
total_quantity_by_branch_pipe group_by(branch) %>% # Group by branch
summarize(total_quantity = sum(quantity)) # Summarize total quantity by branch
# ---------------------------------------------
# 11. APPENDING DATA
# ---------------------------------------------
<- filter(supermarket_raw, branch == "A")
branch_A_sales <- filter(supermarket_raw, branch == "B")
branch_B_sales <- filter(supermarket_raw, branch == "C")
branch_C_sales
<- bind_rows(branch_A_sales, branch_B_sales, branch_C_sales)
super_reconstruct
# ---------------------------------------------
# 12. EXPORTING DATA
# ---------------------------------------------
# Save cleaned dataset to CSV
write_csv(supermarket_clean, "supermarket_cleaned.csv")
# ---------------------------------------------
# END OF SCRIPT
# ---------------------------------------------
Saving the Log File for lab_03.R
To capture the output of sourcing an R script in a text file, use the sink()
function. This redirects R output to a specified file. The source()
function runs an external R script as if you typed each line in the console, allowing automation and reproducibility.
Example: Saving the Log for lab_03.R
- Open a new script.
- Save it as
lab_03_log_script.R
. - Run the following commands. Do not include them in
lab_03.R
itself.
sink("lab_03_log.txt")
source("lab_03.R", echo = TRUE)
sink()
This will execute lab_03.R
and store its output in lab_03_log.txt
. The argument echo = TRUE
ensures the log captures the script output as seen in the console.
Why Use .log
or .txt
for output?
When capturing script output using sink()
, you can save the log as either a .log
or .txt
file.
Functionally, both file types generate the same output, and R does not treat them differently. The choice between .log
and .txt
is mostly a matter of convention:
.log
is often used for logging outputs from scripts, making it clear that the file contains execution details..txt
is a more general format, making it easier to open with standard text editors or share with others.
In short: Either extension works!
Before running the log script, in your lab_03.R
script, comment out any occurrences of:
install.packages()
setwd()
getwd()
dir()
Also, save your lab_03.R
script before executing the log script.
Here is a video that walks you through how to generate a log file using sink()
and source()
in RStudio. You can also access the video through this link.
Footnotes
The package
dplyr
is part of thetidyverse
https://www.tidyverse.org/—a collection of R packages designed for data science. Tidy refers to a structured way of organizing data. Thetidyverse
has extensive documentation and a large community for support. Most of what we cover in this class followstidyverse
principles. Video tutorial: https://youtu.be/bUM3wX4YZDc↩︎