In today’s class, you’ve learned about various tidyverse functions
like select(), filter(),
mutate(), rename(), and
arrange(). Now, it’s your turn to apply these functions to
clean a dataset.
The Uncleaned Dataset Here’s your dataset, representing sales transactions from an online store:
# Creating the uncleaned dataset
unclean_data <- data.frame(
ID = c(1, 2, 3, 4, 5, 6),
Item_sold = c("Laptop", "Mobile", "Laptop", "TV", "Mobile", "TV"),
units = c(5, 3, 4, 2, 1, 3),
Price_per_unit = c(500, 300, 499, 450, 310, 460),
date = c("2022-12-01", "2022-11-25", "2022-11-30", "2022-11-29", "2022-12-03", "2022-11-30"),
isActive = c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE),
Customer_email = c("a@", "b@", "c@", "d@", "e@", "f@")
)
head(unclean_data)
## ID Item_sold units Price_per_unit date isActive Customer_email
## 1 1 Laptop 5 500 2022-12-01 TRUE a@
## 2 2 Mobile 3 300 2022-11-25 TRUE b@
## 3 3 Laptop 4 499 2022-11-30 FALSE c@
## 4 4 TV 2 450 2022-11-29 FALSE d@
## 5 5 Mobile 1 310 2022-12-03 TRUE e@
## 6 6 TV 3 460 2022-11-30 TRUE f@
Your task is to follow the steps below to clean the data. Note: To verify that you’ve obtained the desired results, consider using functions like head(YOUR_DATA) or others that you find appropriate.
Please remember to install and/or load the appropriate libraries first.
Step 1: Select Necessary Columns
The isActive and Customer_email columns are
not relevant for the current analysis. Exclude these from the dataset.
Create a new dataframe called dta_select to save your
results. Use head() to verify your answer.
dta_select <- unclean_data %>%
select(-isActive, -Customer_email)
head(dta_select)
## ID Item_sold units Price_per_unit date
## 1 1 Laptop 5 500 2022-12-01
## 2 2 Mobile 3 300 2022-11-25
## 3 3 Laptop 4 499 2022-11-30
## 4 4 TV 2 450 2022-11-29
## 5 5 Mobile 1 310 2022-12-03
## 6 6 TV 3 460 2022-11-30
Step 2: Filter the Data
Start from dta_select. Retain only the rows where the
item sold is either “Laptop” or “TV”. Save the results to
dta_filter. Use head() to verify your
answer.
dta_filter <- dta_select %>%
filter(Item_sold %in% c("Laptop", "TV"))
head(dta_filter)
## ID Item_sold units Price_per_unit date
## 1 1 Laptop 5 500 2022-12-01
## 2 3 Laptop 4 499 2022-11-30
## 3 4 TV 2 450 2022-11-29
## 4 6 TV 3 460 2022-11-30
Step 3: Add a New Column
Start from dta_filter. Calculate the total price for
each transaction (units multiplied by Price_per_unit) and create a new
column named Total_price. Save the results to dta_mutate.
Use head() to verify your answer.
dta_mutate <- dta_filter %>%
mutate(Total_price = units * Price_per_unit)
head(dta_mutate)
## ID Item_sold units Price_per_unit date Total_price
## 1 1 Laptop 5 500 2022-12-01 2500
## 2 3 Laptop 4 499 2022-11-30 1996
## 3 4 TV 2 450 2022-11-29 900
## 4 6 TV 3 460 2022-11-30 1380
Step 4: Rename Columns
Start from dta_mutate. Rename the columns so that they
start with a capital letter and are written in camel case. For instance,
Item_sold should be renamed to ItemSold. Save
the results to dta_rename. Use head() to
verify your answer.
dta_rename <- dta_mutate %>%
rename(
ID = ID,
ItemSold = Item_sold,
Units = units,
PricePerUnit = Price_per_unit,
Date = date,
TotalPrice = Total_price
)
head(dta_rename)
## ID ItemSold Units PricePerUnit Date TotalPrice
## 1 1 Laptop 5 500 2022-12-01 2500
## 2 3 Laptop 4 499 2022-11-30 1996
## 3 4 TV 2 450 2022-11-29 900
## 4 6 TV 3 460 2022-11-30 1380
Step 5: Sort the Data
Start from dta_rename. Sort the data by Date in
ascending order and then by TotalPrice in descending
order.Save the results to dta_arrange. Use
head() to verify your answer.
dta_arrange <- dta_rename %>%
arrange(Date, desc(TotalPrice))
head(dta_arrange)
## ID ItemSold Units PricePerUnit Date TotalPrice
## 1 4 TV 2 450 2022-11-29 900
## 2 3 Laptop 4 499 2022-11-30 1996
## 3 6 TV 3 460 2022-11-30 1380
## 4 1 Laptop 5 500 2022-12-01 2500
In this activity, you’ll practice transforming data between long and
wide formats using pivot_longer() and
pivot_wider() functions from the tidyverse package. We have
provided two datasets for this purpose—one in long format and one in
wide format.
**Note: Please use head() or other funcitons as you see
fit to examine whether you have obtained the correct results.
Dataset in Long Format
# The long-format dataset
long_data <- data.frame(
StudentID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
Subject = c("Math", "Science", "History", "Math", "Science", "History", "Math", "Science", "History"),
Score = c(85, 90, 80, 88, 92, 84, 76, 81, 78)
)
Dataset in Wide Format
# The wide-format dataset
wide_data <- data.frame(
StudentID = c(1, 2, 3),
Math = c(85, 88, 76),
Science = c(90, 92, 81),
History = c(80, 84, 78)
)
Task 1: Transform the Long Format Data to Wide Format
Use the pivot_wider() function to transform the
long_data dataset into a wide format. In the new dataset, each row
should represent a unique StudentID, and there should be
separate columns for Math, Science, and
History scores.
long_to_wide <- long_data %>%
pivot_wider(names_from = Subject, values_from = Score)
head(long_to_wide)
## # A tibble: 3 × 4
## StudentID Math Science History
## <dbl> <dbl> <dbl> <dbl>
## 1 1 85 90 80
## 2 2 88 92 84
## 3 3 76 81 78
Task 2: Transform the Wide Format Data to Long Format
Use the pivot_longer() function to transform the
wide_data dataset into a long format. The new dataset should have three
columns: StudentID, Subject, and
Score.
wide_to_long <- wide_data %>%
pivot_longer(cols = -StudentID, names_to = "Subject", values_to = "Score")
head(wide_to_long)
## # A tibble: 6 × 3
## StudentID Subject Score
## <dbl> <chr> <dbl>
## 1 1 Math 85
## 2 1 Science 90
## 3 1 History 80
## 4 2 Math 88
## 5 2 Science 92
## 6 2 History 84
This activity aims to help you practice different types of data
joining techniques using the tidyverse package. We will be
working with two datasets: students and grades.
Your Datasets
# Students dataset
students <- data.frame(
StudentID = c(1, 2, 3, 4, 5),
FirstName = c("Alice", "Bob", "Charlie", "David", "Eva"),
LastName = c("Smith", "Johnson", "Brown", "Williams", "Davis")
)
# Grades dataset
grades <- data.frame(
StudentID = c(1, 2, 3, 6, 7),
Math = c(90, 85, 87, 88, 92),
Science = c(80, 89, 91, 78, 85)
)
Task 1: Left Join Use the appropriate join function
to join the students dataset with the grades dataset based on the
StudentID column. This should keep all rows from students
and match the corresponding grades if available.
left_joined <- left_join(students, grades, by = "StudentID")
head(left_joined)
## StudentID FirstName LastName Math Science
## 1 1 Alice Smith 90 80
## 2 2 Bob Johnson 85 89
## 3 3 Charlie Brown 87 91
## 4 4 David Williams NA NA
## 5 5 Eva Davis NA NA
Task 2 Use the appropriate join function to combine students and grades datasets. This will only include the students that exist in both tables.
inner_joined <- inner_join(students, grades, by = "StudentID")
head(inner_joined)
## StudentID FirstName LastName Math Science
## 1 1 Alice Smith 90 80
## 2 2 Bob Johnson 85 89
## 3 3 Charlie Brown 87 91
Task 3 Perform a join function between students and grades. This will include all unique keys (Student IDs) from both tables.
full_joined <- full_join(students, grades, by = "StudentID")
head(full_joined)
## StudentID FirstName LastName Math Science
## 1 1 Alice Smith 90 80
## 2 2 Bob Johnson 85 89
## 3 3 Charlie Brown 87 91
## 4 4 David Williams NA NA
## 5 5 Eva Davis NA NA
## 6 6 <NA> <NA> 88 78
Task 4 Use the appropriate join function to find the rows in students that do not have corresponding grades in the grades table.
missing_grades <- anti_join(students, grades, by = "StudentID")
head(missing_grades)
## StudentID FirstName LastName
## 1 4 David Williams
## 2 5 Eva Davis
You have a dataset where the score column contains numerical values
stored as character strings. Your task is to filter rows where the score
is greater than 4 using dplyr.
# Load the data (example)
data_example <- tibble(id = 1:6,
score = c("5", "3", "6", "2", "7", "not taken"))
Try filtering rows where score > 4 and examine the results.
data_example %>%
filter(score > 4)
## # A tibble: 4 × 2
## id score
## <int> <chr>
## 1 1 5
## 2 3 6
## 3 5 7
## 4 6 not taken
Did you observe something? What may cause this problem?
Now please replace the value “not taken” to NA and
covert the variable score to a numeric variable, this
should be added as a new column to data_example and named
score_cleaned. Filter the rows where score > 4 and
examine the results.
data_example <- tibble(
id = 1:6,
score = c("5", "3", "6", "2", "7", "not taken")
)
data_example <- data_example %>%
mutate(score_cleaned = as.numeric(ifelse(score == "not taken", NA, score)))
filtered <- data_example %>%
filter(score_cleaned > 4)
Please be aware that this bonus exercise is optional and designed for students who feel confident in their current abilities and wish to enhance their skills in data cleaning and processing. It is an opportunity for self-directed advancement and is not recommended for everyone. Only consider this exercise if you are comfortable dedicating extra time to it.
Note that to successfully complete this exercise, you will need a good understanding of regular expressions, a technique to identify and extract string patterns.
In this exercise, I would like to introduce you to practical skills in data cleaning and manipulation using R. The dataset in question stems from a rater study aimed at evaluating the speech performance of various speakers. These speakers are categorized into three groups based on prior assessments by certified raters: Flat, Pronunciation Better, and Lexical Grammar Better.
Participants in this study will assume the role of raters. They are tasked with evaluating five distinct aspects of speech: pronunciation, lexical grammar, rhetorical organization, topical development, and the holistic speaking skill of the speakers.
Each rater will also record their confidence level regarding their ratings, but only for the four analytical aspects. Consequently, there will be nine separate ratings for each participant provided by a single rater:
The current dataset (Data_EyeTracking.csv) is in a wide
format, which students are expected to convert into a long format
suitable for analysis. An example entry in the dataset might look like
this: 007_1_F#1_1 (the fourth column in the dataset), where
007 is the Speaker ID, 1 indicates the scale (with 1 representing
pronunciation, 2 for lexical grammar, etc., up to 5 for holistic), F
signifies the Flat group, #1 denotes the Rating (not the confidence
rating), and the final _1 is redundant.
You are asked to create a dataset with the following column structure:
speaker (Speaker ID) rater (Rater ID)
speakerType (F for Flat, P for Pronunciation Better, L for
Lexical Grammar Better) scale (1 for pronunciation, 2 for
lexical grammar, 3 for rhetorical organization, 4 for topical
development, 5 for holistic) ratingType (1 for rating, 2
for confidence rating) value (the actual rating values)
The dataset after transformation should look like
Data_EyeTracking_long.csv.