Part 1: Data cleaning in R

In today’s class, you’ve learned about various tidyverse functions like select(), filter(), mutate(), rename(), and arrange(). Now, it’s your turn to apply these functions to clean a dataset.

The Uncleaned Dataset Here’s your dataset, representing sales transactions from an online store:

# Creating the uncleaned dataset
unclean_data <- data.frame(
  ID = c(1, 2, 3, 4, 5, 6),
  Item_sold = c("Laptop", "Mobile", "Laptop", "TV", "Mobile", "TV"),
  units = c(5, 3, 4, 2, 1, 3),
  Price_per_unit = c(500, 300, 499, 450, 310, 460),
  date = c("2022-12-01", "2022-11-25", "2022-11-30", "2022-11-29", "2022-12-03", "2022-11-30"),
  isActive = c(TRUE, TRUE, FALSE, FALSE, TRUE, TRUE),
  Customer_email = c("a@", "b@", "c@", "d@", "e@", "f@")
)

head(unclean_data)
##   ID Item_sold units Price_per_unit       date isActive Customer_email
## 1  1    Laptop     5            500 2022-12-01     TRUE             a@
## 2  2    Mobile     3            300 2022-11-25     TRUE             b@
## 3  3    Laptop     4            499 2022-11-30    FALSE             c@
## 4  4        TV     2            450 2022-11-29    FALSE             d@
## 5  5    Mobile     1            310 2022-12-03     TRUE             e@
## 6  6        TV     3            460 2022-11-30     TRUE             f@

Your task is to follow the steps below to clean the data. Note: To verify that you’ve obtained the desired results, consider using functions like head(YOUR_DATA) or others that you find appropriate.

Please remember to install and/or load the appropriate libraries first.

Step 1: Select Necessary Columns

The isActive and Customer_email columns are not relevant for the current analysis. Exclude these from the dataset. Create a new dataframe called dta_select to save your results. Use head() to verify your answer.

dta_select <- unclean_data %>%
  select(-isActive, -Customer_email)

head(dta_select)
##   ID Item_sold units Price_per_unit       date
## 1  1    Laptop     5            500 2022-12-01
## 2  2    Mobile     3            300 2022-11-25
## 3  3    Laptop     4            499 2022-11-30
## 4  4        TV     2            450 2022-11-29
## 5  5    Mobile     1            310 2022-12-03
## 6  6        TV     3            460 2022-11-30

Step 2: Filter the Data

Start from dta_select. Retain only the rows where the item sold is either “Laptop” or “TV”. Save the results to dta_filter. Use head() to verify your answer.

dta_filter <- dta_select %>%
  filter(Item_sold %in% c("Laptop", "TV"))

head(dta_filter)
##   ID Item_sold units Price_per_unit       date
## 1  1    Laptop     5            500 2022-12-01
## 2  3    Laptop     4            499 2022-11-30
## 3  4        TV     2            450 2022-11-29
## 4  6        TV     3            460 2022-11-30

Step 3: Add a New Column

Start from dta_filter. Calculate the total price for each transaction (units multiplied by Price_per_unit) and create a new column named Total_price. Save the results to dta_mutate. Use head() to verify your answer.

dta_mutate <- dta_filter %>%
  mutate(Total_price = units * Price_per_unit)

head(dta_mutate)
##   ID Item_sold units Price_per_unit       date Total_price
## 1  1    Laptop     5            500 2022-12-01        2500
## 2  3    Laptop     4            499 2022-11-30        1996
## 3  4        TV     2            450 2022-11-29         900
## 4  6        TV     3            460 2022-11-30        1380

Step 4: Rename Columns

Start from dta_mutate. Rename the columns so that they start with a capital letter and are written in camel case. For instance, Item_sold should be renamed to ItemSold. Save the results to dta_rename. Use head() to verify your answer.

dta_rename <- dta_mutate %>%
  rename(
    ID = ID,
    ItemSold = Item_sold,
    Units = units,
    PricePerUnit = Price_per_unit,
    Date = date,
    TotalPrice = Total_price
  )

head(dta_rename)
##   ID ItemSold Units PricePerUnit       Date TotalPrice
## 1  1   Laptop     5          500 2022-12-01       2500
## 2  3   Laptop     4          499 2022-11-30       1996
## 3  4       TV     2          450 2022-11-29        900
## 4  6       TV     3          460 2022-11-30       1380

Step 5: Sort the Data

Start from dta_rename. Sort the data by Date in ascending order and then by TotalPrice in descending order.Save the results to dta_arrange. Use head() to verify your answer.

dta_arrange <- dta_rename %>%
  arrange(Date, desc(TotalPrice))

head(dta_arrange)
##   ID ItemSold Units PricePerUnit       Date TotalPrice
## 1  4       TV     2          450 2022-11-29        900
## 2  3   Laptop     4          499 2022-11-30       1996
## 3  6       TV     3          460 2022-11-30       1380
## 4  1   Laptop     5          500 2022-12-01       2500

Part 2: Data transformation in R

In this activity, you’ll practice transforming data between long and wide formats using pivot_longer() and pivot_wider() functions from the tidyverse package. We have provided two datasets for this purpose—one in long format and one in wide format.

**Note: Please use head() or other funcitons as you see fit to examine whether you have obtained the correct results.

Dataset in Long Format

# The long-format dataset
long_data <- data.frame(
  StudentID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  Subject = c("Math", "Science", "History", "Math", "Science", "History", "Math", "Science", "History"),
  Score = c(85, 90, 80, 88, 92, 84, 76, 81, 78)
)

Dataset in Wide Format

# The wide-format dataset
wide_data <- data.frame(
  StudentID = c(1, 2, 3),
  Math = c(85, 88, 76),
  Science = c(90, 92, 81),
  History = c(80, 84, 78)
)

Task 1: Transform the Long Format Data to Wide Format

Use the pivot_wider() function to transform the long_data dataset into a wide format. In the new dataset, each row should represent a unique StudentID, and there should be separate columns for Math, Science, and History scores.

 long_to_wide <- long_data %>%
   pivot_wider(names_from = Subject, values_from = Score)
 
 head(long_to_wide)
## # A tibble: 3 × 4
##   StudentID  Math Science History
##       <dbl> <dbl>   <dbl>   <dbl>
## 1         1    85      90      80
## 2         2    88      92      84
## 3         3    76      81      78

Task 2: Transform the Wide Format Data to Long Format

Use the pivot_longer() function to transform the wide_data dataset into a long format. The new dataset should have three columns: StudentID, Subject, and Score.

wide_to_long <- wide_data %>%
  pivot_longer(cols = -StudentID, names_to = "Subject", values_to = "Score")

head(wide_to_long)
## # A tibble: 6 × 3
##   StudentID Subject Score
##       <dbl> <chr>   <dbl>
## 1         1 Math       85
## 2         1 Science    90
## 3         1 History    80
## 4         2 Math       88
## 5         2 Science    92
## 6         2 History    84

Part 3: Joining datasets in R & Filtering with a Type Mismatch

This activity aims to help you practice different types of data joining techniques using the tidyverse package. We will be working with two datasets: students and grades.

Your Datasets

# Students dataset
students <- data.frame(
  StudentID = c(1, 2, 3, 4, 5),
  FirstName = c("Alice", "Bob", "Charlie", "David", "Eva"),
  LastName = c("Smith", "Johnson", "Brown", "Williams", "Davis")
)

# Grades dataset
grades <- data.frame(
  StudentID = c(1, 2, 3, 6, 7),
  Math = c(90, 85, 87, 88, 92),
  Science = c(80, 89, 91, 78, 85)
)

Task 1: Left Join Use the appropriate join function to join the students dataset with the grades dataset based on the StudentID column. This should keep all rows from students and match the corresponding grades if available.

left_joined <- left_join(students, grades, by = "StudentID")

head(left_joined)
##   StudentID FirstName LastName Math Science
## 1         1     Alice    Smith   90      80
## 2         2       Bob  Johnson   85      89
## 3         3   Charlie    Brown   87      91
## 4         4     David Williams   NA      NA
## 5         5       Eva    Davis   NA      NA

Task 2 Use the appropriate join function to combine students and grades datasets. This will only include the students that exist in both tables.

inner_joined <- inner_join(students, grades, by = "StudentID")

head(inner_joined)
##   StudentID FirstName LastName Math Science
## 1         1     Alice    Smith   90      80
## 2         2       Bob  Johnson   85      89
## 3         3   Charlie    Brown   87      91

Task 3 Perform a join function between students and grades. This will include all unique keys (Student IDs) from both tables.

full_joined <- full_join(students, grades, by = "StudentID")

head(full_joined)
##   StudentID FirstName LastName Math Science
## 1         1     Alice    Smith   90      80
## 2         2       Bob  Johnson   85      89
## 3         3   Charlie    Brown   87      91
## 4         4     David Williams   NA      NA
## 5         5       Eva    Davis   NA      NA
## 6         6      <NA>     <NA>   88      78

Task 4 Use the appropriate join function to find the rows in students that do not have corresponding grades in the grades table.

missing_grades <- anti_join(students, grades, by = "StudentID")

head(missing_grades)
##   StudentID FirstName LastName
## 1         4     David Williams
## 2         5       Eva    Davis

You have a dataset where the score column contains numerical values stored as character strings. Your task is to filter rows where the score is greater than 4 using dplyr.

# Load the data (example)
data_example <- tibble(id = 1:6, 
                       score = c("5", "3", "6", "2", "7", "not taken"))

Try filtering rows where score > 4 and examine the results.

 data_example %>%
  filter(score > 4)
## # A tibble: 4 × 2
##      id score    
##   <int> <chr>    
## 1     1 5        
## 2     3 6        
## 3     5 7        
## 4     6 not taken

Did you observe something? What may cause this problem?

Now please replace the value “not taken” to NA and covert the variable score to a numeric variable, this should be added as a new column to data_example and named score_cleaned. Filter the rows where score > 4 and examine the results.

data_example <- tibble(
  id = 1:6, 
  score = c("5", "3", "6", "2", "7", "not taken")
)
data_example <- data_example %>%
  mutate(score_cleaned = as.numeric(ifelse(score == "not taken", NA, score)))
filtered <- data_example %>%
  filter(score_cleaned > 4)

Bonus Exercise (Optional - Extra 5 points to the total point):

Please be aware that this bonus exercise is optional and designed for students who feel confident in their current abilities and wish to enhance their skills in data cleaning and processing. It is an opportunity for self-directed advancement and is not recommended for everyone. Only consider this exercise if you are comfortable dedicating extra time to it.

Note that to successfully complete this exercise, you will need a good understanding of regular expressions, a technique to identify and extract string patterns.

In this exercise, I would like to introduce you to practical skills in data cleaning and manipulation using R. The dataset in question stems from a rater study aimed at evaluating the speech performance of various speakers. These speakers are categorized into three groups based on prior assessments by certified raters: Flat, Pronunciation Better, and Lexical Grammar Better.

Participants in this study will assume the role of raters. They are tasked with evaluating five distinct aspects of speech: pronunciation, lexical grammar, rhetorical organization, topical development, and the holistic speaking skill of the speakers.

Each rater will also record their confidence level regarding their ratings, but only for the four analytical aspects. Consequently, there will be nine separate ratings for each participant provided by a single rater:

  • pronunciation_Rating
  • pronunciation_ConfidenceRating
  • lexicalgrammar_Rating
  • lexicalgrammar_ConfidenceRating
  • rhetoricalorganization_Rating
  • rhetoricalorganization_ConfidenceRating
  • topicaldevelopment_Rating
  • topicaldevelopment_ConfidenceRating
  • holistic_Rating

The current dataset (Data_EyeTracking.csv) is in a wide format, which students are expected to convert into a long format suitable for analysis. An example entry in the dataset might look like this: 007_1_F#1_1 (the fourth column in the dataset), where 007 is the Speaker ID, 1 indicates the scale (with 1 representing pronunciation, 2 for lexical grammar, etc., up to 5 for holistic), F signifies the Flat group, #1 denotes the Rating (not the confidence rating), and the final _1 is redundant.

You are asked to create a dataset with the following column structure:

speaker (Speaker ID) rater (Rater ID) speakerType (F for Flat, P for Pronunciation Better, L for Lexical Grammar Better) scale (1 for pronunciation, 2 for lexical grammar, 3 for rhetorical organization, 4 for topical development, 5 for holistic) ratingType (1 for rating, 2 for confidence rating) value (the actual rating values)

The dataset after transformation should look like Data_EyeTracking_long.csv.