This document outlines the systematic approach to completing Project 2: Data Tidying and Transformation. The project involves selecting three independent wide-format datasets and transforming them into tidy formats suitable for analysis using R’s tidyr and dplyr packages.
Dataset 1: Airline Delays Analysis
Data Source
This dataset comes from Numbersense by Kaiser Fung (McGraw Hill, 2013) and examines arrival delay patterns for two airlines (Alaska Airlines and AM West) across five U.S. destination cities: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The data captures the count of on-time and delayed flights for each airline-city combination.
Data Structure Before Tidying
Raw Wide-Format Data
The original dataset is presented in a wide format with one row per airline-status combination and one column per destination city:
Code
# Load raw airline delays dataairline_raw <-read.csv("airline_delays_raw.csv")# Display the raw structureknitr::kable(airline_raw, caption ="Table 1: Raw Wide-Format Airline Delay Data")
In the original table, airline names might appear only once per pair of rows (a common spreadsheet pattern). We’ll simulate and handle this:
Code
# Create a version with missing airline names to simulate the original formatairline_na <- airline_rawairline_na$Airline[c(2, 4)] <-NAcat("Data WITH missing values (simulating merged cells):\n")
Data WITH missing values (simulating merged cells):
Code
print(airline_na)
Airline Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 <NA> delayed 117 415 65 129 61
Code
# Fill down the missing airline namesairline_filled <- airline_na %>%fill(Airline, .direction ="down")cat("\nData AFTER filling missing values:\n")
Data AFTER filling missing values:
Code
print(airline_filled)
Airline Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
Code
cat("\nRows affected: 2 (rows with NA airline names)")
Rows affected: 2 (rows with NA airline names)
Step 2: Reshape from Wide to Long Format
Convert from wide (cities as columns) to long (cities as rows):
Code
# Pivot longer: convert city columns to rowsairline_long <- airline_filled %>%pivot_longer(cols =c("Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle"),names_to ="city",values_to ="count" ) %>%mutate(city =gsub("_", " ", city))knitr::kable(head(airline_long, 10), caption ="Table 2: Airline Data After Pivoting to Long Format")
Table 2: Airline Data After Pivoting to Long Format
# Normalize data types and structureairline_tidy <- airline_long %>%mutate(count =as.numeric(count),airline =tolower(Airline),status =tolower(Status),city =tolower(city) ) %>%select(airline, status, city, count) %>%arrange(airline, status, city)knitr::kable(head(airline_tidy, 12),caption ="Table 3: Airline Data After Normalization (snake_case)")
Table 3: Airline Data After Normalization (snake_case)
airline
status
city
count
alaska
delayed
los angeles
62
alaska
delayed
phoenix
12
alaska
delayed
san diego
20
alaska
delayed
san francisco
102
alaska
delayed
seattle
305
alaska
on time
los angeles
497
alaska
on time
phoenix
221
alaska
on time
san diego
212
alaska
on time
san francisco
503
alaska
on time
seattle
1841
am west
delayed
los angeles
117
am west
delayed
phoenix
415
Code
cat("Data types:\n")
Data types:
Code
print(sapply(airline_tidy, class))
airline status city count
"character" "character" "character" "numeric"
Step 4: Verify Data Integrity
Code
cat("Summary of tidy airline data:\n")
Summary of tidy airline data:
Code
print(summary(airline_tidy))
airline status city count
Length:20 Length:20 Length:20 Min. : 12.00
Class :character Class :character Class :character 1st Qu.: 92.75
Mode :character Mode :character Mode :character Median : 216.50
Mean : 550.00
3rd Qu.: 435.50
Max. :4840.00
Code
cat("\nNo missing values:\n")
No missing values:
Code
cat("Rows with NA: ", nrow(airline_tidy[!complete.cases(airline_tidy),]), "\n")
ggplot(airline_overall, aes(x = airline, y = delay_pct, fill = airline)) +geom_col(width =0.5) +geom_text(aes(label =paste0(delay_pct, "%")), vjust =-0.5, size =5, fontface ="bold") +scale_fill_manual(values =c("alaska"="#0072B2", "am west"="#E69F00")) +labs(title ="Overall Delay Rate by Airline",x ="Airline", y ="Delay Rate (%)",caption ="Source: Numbersense by Kaiser Fung (2013)" ) +theme_minimal(base_size =12) +theme(legend.position ="none",plot.title =element_text(hjust =0.5, face ="bold", size =14) ) +ylim(0, 20)
Delay Rate by Airline and City
Code
ggplot(airline_by_city, aes(x = city, y = delay_pct, fill = airline)) +geom_col(position ="dodge") +geom_text(aes(label =paste0(delay_pct, "%")),position =position_dodge(width =0.9), vjust =-0.4, size =3) +scale_fill_manual(values =c("alaska"="#0072B2", "am west"="#E69F00")) +labs(title ="Delay Rate by Airline and City",x ="Destination City", y ="Delay Rate (%)",fill ="Airline",caption ="Source: Numbersense by Kaiser Fung (2013)" ) +theme_minimal(base_size =11) +theme(axis.text.x =element_text(angle =30, hjust =1),plot.title =element_text(hjust =0.5, face ="bold", size =14) )
Dataset 2: Hotel Guest Satisfaction Analysis
Data Source
This dataset represents a hypothetical hotel chain customer satisfaction survey capturing guest feedback across three brands and five regions in the United States. The data records the number of survey responses in each satisfaction category (Satisfied, Neutral, Dissatisfied) by brand and region.
Data Structure Before Tidying
Raw Wide-Format Data
The original dataset uses a wide format with one row per brand-satisfaction combination and one column per region:
Code
# Load raw hotel satisfaction datahotel_raw <-read.csv("hotel_satisfaction_raw.csv")# Display the raw structureknitr::kable(hotel_raw, caption ="Table 6: Raw Wide-Format Hotel Satisfaction Data")
cat("\nRows affected: 6 (rows with NA brand names)")
Rows affected: 6 (rows with NA brand names)
Step 2: Reshape from Wide to Long Format
Convert from wide (regions as columns) to long (regions as rows):
Code
# Pivot longer: convert region columns to rowshotel_long <- hotel_filled %>%pivot_longer(cols =c("Northeast", "Southeast", "Midwest", "Southwest", "West"),names_to ="region",values_to ="guest_count" )knitr::kable(head(hotel_long, 12), caption ="Table 7: Hotel Data After Pivoting to Long Format")
# Normalize data types and structurehotel_tidy <- hotel_long %>%mutate(guest_count =as.numeric(guest_count),brand =tolower(Brand),satisfaction_level =tolower(Satisfaction_Level),region =tolower(region) ) %>%select(brand, satisfaction_level, region, guest_count) %>%arrange(brand, satisfaction_level, region)knitr::kable(head(hotel_tidy, 15),caption ="Table 8: Hotel Data After Normalization")
Table 8: Hotel Data After Normalization
brand
satisfaction_level
region
guest_count
budget stay
dissatisfied
midwest
88
budget stay
dissatisfied
northeast
73
budget stay
dissatisfied
southeast
81
budget stay
dissatisfied
southwest
94
budget stay
dissatisfied
west
96
budget stay
neutral
midwest
95
budget stay
neutral
northeast
89
budget stay
neutral
southeast
98
budget stay
neutral
southwest
102
budget stay
neutral
west
107
budget stay
satisfied
midwest
212
budget stay
satisfied
northeast
198
budget stay
satisfied
southeast
187
budget stay
satisfied
southwest
189
budget stay
satisfied
west
203
Code
cat("Data types:\n")
Data types:
Code
print(sapply(hotel_tidy, class))
brand satisfaction_level region guest_count
"character" "character" "character" "numeric"
Step 4: Verify Data Integrity
Code
cat("Summary of tidy hotel data:\n")
Summary of tidy hotel data:
Code
print(summary(hotel_tidy))
brand satisfaction_level region guest_count
Length:45 Length:45 Length:45 Min. : 15.0
Class :character Class :character Class :character 1st Qu.: 51.0
Mode :character Mode :character Mode :character Median : 88.0
Mean :137.7
3rd Qu.:198.0
Max. :421.0
Code
cat("\nNo missing values:\n")
No missing values:
Code
cat("Rows with NA: ", nrow(hotel_tidy[!complete.cases(hotel_tidy),]), "\n")
This dataset represents transaction records from a hypothetical online retailer capturing product returns across four merchandise categories and five sales channels. The data records the count of returned and non-returned items by product category and sales channel.
Data Structure Before Tidying
Raw Wide-Format Data
The original dataset uses a wide format with one row per category-return status combination and one column per sales channel:
Code
# Load raw e-commerce returns dataecommerce_raw <-read.csv("ecommerce_returns_raw.csv")# Display the raw structureknitr::kable(ecommerce_raw, caption ="Table 11: Raw Wide-Format E-commerce Returns Data")
Fill missing category names (simulating merged cells):
Code
# Create a version with missing category namesecommerce_na <- ecommerce_rawecommerce_na$Product_Category[c(2, 4, 6, 8)] <-NAcat("Data WITH missing values:\n")
Data WITH missing values:
Code
print(ecommerce_na)
Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1 Electronics Not Returned 1245 892 1567 412 234
2 <NA> Returned 156 134 198 87 31
3 Clothing Not Returned 2103 1876 1654 623 512
4 <NA> Returned 287 312 289 145 98
5 Home & Garden Not Returned 834 456 789 234 167
6 <NA> Returned 92 78 134 45 38
7 Sports Equipment Not Returned 567 401 612 178 145
8 <NA> Returned 78 68 95 32 21
Code
# Fill down the missing category namesecommerce_filled <- ecommerce_na %>%fill(Product_Category, .direction ="down")cat("\nData AFTER filling missing values:\n")
Data AFTER filling missing values:
Code
print(ecommerce_filled)
Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1 Electronics Not Returned 1245 892 1567 412 234
2 Electronics Returned 156 134 198 87 31
3 Clothing Not Returned 2103 1876 1654 623 512
4 Clothing Returned 287 312 289 145 98
5 Home & Garden Not Returned 834 456 789 234 167
6 Home & Garden Returned 92 78 134 45 38
7 Sports Equipment Not Returned 567 401 612 178 145
8 Sports Equipment Returned 78 68 95 32 21
Code
cat("\nRows affected: 4 (rows with NA category names)")
Rows affected: 4 (rows with NA category names)
Step 2: Reshape from Wide to Long Format
Convert from wide (channels as columns) to long (channels as rows):
Code
# Pivot longer: convert channel columns to rowsecommerce_long <- ecommerce_filled %>%pivot_longer(cols =c("Website", "Mobile_App", "Amazon", "eBay", "Physical_Store"),names_to ="sales_channel",values_to ="transaction_count" )knitr::kable(head(ecommerce_long, 12), caption ="Table 12: E-commerce Data After Pivoting to Long Format")
Table 12: E-commerce Data After Pivoting to Long Format