Project 2 — Data Tidying and Transformation

Author

Nana Kwasi Danquah

Published

March 8, 2026


Overview

This document outlines the systematic approach to completing Project 2: Data Tidying and Transformation. The project involves selecting three independent wide-format datasets and transforming them into tidy formats suitable for analysis using R’s tidyr and dplyr packages.

Dataset 1: Airline Delays Analysis

Data Source

This dataset comes from Numbersense by Kaiser Fung (McGraw Hill, 2013) and examines arrival delay patterns for two airlines (Alaska Airlines and AM West) across five U.S. destination cities: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The data captures the count of on-time and delayed flights for each airline-city combination.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset is presented in a wide format with one row per airline-status combination and one column per destination city:

Code
# Load raw airline delays data
airline_raw <- read.csv("airline_delays_raw.csv")

# Display the raw structure
knitr::kable(airline_raw, caption = "Table 1: Raw Wide-Format Airline Delay Data")
Table 1: Raw Wide-Format Airline Delay Data
Airline Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61
Code
# Report dimensions
cat("Dimensions: ", nrow(airline_raw), " rows × ", ncol(airline_raw), " columns\n")
Dimensions:  4  rows ×  7  columns
Code
cat("Column names: ", paste(names(airline_raw), collapse = ", "), "\n")
Column names:  Airline, Status, Los_Angeles, Phoenix, San_Diego, San_Francisco, Seattle 

Transformation Steps

Step 1: Handle Missing Values (Merged-Cell Pattern)

In the original table, airline names might appear only once per pair of rows (a common spreadsheet pattern). We’ll simulate and handle this:

Code
# Create a version with missing airline names to simulate the original format
airline_na <- airline_raw
airline_na$Airline[c(2, 4)] <- NA

cat("Data WITH missing values (simulating merged cells):\n")
Data WITH missing values (simulating merged cells):
Code
print(airline_na)
  Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2    <NA> delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4    <NA> delayed         117     415        65           129      61
Code
# Fill down the missing airline names
airline_filled <- airline_na %>%
  fill(Airline, .direction = "down")

cat("\nData AFTER filling missing values:\n")

Data AFTER filling missing values:
Code
print(airline_filled)
  Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2  ALASKA delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4 AM WEST delayed         117     415        65           129      61
Code
cat("\nRows affected: 2 (rows with NA airline names)")

Rows affected: 2 (rows with NA airline names)

Step 2: Reshape from Wide to Long Format

Convert from wide (cities as columns) to long (cities as rows):

Code
# Pivot longer: convert city columns to rows
airline_long <- airline_filled %>%
  pivot_longer(
    cols = c("Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle"),
    names_to = "city",
    values_to = "count"
  ) %>%
  mutate(city = gsub("_", " ", city))

knitr::kable(head(airline_long, 10), 
             caption = "Table 2: Airline Data After Pivoting to Long Format")
Table 2: Airline Data After Pivoting to Long Format
Airline Status city count
ALASKA on time Los Angeles 497
ALASKA on time Phoenix 221
ALASKA on time San Diego 212
ALASKA on time San Francisco 503
ALASKA on time Seattle 1841
ALASKA delayed Los Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San Diego 20
ALASKA delayed San Francisco 102
ALASKA delayed Seattle 305
Code
cat("Dimensions after pivot: ", nrow(airline_long), " rows × ", ncol(airline_long), " columns\n")
Dimensions after pivot:  20  rows ×  4  columns
Code
cat("Original: 4 rows × 7 cols → Tidy: ", nrow(airline_long), " rows × ", ncol(airline_long), " cols\n")
Original: 4 rows × 7 cols → Tidy:  20  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize values:

Code
# Normalize data types and structure
airline_tidy <- airline_long %>%
  mutate(
    count = as.numeric(count),
    airline = tolower(Airline),
    status = tolower(Status),
    city = tolower(city)
  ) %>%
  select(airline, status, city, count) %>%
  arrange(airline, status, city)

knitr::kable(head(airline_tidy, 12),
             caption = "Table 3: Airline Data After Normalization (snake_case)")
Table 3: Airline Data After Normalization (snake_case)
airline status city count
alaska delayed los angeles 62
alaska delayed phoenix 12
alaska delayed san diego 20
alaska delayed san francisco 102
alaska delayed seattle 305
alaska on time los angeles 497
alaska on time phoenix 221
alaska on time san diego 212
alaska on time san francisco 503
alaska on time seattle 1841
am west delayed los angeles 117
am west delayed phoenix 415
Code
cat("Data types:\n")
Data types:
Code
print(sapply(airline_tidy, class))
    airline      status        city       count 
"character" "character" "character"   "numeric" 

Step 4: Verify Data Integrity

Code
cat("Summary of tidy airline data:\n")
Summary of tidy airline data:
Code
print(summary(airline_tidy))
   airline             status              city               count        
 Length:20          Length:20          Length:20          Min.   :  12.00  
 Class :character   Class :character   Class :character   1st Qu.:  92.75  
 Mode  :character   Mode  :character   Mode  :character   Median : 216.50  
                                                          Mean   : 550.00  
                                                          3rd Qu.: 435.50  
                                                          Max.   :4840.00  
Code
cat("\nNo missing values:\n")

No missing values:
Code
cat("Rows with NA: ", nrow(airline_tidy[!complete.cases(airline_tidy),]), "\n")
Rows with NA:  0 
Code
cat("\nUnique values:\n")

Unique values:
Code
cat("Airlines: ", paste(unique(airline_tidy$airline), collapse = ", "), "\n")
Airlines:  alaska, am west 
Code
cat("Statuses: ", paste(unique(airline_tidy$status), collapse = ", "), "\n")
Statuses:  delayed, on time 
Code
cat("Cities: ", paste(unique(airline_tidy$city), collapse = ", "), "\n")
Cities:  los angeles, phoenix, san diego, san francisco, seattle 
Code
cat("\nTotal flights: ", sum(airline_tidy$count), "\n")

Total flights:  11000 

Analytical Methods

Overall Delay Rate Comparison

I calculate the percentage of delayed flights for each airline across all cities combined:

Code
airline_overall <- airline_tidy %>%
  group_by(airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  group_by(airline) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(status == "delayed") %>%
  select(airline, delayed_flights = total, total_flights = grand_total, delay_pct = pct)

knitr::kable(airline_overall,
             caption = "Table 4: Overall Delay Rate by Airline")
Table 4: Overall Delay Rate by Airline
airline delayed_flights total_flights delay_pct
alaska 501 3775 13.27
am west 787 7225 10.89

City-by-City Delay Rate Analysis

I break down delay rates for each airline within each city:

Code
airline_by_city <- airline_tidy %>%
  group_by(airline, city) %>%
  mutate(city_total = sum(count)) %>%
  filter(status == "delayed") %>%
  mutate(delay_pct = round(100 * count / city_total, 2)) %>%
  select(airline, city, delayed = count, city_total, delay_pct) %>%
  arrange(city, airline)

knitr::kable(airline_by_city,
             caption = "Table 5: Delay Rate by Airline and City")
Table 5: Delay Rate by Airline and City
airline city delayed city_total delay_pct
alaska los angeles 62 559 11.09
am west los angeles 117 811 14.43
alaska phoenix 12 233 5.15
am west phoenix 415 5255 7.90
alaska san diego 20 232 8.62
am west san diego 65 448 14.51
alaska san francisco 102 605 16.86
am west san francisco 129 449 28.73
alaska seattle 305 2146 14.21
am west seattle 61 262 23.28

Visualizations

Overall Delay Rate by Airline

Code
ggplot(airline_overall, aes(x = airline, y = delay_pct, fill = airline)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(delay_pct, "%")), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("alaska" = "#0072B2", "am west" = "#E69F00")) +
  labs(
    title = "Overall Delay Rate by Airline",
    x = "Airline", 
    y = "Delay Rate (%)",
    caption = "Source: Numbersense by Kaiser Fung (2013)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 20)

Delay Rate by Airline and City

Code
ggplot(airline_by_city, aes(x = city, y = delay_pct, fill = airline)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(delay_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.4, size = 3) +
  scale_fill_manual(values = c("alaska" = "#0072B2", "am west" = "#E69F00")) +
  labs(
    title = "Delay Rate by Airline and City",
    x = "Destination City", 
    y = "Delay Rate (%)",
    fill = "Airline",
    caption = "Source: Numbersense by Kaiser Fung (2013)"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )

Dataset 2: Hotel Guest Satisfaction Analysis

Data Source

This dataset represents a hypothetical hotel chain customer satisfaction survey capturing guest feedback across three brands and five regions in the United States. The data records the number of survey responses in each satisfaction category (Satisfied, Neutral, Dissatisfied) by brand and region.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset uses a wide format with one row per brand-satisfaction combination and one column per region:

Code
# Load raw hotel satisfaction data
hotel_raw <- read.csv("hotel_satisfaction_raw.csv")

# Display the raw structure
knitr::kable(hotel_raw, caption = "Table 6: Raw Wide-Format Hotel Satisfaction Data")
Table 6: Raw Wide-Format Hotel Satisfaction Data
Brand Satisfaction_Level Northeast Southeast Midwest Southwest West
Luxury Inn Satisfied 285 312 198 227 356
Luxury Inn Neutral 56 48 42 51 73
Luxury Inn Dissatisfied 19 28 15 22 31
Premier Hotels Satisfied 412 398 367 345 421
Premier Hotels Neutral 78 82 71 69 85
Premier Hotels Dissatisfied 35 42 29 38 51
Budget Stay Satisfied 198 187 212 189 203
Budget Stay Neutral 89 98 95 102 107
Budget Stay Dissatisfied 73 81 88 94 96
Code
# Report dimensions
cat("Dimensions: ", nrow(hotel_raw), " rows × ", ncol(hotel_raw), " columns\n")
Dimensions:  9  rows ×  7  columns
Code
cat("Column names: ", paste(names(hotel_raw), collapse = ", "), "\n")
Column names:  Brand, Satisfaction_Level, Northeast, Southeast, Midwest, Southwest, West 

Transformation Steps

Step 1: Handle Missing Values

Fill missing brand names (simulating merged cells in source data):

Code
# Create a version with missing brand names
hotel_na <- hotel_raw
hotel_na$Brand[c(2, 3, 5, 6, 8, 9)] <- NA

cat("Data WITH missing values:\n")
Data WITH missing values:
Code
print(hotel_na)
           Brand Satisfaction_Level Northeast Southeast Midwest Southwest West
1     Luxury Inn          Satisfied       285       312     198       227  356
2           <NA>            Neutral        56        48      42        51   73
3           <NA>       Dissatisfied        19        28      15        22   31
4 Premier Hotels          Satisfied       412       398     367       345  421
5           <NA>            Neutral        78        82      71        69   85
6           <NA>       Dissatisfied        35        42      29        38   51
7    Budget Stay          Satisfied       198       187     212       189  203
8           <NA>            Neutral        89        98      95       102  107
9           <NA>       Dissatisfied        73        81      88        94   96
Code
# Fill down the missing brand names
hotel_filled <- hotel_na %>%
  fill(Brand, .direction = "down")

cat("\nData AFTER filling missing values:\n")

Data AFTER filling missing values:
Code
print(hotel_filled)
           Brand Satisfaction_Level Northeast Southeast Midwest Southwest West
1     Luxury Inn          Satisfied       285       312     198       227  356
2     Luxury Inn            Neutral        56        48      42        51   73
3     Luxury Inn       Dissatisfied        19        28      15        22   31
4 Premier Hotels          Satisfied       412       398     367       345  421
5 Premier Hotels            Neutral        78        82      71        69   85
6 Premier Hotels       Dissatisfied        35        42      29        38   51
7    Budget Stay          Satisfied       198       187     212       189  203
8    Budget Stay            Neutral        89        98      95       102  107
9    Budget Stay       Dissatisfied        73        81      88        94   96
Code
cat("\nRows affected: 6 (rows with NA brand names)")

Rows affected: 6 (rows with NA brand names)

Step 2: Reshape from Wide to Long Format

Convert from wide (regions as columns) to long (regions as rows):

Code
# Pivot longer: convert region columns to rows
hotel_long <- hotel_filled %>%
  pivot_longer(
    cols = c("Northeast", "Southeast", "Midwest", "Southwest", "West"),
    names_to = "region",
    values_to = "guest_count"
  )

knitr::kable(head(hotel_long, 12), 
             caption = "Table 7: Hotel Data After Pivoting to Long Format")
Table 7: Hotel Data After Pivoting to Long Format
Brand Satisfaction_Level region guest_count
Luxury Inn Satisfied Northeast 285
Luxury Inn Satisfied Southeast 312
Luxury Inn Satisfied Midwest 198
Luxury Inn Satisfied Southwest 227
Luxury Inn Satisfied West 356
Luxury Inn Neutral Northeast 56
Luxury Inn Neutral Southeast 48
Luxury Inn Neutral Midwest 42
Luxury Inn Neutral Southwest 51
Luxury Inn Neutral West 73
Luxury Inn Dissatisfied Northeast 19
Luxury Inn Dissatisfied Southeast 28
Code
cat("Dimensions after pivot: ", nrow(hotel_long), " rows × ", ncol(hotel_long), " columns\n")
Dimensions after pivot:  45  rows ×  4  columns
Code
cat("Original: 9 rows × 7 cols → Tidy: ", nrow(hotel_long), " rows × ", ncol(hotel_long), " cols\n")
Original: 9 rows × 7 cols → Tidy:  45  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize naming:

Code
# Normalize data types and structure
hotel_tidy <- hotel_long %>%
  mutate(
    guest_count = as.numeric(guest_count),
    brand = tolower(Brand),
    satisfaction_level = tolower(Satisfaction_Level),
    region = tolower(region)
  ) %>%
  select(brand, satisfaction_level, region, guest_count) %>%
  arrange(brand, satisfaction_level, region)

knitr::kable(head(hotel_tidy, 15),
             caption = "Table 8: Hotel Data After Normalization")
Table 8: Hotel Data After Normalization
brand satisfaction_level region guest_count
budget stay dissatisfied midwest 88
budget stay dissatisfied northeast 73
budget stay dissatisfied southeast 81
budget stay dissatisfied southwest 94
budget stay dissatisfied west 96
budget stay neutral midwest 95
budget stay neutral northeast 89
budget stay neutral southeast 98
budget stay neutral southwest 102
budget stay neutral west 107
budget stay satisfied midwest 212
budget stay satisfied northeast 198
budget stay satisfied southeast 187
budget stay satisfied southwest 189
budget stay satisfied west 203
Code
cat("Data types:\n")
Data types:
Code
print(sapply(hotel_tidy, class))
             brand satisfaction_level             region        guest_count 
       "character"        "character"        "character"          "numeric" 

Step 4: Verify Data Integrity

Code
cat("Summary of tidy hotel data:\n")
Summary of tidy hotel data:
Code
print(summary(hotel_tidy))
    brand           satisfaction_level    region           guest_count   
 Length:45          Length:45          Length:45          Min.   : 15.0  
 Class :character   Class :character   Class :character   1st Qu.: 51.0  
 Mode  :character   Mode  :character   Mode  :character   Median : 88.0  
                                                          Mean   :137.7  
                                                          3rd Qu.:198.0  
                                                          Max.   :421.0  
Code
cat("\nNo missing values:\n")

No missing values:
Code
cat("Rows with NA: ", nrow(hotel_tidy[!complete.cases(hotel_tidy),]), "\n")
Rows with NA:  0 
Code
cat("\nUnique values:\n")

Unique values:
Code
cat("Brands: ", paste(unique(hotel_tidy$brand), collapse = ", "), "\n")
Brands:  budget stay, luxury inn, premier hotels 
Code
cat("Satisfaction levels: ", paste(unique(hotel_tidy$satisfaction_level), collapse = ", "), "\n")
Satisfaction levels:  dissatisfied, neutral, satisfied 
Code
cat("Regions: ", paste(unique(hotel_tidy$region), collapse = ", "), "\n")
Regions:  midwest, northeast, southeast, southwest, west 
Code
cat("\nTotal survey responses: ", sum(hotel_tidy$guest_count), "\n")

Total survey responses:  6198 

Analytical Methods

Overall Satisfaction Rate by Brand

I calculate the percentage of satisfied responses for each hotel brand across all regions:

Code
hotel_overall <- hotel_tidy %>%
  group_by(brand, satisfaction_level) %>%
  summarise(total = sum(guest_count), .groups = "drop") %>%
  group_by(brand) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(satisfaction_level == "satisfied") %>%
  select(brand, satisfied_count = total, total_responses = grand_total, satisfaction_pct = pct)

knitr::kable(hotel_overall,
             caption = "Table 9: Overall Satisfaction Rate by Hotel Brand")
Table 9: Overall Satisfaction Rate by Hotel Brand
brand satisfied_count total_responses satisfaction_pct
budget stay 989 1912 51.73
luxury inn 1378 1763 78.16
premier hotels 1943 2523 77.01

Satisfaction Rate by Brand and Region

Code
hotel_by_region <- hotel_tidy %>%
  group_by(brand, region) %>%
  mutate(region_total = sum(guest_count)) %>%
  filter(satisfaction_level == "satisfied") %>%
  mutate(satisfaction_pct = round(100 * guest_count / region_total, 2)) %>%
  select(brand, region, satisfied = guest_count, region_total, satisfaction_pct) %>%
  arrange(region, brand)

knitr::kable(hotel_by_region,
             caption = "Table 10: Satisfaction Rate by Brand and Region")
Table 10: Satisfaction Rate by Brand and Region
brand region satisfied region_total satisfaction_pct
budget stay midwest 212 395 53.67
luxury inn midwest 198 255 77.65
premier hotels midwest 367 467 78.59
budget stay northeast 198 360 55.00
luxury inn northeast 285 360 79.17
premier hotels northeast 412 525 78.48
budget stay southeast 187 366 51.09
luxury inn southeast 312 388 80.41
premier hotels southeast 398 522 76.25
budget stay southwest 189 385 49.09
luxury inn southwest 227 300 75.67
premier hotels southwest 345 452 76.33
budget stay west 203 406 50.00
luxury inn west 356 460 77.39
premier hotels west 421 557 75.58

Visualizations

Overall Satisfaction Rate by Brand

Code
ggplot(hotel_overall, aes(x = brand, y = satisfaction_pct, fill = brand)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(satisfaction_pct, "%")), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("budget stay" = "#E41A1C", "luxury inn" = "#377EB8", "premier hotels" = "#4DAF4A")) +
  labs(
    title = "Overall Customer Satisfaction Rate by Hotel Brand",
    x = "Hotel Brand", 
    y = "Satisfaction Rate (%)",
    caption = "Source: Hotel chain satisfaction survey"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 100)

Satisfaction Rate by Brand and Region

Code
ggplot(hotel_by_region, aes(x = region, y = satisfaction_pct, fill = brand)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(satisfaction_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.4, size = 3) +
  scale_fill_manual(values = c("budget stay" = "#E41A1C", "luxury inn" = "#377EB8", "premier hotels" = "#4DAF4A")) +
  labs(
    title = "Guest Satisfaction Rate by Hotel Brand and Region",
    x = "Region", 
    y = "Satisfaction Rate (%)",
    fill = "Brand",
    caption = "Source: Hotel chain satisfaction survey"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )

Dataset 3: E-commerce Product Return Rates Analysis

Data Source

This dataset represents transaction records from a hypothetical online retailer capturing product returns across four merchandise categories and five sales channels. The data records the count of returned and non-returned items by product category and sales channel.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset uses a wide format with one row per category-return status combination and one column per sales channel:

Code
# Load raw e-commerce returns data
ecommerce_raw <- read.csv("ecommerce_returns_raw.csv")

# Display the raw structure
knitr::kable(ecommerce_raw, caption = "Table 11: Raw Wide-Format E-commerce Returns Data")
Table 11: Raw Wide-Format E-commerce Returns Data
Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
Electronics Not Returned 1245 892 1567 412 234
Electronics Returned 156 134 198 87 31
Clothing Not Returned 2103 1876 1654 623 512
Clothing Returned 287 312 289 145 98
Home & Garden Not Returned 834 456 789 234 167
Home & Garden Returned 92 78 134 45 38
Sports Equipment Not Returned 567 401 612 178 145
Sports Equipment Returned 78 68 95 32 21
Code
# Report dimensions
cat("Dimensions: ", nrow(ecommerce_raw), " rows × ", ncol(ecommerce_raw), " columns\n")
Dimensions:  8  rows ×  7  columns
Code
cat("Column names: ", paste(names(ecommerce_raw), collapse = ", "), "\n")
Column names:  Product_Category, Return_Status, Website, Mobile_App, Amazon, eBay, Physical_Store 

Transformation Steps

Step 1: Handle Missing Values

Fill missing category names (simulating merged cells):

Code
# Create a version with missing category names
ecommerce_na <- ecommerce_raw
ecommerce_na$Product_Category[c(2, 4, 6, 8)] <- NA

cat("Data WITH missing values:\n")
Data WITH missing values:
Code
print(ecommerce_na)
  Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1      Electronics  Not Returned    1245        892   1567  412            234
2             <NA>      Returned     156        134    198   87             31
3         Clothing  Not Returned    2103       1876   1654  623            512
4             <NA>      Returned     287        312    289  145             98
5    Home & Garden  Not Returned     834        456    789  234            167
6             <NA>      Returned      92         78    134   45             38
7 Sports Equipment  Not Returned     567        401    612  178            145
8             <NA>      Returned      78         68     95   32             21
Code
# Fill down the missing category names
ecommerce_filled <- ecommerce_na %>%
  fill(Product_Category, .direction = "down")

cat("\nData AFTER filling missing values:\n")

Data AFTER filling missing values:
Code
print(ecommerce_filled)
  Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1      Electronics  Not Returned    1245        892   1567  412            234
2      Electronics      Returned     156        134    198   87             31
3         Clothing  Not Returned    2103       1876   1654  623            512
4         Clothing      Returned     287        312    289  145             98
5    Home & Garden  Not Returned     834        456    789  234            167
6    Home & Garden      Returned      92         78    134   45             38
7 Sports Equipment  Not Returned     567        401    612  178            145
8 Sports Equipment      Returned      78         68     95   32             21
Code
cat("\nRows affected: 4 (rows with NA category names)")

Rows affected: 4 (rows with NA category names)

Step 2: Reshape from Wide to Long Format

Convert from wide (channels as columns) to long (channels as rows):

Code
# Pivot longer: convert channel columns to rows
ecommerce_long <- ecommerce_filled %>%
  pivot_longer(
    cols = c("Website", "Mobile_App", "Amazon", "eBay", "Physical_Store"),
    names_to = "sales_channel",
    values_to = "transaction_count"
  )

knitr::kable(head(ecommerce_long, 12), 
             caption = "Table 12: E-commerce Data After Pivoting to Long Format")
Table 12: E-commerce Data After Pivoting to Long Format
Product_Category Return_Status sales_channel transaction_count
Electronics Not Returned Website 1245
Electronics Not Returned Mobile_App 892
Electronics Not Returned Amazon 1567
Electronics Not Returned eBay 412
Electronics Not Returned Physical_Store 234
Electronics Returned Website 156
Electronics Returned Mobile_App 134
Electronics Returned Amazon 198
Electronics Returned eBay 87
Electronics Returned Physical_Store 31
Clothing Not Returned Website 2103
Clothing Not Returned Mobile_App 1876
Code
cat("Dimensions after pivot: ", nrow(ecommerce_long), " rows × ", ncol(ecommerce_long), " columns\n")
Dimensions after pivot:  40  rows ×  4  columns
Code
cat("Original: 8 rows × 7 cols → Tidy: ", nrow(ecommerce_long), " rows × ", ncol(ecommerce_long), " cols\n")
Original: 8 rows × 7 cols → Tidy:  40  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize naming:

Code
# Normalize data types and structure
ecommerce_tidy <- ecommerce_long %>%
  mutate(
    transaction_count = as.numeric(transaction_count),
    product_category = tolower(Product_Category),
    return_status = tolower(Return_Status),
    sales_channel = tolower(gsub("_", " ", sales_channel))
  ) %>%
  select(product_category, return_status, sales_channel, transaction_count) %>%
  arrange(product_category, return_status, sales_channel)

knitr::kable(head(ecommerce_tidy, 15),
             caption = "Table 13: E-commerce Data After Normalization")
Table 13: E-commerce Data After Normalization
product_category return_status sales_channel transaction_count
clothing not returned amazon 1654
clothing not returned ebay 623
clothing not returned mobile app 1876
clothing not returned physical store 512
clothing not returned website 2103
clothing returned amazon 289
clothing returned ebay 145
clothing returned mobile app 312
clothing returned physical store 98
clothing returned website 287
electronics not returned amazon 1567
electronics not returned ebay 412
electronics not returned mobile app 892
electronics not returned physical store 234
electronics not returned website 1245
Code
cat("Data types:\n")
Data types:
Code
print(sapply(ecommerce_tidy, class))
 product_category     return_status     sales_channel transaction_count 
      "character"       "character"       "character"         "numeric" 

Step 4: Verify Data Integrity

Code
cat("Summary of tidy e-commerce data:\n")
Summary of tidy e-commerce data:
Code
print(summary(ecommerce_tidy))
 product_category   return_status      sales_channel      transaction_count
 Length:40          Length:40          Length:40          Min.   :  21.00  
 Class :character   Class :character   Class :character   1st Qu.:  94.25  
 Mode  :character   Mode  :character   Mode  :character   Median : 216.00  
                                                          Mean   : 447.98  
                                                          3rd Qu.: 578.25  
                                                          Max.   :2103.00  
Code
cat("\nNo missing values:\n")

No missing values:
Code
cat("Rows with NA: ", nrow(ecommerce_tidy[!complete.cases(ecommerce_tidy),]), "\n")
Rows with NA:  0 
Code
cat("\nUnique values:\n")

Unique values:
Code
cat("Product categories: ", paste(unique(ecommerce_tidy$product_category), collapse = ", "), "\n")
Product categories:  clothing, electronics, home & garden, sports equipment 
Code
cat("Return statuses: ", paste(unique(ecommerce_tidy$return_status), collapse = ", "), "\n")
Return statuses:  not returned, returned 
Code
cat("Sales channels: ", paste(unique(ecommerce_tidy$sales_channel), collapse = ", "), "\n")
Sales channels:  amazon, ebay, mobile app, physical store, website 
Code
cat("\nTotal transactions: ", sum(ecommerce_tidy$transaction_count), "\n")

Total transactions:  17919 

Analytical Methods

Overall Return Rate by Product Category

I calculate the percentage of returned items for each product category across all sales channels:

Code
ecommerce_overall <- ecommerce_tidy %>%
  group_by(product_category, return_status) %>%
  summarise(total = sum(transaction_count), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(return_status == "returned") %>%
  select(product_category, returned_count = total, total_transactions = grand_total, return_pct = pct)

knitr::kable(ecommerce_overall,
             caption = "Table 14: Overall Return Rate by Product Category")
Table 14: Overall Return Rate by Product Category
product_category returned_count total_transactions return_pct
clothing 1131 7899 14.32
electronics 606 4956 12.23
home & garden 387 2867 13.50
sports equipment 294 2197 13.38

Return Rate by Product Category and Sales Channel

Code
ecommerce_by_channel <- ecommerce_tidy %>%
  group_by(product_category, sales_channel) %>%
  mutate(channel_total = sum(transaction_count)) %>%
  filter(return_status == "returned") %>%
  mutate(return_pct = round(100 * transaction_count / channel_total, 2)) %>%
  select(product_category, sales_channel, returned = transaction_count, channel_total, return_pct) %>%
  arrange(product_category, sales_channel)

knitr::kable(ecommerce_by_channel,
             caption = "Table 15: Return Rate by Product Category and Sales Channel")
Table 15: Return Rate by Product Category and Sales Channel
product_category sales_channel returned channel_total return_pct
clothing amazon 289 1943 14.87
clothing ebay 145 768 18.88
clothing mobile app 312 2188 14.26
clothing physical store 98 610 16.07
clothing website 287 2390 12.01
electronics amazon 198 1765 11.22
electronics ebay 87 499 17.43
electronics mobile app 134 1026 13.06
electronics physical store 31 265 11.70
electronics website 156 1401 11.13
home & garden amazon 134 923 14.52
home & garden ebay 45 279 16.13
home & garden mobile app 78 534 14.61
home & garden physical store 38 205 18.54
home & garden website 92 926 9.94
sports equipment amazon 95 707 13.44
sports equipment ebay 32 210 15.24
sports equipment mobile app 68 469 14.50
sports equipment physical store 21 166 12.65
sports equipment website 78 645 12.09

Visualizations

Overall Return Rate by Product Category

Code
ggplot(ecommerce_overall, aes(x = product_category, y = return_pct, fill = product_category)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(return_pct, "%")), vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_manual(values = c(
    "clothing" = "#FF7F00",
    "electronics" = "#377EB8",
    "home & garden" = "#4DAF4A",
    "sports equipment" = "#E41A1C"
  )) +
  labs(
    title = "Overall Return Rate by Product Category",
    x = "Product Category", 
    y = "Return Rate (%)",
    caption = "Source: E-commerce transaction data"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 15)

Return Rate by Product Category and Sales Channel

Code
ggplot(ecommerce_by_channel, aes(x = sales_channel, y = return_pct, fill = product_category)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(return_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.3, size = 2.5) +
  scale_fill_manual(values = c(
    "clothing" = "#FF7F00",
    "electronics" = "#377EB8",
    "home & garden" = "#4DAF4A",
    "sports equipment" = "#E41A1C"
  )) +
  labs(
    title = "Return Rate by Product Category and Sales Channel",
    x = "Sales Channel", 
    y = "Return Rate (%)",
    fill = "Category",
    caption = "Source: E-commerce transaction data"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )