Project 2 — Data Tidying and Transformation

Author

Nana Kwasi Danquah

Published

March 8, 2026

Overview

This document outlines the systematic approach to completing Project 2: Data Tidying and Transformation. The project involves selecting three independent wide-format datasets and transforming them into tidy formats suitable for analysis using R’s tidyr and dplyr packages.

Dataset 1: Airline Delays Analysis

Data Source

This dataset comes from Numbersense by Kaiser Fung (McGraw Hill, 2013) and examines arrival delay patterns for two airlines (Alaska Airlines and AM West) across five U.S. destination cities: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. The data captures the count of on-time and delayed flights for each airline-city combination.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset is presented in a wide format with one row per airline-status combination and one column per destination city:

Code

# Load raw airline delays data
airline_raw <- read.csv("airline_delays_raw.csv")

# Display the raw structure
knitr::kable(airline_raw, caption = "Table 1: Raw Wide-Format Airline Delay Data")

Table 1: Raw Wide-Format Airline Delay Data
Airline	Status	Los_Angeles	Phoenix	San_Diego	San_Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Code

# Report dimensions
cat("Dimensions: ", nrow(airline_raw), " rows × ", ncol(airline_raw), " columns\n")

Dimensions:  4  rows ×  7  columns

Code

cat("Column names: ", paste(names(airline_raw), collapse = ", "), "\n")

Column names:  Airline, Status, Los_Angeles, Phoenix, San_Diego, San_Francisco, Seattle

Transformation Steps

Step 1: Handle Missing Values (Merged-Cell Pattern)

In the original table, airline names might appear only once per pair of rows (a common spreadsheet pattern). We’ll simulate and handle this:

Code

# Create a version with missing airline names to simulate the original format
airline_na <- airline_raw
airline_na$Airline[c(2, 4)] <- NA

cat("Data WITH missing values (simulating merged cells):\n")

Data WITH missing values (simulating merged cells):

Code

print(airline_na)

  Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2    <NA> delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4    <NA> delayed         117     415        65           129      61

Code

# Fill down the missing airline names
airline_filled <- airline_na %>%
  fill(Airline, .direction = "down")

cat("\nData AFTER filling missing values:\n")


Data AFTER filling missing values:

Code

print(airline_filled)

  Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2  ALASKA delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4 AM WEST delayed         117     415        65           129      61

Code

cat("\nRows affected: 2 (rows with NA airline names)")


Rows affected: 2 (rows with NA airline names)

Step 2: Reshape from Wide to Long Format

Convert from wide (cities as columns) to long (cities as rows):

Code

# Pivot longer: convert city columns to rows
airline_long <- airline_filled %>%
  pivot_longer(
    cols = c("Los_Angeles", "Phoenix", "San_Diego", "San_Francisco", "Seattle"),
    names_to = "city",
    values_to = "count"
  ) %>%
  mutate(city = gsub("_", " ", city))

knitr::kable(head(airline_long, 10), 
             caption = "Table 2: Airline Data After Pivoting to Long Format")

Table 2: Airline Data After Pivoting to Long Format
Airline	Status	city	count
ALASKA	on time	Los Angeles	497
ALASKA	on time	Phoenix	221
ALASKA	on time	San Diego	212
ALASKA	on time	San Francisco	503
ALASKA	on time	Seattle	1841
ALASKA	delayed	Los Angeles	62
ALASKA	delayed	Phoenix	12
ALASKA	delayed	San Diego	20
ALASKA	delayed	San Francisco	102
ALASKA	delayed	Seattle	305

Code

cat("Dimensions after pivot: ", nrow(airline_long), " rows × ", ncol(airline_long), " columns\n")

Dimensions after pivot:  20  rows ×  4  columns

Code

cat("Original: 4 rows × 7 cols → Tidy: ", nrow(airline_long), " rows × ", ncol(airline_long), " cols\n")

Original: 4 rows × 7 cols → Tidy:  20  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize values:

Code

# Normalize data types and structure
airline_tidy <- airline_long %>%
  mutate(
    count = as.numeric(count),
    airline = tolower(Airline),
    status = tolower(Status),
    city = tolower(city)
  ) %>%
  select(airline, status, city, count) %>%
  arrange(airline, status, city)

knitr::kable(head(airline_tidy, 12),
             caption = "Table 3: Airline Data After Normalization (snake_case)")

Table 3: Airline Data After Normalization (snake_case)
airline	status	city	count
alaska	delayed	los angeles	62
alaska	delayed	phoenix	12
alaska	delayed	san diego	20
alaska	delayed	san francisco	102
alaska	delayed	seattle	305
alaska	on time	los angeles	497
alaska	on time	phoenix	221
alaska	on time	san diego	212
alaska	on time	san francisco	503
alaska	on time	seattle	1841
am west	delayed	los angeles	117
am west	delayed	phoenix	415

Code

cat("Data types:\n")

Data types:

Code

print(sapply(airline_tidy, class))

    airline      status        city       count 
"character" "character" "character"   "numeric"

Step 4: Verify Data Integrity

Code

cat("Summary of tidy airline data:\n")

Summary of tidy airline data:

Code

print(summary(airline_tidy))

   airline             status              city               count        
 Length:20          Length:20          Length:20          Min.   :  12.00  
 Class :character   Class :character   Class :character   1st Qu.:  92.75  
 Mode  :character   Mode  :character   Mode  :character   Median : 216.50  
                                                          Mean   : 550.00  
                                                          3rd Qu.: 435.50  
                                                          Max.   :4840.00

Code

cat("\nNo missing values:\n")


No missing values:

Code

cat("Rows with NA: ", nrow(airline_tidy[!complete.cases(airline_tidy),]), "\n")

Rows with NA:  0

Code

cat("\nUnique values:\n")


Unique values:

Code

cat("Airlines: ", paste(unique(airline_tidy$airline), collapse = ", "), "\n")

Airlines:  alaska, am west

Code

cat("Statuses: ", paste(unique(airline_tidy$status), collapse = ", "), "\n")

Statuses:  delayed, on time

Code

cat("Cities: ", paste(unique(airline_tidy$city), collapse = ", "), "\n")

Cities:  los angeles, phoenix, san diego, san francisco, seattle

Code

cat("\nTotal flights: ", sum(airline_tidy$count), "\n")


Total flights:  11000

Analytical Methods

Overall Delay Rate Comparison

I calculate the percentage of delayed flights for each airline across all cities combined:

Code

airline_overall <- airline_tidy %>%
  group_by(airline, status) %>%
  summarise(total = sum(count), .groups = "drop") %>%
  group_by(airline) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(status == "delayed") %>%
  select(airline, delayed_flights = total, total_flights = grand_total, delay_pct = pct)

knitr::kable(airline_overall,
             caption = "Table 4: Overall Delay Rate by Airline")

Table 4: Overall Delay Rate by Airline
airline	delayed_flights	total_flights	delay_pct
alaska	501	3775	13.27
am west	787	7225	10.89

City-by-City Delay Rate Analysis

I break down delay rates for each airline within each city:

Code

airline_by_city <- airline_tidy %>%
  group_by(airline, city) %>%
  mutate(city_total = sum(count)) %>%
  filter(status == "delayed") %>%
  mutate(delay_pct = round(100 * count / city_total, 2)) %>%
  select(airline, city, delayed = count, city_total, delay_pct) %>%
  arrange(city, airline)

knitr::kable(airline_by_city,
             caption = "Table 5: Delay Rate by Airline and City")

Table 5: Delay Rate by Airline and City
airline	city	delayed	city_total	delay_pct
alaska	los angeles	62	559	11.09
am west	los angeles	117	811	14.43
alaska	phoenix	12	233	5.15
am west	phoenix	415	5255	7.90
alaska	san diego	20	232	8.62
am west	san diego	65	448	14.51
alaska	san francisco	102	605	16.86
am west	san francisco	129	449	28.73
alaska	seattle	305	2146	14.21
am west	seattle	61	262	23.28

Visualizations

Overall Delay Rate by Airline

Code

ggplot(airline_overall, aes(x = airline, y = delay_pct, fill = airline)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(delay_pct, "%")), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("alaska" = "#0072B2", "am west" = "#E69F00")) +
  labs(
    title = "Overall Delay Rate by Airline",
    x = "Airline", 
    y = "Delay Rate (%)",
    caption = "Source: Numbersense by Kaiser Fung (2013)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 20)

Delay Rate by Airline and City

Code

ggplot(airline_by_city, aes(x = city, y = delay_pct, fill = airline)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(delay_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.4, size = 3) +
  scale_fill_manual(values = c("alaska" = "#0072B2", "am west" = "#E69F00")) +
  labs(
    title = "Delay Rate by Airline and City",
    x = "Destination City", 
    y = "Delay Rate (%)",
    fill = "Airline",
    caption = "Source: Numbersense by Kaiser Fung (2013)"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )

Dataset 2: Hotel Guest Satisfaction Analysis

Data Source

This dataset represents a hypothetical hotel chain customer satisfaction survey capturing guest feedback across three brands and five regions in the United States. The data records the number of survey responses in each satisfaction category (Satisfied, Neutral, Dissatisfied) by brand and region.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset uses a wide format with one row per brand-satisfaction combination and one column per region:

Code

# Load raw hotel satisfaction data
hotel_raw <- read.csv("hotel_satisfaction_raw.csv")

# Display the raw structure
knitr::kable(hotel_raw, caption = "Table 6: Raw Wide-Format Hotel Satisfaction Data")

Table 6: Raw Wide-Format Hotel Satisfaction Data
Brand	Satisfaction_Level	Northeast	Southeast	Midwest	Southwest	West
Luxury Inn	Satisfied	285	312	198	227	356
Luxury Inn	Neutral	56	48	42	51	73
Luxury Inn	Dissatisfied	19	28	15	22	31
Premier Hotels	Satisfied	412	398	367	345	421
Premier Hotels	Neutral	78	82	71	69	85
Premier Hotels	Dissatisfied	35	42	29	38	51
Budget Stay	Satisfied	198	187	212	189	203
Budget Stay	Neutral	89	98	95	102	107
Budget Stay	Dissatisfied	73	81	88	94	96

Code

# Report dimensions
cat("Dimensions: ", nrow(hotel_raw), " rows × ", ncol(hotel_raw), " columns\n")

Dimensions:  9  rows ×  7  columns

Code

cat("Column names: ", paste(names(hotel_raw), collapse = ", "), "\n")

Column names:  Brand, Satisfaction_Level, Northeast, Southeast, Midwest, Southwest, West

Transformation Steps

Step 1: Handle Missing Values

Fill missing brand names (simulating merged cells in source data):

Code

# Create a version with missing brand names
hotel_na <- hotel_raw
hotel_na$Brand[c(2, 3, 5, 6, 8, 9)] <- NA

cat("Data WITH missing values:\n")

Data WITH missing values:

Code

print(hotel_na)

           Brand Satisfaction_Level Northeast Southeast Midwest Southwest West
1     Luxury Inn          Satisfied       285       312     198       227  356
2           <NA>            Neutral        56        48      42        51   73
3           <NA>       Dissatisfied        19        28      15        22   31
4 Premier Hotels          Satisfied       412       398     367       345  421
5           <NA>            Neutral        78        82      71        69   85
6           <NA>       Dissatisfied        35        42      29        38   51
7    Budget Stay          Satisfied       198       187     212       189  203
8           <NA>            Neutral        89        98      95       102  107
9           <NA>       Dissatisfied        73        81      88        94   96

Code

# Fill down the missing brand names
hotel_filled <- hotel_na %>%
  fill(Brand, .direction = "down")

cat("\nData AFTER filling missing values:\n")


Data AFTER filling missing values:

Code

print(hotel_filled)

           Brand Satisfaction_Level Northeast Southeast Midwest Southwest West
1     Luxury Inn          Satisfied       285       312     198       227  356
2     Luxury Inn            Neutral        56        48      42        51   73
3     Luxury Inn       Dissatisfied        19        28      15        22   31
4 Premier Hotels          Satisfied       412       398     367       345  421
5 Premier Hotels            Neutral        78        82      71        69   85
6 Premier Hotels       Dissatisfied        35        42      29        38   51
7    Budget Stay          Satisfied       198       187     212       189  203
8    Budget Stay            Neutral        89        98      95       102  107
9    Budget Stay       Dissatisfied        73        81      88        94   96

Code

cat("\nRows affected: 6 (rows with NA brand names)")


Rows affected: 6 (rows with NA brand names)

Step 2: Reshape from Wide to Long Format

Convert from wide (regions as columns) to long (regions as rows):

Code

# Pivot longer: convert region columns to rows
hotel_long <- hotel_filled %>%
  pivot_longer(
    cols = c("Northeast", "Southeast", "Midwest", "Southwest", "West"),
    names_to = "region",
    values_to = "guest_count"
  )

knitr::kable(head(hotel_long, 12), 
             caption = "Table 7: Hotel Data After Pivoting to Long Format")

Table 7: Hotel Data After Pivoting to Long Format
Brand	Satisfaction_Level	region	guest_count
Luxury Inn	Satisfied	Northeast	285
Luxury Inn	Satisfied	Southeast	312
Luxury Inn	Satisfied	Midwest	198
Luxury Inn	Satisfied	Southwest	227
Luxury Inn	Satisfied	West	356
Luxury Inn	Neutral	Northeast	56
Luxury Inn	Neutral	Southeast	48
Luxury Inn	Neutral	Midwest	42
Luxury Inn	Neutral	Southwest	51
Luxury Inn	Neutral	West	73
Luxury Inn	Dissatisfied	Northeast	19
Luxury Inn	Dissatisfied	Southeast	28

Code

cat("Dimensions after pivot: ", nrow(hotel_long), " rows × ", ncol(hotel_long), " columns\n")

Dimensions after pivot:  45  rows ×  4  columns

Code

cat("Original: 9 rows × 7 cols → Tidy: ", nrow(hotel_long), " rows × ", ncol(hotel_long), " cols\n")

Original: 9 rows × 7 cols → Tidy:  45  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize naming:

Code

# Normalize data types and structure
hotel_tidy <- hotel_long %>%
  mutate(
    guest_count = as.numeric(guest_count),
    brand = tolower(Brand),
    satisfaction_level = tolower(Satisfaction_Level),
    region = tolower(region)
  ) %>%
  select(brand, satisfaction_level, region, guest_count) %>%
  arrange(brand, satisfaction_level, region)

knitr::kable(head(hotel_tidy, 15),
             caption = "Table 8: Hotel Data After Normalization")

Table 8: Hotel Data After Normalization
brand	satisfaction_level	region	guest_count
budget stay	dissatisfied	midwest	88
budget stay	dissatisfied	northeast	73
budget stay	dissatisfied	southeast	81
budget stay	dissatisfied	southwest	94
budget stay	dissatisfied	west	96
budget stay	neutral	midwest	95
budget stay	neutral	northeast	89
budget stay	neutral	southeast	98
budget stay	neutral	southwest	102
budget stay	neutral	west	107
budget stay	satisfied	midwest	212
budget stay	satisfied	northeast	198
budget stay	satisfied	southeast	187
budget stay	satisfied	southwest	189
budget stay	satisfied	west	203

Code

cat("Data types:\n")

Data types:

Code

print(sapply(hotel_tidy, class))

             brand satisfaction_level             region        guest_count 
       "character"        "character"        "character"          "numeric"

Step 4: Verify Data Integrity

Code

cat("Summary of tidy hotel data:\n")

Summary of tidy hotel data:

Code

print(summary(hotel_tidy))

    brand           satisfaction_level    region           guest_count   
 Length:45          Length:45          Length:45          Min.   : 15.0  
 Class :character   Class :character   Class :character   1st Qu.: 51.0  
 Mode  :character   Mode  :character   Mode  :character   Median : 88.0  
                                                          Mean   :137.7  
                                                          3rd Qu.:198.0  
                                                          Max.   :421.0

Code

cat("\nNo missing values:\n")


No missing values:

Code

cat("Rows with NA: ", nrow(hotel_tidy[!complete.cases(hotel_tidy),]), "\n")

Rows with NA:  0

Code

cat("\nUnique values:\n")


Unique values:

Code

cat("Brands: ", paste(unique(hotel_tidy$brand), collapse = ", "), "\n")

Brands:  budget stay, luxury inn, premier hotels

Code

cat("Satisfaction levels: ", paste(unique(hotel_tidy$satisfaction_level), collapse = ", "), "\n")

Satisfaction levels:  dissatisfied, neutral, satisfied

Code

cat("Regions: ", paste(unique(hotel_tidy$region), collapse = ", "), "\n")

Regions:  midwest, northeast, southeast, southwest, west

Code

cat("\nTotal survey responses: ", sum(hotel_tidy$guest_count), "\n")


Total survey responses:  6198

Analytical Methods

Overall Satisfaction Rate by Brand

I calculate the percentage of satisfied responses for each hotel brand across all regions:

Code

hotel_overall <- hotel_tidy %>%
  group_by(brand, satisfaction_level) %>%
  summarise(total = sum(guest_count), .groups = "drop") %>%
  group_by(brand) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(satisfaction_level == "satisfied") %>%
  select(brand, satisfied_count = total, total_responses = grand_total, satisfaction_pct = pct)

knitr::kable(hotel_overall,
             caption = "Table 9: Overall Satisfaction Rate by Hotel Brand")

Table 9: Overall Satisfaction Rate by Hotel Brand
brand	satisfied_count	total_responses	satisfaction_pct
budget stay	989	1912	51.73
luxury inn	1378	1763	78.16
premier hotels	1943	2523	77.01

Satisfaction Rate by Brand and Region

Code

hotel_by_region <- hotel_tidy %>%
  group_by(brand, region) %>%
  mutate(region_total = sum(guest_count)) %>%
  filter(satisfaction_level == "satisfied") %>%
  mutate(satisfaction_pct = round(100 * guest_count / region_total, 2)) %>%
  select(brand, region, satisfied = guest_count, region_total, satisfaction_pct) %>%
  arrange(region, brand)

knitr::kable(hotel_by_region,
             caption = "Table 10: Satisfaction Rate by Brand and Region")

Table 10: Satisfaction Rate by Brand and Region
brand	region	satisfied	region_total	satisfaction_pct
budget stay	midwest	212	395	53.67
luxury inn	midwest	198	255	77.65
premier hotels	midwest	367	467	78.59
budget stay	northeast	198	360	55.00
luxury inn	northeast	285	360	79.17
premier hotels	northeast	412	525	78.48
budget stay	southeast	187	366	51.09
luxury inn	southeast	312	388	80.41
premier hotels	southeast	398	522	76.25
budget stay	southwest	189	385	49.09
luxury inn	southwest	227	300	75.67
premier hotels	southwest	345	452	76.33
budget stay	west	203	406	50.00
luxury inn	west	356	460	77.39
premier hotels	west	421	557	75.58

Visualizations

Overall Satisfaction Rate by Brand

Code

ggplot(hotel_overall, aes(x = brand, y = satisfaction_pct, fill = brand)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(satisfaction_pct, "%")), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("budget stay" = "#E41A1C", "luxury inn" = "#377EB8", "premier hotels" = "#4DAF4A")) +
  labs(
    title = "Overall Customer Satisfaction Rate by Hotel Brand",
    x = "Hotel Brand", 
    y = "Satisfaction Rate (%)",
    caption = "Source: Hotel chain satisfaction survey"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 100)

Satisfaction Rate by Brand and Region

Code

ggplot(hotel_by_region, aes(x = region, y = satisfaction_pct, fill = brand)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(satisfaction_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.4, size = 3) +
  scale_fill_manual(values = c("budget stay" = "#E41A1C", "luxury inn" = "#377EB8", "premier hotels" = "#4DAF4A")) +
  labs(
    title = "Guest Satisfaction Rate by Hotel Brand and Region",
    x = "Region", 
    y = "Satisfaction Rate (%)",
    fill = "Brand",
    caption = "Source: Hotel chain satisfaction survey"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )

Dataset 3: E-commerce Product Return Rates Analysis

Data Source

This dataset represents transaction records from a hypothetical online retailer capturing product returns across four merchandise categories and five sales channels. The data records the count of returned and non-returned items by product category and sales channel.

Data Structure Before Tidying

Raw Wide-Format Data

The original dataset uses a wide format with one row per category-return status combination and one column per sales channel:

Code

# Load raw e-commerce returns data
ecommerce_raw <- read.csv("ecommerce_returns_raw.csv")

# Display the raw structure
knitr::kable(ecommerce_raw, caption = "Table 11: Raw Wide-Format E-commerce Returns Data")

Table 11: Raw Wide-Format E-commerce Returns Data
Product_Category	Return_Status	Website	Mobile_App	Amazon	eBay	Physical_Store
Electronics	Not Returned	1245	892	1567	412	234
Electronics	Returned	156	134	198	87	31
Clothing	Not Returned	2103	1876	1654	623	512
Clothing	Returned	287	312	289	145	98
Home & Garden	Not Returned	834	456	789	234	167
Home & Garden	Returned	92	78	134	45	38
Sports Equipment	Not Returned	567	401	612	178	145
Sports Equipment	Returned	78	68	95	32	21

Code

# Report dimensions
cat("Dimensions: ", nrow(ecommerce_raw), " rows × ", ncol(ecommerce_raw), " columns\n")

Dimensions:  8  rows ×  7  columns

Code

cat("Column names: ", paste(names(ecommerce_raw), collapse = ", "), "\n")

Column names:  Product_Category, Return_Status, Website, Mobile_App, Amazon, eBay, Physical_Store

Transformation Steps

Step 1: Handle Missing Values

Fill missing category names (simulating merged cells):

Code

# Create a version with missing category names
ecommerce_na <- ecommerce_raw
ecommerce_na$Product_Category[c(2, 4, 6, 8)] <- NA

cat("Data WITH missing values:\n")

Data WITH missing values:

Code

print(ecommerce_na)

  Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1      Electronics  Not Returned    1245        892   1567  412            234
2             <NA>      Returned     156        134    198   87             31
3         Clothing  Not Returned    2103       1876   1654  623            512
4             <NA>      Returned     287        312    289  145             98
5    Home & Garden  Not Returned     834        456    789  234            167
6             <NA>      Returned      92         78    134   45             38
7 Sports Equipment  Not Returned     567        401    612  178            145
8             <NA>      Returned      78         68     95   32             21

Code

# Fill down the missing category names
ecommerce_filled <- ecommerce_na %>%
  fill(Product_Category, .direction = "down")

cat("\nData AFTER filling missing values:\n")


Data AFTER filling missing values:

Code

print(ecommerce_filled)

  Product_Category Return_Status Website Mobile_App Amazon eBay Physical_Store
1      Electronics  Not Returned    1245        892   1567  412            234
2      Electronics      Returned     156        134    198   87             31
3         Clothing  Not Returned    2103       1876   1654  623            512
4         Clothing      Returned     287        312    289  145             98
5    Home & Garden  Not Returned     834        456    789  234            167
6    Home & Garden      Returned      92         78    134   45             38
7 Sports Equipment  Not Returned     567        401    612  178            145
8 Sports Equipment      Returned      78         68     95   32             21

Code

cat("\nRows affected: 4 (rows with NA category names)")


Rows affected: 4 (rows with NA category names)

Step 2: Reshape from Wide to Long Format

Convert from wide (channels as columns) to long (channels as rows):

Code

# Pivot longer: convert channel columns to rows
ecommerce_long <- ecommerce_filled %>%
  pivot_longer(
    cols = c("Website", "Mobile_App", "Amazon", "eBay", "Physical_Store"),
    names_to = "sales_channel",
    values_to = "transaction_count"
  )

knitr::kable(head(ecommerce_long, 12), 
             caption = "Table 12: E-commerce Data After Pivoting to Long Format")

Table 12: E-commerce Data After Pivoting to Long Format
Product_Category	Return_Status	sales_channel	transaction_count
Electronics	Not Returned	Website	1245
Electronics	Not Returned	Mobile_App	892
Electronics	Not Returned	Amazon	1567
Electronics	Not Returned	eBay	412
Electronics	Not Returned	Physical_Store	234
Electronics	Returned	Website	156
Electronics	Returned	Mobile_App	134
Electronics	Returned	Amazon	198
Electronics	Returned	eBay	87
Electronics	Returned	Physical_Store	31
Clothing	Not Returned	Website	2103
Clothing	Not Returned	Mobile_App	1876

Code

cat("Dimensions after pivot: ", nrow(ecommerce_long), " rows × ", ncol(ecommerce_long), " columns\n")

Dimensions after pivot:  40  rows ×  4  columns

Code

cat("Original: 8 rows × 7 cols → Tidy: ", nrow(ecommerce_long), " rows × ", ncol(ecommerce_long), " cols\n")

Original: 8 rows × 7 cols → Tidy:  40  rows ×  4  cols

Step 3: Normalize Variable Structure

Convert data types and standardize naming:

Code

# Normalize data types and structure
ecommerce_tidy <- ecommerce_long %>%
  mutate(
    transaction_count = as.numeric(transaction_count),
    product_category = tolower(Product_Category),
    return_status = tolower(Return_Status),
    sales_channel = tolower(gsub("_", " ", sales_channel))
  ) %>%
  select(product_category, return_status, sales_channel, transaction_count) %>%
  arrange(product_category, return_status, sales_channel)

knitr::kable(head(ecommerce_tidy, 15),
             caption = "Table 13: E-commerce Data After Normalization")

Table 13: E-commerce Data After Normalization
product_category	return_status	sales_channel	transaction_count
clothing	not returned	amazon	1654
clothing	not returned	ebay	623
clothing	not returned	mobile app	1876
clothing	not returned	physical store	512
clothing	not returned	website	2103
clothing	returned	amazon	289
clothing	returned	ebay	145
clothing	returned	mobile app	312
clothing	returned	physical store	98
clothing	returned	website	287
electronics	not returned	amazon	1567
electronics	not returned	ebay	412
electronics	not returned	mobile app	892
electronics	not returned	physical store	234
electronics	not returned	website	1245

Code

cat("Data types:\n")

Data types:

Code

print(sapply(ecommerce_tidy, class))

 product_category     return_status     sales_channel transaction_count 
      "character"       "character"       "character"         "numeric"

Step 4: Verify Data Integrity

Code

cat("Summary of tidy e-commerce data:\n")

Summary of tidy e-commerce data:

Code

print(summary(ecommerce_tidy))

 product_category   return_status      sales_channel      transaction_count
 Length:40          Length:40          Length:40          Min.   :  21.00  
 Class :character   Class :character   Class :character   1st Qu.:  94.25  
 Mode  :character   Mode  :character   Mode  :character   Median : 216.00  
                                                          Mean   : 447.98  
                                                          3rd Qu.: 578.25  
                                                          Max.   :2103.00

Code

cat("\nNo missing values:\n")


No missing values:

Code

cat("Rows with NA: ", nrow(ecommerce_tidy[!complete.cases(ecommerce_tidy),]), "\n")

Rows with NA:  0

Code

cat("\nUnique values:\n")


Unique values:

Code

cat("Product categories: ", paste(unique(ecommerce_tidy$product_category), collapse = ", "), "\n")

Product categories:  clothing, electronics, home & garden, sports equipment

Code

cat("Return statuses: ", paste(unique(ecommerce_tidy$return_status), collapse = ", "), "\n")

Return statuses:  not returned, returned

Code

cat("Sales channels: ", paste(unique(ecommerce_tidy$sales_channel), collapse = ", "), "\n")

Sales channels:  amazon, ebay, mobile app, physical store, website

Code

cat("\nTotal transactions: ", sum(ecommerce_tidy$transaction_count), "\n")


Total transactions:  17919

Analytical Methods

Overall Return Rate by Product Category

I calculate the percentage of returned items for each product category across all sales channels:

Code

ecommerce_overall <- ecommerce_tidy %>%
  group_by(product_category, return_status) %>%
  summarise(total = sum(transaction_count), .groups = "drop") %>%
  group_by(product_category) %>%
  mutate(
    grand_total = sum(total),
    pct = round(100 * total / grand_total, 2)
  ) %>%
  filter(return_status == "returned") %>%
  select(product_category, returned_count = total, total_transactions = grand_total, return_pct = pct)

knitr::kable(ecommerce_overall,
             caption = "Table 14: Overall Return Rate by Product Category")

Table 14: Overall Return Rate by Product Category
product_category	returned_count	total_transactions	return_pct
clothing	1131	7899	14.32
electronics	606	4956	12.23
home & garden	387	2867	13.50
sports equipment	294	2197	13.38

Return Rate by Product Category and Sales Channel

Code

ecommerce_by_channel <- ecommerce_tidy %>%
  group_by(product_category, sales_channel) %>%
  mutate(channel_total = sum(transaction_count)) %>%
  filter(return_status == "returned") %>%
  mutate(return_pct = round(100 * transaction_count / channel_total, 2)) %>%
  select(product_category, sales_channel, returned = transaction_count, channel_total, return_pct) %>%
  arrange(product_category, sales_channel)

knitr::kable(ecommerce_by_channel,
             caption = "Table 15: Return Rate by Product Category and Sales Channel")

Table 15: Return Rate by Product Category and Sales Channel
product_category	sales_channel	returned	channel_total	return_pct
clothing	amazon	289	1943	14.87
clothing	ebay	145	768	18.88
clothing	mobile app	312	2188	14.26
clothing	physical store	98	610	16.07
clothing	website	287	2390	12.01
electronics	amazon	198	1765	11.22
electronics	ebay	87	499	17.43
electronics	mobile app	134	1026	13.06
electronics	physical store	31	265	11.70
electronics	website	156	1401	11.13
home & garden	amazon	134	923	14.52
home & garden	ebay	45	279	16.13
home & garden	mobile app	78	534	14.61
home & garden	physical store	38	205	18.54
home & garden	website	92	926	9.94
sports equipment	amazon	95	707	13.44
sports equipment	ebay	32	210	15.24
sports equipment	mobile app	68	469	14.50
sports equipment	physical store	21	166	12.65
sports equipment	website	78	645	12.09

Visualizations

Overall Return Rate by Product Category

Code

ggplot(ecommerce_overall, aes(x = product_category, y = return_pct, fill = product_category)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(return_pct, "%")), vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_manual(values = c(
    "clothing" = "#FF7F00",
    "electronics" = "#377EB8",
    "home & garden" = "#4DAF4A",
    "sports equipment" = "#E41A1C"
  )) +
  labs(
    title = "Overall Return Rate by Product Category",
    x = "Product Category", 
    y = "Return Rate (%)",
    caption = "Source: E-commerce transaction data"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  ) +
  ylim(0, 15)

Return Rate by Product Category and Sales Channel

Code

ggplot(ecommerce_by_channel, aes(x = sales_channel, y = return_pct, fill = product_category)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(return_pct, "%")),
            position = position_dodge(width = 0.9), vjust = -0.3, size = 2.5) +
  scale_fill_manual(values = c(
    "clothing" = "#FF7F00",
    "electronics" = "#377EB8",
    "home & garden" = "#4DAF4A",
    "sports equipment" = "#E41A1C"
  )) +
  labs(
    title = "Return Rate by Product Category and Sales Channel",
    x = "Sales Channel", 
    y = "Return Rate (%)",
    fill = "Category",
    caption = "Source: E-commerce transaction data"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
  )