library(tidyverse)
library(janitor)
library(lubridate)
theme_set(theme_minimal(base_size = 12))

Instructions

Use tidyverse (dplyr, tidyr, ggplot2). No Python.
Export required CSVs and PNGs to the project root.
Keep code readable; use pipes and comments.
If you use AI, note the prompt and what it changed, especially for visualization improvements.

Task A - Flight Delays

Wrangling

flights_raw <- read_csv("flights_dirty.csv")

## Rows: 140 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): airline_code, tailnum, destination
## dbl (4): flight, dep_delay, arr_delay, distance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

airlines_raw<- read_csv("airlines_lookup.csv")

## Rows: 6 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): airline_code, airline_name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

flights <- readr::read_csv("flights_dirty.csv") %>%
  clean_names() %>%
  mutate(
    dep_delay = as.numeric(trimws(as.character(dep_delay))),
    arr_delay = as.numeric(trimws(as.character(arr_delay)))
  ) %>%
  filter(
    between(dep_delay, -100, 600),
    between(arr_delay, -100, 600)
  ) %>%
  mutate(
    airline_code = toupper(str_trim(airline_code))
  ) %>%
  filter(nchar(airline_code) == 2)

## Rows: 140 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): airline_code, tailnum, destination
## dbl (4): flight, dep_delay, arr_delay, distance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

airlines <- readr::read_csv("airlines_lookup.csv")

## Rows: 6 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): airline_code, airline_name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

flights_cleaned <- flights %>%
  left_join(airlines, by = "airline_code") %>%
  select(
    flight,
    airline_code,
    airline_name,
    destination,
    distance,
    dep_delay,
    arr_delay
  )
readr::write_csv(flights_cleaned, "flights_cleaned.csv")

flights_cleaned <- read_csv("flights_cleaned.csv")

## Rows: 132 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): airline_code, airline_name, destination
## dbl (4): flight, distance, dep_delay, arr_delay
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(flights_cleaned)

## Rows: 132
## Columns: 7
## $ flight       <dbl> 9716, 5076, 3595, 8120, 488, 7493, 2711, 1813, 5102, 5645…
## $ airline_code <chr> "WN", "AA", "B6", "B6", "DL", "DL", "DL", "UA", "DL", "AA…
## $ airline_name <chr> "Southwest", "American Airlines", "JetBlue", "JetBlue", "…
## $ destination  <chr> "MCO", "LAX", "ORD", "MCO", "CLT", "ATL", "ATL", "CLT", "…
## $ distance     <dbl> 1546, 1352, 1529, 877, 1836, 925, 799, 2011, 458, 482, 22…
## $ dep_delay    <dbl> 46, -2, 2, 15, 31, 20, -19, 8, -1, -6, 46, -1, -11, 22, 1…
## $ arr_delay    <dbl> 44, -23, -1, 12, 28, 27, -18, 1, -16, -6, 55, 18, 1, -4, …

Visualizations

A1: Departure delay distribution by airline

flights_cleaned %>%
  filter(!is.na(airline_name), !is.na(dep_delay)) %>%
  ggplot(aes(x = airline_name, y = dep_delay)) +
  geom_boxplot(outlier.alpha = 0.3) +
  coord_flip() +
  labs(
    title = "Departure Delay by Airline",
    x = NULL,
    y = "Departure delay (min)"
  )

ggsave("fig_flights_delay_by_airline.png", width = 8, height = 5, dpi = 150)

A2: Mean arrival delay by destination

flights_dest <- flights_cleaned %>%
group_by(destination) %>%
summarise(
mean_arr = mean(arr_delay, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(mean_arr))

ggplot(flights_dest, aes(x = reorder(destination, mean_arr), y = mean_arr)) +
geom_col() +
coord_flip() +
labs(
title = "Mean Arrival Delay by Destination",
x = "Destination",
y = "Mean arrival delay (min)"
)

ggsave("fig_flights_arr_delay_by_dest.png", width = 8, height = 5, dpi = 150)

A3: Relationship between distance and arrival delay

flights_cleaned %>%
filter(!is.na(distance), !is.na(arr_delay)) %>%
ggplot(aes(x = distance, y = arr_delay)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Distance vs Arrival Delay",
x = "Distance (miles)",
y = "Arrival delay (min)"
)

## `geom_smooth()` using formula = 'y ~ x'

ggsave("fig_flights_distance_vs_arr.png", width = 8, height = 5, dpi = 150)

## `geom_smooth()` using formula = 'y ~ x'

Insights

IA1: Which airlines show the highest variability in departure delays? From the Departure Delay by Airline boxplot, Southwest and JetBlue exhibit the widest interquartile ranges, meaning their departure delays vary the most.
IA2: Which destination(s) have the worst arrival delays? Any plausible reasons? The Mean Arrival Delay by Destination chart shows that MCO (Orlando) and LAX (Los Angeles) have the highest mean arrival delays, followed by LAS (Las Vegas) and SFO (San Francisco). These airports are major ports so they would see a greater influx of people/flights.
IA3: Does longer distance tend to reduce or increase arrival delays in this sample? The Distance vs. Arrival Delay scatterplot shows some long flights still arrive early, while short routes sometimes face heavy delays. This suggests that distance alone doesn’t strongly determine arrival delay
IA4: AI usage to improve a plot (if any): describe the prompt and the change. AI was not used.

Task B - Retail Sales

Wrangling

# ---- Read and clean sales data ----
sales <- readr::read_csv("sales_dirty.csv") %>%
  clean_names() %>%
  mutate(
    item = str_to_title(str_trim(item))      # clean item names
  ) %>%
  distinct(order_id, item, .keep_all = TRUE) %>%  # remove duplicates
  mutate(
    price_usd = if_else(currency == "EUR", price * 1.1, price)  # convert EUR to USD
  ) %>%
  filter(price_usd > 0) %>%
  mutate(
    date_parsed = suppressWarnings(parse_date_time(date, orders = c("mdY", "m/d/y", "Y-m-d")))
  ) %>%
  mutate(
    year = year(date_parsed),
    month = month(date_parsed)
  )

## Rows: 212 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): item, category, currency, date
## dbl (2): order_id, price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# ---- Summaries ----
sales_monthly <- sales %>%
  group_by(year, month) %>%
  summarise(
    monthly_revenue_usd = sum(price_usd, na.rm = TRUE),
    .groups = "drop"
  )

sales_by_cat <- sales %>%
  group_by(year, month, category) %>%
  summarise(
    revenue_usd = sum(price_usd, na.rm = TRUE),
    .groups = "drop"
  )


readr::write_csv(sales, "sales_cleaned.csv")
readr::write_csv(sales_monthly, "sales_monthly_summary.csv")
readr::write_csv(sales_by_cat, "sales_by_category_summary.csv")

Visualizations

# B1: Monthly revenue trend
ggplot(sales_monthly, aes(x = as.Date(paste(year, month, 1, sep = "-")), y = monthly_revenue_usd)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue") +
  labs(
    title = "Monthly Revenue (USD)",
    x = "Month",
    y = "Revenue (USD)"
  )

ggsave("fig_sales_monthly_trend.png", width = 8, height = 4.5, dpi = 150)

# B2: Revenue by category over time (stacked area)
sales_by_cat$date <- as.Date(paste(sales_by_cat$year, sales_by_cat$month, 1, sep = "-"))

ggplot(sales_by_cat, aes(x = date, y = revenue_usd, fill = category)) +
  geom_area() +
  labs(
    title = "Revenue by Category Over Time",
    x = "Month",
    y = "Revenue (USD)",
    fill = "Category"
  )

ggsave("fig_sales_by_category.png", width = 8, height = 5, dpi = 150)

Insights

IB1: Which months are unusually high/low? Hypothesize why (data or business). The Monthly Revenue line chart indicates that sales remained low and stable for most of 2023, fluctuating slightly between January and September. However, there was a sharp revenue spike in October then an immediate drop the following month. The spike could be due to a promotion the company was doing like a fall sales campaign, the fact that it is towards the holidays, or a new product was implemented that became popular.
IB2: Which category drives most revenue? Is its share stable over time? Accessories drives the most revenue. It was consistent throughout the year, then it had a significant spike with the fall/winter season. Gadgets and Widgets maintained relatively small and consistent contributions throughout the year.
IB3: AI usage to improve a plot (if any): AI was not use for plot improvement. # Submission Checklist
flights_cleaned.csv
fig_flights_delay_by_airline.png
fig_flights_arr_delay_by_dest.png
sales_cleaned.csv
sales_monthly_summary.csv
fig_sales_monthly_trend.png
fig_sales_by_category.png
(Optional) fig_flights_distance_vs_arr.png, fig_sales_top_items.png
Insights answered under each task ```

Midterm — Data Wrangling and Visualizations (R only)

Instructions

Task A - Flight Delays

Wrangling

Visualizations

A1: Departure delay distribution by airline

Insights

Task B - Retail Sales

Wrangling

Visualizations

Insights