Academic Honesty Statement (fill your name in the blank)

I, Boyang Liu, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

# load required packages here
library(tidyverse)
library(nycflights13)

1. The mpg data set

After loading tidyverse library, a data set named mpg should be ready to explore. The following questions are based on this data set.

a) Create a new variable mpg_overall which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.
mpg2 <- mpg %>%
  mutate(mpg_overall = (cty + hwy) / 2)

ggplot(mpg2, aes(x = mpg_overall)) + 
  geom_histogram(binwidth = 2, boundary = 20, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Overall MPG",
    x = "Overall MPG",
    y = "Count"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

b) Create a graph to study the relationship between drive train types and mpg_overall.
ggplot(mpg2, aes(x = drv, y = mpg_overall, fill = drv)) +
  geom_boxplot() +
  labs(
    title = "Relationship Between Drive Train and Overall MPG",
    x = "Drive Train Type",
    y = "Overall MPG"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

Answer:
Overall MPG is different across drive train types. In general, front-wheel drive cars tend to have higher fuel efficiency, while four-wheel drive cars tend to have lower fuel efficiency.

c) Create a table to find out which car class has the highest mean mpg_overall.
mpg_class_mean <- mpg2 %>%
  group_by(class) %>%
  summarise(mean_mpg_overall = mean(mpg_overall), groups = "drop") %>%
  arrange(desc(mean_mpg_overall))

mpg_class_mean

Answer:
The car class with the highest mean overall mpg is subcompact. This means that class has the best average fuel efficiency among all classes.

d) Create a proper graph to study the composite effect of year and cyl to mpg_overall. You shall treat year and cyl as categorical variables in your graph.
ggplot(mpg2, aes(x = factor(cyl), y = mpg_overall, fill = factor(year))) +
  geom_boxplot() +
  labs(
    title = "Composite Effect of Year and Cylinders",
    x = "Numbers",
    y = "Overall MPG"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

Answer:
Overall MPG decreases as the number of cylinders increeases. Cars from 2008 generally have slightly higher fuel efficiency than cars from 1999 within the same cylinder group.

2. The flights data set

For the following tasks, use data set flights of the nycflights13 package.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.
jfk_nov_delay <- flights %>%
  filter(origin == "JFK", year == 2013, month ==11, !is.na(arr_delay)) %>%
  group_by(day) %>%
  summarise(avg_arr_delay = mean(arr_delay), .groups ="drop") %>%
  arrange(desc(avg_arr_delay))

jfk_nov_delay

Answer:
The day in November 2013 with the biggest average arrival delay for JFK is 27th.

b) Create a new variable cancel_flight which is Cancelled if the departure time or arrival time is NA, otherwise Not Cancelled.
flights2 <- flights %>%
  mutate(
    cancel_flight = if_else(
      is.na(dep_time) | is.na(arr_time),
      "Cancelled",
      "Not Cancelled"
    )
  )
flights2

Answer:
A new variable is created. If either departure time and arrival time is missing, the flight is labeled as Cancelled; otherwise, it is labeled as Not Cancelled.

c) Create a density graph that compares the distribution of distance between cancelled flights and non-cancelled flights.
ggplot(flights2, aes(x = distance, fill = cancel_flight)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Distribution of Distance by Flight Cancellaion",
    x = "Distance",
    y = "Density"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.
route_table <- flights %>%
  distinct(origin, dest) %>%
  arrange(origin, dest)

route_table
route_table %>%
  summarise(num_routes = n())

Answer:
There are 224 unique flight routes in the data set.

e) Add distance as a column to the table you created in d).

Hint: You should go back to the original flights data set and reconstruct the table with distance included. Create a histogram of distance for the route table.

route_table2 <- flights %>%
  distinct(origin, dest, distance) %>%
  arrange(origin, dest)
route_table2
ggplot(route_table2, aes(x = distance)) +
  geom_histogram(binwidth = 100, fill = "orange") + 
  labs(
    title = "Histogram of Route Distance",
    x = "Distance",
    y = "Count"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.
route_cancel_rate <- flights2 %>%
  group_by(origin, dest) %>%
  summarise(
    total_flights = n(),
    cancelled_flights = sum(cancel_flight == "Cancelled"),
    cancellation_rate = cancelled_flights / total_flights,
    .groups = "drop"
  ) %>%
  arrange(desc(cancellation_rate))

route_cancel_rate

Answer:
The route with the highest cancellation rate is EWR to LCA, with a cancellation rate of 1.00.

Bonus Question for flights data set

The following questions are also from flights data set. Each question is worth 5% bonus points if answered correctly.


a) Create a proper graph to show the rate of cancellation flights for each airline. Answer which airline has the lowest rate of cancellation.
airline_cancel_rate <- flights2 %>%
  group_by(carrier) %>%
  summarise(
    total_flights = n(),
    cancelled_flights = sum(cancel_flight == "Cancelled"),
    cancellation_rate = cancelled_flights / total_flights,
    .groups = "drop"
  ) %>%
  arrange(cancellation_rate)
airline_cancel_rate
ggplot(airline_cancel_rate, aes(x = reorder(carrier, cancellation_rate), y = cancellation_rate, fill = carrier)) +
  geom_col() +
  labs(
    title = "Cancellation Rate by Airline",
    x = "Airline",
    y = "Cancellation Rate"
  ) + 
  theme(plot.title = element_text(hjust = 0.5))

Answer:
The airline with the lowest cancellation rate is HA.

b) If multiple airlines run the same route, they can be considered as competitors. Which route is most competitive (has the most number of carriers)? List all of them in a table.
route_comp <- flights %>%
  group_by(origin, dest) %>%
  summarise(
    num_carriers = n_distinct(carrier),
    .groups = "drop"
  ) %>%
  arrange(desc(num_carriers))
route_comp
max_carriers <- max(route_comp$num_carriers)
route_comp %>%
  filter(num_carriers == max_carriers)

Answer:
The most competitive routes are the ones operated by the highest number of airlines. In this data set, the maximum number of carriers on a route is 5.