Academic Honesty Statement (fill your name in the blank)

I, Jerry Chan, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

library(tidyverse)
library(nycflights13)
library(openintro)

1. The mpg data set

After loading tidyverse library, a data set named mpg should be ready to explore. The following questions are based on this data set.

a) Create a new variable mpg_overall which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.
mpg%>%
  mutate(mpg_overall=(cty+hwy)/2)%>%
  ggplot(aes(x=mpg_overall))+
  geom_histogram(binwidth = 2, boundary=20, fill = "steelblue", color = "black")+
  labs(
    title = "overall mpg",
    x = "average mpg",
    y = "count"
  )

b) Create a graph to study the relationship between drive train types and mpg_overall.
mpg%>%
  mutate(mpg_overall=(cty+hwy)/2)%>%
  ggplot(aes(x=drv, y=mpg_overall))+
  geom_boxplot()+
  labs(
    title="relationship between drive train types and overall mpg",
    x="drive train",
    y="overall mpg"
  )

Answer: drive train ‘f’ yields highest overall mpg compared to ‘4’ and ‘r’ train type.

c) Create a table to find out which car class has the highest mean mpg_overall.
mpg%>%
  mutate(mpg_overall=(cty+hwy)/2)%>%
  group_by(class)%>%
  summarise(mean_mpg_overall=mean(mpg_overall))%>%
  arrange(desc(mean_mpg_overall))

Answer: subcompact has the highest overall mean mpg, followed closely by compact class.

d) Create a proper graph to study the composite effect of year and cyl to mpg_overall. You shall treat year and cyl as categorical variables in your graph.
mpg%>%
  mutate(mpg_overall=((cty+hwy)/2), year=factor(year), cyl=factor(cyl))%>%
  ggplot(aes(x=cyl, y=mpg_overall, fill=year))+
  geom_boxplot()+
  labs(title="overall mpg by cylinders and years", x="cylinder", y="overall mpg")

Answer: fewer cylinders have higher mpg, with 2008 showing minor improvements compared to 1999.

2. The flights data set

For the following tasks, use data set flights of the nycflights13 package.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.
flights%>%
  filter(origin=="JFK", month==11, year==2013)%>%
  group_by(day)%>%
  summarise(mean_arr_delay=mean(arr_delay, na.rm=T))%>%
  arrange(desc(mean_arr_delay))

Answer: on 27th of November 2013, it has the highest mean arrival delay of 21.334405, it could be because of Thanksgiving holiday that increases the number of flights.

b) Create a new variable cancel_flight which is Cancelled if the departure time or arrival time is NA, otherwise Not Cancelled.
cancelled_flights <- flights%>%
  mutate(cancel_flight=if_else(
    is.na(dep_time)|is.na(arr_time),
    "Cancelled",
    "Not cancelled"
  ))
cancelled_flights%>%
  count(cancel_flight)

Answer: there are 8713 cancelled flights.

c) Create a density graph that compares the distribution of distance between cancelled flights and non-cancelled flights.
ggplot(cancelled_flights, aes(x=distance, fill=cancel_flight))+
  geom_density(alpha=0.5)+
  labs(
    title="distribution of distance between cancelled flights and non-cancelled flights",
    x="distance (in miles)",
    y="cancel status"
  )

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.
flights%>%
  distinct(origin, dest)%>%
  count()

Answer: there are 224 unique flight routes.

e) Add distance as a column to the table you created in d).

Hint: You should go back to the original flights data set and reconstruct the table with distance included. Create a histogram of distance for the route table.

flights%>%
  distinct(origin, dest, distance)%>%
  arrange(origin, dest)%>%
  ggplot(aes(x=distance))+
  geom_histogram(binwidth=200, fill="steelblue", color="black")+
  labs(
    title="distribution of route distance",
    x="distance",
    y="number of routes"
  )

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.
flights%>%
  mutate(cancelled=if_else(is.na(dep_time)|is.na(arr_time), "cancelled", "not cancelled"))%>%
  group_by(origin, dest)%>%
  summarise(all_flights=n(),
            cancelled_count=sum(cancelled=="cancelled"),
            cancel_rate=cancelled_count/all_flights,
            .groups="drop")%>%
  arrange(desc(cancel_rate))

Answer: EWR to LGA has the highest cancel rate due to only 1 flight in the table, followed by LGA to MHT with more flights to accurately measure cancellation rate.

Bonus Question for flights data set

The following questions are also from flights data set. Each question is worth 5% bonus points if answered correctly.


a) Create a proper graph to show the rate of cancellation flights for each airline. Answer which airline has the lowest rate of cancellation.
flights%>%
  mutate(cancelled=if_else(is.na(dep_time)|is.na(arr_time),
                           "cancelled", "not cancelled"))%>%
  group_by(carrier)%>%
  summarise(cancel_rate=mean(cancelled=="cancelled"))%>%
  ggplot(aes(x=reorder(carrier, cancel_rate), y=cancel_rate))+
  geom_col()+
  labs(
    title="distribution of cancelation by carriers",
    x="carrier",
    y="cancel rate"
  )

Answer: HA has the lowest cancellation rate at 0 cancellations.

b) If multiple airlines run the same route, they can be considered as competitors. Which route is most competitive (has the most number of carriers)? List all of them in a table.

Answer: