DAS522/DAS241 MIDTERM EXAM PART2

Academic Honesty Statement (fill your name in the blank)

I, tianlu sui, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

# load required packages here
library(tidyverse)
library(openintro)
library(nycflights13)
library(dplyr)
library(ggplot2)

1. The `mpg` data set

After loading tidyverse library, a data set named mpg should be ready to explore. The following questions are based on this data set.

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

?mpg

## 打开httpd帮助服务器… 好了

# Enter code here.
mpg%>%
mutate(mpg_overall=(hwy+cty)/2)%>%
ggplot(aes(x=mpg_overall))+
  geom_histogram(binwidth = 2)

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

# Enter code here.
mpg%>%
mutate(mpg_overall=(hwy+cty)/2)%>%
  ggplot(aes(x=drv,y=mpg_overall))+
  geom_boxplot()

Answer:Front-wheel drive vehicles achieve the highest miles per gallon, while rear-wheel drive and four-wheel drive models are comparable. However, the median fuel efficiency of four-wheel drive vehicles is lower than that of rear-wheel drive vehicles.

c) Create a table to find out which car class has the highest mean `mpg_overall`.

# Enter code here.
mpg%>%
mutate(mpg_overall=(hwy+cty)/2)%>%
group_by(class,mpg_overall)%>%
summarise(avg_mpg_overall= mean(mpg_overall, na.rm = TRUE))%>%
arrange(desc(avg_mpg_overall))

Answer:subcompact

d) Create a proper graph to study the composite effect of `year` and `cyl` to `mpg_overall`. You shall treat `year` and `cyl` as categorical variables in your graph.

table(mpg$cyl)

## 
##  4  5  6  8 
## 81  4 79 70

# Enter code here.
mpg%>%
mutate(mpg_overall=(hwy+cty)/2)%>%
ggplot(aes(x = mpg_overall)) + 
  geom_histogram() + 
  facet_grid(cyl ~ year)

Answer:Five-cylinder vehicles were only produced in limited numbers during 2008. Four-cylinder models achieved fuel efficiency of 20-30 miles per gallon, six-cylinder variants managed 15-25 miles per gallon, while eight-cylinder engines delivered 10-20 miles per gallon. This demonstrates that, fundamentally, the greater the number of cylinders, the higher the fuel consumption. When comparing by year, vehicles with the same number of cylinders from 2008 are more fuel-efficient than those from 1999.

2. The `flights` data set

For the following tasks, use data set flights of the nycflights13 package.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

flights

# Enter code here.
flights %>%
  filter(origin == "JFK", month == 11, !is.na(arr_delay)) %>%
  group_by(day) %>%
  summarise(
    total_arr_delay = sum(arr_delay),
    flights = n(),
    avg_arr_delay = total_arr_delay / flights
  ) %>%
  arrange(desc(avg_arr_delay))

Answer:27

b) Create a new variable `cancel_flight` which is `Cancelled` if the departure time or arrival time is `NA`, otherwise `Not Cancelled`.

# Enter code here.
my_flights<-mutate(flights, cancel_flight = ifelse(is.na(dep_time) | is.na(arr_time), "cancelled", "non-cancelled"))
my_flights

Answer:

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.

# Enter code here.
my_flights%>%
  ggplot( aes(x = distance, fill = cancel_flight)) + 
  geom_density(adjust = 2, alpha = 0.5)

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

# Enter code here.
flights %>%
  distinct(origin, dest) %>%
  arrange(origin, dest)

Answer:224 routes

e) Add `distance` as a column to the table you created in d).

Hint: You should go back to the original flights data set and reconstruct the table with distance included. Create a histogram of distance for the route table.

# Enter code here.
flights_route <- flights %>%
  distinct(origin, dest, distance)

ggplot(flights_route,aes(x = distance)) +
  geom_histogram(binwidth = 200)

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

# Enter code here.
my_flights%>%
  group_by(origin, dest)%>%
    summarise(
    total_flights = n(), 
    cancelled_flights = sum(cancel_flight == "cancelled"),
    cancel_rate = cancelled_flights / total_flights
  ) %>%
  arrange(desc(cancel_rate))

Answer:EWR–LGA

Bonus Question for `flights` data set

The following questions are also from flights data set. Each question is worth 5% bonus points if answered correctly.

a) Create a proper graph to show the rate of cancellation flights for each airline. Answer which airline has the lowest rate of cancellation.

# Enter code here.
carrier_cancel<-my_flights%>%
  group_by(carrier)%>%
    summarise(
    total_flights = n(), 
    cancelled_flights = sum(cancel_flight == "cancelled"),
    cancel_rate = cancelled_flights / total_flights
  ) 

ggplot(carrier_cancel,aes(x=carrier,y=cancel_rate))+
    geom_col()

Answer:HAHA

b) If multiple airlines run the same route, they can be considered as competitors. Which route is most competitive (has the most number of carriers)? List all of them in a table.

# Enter code here.
flights%>%
  group_by(origin,dest)%>%
  summarise(total_carrier=n())%>%
arrange(desc(total_carrier))

Answer:JFK–LAX

DAS522/DAS241 MIDTERM EXAM PART2 - STUDENT TEMPLATE

[ENTER NAME HERE]

Mar 11 2026

Academic Honesty Statement (fill your name in the blank)

Load packages

1. The `mpg` data set

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

c) Create a table to find out which car class has the highest mean `mpg_overall`.

d) Create a proper graph to study the composite effect of `year` and `cyl` to `mpg_overall`. You shall treat `year` and `cyl` as categorical variables in your graph.

2. The `flights` data set

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

b) Create a new variable `cancel_flight` which is `Cancelled` if the departure time or arrival time is `NA`, otherwise `Not Cancelled`.

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

e) Add `distance` as a column to the table you created in d).

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

Bonus Question for `flights` data set

a) Create a proper graph to show the rate of cancellation flights for each airline. Answer which airline has the lowest rate of cancellation.

b) If multiple airlines run the same route, they can be considered as competitors. Which route is most competitive (has the most number of carriers)? List all of them in a table.

DAS522/DAS241 MIDTERM EXAM PART2 - STUDENT TEMPLATE

[ENTER NAME HERE]

Mar 11 2026

Academic Honesty Statement (fill your name in the blank)

Load packages

1. The mpg data set

a) Create a new variable mpg_overall which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and mpg_overall.

c) Create a table to find out which car class has the highest mean mpg_overall.

d) Create a proper graph to study the composite effect of year and cyl to mpg_overall. You shall treat year and cyl as categorical variables in your graph.

2. The flights data set

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

b) Create a new variable cancel_flight which is Cancelled if the departure time or arrival time is NA, otherwise Not Cancelled.

c) Create a density graph that compares the distribution of distance between cancelled flights and non-cancelled flights.

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

e) Add distance as a column to the table you created in d).

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

Bonus Question for flights data set

a) Create a proper graph to show the rate of cancellation flights for each airline. Answer which airline has the lowest rate of cancellation.

b) If multiple airlines run the same route, they can be considered as competitors. Which route is most competitive (has the most number of carriers)? List all of them in a table.

1. The `mpg` data set

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

c) Create a table to find out which car class has the highest mean `mpg_overall`.

d) Create a proper graph to study the composite effect of `year` and `cyl` to `mpg_overall`. You shall treat `year` and `cyl` as categorical variables in your graph.

2. The `flights` data set

b) Create a new variable `cancel_flight` which is `Cancelled` if the departure time or arrival time is `NA`, otherwise `Not Cancelled`.

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.

e) Add `distance` as a column to the table you created in d).

Bonus Question for `flights` data set