DAS522/DAS241 MIDTERM EXAM PART2

1. The `mpg` data set

After loading tidyverse library, a data set named mpg should be ready to explore. The following questions are based on this data set.

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

# Enter code here.
mpg <- mutate(mpg, mpg_overall = (cty + hwy) / 2)

ggplot(data = mpg, mapping = aes(x = mpg_overall)) +
  geom_histogram(binwidth = 2,
                 boundary = 0,
                 fill = "steelblue",
                 color = "white"
  )

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

# Enter code here.
ggplot(data = mpg, mapping = aes(x = drv, y = mpg_overall, fill = drv)) +
  geom_boxplot()+
  labs(
    x = "Driver Train (f = front, r = rear, 4 = 4wd)",
    y = "Overall MPG"
  )

Answer:
The distribution of 4 and r are similar, however, f is significantly higher in overall mpg and is the one that has the most outlier which can save even more fuel ##### c) Create a table to find out which car class has the highest mean mpg_overall.

# Enter code here.
mpg_sum <- mpg %>%
  group_by(class) %>%
  summarize(mean_mpg = mean(mpg_overall)) %>%
  arrange(desc(mean_mpg))
mpg_sum

Answer:
We can easily see that the highest mpg which will sacve the most fuel are compact and subcompact, following by midsize and 2 seater. ##### d) Create a proper graph to study the composite effect of year and cyl to mpg_overall. You shall treat year and cyl as categorical variables in your graph.

# Enter code here.
ggplot(data = mpg, mapping = aes(x = factor(cyl), y = mpg_overall, fill = factor(year)))+
  geom_boxplot()

Answer:
The more cylinders, the more of power consumption which leads to more gas consumption. ### 2. The flights data set

For the following tasks, use data set flights of the nycflights13 package.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

# Enter code here.
nov_delays <- flights %>%
  filter(origin =="JFK", month == 11) %>%
  group_by(day) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  
  arrange(desc(avg_arr_delay))

nov_delays

Answer:
Based on our table, day 27th is the worst day so far. ##### b) Create a new variable cancel_flight which is Cancelled if the departure time or arrival time is NA, otherwise Not Cancelled.

# Enter code here.
flights <- flights %>%
  mutate(cancel_flight = if_else(is.na(dep_time) | is.na(arr_time), 
                                 "Cancelled", 
                                 "Not Cancelled"))
flights %>% 
  select(dep_time, arr_time, cancel_flight)

Answer:

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.

# Enter code here.
ggplot(data = flights, mapping = aes(x = distance, fill = cancel_flight)) +
  geom_density(alpha = 0.5) + 
  labs(
    title = "Cancelled vs. Non-Cancelled Flights",
    x = "miles",
    y = "Density",
    fill = "Flight Status"
  )

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

# Enter code here.
unique_routes <- flights %>%
  distinct(origin, dest)
number_of_routes <- nrow(unique_routes)

number_of_routes

## [1] 224

Answer:
There are 224 unique routes flying from New York ##### e) Add distance as a column to the table you created in d).

Hint: You should go back to the original flights data set and reconstruct the table with distance included. Create a histogram of distance for the route table.

# Enter code here.

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

# Enter code here.
cancel_rates <- flights %>%
  group_by(origin, dest) %>%
  summarize(
    num_flights = n(),
    cancel_rate = mean(cancel_flight == "Cancelled"),
    .groups = "drop"
  ) %>%
  
  filter(num_flights > 10) %>% 
  arrange(desc(cancel_rate))

cancel_rates

Answer:
The route that has the most cancellation rate is From La Guardia to MHT airport with approximately 24%.

DAS522/DAS241 MIDTERM EXAM PART2 - STUDENT TEMPLATE

[Duc Vinh Hoang]

Mar 13 2026

Academic Honesty Statement (fill your name in the blank)

Load packages

1. The `mpg` data set

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

DAS522/DAS241 MIDTERM EXAM PART2 - STUDENT TEMPLATE

[Duc Vinh Hoang]

Mar 13 2026

Academic Honesty Statement (fill your name in the blank)

Load packages

1. The mpg data set

a) Create a new variable mpg_overall which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and mpg_overall.

a) For JFK airport, which day in November 2013 has the biggest average arrival delay? Create a table to answer the question.

c) Create a density graph that compares the distribution of distance between cancelled flights and non-cancelled flights.

d) How many unique flight routes are there in the data set? That is, each unique combination of an origin airport and a destination airport (such as from EWR to ORD) is considered as a route. Create a table to answer the question.

f) Which route has the highest rate of flight cancellation? Create a table to answer the question.

1. The `mpg` data set

a) Create a new variable `mpg_overall` which is the average of city and highway fuel consumption in miles per gallon. Then create a histogram of this new variable with each group covering values of 20-22, 22-24 etc.

b) Create a graph to study the relationship between drive train types and `mpg_overall`.

c) Create a density graph that compares the distribution of `distance` between cancelled flights and non-cancelled flights.