I, Boyang Liu, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg2 <- mpg %>%
mutate(mpg_overall = (cty + hwy) / 2)
ggplot(mpg2, aes(x = mpg_overall)) +
geom_histogram(binwidth = 2, boundary = 20, fill = "skyblue", color = "black") +
labs(
title = "Histogram of Overall MPG",
x = "Overall MPG",
y = "Count"
) +
theme(plot.title = element_text(hjust = 0.5))
mpg_overall.ggplot(mpg2, aes(x = drv, y = mpg_overall, fill = drv)) +
geom_boxplot() +
labs(
title = "Relationship Between Drive Train and Overall MPG",
x = "Drive Train Type",
y = "Overall MPG"
) +
theme(plot.title = element_text(hjust = 0.5))
Answer:
Overall MPG is different across drive
train types. In general, front-wheel drive cars tend to have higher fuel
efficiency, while four-wheel drive cars tend to have lower fuel
efficiency.
mpg_overall.mpg_class_mean <- mpg2 %>%
group_by(class) %>%
summarise(mean_mpg_overall = mean(mpg_overall), groups = "drop") %>%
arrange(desc(mean_mpg_overall))
mpg_class_mean
Answer:
The car class with the highest mean
overall mpg is subcompact. This means that class has the best average
fuel efficiency among all classes.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(mpg2, aes(x = factor(cyl), y = mpg_overall, fill = factor(year))) +
geom_boxplot() +
labs(
title = "Composite Effect of Year and Cylinders",
x = "Numbers",
y = "Overall MPG"
) +
theme(plot.title = element_text(hjust = 0.5))
Answer:
Overall MPG decreases as the number of
cylinders increeases. Cars from 2008 generally have slightly higher fuel
efficiency than cars from 1999 within the same cylinder group.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
jfk_nov_delay <- flights %>%
filter(origin == "JFK", year == 2013, month ==11, !is.na(arr_delay)) %>%
group_by(day) %>%
summarise(avg_arr_delay = mean(arr_delay), .groups ="drop") %>%
arrange(desc(avg_arr_delay))
jfk_nov_delay
Answer:
The day in November 2013 with the
biggest average arrival delay for JFK is 27th.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.flights2 <- flights %>%
mutate(
cancel_flight = if_else(
is.na(dep_time) | is.na(arr_time),
"Cancelled",
"Not Cancelled"
)
)
flights2
Answer:
A new variable is created. If either
departure time and arrival time is missing, the flight is labeled as
Cancelled; otherwise, it is labeled as Not Cancelled.
distance between cancelled flights and non-cancelled
flights.ggplot(flights2, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.5) +
labs(
title = "Distribution of Distance by Flight Cancellaion",
x = "Distance",
y = "Density"
) +
theme(plot.title = element_text(hjust = 0.5))
route_table <- flights %>%
distinct(origin, dest) %>%
arrange(origin, dest)
route_table
route_table %>%
summarise(num_routes = n())
Answer:
There are 224 unique flight routes in
the data set.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
route_table2 <- flights %>%
distinct(origin, dest, distance) %>%
arrange(origin, dest)
route_table2
ggplot(route_table2, aes(x = distance)) +
geom_histogram(binwidth = 100, fill = "orange") +
labs(
title = "Histogram of Route Distance",
x = "Distance",
y = "Count"
) +
theme(plot.title = element_text(hjust = 0.5))
route_cancel_rate <- flights2 %>%
group_by(origin, dest) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancellation_rate = cancelled_flights / total_flights,
.groups = "drop"
) %>%
arrange(desc(cancellation_rate))
route_cancel_rate
Answer:
The route with the highest cancellation
rate is EWR to LCA, with a cancellation rate of 1.00.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
airline_cancel_rate <- flights2 %>%
group_by(carrier) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancellation_rate = cancelled_flights / total_flights,
.groups = "drop"
) %>%
arrange(cancellation_rate)
airline_cancel_rate
ggplot(airline_cancel_rate, aes(x = reorder(carrier, cancellation_rate), y = cancellation_rate, fill = carrier)) +
geom_col() +
labs(
title = "Cancellation Rate by Airline",
x = "Airline",
y = "Cancellation Rate"
) +
theme(plot.title = element_text(hjust = 0.5))
Answer:
The airline with the lowest
cancellation rate is HA.
route_comp <- flights %>%
group_by(origin, dest) %>%
summarise(
num_carriers = n_distinct(carrier),
.groups = "drop"
) %>%
arrange(desc(num_carriers))
route_comp
max_carriers <- max(route_comp$num_carriers)
route_comp %>%
filter(num_carriers == max_carriers)
Answer:
The most competitive routes are the
ones operated by the highest number of airlines. In this data set, the
maximum number of carriers on a route is 5.