I, Yi Tao Wang, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg <- mpg %>%
mutate(mpg_overall = (cty + hwy) / 2)
ggplot(mpg, aes(x = mpg_overall)) +
geom_histogram(binwidth = 2, boundary = 20, fill = "steelblue", color = "white")
mpg_overall.ggplot(mpg, aes(x = drv, y = mpg_overall, fill = drv)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Relationship: Drive Train vs. Overall MPG", x = "Drive Train", y = "Overall MPG")
Answer: Front-wheel drive (f) vehicles
have the highest average combined fuel consumption, followed by
rear-wheel drive (r). Four-wheel drive (4)
vehicles have significantly lower fuel efficiency.
mpg_overall.mpg %>%
group_by(class) %>%
summarize(mean_mpg = mean(mpg_overall)) %>%
arrange(desc(mean_mpg))
Answer: The subcompact car class has the highest average overall MPG (around 24.5), closely followed by compact cars.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(mpg, aes(x = factor(cyl), y = mpg_overall, fill = factor(year))) +
geom_boxplot(position = position_dodge(width = 0.8)) +
labs(title = "Effect of Engine Cylinders and Year on Overall MPG", x = "Number of Cylinders", y = "Overall MPG", fill = "Year")
Answer: Overall MPG decreases as the number of cylinders increases.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
flights %>%
filter(origin == "JFK", year == 2013, month == 11) %>%
group_by(day) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_arr_delay)) %>%
head(1)
Answer: On 27th, it had the biggest average arrival delay for flights originating from JFK, with an average delay of approximately 21.33 minutes.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.cancel_flight <- flights %>%
mutate(cancel_flight = ifelse(is.na(dep_time) | is.na(arr_time), "Cancelled", "Not Cancelled"))
cancel_flight %>%
count(cancel_flight)
Answer: 8713 cancelled flights.
distance between cancelled flights and non-cancelled
flights.ggplot(cancel_flight, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.5) +
labs(
title = "Distance Distribution: Cancelled vs. Not Cancelled Flights",
x = "Flight Distance (miles)",
y = "Density",
fill = "Flight Status"
)
route_table <- flights %>%
distinct(origin, dest)
nrow(route_table)
## [1] 224
Answer: 224 unique flight routes.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
route_distance_table <- flights %>%
distinct(origin, dest, distance)
ggplot(route_distance_table, aes(x = distance)) +
geom_histogram(fill = "darkgreen", color = "black") +
labs(
title = "Distribution of Distances for Unique Flight Routes",
x = "Distance (miles)",
y = "Number of Unique Routes"
)
flights %>%
mutate(cancel_flight = ifelse(is.na(dep_time) | is.na(arr_time), "Cancelled", "Not Cancelled")) %>%
group_by(origin, dest) %>%
summarize(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancellation_rate = cancelled_flights / total_flights,
.groups = 'drop'
) %>%
arrange(desc(cancellation_rate)) %>%
head(5)
Answer: Based on the table, the route from EWR to LGA has the highest cancellation rate.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
carrier_cancellations <- flights %>%
group_by(carrier) %>%
summarize(cancel_rate = mean(is.na(dep_time) | is.na(arr_time))) %>%
arrange(cancel_rate)
ggplot(carrier_cancellations, aes(x = reorder(carrier, cancel_rate), y = cancel_rate)) +
geom_col(fill = "coral") +
coord_flip() +
labs(
title = "Cancellation Rate by Airline",
x = "Airline (Carrier Code)",
y = "Cancellation Rate"
)
Answer: HA has the lowest cancellation rate.
# Enter code here.
Answer: