I, Franklin Li, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg_mpg <- mpg %>%
mutate(mpg_overall = (cty + hwy) / 2)
ggplot(mpg_mpg, aes(x = mpg_overall)) +
geom_histogram(binwidth = 2, boundary = 20, color = "black", fill = "skyblue") +
labs(
title = "Histogram of Overall MPG",
x = "Overall MPG",
y = "Count"
) +
theme_minimal()
mpg_overall.ggplot(mpg_mpg, aes(x = drv, y = mpg_overall, fill = drv)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Overall MPG by Drive Train Type",
x = "Drive Train",
y = "Overall MPG"
) +
theme_minimal()
Answer:
Front wheel drives tend to have higher
mpg oveerall, while four wheel drives tends to have the worest fuel
economy out of the three types.
mpg_overall.class_mpg <- mpg_mpg %>%
group_by(class) %>%
summarise(mean_mpg_overall = mean(mpg_overall), .groups = "drop") %>%
arrange(desc(mean_mpg_overall))
class_mpg
Answer:
Car class with the highest overall mpg
is the subcompact, followed closely by the compact and the midsized.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(
mpg_mpg,
aes(x = factor(cyl), y = mpg_overall, fill = factor(year))
) +
geom_boxplot() +
labs(
title = "Effect of Year and Cylinders on Overall MPG",
x = "Cylinders",
y = "Overall MPG",
fill = "Year"
) +
theme_minimal()
Answer:
Cars with higher cylinders have worst
fuel economy then those that have fewer cylinders.And cars made in the
more recent years tend to have higher fuel efficiency.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
jfk_nov_delay <- flights %>%
filter(origin == "JFK", month == 11) %>%
group_by(day) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(avg_arr_delay))
jfk_nov_delay
Answer:
The day with the longest delay in
November was on the 27th, with a delay of 21 hours.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.flights_can <- flights %>%
mutate(
cancel_flight = if_else(is.na(dep_time) | is.na(arr_time),
"Cancelled",
"Not Cancelled")
)
flights_can %>%
count(cancel_flight)
Answer:
There were 8,713 cancelled flights and
328,063 flights not cancelled.
distance between cancelled flights and non-cancelled
flights.ggplot(flights_can, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.4) +
labs(
title = "Distribution of Flight Distance by Cancellation Status",
x = "Distance",
y = "Density",
fill = "Flight Status"
) +
theme_minimal()
route_table <- flights %>%
distinct(origin, dest)
route_table %>%
summarise(num_unique_routes = n())
Answer:
There are 224 unique flight routes.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
route_distance_table <- flights %>%
distinct(origin, dest, distance)
route_distance_table
ggplot(route_distance_table, aes(x = distance)) +
geom_histogram(binwidth = 100, color = "black", fill = "orange") +
labs(
title = "Histogram of Route Distances",
x = "Distance",
y = "Count"
) +
theme_minimal()
route_cancel_rate <- flights_can %>%
group_by(origin, dest) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancel_rate = cancelled_flights / total_flights,
.groups = "drop"
) %>%
arrange(desc(cancel_rate), desc(cancelled_flights))
route_cancel_rate
Answer:
Technically, EWR to LGA has the highest
cancel rate of 100%, 1 out of 1. The second highest cancel rate route is
LGA to MHT, out of 142 flights, there were 34 cancelled flights, giving
it a cancel rate of 23.9%.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
airline_cancel <- flights_can %>%
group_by(carrier) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancel_rate = cancelled_flights / total_flights,
.groups = "drop"
) %>%
left_join(airlines, by = "carrier") %>%
arrange(cancel_rate)
airline_cancel
ggplot(airline_cancel, aes(x = reorder(name, cancel_rate), y = cancel_rate)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Cancellation Rate by Airline",
x = "Airline",
y = "Cancellation Rate"
) +
theme_minimal()
Answer:
HA, Hawaiian Airlines had the lowest
cancel rate, with a total of 342 flights.
most_competitive_routes <- flights %>%
distinct(origin, dest, carrier) %>%
group_by(origin, dest) %>%
summarise(num_carriers = n(), .groups = "drop") %>%
filter(num_carriers == max(num_carriers)) %>%
arrange(origin, dest)
most_competitive_routes
Answer:
There are 8 routes that are tied for
having the most carriers, each having 5 carriers.