I, Sang Dao, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg["mpg_overall"] = (mpg["cty"] + mpg["hwy"]) / 2
ggplot(mpg, aes(x = mpg_overall)) + geom_histogram(binwidth = 2 , boundary = 0)
mpg_overall.ggplot(mpg, aes(x = drv, y = mpg_overall, fill = drv)) +
geom_boxplot()
Answer: “f” (front-wheel drive) definitely has the highest fuel scores, “4” (4wd) gets the worst gas. The fuel scores for “f” are more spread out => has a lot good outline and a bad one
mpg_overall.mpg %>%
group_by(class) %>%
summarise(mean_mpg_overall = mean(mpg_overall)) %>%
arrange(desc(mean_mpg_overall))
Answer: Subcompact has the highest mean mpg_overall.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(mpg, aes(x = as.factor(cyl), y = mpg_overall, fill = as.factor(year))) +
geom_boxplot()
Answer: A car with a higher number of cylinders will
have a lower mpg_overall score and with the same number of
cylinders the new cars are more saving fuel but still have a little
outline that the some old cars have the highest saving fuel score in
this graph.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
flights %>%
filter(origin == "JFK", year == 2013, month == 11) %>%
group_by(day) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_arr_delay))
Answer: Day 27 has the biggest average arrival delay.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.flights_new <- flights %>%
mutate(cancel_flight = ifelse(is.na(dep_time) | is.na(arr_time), "Cancelled", "Not Cancelled"))
distance between cancelled flights and non-cancelled
flights.ggplot(flights_new, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.5)
Answer: More flights in the short trip. And most cancelled flights is the short trip, and it is really hard to see the cancelled flights from more than 3000 distance.
flights %>%
summarise(total_unique_routes = n_distinct(origin, dest))
Answer: has 224 unique flight
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
route_dist_table <- flights %>%
count(origin, dest, distance)
ggplot(route_dist_table, aes(x = distance)) +
geom_histogram(binwidth = 100)
flights_new %>%
group_by(origin, dest) %>%
summarise(
total_flights = n(), # This counts the flights in the group [1]
cancel_rate = mean(cancel_flight == "Cancelled") # This calculates the specific rate you need
) %>%
arrange(desc(cancel_rate))
Answer: The route from EWR to LGA has the highest rate of flight cancellation cause it only has 1 flight was cancelled. If not count and choose more objective sample we have LGA to MHT.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
carrier_cancel_rates <- flights %>%
group_by(carrier) %>%
summarise(
cancel_rate = mean(is.na(dep_time) | is.na(arr_time))
) %>%
arrange(cancel_rate)
head(carrier_cancel_rates)
ggplot(carrier_cancel_rates, aes(x = reorder(carrier, cancel_rate), y = cancel_rate)) +
geom_col()
Answer: HA has the lowest rate of cancellation.
competitive_routes <- flights %>%
group_by(origin, dest) %>%
summarise(num_carriers = n_distinct(carrier)) %>%
filter(num_carriers == max(num_carriers))
competitive_routes
Answer: There are the most competitive route with 5 number of carriers run on that.