I, Huu Hai Long Ngo (Jonathan), hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg_a <- mutate(mpg, mpg_overall = (cty + hwy) / 2)
ggplot(data = mpg_a) +
geom_histogram(mapping = aes(x = mpg_overall), binwidth = 2) +
labs(title = "Fuel Economy for Vehicles Made in 1999 and 2008", x = "Overall MPG", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
mpg_overall.ggplot(data = mpg_a, mapping = aes(x = drv, y = mpg_overall)) +
stat_boxplot(geom = "errorbar", width = 0.5) + geom_boxplot() +
labs(title = "Relationship between drive train types vs MPG", x = "Type of drive train", y = "Overall MPG") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: Base on this data set for car models
released between 1999 and 2008, front wheel drive have better overall
mpg while rear wheel drive and 4 wheel drive are less fuel efficient,
especially 4 wheel drive can have a very low overall mpg.
mpg_overall.mpg_c <- mpg_a %>%
filter(!is.na(mpg_overall)) %>%
group_by(class) %>%
summarise(mean_mpg_overall = mean(mpg_overall, na.rm = TRUE)) %>%
arrange(desc(mean_mpg_overall))
mpg_c
Answer: Base on this data set for car models
released between 1999 and 2008, subcompact car has the the highest mean
overall mpg.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(data = mpg_a) +
geom_point(mapping = aes(x = mpg_overall, y = factor(year))) +
facet_wrap(~ cyl, nrow = 2) +
labs(title = "Composite effect of year & cylinder numbers to MPG", x = "Overall MPG", y = "Year of Manufacture") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: Base on this data set for car models
released between 1999 and 2008, generally the higher the number of
cylinders the worst the MPG become, while the year of manufacture does
not significantlly changed the overall MPG.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
flights_a <- flights %>%
filter(!is.na(arr_delay), month == 11, origin == "JFK") %>%
group_by(day) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(mean_arr_delay))
flights_a
Answer: Base on this data set, for JFK airport,
November 27th 2013 has the biggest average arrival delay.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.Answer: flights_b <- mutate(flights,
cancel_flight = ifelse(is.na(dep_time) | is.na(arr_time), “Cancelled”,
“Not Cancelled”))
distance between cancelled flights and non-cancelled
flights.ggplot(flights_b, aes(x = distance, fill = cancel_flight)) +
geom_density(adjust = 2, alpha = 0.5) +
labs(title = "Density distribution of distance between
cancelled vs non-cancelled flights", x = "Distance", y = "Density") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
flights_d <- flights %>%
group_by(origin, dest) %>%
summarise(destinations = n_distinct(dest))
flights_d
Answer: There are 224 unique flight routes in the
data set.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
flights_e <- flights %>%
group_by(origin, dest, distance) %>%
summarise(destinations = n_distinct(dest)) %>%
select(-destinations)
flights_e
ggplot(data = flights_e) +
geom_histogram(mapping = aes(x = distance)) +
labs(title = "Flight Routes Distance", x = "Distance", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer:
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
# Enter code here.
Answer:
# Enter code here.
Answer: