I, Jerry Chan, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(nycflights13)
library(openintro)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg%>%
mutate(mpg_overall=(cty+hwy)/2)%>%
ggplot(aes(x=mpg_overall))+
geom_histogram(binwidth = 2, boundary=20, fill = "steelblue", color = "black")+
labs(
title = "overall mpg",
x = "average mpg",
y = "count"
)
mpg_overall.mpg%>%
mutate(mpg_overall=(cty+hwy)/2)%>%
ggplot(aes(x=drv, y=mpg_overall))+
geom_boxplot()+
labs(
title="relationship between drive train types and overall mpg",
x="drive train",
y="overall mpg"
)
Answer: drive train ‘f’ yields highest overall mpg
compared to ‘4’ and ‘r’ train type.
mpg_overall.mpg%>%
mutate(mpg_overall=(cty+hwy)/2)%>%
group_by(class)%>%
summarise(mean_mpg_overall=mean(mpg_overall))%>%
arrange(desc(mean_mpg_overall))
Answer: subcompact has the highest overall mean mpg,
followed closely by compact class.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.mpg%>%
mutate(mpg_overall=((cty+hwy)/2), year=factor(year), cyl=factor(cyl))%>%
ggplot(aes(x=cyl, y=mpg_overall, fill=year))+
geom_boxplot()+
labs(title="overall mpg by cylinders and years", x="cylinder", y="overall mpg")
Answer: fewer cylinders have higher mpg, with 2008
showing minor improvements compared to 1999.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
flights%>%
filter(origin=="JFK", month==11, year==2013)%>%
group_by(day)%>%
summarise(mean_arr_delay=mean(arr_delay, na.rm=T))%>%
arrange(desc(mean_arr_delay))
Answer: on 27th of November 2013, it has the highest
mean arrival delay of 21.334405, it could be because of Thanksgiving
holiday that increases the number of flights.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.cancelled_flights <- flights%>%
mutate(cancel_flight=if_else(
is.na(dep_time)|is.na(arr_time),
"Cancelled",
"Not cancelled"
))
cancelled_flights%>%
count(cancel_flight)
Answer: there are 8713 cancelled flights.
distance between cancelled flights and non-cancelled
flights.ggplot(cancelled_flights, aes(x=distance, fill=cancel_flight))+
geom_density(alpha=0.5)+
labs(
title="distribution of distance between cancelled flights and non-cancelled flights",
x="distance (in miles)",
y="cancel status"
)
flights%>%
distinct(origin, dest)%>%
count()
Answer: there are 224 unique flight routes.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
flights%>%
distinct(origin, dest, distance)%>%
arrange(origin, dest)%>%
ggplot(aes(x=distance))+
geom_histogram(binwidth=200, fill="steelblue", color="black")+
labs(
title="distribution of route distance",
x="distance",
y="number of routes"
)
flights%>%
mutate(cancelled=if_else(is.na(dep_time)|is.na(arr_time), "cancelled", "not cancelled"))%>%
group_by(origin, dest)%>%
summarise(all_flights=n(),
cancelled_count=sum(cancelled=="cancelled"),
cancel_rate=cancelled_count/all_flights,
.groups="drop")%>%
arrange(desc(cancel_rate))
Answer: EWR to LGA has the highest cancel rate due
to only 1 flight in the table, followed by LGA to MHT with more flights
to accurately measure cancellation rate.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
flights%>%
mutate(cancelled=if_else(is.na(dep_time)|is.na(arr_time),
"cancelled", "not cancelled"))%>%
group_by(carrier)%>%
summarise(cancel_rate=mean(cancelled=="cancelled"))%>%
ggplot(aes(x=reorder(carrier, cancel_rate), y=cancel_rate))+
geom_col()+
labs(
title="distribution of cancelation by carriers",
x="carrier",
y="cancel rate"
)
Answer: HA has the lowest cancellation rate at 0
cancellations.
Answer: