I, Lisbeth Liu, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.# Enter code here.
mpg2 <- mpg%>%
mutate(mpg_overall = (cty + hwy)/2)
ggplot(mpg2, aes(x = mpg_overall)) +
geom_histogram(binwidth = 2, boundary = 10, color = 'purple', fill = 'pink') +
labs(title = "Distribution of Overall MPG",
x = "Overall MPG",
y = "Number of Cars") +
theme(plot.title = element_text(hjust = 0.5, size = 12))
mpg_overall.# Enter code here.
ggplot(mpg2, aes(x = drv, y = mpg_overall)) +
geom_boxplot(color = 'navy') +
labs(title = "MPG by Drive Train Type",
x = "Drive Train Type",
y = "Overall MPG") +
theme(plot.title = element_text(hjust = 0.5, size = 16))
Answer: Cars with front-wheel drive (f) tend to have higher overall MPG compared with four-wheel drive (4) and rear-wheel drive (r) vehicles.
mpg_overall.# Enter code here.
mpg2%>%
group_by(class)%>%
summarise(mean_mpg = mean(mpg_overall))%>%
arrange(desc(mean_mpg))
Answer: The subcompact car class has the highest mean overall MPG, indicating that subcompact cars tend to be the most fuel-efficient.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.# Enter code here.
ggplot(mpg2, aes(x = factor(cyl), y = mpg_overall)) +
geom_point() +
facet_wrap(~year) +
labs(title = 'Effect of Year and Cylinders on Overall MPG',
x = 'Number of Cylinders',
y = 'Overall MPG') +
theme(plot.title = element_text(hjust = 0.5, size = 16))
Answer: Cars with fewer cylinders generally have higher MPG. This pattern appears in both 1999 and 20008. Additionally, cars in 2008 shows slightly higher MPG overall, suggesting that fuel efficiency may have improved a little bit over time.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
# Enter code here.
flights%>%
filter(origin == 'JFK', month == 11)%>%
group_by(day) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = T))%>%
arrange(desc(avg_arr_delay))
Answer: 11/27/2013 has the largest average arrival delay.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.# Enter code here.
flights2 <- flights %>%
mutate(cancel_flight = ifelse(is.na(dep_time)|is.na(arr_time),
'Cancelled',
'Not Cancelled'))
glimpse(flights2)
## Rows: 336,776
## Columns: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ cancel_flight <chr> "Not Cancelled", "Not Cancelled", "Not Cancelled", "Not…
Answer: Flights are labeled “Cancelled” when either the departure time or arrival time is missing (NA). Otherwise, they are labeled “Not Cancelled”.
distance between cancelled flights and non-cancelled
flights.# Enter code here.
ggplot(flights2, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.4) +
labs(title = "Distance Distribution of Cancelled vs Non-Cancelled Flights",
x = "Distance",
y = "Density") +
theme(plot.title = element_text(hjust = 0.5))
# Enter code here.
routes <- flights %>%
distinct(origin, dest)
routes
nrow(routes)
## [1] 224
Answer: The table lists all unique routes, and there are 224 unique routes in the dataset.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
# Enter code here.
routes2 <- flights%>%
distinct(origin, dest, distance)
ggplot(routes2, aes(x = distance)) +
geom_histogram(binwidth = 20) +
labs(title = 'Distribution of Route Distances',
x = 'Distance',
y = 'Number of Routes') +
theme(plot.title = element_text(hjust = 0.5, size = 16))
# Enter code here.
#cancel rate = cancelled flights / total flights
flights2%>%
group_by(origin, dest) %>%
summarise(total_flights = n(),
cancelled = sum(cancel_flight == 'Cancelled'),
cancel_rate = cancelled / total_flights) %>%
arrange(desc(cancel_rate))
Answer: The route from EWR to LGA has the highest flight cancellation rate. Actually, this answer may not be very helpful as a reference, cuz there is just one route from EWR to LGA. In normal cases, the route with highest cancellation rate should be the one from LGA to MHT.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
# Enter code here.
flights2 <- flights %>%
mutate(cancel_flight = ifelse(is.na(dep_time)|is.na(arr_time),
1,
0))
cancel_rate_airline <- flights2 %>%
group_by(carrier) %>%
summarise(cancel_rate = mean(cancel_flight))
ggplot(cancel_rate_airline, aes(x = carrier, y = cancel_rate)) +
geom_col(fill = 'pink') +
labs(title = 'Cancellation Rate by Airline',
x = 'Airline(Carrier)',
y = 'Cancellation Rate')+
theme(plot.title = element_text(hjust = 0.5, size = 12))
Answer: The HA airline has the lowest cancellation rate.
# Enter code here.
flights %>%
group_by(origin, dest) %>%
summarise(num_carries = n_distinct(carrier)) %>%
arrange(desc(num_carries))
Answer: The most competitive routes are routes from EWR to DTW, from EWR to MSP, from JFK to LAX, from JFK to SFO, from JFK to TPA, from LGA to ATL, from LGA to CLE, and the one from LGA to CLT.