I, keyan , hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.mpg2 <- mpg %>%
mutate(mpg_overall = (cty + hwy)/2)
ggplot(mpg2, aes(x = mpg_overall)) +
geom_histogram(binwidth = 2, boundary = 20, fill = "skyblue", color = "black") +
labs(
title = "Histogram of Overall MPG",
x = "Overall MPG",
y = "Count"
)
mpg_overall.ggplot(mpg2, aes(x = drv, y = mpg_overall)) +
geom_boxplot(fill = "pink") +
labs(
title = "Overall MPG by Drive Train Type",
x = "Drive Train Type",
y = "Overall MPG"
)
Answer:
The boxplot shows the distribution of
overall mpg for different drive train types. Front-wheel drive vehicles
generally have a higher median MPG, while rear-wheel and four-wheel
drive vehicles tend to have lower fuel efficiency.
mpg_overall.class_table <- mpg2 %>%
group_by(class) %>%
summarise(mean_mpg_overall = mean(mpg_overall, na.rm = TRUE)) %>%
arrange(desc(mean_mpg_overall))
class_table
Answer:
From the table above, the car class with
the highest mean overall mpg is the class shown at the top of the
table.
year and cyl to mpg_overall. You
shall treat year and cyl as categorical
variables in your graph.ggplot(mpg2, aes(x = factor(cyl), y = mpg_overall, fill = factor(year))) +
geom_boxplot() +
labs(
title = "Overall MPG by Cylinder and Year",
x = "Number of Cylinders",
y = "Overall MPG",
fill = "Year"
)
Answer:
This graph shows how overall mpg varies
across different cylinder categories and model years. Vehicles with
fewer cylinders generally have higher fuel efficiency, and there are
also differences between different years.
flights data setFor the following tasks, use data set flights of the
nycflights13 package.
jfk_nov_delay <- flights %>%
filter(origin == "JFK", year == 2013, month == 11) %>%
group_by(day) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_arr_delay))
jfk_nov_delay
Answer:
From the table above, the day with the
biggest average arrival delay at JFK in November 2013 is the day shown
in the first row.
cancel_flight which is
Cancelled if the departure time or arrival time is
NA, otherwise Not Cancelled.flights2 <- flights %>%
mutate(cancel_flight = if_else(is.na(dep_time) | is.na(arr_time),
"Cancelled",
"Not Cancelled"))
head(flights2)
Answer:
A flight is labeled Cancelled if either
departure time or arrival time is missing; otherwise it is labeled Not
Cancelled.
distance between cancelled flights and non-cancelled
flights.ggplot(flights2, aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.4) +
labs(
title = "Distance Distribution by Flight Cancellation Status",
x = "Distance",
y = "Density",
fill = "Flight Status"
)
route_table <- flights2 %>%
distinct(origin, dest)
route_table
nrow(route_table)
## [1] 224
Answer:
The dataset contains 224 unique flight
routes.
distance as a column to the table you created in
d).Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
route_distance_table <- flights2 %>%
distinct(origin, dest, distance)
route_distance_table
ggplot(route_distance_table, aes(x = distance)) +
geom_histogram(binwidth = 100, fill = "lightblue", color = "black") +
labs(
title = "Histogram of Route Distance",
x = "Distance",
y = "Count"
)
route_cancel_table <- flights2 %>%
group_by(origin, dest) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancel_rate = cancelled_flights / total_flights
) %>%
arrange(desc(cancel_rate))
route_cancel_table
Answer:
From the table, the first row shows the
route with the highest cancellation rate.
flights data setThe following questions are also from flights data set.
Each question is worth 5% bonus points if answered correctly.
airline_cancel_table <- flights2 %>%
group_by(carrier) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancel_flight == "Cancelled"),
cancel_rate = cancelled_flights / total_flights
) %>%
arrange(cancel_rate)
airline_cancel_table
ggplot(airline_cancel_table, aes(x = reorder(carrier, cancel_rate), y = cancel_rate)) +
geom_col(fill = "pink") +
labs(
title = "Cancellation Rate by Airline",
x = "Airline",
y = "Cancellation Rate"
)
Answer:
Based on the graph, HA has the lowest
cancellation rate.
route_competition <- flights2 %>%
distinct(origin, dest, carrier) %>%
group_by(origin, dest) %>%
summarise(num_carriers = n()) %>%
arrange(desc(num_carriers))
route_competition
Answer:
The most competitive routes are those
served by the largest number of carriers. In this data set, the maximum
number of carriers on a route is 5. The routes with 5 competing carriers
are EWR–DTW, EWR–MSP, JFK–LAX, JFK–SFO, JFK–TPA, LGA–ATL, LGA–CLE, and
LGA–CLT.