I, Duc Vinh Hoang, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
library(openintro)
library(nycflights13)
mpg data setAfter loading tidyverse library, a data set named
mpg should be ready to explore. The following questions are
based on this data set.
mpg_overall which is the
average of city and highway fuel consumption in miles per gallon. Then
create a histogram of this new variable with each group covering values
of 20-22, 22-24 etc.# Enter code here.
mpg <- mutate(mpg, mpg_overall = (cty + hwy) / 2)
ggplot(data = mpg, mapping = aes(x = mpg_overall)) +
geom_histogram(binwidth = 2,
boundary = 0,
fill = "steelblue",
color = "white"
)
mpg_overall.# Enter code here.
ggplot(data = mpg, mapping = aes(x = drv, y = mpg_overall, fill = drv)) +
geom_boxplot()+
labs(
x = "Driver Train (f = front, r = rear, 4 = 4wd)",
y = "Overall MPG"
)
Answer:
The distribution of 4 and r are
similar, however, f is significantly higher in overall mpg and is the
one that has the most outlier which can save even more fuel ##### c)
Create a table to find out which car class has the highest mean
mpg_overall.
# Enter code here.
mpg_sum <- mpg %>%
group_by(class) %>%
summarize(mean_mpg = mean(mpg_overall)) %>%
arrange(desc(mean_mpg))
mpg_sum
Answer:
We can easily see that the highest mpg
which will sacve the most fuel are compact and subcompact, following by
midsize and 2 seater. ##### d) Create a proper graph to study the
composite effect of year and cyl to
mpg_overall. You shall treat year and
cyl as categorical variables in your graph.
# Enter code here.
ggplot(data = mpg, mapping = aes(x = factor(cyl), y = mpg_overall, fill = factor(year)))+
geom_boxplot()
Answer:
The more cylinders, the more of power
consumption which leads to more gas consumption. ### 2. The
flights data set
For the following tasks, use data set flights of the
nycflights13 package.
# Enter code here.
nov_delays <- flights %>%
filter(origin =="JFK", month == 11) %>%
group_by(day) %>%
summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_arr_delay))
nov_delays
Answer:
Based on our table, day 27th is the
worst day so far. ##### b) Create a new variable
cancel_flight which is Cancelled if the
departure time or arrival time is NA, otherwise
Not Cancelled.
# Enter code here.
flights <- flights %>%
mutate(cancel_flight = if_else(is.na(dep_time) | is.na(arr_time),
"Cancelled",
"Not Cancelled"))
flights %>%
select(dep_time, arr_time, cancel_flight)
Answer:
distance between cancelled flights and non-cancelled
flights.# Enter code here.
ggplot(data = flights, mapping = aes(x = distance, fill = cancel_flight)) +
geom_density(alpha = 0.5) +
labs(
title = "Cancelled vs. Non-Cancelled Flights",
x = "miles",
y = "Density",
fill = "Flight Status"
)
# Enter code here.
unique_routes <- flights %>%
distinct(origin, dest)
number_of_routes <- nrow(unique_routes)
number_of_routes
## [1] 224
Answer:
There are 224 unique routes flying from
New York ##### e) Add distance as a column to the table you
created in d).
Hint: You should go back to the original flights data
set and reconstruct the table with distance included. Create a histogram
of distance for the route table.
# Enter code here.
# Enter code here.
cancel_rates <- flights %>%
group_by(origin, dest) %>%
summarize(
num_flights = n(),
cancel_rate = mean(cancel_flight == "Cancelled"),
.groups = "drop"
) %>%
filter(num_flights > 10) %>%
arrange(desc(cancel_rate))
cancel_rates
Answer:
The route that has the most
cancellation rate is From La Guardia to MHT airport with approximately
24%.