In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013.
We will generate simple graphical and numerical summaries of data on these flights and explore delay times. You can find the instructions for this lab here
# 1. load the library "tidyverse"
library(tidyverse)
# 2. use the read_csv file to read the dataset
nycflights<- read_csv("data/nycflights.csv")
glimpse(nycflights)## Rows: 32,735
## Columns: 16
## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <dbl> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <dbl> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <dbl> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <dbl> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <dbl> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
## No documentation for 'nycflights' in specified packages and libraries:
## you could try '??nycflights'
nycflights|>
mutate(carrier=fct_infreq(carrier))|>
mutate(carrier=fct_rev(carrier))|>
ggplot(aes(x = carrier)) + geom_bar()## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 19936 rows containing non-finite values (`stat_bin()`).
Question: Experiment with different binwidths to
adjust your histogram in a way that will show its important features.
You may also want to try using the + scale_x_log10(). What
features are revealed now that were obscured in the original
histogram?
Answer: [Replace this with your answer]
Question: Create a new data frame that includes
flights headed to SFO in February, and save this data frame as
sfo_feb_flights. How many flights meet these criteria?
Answer: [Replace this with your answer]
In February, there are 68 to SFO
Question: Describe the distribution of the arrival delays of flights headed to SFO in February, using an appropriate histogram and summary statistics.
# Insert code for Exercise 3 here
sfo_feb_flights |> ggplot(aes(x=arr_delay)) + geom_histogram(binwidth = 10)## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 49 rows containing non-finite values (`stat_bin()`).
sfo_feb_flights %>%
group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())Answer: [Replace this with your answer]
Question: Calculate the median and interquartile
range for arr_delays of flights in in the
sfo_feb_flights data frame, grouped by carrier. Which
carrier has the most variable arrival delays?
# Insert code for the Exercise here
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))Answer: [Replace this with your answer]
nycflights <- nycflights %>% mutate(dep_type = ifelse(dep_delay < 5, “on time”, “delayed”)) nycflights %>% group_by(dep_type) %>% summarise(on_time_dep_rate = sum(dep_type == “on time”) / n()) %>% arrange(desc(dep_type))
nycflights |> ggplot(aes(x=origin,fill=dep_type)) + geom_bar() nycflights |> ggplot(aes(x=origin,fill=dep_type)) + geom_bar(position = “fill”) nycflights |> ggplot(aes(x=origin,fill=dep_type)) + geom_bar(position = “dodge”) + labs(fill=“dep type”, y= “proportion”)
Question: Create a list of origin
airports and their rate of on-time-departure. Then visualize the
distribution of on-time-departure rate across the three airports using a
segmented bar plot (see below). If you could select an airport based on
on time departure percentage, which NYC airport would you choose to fly
out of? Hint: For the segmented bar plot, will need to
map the aesthetic arguments as follows:
x = origin, fill = dep_type and a geom_bar()
layer. Create three plots, one with geom_bar() layer, one
with geom_bar(position = "fill") and the third with
geom_bar(position = "dodge"). Explain the difference
between the three results.
# Insert code for the Exercise here
nycflights <- nycflights |>
mutate(avg_speed= distance * 60/ air_time)
nycflights <- nycflights |>
mutate(air_time_hours = air_time/60) |>
mutate(avg_speed= distance/ air_time_hours)
nycflights %>% ggplot(aes(x = distance, y = avg_speed)) +
geom_point()+
geom_smooth() ## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Question: Mutate the data frame so that it includes
a new variable that contains the average speed, avg_speed
traveled by the plane for each flight (in mph, or if you are brave, in
km/h). Now make a scatter plot of distance
vs. avg_speed. Think carefully which of the two variables
is the predictor (on the x-axis) and which is the outcome
variable (on the y-axis) and explain why you made this
choice. Describe the relationship between average speed and
distance.
Answer: [Replace this with your answer]
Question: Replicate the following plot and determine what is the cutoff point for the latest departure delay where you can still have a chance to arrive at your destination on time.
# Insert code for the Exercise here
nycflights |> filter(carrier %in% c("AA", "DL", "UA")) |> filter(arr_delay < 0) |>
ggplot(aes(dep_delay, arr_delay, colour=carrier)) + geom_point() +
scale_x_continuous(breaks = seq(-25,100,2))Answer: [Replace this with your answer]