library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data(nycflights)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Practically each chart displays different bucketing. One unique observable feature is the number of flights have a negative departure delay time, implying they departed earlier than previously expected, which is simply obfuscated in other charts.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO" | origin == "SFO", month == 2)
sfo_feb_flights
68 Flights meet this criteria
Describe the Distribution in a histogram & summary statistics
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15)
sfo_feb_flights %>%
group_by(dest) %>%
summarise(mean = mean(arr_delay), median = median(arr_delay), standard_deviation = sd(arr_delay), n_flights = n())
Practically these are a somewhat bellcurve shaped distribution, with a few outlines (that make sense given the nature of flights and weather)
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_dd = median(arr_delay), iqr_dd = IQR(arr_delay), n_flights = n())
The IRQ is a measure of the middle 50% of the data. higher IRQ values would indicate a wider distribution of data. Using the strictest defintion of arrival delays as leafing after a targeted time, the carrier with the highest varability of arrival delay rate would be UA as it not only has the same IRQ value as DL, but a higher median departure delay.
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay), median_dd = median(dep_delay)) %>%
arrange(desc(mean_dd))
Practically you have two main choices, if you have a delay, do you want it to be most likely be short or are you alright with the chance of a catastrophic issue. The mean is best viewed as the average of the entire set of departure delays, which takes into account catastrophic issues. The median on the other hand more effectively balances extreme cases on both ends, making it feel more akin to what one should normally expect!
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
If all I cared about was departure rates, I would select LGA as the airport I depart from as it has the highest. The other option would be JFK, as it is remarkably close in terms of on-time departure rates, although I believe it services more destinations. I would avoid EWR as it is significantly lower than the other two options.
nycflights <- nycflights %>%
mutate(avg_speed = (distance/(air_time/60)))
head(nycflights)
As you can see, you can take distance divide it by air time and divide it by 60, to get miles per hour. These number roughly align with a jet at cruising speeds.
ggplot(data = nycflights, aes(x = distance, y = avg_speed)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Practically as distance increases, avg_speed increases until it levels off. This actually makes sense as certain flights that are prop planes are used for shorter jumps (whith a lower top speed), vs longer distance flights, that use a jet with higher top speeds.
nycflights_short <- nycflights %>%
filter( carrier == "DL" | carrier == "AA" | carrier == "UA")
head(nycflights_short)
ggplot(data = nycflights_short, aes(y = arr_delay, x = dep_delay, color= carrier )) +
geom_point()