library(tidyverse)
library(openintro)
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Comparing the three histograms, When the width of the bars is default (first histogram), it is hard to see the early departure (the negative times). While in the second histogram -where the width of the bars is 15- it is clear that the number of flights that depart early is little over 2500 flights. The last histogram shows only three categories: one is early, on time and a few minutes late. Second is around 64 minutes late to 240 minutes late, then last category is from 240 minutes to 375 minutes late departure.
<- nycflights %>%
lax_flights filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
%>%
lax_flights summarise(mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 9.78 -1 1583
<- nycflights %>%
sfo_feb_flights filter(dest == "SFO", month == 2)
<- nycflights %>%
sfo_feb_flights filter(dest == "SFO", month == 2)
glimpse(sfo_feb_flights)
## Rows: 68
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day <int> 18, 3, 15, 18, 24, 25, 7, 15, 13, 8, 11, 13, 25, 20, 12, 27,…
## $ dep_time <int> 1527, 613, 955, 1928, 1340, 1415, 1032, 1805, 1056, 656, 191…
## $ dep_delay <dbl> 57, 14, -5, 15, 2, -10, 1, 20, -4, -4, 40, -2, -1, -6, -7, 2…
## $ arr_time <int> 1903, 1008, 1313, 2239, 1644, 1737, 1352, 2122, 1412, 1039, …
## $ arr_delay <dbl> 48, 38, -28, -6, -21, -13, -10, 2, -13, -6, 2, -5, -30, -22,…
## $ carrier <chr> "DL", "UA", "DL", "UA", "UA", "UA", "B6", "AA", "UA", "DL", …
## $ tailnum <chr> "N711ZX", "N502UA", "N717TW", "N24212", "N76269", "N532UA", …
## $ flight <int> 1322, 691, 1765, 1214, 1111, 394, 641, 177, 642, 1865, 272, …
## $ origin <chr> "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "JFK", "JFK", "JFK…
## $ dest <chr> "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO…
## $ air_time <dbl> 358, 367, 338, 353, 341, 355, 359, 338, 347, 361, 332, 351, …
## $ distance <dbl> 2586, 2586, 2586, 2565, 2565, 2586, 2586, 2586, 2586, 2586, …
## $ hour <dbl> 15, 6, 9, 19, 13, 14, 10, 18, 10, 6, 19, 8, 10, 18, 7, 17, 1…
## $ minute <dbl> 27, 13, 55, 28, 40, 15, 32, 5, 56, 56, 10, 33, 48, 49, 23, 2…
There are 68 flights that headed to SFO in February.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
%>%
sfo_feb_flights summarise(mean_ad = mean(arr_delay),
median_ad = median(arr_delay),
n = n())
## # A tibble: 1 × 3
## mean_ad median_ad n
## <dbl> <dbl> <int>
## 1 -4.5 -11 68
Based on the histogram and the mean and the median, there are many flights whose destination to SFO arrived earlier than their times
%>%
sfo_feb_flights group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
## # A tibble: 2 × 4
## origin median_dd iqr_dd n_flights
## <chr> <dbl> <dbl> <int>
## 1 EWR 0.5 5.75 8
## 2 JFK -2.5 15.2 60
The flights that departed from JFK arrived earlier than the flights that were coming from EWR airport.
%>%
sfo_feb_flights group_by(carrier) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
## # A tibble: 5 × 4
## carrier median_dd iqr_dd n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 13 32.8 10
## 2 B6 -2 3.5 6
## 3 DL -3 6.5 19
## 4 UA -2 13 21
## 5 VX -3.5 16.8 12
The carrier AA has the most variable arrival delays.
%>%
nycflights group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
%>%
nycflights group_by(month) %>%
summarise(median_dd = median(dep_delay)) %>%
arrange(desc(median_dd))
## # A tibble: 12 × 2
## month median_dd
## <int> <dbl>
## 1 12 1
## 2 6 0
## 3 7 0
## 4 3 -1
## 5 5 -1
## 6 8 -1
## 7 1 -2
## 8 2 -2
## 9 4 -2
## 10 11 -2
## 11 9 -3
## 12 10 -3
The mean gives the best overview of the central tendency or the distribution of the data even though it is sensitive to outliers. The median focuses only on the middle point but it is not sensitive to outliers. So, to choose the month that minimizes the potential of departure delays, I will choose the month that has both; lowest mean and median. based on the two above summaries, January has the lowest mean and median.
<- nycflights %>%
nycflights mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
%>%
nycflights group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
Based on on time departure percentage, I would fly out from LGA airport because there is around 73% for the flights to be on time.
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
<- nycflights %>%
nycflights mutate(hour_air_time = air_time / 60)
<- nycflights %>%
nycflights mutate (Avg_speed = nycflights$distance / hour_air_time)
ggplot(nycflights, aes(x = Avg_speed, y = distance, col= carrier)) +
geom_point()
cor(nycflights$Avg_speed, nycflights$distance)
## [1] 0.6831483
There is a strong correlation between the average speed and the distance flown by a flight.
<- nycflights[nycflights$carrier %in% c("AA", "DL", "UA"),] us_carrier
ggplot(us_carrier, aes(x = dep_delay, y = arr_delay, colour = carrier)) +
geom_point()
ggplot(us_carrier, aes(x = dep_delay, y = arr_delay, colour = carrier)) +
geom_point() +
facet_wrap(~ carrier)
The cutoff point is for departure delays where I can still expect to get to my destination on time is (200,150)