library(tidyverse)
library(openintro)
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
Changing the binwidth helps to show features. The 2nd histogram is better representation of the data as it displays the breakdown accurately.
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
68 flights meet this criteria.
<- nycflights %>%
sfo_feb_flights filter(dest == "SFO", month == 2)
count(sfo_feb_flights)
## # A tibble: 1 × 1
## n
## <int>
## 1 68
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15)
%>%
sfo_feb_flights group_by(origin)%>%
summarise(
min_ad = min(arr_delay),
max_ad = max(arr_delay),
median_ad = median(arr_delay),
iqr_ad = IQR(arr_delay),
n_flights=n()
)
## # A tibble: 2 × 6
## origin min_ad max_ad median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 EWR -35 7 -15.5 17.5 8
## 2 JFK -66 196 -10.5 22.8 60
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
Delta (DL) and United Airlines (UA) have the largest IQR values and thus the most variable arrival delays.
%>%
sfo_feb_flights group_by(carrier)%>%
summarise(
min_ad = min(arr_delay),
max_ad = max(arr_delay),
median_ad = median(arr_delay),
iqr_ad = IQR(arr_delay),
n_flights=n()
)
## # A tibble: 5 × 6
## carrier min_ad max_ad median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 AA -26 76 5 17.5 10
## 2 B6 -18 11 -10.5 12.2 6
## 3 DL -48 48 -15 22 19
## 4 UA -35 196 -10 22 21
## 5 VX -66 99 -22.5 21.2 12
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
The mean would give the average departure delay while the median would give the observation that sits at the middle of the data set.
The pro of using the mean is it accounts for outliers and it give you better ideas of the delay on average.
The pro of median is that it does not affected by extreme cases, however in this case, mean is a better option for decision making.
December is the month with the lowest average departure delay.
%>%
nycflights group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
LaGuardia (LGA) airport.
<- nycflights %>%
nycflights mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
%>%
nycflights group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
Mutate the data frame so that it includes a new variable that
contains the average speed, avg_speed
traveled by the plane
for each flight (in mph). Hint: Average speed can be calculated as
distance divided by number of hours of travel, and note that
air_time
is given in minutes.
<- nycflights %>%
nycflights mutate(avg_speed = distance / (air_time / 60))
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
The average speed increases until the distance is in the range of 1000-1500 (miles) and then it stabilizes.
ggplot(data = nycflights, mapping = aes(x = distance, y = avg_speed)) +
geom_point()
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
<- nycflights %>%
c_delay filter(carrier == 'AA' | carrier == 'DL' | carrier == 'UA')
ggplot(c_delay, aes(dep_delay, arr_delay, color = carrier)) + geom_point()
Based on the below plot, the cutoff point is approximately 20mins. After this point, as departure delays climb so do arrival delays.
ggplot(c_delay, aes(dep_delay, arr_delay, color = carrier)) +
xlim(-10, 60) +
ylim(-10, 60) +
geom_point()
## Warning: Removed 7108 rows containing missing values (`geom_point()`).