In this lab, we will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro labs, openintro.
Let’s load the packages.
library(tidyverse)
library(openintro)
data(nycflights)
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
Let’s start by examing the distribution of departure delays of all flights with a histogram.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Looking through the three histograms you can compare the shape of the charts. In the three histograms you can see each one has a different bin width. When you look into the binwidth and decrease the n you can see the distribution is center around 0-15. Even when binwidth is higher to around 30 or so. The distrubtion is around 0-15. Each histogram show data is skewing to the left where the shape is towards the left side.
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
lax_flights %>%
summarise(mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 9.78 -1 1583
You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February:
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
glimpse(sfo_feb_flights)
## Rows: 68
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day <int> 18, 3, 15, 18, 24, 25, 7, 15, 13, 8, 11, 13, 25, 20, 12, 27,…
## $ dep_time <int> 1527, 613, 955, 1928, 1340, 1415, 1032, 1805, 1056, 656, 191…
## $ dep_delay <dbl> 57, 14, -5, 15, 2, -10, 1, 20, -4, -4, 40, -2, -1, -6, -7, 2…
## $ arr_time <int> 1903, 1008, 1313, 2239, 1644, 1737, 1352, 2122, 1412, 1039, …
## $ arr_delay <dbl> 48, 38, -28, -6, -21, -13, -10, 2, -13, -6, 2, -5, -30, -22,…
## $ carrier <chr> "DL", "UA", "DL", "UA", "UA", "UA", "B6", "AA", "UA", "DL", …
## $ tailnum <chr> "N711ZX", "N502UA", "N717TW", "N24212", "N76269", "N532UA", …
## $ flight <int> 1322, 691, 1765, 1214, 1111, 394, 641, 177, 642, 1865, 272, …
## $ origin <chr> "JFK", "JFK", "JFK", "EWR", "EWR", "JFK", "JFK", "JFK", "JFK…
## $ dest <chr> "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO", "SFO…
## $ air_time <dbl> 358, 367, 338, 353, 341, 355, 359, 338, 347, 361, 332, 351, …
## $ distance <dbl> 2586, 2586, 2586, 2565, 2565, 2586, 2586, 2586, 2586, 2586, …
## $ hour <dbl> 15, 6, 9, 19, 13, 14, 10, 18, 10, 6, 19, 8, 10, 18, 7, 17, 1…
## $ minute <dbl> 27, 13, 55, 28, 40, 15, 32, 5, 56, 56, 10, 33, 48, 49, 23, 2…
sfo_feb_flights. How
many flights meet these criteria?There are 68 flights that meet these criteria.
3.Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
Another useful technique is quickly calculating summary statistics
for various groups in your data frame. For example, we can modify the
above command using the group_by function to get the same
summary stats for each origin airport:
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth=10)
sfo_feb_flights %>%
summarise(mean_ad = mean(arr_delay), median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## # A tibble: 1 × 4
## mean_ad median_ad iqr_ad n_flights
## <dbl> <dbl> <dbl> <int>
## 1 -4.5 -11 23.2 68
arr_delays of flights inin the sfo_feb_flights
data frame, grouped by carrier. Which carrier has the most variable
arrival delays?The carriers DL and UA are tied for having the most variable arrival delays because their interquartile ranges are tied for the highest around 22. This can suggest that there is a greatest variation around the delays for 50% fo their data.
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
## # A tibble: 5 × 4
## carrier median_ad iqr_ad n_flights
## <chr> <dbl> <dbl> <int>
## 1 AA 5 17.5 10
## 2 B6 -10.5 12.2 6
## 3 DL -15 22 19
## 4 UA -10 22 21
## 5 VX -22.5 21.2 12
The option to choose the month with the lowest mean departure delay would be shows how the data is distributed and that some of the skewed data is outliers.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
LGA has the best time departure with a percentage of 72.8%.
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
avg_speed traveled by the plane
for each flight (in mph). Hint: Average speed can be
calculated as distance divided by number of hours of travel, and note
that air_time is given in minutes.nycflights <- nycflights %>%
mutate(avg_speed = 60*(distance / air_time))
glimpse(nycflights)
## Rows: 32,735
## Columns: 18
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
## $ dep_type <chr> "delayed", "on time", "on time", "on time", "on time", "on t…
## $ avg_speed <dbl> 474.4409, 443.8889, 394.9468, 446.6667, 355.2000, 318.6957, …
avg_speed
vs. distance. Describe the relationship between average
speed and distance. Hint: Use
geom_point().ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point()
colored by
carrier. Once you replicate the plot, determine (roughly)
what the cutoff point is for departure delays where you can still expect
to get to your destination on time.