Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information – the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airports in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. Since this is a large data set, along the way you’ll also learn the indispensable skills of data processing and subsetting.
In this lab, we will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro labs, openintro.
Let’s load the packages.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openintro)
## Warning: package 'openintro' was built under R version 4.1.1
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data(nycflights)
To view the names of the variables, type the command
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
Remember that you can use glimpse to take a quick peek at your data to understand its contents better.
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ~
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10~
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, ~
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940~
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, ~
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, ~
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, ~
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", ~
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", ~
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, ~
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA~
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA~
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,~
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,~
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6~
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24~
The nycflights data frame is a massive trove of information. Let’s think about some questions we might want to answer with these data:
Let’s start by examing the distribution of departure delays of all flights with a histogram.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?
The Three Histogram shows how the different level of Bin width can affect the appearance of the data. The last Histogram has a very Big width bar that can make it difficult to read the horizontal information in the graph. The Third data has the largest Bin width, followed by the first then the second one.
Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?
sfo_feb_flights <- nycflights %>% filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)
## [1] 68
The number of flight that met the criteria is 68
Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) + geom_histogram(binwidth = 10)
sfo_feb_flights %>% group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
sfo_feb_flights %>% summarise(mean_ad = mean(arr_delay), median_ad = median(arr_delay), IQR_ad = IQR(arr_delay), n_flights = n())
Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
sfo_feb_flights %>% group_by(carrier) %>% summarise(median_ad = median(arr_delay), iqr_ad = IQR(arr_delay), n_flights = n())
The carrier with the most delays of flight are DL and UA because the have the highest IQR.
Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay.Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
From the gragh displayed, the airline is highest on time departure rate is JFK.
Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph).Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
nycflights <- nycflights %>% mutate(avg_speed = distance/(air_time/60))
head(nycflights %>% select(distance, air_time, avg_speed))
Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
ggplot(data = nycflights, aes(x = distance, y = avg_speed)) + geom_point() + theme_bw() + labs(title = "avg_speed vs distance")
Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicatethe plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.
dl_aa_ua <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = dl_aa_ua, aes(x = dep_delay, y = arr_delay, color = carrier)) +
geom_point() + xlim(-20, 200) + ylim(-20, 200)
## Warning: Removed 3271 rows containing missing values (geom_point).
The cut off Point for departure delay start approximately 25. After this point, there was an increase in arrival time for the flight