The data The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes transportation data available, such as the flights data we will be working with in this lab.
First, we’ll view the nycflights data frame. Type the following in your console to load the data:
data(nycflights)
To view the names of the variables, type the command
## [1] "year" "month" "day" "dep_time" "dep_delay" "arr_time"
## [7] "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
## [13] "air_time" "distance" "hour" "minute"
The codebook (description of the variables) can be accessed by pulling up the help file:
?nycflights
## starting httpd help server ... done
use glimpse to take a quick peek at your data to understand its contents better.
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
Departure delays Let’s start by examing the distribution of departure delays of all flights with a histogram.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This function says to plot the dep_delay variable from the nycflights
data frame on the x-axis. It also defines a geom (short for geometric
object), which describes the type of plot you will produce.
Histograms are generally a very good way to see the shape of a single distribution of numerical data, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
EXERCISE 1 Look carefully at these three histograms.
Question: How do they compare? Answer: All three histograms display the same data and all are right skewed.However the second histogram shows a drastic rise and fall in distribution.
Question: Are features revealed in one that are obscured in another? Answer: All three histograms display the same data however because of the different bin sizes the data is more obscure in the third histogram than the others.
If you want to visualize only on delays of flights headed to Los Angeles, you need to first filter the data for flights with that destination (dest == “LAX”) and then make a histogram of the departure delays of only those flights.
lax_flights <- nycflights %>%
filter(dest == "LAX")
ggplot(data = lax_flights, aes(x = dep_delay)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can also obtain numerical summaries for these flights:
lax_flights %>%
summarise(mean_dd = mean(dep_delay),
median_dd = median(dep_delay),
n = n())
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 9.78 -1 1583
You can also filter based on multiple criteria. Suppose you are interested in flights headed to San Francisco (SFO) in February:
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
EXERCISE 2 Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights.
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
sfo_feb_flights
## # A tibble: 68 × 16
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 2 18 1527 57 1903 48 DL N711ZX 1322
## 2 2013 2 3 613 14 1008 38 UA N502UA 691
## 3 2013 2 15 955 -5 1313 -28 DL N717TW 1765
## 4 2013 2 18 1928 15 2239 -6 UA N24212 1214
## 5 2013 2 24 1340 2 1644 -21 UA N76269 1111
## 6 2013 2 25 1415 -10 1737 -13 UA N532UA 394
## 7 2013 2 7 1032 1 1352 -10 B6 N627JB 641
## 8 2013 2 15 1805 20 2122 2 AA N335AA 177
## 9 2013 2 13 1056 -4 1412 -13 UA N532UA 642
## 10 2013 2 8 656 -4 1039 -6 DL N710TW 1865
## # … with 58 more rows, 6 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
## # variable name ¹arr_delay
QUESTION: How many flights meet these criteria? ANSWER: 68 Flights
sfo_feb_flights %>% summarise( n_flights = n())
## # A tibble: 1 × 1
## n_flights
## <int>
## 1 68
EXERCISE 3 Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.
sfo_feb_flights %>%
group_by(origin) %>%
summarise(median_dd = median(dep_delay), iqr_dd = IQR(dep_delay), n_flights = n())
## # A tibble: 2 × 4
## origin median_dd iqr_dd n_flights
## <chr> <dbl> <dbl> <int>
## 1 EWR 0.5 5.75 8
## 2 JFK -2.5 15.2 60
ggplot(data = sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15)
EXERCISE 4
Which month would you expect to have the highest average delay departing from an NYC airport? Months 7 & 6
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
## # A tibble: 12 × 2
## month mean_dd
## <int> <dbl>
## 1 7 20.8
## 2 6 20.4
## 3 12 17.4
## 4 4 14.6
## 5 3 13.5
## 6 5 13.3
## 7 8 12.6
## 8 2 10.7
## 9 1 10.2
## 10 9 6.87
## 11 11 6.10
## 12 10 5.88
QUESTION: Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?
ANSWER: DL and UA have the most variable arrival delays
sfo_feb_flights %>%
group_by(carrier) %>%
summarise(median = median(arr_delay), IQR = IQR(arr_delay)) %>%
arrange(desc(IQR))
## # A tibble: 5 × 3
## carrier median IQR
## <chr> <dbl> <dbl>
## 1 DL -15 22
## 2 UA -10 22
## 3 VX -22.5 21.2
## 4 AA 5 17.5
## 5 B6 -10.5 12.2
EXERCISE 5 QUESTION: Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?
ANSWER: The mean is the average of the entire Flight data set so in this case the average departure delay will not be as accurate as the median departure delay which is the middle of the data set.
On time departure rate for NYC airports
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.728
## 2 JFK 0.694
## 3 EWR 0.637
EXERCISE 6 QUESTION: If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
ANSWER: I would choose to fly out of LGA based on the fact that LGA have the highest on time departure rate.
You can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
EXERCISE 7 Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.
nycflights <- nycflights %>%
mutate(avg_speed = distance / (air_time / 60))
nycflights
## # A tibble: 32,735 × 18
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## 7 2013 12 9 1259 14 1617 22 WN N218WN 1428
## 8 2013 8 13 1920 85 2032 71 B6 N284JB 1407
## 9 2013 9 26 725 -10 1027 -8 AA N3FSAA 2279
## 10 2013 4 30 1323 62 1549 60 EV N12163 4162
## # … with 32,725 more rows, 8 more variables: origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, dep_type <chr>,
## # avg_speed <dbl>, and abbreviated variable name ¹arr_delay
EXERCISE 8 Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().
ggplot(data = nycflights, aes(x = distance, y = avg_speed, color= carrier)) + geom_point()+
labs(title = "AVERAGE SPEED VS DISTANCE", x = "DISTANCE", y = "AVG. SPEED")
EXERCISE 9 QUESTION: Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) What the cutoff point is for departure delays where you can still expect to get to your destination on time.
ANSWER:The cutoff point for departure delays where you can still expect to get to your destination on time seems to be somewhere between 50 and 60 mins.
nycflightsF <- nycflights %>%
filter(carrier == "AA" | carrier == "DL" | carrier == "UA")
ggplot(data = nycflightsF, aes(x = dep_delay, y = arr_delay, color= carrier)) + geom_point()+
labs(title = "AVERAGE SPEED VS DISTANCE", x = "DISTANCE", y = "AVG. SPEED")