In this exploratory analysis I’ll be using data from the nycflights13, a package that contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013. This is available to anyone at https://cran.r-project.org/.
To install, in the R console type install.packages(‘nycflights13’)
This package provides the following data tables.
flights: all flights that departed from NYC in 2013. weather: hourly meteorological data for each airport. planes: construction information about each plane. airports: airport names and locations. airlines: translation between two letter carrier codes and names.
To load the dataset type:
library(nycflights13)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
I’ll be using nycflights13::flights, a dataframe that contains 336,776 flights that departed from New York City in 2013.
First, I’ll explore the dataset’s structure
str(flights)
## tibble [336,776 Ă— 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
n_distinct(flights$carrier)
## [1] 16
n_distinct(flights$origin)
## [1] 3
n_distinct(flights$dest)
## [1] 105
#Number of departures getting cancelled
sum(is.na(flights$dep_time))
## [1] 8255
From basic inspection we can find that there are 16 different carriers flying out of NYC airports. NYC has 3 different airports. There are 105 different destination locations to which flights fly out of NYC airports. 8255 flights departures were cancelled as the data has NA.
Now let’s examine the relationship between the distance and the average delay for each location.
new_flights <- flights %>% group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
new_flights
lets explore the output visually
ggplot(new_flights, aes(dist,delay)) + geom_point(aes(size = count),alpha = 0.5) +
geom_smooth() + geom_jitter()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).
Flights tend to have more delays in short-medium distances. Long distance flights do not have as much delay.
Now lets look at planes with the highest average delays.
(I opted to avoid aggregating data with missing values)
#remove all **NA** values from flights dataset.
not_cancelled <- flights %>%
filter(!is.na(arr_delay), !is.na(dep_delay))
not_cancelled
planes_with_delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
count = n(),
avg_delay = mean(arr_delay)
)
planes_with_delays
Using geom_freqpoly() to display the counts with lines, instead of counts with bars. But using geom_histogram in this specific issue is also a safe option.
(It is much easier to understand overlapping lines than bars.)
ggplot(planes_with_delays, aes(avg_delay)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lets find that peak
max_delay <- planes_with_delays %>% arrange(-avg_delay)
max_delay
Some planes have an average delay of 300 minutes = 5 hours. Lets explore even further.
ggplot(planes_with_delays, aes(count, avg_delay)) + geom_point()
With a small amount of flights there is a much greater variation in the average delay.
Let’s filter out the extreme variation.
# number 20 was used as an educated guess
planes_with_delays %>%
filter(count > 20) %>%
ggplot(aes(count, avg_delay)) + geom_point()
As we can see, the variation decreases as the sample size increases. I believe that the Law of Large Numbers explains why the variance (standard error) goes down when the sample size increases.