Pulling in the data from my github
library(RCurl)
my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/BridgeR/main/datasets/hw3/flights.csv")
flights <- read.csv(text = my_git_url, quote = "")
#flights <- read.csv("/Users/aelsaeyed/BridgeR/datasets/hw3/flights.csv", quote = "" )
#airports <- read.csv("/Users/aelsaeyed/BridgeR/datasets/hw3/airports.csv", quote = "")
The set contains data pertaining to arrival times and delays, so I wanted to rank the carriers by efficiency. I thought there were two ways to do this- to look at the amount of time the delays took, and also the number of flights that were delayed. From an initial glance, it looks like the carrier F9 has both the longest average delay, and also the most frequent delays. Further research could go into exploring if there are any other factors that affect the delay time or frequency, such as time of year/day or destination airport. I also question whether or not arrival delays are affected by departure delays. I derive a percentage column by dividing the number of delayed flights by the number of total flights per carrier.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
flights_summ = summary(flights)
flights_summ
## X.. X.year. X.month. X.day.
## Length:336776 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.549 Mean :15.71
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## X.dep_time. X.sched_dep_time. X.dep_delay. X.arr_time.
## Min. : 1 Min. : 106 Min. : -43.00 Min. : 1
## 1st Qu.: 907 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104
## Median :1401 Median :1359 Median : -2.00 Median :1535
## Mean :1349 Mean :1344 Mean : 12.64 Mean :1502
## 3rd Qu.:1744 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940
## Max. :2400 Max. :2359 Max. :1301.00 Max. :2400
## NA's :8255 NA's :8255 NA's :8713
## X.sched_arr_time. X.arr_delay. X.carrier. X.flight.
## Min. : 1 Min. : -86.000 Length:336776 Min. : 1
## 1st Qu.:1124 1st Qu.: -17.000 Class :character 1st Qu.: 553
## Median :1556 Median : -5.000 Mode :character Median :1496
## Mean :1536 Mean : 6.895 Mean :1972
## 3rd Qu.:1945 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :2359 Max. :1272.000 Max. :8500
## NA's :9430
## X.tailnum. X.origin. X.dest. X.air_time.
## Length:336776 Length:336776 Length:336776 Min. : 20.0
## Class :character Class :character Class :character 1st Qu.: 82.0
## Mode :character Mode :character Mode :character Median :129.0
## Mean :150.7
## 3rd Qu.:192.0
## Max. :695.0
## NA's :9430
## X.distance. X.hour. X.minute. X.time_hour.
## Min. : 17 Min. : 1.00 Min. : 0.00 Length:336776
## 1st Qu.: 502 1st Qu.: 9.00 1st Qu.: 8.00 Class :character
## Median : 872 Median :13.00 Median :29.00 Mode :character
## Mean :1040 Mean :13.18 Mean :26.23
## 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:44.00
## Max. :4983 Max. :23.00 Max. :59.00
##
#airports_summ = summary(airports)
#airports_summ
#Average departure delay from JFK
mean_dep_delay= mean(flights$X.dep_delay., na.rm=TRUE)
mean_dep_delay
## [1] 12.63907
#Average departure by carrier (group by carrier and mean dep_delay)
delays_by_carrier = flights %>%
group_by(X.carrier.) %>%
summarize(avg_delay=mean(X.dep_delay., na.rm=TRUE)) %>%
arrange(desc(avg_delay))
delays_by_carrier
## # A tibble: 16 × 2
## X.carrier. avg_delay
## <chr> <dbl>
## 1 "\"F9\"" 20.2
## 2 "\"EV\"" 20.0
## 3 "\"YV\"" 19.0
## 4 "\"FL\"" 18.7
## 5 "\"WN\"" 17.7
## 6 "\"9E\"" 16.7
## 7 "\"B6\"" 13.0
## 8 "\"VX\"" 12.9
## 9 "\"OO\"" 12.6
## 10 "\"UA\"" 12.1
## 11 "\"MQ\"" 10.6
## 12 "\"DL\"" 9.26
## 13 "\"AA\"" 8.59
## 14 "\"AS\"" 5.80
## 15 "\"HA\"" 4.90
## 16 "\"US\"" 3.78
#Carrier with the most number of delays > 5 minutes
frequency_of_delays = flights %>%
group_by(X.carrier.) %>%
summarize(total_flights = n(), num_delays = sum(X.dep_delay.>= 5, na.rm=TRUE), percent_delays = (num_delays/total_flights) * 100) %>%
arrange(desc(percent_delays))
sum
## function (..., na.rm = FALSE) .Primitive("sum")
frequency_of_delays
## # A tibble: 16 × 4
## X.carrier. total_flights num_delays percent_delays
## <chr> <int> <int> <dbl>
## 1 "\"F9\"" 685 290 42.3
## 2 "\"WN\"" 12275 5055 41.2
## 3 "\"FL\"" 3260 1304 40
## 4 "\"EV\"" 54173 20397 37.7
## 5 "\"UA\"" 58665 20189 34.4
## 6 "\"YV\"" 601 203 33.8
## 7 "\"9E\"" 18460 6072 32.9
## 8 "\"B6\"" 54635 17961 32.9
## 9 "\"VX\"" 5162 1571 30.4
## 10 "\"MQ\"" 26397 7106 26.9
## 11 "\"DL\"" 48110 12096 25.1
## 12 "\"AA\"" 32729 7678 23.5
## 13 "\"AS\"" 714 161 22.5
## 14 "\"OO\"" 32 7 21.9
## 15 "\"US\"" 20536 3765 18.3
## 16 "\"HA\"" 342 37 10.8
I decided to cut down the data to something more manageable by removing the columns I wouldn’t be using, such as tailnum, flight ID, airtime, and distance. I also fixed up the column names to remove the annoying X., and used subset to get the top 5 flights with the largest departure delay.
#insurance_subset = subset(hinsurance_csv, hinsurance_csv$X.age > 35 & hinsurance_csv$X.family >= 2)
#insurance_new_cols = insurance_subset[, c("X.age.", "X.family.", "X.ethnicity.", "X.region.")]
#summary(insurance_new_cols)
flights_new_columns = flights[, c("X.month.", "X.day.", "X.dep_delay.", "X.arr_delay.", "X.air_time.", "X.carrier.", "X.dest.", "X.hour.", "X.minute.")]
names(flights_new_columns) <- c('month', 'day', 'dep_delay', 'arr_delay', 'air_time', 'carrier', 'dest', 'hour', 'minute')
head(flights_new_columns)
## month day dep_delay arr_delay air_time carrier dest hour minute
## 1 1 1 2 11 227 "UA" "IAH" 5 15
## 2 1 1 4 20 227 "UA" "IAH" 5 29
## 3 1 1 2 33 160 "AA" "MIA" 5 40
## 4 1 1 -1 -18 183 "B6" "BQN" 5 45
## 5 1 1 -6 -25 116 "DL" "ATL" 6 0
## 6 1 1 -4 12 150 "UA" "ORD" 5 58
delays_by_carrier
## # A tibble: 16 × 2
## X.carrier. avg_delay
## <chr> <dbl>
## 1 "\"F9\"" 20.2
## 2 "\"EV\"" 20.0
## 3 "\"YV\"" 19.0
## 4 "\"FL\"" 18.7
## 5 "\"WN\"" 17.7
## 6 "\"9E\"" 16.7
## 7 "\"B6\"" 13.0
## 8 "\"VX\"" 12.9
## 9 "\"OO\"" 12.6
## 10 "\"UA\"" 12.1
## 11 "\"MQ\"" 10.6
## 12 "\"DL\"" 9.26
## 13 "\"AA\"" 8.59
## 14 "\"AS\"" 5.80
## 15 "\"HA\"" 4.90
## 16 "\"US\"" 3.78
latest_flights = as.list(delays_by_carrier$X.carrier. [1:5])
head(latest_flights)
## [[1]]
## [1] "\"F9\""
##
## [[2]]
## [1] "\"EV\""
##
## [[3]]
## [1] "\"YV\""
##
## [[4]]
## [1] "\"FL\""
##
## [[5]]
## [1] "\"WN\""
flights_subset = subset(flights_new_columns, flights_new_columns$carrier %in% latest_flights)
head(flights_subset)
## month day dep_delay arr_delay air_time carrier dest hour minute
## 8 1 1 -3 -14 53 "EV" "IAD" 6 0
## 34 1 1 -6 29 190 "EV" "MSP" 6 30
## 40 1 1 -1 -19 40 "WN" "BWI" 6 30
## 42 1 1 24 12 52 "EV" "IAD" 6 8
## 75 1 1 -3 10 134 "FL" "MKE" 7 20
## 103 1 1 -1 33 279 "WN" "DEN" 7 55
I’m doing a little bit of data wrangling to group the data by number of days, and getting the number of flights per day and comparing that to how many of those were late per day (using arrival this time). I’m displaying this as a histogram to very quickly see visually if a pattern arises across the year. I also focused on the carrier with the most and longest delays, “F9”.
delays_subset = subset(flights_subset, flights_subset$carrier == '"F9"' & flights_subset$arr_delay >=10 & flights_subset$arr_delay <=40)
scatter = plot(delays_subset$arr_delay, delays_subset$air_time)
scatter
## NULL
histo = hist(delays_subset$hour, las=2)
histo
## $breaks
## [1] 8 9 10 11 12 13 14 15 16 17
##
## $counts
## [1] 45 1 0 0 36 0 0 0 96
##
## $density
## [1] 0.252808989 0.005617978 0.000000000 0.000000000 0.202247191 0.000000000
## [7] 0.000000000 0.000000000 0.539325843
##
## $mids
## [1] 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5
##
## $xname
## [1] "delays_subset$hour"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
histo2 = hist(delays_subset$month, las=2)
histo2
## $breaks
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
##
## $counts
## [1] 36 11 12 16 12 10 16 11 16 21 17
##
## $density
## [1] 0.20224719 0.06179775 0.06741573 0.08988764 0.06741573 0.05617978
## [7] 0.08988764 0.06179775 0.08988764 0.11797753 0.09550562
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5
##
## $xname
## [1] "delays_subset$month"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
box = boxplot(delays_subset$dep_delay)
box
## $stats
## [,1]
## [1,] -12
## [2,] -3
## [3,] 3
## [4,] 21
## [5,] 52
##
## $n
## [1] 178
##
## $conf
## [,1]
## [1,] 0.1577758
## [2,] 5.8422242
##
## $out
## [1] 61 71 58 59 64
##
## $group
## [1] 1 1 1 1 1
##
## $names
## [1] "1"
box2 = boxplot(delays_subset$arr_delay)
box2
## $stats
## [,1]
## [1,] 10
## [2,] 15
## [3,] 21
## [4,] 30
## [5,] 40
##
## $n
## [1] 178
##
## $conf
## [,1]
## [1,] 19.22361
## [2,] 22.77639
##
## $out
## numeric(0)
##
## $group
## numeric(0)
##
## $names
## [1] "1"
scatter2 = plot(delays_subset$arr_delay, delays_subset$dep_delay)
scatter2
## NULL
My question originally was whether or not the time of departure made a difference with regards to the arrival lateness. The first histogram above shows that to some degree, flights that are later in the day, after 8pm, tended to be later than those earlier in the day. The second histogram shows that flights in January tend to arrive later than flights during the rest of the year, and by a lot.
For the scatter plot I graphed air time against arrival delay to see if they would correlate in any way and it’s not very conclusive. Lastly, the box plots show that the delays in arrival center around the 15-30 range without many outliers, while the delays in departure more frequently occur in the 0-20 range, with a few gross outliers in the 60+ range. This was an attempt to explore whether or not departure delays made a difference when it came to arrival delays, however a box plot is probably not the best way to do that- the scatterplot would be the visual aid of choice.
That’s why I decided to make one more scatterplot to see the relationship between arrival delay and departure delay. It looks like there is a slight correlation, with arrival delays being concentrated on the lower end the lower the departure delays got.