Pulling in the data from my github

library(RCurl)
my_git_url <- getURL("https://raw.githubusercontent.com/aelsaeyed/BridgeR/main/datasets/hw3/flights.csv")
flights <- read.csv(text = my_git_url, quote = "")

#flights <- read.csv("/Users/aelsaeyed/BridgeR/datasets/hw3/flights.csv", quote = "" )
#airports <- read.csv("/Users/aelsaeyed/BridgeR/datasets/hw3/airports.csv", quote = "")

Data Exploration

The set contains data pertaining to arrival times and delays, so I wanted to rank the carriers by efficiency. I thought there were two ways to do this- to look at the amount of time the delays took, and also the number of flights that were delayed. From an initial glance, it looks like the carrier F9 has both the longest average delay, and also the most frequent delays. Further research could go into exploring if there are any other factors that affect the delay time or frequency, such as time of year/day or destination airport. I also question whether or not arrival delays are affected by departure delays. I derive a percentage column by dividing the number of delayed flights by the number of total flights per carrier.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
flights_summ = summary(flights)
flights_summ
##      X..               X.year.        X.month.          X.day.     
##  Length:336776      Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.549   Mean   :15.71  
##                     3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##   X.dep_time.   X.sched_dep_time.  X.dep_delay.      X.arr_time.  
##  Min.   :   1   Min.   : 106      Min.   : -43.00   Min.   :   1  
##  1st Qu.: 907   1st Qu.: 906      1st Qu.:  -5.00   1st Qu.:1104  
##  Median :1401   Median :1359      Median :  -2.00   Median :1535  
##  Mean   :1349   Mean   :1344      Mean   :  12.64   Mean   :1502  
##  3rd Qu.:1744   3rd Qu.:1729      3rd Qu.:  11.00   3rd Qu.:1940  
##  Max.   :2400   Max.   :2359      Max.   :1301.00   Max.   :2400  
##  NA's   :8255                     NA's   :8255      NA's   :8713  
##  X.sched_arr_time.  X.arr_delay.       X.carrier.          X.flight.   
##  Min.   :   1      Min.   : -86.000   Length:336776      Min.   :   1  
##  1st Qu.:1124      1st Qu.: -17.000   Class :character   1st Qu.: 553  
##  Median :1556      Median :  -5.000   Mode  :character   Median :1496  
##  Mean   :1536      Mean   :   6.895                      Mean   :1972  
##  3rd Qu.:1945      3rd Qu.:  14.000                      3rd Qu.:3465  
##  Max.   :2359      Max.   :1272.000                      Max.   :8500  
##                    NA's   :9430                                        
##   X.tailnum.         X.origin.           X.dest.           X.air_time.   
##  Length:336776      Length:336776      Length:336776      Min.   : 20.0  
##  Class :character   Class :character   Class :character   1st Qu.: 82.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :129.0  
##                                                           Mean   :150.7  
##                                                           3rd Qu.:192.0  
##                                                           Max.   :695.0  
##                                                           NA's   :9430   
##   X.distance.      X.hour.        X.minute.     X.time_hour.      
##  Min.   :  17   Min.   : 1.00   Min.   : 0.00   Length:336776     
##  1st Qu.: 502   1st Qu.: 9.00   1st Qu.: 8.00   Class :character  
##  Median : 872   Median :13.00   Median :29.00   Mode  :character  
##  Mean   :1040   Mean   :13.18   Mean   :26.23                     
##  3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:44.00                     
##  Max.   :4983   Max.   :23.00   Max.   :59.00                     
## 
#airports_summ = summary(airports)
#airports_summ

#Average departure delay from JFK 
mean_dep_delay= mean(flights$X.dep_delay., na.rm=TRUE)
mean_dep_delay
## [1] 12.63907
#Average departure by carrier (group by carrier and mean dep_delay)
delays_by_carrier = flights %>% 
  group_by(X.carrier.) %>% 
  summarize(avg_delay=mean(X.dep_delay., na.rm=TRUE)) %>%
  arrange(desc(avg_delay))
delays_by_carrier
## # A tibble: 16 × 2
##    X.carrier. avg_delay
##    <chr>          <dbl>
##  1 "\"F9\""       20.2 
##  2 "\"EV\""       20.0 
##  3 "\"YV\""       19.0 
##  4 "\"FL\""       18.7 
##  5 "\"WN\""       17.7 
##  6 "\"9E\""       16.7 
##  7 "\"B6\""       13.0 
##  8 "\"VX\""       12.9 
##  9 "\"OO\""       12.6 
## 10 "\"UA\""       12.1 
## 11 "\"MQ\""       10.6 
## 12 "\"DL\""        9.26
## 13 "\"AA\""        8.59
## 14 "\"AS\""        5.80
## 15 "\"HA\""        4.90
## 16 "\"US\""        3.78
#Carrier with the most number of delays > 5 minutes 
frequency_of_delays = flights %>%
  group_by(X.carrier.) %>% 
  summarize(total_flights = n(), num_delays = sum(X.dep_delay.>= 5, na.rm=TRUE), percent_delays = (num_delays/total_flights)  * 100) %>%
  arrange(desc(percent_delays))
sum
## function (..., na.rm = FALSE)  .Primitive("sum")
frequency_of_delays
## # A tibble: 16 × 4
##    X.carrier. total_flights num_delays percent_delays
##    <chr>              <int>      <int>          <dbl>
##  1 "\"F9\""             685        290           42.3
##  2 "\"WN\""           12275       5055           41.2
##  3 "\"FL\""            3260       1304           40  
##  4 "\"EV\""           54173      20397           37.7
##  5 "\"UA\""           58665      20189           34.4
##  6 "\"YV\""             601        203           33.8
##  7 "\"9E\""           18460       6072           32.9
##  8 "\"B6\""           54635      17961           32.9
##  9 "\"VX\""            5162       1571           30.4
## 10 "\"MQ\""           26397       7106           26.9
## 11 "\"DL\""           48110      12096           25.1
## 12 "\"AA\""           32729       7678           23.5
## 13 "\"AS\""             714        161           22.5
## 14 "\"OO\""              32          7           21.9
## 15 "\"US\""           20536       3765           18.3
## 16 "\"HA\""             342         37           10.8

Question 2

I decided to cut down the data to something more manageable by removing the columns I wouldn’t be using, such as tailnum, flight ID, airtime, and distance. I also fixed up the column names to remove the annoying X., and used subset to get the top 5 flights with the largest departure delay.

#insurance_subset = subset(hinsurance_csv, hinsurance_csv$X.age > 35 & hinsurance_csv$X.family >= 2)
#insurance_new_cols = insurance_subset[, c("X.age.", "X.family.", "X.ethnicity.", "X.region.")]
#summary(insurance_new_cols)

flights_new_columns = flights[, c("X.month.", "X.day.", "X.dep_delay.", "X.arr_delay.", "X.air_time.", "X.carrier.", "X.dest.", "X.hour.", "X.minute.")]
names(flights_new_columns) <- c('month', 'day', 'dep_delay', 'arr_delay', 'air_time', 'carrier', 'dest', 'hour', 'minute')
head(flights_new_columns)
##   month day dep_delay arr_delay air_time carrier  dest hour minute
## 1     1   1         2        11      227    "UA" "IAH"    5     15
## 2     1   1         4        20      227    "UA" "IAH"    5     29
## 3     1   1         2        33      160    "AA" "MIA"    5     40
## 4     1   1        -1       -18      183    "B6" "BQN"    5     45
## 5     1   1        -6       -25      116    "DL" "ATL"    6      0
## 6     1   1        -4        12      150    "UA" "ORD"    5     58
delays_by_carrier
## # A tibble: 16 × 2
##    X.carrier. avg_delay
##    <chr>          <dbl>
##  1 "\"F9\""       20.2 
##  2 "\"EV\""       20.0 
##  3 "\"YV\""       19.0 
##  4 "\"FL\""       18.7 
##  5 "\"WN\""       17.7 
##  6 "\"9E\""       16.7 
##  7 "\"B6\""       13.0 
##  8 "\"VX\""       12.9 
##  9 "\"OO\""       12.6 
## 10 "\"UA\""       12.1 
## 11 "\"MQ\""       10.6 
## 12 "\"DL\""        9.26
## 13 "\"AA\""        8.59
## 14 "\"AS\""        5.80
## 15 "\"HA\""        4.90
## 16 "\"US\""        3.78
latest_flights = as.list(delays_by_carrier$X.carrier. [1:5])
head(latest_flights)
## [[1]]
## [1] "\"F9\""
## 
## [[2]]
## [1] "\"EV\""
## 
## [[3]]
## [1] "\"YV\""
## 
## [[4]]
## [1] "\"FL\""
## 
## [[5]]
## [1] "\"WN\""
flights_subset = subset(flights_new_columns, flights_new_columns$carrier %in% latest_flights)
head(flights_subset)
##     month day dep_delay arr_delay air_time carrier  dest hour minute
## 8       1   1        -3       -14       53    "EV" "IAD"    6      0
## 34      1   1        -6        29      190    "EV" "MSP"    6     30
## 40      1   1        -1       -19       40    "WN" "BWI"    6     30
## 42      1   1        24        12       52    "EV" "IAD"    6      8
## 75      1   1        -3        10      134    "FL" "MKE"    7     20
## 103     1   1        -1        33      279    "WN" "DEN"    7     55

Question 3

I’m doing a little bit of data wrangling to group the data by number of days, and getting the number of flights per day and comparing that to how many of those were late per day (using arrival this time). I’m displaying this as a histogram to very quickly see visually if a pattern arises across the year. I also focused on the carrier with the most and longest delays, “F9”.

delays_subset = subset(flights_subset, flights_subset$carrier == '"F9"' & flights_subset$arr_delay >=10 & flights_subset$arr_delay <=40)


scatter = plot(delays_subset$arr_delay, delays_subset$air_time)

scatter
## NULL
histo = hist(delays_subset$hour, las=2)

histo
## $breaks
##  [1]  8  9 10 11 12 13 14 15 16 17
## 
## $counts
## [1] 45  1  0  0 36  0  0  0 96
## 
## $density
## [1] 0.252808989 0.005617978 0.000000000 0.000000000 0.202247191 0.000000000
## [7] 0.000000000 0.000000000 0.539325843
## 
## $mids
## [1]  8.5  9.5 10.5 11.5 12.5 13.5 14.5 15.5 16.5
## 
## $xname
## [1] "delays_subset$hour"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
histo2 = hist(delays_subset$month, las=2)

histo2
## $breaks
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## 
## $counts
##  [1] 36 11 12 16 12 10 16 11 16 21 17
## 
## $density
##  [1] 0.20224719 0.06179775 0.06741573 0.08988764 0.06741573 0.05617978
##  [7] 0.08988764 0.06179775 0.08988764 0.11797753 0.09550562
## 
## $mids
##  [1]  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $xname
## [1] "delays_subset$month"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
box = boxplot(delays_subset$dep_delay)

box
## $stats
##      [,1]
## [1,]  -12
## [2,]   -3
## [3,]    3
## [4,]   21
## [5,]   52
## 
## $n
## [1] 178
## 
## $conf
##           [,1]
## [1,] 0.1577758
## [2,] 5.8422242
## 
## $out
## [1] 61 71 58 59 64
## 
## $group
## [1] 1 1 1 1 1
## 
## $names
## [1] "1"
box2 = boxplot(delays_subset$arr_delay)

box2
## $stats
##      [,1]
## [1,]   10
## [2,]   15
## [3,]   21
## [4,]   30
## [5,]   40
## 
## $n
## [1] 178
## 
## $conf
##          [,1]
## [1,] 19.22361
## [2,] 22.77639
## 
## $out
## numeric(0)
## 
## $group
## numeric(0)
## 
## $names
## [1] "1"
scatter2 = plot(delays_subset$arr_delay, delays_subset$dep_delay)

scatter2
## NULL

Question 4

My question originally was whether or not the time of departure made a difference with regards to the arrival lateness. The first histogram above shows that to some degree, flights that are later in the day, after 8pm, tended to be later than those earlier in the day. The second histogram shows that flights in January tend to arrive later than flights during the rest of the year, and by a lot.

For the scatter plot I graphed air time against arrival delay to see if they would correlate in any way and it’s not very conclusive. Lastly, the box plots show that the delays in arrival center around the 15-30 range without many outliers, while the delays in departure more frequently occur in the 0-20 range, with a few gross outliers in the 60+ range. This was an attempt to explore whether or not departure delays made a difference when it came to arrival delays, however a box plot is probably not the best way to do that- the scatterplot would be the visual aid of choice.

That’s why I decided to make one more scatterplot to see the relationship between arrival delay and departure delay. It looks like there is a slight correlation, with arrival delays being concentrated on the lower end the lower the departure delays got.