Adding dplyr to Rmd

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Question 1

Create variables and data frames for each csv.

This reads each csv and put them into their own variables

flights <- read.csv("Downloads/CVG_Flights.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airports <- read.csv("Downloads/airports.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airlines <- read.csv("Downloads/airlines.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")

Each variable is put into a data frame.

flights.df <- data.frame(flights)
airports.df <- data.frame(airports)
airlines.df <- data.frame(airlines)

This shows that the data frames were properly created.

Question 2

How many rows are there in each data frame?

The amount of rows in each data frame are counted using nrow.

nrow(flights.df)
## [1] 7763
nrow(airports.df)
## [1] 322
nrow(airlines.df)
## [1] 14

There are 7763 rows in flights.df, 322 rows in airports.df and 14 rows in airlines.df overall.

Question 3

How many columns are there in each data frame?

The amount of columns in each data frame are counted using ncol.

ncol(flights.df)
## [1] 27
ncol(airports.df)
## [1] 7
ncol(airlines.df)
## [1] 2

There are 27 columns in flights.df, 7 columns in airports.df and 2 columns in airlines.df overall.

Question 4

Merge all three data frames into one data frame according to the IATA code of airlines and airports.

The flights.df and the airlines.df are joined together by the “AIRLINE” and “IATA_CODE” into first_step_data.

first_step_data <- left_join(flights.df, 
                             airlines.df, 
                             by=c("AIRLINE"="IATA_CODE"))

The first_step_data and airports.df are joined together by the “ORIGIN_AIRPORT” and “IATA_CODE” into combined_dataframe.

combined_dataframe <- left_join(first_step_data, 
                                airports.df, 
                                by=c("ORIGIN_AIRPORT"="IATA_CODE"))

The end result shows me that I have successfully merged the data frames.

Question 5

Print the first 6 rows.

The first 6 rows of the data are printed.

head(combined_dataframe, 6)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/1/2015      MQ          3355            CVG                 MIA
## 2    1/1/2015      MQ          3597            CVG                 DFW
## 3    1/1/2015      OO          4811            BNA                 CVG
## 4    1/1/2015      DL          2315            CVG                 ATL
## 5    1/1/2015      F9          1308            IAD                 CVG
## 6    1/1/2015      DL          1502            CVG                 TPA
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1          6:00:00 AM     5:56:00 AM              -4       15 6:11:00 AM
## 2          6:00:00 AM           <NA>              NA       NA       <NA>
## 3          6:45:00 AM     6:37:00 AM              -8       16 6:53:00 AM
## 4          7:10:00 AM     7:08:00 AM              -2       36 7:44:00 AM
## 5          7:15:00 AM     7:05:00 AM             -10       11 7:16:00 AM
## 6          7:20:00 AM     7:19:00 AM              -1        9 7:28:00 AM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            153          150      123      948 8:14:00 AM      12
## 2            165           NA       NA      812       <NA>      NA
## 3             72           66       45      230 8:38:00 AM       5
## 4            102          101       62      373 8:46:00 AM       3
## 5             95           91       75      388 8:31:00 AM       5
## 6            141          116      105      773 9:13:00 AM       2
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1        8:33:00 AM   8:26:00 AM            -7        0         0
## 2        7:45:00 AM         <NA>            NA        0         1
## 3        8:57:00 AM   8:43:00 AM           -14        0         0
## 4        8:52:00 AM   8:49:00 AM            -3        0         0
## 5        8:50:00 AM   8:36:00 AM           -14        0         0
## 6        9:41:00 AM   9:15:00 AM           -26        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>               NA             NA            NA
## 2                   B               NA             NA            NA
## 3                <NA>               NA             NA            NA
## 4                <NA>               NA             NA            NA
## 5                <NA>               NA             NA            NA
## 6                <NA>               NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY                    AIRLINE.y
## 1                  NA            NA American Eagle Airlines Inc.
## 2                  NA            NA American Eagle Airlines Inc.
## 3                  NA            NA        Skywest Airlines Inc.
## 4                  NA            NA         Delta Air Lines Inc.
## 5                  NA            NA       Frontier Airlines Inc.
## 6                  NA            NA         Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3                    Nashville International Airport Nashville    TN     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5            Washington Dulles International Airport Chantilly    VA     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE
## 1 39.04614 -84.66217
## 2 39.04614 -84.66217
## 3 36.12448 -86.67818
## 4 39.04614 -84.66217
## 5 38.94453 -77.45581
## 6 39.04614 -84.66217

The last 6 rows of the data are printed.

tail(combined_dataframe, 6)
##      FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 7758   3/10/2015      DL           803            CVG                 MCO
## 7759   3/10/2015      DL          2011            ATL                 CVG
## 7760   3/10/2015      EV          4638            EWR                 CVG
## 7761   3/10/2015      DL          1705            CVG                 LGA
## 7762   3/10/2015      F9           734            DEN                 CVG
## 7763   3/10/2015      OO          5643            DEN                 CVG
##      SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT  WHEELS_OFF
## 7758          9:40:00 AM     9:36:00 AM              -4       12  9:48:00 AM
## 7759          9:40:00 AM     9:48:00 AM               8       10  9:58:00 AM
## 7760          9:41:00 AM     9:31:00 AM             -10       15  9:46:00 AM
## 7761          9:45:00 AM     9:45:00 AM               0       13  9:58:00 AM
## 7762         10:00:00 AM    10:09:00 AM               9       16 10:25:00 AM
## 7763         10:00:00 AM    10:02:00 AM               2       16 10:18:00 AM
##      SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 7758            129          127      110      757 1:38:00 AM       5
## 7759             85           74       59      373 1:57:00 AM       5
## 7760            136          125      103      569 1:29:00 AM       7
## 7761            121          100       73      585 1:11:00 AM      14
## 7762            152          156      132     1069 1:37:00 AM       8
## 7763            159          159      134     1069 1:32:00 AM       9
##      SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 7758       11:49:00 AM  11:43:00 AM            -6        0         0
## 7759       11:05:00 AM  11:02:00 AM            -3        0         0
## 7760       11:57:00 AM  11:36:00 AM           -21        0         0
## 7761       11:46:00 AM  11:25:00 AM           -21        0         0
## 7762        2:32:00 PM   2:45:00 PM            13        0         0
## 7763        2:39:00 PM   2:41:00 PM             2        0         0
##      CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 7758                <NA>               NA             NA            NA
## 7759                <NA>               NA             NA            NA
## 7760                <NA>               NA             NA            NA
## 7761                <NA>               NA             NA            NA
## 7762                <NA>               NA             NA            NA
## 7763                <NA>               NA             NA            NA
##      LATE_AIRCRAFT_DELAY WEATHER_DELAY                   AIRLINE.y
## 7758                  NA            NA        Delta Air Lines Inc.
## 7759                  NA            NA        Delta Air Lines Inc.
## 7760                  NA            NA Atlantic Southeast Airlines
## 7761                  NA            NA        Delta Air Lines Inc.
## 7762                  NA            NA      Frontier Airlines Inc.
## 7763                  NA            NA       Skywest Airlines Inc.
##                                                 AIRPORT      CITY STATE COUNTRY
## 7758 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 7759   Hartsfield-Jackson Atlanta International Airport   Atlanta    GA     USA
## 7760               Newark Liberty International Airport    Newark    NJ     USA
## 7761 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 7762                       Denver International Airport    Denver    CO     USA
## 7763                       Denver International Airport    Denver    CO     USA
##      LATITUDE  LONGITUDE
## 7758 39.04614  -84.66217
## 7759 33.64044  -84.42694
## 7760 40.69250  -74.16866
## 7761 39.04614  -84.66217
## 7762 39.85841 -104.66700
## 7763 39.85841 -104.66700

Question 6

Are there any missing values and in which variables?

Using a combination of colSums and is.na, missing numbers within the combined_dataframe are found

colnames(combined_dataframe)[colSums(is.na(combined_dataframe))>0]
##  [1] "DEPARTURE_TIME"      "DEPARTURE_DELAY"     "TAXI_OUT"           
##  [4] "WHEELS_OFF"          "ELAPSED_TIME"        "AIR_TIME"           
##  [7] "WHEELS_ON"           "TAXI_IN"             "ARRIVAL_TIME"       
## [10] "ARRIVAL_DELAY"       "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"   
## [13] "SECURITY_DELAY"      "AIRLINE_DELAY"       "LATE_AIRCRAFT_DELAY"
## [16] "WEATHER_DELAY"

Missing values were found in columns Departure_Time, Departure_Delay, TAXI_OUT, WHEELS_OFF, ELAPSED_TIME, AIR_TIME, ARRIVAL_DELAY, CANCELLATION_REASON, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, and WEATHER_DELAY. This was due to there being no delays/cancellations for certain flights and some flights being canceled in general.

Question 7

What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there? What are they?

This gives us the number of flights not canceled and then the number of flights canceled.

combined_dataframe %>% count(CANCELLED, sort = TRUE)
##   CANCELLED    n
## 1         0 7385
## 2         1  378

7385 flights were not canceled and 378 flights were actually canceled.

This gives the proportion of canceled flights to all flights in total.

378/(7385+378)
## [1] 0.04869252

This cancellation proportion is 0.04869252 or 4.869252%.

This finds the reasons as to why flights are canceled.

combined_dataframe %>% count(CANCELLATION_REASON, sort = TRUE)
##   CANCELLATION_REASON    n
## 1                <NA> 7385
## 2                   B  180
## 3                   C  127
## 4                   A   71

There are three different reasons for flights being canceled. Those include reasons “B”, “C” and “A”.

Question 8

For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?

By using the count of DEPARTURE_TIME, we are able to see if there are missing values or not.

combined_dataframe %>% count(is.na(DEPARTURE_TIME), sort = TRUE)
##   is.na(DEPARTURE_TIME)    n
## 1                 FALSE 7393
## 2                  TRUE  370

There are 370 missing values. These values are missing due to canceled flights.

Question 9

Create a new variable as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME. Print the first six elements of the new variable.

The new variable TIME_DIFFERENCE is created by subtracting ELAPSED_TIME from SCHEDULED_TIME.

combined_dataframe$TIME_DIFFERENCE <- combined_dataframe$SCHEDULED_TIME - combined_dataframe$ELAPSED_TIME

The first 6 elements of the variable are printed.

head(combined_dataframe$TIME_DIFFERENCE, 6)
## [1]  3 NA  6  1  4 25

The first 6 elements for the newly created TIME_DIFFERENCE variable are 3, NA, 6, 1, 4, and 25

Question 10

Extract the observations with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport and make a new data frame. Print the first six flight numbers of the new data frame.

Observations were found from filtering ORIGIN_AIRPORT and AIRLINE from the combined_dataframe. These observations were then put into a new data frame.

Delta_CVG <- filter(combined_dataframe, ORIGIN_AIRPORT == "CVG", AIRLINE == "DL")

The first 6 elements of the variable are printed.

head(Delta_CVG, 6)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/1/2015      DL          2315            CVG                 ATL
## 2    1/1/2015      DL          1502            CVG                 TPA
## 3    1/1/2015      DL          2359            CVG                 RSW
## 4    1/1/2015      DL          1741            CVG                 SLC
## 5    1/1/2015      DL          1221            CVG                 SEA
## 6    1/1/2015      DL          2367            CVG                 FLL
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1          7:10:00 AM     7:08:00 AM              -2       36 7:44:00 AM
## 2          7:20:00 AM     7:19:00 AM              -1        9 7:28:00 AM
## 3          7:31:00 AM     7:22:00 AM              -9       27 7:49:00 AM
## 4          8:25:00 AM     8:20:00 AM              -5       20 8:40:00 AM
## 5          8:40:00 AM     8:39:00 AM              -1       10 8:49:00 AM
## 6          9:30:00 AM     9:27:00 AM              -3       19 9:46:00 AM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            102          101       62      373 8:46:00 AM       3
## 2            141          116      105      773 9:13:00 AM       2
## 3            153          150      117      879 9:46:00 AM       6
## 4            244          238      214     1450 1:14:00 AM       4
## 5            306          289      270     1965 1:19:00 AM       9
## 6            165          144      122      932 1:48:00 AM       3
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1        8:52:00 AM   8:49:00 AM            -3        0         0
## 2        9:41:00 AM   9:15:00 AM           -26        0         0
## 3       10:04:00 AM   9:52:00 AM           -12        0         0
## 4       10:29:00 AM  10:18:00 AM           -11        0         0
## 5       10:46:00 AM  10:28:00 AM           -18        0         0
## 6       12:15:00 PM  11:51:00 AM           -24        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>               NA             NA            NA
## 2                <NA>               NA             NA            NA
## 3                <NA>               NA             NA            NA
## 4                <NA>               NA             NA            NA
## 5                <NA>               NA             NA            NA
## 6                <NA>               NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY            AIRLINE.y
## 1                  NA            NA Delta Air Lines Inc.
## 2                  NA            NA Delta Air Lines Inc.
## 3                  NA            NA Delta Air Lines Inc.
## 4                  NA            NA Delta Air Lines Inc.
## 5                  NA            NA Delta Air Lines Inc.
## 6                  NA            NA Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217               1
## 2 39.04614 -84.66217              25
## 3 39.04614 -84.66217               3
## 4 39.04614 -84.66217               6
## 5 39.04614 -84.66217              17
## 6 39.04614 -84.66217              21

The new data frame was created and the first six elements were printed.

Question 11

Extract the observations with AIRLINE of Delta, ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and make a new data frame. Print the first six flight numbers of the new data frame.

Observations were found from filtering DEPARTURE_DELAY > 30 from the Delta_CVG. These observations were put into a new data frame.

Delayed_Delta_CVG <- filter(Delta_CVG, DEPARTURE_DELAY > 30)

The first 6 elements of the variable are printed.

head(Delayed_Delta_CVG, 6)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/3/2015      DL          1741            CVG                 SLC
## 2    1/3/2015      DL          1833            CVG                 LAX
## 3    1/4/2015      DL          1854            CVG                 RSW
## 4    1/9/2015      DL          1503            CVG                 MCO
## 5    1/9/2015      DL            63            CVG                 BOS
## 6   1/10/2015      DL          1934            CVG                 ATL
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT  WHEELS_OFF
## 1          8:25:00 AM    11:41:00 AM             196       11 11:52:00 AM
## 2          4:46:00 PM     5:43:00 PM              57        7  5:50:00 PM
## 3          9:35:00 AM    10:20:00 AM              45       15 10:35:00 AM
## 4          1:55:00 PM     2:39:00 PM              44       15  2:54:00 PM
## 5          4:15:00 PM     5:32:00 PM              77        9  5:41:00 PM
## 6          2:19:00 PM     3:58:00 PM              99        8  4:06:00 PM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            245          222      208     1450 1:20:00 AM       3
## 2            294          269      250     1900 1:00:00 AM      12
## 3            153          156      139      879 1:54:00 AM       2
## 4            134          131      109      757 1:43:00 AM       7
## 5            121          104       90      752 1:11:00 AM       5
## 6             99           74       60      373 1:06:00 AM       6
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1       10:30:00 AM   1:23:00 PM           173        0         0
## 2        6:40:00 PM   7:12:00 PM            32        0         0
## 3       12:08:00 PM  12:56:00 PM            48        0         0
## 4        4:09:00 PM   4:50:00 PM            41        0         0
## 5        6:16:00 PM   7:16:00 PM            60        0         0
## 6        3:58:00 PM   5:12:00 PM            74        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>                0              0           173
## 2                <NA>                0              0             9
## 3                <NA>               48              0             0
## 4                <NA>                0              0            41
## 5                <NA>                0              0             0
## 6                <NA>                0              0            74
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY            AIRLINE.y
## 1                   0             0 Delta Air Lines Inc.
## 2                  23             0 Delta Air Lines Inc.
## 3                   0             0 Delta Air Lines Inc.
## 4                   0             0 Delta Air Lines Inc.
## 5                  60             0 Delta Air Lines Inc.
## 6                   0             0 Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217              23
## 2 39.04614 -84.66217              25
## 3 39.04614 -84.66217              -3
## 4 39.04614 -84.66217               3
## 5 39.04614 -84.66217              17
## 6 39.04614 -84.66217              25

The new data frame was created and the first six elements were printed.

Question 12

Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?

This groups the combined_dataframe by AIRLINE into a new data frame called airlines_differences.

airlines_differences <- group_by(combined_dataframe, AIRLINE)

This step creates a data frame that contains the average departure time for each different airline.

avg_airlines_differences <- summarize(airlines_differences, mean_airline = mean(DEPARTURE_DELAY, na.rm= TRUE))

The average delay times for each airlines were DL=6.09, EV=10.4, F9=23.1, MQ=19.1, and OO=11.1.

Lists out the data frame.

list(avg_airlines_differences)
## [[1]]
## # A tibble: 5 x 2
##   AIRLINE mean_airline
## * <chr>          <dbl>
## 1 DL              6.09
## 2 EV             10.4 
## 3 F9             23.9 
## 4 MQ             19.1 
## 5 OO             11.1

The largest delay time average is F9 with 23.9 minutes and the smallest is DL with 6.09 minutes.

Question 13

Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Which ORIGIN_AIRPORT has the longest and shortest average department delay?

This groups the combined_dataframe by ORIGIN_AIRPORT into a new data frame called airport_differences.

airport_differences <- group_by(combined_dataframe, ORIGIN_AIRPORT)

This step creates a data frame that contains the average departure time for each different airport. See table for all of the values.

avg_airport_differences <- summarize(airport_differences, mean_airport = mean(DEPARTURE_DELAY, na.rm= TRUE))

Lists out the data frame.

list(avg_airport_differences)
## [[1]]
## # A tibble: 38 x 2
##    ORIGIN_AIRPORT mean_airport
##  * <chr>                 <dbl>
##  1 ATL                    7.12
##  2 BDL                   -4   
##  3 BNA                   13.3 
##  4 BOS                   21.1 
##  5 CLT                   10.0 
##  6 CVG                   11.6 
##  7 DEN                   15.1 
##  8 DFW                   16.9 
##  9 DTW                   21.3 
## 10 EWR                    7.29
## # … with 28 more rows

The largest delay time average is TTN with 74.33 minutes and the smallest is STL with -4.41 minutes (4.41 minutes ahead of schedule).