Adding dplyr to Rmd
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Create variables and data frames for each csv.
This reads each csv and put them into their own variables
flights <- read.csv("Downloads/CVG_Flights.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airports <- read.csv("Downloads/airports.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airlines <- read.csv("Downloads/airlines.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
Each variable is put into a data frame.
flights.df <- data.frame(flights)
airports.df <- data.frame(airports)
airlines.df <- data.frame(airlines)
This shows that the data frames were properly created.
How many rows are there in each data frame?
The amount of rows in each data frame are counted using nrow.
nrow(flights.df)
## [1] 7763
nrow(airports.df)
## [1] 322
nrow(airlines.df)
## [1] 14
There are 7763 rows in flights.df, 322 rows in airports.df and 14 rows in airlines.df overall.
How many columns are there in each data frame?
The amount of columns in each data frame are counted using ncol.
ncol(flights.df)
## [1] 27
ncol(airports.df)
## [1] 7
ncol(airlines.df)
## [1] 2
There are 27 columns in flights.df, 7 columns in airports.df and 2 columns in airlines.df overall.
Merge all three data frames into one data frame according to the IATA code of airlines and airports.
The flights.df and the airlines.df are joined together by the “AIRLINE” and “IATA_CODE” into first_step_data.
first_step_data <- left_join(flights.df,
airlines.df,
by=c("AIRLINE"="IATA_CODE"))
The first_step_data and airports.df are joined together by the “ORIGIN_AIRPORT” and “IATA_CODE” into combined_dataframe.
combined_dataframe <- left_join(first_step_data,
airports.df,
by=c("ORIGIN_AIRPORT"="IATA_CODE"))
The end result shows me that I have successfully merged the data frames.
Print the first 6 rows.
The first 6 rows of the data are printed.
head(combined_dataframe, 6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/1/2015 MQ 3355 CVG MIA
## 2 1/1/2015 MQ 3597 CVG DFW
## 3 1/1/2015 OO 4811 BNA CVG
## 4 1/1/2015 DL 2315 CVG ATL
## 5 1/1/2015 F9 1308 IAD CVG
## 6 1/1/2015 DL 1502 CVG TPA
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 6:00:00 AM 5:56:00 AM -4 15 6:11:00 AM
## 2 6:00:00 AM <NA> NA NA <NA>
## 3 6:45:00 AM 6:37:00 AM -8 16 6:53:00 AM
## 4 7:10:00 AM 7:08:00 AM -2 36 7:44:00 AM
## 5 7:15:00 AM 7:05:00 AM -10 11 7:16:00 AM
## 6 7:20:00 AM 7:19:00 AM -1 9 7:28:00 AM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 153 150 123 948 8:14:00 AM 12
## 2 165 NA NA 812 <NA> NA
## 3 72 66 45 230 8:38:00 AM 5
## 4 102 101 62 373 8:46:00 AM 3
## 5 95 91 75 388 8:31:00 AM 5
## 6 141 116 105 773 9:13:00 AM 2
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 8:33:00 AM 8:26:00 AM -7 0 0
## 2 7:45:00 AM <NA> NA 0 1
## 3 8:57:00 AM 8:43:00 AM -14 0 0
## 4 8:52:00 AM 8:49:00 AM -3 0 0
## 5 8:50:00 AM 8:36:00 AM -14 0 0
## 6 9:41:00 AM 9:15:00 AM -26 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> NA NA NA
## 2 B NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 NA NA American Eagle Airlines Inc.
## 2 NA NA American Eagle Airlines Inc.
## 3 NA NA Skywest Airlines Inc.
## 4 NA NA Delta Air Lines Inc.
## 5 NA NA Frontier Airlines Inc.
## 6 NA NA Delta Air Lines Inc.
## AIRPORT CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 2 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 3 Nashville International Airport Nashville TN USA
## 4 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 5 Washington Dulles International Airport Chantilly VA USA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY USA
## LATITUDE LONGITUDE
## 1 39.04614 -84.66217
## 2 39.04614 -84.66217
## 3 36.12448 -86.67818
## 4 39.04614 -84.66217
## 5 38.94453 -77.45581
## 6 39.04614 -84.66217
The last 6 rows of the data are printed.
tail(combined_dataframe, 6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 7758 3/10/2015 DL 803 CVG MCO
## 7759 3/10/2015 DL 2011 ATL CVG
## 7760 3/10/2015 EV 4638 EWR CVG
## 7761 3/10/2015 DL 1705 CVG LGA
## 7762 3/10/2015 F9 734 DEN CVG
## 7763 3/10/2015 OO 5643 DEN CVG
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 7758 9:40:00 AM 9:36:00 AM -4 12 9:48:00 AM
## 7759 9:40:00 AM 9:48:00 AM 8 10 9:58:00 AM
## 7760 9:41:00 AM 9:31:00 AM -10 15 9:46:00 AM
## 7761 9:45:00 AM 9:45:00 AM 0 13 9:58:00 AM
## 7762 10:00:00 AM 10:09:00 AM 9 16 10:25:00 AM
## 7763 10:00:00 AM 10:02:00 AM 2 16 10:18:00 AM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 7758 129 127 110 757 1:38:00 AM 5
## 7759 85 74 59 373 1:57:00 AM 5
## 7760 136 125 103 569 1:29:00 AM 7
## 7761 121 100 73 585 1:11:00 AM 14
## 7762 152 156 132 1069 1:37:00 AM 8
## 7763 159 159 134 1069 1:32:00 AM 9
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 7758 11:49:00 AM 11:43:00 AM -6 0 0
## 7759 11:05:00 AM 11:02:00 AM -3 0 0
## 7760 11:57:00 AM 11:36:00 AM -21 0 0
## 7761 11:46:00 AM 11:25:00 AM -21 0 0
## 7762 2:32:00 PM 2:45:00 PM 13 0 0
## 7763 2:39:00 PM 2:41:00 PM 2 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 7758 <NA> NA NA NA
## 7759 <NA> NA NA NA
## 7760 <NA> NA NA NA
## 7761 <NA> NA NA NA
## 7762 <NA> NA NA NA
## 7763 <NA> NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 7758 NA NA Delta Air Lines Inc.
## 7759 NA NA Delta Air Lines Inc.
## 7760 NA NA Atlantic Southeast Airlines
## 7761 NA NA Delta Air Lines Inc.
## 7762 NA NA Frontier Airlines Inc.
## 7763 NA NA Skywest Airlines Inc.
## AIRPORT CITY STATE COUNTRY
## 7758 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 7759 Hartsfield-Jackson Atlanta International Airport Atlanta GA USA
## 7760 Newark Liberty International Airport Newark NJ USA
## 7761 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 7762 Denver International Airport Denver CO USA
## 7763 Denver International Airport Denver CO USA
## LATITUDE LONGITUDE
## 7758 39.04614 -84.66217
## 7759 33.64044 -84.42694
## 7760 40.69250 -74.16866
## 7761 39.04614 -84.66217
## 7762 39.85841 -104.66700
## 7763 39.85841 -104.66700
Are there any missing values and in which variables?
Using a combination of colSums and is.na, missing numbers within the combined_dataframe are found
colnames(combined_dataframe)[colSums(is.na(combined_dataframe))>0]
## [1] "DEPARTURE_TIME" "DEPARTURE_DELAY" "TAXI_OUT"
## [4] "WHEELS_OFF" "ELAPSED_TIME" "AIR_TIME"
## [7] "WHEELS_ON" "TAXI_IN" "ARRIVAL_TIME"
## [10] "ARRIVAL_DELAY" "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"
## [13] "SECURITY_DELAY" "AIRLINE_DELAY" "LATE_AIRCRAFT_DELAY"
## [16] "WEATHER_DELAY"
Missing values were found in columns Departure_Time, Departure_Delay, TAXI_OUT, WHEELS_OFF, ELAPSED_TIME, AIR_TIME, ARRIVAL_DELAY, CANCELLATION_REASON, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, and WEATHER_DELAY. This was due to there being no delays/cancellations for certain flights and some flights being canceled in general.
What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there? What are they?
This gives us the number of flights not canceled and then the number of flights canceled.
combined_dataframe %>% count(CANCELLED, sort = TRUE)
## CANCELLED n
## 1 0 7385
## 2 1 378
7385 flights were not canceled and 378 flights were actually canceled.
This gives the proportion of canceled flights to all flights in total.
378/(7385+378)
## [1] 0.04869252
This cancellation proportion is 0.04869252 or 4.869252%.
This finds the reasons as to why flights are canceled.
combined_dataframe %>% count(CANCELLATION_REASON, sort = TRUE)
## CANCELLATION_REASON n
## 1 <NA> 7385
## 2 B 180
## 3 C 127
## 4 A 71
There are three different reasons for flights being canceled. Those include reasons “B”, “C” and “A”.
For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?
By using the count of DEPARTURE_TIME, we are able to see if there are missing values or not.
combined_dataframe %>% count(is.na(DEPARTURE_TIME), sort = TRUE)
## is.na(DEPARTURE_TIME) n
## 1 FALSE 7393
## 2 TRUE 370
There are 370 missing values. These values are missing due to canceled flights.
Create a new variable as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME. Print the first six elements of the new variable.
The new variable TIME_DIFFERENCE is created by subtracting ELAPSED_TIME from SCHEDULED_TIME.
combined_dataframe$TIME_DIFFERENCE <- combined_dataframe$SCHEDULED_TIME - combined_dataframe$ELAPSED_TIME
The first 6 elements of the variable are printed.
head(combined_dataframe$TIME_DIFFERENCE, 6)
## [1] 3 NA 6 1 4 25
The first 6 elements for the newly created TIME_DIFFERENCE variable are 3, NA, 6, 1, 4, and 25
Extract the observations with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport and make a new data frame. Print the first six flight numbers of the new data frame.
Observations were found from filtering ORIGIN_AIRPORT and AIRLINE from the combined_dataframe. These observations were then put into a new data frame.
Delta_CVG <- filter(combined_dataframe, ORIGIN_AIRPORT == "CVG", AIRLINE == "DL")
The first 6 elements of the variable are printed.
head(Delta_CVG, 6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/1/2015 DL 2315 CVG ATL
## 2 1/1/2015 DL 1502 CVG TPA
## 3 1/1/2015 DL 2359 CVG RSW
## 4 1/1/2015 DL 1741 CVG SLC
## 5 1/1/2015 DL 1221 CVG SEA
## 6 1/1/2015 DL 2367 CVG FLL
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 7:10:00 AM 7:08:00 AM -2 36 7:44:00 AM
## 2 7:20:00 AM 7:19:00 AM -1 9 7:28:00 AM
## 3 7:31:00 AM 7:22:00 AM -9 27 7:49:00 AM
## 4 8:25:00 AM 8:20:00 AM -5 20 8:40:00 AM
## 5 8:40:00 AM 8:39:00 AM -1 10 8:49:00 AM
## 6 9:30:00 AM 9:27:00 AM -3 19 9:46:00 AM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 102 101 62 373 8:46:00 AM 3
## 2 141 116 105 773 9:13:00 AM 2
## 3 153 150 117 879 9:46:00 AM 6
## 4 244 238 214 1450 1:14:00 AM 4
## 5 306 289 270 1965 1:19:00 AM 9
## 6 165 144 122 932 1:48:00 AM 3
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 8:52:00 AM 8:49:00 AM -3 0 0
## 2 9:41:00 AM 9:15:00 AM -26 0 0
## 3 10:04:00 AM 9:52:00 AM -12 0 0
## 4 10:29:00 AM 10:18:00 AM -11 0 0
## 5 10:46:00 AM 10:28:00 AM -18 0 0
## 6 12:15:00 PM 11:51:00 AM -24 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> NA NA NA
## 2 <NA> NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 NA NA Delta Air Lines Inc.
## 2 NA NA Delta Air Lines Inc.
## 3 NA NA Delta Air Lines Inc.
## 4 NA NA Delta Air Lines Inc.
## 5 NA NA Delta Air Lines Inc.
## 6 NA NA Delta Air Lines Inc.
## AIRPORT CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 2 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 3 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 4 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 5 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY USA
## LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217 1
## 2 39.04614 -84.66217 25
## 3 39.04614 -84.66217 3
## 4 39.04614 -84.66217 6
## 5 39.04614 -84.66217 17
## 6 39.04614 -84.66217 21
The new data frame was created and the first six elements were printed.
Extract the observations with AIRLINE of Delta, ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and make a new data frame. Print the first six flight numbers of the new data frame.
Observations were found from filtering DEPARTURE_DELAY > 30 from the Delta_CVG. These observations were put into a new data frame.
Delayed_Delta_CVG <- filter(Delta_CVG, DEPARTURE_DELAY > 30)
The first 6 elements of the variable are printed.
head(Delayed_Delta_CVG, 6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/3/2015 DL 1741 CVG SLC
## 2 1/3/2015 DL 1833 CVG LAX
## 3 1/4/2015 DL 1854 CVG RSW
## 4 1/9/2015 DL 1503 CVG MCO
## 5 1/9/2015 DL 63 CVG BOS
## 6 1/10/2015 DL 1934 CVG ATL
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 8:25:00 AM 11:41:00 AM 196 11 11:52:00 AM
## 2 4:46:00 PM 5:43:00 PM 57 7 5:50:00 PM
## 3 9:35:00 AM 10:20:00 AM 45 15 10:35:00 AM
## 4 1:55:00 PM 2:39:00 PM 44 15 2:54:00 PM
## 5 4:15:00 PM 5:32:00 PM 77 9 5:41:00 PM
## 6 2:19:00 PM 3:58:00 PM 99 8 4:06:00 PM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 245 222 208 1450 1:20:00 AM 3
## 2 294 269 250 1900 1:00:00 AM 12
## 3 153 156 139 879 1:54:00 AM 2
## 4 134 131 109 757 1:43:00 AM 7
## 5 121 104 90 752 1:11:00 AM 5
## 6 99 74 60 373 1:06:00 AM 6
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 10:30:00 AM 1:23:00 PM 173 0 0
## 2 6:40:00 PM 7:12:00 PM 32 0 0
## 3 12:08:00 PM 12:56:00 PM 48 0 0
## 4 4:09:00 PM 4:50:00 PM 41 0 0
## 5 6:16:00 PM 7:16:00 PM 60 0 0
## 6 3:58:00 PM 5:12:00 PM 74 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> 0 0 173
## 2 <NA> 0 0 9
## 3 <NA> 48 0 0
## 4 <NA> 0 0 41
## 5 <NA> 0 0 0
## 6 <NA> 0 0 74
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 0 0 Delta Air Lines Inc.
## 2 23 0 Delta Air Lines Inc.
## 3 0 0 Delta Air Lines Inc.
## 4 0 0 Delta Air Lines Inc.
## 5 60 0 Delta Air Lines Inc.
## 6 0 0 Delta Air Lines Inc.
## AIRPORT CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 2 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 3 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 4 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 5 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY USA
## LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217 23
## 2 39.04614 -84.66217 25
## 3 39.04614 -84.66217 -3
## 4 39.04614 -84.66217 3
## 5 39.04614 -84.66217 17
## 6 39.04614 -84.66217 25
The new data frame was created and the first six elements were printed.
Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?
This groups the combined_dataframe by AIRLINE into a new data frame called airlines_differences.
airlines_differences <- group_by(combined_dataframe, AIRLINE)
This step creates a data frame that contains the average departure time for each different airline.
avg_airlines_differences <- summarize(airlines_differences, mean_airline = mean(DEPARTURE_DELAY, na.rm= TRUE))
The average delay times for each airlines were DL=6.09, EV=10.4, F9=23.1, MQ=19.1, and OO=11.1.
Lists out the data frame.
list(avg_airlines_differences)
## [[1]]
## # A tibble: 5 x 2
## AIRLINE mean_airline
## * <chr> <dbl>
## 1 DL 6.09
## 2 EV 10.4
## 3 F9 23.9
## 4 MQ 19.1
## 5 OO 11.1
The largest delay time average is F9 with 23.9 minutes and the smallest is DL with 6.09 minutes.
Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Which ORIGIN_AIRPORT has the longest and shortest average department delay?
This groups the combined_dataframe by ORIGIN_AIRPORT into a new data frame called airport_differences.
airport_differences <- group_by(combined_dataframe, ORIGIN_AIRPORT)
This step creates a data frame that contains the average departure time for each different airport. See table for all of the values.
avg_airport_differences <- summarize(airport_differences, mean_airport = mean(DEPARTURE_DELAY, na.rm= TRUE))
Lists out the data frame.
list(avg_airport_differences)
## [[1]]
## # A tibble: 38 x 2
## ORIGIN_AIRPORT mean_airport
## * <chr> <dbl>
## 1 ATL 7.12
## 2 BDL -4
## 3 BNA 13.3
## 4 BOS 21.1
## 5 CLT 10.0
## 6 CVG 11.6
## 7 DEN 15.1
## 8 DFW 16.9
## 9 DTW 21.3
## 10 EWR 7.29
## # … with 28 more rows
The largest delay time average is TTN with 74.33 minutes and the smallest is STL with -4.41 minutes (4.41 minutes ahead of schedule).