R Markdown

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(ggalt)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggmap)
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
CVG.Flights <- read_csv("CVG_Flights.csv")
## Rows: 7763 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): FLIGHT_DATE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT, CANCEL...
## dbl  (16): FLIGHT_NUMBER, DEPARTURE_DELAY, TAXI_OUT, SCHEDULED_TIME, ELAPSED...
## time  (6): SCHEDULED_DEPARTURE, DEPARTURE_TIME, WHEELS_OFF, WHEELS_ON, SCHED...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airlines <- read_csv("airlines.csv")
## Rows: 14 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): IATA_CODE, AIRLINE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airports <- read_csv("airports.csv")
## Rows: 322 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): IATA_CODE, AIRPORT, CITY, STATE, COUNTRY
## dbl (2): LATITUDE, LONGITUDE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 1:

I will use dim() to find the number of variables and observations. Then I will use is.na() to find if there are any missing variables. To show the missing variables, I will use colSums and is.na() to show the amount of NA’s for each variable.

dim(CVG.Flights)
## [1] 7763   27
sum(is.na(CVG.Flights))
## [1] 40270
colSums(is.na(CVG.Flights))
##         FLIGHT_DATE             AIRLINE       FLIGHT_NUMBER      ORIGIN_AIRPORT 
##                   0                   0                   0                   0 
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE      DEPARTURE_TIME     DEPARTURE_DELAY 
##                   0                   0                 370                 370 
##            TAXI_OUT          WHEELS_OFF      SCHEDULED_TIME        ELAPSED_TIME 
##                 378                 378                   0                 392 
##            AIR_TIME            DISTANCE           WHEELS_ON             TAXI_IN 
##                 392                   0                 381                 381 
##   SCHEDULED_ARRIVAL        ARRIVAL_TIME       ARRIVAL_DELAY            DIVERTED 
##                   0                 381                 392                   0 
##           CANCELLED CANCELLATION_REASON    AIR_SYSTEM_DELAY      SECURITY_DELAY 
##                   0                7385                5814                5814 
##       AIRLINE_DELAY LATE_AIRCRAFT_DELAY       WEATHER_DELAY 
##                5814                5814                5814

Answer:

There are 7763 observations, 27 variables, and 40270 missing variables.

Question 2:

I am going to start this question by gathering a summary of all of the data.

summary(CVG.Flights)
##  FLIGHT_DATE          AIRLINE          FLIGHT_NUMBER  ORIGIN_AIRPORT    
##  Length:7763        Length:7763        Min.   :  62   Length:7763       
##  Class :character   Class :character   1st Qu.:1908   Class :character  
##  Mode  :character   Mode  :character   Median :3246   Mode  :character  
##                                        Mean   :3194                     
##                                        3rd Qu.:4536                     
##                                        Max.   :6297                     
##                                                                         
##  DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME    DEPARTURE_DELAY 
##  Length:7763         Length:7763         Length:7763       Min.   :-24.00  
##  Class :character    Class1:hms          Class1:hms        1st Qu.: -5.00  
##  Mode  :character    Class2:difftime     Class2:difftime   Median : -1.00  
##                      Mode  :numeric      Mode  :numeric    Mean   : 12.35  
##                                                            3rd Qu.: 12.00  
##                                                            Max.   :708.00  
##                                                            NA's   :370     
##     TAXI_OUT       WHEELS_OFF       SCHEDULED_TIME   ELAPSED_TIME  
##  Min.   :  1.00   Length:7763       Min.   : 60.0   Min.   : 49.0  
##  1st Qu.: 12.00   Class1:hms        1st Qu.: 85.0   1st Qu.: 83.0  
##  Median : 15.00   Class2:difftime   Median :126.0   Median :123.0  
##  Mean   : 18.52   Mode  :numeric    Mean   :128.1   Mean   :127.3  
##  3rd Qu.: 21.00                     3rd Qu.:156.0   3rd Qu.:152.0  
##  Max.   :122.00                     Max.   :313.0   Max.   :340.0  
##  NA's   :378                                        NA's   :392    
##     AIR_TIME        DISTANCE       WHEELS_ON           TAXI_IN       
##  Min.   : 36.0   Min.   : 229.0   Length:7763       Min.   :  1.000  
##  1st Qu.: 56.0   1st Qu.: 308.0   Class1:hms        1st Qu.:  5.000  
##  Median : 99.0   Median : 589.0   Class2:difftime   Median :  6.000  
##  Mean   :100.4   Mean   : 675.9   Mode  :numeric    Mean   :  8.448  
##  3rd Qu.:125.0   3rd Qu.: 871.0                     3rd Qu.:  9.000  
##  Max.   :319.0   Max.   :2036.0                     Max.   :128.000  
##  NA's   :392                                        NA's   :381      
##  SCHEDULED_ARRIVAL ARRIVAL_TIME      ARRIVAL_DELAY       DIVERTED       
##  Length:7763       Length:7763       Min.   :-56.00   Min.   :0.000000  
##  Class1:hms        Class1:hms        1st Qu.:-12.00   1st Qu.:0.000000  
##  Class2:difftime   Class2:difftime   Median : -2.00   Median :0.000000  
##  Mode  :numeric    Mode  :numeric    Mean   : 10.65   Mean   :0.001803  
##                                      3rd Qu.: 16.00   3rd Qu.:0.000000  
##                                      Max.   :716.00   Max.   :1.000000  
##                                      NA's   :392                        
##    CANCELLED       CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
##  Min.   :0.00000   Length:7763         Min.   :  0.00   Min.   :0.00  
##  1st Qu.:0.00000   Class :character    1st Qu.:  0.00   1st Qu.:0.00  
##  Median :0.00000   Mode  :character    Median : 10.00   Median :0.00  
##  Mean   :0.04869                       Mean   : 16.35   Mean   :0.01  
##  3rd Qu.:0.00000                       3rd Qu.: 23.00   3rd Qu.:0.00  
##  Max.   :1.00000                       Max.   :258.00   Max.   :9.00  
##                                        NA's   :5814     NA's   :5814  
##  AIRLINE_DELAY    LATE_AIRCRAFT_DELAY WEATHER_DELAY    
##  Min.   :  0.00   Min.   :  0.00      Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00      1st Qu.:  0.000  
##  Median :  0.00   Median :  2.00      Median :  0.000  
##  Mean   : 17.06   Mean   : 23.24      Mean   :  3.873  
##  3rd Qu.: 18.00   3rd Qu.: 30.00      3rd Qu.:  0.000  
##  Max.   :435.00   Max.   :435.00      Max.   :638.000  
##  NA's   :5814     NA's   :5814        NA's   :5814

1:

For flight data, I can see that data was only collected for January, February, and March of 2015. The majority of the flights were taken on January 2nd, 2015.

2:

For airline data, we can see that DL has the most flights with 2289 and F9 has the least with 515.

3:

For flight number data, the highest was 6297 showcasing the amount of flight numbers used in the data set.

4:

For origin airport data, there are 38 different origin airports in the data set, with CVG being the highest at 3886.

5:

For destination airport data, there are also 38, with CVG being the highest at 3877.

6:

For scheduled departure data, we can see that 4:15 PM is the most common scheduled departure time.

7:

For departure time data, we can see that the most common time is 9:28 AM.

8:

For departure delay data, we can see that they average delay throughout all airports in the data set in 12.35 minutes.

9:

For taxi out data, we can see that the average time it takes for the aircraft to taxi is 18.52 minutes. The longest it takes is 122 minutes and the shortest is 1 minute.

10:

for wheels off data, we can see that the most common time that planes take off the ground is at 9:41 AM.

11:

For scheduled time data, we can see that none of them are less than 60 minutes. The most amount is 313 minutes. The median time is 126 minutes.

12:

For elapsed time data, we can see that the minumum is 49 minutes, the maximum is 340 minutes, and the median is 123 minutes. This variable also has 392 missing variables.

13:

For air time data, we can see that the least amount of time the aircrafts from these airports are in the air is 36 minutes. This means there is no flights that are less than 36 minutes long.

14:

For distance data, we can see that the farthest any of these planes fly is 2036 miles. However, the median is 589 miles.

16:

For wheels on data, we can see that the majority of these planes have their wheels back on the ground by 1:13 AM.

17:

For taxi in data, we can see that the median time for planes to taxi back into the airport is 6 minutes. This variable also has 381 missing values.

18:

For scheduled arrival data, we can see that the most common schedule arrival time is 7:24 PM.

19:

For arrival time data, we can see that the most common arrival time is 7:23 PM.

20:

or arrival delay data, we can see that the largest arrival delay time that this data set has experienced in 716 minutes. There are also 392 missing values in this variable.

21:

For diverted data, we can see that this is a dummy variable. The zeros mean that a plane was not divered and the one means it was. This does not tell us specifics, but it shows us that there has been instances where a plane was diverted.

22:

For cancelled data, we can see that this is also a dummy variable. This tells us that there has been instances where a flight was cancelled.

23:

For cancellation reason data, we can see that cancellation reason B is the highest with 180. This variable also has 7385 missing values which is the most out of all the variables.

24:

For air system delay data, we can see that this causes an average delay of 16.35 minutes.

25:

For security delay data, we can see that this causes at most a delay of 9 minutes. This one does not seem to happen very often.

26:

For airline delay data, we can see that this causes an average delay of 17.06 minutes.

27:

For late aircraft data, we can see that this causes an average delay of 23.24 minutes. This one is the most common reason for delay.

28:

For weather delay data, we can see that this causes an average delay of 3.873 minutes.

Question 3:

For this question, I am going to plot the relationship between airline and origin airport. I am going to filter it to only flights on January 1st, 2015 in order to see which airlines flew out of which airports on that day.

flight2 <- filter(CVG.Flights, FLIGHT_DATE == "1/1/2015")
ggplot(flight2, aes(x = ORIGIN_AIRPORT, fill = AIRLINE)) +
  geom_bar() +
  ggtitle("Count of each airline out of each airport in January 1st, 2015") +
  labs(x = "Origin Airport",
       y = "Count of Airlines")

# Interpretation: This graph shows us that MQ appears to have been flown the most that day, with DL trailing right behind. We also see that majority of flights were flying out of CVG. Also, every single airline flew out of CVG that specific day.

Question 4:

For this plot, I am going to graph the relationship between distance and air time based on airline. I aim to see which airlines travel the furthest and spend the most time in the air.

ggplot(CVG.Flights, aes(x = DISTANCE, y = AIR_TIME, color = AIRLINE)) +
  geom_boxplot() +
  facet_wrap(~AIRLINE) +
  ggtitle("Travel Distance and Air Time Per Airline") +
  labs(x = "Distance",
       y = "Air Time")
## Warning: Removed 392 rows containing non-finite values (stat_boxplot).

Interprtation:

This graph shows me that DL flies the furthest distances and MQ flies the shortest distances. However, the median air time for MQ and DL are the same at 100. We can see that F9 does not travel any less than around 500 miles. Lastly, OO has the smallest median air time.

Question 5:

airlinesmerged <- left_join(CVG.Flights, airlines, by = c("AIRLINE" = "IATA_CODE"))
airportsmerged <- left_join(airlinesmerged, airports, by = c("ORIGIN_AIRPORT" = "IATA_CODE"))
flightsmerged <- left_join(airportsmerged, airports, by = c("DESTINATION_AIRPORT" = "IATA_CODE"))

Question 6:

1: Which state experiences the most weather delays?

weather <- filter(flightsmerged, WEATHER_DELAY > 0)
weather_by_origin <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.x, FUN = length)
weather_by_origin
##    weather$STATE.x weather$WEATHER_DELAY
## 1               CO                     2
## 2               FL                     1
## 3               GA                     6
## 4               IL                    88
## 5               KY                    70
## 6               MA                     5
## 7               MN                     1
## 8               NJ                     3
## 9               NV                     1
## 10              NY                     4
## 11              TN                     1
## 12              TX                    29
## 13              WI                     4
ggplot(weather_by_origin, aes(`weather$STATE.x`, `weather$WEATHER_DELAY`)) +
  geom_col(color = "blue") +
ggtitle("Total Amount of Weather Delays Per State") +
  xlab("State") +
  ylab("Count of Weather Delays")

Interpretation:

Here we xan see that Illinois experiences the highest amount of weather delays, which is no surprise. We can see that Florida experiences the least. I am very surprised to see Colorado so low due to the amount of snow they get. But, that must be due to the fact that it is dry all the other times of the year.

2: Which city has the most flights?

library(RColorBrewer)
library(leaflet)
longitude <- unique(flightsmerged$LONGITUDE.x) 
latitude <- unique(flightsmerged$LATITUDE.x)
df = data.frame(Lat = latitude, Long = longitude)
leaflet(df) %>% addTiles() %>% addCircleMarkers()

Going further:

Then, to complete this question I will create a bar chart to better see which city has the most.

airportspercity <- sort(table(flightsmerged$CITY.x  ), decreasing = TRUE )
barplot(airportspercity[1:5], col = "blue", ylim = c(0,3500))

Interpretation:

This shows me that Covington has the most amount of airports and Newark has the least.

3: Which airline appears to be flown the most and in which area?

USA <- get_map("USA", zoom = 4, source = "google", maptype = "terrain")
## ℹ <]8;;https://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxxhttps://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx]8;;>
## ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=USA&key=xxxhttps://maps.googleapis.com/maps/api/geocode/json?address=USA&key=xxx]8;;>
ggmap(USA) +
  geom_point(data = flightsmerged %>%
               filter(AIRLINE %in% c("MQ", "OO", "DL", "EV", "F9")),
             aes(x = LONGITUDE.x, y = LATITUDE.x,
             color = AIRLINE))

Interpretation:

This map shows us that Delta is the airline that appears to be flown the most. We see them show up most often in the West, as well as in Florida.