library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(maps)
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
library(ggalt)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(ggmap)
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
CVG.Flights <- read_csv("CVG_Flights.csv")
## Rows: 7763 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): FLIGHT_DATE, AIRLINE, ORIGIN_AIRPORT, DESTINATION_AIRPORT, CANCEL...
## dbl (16): FLIGHT_NUMBER, DEPARTURE_DELAY, TAXI_OUT, SCHEDULED_TIME, ELAPSED...
## time (6): SCHEDULED_DEPARTURE, DEPARTURE_TIME, WHEELS_OFF, WHEELS_ON, SCHED...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airlines <- read_csv("airlines.csv")
## Rows: 14 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): IATA_CODE, AIRLINE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airports <- read_csv("airports.csv")
## Rows: 322 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): IATA_CODE, AIRPORT, CITY, STATE, COUNTRY
## dbl (2): LATITUDE, LONGITUDE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I will use dim() to find the number of variables and observations. Then I will use is.na() to find if there are any missing variables. To show the missing variables, I will use colSums and is.na() to show the amount of NA’s for each variable.
dim(CVG.Flights)
## [1] 7763 27
sum(is.na(CVG.Flights))
## [1] 40270
colSums(is.na(CVG.Flights))
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 0 0 0 0
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 0 0 370 370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## 378 378 0 392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 392 0 381 381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## 0 381 392 0
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## 0 7385 5814 5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## 5814 5814 5814
There are 7763 observations, 27 variables, and 40270 missing variables.
I am going to start this question by gathering a summary of all of the data.
summary(CVG.Flights)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## Length:7763 Length:7763 Min. : 62 Length:7763
## Class :character Class :character 1st Qu.:1908 Class :character
## Mode :character Mode :character Median :3246 Mode :character
## Mean :3194
## 3rd Qu.:4536
## Max. :6297
##
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## Length:7763 Length:7763 Length:7763 Min. :-24.00
## Class :character Class1:hms Class1:hms 1st Qu.: -5.00
## Mode :character Class2:difftime Class2:difftime Median : -1.00
## Mode :numeric Mode :numeric Mean : 12.35
## 3rd Qu.: 12.00
## Max. :708.00
## NA's :370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## Min. : 1.00 Length:7763 Min. : 60.0 Min. : 49.0
## 1st Qu.: 12.00 Class1:hms 1st Qu.: 85.0 1st Qu.: 83.0
## Median : 15.00 Class2:difftime Median :126.0 Median :123.0
## Mean : 18.52 Mode :numeric Mean :128.1 Mean :127.3
## 3rd Qu.: 21.00 3rd Qu.:156.0 3rd Qu.:152.0
## Max. :122.00 Max. :313.0 Max. :340.0
## NA's :378 NA's :392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## Min. : 36.0 Min. : 229.0 Length:7763 Min. : 1.000
## 1st Qu.: 56.0 1st Qu.: 308.0 Class1:hms 1st Qu.: 5.000
## Median : 99.0 Median : 589.0 Class2:difftime Median : 6.000
## Mean :100.4 Mean : 675.9 Mode :numeric Mean : 8.448
## 3rd Qu.:125.0 3rd Qu.: 871.0 3rd Qu.: 9.000
## Max. :319.0 Max. :2036.0 Max. :128.000
## NA's :392 NA's :381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## Length:7763 Length:7763 Min. :-56.00 Min. :0.000000
## Class1:hms Class1:hms 1st Qu.:-12.00 1st Qu.:0.000000
## Class2:difftime Class2:difftime Median : -2.00 Median :0.000000
## Mode :numeric Mode :numeric Mean : 10.65 Mean :0.001803
## 3rd Qu.: 16.00 3rd Qu.:0.000000
## Max. :716.00 Max. :1.000000
## NA's :392
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## Min. :0.00000 Length:7763 Min. : 0.00 Min. :0.00
## 1st Qu.:0.00000 Class :character 1st Qu.: 0.00 1st Qu.:0.00
## Median :0.00000 Mode :character Median : 10.00 Median :0.00
## Mean :0.04869 Mean : 16.35 Mean :0.01
## 3rd Qu.:0.00000 3rd Qu.: 23.00 3rd Qu.:0.00
## Max. :1.00000 Max. :258.00 Max. :9.00
## NA's :5814 NA's :5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 2.00 Median : 0.000
## Mean : 17.06 Mean : 23.24 Mean : 3.873
## 3rd Qu.: 18.00 3rd Qu.: 30.00 3rd Qu.: 0.000
## Max. :435.00 Max. :435.00 Max. :638.000
## NA's :5814 NA's :5814 NA's :5814
For flight data, I can see that data was only collected for January, February, and March of 2015. The majority of the flights were taken on January 2nd, 2015.
For airline data, we can see that DL has the most flights with 2289 and F9 has the least with 515.
For flight number data, the highest was 6297 showcasing the amount of flight numbers used in the data set.
For origin airport data, there are 38 different origin airports in the data set, with CVG being the highest at 3886.
For destination airport data, there are also 38, with CVG being the highest at 3877.
For scheduled departure data, we can see that 4:15 PM is the most common scheduled departure time.
For departure time data, we can see that the most common time is 9:28 AM.
For departure delay data, we can see that they average delay throughout all airports in the data set in 12.35 minutes.
For taxi out data, we can see that the average time it takes for the aircraft to taxi is 18.52 minutes. The longest it takes is 122 minutes and the shortest is 1 minute.
for wheels off data, we can see that the most common time that planes take off the ground is at 9:41 AM.
For scheduled time data, we can see that none of them are less than 60 minutes. The most amount is 313 minutes. The median time is 126 minutes.
For elapsed time data, we can see that the minumum is 49 minutes, the maximum is 340 minutes, and the median is 123 minutes. This variable also has 392 missing variables.
For air time data, we can see that the least amount of time the aircrafts from these airports are in the air is 36 minutes. This means there is no flights that are less than 36 minutes long.
For distance data, we can see that the farthest any of these planes fly is 2036 miles. However, the median is 589 miles.
For wheels on data, we can see that the majority of these planes have their wheels back on the ground by 1:13 AM.
For taxi in data, we can see that the median time for planes to taxi back into the airport is 6 minutes. This variable also has 381 missing values.
For scheduled arrival data, we can see that the most common schedule arrival time is 7:24 PM.
For arrival time data, we can see that the most common arrival time is 7:23 PM.
or arrival delay data, we can see that the largest arrival delay time that this data set has experienced in 716 minutes. There are also 392 missing values in this variable.
For diverted data, we can see that this is a dummy variable. The zeros mean that a plane was not divered and the one means it was. This does not tell us specifics, but it shows us that there has been instances where a plane was diverted.
For cancelled data, we can see that this is also a dummy variable. This tells us that there has been instances where a flight was cancelled.
For cancellation reason data, we can see that cancellation reason B is the highest with 180. This variable also has 7385 missing values which is the most out of all the variables.
For air system delay data, we can see that this causes an average delay of 16.35 minutes.
For security delay data, we can see that this causes at most a delay of 9 minutes. This one does not seem to happen very often.
For airline delay data, we can see that this causes an average delay of 17.06 minutes.
For late aircraft data, we can see that this causes an average delay of 23.24 minutes. This one is the most common reason for delay.
For weather delay data, we can see that this causes an average delay of 3.873 minutes.
For this question, I am going to plot the relationship between airline and origin airport. I am going to filter it to only flights on January 1st, 2015 in order to see which airlines flew out of which airports on that day.
flight2 <- filter(CVG.Flights, FLIGHT_DATE == "1/1/2015")
ggplot(flight2, aes(x = ORIGIN_AIRPORT, fill = AIRLINE)) +
geom_bar() +
ggtitle("Count of each airline out of each airport in January 1st, 2015") +
labs(x = "Origin Airport",
y = "Count of Airlines")
# Interpretation: This graph shows us that MQ appears to have been flown
the most that day, with DL trailing right behind. We also see that
majority of flights were flying out of CVG. Also, every single airline
flew out of CVG that specific day.
For this plot, I am going to graph the relationship between distance and air time based on airline. I aim to see which airlines travel the furthest and spend the most time in the air.
ggplot(CVG.Flights, aes(x = DISTANCE, y = AIR_TIME, color = AIRLINE)) +
geom_boxplot() +
facet_wrap(~AIRLINE) +
ggtitle("Travel Distance and Air Time Per Airline") +
labs(x = "Distance",
y = "Air Time")
## Warning: Removed 392 rows containing non-finite values (stat_boxplot).
This graph shows me that DL flies the furthest distances and MQ flies the shortest distances. However, the median air time for MQ and DL are the same at 100. We can see that F9 does not travel any less than around 500 miles. Lastly, OO has the smallest median air time.
airlinesmerged <- left_join(CVG.Flights, airlines, by = c("AIRLINE" = "IATA_CODE"))
airportsmerged <- left_join(airlinesmerged, airports, by = c("ORIGIN_AIRPORT" = "IATA_CODE"))
flightsmerged <- left_join(airportsmerged, airports, by = c("DESTINATION_AIRPORT" = "IATA_CODE"))
weather <- filter(flightsmerged, WEATHER_DELAY > 0)
weather_by_origin <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.x, FUN = length)
weather_by_origin
## weather$STATE.x weather$WEATHER_DELAY
## 1 CO 2
## 2 FL 1
## 3 GA 6
## 4 IL 88
## 5 KY 70
## 6 MA 5
## 7 MN 1
## 8 NJ 3
## 9 NV 1
## 10 NY 4
## 11 TN 1
## 12 TX 29
## 13 WI 4
ggplot(weather_by_origin, aes(`weather$STATE.x`, `weather$WEATHER_DELAY`)) +
geom_col(color = "blue") +
ggtitle("Total Amount of Weather Delays Per State") +
xlab("State") +
ylab("Count of Weather Delays")
Here we xan see that Illinois experiences the highest amount of weather delays, which is no surprise. We can see that Florida experiences the least. I am very surprised to see Colorado so low due to the amount of snow they get. But, that must be due to the fact that it is dry all the other times of the year.
library(RColorBrewer)
library(leaflet)
longitude <- unique(flightsmerged$LONGITUDE.x)
latitude <- unique(flightsmerged$LATITUDE.x)
df = data.frame(Lat = latitude, Long = longitude)
leaflet(df) %>% addTiles() %>% addCircleMarkers()
Then, to complete this question I will create a bar chart to better see which city has the most.
airportspercity <- sort(table(flightsmerged$CITY.x ), decreasing = TRUE )
barplot(airportspercity[1:5], col = "blue", ylim = c(0,3500))
This shows me that Covington has the most amount of airports and Newark has the least.
USA <- get_map("USA", zoom = 4, source = "google", maptype = "terrain")
## ℹ <]8;;https://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxxhttps://maps.googleapis.com/maps/api/staticmap?center=USA&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx]8;;>
## ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=USA&key=xxxhttps://maps.googleapis.com/maps/api/geocode/json?address=USA&key=xxx]8;;>
ggmap(USA) +
geom_point(data = flightsmerged %>%
filter(AIRLINE %in% c("MQ", "OO", "DL", "EV", "F9")),
aes(x = LONGITUDE.x, y = LATITUDE.x,
color = AIRLINE))
This map shows us that Delta is the airline that appears to be flown the most. We see them show up most often in the West, as well as in Florida.