People tend to like applications where they can get information which is relevant to themselves and which they can use to improve their decisions. As Wang, Park and Fesenmaier (2011) explain, people have with the recent boom in social media accessibility, found pleasure in reading about the experiences of travelers, and ways to improve their journey to wherever they’re going (Cole 2015). People find it extremely frustrating when they have to wait for something, especially when they do not know the reason for the delays.
We saw the opportunity to provide people with information to improve their journeys, by suggesting various possibilities to avoid delayed flights. Furthermore we could show the most common reasons why flights get delayed.We will focus on the delays related to flights.
This can include:
• Flight routes with the most delays
• Flight routes with the least delays
• Airports with high delays to stay clear of
• Airports with the least delays
• The average delay for each type of delay, for example weather, security, carrier, late aircraft and NAS (National Aviation System) delay.
The dataset is available on the Bureau of Transportation Statistics (BTS) website which is a Federal multi organizational entity which focuses on all forms of U.S transportation data collection & analysis. The Bureau of Transportation Statistics (BTS), as part of the United States Department of Transportation, compiles, analyzes, and makes accessible information on the nation’s transportation systems; collects information on intermodal transportation and other areas as needed; and improves the quality and effectiveness of DOT’s statistical programs through research, development of guidelines, and promotion of improvements in data acquisition and use. BTS is part of RITA and is a principal agency of the U.S. Federal Statistical System.
The BTS dataset contains information about the domestic flights and the attributes contributing/calculating their delay in the U.S. The data consist of the following characteristics over the period of 10 years:
• Flight details
• Origin airport
• Destination airport
• Airline
• Delay time
• Cause for delays
The dataset contains 469968 records for all commercial domestic airlines for the month of December 2015. We decided to analyze the data for the month of December because it is a holiday season and majority of the people around the world travel during this time and there is a huge possibility of delay. This might be the most ideal month of a year in order to analyze flight delays. The data consists of flight arrival and departure details for all commercial flights within the USA.
These observations provide the details associated with each individual departure flight from all domestic airports throughout the United States. The data set is specific to identify multiple attributes about the location of the flight, airline carrier and which specific airplane made the journey and the duration of the flight. All these observations provide the reader with insight as to whether the flight was ahead of schedule, on-time or delayed.
To help understand what causes delays, we look closer at the reported 38 variables. We look at the structure of the flights dataset using the following command.
str(flights)
The flights data frame has 469968 rows and 38 columns. Each row represents a unique flight details, flying from source to destination and representing arrival or departure delays.
Each column in the dataset is a variable containing some information about the flight. The table below summarizes some of these variables: Variable descriptions
| Column | Description | Units |
|---|---|---|
YEAR |
year of flight | Integer |
MONTH |
month of flight | Integer |
DAY_OF_MONTH |
day number of month of flight (1-31) | Integer |
DAY_OF_WEEK |
day number of week of flight (1-7) | Integer |
FL_DATE |
flight date | datetime |
UNIQUE_CARRIER |
unique carrier code | Character |
AIRLINE_ID |
unique ID of airline | Integer |
CARRIER |
flight operator | Character |
FL_NUM |
flight number | Integer |
ORIGIN_AIRPORT_ID |
ID of the airport from where the flight originates | Integer |
ORIGIN |
originating airport abbreviation | Character |
ORIGIN_CITY_NAME |
city of the originating airport | Character |
ORIGIN_STATE_ABR |
abbreviation of the state of originating airport city | Character |
ORIGIN_STATE_NM |
state name of originating airport city | Character |
DEST_AIRPORT_ID |
ID of the destination airport | Integer |
DEST |
destination airport abbreviation | Character |
DEST_CITY_NAME |
city of the destination airport | Character |
DEST_STATE_ABR |
abbreviation of the state of destination city | Character |
ORIGIN_STATE_NM |
state name of destination airport city | Character |
CRS_DEP_TIME |
computerized reservations systems departure time | Integer |
DEP_TIME |
actual departure time | Integer |
DEP_DELAY |
departure delay (includes negative delays: flights taking off before scheduled time), in minutes | Integer |
DEP_DELAY_NEW |
only positive departure delays, in minutes | Integer |
CRS_ARR_TIME |
computerized reservations systems arrival time | Integer |
ARR_TIME |
Actual arrival time | Integer |
ARR_DELAY |
arrival delay (includes negative delays: flights arriving before scheduled time), in minutes | Integer |
ARR_DELAY_NEW |
only positive arrival delays, in minutes | Integer |
CARRIER_DELAY |
delays due to carrier, in minutes | Integer |
WEATHER_DELAY |
delays due to weather, in minutes | Integer |
NAS_DELAY |
delays due to national air system, in minutes | Integer |
SECURITY_DELAY |
delays due to security, in minutes | Integer |
LATE_AIRCRAFT_DELAY |
delays due to late aircraft, in minutes | Integer |
Using the flight dataset, we are trying to answer some of the questions mentioned below in our analysis.
The dataset preparation process involves the importing libraries, importing dataset, cleaning dataset and creating a data frame.
Installing Packages and libraries
We will install and load the dplyr package that contains additional functions for data manipulation using data frames. It allows us to order rows, select rows, select variables, modify variables and summarize variables. We will be using functions like distinct, filter, group_by from this package.
Neo4j, a graph database, allows users to store their data as a property graph. A graph consists of nodes that are connected by relationships; both nodes and relationships can have properties, or key-value pairs. RNeo4j is Neo4j’s R driver. It allows users to read and write data from and to Neo4j directly from their R environment by exposing an interface for interacting with nodes, relationships, paths, and more. Most notably, it allows users to retrieve Cypher query results as R data frames, where Cypher is Neo4j’s graph query language.
Routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more.
Provides a mechanism for chaining commands with a new forward-pipe operator, %> %. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions. For more information, see package vignette.
We will be taking the flights data frame and be creating a new data frame with edges between the flights source and destinations airport and then adding a count variable which will represent the total number of edges between the source and destination airports.
The edges.df will give us data frame with variables such as ORIGIN, DEST and count. The data frame will have source and destination airport code and the number of flights which have operated between these airports in the month of December 2015 in USA.
The below command setwd sets the working directory to the current folder on our system and reads the csv file from that folder.
library(ggplot2)
setwd("/Users/Harshit/Desktop/INFO7374 - David Oury/Assignment2/")
flights <- read.csv("850493877_T_ONTIME.csv", sep = ",", header = TRUE)
Some of the variables need to be converted to factors. Below is the code doing so.
flights$YEAR <- factor(flights$YEAR)
flights$MONTH <- factor(flights$MONTH)
flights$DAY_OF_MONTH <- factor(flights$DAY_OF_MONTH)
flights$ORIGIN_AIRPORT_ID <- factor(flights$ORIGIN_AIRPORT_ID)
flights$DEST_AIRPORT_ID <- factor(flights$DEST_AIRPORT_ID)
flights$DAY_OF_WEEK <- factor(flights$DAY_OF_WEEK)
We will be taking the flights dataframe and create a new dataframe with edges between the flights source and destination airport and then adding a count variable which will represent the total number of edges(flights) between the source and destination airports.
RNeo4j is Neo4j’s R driver. It allows you to read and write data from / to Neo4j directly from your R environment.
Below, we take the flights dataframe, group it by ORIGIN, DEST, CARRIER, and FL_DATE and then summmarize to get the number of flights between each origin and destination. We also find the average of departure delay, arrival delay, security delay, carrier delay and other delays. We have used the summarize function to calculate the averages.
flights %>%
select(CARRIER,ORIGIN,DEST,WEATHER_DELAY,SECURITY_DELAY,
NAS_DELAY,LATE_AIRCRAFT_DELAY,CARRIER_DELAY,DEP_DELAY_NEW,ARR_DELAY_NEW,
FL_DATE) %>%
na.omit()%>%
group_by(CARRIER,ORIGIN,DEST,FL_DATE) %>%
summarize(nflights = n(),
Arr_delay = mean(ARR_DELAY_NEW, na.rm=TRUE),
Dep_delay = mean(DEP_DELAY_NEW, na.rm=TRUE),
We_delay = mean(WEATHER_DELAY, na.rm=TRUE),
Sec_delay = mean(SECURITY_DELAY, na.rm=TRUE),
Nas_delay = mean(NAS_DELAY, na.rm=TRUE),
Late_air_delay = mean(LATE_AIRCRAFT_DELAY, na.rm=TRUE),
Car_delay = mean(CARRIER_DELAY, na.rm=TRUE)) %>%
group_by() %>%
{.} -> edges.df
So now that we have the edges.df dataframe, lets take alook at its structure.
str(edges.df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 62221 obs. of 12 variables:
## $ CARRIER : Factor w/ 14 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ORIGIN : Factor w/ 312 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ DEST : Factor w/ 312 levels "ABE","ABI","ABQ",..: 83 83 83 83 83 83 83 83 83 83 ...
## $ FL_DATE : Factor w/ 31 levels "2015-01-01","2015-01-02",..: 1 2 3 5 7 9 12 13 14 15 ...
## $ nflights : int 2 3 1 2 2 1 1 2 2 2 ...
## $ Arr_delay : num 26.5 34.3 29 21.5 271.5 ...
## $ Dep_delay : num 7 17.3 0 0 249.5 ...
## $ We_delay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sec_delay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Nas_delay : num 19.5 17 29 21.5 22 5 5 28.5 32.5 17 ...
## $ Late_air_delay: num 0 17.3 0 0 0 ...
## $ Car_delay : num 7 0 0 0 250 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:374017] 1 2 3 4 5 7 8 9 10 11 ...
## .. ..- attr(*, "names")= chr [1:374017] "1" "2" "3" "4" ...
## - attr(*, "vars")=List of 3
## ..$ : symbol CARRIER
## ..$ : symbol ORIGIN
## ..$ : symbol DEST
## - attr(*, "drop")= logi TRUE
As we can see here, the columns for the delays are not integer. So we convert them to integer using as.integer function. First we create a variable with the column indices which are to be converted to integer. Then we pass that variable to lapply function which applies as.integer function over all te mentioned columns.
cols <- c(6:12)
edges.df[cols] <- lapply(edges.df[cols], as.integer)
Now lets make another column in the dataframe which will tell us whether the maximum delay for a particular flight was a security delay or a weather delay or other types of delays.
new_df = edges.df[,c("We_delay","Sec_delay","Nas_delay","Late_air_delay","Car_delay")]
Cause_Of_delay = colnames(new_df)[apply(new_df,1,which.max)]
edges.df = cbind(edges.df,Cause_Of_delay)
Now lets again take a look at the structure of the edges.df dataframe.
str(edges.df)
## 'data.frame': 62221 obs. of 13 variables:
## $ CARRIER : Factor w/ 14 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ORIGIN : Factor w/ 312 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ DEST : Factor w/ 312 levels "ABE","ABI","ABQ",..: 83 83 83 83 83 83 83 83 83 83 ...
## $ FL_DATE : Factor w/ 31 levels "2015-01-01","2015-01-02",..: 1 2 3 5 7 9 12 13 14 15 ...
## $ nflights : int 2 3 1 2 2 1 1 2 2 2 ...
## $ Arr_delay : int 26 34 29 21 271 58 55 46 32 45 ...
## $ Dep_delay : int 7 17 0 0 249 53 50 17 1 28 ...
## $ We_delay : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Sec_delay : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Nas_delay : int 19 17 29 21 22 5 5 28 32 17 ...
## $ Late_air_delay: int 0 17 0 0 0 0 0 0 0 27 ...
## $ Car_delay : int 7 0 0 0 249 53 50 17 0 1 ...
## $ Cause_Of_delay: Factor w/ 5 levels "Car_delay","Late_air_delay",..: 3 3 3 3 1 1 1 3 3 2 ...
We have successfully converted the relevant columns to integer and added a new column which categorizes different types of delays.
Start the Neo4j server
Now we start the graph using the startGraph function. This function takes URL to the neo4j database, username and password as its arguments. To keep our password protected, we have used echo=FALSE.
Before we move forward, lets clear all the previous graphs, if any, from the server.
clear(graph, input = FALSE)
Neo4j can help keep your data clean. It does so using constraints. Constraints allow you to specify the rules for what your data should look like. Any changes that break these rules will be denied. We add 1 constraint, Airport.
addConstraint(graph, "Airport", "name")
Now we create nodes for all the airports from edges.df. We do so by using the function createNode.
# Create a node in Neo4j for each unique airport
lapply(unique(c(as.character(edges.df$ORIGIN),
as.character(edges.df$DEST))),
function(a.airport) {
createNode(graph,
"Airport",
name=a.airport)
}
)
Finally, we write the edges.df to a csv file.
write.csv(edges.df,
"/Users/Harshit/Documents/Neo4j/default.graphdb/import/edges.csv")
cypher is used to query the Neo4j database. In the below query, we are loading the edges.csv in Neo4j and matching the source and destination airports. We are creating the connection between them to see which airport is connected to which one.
cypher(graph,
"LOAD CSV WITH HEADERS FROM 'file:/edges.csv' AS row
MATCH (src:Airport {name:row.ORIGIN}),
(tgt:Airport {name:row.DEST})
CREATE (src)-[:flight {carrier:row.CARRIER,
number:row.nflights,
arr_delay:row.Arr_delay,
cause:row.Cause_Of_delay,
dep_delay:row.Dep_delay}]->(tgt)")
Now we do some analysis to find answers to our questions. First, we find out which airports have the highest average departure delay. We find 15 such airports.
mostDepDelay15 = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH src.name as Origin,
AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time,
count(tgt) as numberOfDest
ORDER BY Avg_dep_delay_time desc
RETURN Origin, numberOfDest,
Avg_dep_delay_time LIMIT 15")
mostDepDelay15
## Origin numberOfDest Avg_dep_delay_time
## 1 ESC 6 183.66667
## 2 PIH 7 169.57143
## 3 ABR 7 159.28571
## 4 VEL 5 132.00000
## 5 SMX 6 130.00000
## 6 EAU 11 125.27273
## 7 APN 14 110.07143
## 8 PAH 7 106.14286
## 9 OTZ 6 105.33333
## 10 IMT 9 97.77778
## 11 HOB 3 97.00000
## 12 DIK 15 93.40000
## 13 CEC 14 92.42857
## 14 BJI 6 90.50000
## 15 SCE 18 90.05556
Finding: The ESC (Escanaba, Michigan) airport had the highest average departure delay time, specifically 183.66 minutes, in the month of december 2015. Also, EAU (Eau Claire, Wisconsin), SMX (Santa Maria, California), VEL (Vernal, Utah) and couple other airports have mroe than 120 minutes (2 hours) of delay. So, if you have a flight from any of these airports during december, expect an average delay of 2 hours for your departure.
Lets investigate deeper. We select ESC, PIH, ABR, VEL, SMX and EAU airports to see any trends or find any insights.
investigation_1 = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH src.name AS Origin, r.cause as Cause,
tgt.name as Destination,
AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time
ORDER BY Avg_dep_delay_time desc
WHERE src.name IN ['ESC','PIH','ABR','VEL','SMX','EAU']
RETURN Origin, Destination, Cause, Avg_dep_delay_time")
investigation_1
## Origin Destination Cause Avg_dep_delay_time
## 1 ABR MSP Car_delay 443.50000
## 2 PIH SLC Car_delay 319.00000
## 3 SMX LAX Car_delay 317.50000
## 4 VEL SLC Late_air_delay 269.00000
## 5 PIH SLC We_delay 268.00000
## 6 EAU ORD Car_delay 267.33333
## 7 ESC DTW Car_delay 262.00000
## 8 ESC DTW Late_air_delay 105.33333
## 9 VEL SLC Car_delay 96.00000
## 10 EAU ORD Late_air_delay 82.83333
## 11 ABR MSP Late_air_delay 53.75000
## 12 SMX LAX Late_air_delay 47.33333
## 13 EAU ORD Nas_delay 39.50000
## 14 VEL SLC We_delay 26.00000
## 15 ABR MSP Nas_delay 13.00000
## 16 PIH SLC Late_air_delay 13.00000
## 17 SMX LAX Nas_delay 3.00000
## 18 PIH SLC Nas_delay 0.00000
## 19 VEL SLC Nas_delay 0.00000
Finding: We found that the top delays are carrier related. It means, its not just about the airports, but if you are travelling from some specific carrier, you might have to wait even more before that flight of yours takes off.
Now lets find out which carriers are culprit.
culpritCarriers = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH src.name AS Origin, r.cause as Cause,
tgt.name as Destination, r.carrier as Carrier,
AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time
ORDER BY Avg_dep_delay_time desc
WHERE src.name IN ['ESC','PIH','ABR','VEL','SMX','EAU']
AND r.cause = 'Car_delay'
RETURN Origin, Destination, Cause, Avg_dep_delay_time, Carrier")
culpritCarriers
## Origin Destination Cause Avg_dep_delay_time Carrier
## 1 ABR MSP Car_delay 443.5000 OO
## 2 PIH SLC Car_delay 319.0000 OO
## 3 SMX LAX Car_delay 317.5000 OO
## 4 EAU ORD Car_delay 267.3333 OO
## 5 ESC DTW Car_delay 262.0000 OO
## 6 VEL SLC Car_delay 96.0000 OO
Finding: Shockingly, or not, the culprit of high carrier delays is OO (SkyWest Airlines). All the high carrier delays are associated with SkyWesy Airlines. So, if you are flying from the selected 6 airports in the month of december using the services of SkyWest Airlines, you are going to have to wait for eternity!
Now lets do another analysis. Lets try to find out which airports have the worst connectivity.
airportsLeastConnected = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH tgt.name as Destination,
count(distinct src) as flightsFromAirport
ORDER BY flightsFromAirport asc
RETURN Destination, flightsFromAirport")
Finding: 76 airports have incoming flights from just 1 airport. It means that there is only one way you can go these 76 airports. It might be fair to say that these airports don’t get enough passengers to improve connectivity. Some of these airports are DVL (Devil’s Lake, North Dakota), BTM (Butte, Montana) and GUM (Guam). Note: Guam is not on mainland, but is an island territory in the western pacific ocean owned by US.
Now, lets pick the 3 airports we mentioned above and find out how you can reach there. We find the source airports which have flights to these 3 airports and number of flights for different airlines.
HowToReachThere = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH src.name as Origin,
tgt.name as Destination,
r.carrier as Carrier,
sum(toInt(r.number)) as numberOfFlights
WHERE tgt.name IN ['DVL','BTM','GUM']
RETURN Origin, Destination, Carrier, numberOfFlights")
HowToReachThere
## Origin Destination Carrier numberOfFlights
## 1 SLC BTM OO 5
## 2 JMS DVL OO 17
## 3 HNL GUM UA 12
Finding: If you want to reach Guam, you have to go to HNL (Honolulu, Island of Oahu). If you are planning to go to Butte, Montana, then you need to go via SLC (Salt Lake City, Utah)
What if you want to get to Guam? Since you can only go to Gua via Honolulu, lets find out if Honolulu is well connected or not.
ToHonolulu = cypher(graph,
"MATCH (src)-[r:flight]->(tgt)
WITH tgt.name as Destination,
sum(toInt(r.number)) as numberOfFlights,
count(distinct src) as numberOfOrigins
WHERE tgt.name = 'HNL'
RETURN Destination, numberOfOrigins, numberOfFlights")
ToHonolulu
## Destination numberOfOrigins numberOfFlights
## 1 HNL 27 754
Finding: Well, you are in luck! There are 754 flights to Honolulu in december and you can reach Honolulu via 27 different airport and cities. Thats a sigh of relief!
igraph is a library collection for creating and manipulating graphs and analyzing networks. igraph can be used to generate graphs, compute centrality measures and path length based properties as well as graph components and graph motifs.
Below we are creating a graph dataframe which has source airport and destination airport as vertices.
edges.df %>%
select(ORIGIN, DEST) %>%
distinct() -> edges.df.igraph
ig = graph.data.frame(edges.df.igraph)
Below, we are using the degree function to get the total number of edges to and from a vertex which means we will get the count of airports each airport is connected to. From this value we can determine the busiest airport in the US.
First we will find out the in-degree of each node. This will tell us how many airports have flights to that particular airport.
ig_degree_in = sort(degree(ig, mode = "in"),decreasing = TRUE)
ig_degree_in
## ATL ORD DFW DEN IAH MSP DTW EWR LAX PHX SLC MCO SFO LAS BWI MDW CLT SEA
## 161 154 147 137 109 96 92 77 76 76 73 72 72 69 64 62 60 60
## IAD JFK TPA FLL LGA BOS MIA BNA HOU DCA STL SAN PDX PHL MCI AUS DAL RSW
## 59 57 57 56 55 53 46 44 44 43 42 39 38 36 35 34 34 32
## MSY CLE RDU OAK CVG HNL SAT PIT IND MKE SJC CMH SMF ABQ PBI SJU BDL JAX
## 31 30 30 28 28 27 27 26 25 25 24 23 23 21 21 21 20 20
## ANC OGG OKC SNA MEM OMA TUL BUF CHS TTN BHM TUS GRR ORF RIC SDF ONT ELP
## 20 18 18 18 17 16 16 16 16 16 15 15 15 14 14 14 13 12
## LIT ROC BOI DAY DSM LIH PSP STT BUR PVD ALB GSP ICT JAC RNO LGB SAV CAK
## 12 12 12 11 11 11 11 11 11 11 11 11 10 10 10 10 10 10
## MSN COS EGE HDN KOA GEG BZN GSO MHT MTJ HPN SYR TYS XNA ASE FAT BTV PWM
## 10 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7
## CAE FNT HSV PNS JNU SRQ FAR LEX MDT BMI CID LBB MAF MLI SBA ACY FSD MYR
## 7 7 7 7 6 6 6 6 6 6 6 6 6 6 6 6 5 5
## AMA GJT PIA RAP SBN ISP KTN BQN ATW BIS BTR CHA CRW ECP EVV GRB JAN LFT
## 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4
## MOB SGF SHV AZO FWA LAN LNK MBS LBE BFL EUG FCA RDM GUC STX BLI FAI ABE
## 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3
## AVL BIL CHO EYW GPT MSO PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## MLU PHF RST SAF ITO GCC MFR MRY RKS SBP SUN MFE BRW CDV OTZ PSG SCC SIT
## 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## WRG YAK PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MEI MGM MHK MOT TVC
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## TYR ILG UST CMI COU LSE SPI COD CPR GTF HIB HLN IDA IMT INL JMS MMH RHI
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## SGU ADQ BET OME ORH AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG CWA DHN EWN
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## GGG GTR HOB LAW OAJ PIB SCE SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI JLN
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX DVL
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC TWF
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## VEL YUM GUM BTM CNY ADK
## 1 1 1 1 1 0
Finding: ATL, ORD, DFW, DEN and IAH have more than 100 in-degree. It means flights from more than 100 airports end at these airports. Also, ADK has 0 in-degree. That means this airport has no incoming flight. Weird!
Next, we find the out-degree of each node.
ig_degree_out = sort(degree(ig, mode = "out"),decreasing = TRUE)
ig_degree_out
## ATL ORD DFW DEN IAH MSP DTW SLC EWR LAX PHX SFO LAS MCO BWI IAD MDW CLT
## 157 153 146 137 109 99 95 79 78 76 76 74 71 71 64 63 63 60
## SEA FLL JFK LGA TPA BOS MIA BNA DCA HOU STL SAN PDX PHL MCI AUS DAL MSY
## 59 57 57 57 56 53 46 44 44 44 42 41 37 37 35 34 34 33
## RSW CLE RDU OAK CVG PIT SAT HNL MKE SJC IND CMH SMF ABQ BDL PBI SJU ANC
## 32 29 29 28 27 26 26 25 25 24 23 22 22 21 21 20 20 19
## JAX OKC SNA OGG MEM OMA BUF CHS TTN BHM RIC TUS ONT ORF SDF GRR JAC ELP
## 18 18 18 17 16 16 16 16 16 15 15 15 14 14 14 14 13 12
## LIT TUL ROC BOI DAY KOA PSP STT BUR PVD GSP EGE ICT LGB SAV ALB CAK MSN
## 12 12 12 12 11 11 11 11 11 11 11 10 10 10 10 10 10 10
## COS DSM HDN LIH MTJ RNO GEG GSO MHT HPN BZN TYS XNA ASE FAT BTV PWM SYR
## 9 9 9 9 9 9 9 9 9 8 8 8 8 8 7 7 7 7
## FNT HSV PNS JNU SRQ CAE FAR LEX MDT BMI CID MAF MLI SBA ACY FSD MYR AMA
## 7 7 7 6 6 6 6 6 6 6 6 6 6 6 6 5 5 5
## GJT LBB PIA RAP SBN ISP GUC BLI KTN BQN ATW BIS CHA CRW ECP EVV GRB JAN
## 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4
## LFT MOB MSO SGF SHV AZO FWA LAN LBE BFL EUG RDM STX BRW ABE AVL BIL BTR
## 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
## CHO EYW GPT PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN LNK MBS MLU
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## RST SAF SCE ITO FCA GCC MFR MRY RKS SBP SUN MFE CDV FAI OME OTZ PSG SCC
## 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## SIT WRG YAK ORH PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MEI MGM MHK
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## MOT PHF PIB TVC TYR ILG UST CMI COU SPI COD DVL GTF HIB HLN IDA IMT INL
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## JMS MMH RHI SGU ADK ADQ BET AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG CWA
## 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## DHN EWN GGG GTR HOB LAW OAJ SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI JLN
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## LSE MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## CPR EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## TWF VEL YUM GUM BTM CNY
## 1 1 1 1 0 0
Finding: Not so surprising, but ATL, ORD, DFW, DEN and IAH have outbound flights to more than 100 destinations. If we take a look at ADK, it has flight to 1 destination, even though it doesn’t have any incoming flights.
List of airports connected to the busiest airport.
We are naming all the neighboring vertices of a vertex v. We are considering ATL as our vertex and we are trying to find all the airports which has a flight connection to ATL.
V(ig)[neighbors(ig, v=V(ig)['ATL'])]
## + 157/312 vertices, named:
## [1] ABQ AUS BDL BHM BNA BOS BWI CLE CLT CMH COS DAY DCA DEN DFW DSM DTW
## [18] EGE ELP EWR FLL HDN HNL HOU IAD IAH ICT IND JAX JFK LAS LAX LGA LIT
## [35] MCI MCO MEM MIA MKE MSP MSY OKC OMA ORD ORF PBI PDX PHL PHX PIT RDU
## [52] RIC RSW SAN SAT SDF SEA SFO SJC SJU SLC SMF SNA STL STT TPA TUL TUS
## [69] OAK BTV BUF CHS HPN PVD PWM ROC SAV SRQ SYR ABE AGS ALB ATW AVL BTR
## [86] BZN CAE CAK CHA CHO CRW CVG DAB DAL ECP EVV EYW FAR FAY FNT FSD GNV
## [103] GPT GRB GRR GSO GSP HSV ILM JAN LEX LFT MDT MDW MHT MLB MOB MSN MYR
## [120] PNS ROA SGF SHV TLH TRI TYS VPS XNA ABY AEX AVP AZO BMI BQK CID CSG
## [137] DHN ELM EWN FSM FWA GRK GTR LAN LNK MBS MGM MLI MLU OAJ PHF PIA RST
## [154] SBN VLD TTN ASE
E(ig)[incident(ig, v=V(ig)['ATL'])]
## + 318/4046 edges (vertex names):
## [1] ATL->ABQ ATL->AUS ATL->BDL ATL->BHM ATL->BNA ATL->BOS ATL->BWI
## [8] ATL->CLE ATL->CLT ATL->CMH ATL->COS ATL->DAY ATL->DCA ATL->DEN
## [15] ATL->DFW ATL->DSM ATL->DTW ATL->EGE ATL->ELP ATL->EWR ATL->FLL
## [22] ATL->HDN ATL->HNL ATL->HOU ATL->IAD ATL->IAH ATL->ICT ATL->IND
## [29] ATL->JAX ATL->JFK ATL->LAS ATL->LAX ATL->LGA ATL->LIT ATL->MCI
## [36] ATL->MCO ATL->MEM ATL->MIA ATL->MKE ATL->MSP ATL->MSY ATL->OKC
## [43] ATL->OMA ATL->ORD ATL->ORF ATL->PBI ATL->PDX ATL->PHL ATL->PHX
## [50] ATL->PIT ATL->RDU ATL->RIC ATL->RSW ATL->SAN ATL->SAT ATL->SDF
## [57] ATL->SEA ATL->SFO ATL->SJC ATL->SJU ATL->SLC ATL->SMF ATL->SNA
## [64] ATL->STL ATL->STT ATL->TPA ATL->TUL ATL->TUS ATL->OAK ATL->BTV
## + ... omitted several edges
We also want to find out the most and least connected airport on the basis of the betweeness of a graph. The below code calculates the betweenness among all the airports and then they are sorted in decreasing order.
ig_betweeness = sort(betweenness(ig,directed = TRUE), decreasing = TRUE)
ig_betweeness[1:10]
## ATL ORD DFW DEN MSP SLC IAH
## 17645.593 16858.076 16790.878 13601.721 8245.909 7158.786 5712.870
## ANC DTW SFO
## 5143.909 5063.938 4709.846
ig_betweeness_0 <- ig_betweeness[which(ig_betweeness == 0.00)]
length(ig_betweeness_0)
## [1] 184
Finding: The airport with most betweenness is ATL, with a betweenness of 17645.593. There are 184 airports with 0 betweenness, which means that they don’t act as a connecting airport to any other airport. Out of these 184, we chose 3 random airports, which are COS (Colorado Springs, Colorado), FAT (Fresno, California) and GUC (Gunnison, Colorado).
The below code gives the shortest path between the two least busiest airports, COS and GUC.
ig_shortest_path = sort (shortest.paths(graph = ig)['COS','GUC'],decreasing = TRUE)
ig_shortest_path
## [1] 2
Finding: Here in our case we are calculating the distance between COS and GUC which is 2. It means there is no direct flight from Colorado Springs to Gunnison, Colorado.
Closeness centrality is defined as the total graph-theoretic distance to all other nodes in the network. That means it gives us the reciprocal of the sum of distance from the vertex to all other vertices. When a node has a high closeness score (i.e., is highly central), it tends to receive anything flowing through the network very quickly.
ig_closeness = sort(closeness(ig),decreasing = TRUE)
head(ig_closeness,5)
## ORD ATL DFW DEN IAH
## 0.001272265 0.001267427 0.001251564 0.001248439 0.001194743
tail(ig_closeness,5)
## SIT YAK WRG BTM CNY
## 6.775068e-04 6.775068e-04 6.734007e-04 1.030588e-05 1.030588e-05
Finding: We can see that ORD, ATL and DFW are the 3 airports with the highest closeness that means these airports are more frequently encountered on our network of airports and flights.
Transitivity measures the probability that the adjacent vertices of a vertex are connected. The global transitivity of an undirected graph (directed graphs are considered as undirected ones as well). This is simply the ratio of the triangles and the connected triples in the graph. For directed graph the direction of the edges is ignored.
ig_transitivity = transitivity(ig)
ig_transitivity
## [1] 0.3164311
A property very important in networks, is transitivity. It refers to the extent to which the relation that relates two nodes in a network that are connected by an edge is transitive. The graph created has a transitivity of 0.3164311.
We have used the coreness funtion to get the coreness value of each airport and then sorted in decreasing order. If we see the top airports, we will get to know that ATL is the airport with highest coreness value.
ig_coreness = sort(coreness(ig),decreasing = TRUE)
ig_coreness
## ATL AUS BNA BOS BWI CLE CLT DCA DEN DFW DTW EWR FLL HOU IAD IAH JFK LAS
## 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## LAX LGA MCI MCO MIA MSP MSY ORD PHL PHX PIT RDU SAN SEA SFO SLC STL TPA
## 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
## MDW SAT MKE CVG IND RSW CMH DAL PDX ABQ JAX OAK BDL SJU HNL OKC SJC SMF
## 50 49 47 47 46 46 45 44 43 40 38 38 37 37 36 36 36 36
## SNA MEM PBI OMA BUF CHS BHM TUS TTN GRR ORF RIC SDF TUL ONT OGG ELP LIT
## 36 33 33 32 32 32 30 30 30 29 28 28 28 28 27 25 24 24
## ROC JAC DAY PSP STT BUR PVD BOI GSP ALB DSM ICT KOA LIH ANC LGB SAV CAK
## 24 23 22 22 22 22 22 22 22 21 20 20 20 20 20 20 20 20
## MSN EGE RNO COS HDN GEG GSO MHT MTJ BZN HPN TYS XNA ASE SYR FAT BTV PWM
## 20 19 19 18 18 18 18 18 17 17 16 16 16 16 15 14 14 14
## FNT HSV PNS CAE SRQ FAR LEX MDT BMI CID MAF MLI SBA LBB FSD MYR AMA GJT
## 14 14 14 13 12 12 12 12 12 12 12 12 12 11 10 10 10 10
## PIA RAP SBN ACY ISP BQN ATW BIS CHA CRW ECP EVV GRB JAN LFT MOB SGF SHV
## 10 10 10 10 10 8 8 8 8 8 8 8 8 8 8 8 8 8
## AZO FWA LAN LBE BFL EUG RDM GUC BLI BTR MSO LNK MBS FCA STX ABE AVL BIL
## 8 8 8 8 8 8 8 7 7 7 7 7 7 7 6 6 6 6
## CHO EYW GPT PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN MLU RST SAF
## 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## ITO GCC MFR MRY RKS SBP SUN PHF MFE BRW CDV FAI JNU KTN PSG SCC SIT WRG
## 6 6 6 6 6 6 6 5 4 4 4 4 4 4 4 4 4 4
## YAK PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MGM MHK MOT SCE TVC TYR
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## ILG UST CMI COU SPI COD GTF HIB HLN IDA IMT INL MMH RHI SGU OME OTZ ORH
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3
## MEI PIB LSE CPR DVL JMS ADQ BET AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG
## 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2
## CWA DHN EWN GGG GTR HOB LAW OAJ SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## JLN MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC TWF
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## VEL YUM GUM ADK BTM CNY
## 2 2 2 1 1 1
We are making a data frame to plot the busiest airports. Below code creates a data frame edges.df.new.
flights %>%
select(CARRIER,ORIGIN,DEST,DEP_DELAY_NEW) %>%
na.omit()%>%
group_by(ORIGIN,DEST) %>%
summarize(nflights = n(),
Dep_delay = mean(DEP_DELAY_NEW, na.rm=TRUE)) %>%
group_by() %>%
{.} -> edges.df.new
edges.df.new %>%
arrange(., desc(nflights)) %>%
mutate(avgDailyFlights = ceiling(nflights / 31)) -> edges.df.new.1
Lets take a look at the summary of the dataframe.
summary(edges.df.new.1)
## ORIGIN DEST nflights Dep_delay
## ATL : 164 ATL : 164 Min. : 1.0 Min. : 0.000
## ORD : 154 ORD : 154 1st Qu.: 31.0 1st Qu.: 7.773
## DFW : 146 DFW : 147 Median : 65.0 Median : 11.657
## DEN : 137 DEN : 138 Mean : 109.9 Mean : 13.693
## IAH : 110 IAH : 110 3rd Qu.: 144.0 3rd Qu.: 16.743
## MSP : 106 MSP : 106 Max. :1085.0 Max. :206.000
## (Other):3352 (Other):3350
## avgDailyFlights
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 3.000
## Mean : 3.962
## 3rd Qu.: 5.000
## Max. :35.000
##
Finding: We can see that the maximum flights between a pair of airports is 1085, in December 2015. That’s an average of 35 flights between 2 top airports.
Now let’s take a look at the distribution of average daily flights. We take help of the ggplot2 package here and use functions like ggplot, geom_histogram and geom_density.
p <- ggplot(edges.df.new.1, aes(x=avgDailyFlights))
p + geom_histogram(binwidth=1,
colour="black", fill="white") +
geom_vline(aes(xintercept=median(avgDailyFlights, na.rm=T)),
color="red", linetype="dashed", size=1)
Finding: Well, clearly the histogram is right skewed. The maximum average daily flights are 35, but for half the pairs of airports, there are less than 3 flights daily. This calls for more investigation. The red line is the median line.
Now lets take a look at the boxplot of the average daily flights.
p1 <- ggplot(edges.df.new.1, aes(x=avgDailyFlights ,y=avgDailyFlights))
p1 + geom_boxplot(outlier.colour="blue",
outlier.shape=1,
outlier.size=2,
notch=FALSE)
Finding: There are at least 18 pairs of airports which are outliers. They don’t fall under the normal distribution of the average daily flights. From the boxplot, we can see that if the average daily flights is more than 11, then the pair is an outlier. The blue dots are outliers.
So, we make a graph of all the airport pairs which have more than 11 flights between them everyday.
edges.df.new.1 %>%
filter(avgDailyFlights > 11) %>%
graph.data.frame() -> airportPairsGreaterThan11
The below plot shows the airport pairs with more than 11 flights between them, on a given day.
plot(airportPairsGreaterThan11,
layout = layout.kamada.kawai,
vertex.color="blue", vertex.size=7,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.5, vertex.label.dist=0.5,
edge.curved=0.2, edge.arrow.size=0.4,
edge.color="light blue")
Finding: Airports such as LAX, ATL, ORD, SFO and others have more than 11 flights between them everyday. This says that these cities are very well connected. You cannot get stuck in these cities due to lack of flights.
We are making a data frame to plot the airports having the most delay and then plot the graph. From the graph, we can find the busiest airports and also how well the airports are connected. The flight direction is given by an arrow.
edges.df.new %>%
arrange(.,desc(Dep_delay)) -> edges.df.new.2
Lets take a look at the summary of this dataframe.
summary(edges.df.new.2)
## ORIGIN DEST nflights Dep_delay
## ATL : 164 ATL : 164 Min. : 1.0 Min. : 0.000
## ORD : 154 ORD : 154 1st Qu.: 31.0 1st Qu.: 7.773
## DFW : 146 DFW : 147 Median : 65.0 Median : 11.657
## DEN : 137 DEN : 138 Mean : 109.9 Mean : 13.693
## IAH : 110 IAH : 110 3rd Qu.: 144.0 3rd Qu.: 16.743
## MSP : 106 MSP : 106 Max. :1085.0 Max. :206.000
## (Other):3352 (Other):3350
Finding: We can see that the maximum departure delay between a pair of airports is 206 minutes, in December 2015. But 3/4th of the flights have departure delay less than 16.743 minutes, which is nice.
Now let’s take a look at the distribution of average departure delay.
p <- ggplot(edges.df.new.2, aes(x=Dep_delay))
p + geom_histogram(binwidth=1,
colour="black", fill="white") +
geom_vline(aes(xintercept=median(Dep_delay, na.rm=T)),
color="red", linetype="dashed", size=1)
Finding: Well, clearly the histogram is right skewed. The maximum average departure delay is greater than 200 minutes, but for half the pairs of airports, there are less than 12 minutes of departure delay. The red line is the median line.
Now lets take a look at the boxplot of the average departure delay.
p1 <- ggplot(edges.df.new.2, aes(x=Dep_delay ,y=Dep_delay))
p1 + geom_boxplot(outlier.colour="blue",
outlier.shape=1,
outlier.size=2,
notch=FALSE)
Finding: There are many pairs of airports which are outliers. They don’t fall under the normal distribution of the average departure delay. From the boxplot, we can see that if the average departure delay is more than 30 minutes, then the pair is an outlier. The blue dots are outliers.
So, we make a graph of all the airport pairs which have more than 30 minutes of departure delay between them everyday.
edges.df.new.2 %>%
filter(Dep_delay > 30) %>%
graph.data.frame() -> airportPairsGreaterThan30
The below plot shows the airport pairs which have more than 30 minutes of departure delay between them everyday.
plot(airportPairsGreaterThan30,
layout = layout.auto,
vertex.color="blue", vertex.size=7,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.5, vertex.label.dist=0.5,
edge.curved=0.2, edge.arrow.size=0.4,
edge.color="light blue")
Finding: Flights coming into airports such as DFW, JFK, ORD, SFO and others have more than 30 minutes of departure delay.
The below plot shows the top 20 airport pairs with maximum delay. The graph shows the origin and destination airports and the direction of the flight which has the most delay.
edges.df.new.2 %>%
filter(Dep_delay > 120) %>%
graph.data.frame() -> airportPairsGreaterThan120
plot(airportPairsGreaterThan120,
layout = layout.auto,
vertex.color="blue", vertex.size=7,
vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.5, vertex.label.dist=0.5,
edge.curved=0.2, edge.arrow.size=0.4,
edge.color="light blue")
Finding: There are 6 pairs of airports which have more than 2 hours (120 minutes) of departure delay.
From our analysis, we have come to the following conclusions.
ATL (Hartsfield-Jackson Atlanta International Airport) is the busiest airport in the US. It also means that you can virtually reach anywhere in US if you plan your trip via ATL.
GUM (Guam), which is not on mainland US, but an island territory in western pacific, is only connected to US via Honolulu, another island territory. But, Honolulu has good connectivity, because of tourist attraction, reaching to Guam from Honolulu is pretty easy.
ATL(Hartsfield-Jackson Atlanta International Airport), DFW(Dallas/Fort Worth International Airport) and ORD (O’Hare International Airport) are the airports which have most flights departing from and arriving to.
OO (SkyWest Airlines) is the carrier with the most delays.
ESC (Escanaba, Michigan) airport had the highest average departure delay time, specifically 183.66 minutes, in the month of december 2015.
EAU (Eau Claire, Wisconsin), SMX (Santa Maria, California), VEL (Vernal, Utah) have an average of 2 hour delay in the month of December 2015
COS (Colorado Springs, Colorado), FAT (Fresno, California) and GUC (Gunnison, Colorado) are some of the least busiest airports.
Between Delta Airlines and United airlines, Delta Airlines has twice as much delay for more than 80% of the days in the month of December 2015.
Carrier delays contribute to most of the flight delays.