true

Introduction

People tend to like applications where they can get information which is relevant to themselves and which they can use to improve their decisions. As Wang, Park and Fesenmaier (2011) explain, people have with the recent boom in social media accessibility, found pleasure in reading about the experiences of travelers, and ways to improve their journey to wherever they’re going (Cole 2015). People find it extremely frustrating when they have to wait for something, especially when they do not know the reason for the delays.

We saw the opportunity to provide people with information to improve their journeys, by suggesting various possibilities to avoid delayed flights. Furthermore we could show the most common reasons why flights get delayed.We will focus on the delays related to flights.

This can include:
• Flight routes with the most delays
• Flight routes with the least delays
• Airports with high delays to stay clear of
• Airports with the least delays
• The average delay for each type of delay, for example weather, security, carrier, late aircraft and NAS (National Aviation System) delay.

The dataset is available on the Bureau of Transportation Statistics (BTS) website which is a Federal multi organizational entity which focuses on all forms of U.S transportation data collection & analysis. The Bureau of Transportation Statistics (BTS), as part of the United States Department of Transportation, compiles, analyzes, and makes accessible information on the nation’s transportation systems; collects information on intermodal transportation and other areas as needed; and improves the quality and effectiveness of DOT’s statistical programs through research, development of guidelines, and promotion of improvements in data acquisition and use. BTS is part of RITA and is a principal agency of the U.S. Federal Statistical System.

About the data

The BTS dataset contains information about the domestic flights and the attributes contributing/calculating their delay in the U.S. The data consist of the following characteristics over the period of 10 years:

• Flight details
• Origin airport
• Destination airport
• Airline
• Delay time
• Cause for delays

Data description

The dataset contains 469968 records for all commercial domestic airlines for the month of December 2015. We decided to analyze the data for the month of December because it is a holiday season and majority of the people around the world travel during this time and there is a huge possibility of delay. This might be the most ideal month of a year in order to analyze flight delays. The data consists of flight arrival and departure details for all commercial flights within the USA.

These observations provide the details associated with each individual departure flight from all domestic airports throughout the United States. The data set is specific to identify multiple attributes about the location of the flight, airline carrier and which specific airplane made the journey and the duration of the flight. All these observations provide the reader with insight as to whether the flight was ahead of schedule, on-time or delayed.

Description of variables

To help understand what causes delays, we look closer at the reported 38 variables. We look at the structure of the flights dataset using the following command.

str(flights)

The flights data frame has 469968 rows and 38 columns. Each row represents a unique flight details, flying from source to destination and representing arrival or departure delays.

Each column in the dataset is a variable containing some information about the flight. The table below summarizes some of these variables: Variable descriptions

Column Description Units
YEAR year of flight Integer
MONTH month of flight Integer
DAY_OF_MONTH day number of month of flight (1-31) Integer
DAY_OF_WEEK day number of week of flight (1-7) Integer
FL_DATE flight date datetime
UNIQUE_CARRIER unique carrier code Character
AIRLINE_ID unique ID of airline Integer
CARRIER flight operator Character
FL_NUM flight number Integer
ORIGIN_AIRPORT_ID ID of the airport from where the flight originates Integer
ORIGIN originating airport abbreviation Character
ORIGIN_CITY_NAME city of the originating airport Character
ORIGIN_STATE_ABR abbreviation of the state of originating airport city Character
ORIGIN_STATE_NM state name of originating airport city Character
DEST_AIRPORT_ID ID of the destination airport Integer
DEST destination airport abbreviation Character
DEST_CITY_NAME city of the destination airport Character
DEST_STATE_ABR abbreviation of the state of destination city Character
ORIGIN_STATE_NM state name of destination airport city Character
CRS_DEP_TIME computerized reservations systems departure time Integer
DEP_TIME actual departure time Integer
DEP_DELAY departure delay (includes negative delays: flights taking off before scheduled time), in minutes Integer
DEP_DELAY_NEW only positive departure delays, in minutes Integer
CRS_ARR_TIME computerized reservations systems arrival time Integer
ARR_TIME Actual arrival time Integer
ARR_DELAY arrival delay (includes negative delays: flights arriving before scheduled time), in minutes Integer
ARR_DELAY_NEW only positive arrival delays, in minutes Integer
CARRIER_DELAY delays due to carrier, in minutes Integer
WEATHER_DELAY delays due to weather, in minutes Integer
NAS_DELAY delays due to national air system, in minutes Integer
SECURITY_DELAY delays due to security, in minutes Integer
LATE_AIRCRAFT_DELAY delays due to late aircraft, in minutes Integer

Analysis Goals

Using the flight dataset, we are trying to answer some of the questions mentioned below in our analysis.

  1. Which is the busiest airport in Dec 2015?
  2. Which airport has maximum delays (minutes)?
  3. Categorizing which delay reason has caused the maximum delay (minutes).
  4. Analyzing the top 20 well connected airports from the busiest airport.
  5. Flight routes with the most delays.
  6. Flight routes with the least delays.
  7. Airports with high delays to stay clear of.
  8. Airports with the least delays.

Dataset preparation

The dataset preparation process involves the importing libraries, importing dataset, cleaning dataset and creating a data frame.

Installing Packages and libraries

dplyr package

We will install and load the dplyr package that contains additional functions for data manipulation using data frames. It allows us to order rows, select rows, select variables, modify variables and summarize variables. We will be using functions like distinct, filter, group_by from this package.

RNeo4j package

Neo4j, a graph database, allows users to store their data as a property graph. A graph consists of nodes that are connected by relationships; both nodes and relationships can have properties, or key-value pairs. RNeo4j is Neo4j’s R driver. It allows users to read and write data from and to Neo4j directly from their R environment by exposing an interface for interacting with nodes, relationships, paths, and more. Most notably, it allows users to retrieve Cypher query results as R data frames, where Cypher is Neo4j’s graph query language.

igraph package

Routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more.

magrittr package

Provides a mechanism for chaining commands with a new forward-pipe operator, %> %. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions. For more information, see package vignette.

Defining edges between source and destination

We will be taking the flights data frame and be creating a new data frame with edges between the flights source and destinations airport and then adding a count variable which will represent the total number of edges between the source and destination airports.

The edges.df will give us data frame with variables such as ORIGIN, DEST and count. The data frame will have source and destination airport code and the number of flights which have operated between these airports in the month of December 2015 in USA.

Reading data from csv

The below command setwd sets the working directory to the current folder on our system and reads the csv file from that folder.

library(ggplot2)
setwd("/Users/Harshit/Desktop/INFO7374 - David Oury/Assignment2/")
flights <- read.csv("850493877_T_ONTIME.csv", sep = ",", header = TRUE)

Convert variables to factors

Some of the variables need to be converted to factors. Below is the code doing so.

flights$YEAR <- factor(flights$YEAR)
flights$MONTH <- factor(flights$MONTH)
flights$DAY_OF_MONTH <- factor(flights$DAY_OF_MONTH)
flights$ORIGIN_AIRPORT_ID <- factor(flights$ORIGIN_AIRPORT_ID)
flights$DEST_AIRPORT_ID <- factor(flights$DEST_AIRPORT_ID)
flights$DAY_OF_WEEK <- factor(flights$DAY_OF_WEEK)

Network Analysis

We will be taking the flights dataframe and create a new dataframe with edges between the flights source and destination airport and then adding a count variable which will represent the total number of edges(flights) between the source and destination airports.

Rneo4j

RNeo4j is Neo4j’s R driver. It allows you to read and write data from / to Neo4j directly from your R environment.

Below, we take the flights dataframe, group it by ORIGIN, DEST, CARRIER, and FL_DATE and then summmarize to get the number of flights between each origin and destination. We also find the average of departure delay, arrival delay, security delay, carrier delay and other delays. We have used the summarize function to calculate the averages.

flights %>%
  select(CARRIER,ORIGIN,DEST,WEATHER_DELAY,SECURITY_DELAY,
         NAS_DELAY,LATE_AIRCRAFT_DELAY,CARRIER_DELAY,DEP_DELAY_NEW,ARR_DELAY_NEW,
         FL_DATE) %>%
         na.omit()%>%
         group_by(CARRIER,ORIGIN,DEST,FL_DATE) %>%
         summarize(nflights = n(),
            Arr_delay = mean(ARR_DELAY_NEW, na.rm=TRUE),
            Dep_delay =  mean(DEP_DELAY_NEW, na.rm=TRUE),
            We_delay = mean(WEATHER_DELAY, na.rm=TRUE),
            Sec_delay = mean(SECURITY_DELAY, na.rm=TRUE),
            Nas_delay = mean(NAS_DELAY, na.rm=TRUE),
            Late_air_delay = mean(LATE_AIRCRAFT_DELAY, na.rm=TRUE),
            Car_delay = mean(CARRIER_DELAY, na.rm=TRUE))  %>%
         group_by() %>% 
         {.} -> edges.df

So now that we have the edges.df dataframe, lets take alook at its structure.

str(edges.df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    62221 obs. of  12 variables:
##  $ CARRIER       : Factor w/ 14 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ORIGIN        : Factor w/ 312 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ DEST          : Factor w/ 312 levels "ABE","ABI","ABQ",..: 83 83 83 83 83 83 83 83 83 83 ...
##  $ FL_DATE       : Factor w/ 31 levels "2015-01-01","2015-01-02",..: 1 2 3 5 7 9 12 13 14 15 ...
##  $ nflights      : int  2 3 1 2 2 1 1 2 2 2 ...
##  $ Arr_delay     : num  26.5 34.3 29 21.5 271.5 ...
##  $ Dep_delay     : num  7 17.3 0 0 249.5 ...
##  $ We_delay      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sec_delay     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Nas_delay     : num  19.5 17 29 21.5 22 5 5 28.5 32.5 17 ...
##  $ Late_air_delay: num  0 17.3 0 0 0 ...
##  $ Car_delay     : num  7 0 0 0 250 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:374017] 1 2 3 4 5 7 8 9 10 11 ...
##   .. ..- attr(*, "names")= chr [1:374017] "1" "2" "3" "4" ...
##  - attr(*, "vars")=List of 3
##   ..$ : symbol CARRIER
##   ..$ : symbol ORIGIN
##   ..$ : symbol DEST
##  - attr(*, "drop")= logi TRUE

As we can see here, the columns for the delays are not integer. So we convert them to integer using as.integer function. First we create a variable with the column indices which are to be converted to integer. Then we pass that variable to lapply function which applies as.integer function over all te mentioned columns.

cols <- c(6:12)
edges.df[cols] <- lapply(edges.df[cols], as.integer)

Now lets make another column in the dataframe which will tell us whether the maximum delay for a particular flight was a security delay or a weather delay or other types of delays.

new_df = edges.df[,c("We_delay","Sec_delay","Nas_delay","Late_air_delay","Car_delay")]
Cause_Of_delay = colnames(new_df)[apply(new_df,1,which.max)]
edges.df = cbind(edges.df,Cause_Of_delay)

Now lets again take a look at the structure of the edges.df dataframe.

str(edges.df)
## 'data.frame':    62221 obs. of  13 variables:
##  $ CARRIER       : Factor w/ 14 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ORIGIN        : Factor w/ 312 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ DEST          : Factor w/ 312 levels "ABE","ABI","ABQ",..: 83 83 83 83 83 83 83 83 83 83 ...
##  $ FL_DATE       : Factor w/ 31 levels "2015-01-01","2015-01-02",..: 1 2 3 5 7 9 12 13 14 15 ...
##  $ nflights      : int  2 3 1 2 2 1 1 2 2 2 ...
##  $ Arr_delay     : int  26 34 29 21 271 58 55 46 32 45 ...
##  $ Dep_delay     : int  7 17 0 0 249 53 50 17 1 28 ...
##  $ We_delay      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sec_delay     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Nas_delay     : int  19 17 29 21 22 5 5 28 32 17 ...
##  $ Late_air_delay: int  0 17 0 0 0 0 0 0 0 27 ...
##  $ Car_delay     : int  7 0 0 0 249 53 50 17 0 1 ...
##  $ Cause_Of_delay: Factor w/ 5 levels "Car_delay","Late_air_delay",..: 3 3 3 3 1 1 1 3 3 2 ...

We have successfully converted the relevant columns to integer and added a new column which categorizes different types of delays.

Start the Neo4j server
Now we start the graph using the startGraph function. This function takes URL to the neo4j database, username and password as its arguments. To keep our password protected, we have used echo=FALSE.

Before we move forward, lets clear all the previous graphs, if any, from the server.

clear(graph, input = FALSE)

Neo4j can help keep your data clean. It does so using constraints. Constraints allow you to specify the rules for what your data should look like. Any changes that break these rules will be denied. We add 1 constraint, Airport.

addConstraint(graph, "Airport", "name")

Now we create nodes for all the airports from edges.df. We do so by using the function createNode.

# Create a node in Neo4j for each unique airport
lapply(unique(c(as.character(edges.df$ORIGIN),
                as.character(edges.df$DEST))), 
       function(a.airport) { 
         createNode(graph, 
                    "Airport", 
                    name=a.airport) 
       }
)

Finally, we write the edges.df to a csv file.

write.csv(edges.df,
          "/Users/Harshit/Documents/Neo4j/default.graphdb/import/edges.csv")

cypher is used to query the Neo4j database. In the below query, we are loading the edges.csv in Neo4j and matching the source and destination airports. We are creating the connection between them to see which airport is connected to which one.

cypher(graph,
       "LOAD CSV WITH HEADERS FROM 'file:/edges.csv' AS row
       MATCH (src:Airport {name:row.ORIGIN}), 
       (tgt:Airport {name:row.DEST})
       
       CREATE (src)-[:flight {carrier:row.CARRIER, 
       number:row.nflights,
       arr_delay:row.Arr_delay, 
       cause:row.Cause_Of_delay, 
       dep_delay:row.Dep_delay}]->(tgt)")

Now we do some analysis to find answers to our questions. First, we find out which airports have the highest average departure delay. We find 15 such airports.

  mostDepDelay15 = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH src.name as Origin, 
                    AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time,
                    count(tgt) as numberOfDest
                    ORDER BY Avg_dep_delay_time desc
                    RETURN Origin, numberOfDest,
                    Avg_dep_delay_time LIMIT 15")

mostDepDelay15
##    Origin numberOfDest Avg_dep_delay_time
## 1     ESC            6          183.66667
## 2     PIH            7          169.57143
## 3     ABR            7          159.28571
## 4     VEL            5          132.00000
## 5     SMX            6          130.00000
## 6     EAU           11          125.27273
## 7     APN           14          110.07143
## 8     PAH            7          106.14286
## 9     OTZ            6          105.33333
## 10    IMT            9           97.77778
## 11    HOB            3           97.00000
## 12    DIK           15           93.40000
## 13    CEC           14           92.42857
## 14    BJI            6           90.50000
## 15    SCE           18           90.05556

Finding: The ESC (Escanaba, Michigan) airport had the highest average departure delay time, specifically 183.66 minutes, in the month of december 2015. Also, EAU (Eau Claire, Wisconsin), SMX (Santa Maria, California), VEL (Vernal, Utah) and couple other airports have mroe than 120 minutes (2 hours) of delay. So, if you have a flight from any of these airports during december, expect an average delay of 2 hours for your departure.

Lets investigate deeper. We select ESC, PIH, ABR, VEL, SMX and EAU airports to see any trends or find any insights.

  investigation_1 = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH src.name AS Origin, r.cause as Cause,
                    tgt.name as Destination,
                    AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time
                    ORDER BY Avg_dep_delay_time desc
                    WHERE src.name IN ['ESC','PIH','ABR','VEL','SMX','EAU']
                    RETURN Origin, Destination, Cause, Avg_dep_delay_time") 

investigation_1
##    Origin Destination          Cause Avg_dep_delay_time
## 1     ABR         MSP      Car_delay          443.50000
## 2     PIH         SLC      Car_delay          319.00000
## 3     SMX         LAX      Car_delay          317.50000
## 4     VEL         SLC Late_air_delay          269.00000
## 5     PIH         SLC       We_delay          268.00000
## 6     EAU         ORD      Car_delay          267.33333
## 7     ESC         DTW      Car_delay          262.00000
## 8     ESC         DTW Late_air_delay          105.33333
## 9     VEL         SLC      Car_delay           96.00000
## 10    EAU         ORD Late_air_delay           82.83333
## 11    ABR         MSP Late_air_delay           53.75000
## 12    SMX         LAX Late_air_delay           47.33333
## 13    EAU         ORD      Nas_delay           39.50000
## 14    VEL         SLC       We_delay           26.00000
## 15    ABR         MSP      Nas_delay           13.00000
## 16    PIH         SLC Late_air_delay           13.00000
## 17    SMX         LAX      Nas_delay            3.00000
## 18    PIH         SLC      Nas_delay            0.00000
## 19    VEL         SLC      Nas_delay            0.00000

Finding: We found that the top delays are carrier related. It means, its not just about the airports, but if you are travelling from some specific carrier, you might have to wait even more before that flight of yours takes off.

Now lets find out which carriers are culprit.

  culpritCarriers = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH src.name AS Origin, r.cause as Cause, 
                    tgt.name as Destination, r.carrier as Carrier,
                    AVG(toInt(r.dep_delay)) AS Avg_dep_delay_time
                    ORDER BY Avg_dep_delay_time desc
                    WHERE src.name IN ['ESC','PIH','ABR','VEL','SMX','EAU']
                    AND r.cause = 'Car_delay'
                    RETURN Origin, Destination, Cause, Avg_dep_delay_time, Carrier") 

culpritCarriers
##   Origin Destination     Cause Avg_dep_delay_time Carrier
## 1    ABR         MSP Car_delay           443.5000      OO
## 2    PIH         SLC Car_delay           319.0000      OO
## 3    SMX         LAX Car_delay           317.5000      OO
## 4    EAU         ORD Car_delay           267.3333      OO
## 5    ESC         DTW Car_delay           262.0000      OO
## 6    VEL         SLC Car_delay            96.0000      OO

Finding: Shockingly, or not, the culprit of high carrier delays is OO (SkyWest Airlines). All the high carrier delays are associated with SkyWesy Airlines. So, if you are flying from the selected 6 airports in the month of december using the services of SkyWest Airlines, you are going to have to wait for eternity!

Now lets do another analysis. Lets try to find out which airports have the worst connectivity.

  airportsLeastConnected = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH tgt.name as Destination,
                    count(distinct src) as flightsFromAirport
                    ORDER BY flightsFromAirport asc
                    RETURN Destination, flightsFromAirport") 

Finding: 76 airports have incoming flights from just 1 airport. It means that there is only one way you can go these 76 airports. It might be fair to say that these airports don’t get enough passengers to improve connectivity. Some of these airports are DVL (Devil’s Lake, North Dakota), BTM (Butte, Montana) and GUM (Guam). Note: Guam is not on mainland, but is an island territory in the western pacific ocean owned by US.

Now, lets pick the 3 airports we mentioned above and find out how you can reach there. We find the source airports which have flights to these 3 airports and number of flights for different airlines.

  HowToReachThere = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH src.name as Origin,
                    tgt.name as Destination,
                    r.carrier as Carrier,
                    sum(toInt(r.number)) as numberOfFlights
                    WHERE tgt.name IN ['DVL','BTM','GUM']
                    RETURN Origin, Destination, Carrier, numberOfFlights")

HowToReachThere
##   Origin Destination Carrier numberOfFlights
## 1    SLC         BTM      OO               5
## 2    JMS         DVL      OO              17
## 3    HNL         GUM      UA              12

Finding: If you want to reach Guam, you have to go to HNL (Honolulu, Island of Oahu). If you are planning to go to Butte, Montana, then you need to go via SLC (Salt Lake City, Utah)

What if you want to get to Guam? Since you can only go to Gua via Honolulu, lets find out if Honolulu is well connected or not.

  ToHonolulu = cypher(graph,
                    "MATCH (src)-[r:flight]->(tgt)
                    WITH tgt.name as Destination,
                    sum(toInt(r.number)) as numberOfFlights,
                    count(distinct src) as numberOfOrigins
                    WHERE tgt.name = 'HNL'
                    RETURN Destination, numberOfOrigins, numberOfFlights")

ToHonolulu
##   Destination numberOfOrigins numberOfFlights
## 1         HNL              27             754

Finding: Well, you are in luck! There are 754 flights to Honolulu in december and you can reach Honolulu via 27 different airport and cities. Thats a sigh of relief!

igraph

igraph is a library collection for creating and manipulating graphs and analyzing networks. igraph can be used to generate graphs, compute centrality measures and path length based properties as well as graph components and graph motifs.

Below we are creating a graph dataframe which has source airport and destination airport as vertices.

edges.df %>%
  select(ORIGIN, DEST) %>%
  distinct() -> edges.df.igraph

ig = graph.data.frame(edges.df.igraph)

Below, we are using the degree function to get the total number of edges to and from a vertex which means we will get the count of airports each airport is connected to. From this value we can determine the busiest airport in the US.

First we will find out the in-degree of each node. This will tell us how many airports have flights to that particular airport.

ig_degree_in = sort(degree(ig, mode = "in"),decreasing = TRUE)
ig_degree_in 
## ATL ORD DFW DEN IAH MSP DTW EWR LAX PHX SLC MCO SFO LAS BWI MDW CLT SEA 
## 161 154 147 137 109  96  92  77  76  76  73  72  72  69  64  62  60  60 
## IAD JFK TPA FLL LGA BOS MIA BNA HOU DCA STL SAN PDX PHL MCI AUS DAL RSW 
##  59  57  57  56  55  53  46  44  44  43  42  39  38  36  35  34  34  32 
## MSY CLE RDU OAK CVG HNL SAT PIT IND MKE SJC CMH SMF ABQ PBI SJU BDL JAX 
##  31  30  30  28  28  27  27  26  25  25  24  23  23  21  21  21  20  20 
## ANC OGG OKC SNA MEM OMA TUL BUF CHS TTN BHM TUS GRR ORF RIC SDF ONT ELP 
##  20  18  18  18  17  16  16  16  16  16  15  15  15  14  14  14  13  12 
## LIT ROC BOI DAY DSM LIH PSP STT BUR PVD ALB GSP ICT JAC RNO LGB SAV CAK 
##  12  12  12  11  11  11  11  11  11  11  11  11  10  10  10  10  10  10 
## MSN COS EGE HDN KOA GEG BZN GSO MHT MTJ HPN SYR TYS XNA ASE FAT BTV PWM 
##  10   9   9   9   9   9   9   9   9   8   8   8   8   8   8   7   7   7 
## CAE FNT HSV PNS JNU SRQ FAR LEX MDT BMI CID LBB MAF MLI SBA ACY FSD MYR 
##   7   7   7   7   6   6   6   6   6   6   6   6   6   6   6   6   5   5 
## AMA GJT PIA RAP SBN ISP KTN BQN ATW BIS BTR CHA CRW ECP EVV GRB JAN LFT 
##   5   5   5   5   5   5   4   4   4   4   4   4   4   4   4   4   4   4 
## MOB SGF SHV AZO FWA LAN LNK MBS LBE BFL EUG FCA RDM GUC STX BLI FAI ABE 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   3   3   3   3   3 
## AVL BIL CHO EYW GPT MSO PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## MLU PHF RST SAF ITO GCC MFR MRY RKS SBP SUN MFE BRW CDV OTZ PSG SCC SIT 
##   3   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2 
## WRG YAK PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MEI MGM MHK MOT TVC 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## TYR ILG UST CMI COU LSE SPI COD CPR GTF HIB HLN IDA IMT INL JMS MMH RHI 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## SGU ADQ BET OME ORH AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG CWA DHN EWN 
##   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## GGG GTR HOB LAW OAJ PIB SCE SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI JLN 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX DVL 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC TWF 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## VEL YUM GUM BTM CNY ADK 
##   1   1   1   1   1   0

Finding: ATL, ORD, DFW, DEN and IAH have more than 100 in-degree. It means flights from more than 100 airports end at these airports. Also, ADK has 0 in-degree. That means this airport has no incoming flight. Weird!

Next, we find the out-degree of each node.

ig_degree_out = sort(degree(ig, mode = "out"),decreasing = TRUE)
ig_degree_out
## ATL ORD DFW DEN IAH MSP DTW SLC EWR LAX PHX SFO LAS MCO BWI IAD MDW CLT 
## 157 153 146 137 109  99  95  79  78  76  76  74  71  71  64  63  63  60 
## SEA FLL JFK LGA TPA BOS MIA BNA DCA HOU STL SAN PDX PHL MCI AUS DAL MSY 
##  59  57  57  57  56  53  46  44  44  44  42  41  37  37  35  34  34  33 
## RSW CLE RDU OAK CVG PIT SAT HNL MKE SJC IND CMH SMF ABQ BDL PBI SJU ANC 
##  32  29  29  28  27  26  26  25  25  24  23  22  22  21  21  20  20  19 
## JAX OKC SNA OGG MEM OMA BUF CHS TTN BHM RIC TUS ONT ORF SDF GRR JAC ELP 
##  18  18  18  17  16  16  16  16  16  15  15  15  14  14  14  14  13  12 
## LIT TUL ROC BOI DAY KOA PSP STT BUR PVD GSP EGE ICT LGB SAV ALB CAK MSN 
##  12  12  12  12  11  11  11  11  11  11  11  10  10  10  10  10  10  10 
## COS DSM HDN LIH MTJ RNO GEG GSO MHT HPN BZN TYS XNA ASE FAT BTV PWM SYR 
##   9   9   9   9   9   9   9   9   9   8   8   8   8   8   7   7   7   7 
## FNT HSV PNS JNU SRQ CAE FAR LEX MDT BMI CID MAF MLI SBA ACY FSD MYR AMA 
##   7   7   7   6   6   6   6   6   6   6   6   6   6   6   6   5   5   5 
## GJT LBB PIA RAP SBN ISP GUC BLI KTN BQN ATW BIS CHA CRW ECP EVV GRB JAN 
##   5   5   5   5   5   5   4   4   4   4   4   4   4   4   4   4   4   4 
## LFT MOB MSO SGF SHV AZO FWA LAN LBE BFL EUG RDM STX BRW ABE AVL BIL BTR 
##   4   4   4   4   4   4   4   4   4   4   4   4   3   3   3   3   3   3 
## CHO EYW GPT PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN LNK MBS MLU 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## RST SAF SCE ITO FCA GCC MFR MRY RKS SBP SUN MFE CDV FAI OME OTZ PSG SCC 
##   3   3   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2 
## SIT WRG YAK ORH PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MEI MGM MHK 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## MOT PHF PIB TVC TYR ILG UST CMI COU SPI COD DVL GTF HIB HLN IDA IMT INL 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## JMS MMH RHI SGU ADK ADQ BET AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG CWA 
##   2   2   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## DHN EWN GGG GTR HOB LAW OAJ SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI JLN 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## LSE MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## CPR EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## TWF VEL YUM GUM BTM CNY 
##   1   1   1   1   0   0

Finding: Not so surprising, but ATL, ORD, DFW, DEN and IAH have outbound flights to more than 100 destinations. If we take a look at ADK, it has flight to 1 destination, even though it doesn’t have any incoming flights.

List of airports connected to the busiest airport.

We are naming all the neighboring vertices of a vertex v. We are considering ATL as our vertex and we are trying to find all the airports which has a flight connection to ATL.

V(ig)[neighbors(ig, v=V(ig)['ATL'])]
## + 157/312 vertices, named:
##   [1] ABQ AUS BDL BHM BNA BOS BWI CLE CLT CMH COS DAY DCA DEN DFW DSM DTW
##  [18] EGE ELP EWR FLL HDN HNL HOU IAD IAH ICT IND JAX JFK LAS LAX LGA LIT
##  [35] MCI MCO MEM MIA MKE MSP MSY OKC OMA ORD ORF PBI PDX PHL PHX PIT RDU
##  [52] RIC RSW SAN SAT SDF SEA SFO SJC SJU SLC SMF SNA STL STT TPA TUL TUS
##  [69] OAK BTV BUF CHS HPN PVD PWM ROC SAV SRQ SYR ABE AGS ALB ATW AVL BTR
##  [86] BZN CAE CAK CHA CHO CRW CVG DAB DAL ECP EVV EYW FAR FAY FNT FSD GNV
## [103] GPT GRB GRR GSO GSP HSV ILM JAN LEX LFT MDT MDW MHT MLB MOB MSN MYR
## [120] PNS ROA SGF SHV TLH TRI TYS VPS XNA ABY AEX AVP AZO BMI BQK CID CSG
## [137] DHN ELM EWN FSM FWA GRK GTR LAN LNK MBS MGM MLI MLU OAJ PHF PIA RST
## [154] SBN VLD TTN ASE
E(ig)[incident(ig, v=V(ig)['ATL'])]
## + 318/4046 edges (vertex names):
##  [1] ATL->ABQ ATL->AUS ATL->BDL ATL->BHM ATL->BNA ATL->BOS ATL->BWI
##  [8] ATL->CLE ATL->CLT ATL->CMH ATL->COS ATL->DAY ATL->DCA ATL->DEN
## [15] ATL->DFW ATL->DSM ATL->DTW ATL->EGE ATL->ELP ATL->EWR ATL->FLL
## [22] ATL->HDN ATL->HNL ATL->HOU ATL->IAD ATL->IAH ATL->ICT ATL->IND
## [29] ATL->JAX ATL->JFK ATL->LAS ATL->LAX ATL->LGA ATL->LIT ATL->MCI
## [36] ATL->MCO ATL->MEM ATL->MIA ATL->MKE ATL->MSP ATL->MSY ATL->OKC
## [43] ATL->OMA ATL->ORD ATL->ORF ATL->PBI ATL->PDX ATL->PHL ATL->PHX
## [50] ATL->PIT ATL->RDU ATL->RIC ATL->RSW ATL->SAN ATL->SAT ATL->SDF
## [57] ATL->SEA ATL->SFO ATL->SJC ATL->SJU ATL->SLC ATL->SMF ATL->SNA
## [64] ATL->STL ATL->STT ATL->TPA ATL->TUL ATL->TUS ATL->OAK ATL->BTV
## + ... omitted several edges

We also want to find out the most and least connected airport on the basis of the betweeness of a graph. The below code calculates the betweenness among all the airports and then they are sorted in decreasing order.

ig_betweeness = sort(betweenness(ig,directed = TRUE), decreasing = TRUE)
ig_betweeness[1:10]
##       ATL       ORD       DFW       DEN       MSP       SLC       IAH 
## 17645.593 16858.076 16790.878 13601.721  8245.909  7158.786  5712.870 
##       ANC       DTW       SFO 
##  5143.909  5063.938  4709.846
ig_betweeness_0 <- ig_betweeness[which(ig_betweeness == 0.00)]
length(ig_betweeness_0)
## [1] 184

Finding: The airport with most betweenness is ATL, with a betweenness of 17645.593. There are 184 airports with 0 betweenness, which means that they don’t act as a connecting airport to any other airport. Out of these 184, we chose 3 random airports, which are COS (Colorado Springs, Colorado), FAT (Fresno, California) and GUC (Gunnison, Colorado).

The below code gives the shortest path between the two least busiest airports, COS and GUC.

ig_shortest_path = sort (shortest.paths(graph = ig)['COS','GUC'],decreasing = TRUE)
ig_shortest_path
## [1] 2

Finding: Here in our case we are calculating the distance between COS and GUC which is 2. It means there is no direct flight from Colorado Springs to Gunnison, Colorado.

Closeness centrality is defined as the total graph-theoretic distance to all other nodes in the network. That means it gives us the reciprocal of the sum of distance from the vertex to all other vertices. When a node has a high closeness score (i.e., is highly central), it tends to receive anything flowing through the network very quickly.

ig_closeness = sort(closeness(ig),decreasing = TRUE)
head(ig_closeness,5)
##         ORD         ATL         DFW         DEN         IAH 
## 0.001272265 0.001267427 0.001251564 0.001248439 0.001194743
tail(ig_closeness,5)
##          SIT          YAK          WRG          BTM          CNY 
## 6.775068e-04 6.775068e-04 6.734007e-04 1.030588e-05 1.030588e-05

Finding: We can see that ORD, ATL and DFW are the 3 airports with the highest closeness that means these airports are more frequently encountered on our network of airports and flights.

Transitivity measures the probability that the adjacent vertices of a vertex are connected. The global transitivity of an undirected graph (directed graphs are considered as undirected ones as well). This is simply the ratio of the triangles and the connected triples in the graph. For directed graph the direction of the edges is ignored.

ig_transitivity = transitivity(ig)
ig_transitivity
## [1] 0.3164311

A property very important in networks, is transitivity. It refers to the extent to which the relation that relates two nodes in a network that are connected by an edge is transitive. The graph created has a transitivity of 0.3164311.

We have used the coreness funtion to get the coreness value of each airport and then sorted in decreasing order. If we see the top airports, we will get to know that ATL is the airport with highest coreness value.

ig_coreness = sort(coreness(ig),decreasing = TRUE)
ig_coreness
## ATL AUS BNA BOS BWI CLE CLT DCA DEN DFW DTW EWR FLL HOU IAD IAH JFK LAS 
##  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50 
## LAX LGA MCI MCO MIA MSP MSY ORD PHL PHX PIT RDU SAN SEA SFO SLC STL TPA 
##  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50 
## MDW SAT MKE CVG IND RSW CMH DAL PDX ABQ JAX OAK BDL SJU HNL OKC SJC SMF 
##  50  49  47  47  46  46  45  44  43  40  38  38  37  37  36  36  36  36 
## SNA MEM PBI OMA BUF CHS BHM TUS TTN GRR ORF RIC SDF TUL ONT OGG ELP LIT 
##  36  33  33  32  32  32  30  30  30  29  28  28  28  28  27  25  24  24 
## ROC JAC DAY PSP STT BUR PVD BOI GSP ALB DSM ICT KOA LIH ANC LGB SAV CAK 
##  24  23  22  22  22  22  22  22  22  21  20  20  20  20  20  20  20  20 
## MSN EGE RNO COS HDN GEG GSO MHT MTJ BZN HPN TYS XNA ASE SYR FAT BTV PWM 
##  20  19  19  18  18  18  18  18  17  17  16  16  16  16  15  14  14  14 
## FNT HSV PNS CAE SRQ FAR LEX MDT BMI CID MAF MLI SBA LBB FSD MYR AMA GJT 
##  14  14  14  13  12  12  12  12  12  12  12  12  12  11  10  10  10  10 
## PIA RAP SBN ACY ISP BQN ATW BIS CHA CRW ECP EVV GRB JAN LFT MOB SGF SHV 
##  10  10  10  10  10   8   8   8   8   8   8   8   8   8   8   8   8   8 
## AZO FWA LAN LBE BFL EUG RDM GUC BLI BTR MSO LNK MBS FCA STX ABE AVL BIL 
##   8   8   8   8   8   8   8   7   7   7   7   7   7   7   6   6   6   6 
## CHO EYW GPT PSC ROA TLH VPS AEX AVP CRP DRO ELM GRK HRL ISN MLU RST SAF 
##   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
## ITO GCC MFR MRY RKS SBP SUN PHF MFE BRW CDV FAI JNU KTN PSG SCC SIT WRG 
##   6   6   6   6   6   6   6   5   4   4   4   4   4   4   4   4   4   4 
## YAK PSE SWF DLH GNV BRO CLL DIK ERI FSM LCH LRD MGM MHK MOT SCE TVC TYR 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## ILG UST CMI COU SPI COD GTF HIB HLN IDA IMT INL MMH RHI SGU OME OTZ ORH 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   3   3   3 
## MEI PIB LSE CPR DVL JMS ADQ BET AGS DAB FAY ILM MLB TRI ABY BPT BQK CSG 
##   3   3   3   3   3   3   2   2   2   2   2   2   2   2   2   2   2   2 
## CWA DHN EWN GGG GTR HOB LAW OAJ SJT SPS VLD PPG ABI ACT ALO DBQ GCK GRI 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## JLN MQT ROW SUX TOL TXK IAG PBG ABR ACV APN BJI BRD CDC CEC CIU CLD CMX 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## EAU EKO ESC FLG GFK HYS LAR LWS MKG OTH PAH PIH PLN PUB RDD SMX STC TWF 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## VEL YUM GUM ADK BTM CNY 
##   2   2   2   1   1   1

Visualization

We are making a data frame to plot the busiest airports. Below code creates a data frame edges.df.new.

flights %>%
  select(CARRIER,ORIGIN,DEST,DEP_DELAY_NEW) %>%
         na.omit()%>%
         group_by(ORIGIN,DEST) %>%
         summarize(nflights = n(),
            Dep_delay =  mean(DEP_DELAY_NEW, na.rm=TRUE))  %>%
         group_by() %>% 
         {.} -> edges.df.new
edges.df.new %>%
  arrange(., desc(nflights)) %>%
  mutate(avgDailyFlights = ceiling(nflights / 31)) -> edges.df.new.1

Lets take a look at the summary of the dataframe.

summary(edges.df.new.1)
##      ORIGIN          DEST         nflights        Dep_delay      
##  ATL    : 164   ATL    : 164   Min.   :   1.0   Min.   :  0.000  
##  ORD    : 154   ORD    : 154   1st Qu.:  31.0   1st Qu.:  7.773  
##  DFW    : 146   DFW    : 147   Median :  65.0   Median : 11.657  
##  DEN    : 137   DEN    : 138   Mean   : 109.9   Mean   : 13.693  
##  IAH    : 110   IAH    : 110   3rd Qu.: 144.0   3rd Qu.: 16.743  
##  MSP    : 106   MSP    : 106   Max.   :1085.0   Max.   :206.000  
##  (Other):3352   (Other):3350                                     
##  avgDailyFlights 
##  Min.   : 1.000  
##  1st Qu.: 1.000  
##  Median : 3.000  
##  Mean   : 3.962  
##  3rd Qu.: 5.000  
##  Max.   :35.000  
## 

Finding: We can see that the maximum flights between a pair of airports is 1085, in December 2015. That’s an average of 35 flights between 2 top airports.

Now let’s take a look at the distribution of average daily flights. We take help of the ggplot2 package here and use functions like ggplot, geom_histogram and geom_density.

p <- ggplot(edges.df.new.1, aes(x=avgDailyFlights))
p + geom_histogram(binwidth=1,
                   colour="black", fill="white") +
    geom_vline(aes(xintercept=median(avgDailyFlights, na.rm=T)), 
               color="red", linetype="dashed", size=1)

Finding: Well, clearly the histogram is right skewed. The maximum average daily flights are 35, but for half the pairs of airports, there are less than 3 flights daily. This calls for more investigation. The red line is the median line.

Now lets take a look at the boxplot of the average daily flights.

p1 <- ggplot(edges.df.new.1, aes(x=avgDailyFlights ,y=avgDailyFlights))
p1 + geom_boxplot(outlier.colour="blue",
               outlier.shape=1,
               outlier.size=2,
               notch=FALSE)

Finding: There are at least 18 pairs of airports which are outliers. They don’t fall under the normal distribution of the average daily flights. From the boxplot, we can see that if the average daily flights is more than 11, then the pair is an outlier. The blue dots are outliers.

So, we make a graph of all the airport pairs which have more than 11 flights between them everyday.

edges.df.new.1 %>%
  filter(avgDailyFlights > 11) %>%
  graph.data.frame() -> airportPairsGreaterThan11

The below plot shows the airport pairs with more than 11 flights between them, on a given day.

plot(airportPairsGreaterThan11,
     layout = layout.kamada.kawai,
     vertex.color="blue", vertex.size=7,
     vertex.frame.color="gray", vertex.label.color="black",
     vertex.label.cex=0.5, vertex.label.dist=0.5,
     edge.curved=0.2, edge.arrow.size=0.4,
     edge.color="light blue")

Finding: Airports such as LAX, ATL, ORD, SFO and others have more than 11 flights between them everyday. This says that these cities are very well connected. You cannot get stuck in these cities due to lack of flights.

We are making a data frame to plot the airports having the most delay and then plot the graph. From the graph, we can find the busiest airports and also how well the airports are connected. The flight direction is given by an arrow.

edges.df.new %>%
  arrange(.,desc(Dep_delay)) -> edges.df.new.2

Lets take a look at the summary of this dataframe.

summary(edges.df.new.2)
##      ORIGIN          DEST         nflights        Dep_delay      
##  ATL    : 164   ATL    : 164   Min.   :   1.0   Min.   :  0.000  
##  ORD    : 154   ORD    : 154   1st Qu.:  31.0   1st Qu.:  7.773  
##  DFW    : 146   DFW    : 147   Median :  65.0   Median : 11.657  
##  DEN    : 137   DEN    : 138   Mean   : 109.9   Mean   : 13.693  
##  IAH    : 110   IAH    : 110   3rd Qu.: 144.0   3rd Qu.: 16.743  
##  MSP    : 106   MSP    : 106   Max.   :1085.0   Max.   :206.000  
##  (Other):3352   (Other):3350

Finding: We can see that the maximum departure delay between a pair of airports is 206 minutes, in December 2015. But 3/4th of the flights have departure delay less than 16.743 minutes, which is nice.

Now let’s take a look at the distribution of average departure delay.

p <- ggplot(edges.df.new.2, aes(x=Dep_delay))
p + geom_histogram(binwidth=1,
                   colour="black", fill="white") +
    geom_vline(aes(xintercept=median(Dep_delay, na.rm=T)), 
               color="red", linetype="dashed", size=1)

Finding: Well, clearly the histogram is right skewed. The maximum average departure delay is greater than 200 minutes, but for half the pairs of airports, there are less than 12 minutes of departure delay. The red line is the median line.

Now lets take a look at the boxplot of the average departure delay.

p1 <- ggplot(edges.df.new.2, aes(x=Dep_delay ,y=Dep_delay))
p1 + geom_boxplot(outlier.colour="blue",
               outlier.shape=1,
               outlier.size=2,
               notch=FALSE)

Finding: There are many pairs of airports which are outliers. They don’t fall under the normal distribution of the average departure delay. From the boxplot, we can see that if the average departure delay is more than 30 minutes, then the pair is an outlier. The blue dots are outliers.

So, we make a graph of all the airport pairs which have more than 30 minutes of departure delay between them everyday.

edges.df.new.2 %>%
  filter(Dep_delay > 30) %>%
  graph.data.frame() -> airportPairsGreaterThan30

The below plot shows the airport pairs which have more than 30 minutes of departure delay between them everyday.

plot(airportPairsGreaterThan30,
     layout = layout.auto,
     vertex.color="blue", vertex.size=7,
     vertex.frame.color="gray", vertex.label.color="black",
     vertex.label.cex=0.5, vertex.label.dist=0.5,
     edge.curved=0.2, edge.arrow.size=0.4,
     edge.color="light blue")

Finding: Flights coming into airports such as DFW, JFK, ORD, SFO and others have more than 30 minutes of departure delay.

The below plot shows the top 20 airport pairs with maximum delay. The graph shows the origin and destination airports and the direction of the flight which has the most delay.

edges.df.new.2 %>%
  filter(Dep_delay > 120) %>%
  graph.data.frame() -> airportPairsGreaterThan120

plot(airportPairsGreaterThan120,
     layout = layout.auto,
     vertex.color="blue", vertex.size=7,
     vertex.frame.color="gray", vertex.label.color="black",
     vertex.label.cex=0.5, vertex.label.dist=0.5,
     edge.curved=0.2, edge.arrow.size=0.4,
     edge.color="light blue")

Finding: There are 6 pairs of airports which have more than 2 hours (120 minutes) of departure delay.

Conclusion

From our analysis, we have come to the following conclusions.