Load Libraries:

# load Libraries
library(hflights)
library(ggplot2)

Data Preparation

head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

Research question

In this project i am analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so i decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0

We will try to establish a relationship between the departure and arrival delay and season

We will try and see if there is relationship between departure and arrival delay and weekdays

Cases

dim(hflights)
## [1] 227496     21

Each row contains single flight departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby).

Data collection

Data is collected via the hflights library in R. From the documentation here is the varible definition

Year, Month, DayofMonth: date of departure DayOfWeek: day of week of departure (useful for removing weekend effects) DepTime, ArrTime: departure and arrival times (in local time, hhmm) UniqueCarrier: unique abbreviation for a carrier FlightNum: flight number TailNum: airplane tail number ActualElapsedTime: elapsed time of flight, in minutes AirTime: flight time, in minutes ArrDelay, DepDelay: arrival and departure delays, in minutes Origin, Dest origin and destination airport codes Distance: distance of flight, in miles TaxiIn, TaxiOut: taxi in and out times in minutes Cancelled: cancelled indicator: 1 = Yes, 0 = No CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security Diverted: diverted indicator: 1 = Yes, 0 = No

Type of study

Observational

Data Source

Data is collected via the hflights library in R.

Dependent Variable

DepDelay—-Quantitative in minutes ArrDelay—-Quantitative in minutes

Independent Variable

Season————Qualitative

DayOfWeek———Quantitative

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 
#Departure delay boxplot
ggplot(hflights,aes(x=Origin,y=DepDelay)) + geom_boxplot()
## Warning: Removed 2905 rows containing non-finite values (stat_boxplot).

#Arrival delay boxplot
ggplot(hflights,aes(x=Origin,y=ArrDelay)) + geom_boxplot()
## Warning: Removed 3622 rows containing non-finite values (stat_boxplot).

Looks like IAH has high Departure and Arrival Delays compared to HOU

First, We need to differentiate between weekdays that and weekends. Will create a function called ‘weekend?’ that indicates whether a flight took place on a weekend (1) or not (0).

is.weekend <- function (c) {
  if (c >= 1 & c <= 5) {output <- 0}
  if (c == 6 | c == 7) {output <- 1}
  if (c <0 | c >7) {output <- 'NA'}
  return (output)}

Next i will add a column ‘isweekend’

hflights$isweekend <- 'NA'

Loop over the newly created column ‘isweekend’ and apply the function ‘is.weekend’ to indicate whether a flight took place on a weekend or not.

for (i in 1:nrow(hflights)) {
  c <- hflights$DayOfWeek [i]
  isweekend.i <- is.weekend(c)
  hflights$isweekend [i] <- isweekend.i 
}

#Checking How it looks

head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      isweekend
## 5424         1
## 5425         1
## 5426         0
## 5427         0
## 5428         0
## 5429         0

Lets add Season Column to the dataframe I create a new column called ‘Season’ into which I copy the values of the colum ‘Month’ then we will decode the values of month to season

hflights$Season <- hflights$Month
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      isweekend Season
## 5424         1      1
## 5425         1      1
## 5426         0      1
## 5427         0      1
## 5428         0      1
## 5429         0      1

Now i will decode the numbers to season name

hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'

#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'

#Assigning the fall months 
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'

#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'

head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      isweekend Season
## 5424         1 Winter
## 5425         1 Winter
## 5426         0 Winter
## 5427         0 Winter
## 5428         0 Winter
## 5429         0 Winter

So this opens a lot of possibilities to analyse and i will be doing a bit more of them during the project.

There is a plenty of variables i can analyse and establish a relationship with.

Part 1 - Introduction

In this project we are analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so we decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0

We will try and see if there is relationship between departure and arrival delay and weekdays

We will try to establish a relationship between the departure and arrival delay and season

Part 2 - Data

There were 227496 rows and 21 columns in the dataset. The names of the columns were

## [1] 227496     23
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"         
## [22] "isweekend"         "Season"

Data is collected via the hflights library in R. From the documentation here is the varible definition

Year, Month, DayofMonth: date of departure

DayOfWeek: day of week of departure (useful for removing weekend effects)

DepTime, ArrTime: departure and arrival times (in local time, hhmm)

UniqueCarrier: unique abbreviation for a carrier

FlightNum: flight number

TailNum: airplane tail number

ActualElapsedTime: elapsed time of flight, in minutes

AirTime: flight time, in minutes

ArrDelay, DepDelay: arrival and departure delays, in minutes

Origin, Dest origin and destination airport codes

Distance: distance of flight, in miles

TaxiIn, TaxiOut: taxi in and out times in minutes

Cancelled: cancelled indicator: 1 = Yes, 0 = No

CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security

Diverted: diverted indicator: 1 = Yes, 0 = No

Part 3 - Exploratory data analysis

We will try to establish a relationship between the departure and arrival delay and season

First, we need to differentiate between days that are weekdays and days which are weekends. we therefore create a function called ‘isweekend’ that indicates whether a flight took place on a weekend (1) or not (0).

Next, we create a new, empty column called ‘weekend’ which will later indicate wheter a flight took place on a weekend or not.

Now, we loop over the newly created column ‘weekend’ and apply the function ‘isweekend’ to indicate whether a flight took place on a weekend or not.

# load Libraries
library(hflights)
library(ggplot2)

head(hflights)

dim(hflights)
summary(hflights)
#Departure delay boxplot
ggplot(hflights,aes(x=Origin,y=DepDelay)) + geom_boxplot()
#Arrival delay boxplot
ggplot(hflights,aes(x=Origin,y=ArrDelay)) + geom_boxplot()
is.weekend <- function (c) {
  if (c >= 1 & c <= 5) {output <- 0}
  if (c == 6 | c == 7) {output <- 1}
  if (c <0 | c >7) {output <- 'NA'}
  return (output)}
hflights$isweekend <- 'NA'
for (i in 1:nrow(hflights)) {
  c <- hflights$DayOfWeek [i]
  isweekend.i <- is.weekend(c)
  hflights$isweekend [i] <- isweekend.i 
}

#Checking How it looks

head(hflights)
hflights$Season <- hflights$Month
head(hflights)
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'

#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'

#Assigning the fall months 
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'

#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'

head(hflights)
dim(hflights)
names(hflights)
isweekend <- function (x) {
  if (x >= 1 & x <= 5) {output <- 0}
  if (x == 6 | x == 7) {output <- 1}
  if (x <0 | x >7) {output <- 'NA'}
  return (output)}
hflights$weekend <- 'NA' 
#new column serves as container for new values
for (i in 1:nrow(hflights)) {
  x <- hflights$DayOfWeek [i]
  weekend.i <- isweekend(x)
  hflights$weekend [i] <- weekend.i 
}




# Appendix