Load Libraries:
# load Libraries
library(hflights)
library(ggplot2)
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
In this project i am analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so i decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0
We will try to establish a relationship between the departure and arrival delay and season
We will try and see if there is relationship between departure and arrival delay and weekdays
dim(hflights)
## [1] 227496 21
Each row contains single flight departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby).
Data is collected via the hflights library in R. From the documentation here is the varible definition
Year, Month, DayofMonth: date of departure DayOfWeek: day of week of departure (useful for removing weekend effects) DepTime, ArrTime: departure and arrival times (in local time, hhmm) UniqueCarrier: unique abbreviation for a carrier FlightNum: flight number TailNum: airplane tail number ActualElapsedTime: elapsed time of flight, in minutes AirTime: flight time, in minutes ArrDelay, DepDelay: arrival and departure delays, in minutes Origin, Dest origin and destination airport codes Distance: distance of flight, in miles TaxiIn, TaxiOut: taxi in and out times in minutes Cancelled: cancelled indicator: 1 = Yes, 0 = No CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security Diverted: diverted indicator: 1 = Yes, 0 = No
Observational
Data is collected via the hflights library in R.
DepDelay—-Quantitative in minutes ArrDelay—-Quantitative in minutes
Season————Qualitative
DayOfWeek———Quantitative
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
#Departure delay boxplot
ggplot(hflights,aes(x=Origin,y=DepDelay)) + geom_boxplot()
## Warning: Removed 2905 rows containing non-finite values (stat_boxplot).
#Arrival delay boxplot
ggplot(hflights,aes(x=Origin,y=ArrDelay)) + geom_boxplot()
## Warning: Removed 3622 rows containing non-finite values (stat_boxplot).
Looks like IAH has high Departure and Arrival Delays compared to HOU
First, We need to differentiate between weekdays that and weekends. Will create a function called ‘weekend?’ that indicates whether a flight took place on a weekend (1) or not (0).
is.weekend <- function (c) {
if (c >= 1 & c <= 5) {output <- 0}
if (c == 6 | c == 7) {output <- 1}
if (c <0 | c >7) {output <- 'NA'}
return (output)}
Next i will add a column ‘isweekend’
hflights$isweekend <- 'NA'
Loop over the newly created column ‘isweekend’ and apply the function ‘is.weekend’ to indicate whether a flight took place on a weekend or not.
for (i in 1:nrow(hflights)) {
c <- hflights$DayOfWeek [i]
isweekend.i <- is.weekend(c)
hflights$isweekend [i] <- isweekend.i
}
#Checking How it looks
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## isweekend
## 5424 1
## 5425 1
## 5426 0
## 5427 0
## 5428 0
## 5429 0
Lets add Season Column to the dataframe I create a new column called ‘Season’ into which I copy the values of the colum ‘Month’ then we will decode the values of month to season
hflights$Season <- hflights$Month
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## isweekend Season
## 5424 1 1
## 5425 1 1
## 5426 0 1
## 5427 0 1
## 5428 0 1
## 5429 0 1
Now i will decode the numbers to season name
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'
#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'
#Assigning the fall months
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'
#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## isweekend Season
## 5424 1 Winter
## 5425 1 Winter
## 5426 0 Winter
## 5427 0 Winter
## 5428 0 Winter
## 5429 0 Winter
So this opens a lot of possibilities to analyse and i will be doing a bit more of them during the project.
There is a plenty of variables i can analyse and establish a relationship with.
In this project we are analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so we decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0
We will try and see if there is relationship between departure and arrival delay and weekdays
We will try to establish a relationship between the departure and arrival delay and season
There were 227496 rows and 21 columns in the dataset. The names of the columns were
## [1] 227496 23
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
## [22] "isweekend" "Season"
Data is collected via the hflights library in R. From the documentation here is the varible definition
Year, Month, DayofMonth: date of departure
DayOfWeek: day of week of departure (useful for removing weekend effects)
DepTime, ArrTime: departure and arrival times (in local time, hhmm)
UniqueCarrier: unique abbreviation for a carrier
FlightNum: flight number
TailNum: airplane tail number
ActualElapsedTime: elapsed time of flight, in minutes
AirTime: flight time, in minutes
ArrDelay, DepDelay: arrival and departure delays, in minutes
Origin, Dest origin and destination airport codes
Distance: distance of flight, in miles
TaxiIn, TaxiOut: taxi in and out times in minutes
Cancelled: cancelled indicator: 1 = Yes, 0 = No
CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security
Diverted: diverted indicator: 1 = Yes, 0 = No
First, we need to differentiate between days that are weekdays and days which are weekends. we therefore create a function called ‘isweekend’ that indicates whether a flight took place on a weekend (1) or not (0).
Next, we create a new, empty column called ‘weekend’ which will later indicate wheter a flight took place on a weekend or not.
Now, we loop over the newly created column ‘weekend’ and apply the function ‘isweekend’ to indicate whether a flight took place on a weekend or not.
# load Libraries
library(hflights)
library(ggplot2)
head(hflights)
dim(hflights)
summary(hflights)
#Departure delay boxplot
ggplot(hflights,aes(x=Origin,y=DepDelay)) + geom_boxplot()
#Arrival delay boxplot
ggplot(hflights,aes(x=Origin,y=ArrDelay)) + geom_boxplot()
is.weekend <- function (c) {
if (c >= 1 & c <= 5) {output <- 0}
if (c == 6 | c == 7) {output <- 1}
if (c <0 | c >7) {output <- 'NA'}
return (output)}
hflights$isweekend <- 'NA'
for (i in 1:nrow(hflights)) {
c <- hflights$DayOfWeek [i]
isweekend.i <- is.weekend(c)
hflights$isweekend [i] <- isweekend.i
}
#Checking How it looks
head(hflights)
hflights$Season <- hflights$Month
head(hflights)
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'
#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'
#Assigning the fall months
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'
#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'
head(hflights)
dim(hflights)
names(hflights)
isweekend <- function (x) {
if (x >= 1 & x <= 5) {output <- 0}
if (x == 6 | x == 7) {output <- 1}
if (x <0 | x >7) {output <- 'NA'}
return (output)}
hflights$weekend <- 'NA'
#new column serves as container for new values
for (i in 1:nrow(hflights)) {
x <- hflights$DayOfWeek [i]
weekend.i <- isweekend(x)
hflights$weekend [i] <- weekend.i
}
# Appendix