Data set description

My dataset is about flights from the Houston Airport in 2011. First I wanted to use your suggested link but then I saw that there’s a package called “hflights” which contains the dataset, so I downloaded that one from CRAN. Its author is Hadley Wickham and he got the data from the Research and Innovation Technology Administration at the Bureau of Transportation statistics. Sorry that I didn’t use your link but i honestly didn’t understand what some of the columns meant so I researched a little bit and found that package where the author gives detailed information about the meaning of the columns.

#Installing the required package:
require("hflights")

## Loading required package: hflights

## Warning: package 'hflights' was built under R version 3.2.3

#Now I am loading it:
library("hflights")

nrow(hflights) #How many rows has my dataset?

## [1] 227496

ncol(hflights) #How many columns has the dataset?

## [1] 21

The chosen dataset has 227,496 rows, 21 columns and the columns are called:

names(hflights) # What are the names of the columns?

##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"

Here is what each column means:

Year: the year of departure

Month: the month of departure

DayofMonth: the day of the month of departure

DayofWeek: the day of week of departure

DepTime: departure time in local time

ArrTime: arrival time in local time

UniqueCarrier: unique abbreviation for a carrier

FlightNum: flight number

TailNum: airplane tail number

ActualElapsedTime: elapsed time of flight, in minutes

AirTime: flight time, in minutes

ArrDelay: arrival delays in minutes

DepDelay: departure delays in minutes

Origin: origin airport code

Dest: destination airport code

Distance: distance of flight, in miles

TaxiIn: taxi in time in minutes

TaxiOut: taxi out time in minutes

Cancelled: cancelled indicator: 1 = Yes, 0 = No

CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security

Diverted: diverted indicator: 1 = Yes, 0 = No

Questions

The leading manager of the Houston Airport is a big R-Fan and wants to convince all his workers and passengers to use that amazing program as well. For that purpose he gives them valuable information about the Houston Airport and answers urgent questions:

Question 1 : Is it more risky that my plane departs delayed in winter (January, Feburuary, March) than in summer(June, July, August) ? In case there is a difference, is it significant?

Question 2: Can you predict the arrival delays on sundays using Month, AirTime, DepDelay and Distance as predictors? Create a scatterplot to compare the actual values with the predicted ones.

Question 3: Because the manager of the Houston Airport has an old friend who is the leading manager at Atlanta Airport, he wants to do him a favor by showing him the actual elapsed time of flights going to Atlanta (ATL) in a histogram.He also adds additional reference lines showing the mean and median.

Question 4: The manager of the Houston Airport wants an easy way for passengers to see the reason for their cancellation when they type in the letter of the cancellation reason.Therefore he creates a fancy function that helps passengers answering that question.

Question 5 : For his weekly presentation the manager of the airport is curious about some relations between the distance of a flight and other variables. First he examines whether the average distance of flights differs according to the day of the week. Next he is interested whether the distance of a flight is somehow correlated with the taxi out time because the big flights might get better or worse parkinglots at the airport. To show his results he wants to present them in a nice manner.

Question 6 : The befriended manager of the Atlanta Airport was so happy about the histogram that he would like to get some more descriptive statistics about flights from Houston to Atlanta. He is particularly interested what is the average delay time for flights from Houston to Atlanta in comparison to other destinations. Of course our manager is happy to help him.

Question 7: Although the Houston Airport has such a skilled leading manager, passengers don’t like it that much, especially because of the long taxi in and out times. Because every airport worker worries a lot about loosing his or her job they need a good advertisement campaign to attract more passengers. Our manager has the brilliant idea to cheat a little on the taxi data and subtracts 5 minutes from every Taxi in and out time if it is larger than 5 minutes.

Question 8: To make his monthly presentations a bit easier, our manager looks fo a quick way to see how many flights are flying from Houston Airport (HOU) and how many from George Bush Intercontinal airport (IAH) (the second large airport close to Houston) per month. To make sure he follows the holy rule D.R.Y. (“Don’t repeat yourself.”), he decides to create a loop for not typing in the same stuff every month.

Question 9: To end his convincing statement for using R more often, the manager of the Houston Airport wants to show his colleagues and passengers a beautiful scatterplot with two groups in different colours: he compares flights starting at Houston Airport (red dots) and flights starting at George Bush International Airport (green squares) in relation of distance of a flight and Taxi in time.

Analyses

Question 1 : Is it more risky that my plane departs delayed in winter (January, Feburuary, March) than in summer(June, July, August) ? In case there is a difference, is it significant?

First I will create a subset of data by seperating the winter from the summer data:

winter.flights <- subset(hflights, subset = (Month == 1 | Month == 2 | Month ==3))
summer.flights <- subset(hflights, subset = (Month == 6 | Month == 7 | Month ==8))

Now i will calculate some basic statistics on the departure delay like the mean, the standard deviation and the median.

#For that task I can rather take the fast way of using the summary function or I can do it separately by using sd(), mean() and median(). Because we want to know the statistics of the departure delay I will use indexing.

#TASK 2

round(mean(winter.flights$DepDelay, na.rm = T), 2) # Including na.rm = T prevents me from getting a NA result because now R knows that it should ignore the rows were the answer is NA for DepDelay. Moreover I encluded the roundfunction in the code to make my results a bit easier to read (it rounds the mean to 2 decimals behind the comma).

## [1] 8.97

round(mean(summer.flights$DepDelay, na.rm = T), 2)

## [1] 10.76

#Now I am calculating the median departure delay for both subsets:
round(median(winter.flights$DepDelay, na.rm =T),2)

## [1] 0

round(median(summer.flights$DepDelay, na.rm = T),2)

## [1] 1

#And the standard deviation:
round(sd(winter.flights$DepDelay, na.rm = T),2)

## [1] 26.73

round(sd(summer.flights$DepDelay, na.rm =T),2)

## [1] 28.9

#An easier and faster way is using the summary function were we get all the information with one function:

round(summary(winter.flights$DepDelay),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -23.00   -3.00    0.00    8.97   10.00  780.00    1459

round(summary(summer.flights$DepDelay),2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -19.00   -2.00    1.00   10.76   11.00  981.00     496

It seems like your plane is more at risk of departing late in summer with a mean of 10.76 minutes than in winter, where the mean is 8.97 minutes. The median for winter flights is 0 minutes and for summer flights it’s 1 minute. The standard deviation for winter flights is 26.73 minutes and for summer flights it’s 28.9 minutes. The maximum for winter flights is 780 minutes of delay and in summer even 981 minutes (poor passengers!).

But is this difference significant? To see wether there is a significant difference in departure delay between summer and winter flights I am using a t-test.

#TASK 3

t.test(winter.flights$DepDelay, summer.flights$DepDelay, mu = 0, paired = F, var.equal = F, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  winter.flights$DepDelay and summer.flights$DepDelay
## t = -10.845, df = 113810, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.111109 -1.464860
## sample estimates:
## mean of x mean of y 
##  8.969102 10.757087

#I am using the t-test function to calculate a two sample t-test where I compare whether the mean of departure delays in winter flights is significantly different to the mean of departure delays in summer flights.

There is a significant difference between summer flights and winter flights as summer flights have significantly more minutes in departure delay than winter flights, t(113810) = -10.845, p < 2.2e-16.

Question 2: Can you predict the arrival delays on sundays using Month, AirTime, DepDelay and Distance as predictors? Create a scatterplot to compare the actual values with the predicted ones.

#Task 5

#I am calculating a linear regression using the lm() function, including all the named variables as predictors and ArrDelay as the predicted value. I also include the subset option to only get the results for sundays (day=7).

lm.arrival.delay <- lm(formula = ArrDelay ~ Month + AirTime + DepDelay + Distance ,data = hflights, subset = DayOfWeek == 7 )

#To see the results I am using the summary() function.

summary(lm.arrival.delay)

## 
## Call:
## lm(formula = ArrDelay ~ Month + AirTime + DepDelay + Distance, 
##     data = hflights, subset = DayOfWeek == 7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.827  -5.682  -0.638   4.898 137.065 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.7216443  0.1674173  -34.18  < 2e-16 ***
## Month       -0.1006011  0.0163303   -6.16 7.34e-10 ***
## AirTime      0.5053775  0.0057372   88.09  < 2e-16 ***
## DepDelay     0.9962699  0.0019866  501.50  < 2e-16 ***
## Distance    -0.0643287  0.0007147  -90.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.962 on 31657 degrees of freedom
##   (396 observations deleted due to missingness)
## Multiple R-squared:  0.8907, Adjusted R-squared:  0.8907 
## F-statistic: 6.45e+04 on 4 and 31657 DF,  p-value: < 2.2e-16

According to the model all included independent variables are highly significant predictors of Arrival delay: Month is a significant predictor(t(31657) = -6.16, p > 7.34e-10) as well as AirTime (t(31657) = 88.09,p < 2e-16) as well as DepDelay (t(31657) = 501.50 , p> 2e-16) as well as Distance (t(31657) = -90.01, p > 2e-16). The earlier in the year, the longer the time in the air, the more delayed the plane departed and the shorter the distance the more arrival delay the plane has. The whole model is highly significant (F(31657) = 6.45e+04, p> 2.2e-16) and explains 89,07% of the variance.

Let’s see whether the scatterplot also shows how good the model predicts arrival delay.

#If I run the command names() R shows me which attributes our linear model also includes. There should be one called fitted.values which shows us what the model predicted the values should be.

names(lm.arrival.delay)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"

#Indeed, there is the fitted.value information.
#Now we can compare the fitted values with the actual values in a scatterplot.

plot(x = hflights$ArrDelay[hflights$DayOfWeek == 7 & is.finite(hflights$ArrDelay)],
y = lm.arrival.delay$fitted.values,
xlab = "True Arrival Delay in Minutes",
ylab = "Model Arrival Delay in Minutes",
main = "True versus Predicted Arrival Delays on Sundays",
pch = 16, col = "green", bg = gray(0.9,0.1), cex = 0.3
)
#Now I am adding a regression line to the plot by using the function abline:
abline(lm(hflights$ArrDelay[hflights$DayOfWeek == 7 & is.finite(hflights$ArrDelay)] ~ lm.arrival.delay$fitted.values), col = "red")

Question 3: Because the manager of the Houston Airport has an old friend who is the leading manager at Atlanta Airport, he wants to do him a favor by showing him the actual elapsed time of flights going to Atlanta (ATL) in a histogram. He also adds additional reference lines showing the mean and median.

# TASK 7

hist(hflights$ActualElapsedTime[hflights$Dest == "ATL"], breaks = 100, col = "green", border = "black", main = "Histogram of elapsed time for flights to Atlanta", xlab = "elapsed time of flight, in minutes", ylab = "number of flights")

#Add the lines

#For the mean (red line):
abline(v = mean(hflights$ActualElapsedTime[hflights$Dest == "ATL"], na.rm = T), col = "red", lwd = 2)
#For the median (blue line):
abline(v = median(hflights$ActualElapsedTime[hflights$Dest == "ATL"], na.rm = T), col = "blue", lwd = 2)

As you can see from the histogram most of the flights have an elapsed time of about 115 minutes and the mean is at 120.05 minutes and the median at 118 minutes.

Question 4: The manager of the Houston Airport wants an easy way for passengers to see the reason for their cancellation when they type in the letter of the cancellation reason.

#TASK 10

#I will create a new function that prints out a sentence that tells the passenger why his/her flight was cancelled.

flight.cancellation.reason <- function(x) {if(x == "A") {output <- "Unfortunately your flight was cancelled because of the carrier."}
if(x == "B") {output <- "Due to bad weather we can't fly, we apologize for that but safety comes first."}
if(x == "C") {output <- "The national air system doens't allow us to fly, we apologize for cancelling your flight."} 
if(x == "D") {output <- "Due to safety reasons wearen't able to fly so your flight was cancelled. We apologize for that but your safety and life is more important to us."}
  return(output)}

#Test of the function

flight.cancellation.reason("A")

## [1] "Unfortunately your flight was cancelled because of the carrier."

flight.cancellation.reason("B")

## [1] "Due to bad weather we can't fly, we apologize for that but safety comes first."

flight.cancellation.reason("C")

## [1] "The national air system doens't allow us to fly, we apologize for cancelling your flight."

flight.cancellation.reason("D")

## [1] "Due to safety reasons wearen't able to fly so your flight was cancelled. We apologize for that but your safety and life is more important to us."

#Seems like it works.

Question 5 :For his weekly presentation the manager of the airport is curious about some relations between the distance of a flight and other variables. First he examines whether the average distance of flights differs according to the day of the week. Next he is interested whether the distance of a flight is somehow correlated with the taxi out time because the big flights might get better or worse parkinglots at the airport. To show his results he wants to present them in a nice manner.

#Task 8

#For the first relationship the dependent variable is Distance which is numeric and the independent variable is DayofWeek which is categorial. For this relationship a boxplot is a nice way to show the results.

boxplot(hflights$Distance ~ hflights$DayOfWeek, data = hflights, main = "Distance according to the day of the week", xlab = "Day of the week", ylab = "Distance in miles", col="gold")

The boxplot shows him that there is no obvious difference in the distance between the weekdays. Probably the airport is large enough to offer the same flights every day.

Relation between Distance of the flight and Taxi out time.

#For the next relationship both variables (Distance and TaxiOut) are numeric, so I will show a correlation between the two.

cor(x= hflights$Distance, y= hflights$TaxiOut, use = "na.or.complete", method = "pearson" )

## [1] 0.1582346

The correlation is 0.1582346 but of course we also want to know whether this is significant or not. For that i am using the cor.test function.

#Task 4

cor.test(x= hflights$Distance, y=hflights$TaxiOut , alternative = "two.sided", method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  hflights$Distance and hflights$TaxiOut
## t = 75.938, df = 224550, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1541994 0.1622646
## sample estimates:
##       cor 
## 0.1582346

According to the correlation test the correlation is significant, with t(224550) = 75.938 and a p-value < 2.2e-16. This means the longer a flight is going to be, the more time the taxi needs to get the passengers to the carrier. For his presentation the leader also shows the correlation in a plot:

plot(x = hflights$Distance, y=hflights$TaxiOut, type = "p", main = "Relationship between Distance and Taxi time", xlab = "Distance of the flight in miles", ylab = "Taxi time in minutes")
abline(lm(hflights$TaxiOut ~ hflights$Distance ), col = "red")

Question 6: The befriended manager of the Atlanta Airport was so happy about the histogram that he would like to get some more descriptive statistics about flights from Houston to Atlanta. He is particularly interested what is the average delay time for flights from Houston to Atlanta in comparison to other destinations. Of course our manager is happy to help him.

#TASK 9

#For that Task I am using the aggregate function because i want to see the mean general delaytime for every destination.

mean.delay.dest <- aggregate(formula = DepDelay + ArrDelay ~ Dest, FUN = mean, na.rm = TRUE, data = hflights)

#To get information for Atlanta I am using indexing: 

mean.delay.dest$'DepDelay + ArrDelay'[mean.delay.dest$Dest== "ATL"]

## [1] 18.37644

The mean delay time for a flight from Houston airport to Atlanta is 18.38 minutes. To see what this means in comparison to other destinations i am using the summary function to get the maximum and minimum for all the average delay times:

summary(mean.delay.dest)

##      Dest           DepDelay + ArrDelay
##  Length:116         Min.   :-20.09     
##  Class :character   1st Qu.: 12.55     
##  Mode  :character   Median : 15.59     
##                     Mean   : 15.94     
##                     3rd Qu.: 19.73     
##                     Max.   : 51.26

In comparison to other destinations Atlanta is a bit above the average which is 15.94 minutes and in the “worse” half of all destinations because the median is at 15.59 minutes. But what is the worst destination to go to from Houston?

#Here I am using the function which.max to get the index of the greatest delaytime.
which.max(mean.delay.dest$`DepDelay + ArrDelay`)

## [1] 5

#It returns the index 5 so let's use indexing to get the Destination of the fifth row:
mean.delay.dest$Dest[5]

## [1] "ANC"

The most average delaytime with 51.26 minutes is for flights from Houston to “ANC” which is Anchorage in Alaska. Might be because of snow or bad weather…

Question 7: Although the Houston Airport has such a skilled leading manager passengers don’t like it that much, especially because of the long taxi in and out times. Because every airport worker worries a lot about loosing his or her job they need a good advertisement campaign to attract more passengers. Our manager has the brilliant idea to cheat a little bit on the taxi data and subtracts 5 minutes from every Taxi in and out time if it is larger than 5 minutes.

#TASK 1

#I am recoding the values of the taxi in time by subtracting 5 from every value. 

hflights$TaxiIn.new <- hflights$TaxiIn -5

#To make it less obvious that he cheated i now have to index all negative numbers and set them to 1 minute:

hflights$TaxiIn.new[hflights$TaxiIn.new < 1] <- 1

#Now Iam doing the same for the Taxi out time:

hflights$TaxiOut.new <- hflights$TaxiOut -5

hflights$TaxiOut.new[hflights$TaxiOut.new < 1] <- 1

head(hflights)

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      TaxiIn.new TaxiOut.new
## 5424          2           8
## 5425          1           4
## 5426          1          12
## 5427          4          17
## 5428          4           4
## 5429          1           8

It worked and we have two new columns that they can use for their advertisement campaign.

Question 8: To make his big monthly presentations a bit easier, our manager looks fo a quick way to see how many flights are flying from Houston Airport (HOU) and how many from George Bush Intercontinal Airport (IAH) (the second large airport close to Houston) per month. To make sure he follows the holy rule D.R.Y. (“Don’t repeat yourself.”), he decides to create a loop for not typing in the same stuff every month.

for (i in 1:12) #Here I want my loop to repeat it's calculation for every month from 1 (January) to 12 (December).

  {monthly.table <- table(hflights$Origin[hflights$Month == i])#the function table should be repeated for every month.
  print (monthly.table) #To see the results I need to add the function print().
}

## 
##   HOU   IAH 
##  4270 14640 
## 
##   HOU   IAH 
##  3884 13244 
## 
##   HOU   IAH 
##  4544 14926 
## 
##   HOU   IAH 
##  4420 14173 
## 
##   HOU   IAH 
##  4533 14639 
## 
##   HOU   IAH 
##  4499 15101 
## 
##   HOU   IAH 
##  4519 16029 
## 
##   HOU   IAH 
##  4505 15671 
## 
##   HOU   IAH 
##  4186 13879 
## 
##   HOU   IAH 
##  4405 14291 
## 
##   HOU   IAH 
##  4212 13809 
## 
##   HOU   IAH 
##  4322 14795

#TASK 6

# First I am subsetting the data

flights.hou <- subset(hflights, Origin == "HOU")
flights.iah <- subset(hflights, Origin == "IAH")

# Now i am creating a blank plot
plot(x = 1,
xlab = "Distance in miles",
ylab = "Taxi in time in minutes",
type = "n", 
main = "Taxi In time by Distance and Origin",
xlim = c(0, 2000),
ylim = c(0, 40))

#I am typing type = n here because this means no plotting -> we want to use low level plotting afterwards to set our dots and squares

# Now I am adding red dots for flights from Houston Airport:
points(x = (flights.hou$Distance),
y = (flights.hou$TaxiIn),
pch = 16,
col = "red")

# Here I am adding green squares for flights from George Bush International Airport:
points(x = flights.iah$Distance,
y = flights.iah$TaxiIn,
pch = 22,
col = "lawngreen")

#Finally i am adding two regression lines, a black one for HOU and a blue one for IAH:

abline(lm(flights.hou$TaxiIn ~ flights.hou$Distance), col = "black") 
abline(lm(flights.iah$TaxiIn ~ flights.iah$Distance), col = "blue")

According to the plot there are more long distance flights from IAH-Airport and the regression lines show us that the TaxiIn time is slightly higher there as well.On the other hand it doesn’t change that much with more distance of a flight, which is the case on Houston Airport.

Conclusion

First we can see that in matters of delay it is better to fly in winter than in summer because the departure delay in summer is on average signifcantly higher than in winter. Moreover it is not recommendable to fly from Houston to Anchorage in Alaska because it has the maximum of all average delaytimes (departure and arrival delay) with 51.26min.

A good model to predict the arrival delay of your flight is a linear model using the month, time in the air, departure delay and distance as predictors. At least the model works pretty well on sundays.

According to our boxplots there is no difference of the distances flown between the weekdays.

The correlation test showed us that there is a significant positive relation between the distance of a flight and the taxi out time in minutes, which means that the long distance flights tend to be parked further away from the airport building because the taxi needs more time to get the passengers to the carrier.

If you want to fly from Houston to Atlanta you have to expect on average a delay time of 18.38 minutes which is a bit worse than the average delay time of all destinations from Houston (15.94 minutes). Most of the Houston - Atlanta flights have an elapsed time of about 115 minutes at the measuring point, as you can see from the histogram.

The long distance flights tend to have longer Taxi In times and start more often from George Bush International Airport.

Data Analysis and Visualization with R : Final Paper

Ulrike Schmiedel

Winter 2015/2016

Data set description

Questions

Analyses

Conclusion