My dataset is about flights from the Houston Airport in 2011. First I wanted to use your suggested link but then I saw that there’s a package called “hflights” which contains the dataset, so I downloaded that one from CRAN. Its author is Hadley Wickham and he got the data from the Research and Innovation Technology Administration at the Bureau of Transportation statistics. Sorry that I didn’t use your link but i honestly didn’t understand what some of the columns meant so I researched a little bit and found that package where the author gives detailed information about the meaning of the columns.
#Installing the required package:
require("hflights")
## Loading required package: hflights
## Warning: package 'hflights' was built under R version 3.2.3
#Now I am loading it:
library("hflights")
nrow(hflights) #How many rows has my dataset?
## [1] 227496
ncol(hflights) #How many columns has the dataset?
## [1] 21
The chosen dataset has 227,496 rows, 21 columns and the columns are called:
names(hflights) # What are the names of the columns?
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
Here is what each column means:
Year: the year of departure
Month: the month of departure
DayofMonth: the day of the month of departure
DayofWeek: the day of week of departure
DepTime: departure time in local time
ArrTime: arrival time in local time
UniqueCarrier: unique abbreviation for a carrier
FlightNum: flight number
TailNum: airplane tail number
ActualElapsedTime: elapsed time of flight, in minutes
AirTime: flight time, in minutes
ArrDelay: arrival delays in minutes
DepDelay: departure delays in minutes
Origin: origin airport code
Dest: destination airport code
Distance: distance of flight, in miles
TaxiIn: taxi in time in minutes
TaxiOut: taxi out time in minutes
Cancelled: cancelled indicator: 1 = Yes, 0 = No
CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security
Diverted: diverted indicator: 1 = Yes, 0 = No
The leading manager of the Houston Airport is a big R-Fan and wants to convince all his workers and passengers to use that amazing program as well. For that purpose he gives them valuable information about the Houston Airport and answers urgent questions:
Question 1 : Is it more risky that my plane departs delayed in winter (January, Feburuary, March) than in summer(June, July, August) ? In case there is a difference, is it significant?
Question 2: Can you predict the arrival delays on sundays using Month, AirTime, DepDelay and Distance as predictors? Create a scatterplot to compare the actual values with the predicted ones.
Question 3: Because the manager of the Houston Airport has an old friend who is the leading manager at Atlanta Airport, he wants to do him a favor by showing him the actual elapsed time of flights going to Atlanta (ATL) in a histogram.He also adds additional reference lines showing the mean and median.
Question 4: The manager of the Houston Airport wants an easy way for passengers to see the reason for their cancellation when they type in the letter of the cancellation reason.Therefore he creates a fancy function that helps passengers answering that question.
Question 5 : For his weekly presentation the manager of the airport is curious about some relations between the distance of a flight and other variables. First he examines whether the average distance of flights differs according to the day of the week. Next he is interested whether the distance of a flight is somehow correlated with the taxi out time because the big flights might get better or worse parkinglots at the airport. To show his results he wants to present them in a nice manner.
Question 6 : The befriended manager of the Atlanta Airport was so happy about the histogram that he would like to get some more descriptive statistics about flights from Houston to Atlanta. He is particularly interested what is the average delay time for flights from Houston to Atlanta in comparison to other destinations. Of course our manager is happy to help him.
Question 7: Although the Houston Airport has such a skilled leading manager, passengers don’t like it that much, especially because of the long taxi in and out times. Because every airport worker worries a lot about loosing his or her job they need a good advertisement campaign to attract more passengers. Our manager has the brilliant idea to cheat a little on the taxi data and subtracts 5 minutes from every Taxi in and out time if it is larger than 5 minutes.
Question 8: To make his monthly presentations a bit easier, our manager looks fo a quick way to see how many flights are flying from Houston Airport (HOU) and how many from George Bush Intercontinal airport (IAH) (the second large airport close to Houston) per month. To make sure he follows the holy rule D.R.Y. (“Don’t repeat yourself.”), he decides to create a loop for not typing in the same stuff every month.
Question 9: To end his convincing statement for using R more often, the manager of the Houston Airport wants to show his colleagues and passengers a beautiful scatterplot with two groups in different colours: he compares flights starting at Houston Airport (red dots) and flights starting at George Bush International Airport (green squares) in relation of distance of a flight and Taxi in time.
Question 1 : Is it more risky that my plane departs delayed in winter (January, Feburuary, March) than in summer(June, July, August) ? In case there is a difference, is it significant?
First I will create a subset of data by seperating the winter from the summer data:
winter.flights <- subset(hflights, subset = (Month == 1 | Month == 2 | Month ==3))
summer.flights <- subset(hflights, subset = (Month == 6 | Month == 7 | Month ==8))
Now i will calculate some basic statistics on the departure delay like the mean, the standard deviation and the median.
#For that task I can rather take the fast way of using the summary function or I can do it separately by using sd(), mean() and median(). Because we want to know the statistics of the departure delay I will use indexing.
#TASK 2
round(mean(winter.flights$DepDelay, na.rm = T), 2) # Including na.rm = T prevents me from getting a NA result because now R knows that it should ignore the rows were the answer is NA for DepDelay. Moreover I encluded the roundfunction in the code to make my results a bit easier to read (it rounds the mean to 2 decimals behind the comma).
## [1] 8.97
round(mean(summer.flights$DepDelay, na.rm = T), 2)
## [1] 10.76
#Now I am calculating the median departure delay for both subsets:
round(median(winter.flights$DepDelay, na.rm =T),2)
## [1] 0
round(median(summer.flights$DepDelay, na.rm = T),2)
## [1] 1
#And the standard deviation:
round(sd(winter.flights$DepDelay, na.rm = T),2)
## [1] 26.73
round(sd(summer.flights$DepDelay, na.rm =T),2)
## [1] 28.9
#An easier and faster way is using the summary function were we get all the information with one function:
round(summary(winter.flights$DepDelay),2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -23.00 -3.00 0.00 8.97 10.00 780.00 1459
round(summary(summer.flights$DepDelay),2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -19.00 -2.00 1.00 10.76 11.00 981.00 496
It seems like your plane is more at risk of departing late in summer with a mean of 10.76 minutes than in winter, where the mean is 8.97 minutes. The median for winter flights is 0 minutes and for summer flights it’s 1 minute. The standard deviation for winter flights is 26.73 minutes and for summer flights it’s 28.9 minutes. The maximum for winter flights is 780 minutes of delay and in summer even 981 minutes (poor passengers!).
But is this difference significant? To see wether there is a significant difference in departure delay between summer and winter flights I am using a t-test.
#TASK 3
t.test(winter.flights$DepDelay, summer.flights$DepDelay, mu = 0, paired = F, var.equal = F, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: winter.flights$DepDelay and summer.flights$DepDelay
## t = -10.845, df = 113810, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.111109 -1.464860
## sample estimates:
## mean of x mean of y
## 8.969102 10.757087
#I am using the t-test function to calculate a two sample t-test where I compare whether the mean of departure delays in winter flights is significantly different to the mean of departure delays in summer flights.
There is a significant difference between summer flights and winter flights as summer flights have significantly more minutes in departure delay than winter flights, t(113810) = -10.845, p < 2.2e-16.
Question 2: Can you predict the arrival delays on sundays using Month, AirTime, DepDelay and Distance as predictors? Create a scatterplot to compare the actual values with the predicted ones.
#Task 5
#I am calculating a linear regression using the lm() function, including all the named variables as predictors and ArrDelay as the predicted value. I also include the subset option to only get the results for sundays (day=7).
lm.arrival.delay <- lm(formula = ArrDelay ~ Month + AirTime + DepDelay + Distance ,data = hflights, subset = DayOfWeek == 7 )
#To see the results I am using the summary() function.
summary(lm.arrival.delay)
##
## Call:
## lm(formula = ArrDelay ~ Month + AirTime + DepDelay + Distance,
## data = hflights, subset = DayOfWeek == 7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.827 -5.682 -0.638 4.898 137.065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.7216443 0.1674173 -34.18 < 2e-16 ***
## Month -0.1006011 0.0163303 -6.16 7.34e-10 ***
## AirTime 0.5053775 0.0057372 88.09 < 2e-16 ***
## DepDelay 0.9962699 0.0019866 501.50 < 2e-16 ***
## Distance -0.0643287 0.0007147 -90.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.962 on 31657 degrees of freedom
## (396 observations deleted due to missingness)
## Multiple R-squared: 0.8907, Adjusted R-squared: 0.8907
## F-statistic: 6.45e+04 on 4 and 31657 DF, p-value: < 2.2e-16
According to the model all included independent variables are highly significant predictors of Arrival delay: Month is a significant predictor(t(31657) = -6.16, p > 7.34e-10) as well as AirTime (t(31657) = 88.09,p < 2e-16) as well as DepDelay (t(31657) = 501.50 , p> 2e-16) as well as Distance (t(31657) = -90.01, p > 2e-16). The earlier in the year, the longer the time in the air, the more delayed the plane departed and the shorter the distance the more arrival delay the plane has. The whole model is highly significant (F(31657) = 6.45e+04, p> 2.2e-16) and explains 89,07% of the variance.
Let’s see whether the scatterplot also shows how good the model predicts arrival delay.
#If I run the command names() R shows me which attributes our linear model also includes. There should be one called fitted.values which shows us what the model predicted the values should be.
names(lm.arrival.delay)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "xlevels" "call" "terms"
## [13] "model"
#Indeed, there is the fitted.value information.
#Now we can compare the fitted values with the actual values in a scatterplot.
plot(x = hflights$ArrDelay[hflights$DayOfWeek == 7 & is.finite(hflights$ArrDelay)],
y = lm.arrival.delay$fitted.values,
xlab = "True Arrival Delay in Minutes",
ylab = "Model Arrival Delay in Minutes",
main = "True versus Predicted Arrival Delays on Sundays",
pch = 16, col = "green", bg = gray(0.9,0.1), cex = 0.3
)
#Now I am adding a regression line to the plot by using the function abline:
abline(lm(hflights$ArrDelay[hflights$DayOfWeek == 7 & is.finite(hflights$ArrDelay)] ~ lm.arrival.delay$fitted.values), col = "red")
Question 3: Because the manager of the Houston Airport has an old friend who is the leading manager at Atlanta Airport, he wants to do him a favor by showing him the actual elapsed time of flights going to Atlanta (ATL) in a histogram. He also adds additional reference lines showing the mean and median.
# TASK 7
hist(hflights$ActualElapsedTime[hflights$Dest == "ATL"], breaks = 100, col = "green", border = "black", main = "Histogram of elapsed time for flights to Atlanta", xlab = "elapsed time of flight, in minutes", ylab = "number of flights")
#Add the lines
#For the mean (red line):
abline(v = mean(hflights$ActualElapsedTime[hflights$Dest == "ATL"], na.rm = T), col = "red", lwd = 2)
#For the median (blue line):
abline(v = median(hflights$ActualElapsedTime[hflights$Dest == "ATL"], na.rm = T), col = "blue", lwd = 2)
As you can see from the histogram most of the flights have an elapsed time of about 115 minutes and the mean is at 120.05 minutes and the median at 118 minutes.
Question 4: The manager of the Houston Airport wants an easy way for passengers to see the reason for their cancellation when they type in the letter of the cancellation reason.
#TASK 10
#I will create a new function that prints out a sentence that tells the passenger why his/her flight was cancelled.
flight.cancellation.reason <- function(x) {if(x == "A") {output <- "Unfortunately your flight was cancelled because of the carrier."}
if(x == "B") {output <- "Due to bad weather we can't fly, we apologize for that but safety comes first."}
if(x == "C") {output <- "The national air system doens't allow us to fly, we apologize for cancelling your flight."}
if(x == "D") {output <- "Due to safety reasons wearen't able to fly so your flight was cancelled. We apologize for that but your safety and life is more important to us."}
return(output)}
#Test of the function
flight.cancellation.reason("A")
## [1] "Unfortunately your flight was cancelled because of the carrier."
flight.cancellation.reason("B")
## [1] "Due to bad weather we can't fly, we apologize for that but safety comes first."
flight.cancellation.reason("C")
## [1] "The national air system doens't allow us to fly, we apologize for cancelling your flight."
flight.cancellation.reason("D")
## [1] "Due to safety reasons wearen't able to fly so your flight was cancelled. We apologize for that but your safety and life is more important to us."
#Seems like it works.
Question 5 :For his weekly presentation the manager of the airport is curious about some relations between the distance of a flight and other variables. First he examines whether the average distance of flights differs according to the day of the week. Next he is interested whether the distance of a flight is somehow correlated with the taxi out time because the big flights might get better or worse parkinglots at the airport. To show his results he wants to present them in a nice manner.
#Task 8
#For the first relationship the dependent variable is Distance which is numeric and the independent variable is DayofWeek which is categorial. For this relationship a boxplot is a nice way to show the results.
boxplot(hflights$Distance ~ hflights$DayOfWeek, data = hflights, main = "Distance according to the day of the week", xlab = "Day of the week", ylab = "Distance in miles", col="gold")
The boxplot shows him that there is no obvious difference in the distance between the weekdays. Probably the airport is large enough to offer the same flights every day.
Relation between Distance of the flight and Taxi out time.
#For the next relationship both variables (Distance and TaxiOut) are numeric, so I will show a correlation between the two.
cor(x= hflights$Distance, y= hflights$TaxiOut, use = "na.or.complete", method = "pearson" )
## [1] 0.1582346
The correlation is 0.1582346 but of course we also want to know whether this is significant or not. For that i am using the cor.test function.
#Task 4
cor.test(x= hflights$Distance, y=hflights$TaxiOut , alternative = "two.sided", method = "pearson")
##
## Pearson's product-moment correlation
##
## data: hflights$Distance and hflights$TaxiOut
## t = 75.938, df = 224550, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1541994 0.1622646
## sample estimates:
## cor
## 0.1582346
According to the correlation test the correlation is significant, with t(224550) = 75.938 and a p-value < 2.2e-16. This means the longer a flight is going to be, the more time the taxi needs to get the passengers to the carrier. For his presentation the leader also shows the correlation in a plot:
plot(x = hflights$Distance, y=hflights$TaxiOut, type = "p", main = "Relationship between Distance and Taxi time", xlab = "Distance of the flight in miles", ylab = "Taxi time in minutes")
abline(lm(hflights$TaxiOut ~ hflights$Distance ), col = "red")
Question 6: The befriended manager of the Atlanta Airport was so happy about the histogram that he would like to get some more descriptive statistics about flights from Houston to Atlanta. He is particularly interested what is the average delay time for flights from Houston to Atlanta in comparison to other destinations. Of course our manager is happy to help him.
#TASK 9
#For that Task I am using the aggregate function because i want to see the mean general delaytime for every destination.
mean.delay.dest <- aggregate(formula = DepDelay + ArrDelay ~ Dest, FUN = mean, na.rm = TRUE, data = hflights)
#To get information for Atlanta I am using indexing:
mean.delay.dest$'DepDelay + ArrDelay'[mean.delay.dest$Dest== "ATL"]
## [1] 18.37644
The mean delay time for a flight from Houston airport to Atlanta is 18.38 minutes. To see what this means in comparison to other destinations i am using the summary function to get the maximum and minimum for all the average delay times:
summary(mean.delay.dest)
## Dest DepDelay + ArrDelay
## Length:116 Min. :-20.09
## Class :character 1st Qu.: 12.55
## Mode :character Median : 15.59
## Mean : 15.94
## 3rd Qu.: 19.73
## Max. : 51.26
In comparison to other destinations Atlanta is a bit above the average which is 15.94 minutes and in the “worse” half of all destinations because the median is at 15.59 minutes. But what is the worst destination to go to from Houston?
#Here I am using the function which.max to get the index of the greatest delaytime.
which.max(mean.delay.dest$`DepDelay + ArrDelay`)
## [1] 5
#It returns the index 5 so let's use indexing to get the Destination of the fifth row:
mean.delay.dest$Dest[5]
## [1] "ANC"
The most average delaytime with 51.26 minutes is for flights from Houston to “ANC” which is Anchorage in Alaska. Might be because of snow or bad weather…
Question 7: Although the Houston Airport has such a skilled leading manager passengers don’t like it that much, especially because of the long taxi in and out times. Because every airport worker worries a lot about loosing his or her job they need a good advertisement campaign to attract more passengers. Our manager has the brilliant idea to cheat a little bit on the taxi data and subtracts 5 minutes from every Taxi in and out time if it is larger than 5 minutes.
#TASK 1
#I am recoding the values of the taxi in time by subtracting 5 from every value.
hflights$TaxiIn.new <- hflights$TaxiIn -5
#To make it less obvious that he cheated i now have to index all negative numbers and set them to 1 minute:
hflights$TaxiIn.new[hflights$TaxiIn.new < 1] <- 1
#Now Iam doing the same for the Taxi out time:
hflights$TaxiOut.new <- hflights$TaxiOut -5
hflights$TaxiOut.new[hflights$TaxiOut.new < 1] <- 1
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## TaxiIn.new TaxiOut.new
## 5424 2 8
## 5425 1 4
## 5426 1 12
## 5427 4 17
## 5428 4 4
## 5429 1 8
It worked and we have two new columns that they can use for their advertisement campaign.
Question 8: To make his big monthly presentations a bit easier, our manager looks fo a quick way to see how many flights are flying from Houston Airport (HOU) and how many from George Bush Intercontinal Airport (IAH) (the second large airport close to Houston) per month. To make sure he follows the holy rule D.R.Y. (“Don’t repeat yourself.”), he decides to create a loop for not typing in the same stuff every month.
for (i in 1:12) #Here I want my loop to repeat it's calculation for every month from 1 (January) to 12 (December).
{monthly.table <- table(hflights$Origin[hflights$Month == i])#the function table should be repeated for every month.
print (monthly.table) #To see the results I need to add the function print().
}
##
## HOU IAH
## 4270 14640
##
## HOU IAH
## 3884 13244
##
## HOU IAH
## 4544 14926
##
## HOU IAH
## 4420 14173
##
## HOU IAH
## 4533 14639
##
## HOU IAH
## 4499 15101
##
## HOU IAH
## 4519 16029
##
## HOU IAH
## 4505 15671
##
## HOU IAH
## 4186 13879
##
## HOU IAH
## 4405 14291
##
## HOU IAH
## 4212 13809
##
## HOU IAH
## 4322 14795
Question 9: To end his convincing statement for using R more often, the manager of the Houston Airport wants to show his colleagues and passengers a beautiful scatterplot with two groups in different colours: he compares flights starting at Houston Airport (red dots) and flights starting at George Bush International Airport (green squares) in relation of distance of a flight and Taxi in time.
#TASK 6
# First I am subsetting the data
flights.hou <- subset(hflights, Origin == "HOU")
flights.iah <- subset(hflights, Origin == "IAH")
# Now i am creating a blank plot
plot(x = 1,
xlab = "Distance in miles",
ylab = "Taxi in time in minutes",
type = "n",
main = "Taxi In time by Distance and Origin",
xlim = c(0, 2000),
ylim = c(0, 40))
#I am typing type = n here because this means no plotting -> we want to use low level plotting afterwards to set our dots and squares
# Now I am adding red dots for flights from Houston Airport:
points(x = (flights.hou$Distance),
y = (flights.hou$TaxiIn),
pch = 16,
col = "red")
# Here I am adding green squares for flights from George Bush International Airport:
points(x = flights.iah$Distance,
y = flights.iah$TaxiIn,
pch = 22,
col = "lawngreen")
#Finally i am adding two regression lines, a black one for HOU and a blue one for IAH:
abline(lm(flights.hou$TaxiIn ~ flights.hou$Distance), col = "black")
abline(lm(flights.iah$TaxiIn ~ flights.iah$Distance), col = "blue")
According to the plot there are more long distance flights from IAH-Airport and the regression lines show us that the TaxiIn time is slightly higher there as well.On the other hand it doesn’t change that much with more distance of a flight, which is the case on Houston Airport.
First we can see that in matters of delay it is better to fly in winter than in summer because the departure delay in summer is on average signifcantly higher than in winter. Moreover it is not recommendable to fly from Houston to Anchorage in Alaska because it has the maximum of all average delaytimes (departure and arrival delay) with 51.26min.
A good model to predict the arrival delay of your flight is a linear model using the month, time in the air, departure delay and distance as predictors. At least the model works pretty well on sundays.
According to our boxplots there is no difference of the distances flown between the weekdays.
The correlation test showed us that there is a significant positive relation between the distance of a flight and the taxi out time in minutes, which means that the long distance flights tend to be parked further away from the airport building because the taxi needs more time to get the passengers to the carrier.
If you want to fly from Houston to Atlanta you have to expect on average a delay time of 18.38 minutes which is a bit worse than the average delay time of all destinations from Houston (15.94 minutes). Most of the Houston - Atlanta flights have an elapsed time of about 115 minutes at the measuring point, as you can see from the histogram.
The long distance flights tend to have longer Taxi In times and start more often from George Bush International Airport.