Section 1: Dataset Description

Flights <- read.table(file = "http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt",
                       header = T,
                       sep = "\t", 
                       stringsAsFactors = F 
                       )

Question 1: How did you obtain the dataset?

Question 2: How were the data originally collected?

Question 3: How many rows and colums are in the dataset?

nrow(Flights)
## [1] 227496
ncol(Flights)
## [1] 14

Question 4: What are the columns are in the dataset? For each column, give the variable name and a brief description of what is represents.

head(Flights)
##                  date hour minute  dep  arr dep_delay arr_delay carrier
## 1 2011-01-01 12:00:00   14      0 1400 1500         0       -10      AA
## 2 2011-01-02 12:00:00   14      1 1401 1501         1        -9      AA
## 3 2011-01-03 12:00:00   13     52 1352 1502        -8        -8      AA
## 4 2011-01-04 12:00:00   14      3 1403 1513         3         3      AA
## 5 2011-01-05 12:00:00   14      5 1405 1507         5        -3      AA
## 6 2011-01-06 12:00:00   13     59 1359 1503        -1        -7      AA
##   flight dest  plane cancelled time dist
## 1    428  DFW N576AA         0   40  224
## 2    428  DFW N557AA         0   45  224
## 3    428  DFW N541AA         0   48  224
## 4    428  DFW N403AA         0   39  224
## 5    428  DFW N492AA         0   44  224
## 6    428  DFW N262AA         0   45  224

Section 2: Questions & Analyses

Question 1: What is the mean, the median and the standart deviation of the flight time for flights with a distance more than 600 miles?
Find missing vaules in the dataset and recode them into NA.

Flights[is.null(Flights)] <- NA

flight600 <- subset(x= Flights,
                    subset = (Flights$dist > 600),
                    select = c("time")
                    )
summary(flight600$time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    73.0   110.0   129.0   140.5   166.0   549.0    1811
mean(flight600$time, na.rm =T)
## [1] 140.5317
median(flight600$time, na.rm =T)
## [1] 129
sd(flight600$time, na.rm =T)
## [1] 44.33558

Question 2: What is the mean, the standart deviation and the maximum of the arrival delay of each carrier?
Use depyr to calculate the descriptive statistics across ‘carrier’ and ‘arrival delay’.

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Flights%>%
group_by(carrier) %>% 
  summarise(a.mean = mean(arr_delay,na.rm =T),  
            b.sd = sd(arr_delay, na.rm =T), 
            c.max = max(arr_delay,na.rm =T) 
            )
## Source: local data frame [15 x 4]
## 
##    carrier     a.mean     b.sd c.max
## 1       AA  0.8917558 37.39939   978
## 2       AS  3.1923077 25.45696   183
## 3       B6  9.8588410 47.64176   335
## 4       CO  6.0986983 28.38512   957
## 5       DL  6.0841374 41.44595   701
## 6       EV  7.2569543 43.26771   469
## 7       F9  7.6682692 24.49275   277
## 8       FL  1.8536239 33.74713   500
## 9       MQ  7.1529751 47.01261   918
## 10      OO  8.6934922 30.40658   380
## 11      UA 10.4628628 47.72488   861
## 12      US -0.6307692 25.20307   433
## 13      WN  7.5871430 30.54575   499
## 14      XE  8.1865242 29.81871   634
## 15      YV  4.0128205 18.82972    72

Question 3: Does the flight time correlate with the distance? Does the distance predict the flight time (regression)?
Plot the results in a scatterplot with added regression lines.
In addition plot the regressiongraphs for distance, departure delay and arrival delay next to each other.

test.result <- cor.test(x= Flights$time, 
         y= Flights$dist)
test.result
## 
##  Pearson's product-moment correlation
## 
## data:  Flights$time and Flights$dist
## t = 2676.3, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9846032 0.9848543
## sample estimates:
##       cor 
## 0.9847293
dist.lm <- lm(time ~ dist,
                 data = Flights,
                 )
dist.lm
## 
## Call:
## lm(formula = time ~ dist, data = Flights)
## 
## Coefficients:
## (Intercept)         dist  
##     11.2503       0.1227
dep_delay.lm <- lm(time ~ dep_delay,
                 data = Flights,
                 )

arr_delay.lm <- lm(time ~ arr_delay,
                 data = Flights,
                 )
dep_delay.lm
## 
## Call:
## lm(formula = time ~ dep_delay, data = Flights)
## 
## Coefficients:
## (Intercept)    dep_delay  
##   107.63291      0.05411
arr_delay.lm
## 
## Call:
## lm(formula = time ~ arr_delay, data = Flights)
## 
## Coefficients:
## (Intercept)    arr_delay  
##   107.85684      0.04024
par(mfrow = c(1,3))

plot(x= Flights$dist, 
     y= Flights$time,
     main = "Time and Distance",
     xlab = "Distance in miles",
     ylab = "Time in minutes",
     xlim = c(0, 4000),  
     ylim = c(0, 600), 
     col = "pink3", 
     pch = 8, 
     cex = 0.3
     )
abline(lm(Flights$time ~ Flights$dist), col = "black", lty= 1)

plot(x= Flights$arr_delay, 
     y= Flights$time,
     main = "Time and Arrival Delay",
     xlab = "Arrival Delay in Minutes",
     ylab = "Time in minutes",
     xlim = c(-70, 1000),  
     ylim = c(0, 600), 
     col = "lightcoral", 
     pch = 4, 
     cex = 0.2
     )
abline(lm(Flights$time ~ Flights$arr_delay), col = "black", lty= 1)

plot(x= Flights$dep_delay, 
     y= Flights$time,
     main = "Time and Departure Delay",
     xlab = "Departure Delay in Minutes",
     ylab = "Time in minutes",
     xlim = c(-50, 1000),  
     ylim = c(0, 600), 
     col = "violetred4", 
     pch = 4, 
     cex = 0.2
     )
abline(lm(Flights$time ~ Flights$dep_delay), col = "black", lty= 1)

Question 4: Create some histograms that showes how many flights departed at each hour with mean and median.

hist (x= Flights$hour,
      main="flights per hour over one year", 
      breaks= seq(0,24, by=1),
      xlab= "hour", 
      ylab= "number of flights",
      xlim= c(0,24),
      ylim= c(0, 20000),
      col= "skyblue2",
      border= "deepskyblue4"
      )

abline(v= median(Flights$hour, na.rm=T), lwd = 3, col = "royalblue4", lty= 1)
text(x= 12, y= 20000, labels= "median", col= "royalblue4", font = 2)
abline(v= mean(Flights$hour, na.rm=T), lwd = 2, col = "violetred2", lty= 6)
text(x= 12, y= 18500, labels= "mean", col= "violetred2", font = 2)

median(Flights$hour, na.rm=T)
## [1] 14
mean(Flights$hour, na.rm=T)
## [1] 13.66337

Question 5: Does the departure delay time of the two carriers (“AA” and “AS”) significantly differ from each other?
Conduct a t-test!
Create a custom function that tells you, what the departure delay of the carrier “AA” and “AS” is. To get information about the carriers WN and CO, create a scatterplot that tells you the arrival delay and the distance of these two carriers.

AA.delay <- subset(Flights, subset = carrier == "AA")$dep_delay
AS.delay <- subset(Flights, subset = carrier == "AS")$dep_delay
test.result <- t.test(x = AA.delay, 
                      y = AS.delay,
                      alternative = "two.sided")
test.result
## 
##  Welch Two Sample t-test
## 
## data:  AA.delay and AS.delay
## t = 2.1717, df = 652.24, p-value = 0.03024
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2566127 5.0990185
## sample estimates:
## mean of x mean of y 
##  6.390144  3.712329
data <- subset(x = Flights$dep_delay, 
               subset = (Flights$carrier == "AA"))
madlib <- function(a,b) {output <- paste("If you take the carrier", a, "the departure delay will be", b, "minutes on average.")
return(output)
}
madlib("AA", mean(data, na.rm=T))
## [1] "If you take the carrier AA the departure delay will be 6.39014438166981 minutes on average."
data2 <- subset(x = Flights$dep_delay, 
               subset = (Flights$carrier == "AS"))
madlib <- function(a,b) {output <- paste("If you take the carrier", a, "the departure delay will be", b, "minutes on average.")
return(output)
}
madlib("AS", mean(data2, na.rm=T))
## [1] "If you take the carrier AS the departure delay will be 3.71232876712329 minutes on average."
WN.data <- subset(Flights, carrier == "WN" ) 
CO.data <- subset(Flights, carrier == "CO" )


plot(x = WN.data$dist, 
     y = WN.data$arr_delay, 
     col = "darkolivegreen", 
     pch = 16,
     type = "p",
     cex = 0.3,
     xlab = "distance in miles",
     ylab = "arrival delay in minutes",
     main = "distance and arrival delay of flights"
)

points(x = CO.data$dist, 
       y = CO.data$arr_delay,
       pch = 16, 
     col = "darkolivegreen3",
     cex = 0.2)

legend("topright", 
       legend = c("carrier WN", "carrier CO"), 
       pch = c(16, 16),
       col = c("darkolivegreen", "darkolivegreen3")
       )


Section 3: Conclusion

With our analysis, we found out that flights with a distance of more than 600 miles take a 140 minutes flight time on average. The median is 129 minutes and the standard deviation 44 minutes. Furthermore, we analysed descriptive statistics of the arrival delay for each carrier. This can be important for future choices regarding which carrier to take.

We were interested in the relationship between distance and flight time. Flights with a longer distance logically should take longer, but the time of the start and the approach for a landing should affect the relationship. We found out that there is nearly a perfect correlation between the flight time and the distance. Additionally, we found out that the arrival delay and departure delay also correlate with the distance.

To show how many flights departed at each hour during the day, we created a histogram. It shows the frequency of flights per hour over the whole year 2011. Because in most cases, night-flight is prohibited, the most flights departed during the day. On average, the planes departed at 13:40pm.

In our last analysis we wanted to see whether the carriers AA and AS have different departure delay times. There is a significant difference between these two carriers with carrier AS having less departure delay time. We additionally looked at two other carriers, WN and CO and plotted their arrival delay times.