Flights <- read.table(file = "http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt",
                       header = T,
                       sep = "\t", 
                       stringsAsFactors = F 
                       )

SECTION 1: DATASET DESCRIPTION

1. How did you obtain the dataset? We obtained the dataset from Dr. Nathaniel D. Phillips website http://nathanieldphillips.com/r-course/. The dataset Flights contains data on all flights leaving the Houston airport in 2011.

2. How were the data originally collected? The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics and is collected by the Office of Airline Information, Bureau of Transportation Statistics.

3. How many rows and collumns are in the dataset?

nrow(Flights)
## [1] 227496
ncol(Flights)
## [1] 14

The dataset contains 227496 rows and 14 columns.

4. What are the columns in the dataset?

head(Flights)
##                  date hour minute  dep  arr dep_delay arr_delay carrier
## 1 2011-01-01 12:00:00   14      0 1400 1500         0       -10      AA
## 2 2011-01-02 12:00:00   14      1 1401 1501         1        -9      AA
## 3 2011-01-03 12:00:00   13     52 1352 1502        -8        -8      AA
## 4 2011-01-04 12:00:00   14      3 1403 1513         3         3      AA
## 5 2011-01-05 12:00:00   14      5 1405 1507         5        -3      AA
## 6 2011-01-06 12:00:00   13     59 1359 1503        -1        -7      AA
##   flight dest  plane cancelled time dist
## 1    428  DFW N576AA         0   40  224
## 2    428  DFW N557AA         0   45  224
## 3    428  DFW N541AA         0   48  224
## 4    428  DFW N403AA         0   39  224
## 5    428  DFW N492AA         0   44  224
## 6    428  DFW N262AA         0   45  224

SECTION 2: QUESTIONS

  1. Find missing values in the dataset and recode them into NA. What was the mean, median and the standard deviation of the flight time for flights with a distance of more than 600 miles?

  2. What is the mean, the standard deviation and the maximum of the arrival delay for each carrier? Use dplyr to calculate descriptive statistics across ‘carrier’ and ‘arrival delay’.

  3. Does the flight time correlate with the distance? Does the distance predict the flight time (regression)? Plot the results in a scatterplot and add the regression lines. In addition, plot the regression graphs for distance, departure delay and arrival delay and put the plots next to each other.

  4. Create a histogram that shows how many flights departed each hour over the whole year.

  5. Are the departure delay times of the two carriers AA and AS significantly different? Conduct a t-test. Create a custom function that tells you what the departure delay of the carrier AA is. To get information about the carriers WN and CO, create a scatterplot that tells you the arrival delay and the distance of these two carriers.

SECTION 3: ANALYSES

1. Find missing values in the dataset and recode them into NA. What was the mean, median and the standard deviation of the flight time for flights with a distance of more than 600 miles?

Flights [is.null(Flights)] <- NA

flight600 <- subset(x = Flights,
                    subset = (dist > 600),
                    select = c("time"),
                    )


mean(flight600$time, 
     na.rm = T
     )
## [1] 140.5317
sd(flight600$time, 
   na.rm = T
   )
## [1] 44.33558
median(flight600$time, 
       na.rm = T
       )
## [1] 129
summary(flight600$time, 
        na.rm = T
        )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    73.0   110.0   129.0   140.5   166.0   549.0    1811

The mean of the flight time for flights with a distance of more than 600 miles is 140 minutes, the median is 129 minutes and the standard deviation is 44 minutes.

2. What is the mean, the standard deviation and the maximum of the arrival delay for each carrier? Use dplyr to calculate descriptive statistics across ‘carrier’ and ‘arrival delay’.

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Flights %>% 
  group_by (carrier) %>%
  summarise(
    a.mean = mean(arr_delay, na.rm = T),
    b.sd = sd(arr_delay, na.rm = T),
    c.max = max(arr_delay, na.rm = T)
    )
## Source: local data frame [15 x 4]
## 
##    carrier     a.mean     b.sd c.max
## 1       AA  0.8917558 37.39939   978
## 2       AS  3.1923077 25.45696   183
## 3       B6  9.8588410 47.64176   335
## 4       CO  6.0986983 28.38512   957
## 5       DL  6.0841374 41.44595   701
## 6       EV  7.2569543 43.26771   469
## 7       F9  7.6682692 24.49275   277
## 8       FL  1.8536239 33.74713   500
## 9       MQ  7.1529751 47.01261   918
## 10      OO  8.6934922 30.40658   380
## 11      UA 10.4628628 47.72488   861
## 12      US -0.6307692 25.20307   433
## 13      WN  7.5871430 30.54575   499
## 14      XE  8.1865242 29.81871   634
## 15      YV  4.0128205 18.82972    72

In this table you can see the mean, the standard deviation and the maximum of the arrival delay for each carrier.

3. Does the flight time correlate with the distance? Does the distance predict the flight time (regression)? Plot the results in a scatterplot and add the regression lines. In addition, plot the regression graphs for distance, departure delay and arrival delay and put the plots next to each other.

test.result <- cor.test(x = Flights$time, 
                        y = Flights$dist
                        )
test.result
## 
##  Pearson's product-moment correlation
## 
## data:  Flights$time and Flights$dist
## t = 2676.311, df = 223872, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9846032 0.9848543
## sample estimates:
##       cor 
## 0.9847293

Yes, the flight time shows a positive correlation (r=.98) with the distance. This is nearly a perfect correlation like it should be for those variables.

dist.lm <- lm(time ~ dist ,
                 data = Flights
                  )
dist.lm
## 
## Call:
## lm(formula = time ~ dist, data = Flights)
## 
## Coefficients:
## (Intercept)         dist  
##     11.2503       0.1227

Yes, the distance predicts the flight time. The regression equation is y = 0.1227x + 11.2503.

depdelay.lm <- lm(time ~ dep_delay ,
                 data = Flights
                  )
depdelay.lm
## 
## Call:
## lm(formula = time ~ dep_delay, data = Flights)
## 
## Coefficients:
## (Intercept)    dep_delay  
##   107.63291      0.05411
arrdelay.lm <- lm(time ~ arr_delay ,
                 data = Flights
                  )
arrdelay.lm
## 
## Call:
## lm(formula = time ~ arr_delay, data = Flights)
## 
## Coefficients:
## (Intercept)    arr_delay  
##   107.85684      0.04024
#Put the plots next to each other
par(mfrow = c(1, 3))

#First Plot - Distance
plot(x = Flights$dist, 
     y = Flights$time,
     main = "Time and Distance",
     xlab = "Distance in miles",
     ylab = "Time in minutes",
     xlim = c(0, 4000),  
     ylim = c(0, 600), 
     col = "lightcoral",
     pch = 18, 
     type = "p",
     cex = 0.2
     )

abline(lm(Flights$time ~ Flights$dist), 
       col = "black", 
       lty = 1
       )

# Second Plot - dep_delay
plot(x = Flights$dep_delay, 
     y = Flights$time,
     main = "Time and Departure Delay",
     xlab = "Delay of departure",
     ylab = "Time in hours",
     xlim = c(-33, 1000),  
     ylim = c(0, 600), 
     col = "skyblue",
     pch = 11, 
     type = "p",
     cex = 0.5
     )

abline(lm(Flights$time ~ Flights$dep_delay), 
       col = "black", 
       lty = 1
       )


# Third Plot - arr_delay
plot(x = Flights$arr_delay, 
     y = Flights$time,
     main = "Time and Arrival Delay",
     xlab = "Delay of arrival",
     ylab = "Time in hours",
     xlim = c(-70, 1000),  
     ylim = c(0, 600), 
     col = "cyan3",
     pch = 11, 
     type = "p",
     cex = 0.5
     )

abline(lm(Flights$time ~ Flights$arr_delay), 
       col = "black", 
       lty = 1
       )

4. Create a histogram that shows how many flights departed each hour over the whole year.

hist(x = Flights$hour,
     main = "Flights per hour over the year", 
     xlab = "Hour",
     ylab = "Number of flights",
     col = "aliceblue",
     border = "black",
     xlim = c(0,24),
     ylim = c(0,20000),
     breaks = seq(0,24, by = 1)
     )

abline(v = median(Flights$hour, na.rm = T), 
       col = "black",
       lwd = 3,
       lty = 1
       )

text(x = 16, 
     y = 20000, 
     labels = "Median",
     lwd = 2)


abline(v = mean(Flights$hour, na.rm = T), 
       col = "indianred4",
       lwd = 3,
       lty = 1
       )

text(x = 12, 
     y = 20000, 
     labels = "Mean",
     lwd = 2,
     col = "indianred4"
     )

median(Flights$hour, 
       na.rm = T
       )
## [1] 14
mean(Flights$hour, 
     na.rm = T
     )
## [1] 13.66337

The histrogram shows that most flights departed at 2pm. The mean is 13.66 which means that, on average, the planes departed at 13:40pm. It also shows that during the night there is less air traffic than during the day.

5. Are the departure delay times of the two carriers AA and AS significantly different? Conduct a t-test. Create a custom function that tells you what the departure delay of the carrier AA is. To get information about the carriers WN and CO, create a scatterplot that tells you the arrival delay and the distance of these two carriers.

AA.delay <- subset(Flights, subset = carrier== "AA")$dep_delay 

AS.delay <- subset(Flights, subset = carrier == "AS")$dep_delay

mean(AA.delay, na.rm = T)
## [1] 6.390144
mean(AS.delay, na.rm = T)
## [1] 3.712329
test.result <- t.test(x = AA.delay, 
                      y = AS.delay, 
                      alternative = "two.sided" 
                      )

test.result
## 
##  Welch Two Sample t-test
## 
## data:  AA.delay and AS.delay
## t = 2.1717, df = 652.243, p-value = 0.03024
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2566127 5.0990185
## sample estimates:
## mean of x mean of y 
##  6.390144  3.712329

The two-sample t-test shows that the two carriers significantly differ from each other regarding the departure delay. t(652.243) = 2.1717, p = 0.03024, 95% CI = [0.25, 5.09].

data <- subset(x = Flights$dep_delay, 
               subset = (Flights$carrier == "AA"))
#data
mean(data, na.rm=T)
## [1] 6.390144
a <- "AA"
b <- "9.444951"
madlib <- function(a,b) {output <- paste("If you take the carrier", a, "the departure delay will be", b, "minutes on average.")
return(output)
}

madlib("AA", "9.444951")
## [1] "If you take the carrier AA the departure delay will be 9.444951 minutes on average."
WN.data <- subset(Flights, 
                  carrier == "WN" 
                  ) 

CO.data <- subset(Flights, 
                  carrier == "CO" 
                  )

plot(x = WN.data$dist, 
     y = WN.data$arr_delay, 
     col = "darkolivegreen", 
     pch = 16,
     type = "p",
     cex = 0.3,
     xlab = "Distance in miles",
     ylab = "Arrival delay in minutes",
     main = "Distance and arrival delay of flights"
     )

points(x = CO.data$dist, 
       y = CO.data$arr_delay,
       pch = 16, 
       col = "darkolivegreen3",
       cex = 0.2
       )

legend("topright", 
       legend = c("Carrier WN", "Carrier CO"), 
       pch = c(16, 16),
       col = c("darkolivegreen", "darkolivegreen3")
       )

The scatterplot shows that the two carriers WN and CO have different arrival delays and distances. The arrival delay seems mostly to be less than 100 minutes and there are few outliers.

SECTION 4: CONCLUSION

With our analysis we found out that flights with a distance of more than 600 miles on average take 140 minutes. The median is 129 minutes and the standard deviation 44 minutes. Furthermore, we analysed descriptive statistics of the arrival delay for each carrier. This can be important for future choices regarding which carrier to take.

We were interested in the relationship between distance and flight time. Flights with a longer distance logically should take longer, but the time of the start and the approach for a landing should affect the relationship. We found out that there is nearly a perfect correlation between the flight time and the distance. Additionally, we found out that also arrival delay and departure delay correlate with the distance.

To show how many flights departed at each hour during the day, we created a histogram. It shows the frequency of flights per hour over the whole year 2011. Because in most cases, night-flight is prohibited, the most flights departed during the day. On average, the planes departed at 13:40pm.

In our last analysis we wanted to see whether the carriers AA and AS have different departure delay times. There is a significant difference between these two carriers with carrier AS having less departure delay time. We additionally looked at two other carriers, WN and CO and plotted their arrival delay times.