Flights <- read.table(file = "http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt",
header = T,
sep = "\t",
stringsAsFactors = F
)
SECTION 1: DATASET DESCRIPTION
1. How did you obtain the dataset? We obtained the dataset from Dr. Nathaniel D. Phillips website http://nathanieldphillips.com/r-course/. The dataset Flights contains data on all flights leaving the Houston airport in 2011.
2. How were the data originally collected? The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics and is collected by the Office of Airline Information, Bureau of Transportation Statistics.
3. How many rows and collumns are in the dataset?
nrow(Flights)
## [1] 227496
ncol(Flights)
## [1] 14
The dataset contains 227496 rows and 14 columns.
4. What are the columns in the dataset?
head(Flights)
## date hour minute dep arr dep_delay arr_delay carrier
## 1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA
## 2 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA
## 3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA
## 4 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA
## 5 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA
## 6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA
## flight dest plane cancelled time dist
## 1 428 DFW N576AA 0 40 224
## 2 428 DFW N557AA 0 45 224
## 3 428 DFW N541AA 0 48 224
## 4 428 DFW N403AA 0 39 224
## 5 428 DFW N492AA 0 44 224
## 6 428 DFW N262AA 0 45 224
SECTION 2: QUESTIONS
Find missing values in the dataset and recode them into NA. What was the mean, median and the standard deviation of the flight time for flights with a distance of more than 600 miles?
What is the mean, the standard deviation and the maximum of the arrival delay for each carrier? Use dplyr to calculate descriptive statistics across ‘carrier’ and ‘arrival delay’.
Does the flight time correlate with the distance? Does the distance predict the flight time (regression)? Plot the results in a scatterplot and add the regression lines. In addition, plot the regression graphs for distance, departure delay and arrival delay and put the plots next to each other.
Create a histogram that shows how many flights departed each hour over the whole year.
Are the departure delay times of the two carriers AA and AS significantly different? Conduct a t-test. Create a custom function that tells you what the departure delay of the carrier AA is. To get information about the carriers WN and CO, create a scatterplot that tells you the arrival delay and the distance of these two carriers.
SECTION 3: ANALYSES
1. Find missing values in the dataset and recode them into NA. What was the mean, median and the standard deviation of the flight time for flights with a distance of more than 600 miles?
Flights [is.null(Flights)] <- NA
flight600 <- subset(x = Flights,
subset = (dist > 600),
select = c("time"),
)
mean(flight600$time,
na.rm = T
)
## [1] 140.5317
sd(flight600$time,
na.rm = T
)
## [1] 44.33558
median(flight600$time,
na.rm = T
)
## [1] 129
summary(flight600$time,
na.rm = T
)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 73.0 110.0 129.0 140.5 166.0 549.0 1811
The mean of the flight time for flights with a distance of more than 600 miles is 140 minutes, the median is 129 minutes and the standard deviation is 44 minutes.
2. What is the mean, the standard deviation and the maximum of the arrival delay for each carrier? Use dplyr to calculate descriptive statistics across ‘carrier’ and ‘arrival delay’.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Flights %>%
group_by (carrier) %>%
summarise(
a.mean = mean(arr_delay, na.rm = T),
b.sd = sd(arr_delay, na.rm = T),
c.max = max(arr_delay, na.rm = T)
)
## Source: local data frame [15 x 4]
##
## carrier a.mean b.sd c.max
## 1 AA 0.8917558 37.39939 978
## 2 AS 3.1923077 25.45696 183
## 3 B6 9.8588410 47.64176 335
## 4 CO 6.0986983 28.38512 957
## 5 DL 6.0841374 41.44595 701
## 6 EV 7.2569543 43.26771 469
## 7 F9 7.6682692 24.49275 277
## 8 FL 1.8536239 33.74713 500
## 9 MQ 7.1529751 47.01261 918
## 10 OO 8.6934922 30.40658 380
## 11 UA 10.4628628 47.72488 861
## 12 US -0.6307692 25.20307 433
## 13 WN 7.5871430 30.54575 499
## 14 XE 8.1865242 29.81871 634
## 15 YV 4.0128205 18.82972 72
In this table you can see the mean, the standard deviation and the maximum of the arrival delay for each carrier.
3. Does the flight time correlate with the distance? Does the distance predict the flight time (regression)? Plot the results in a scatterplot and add the regression lines. In addition, plot the regression graphs for distance, departure delay and arrival delay and put the plots next to each other.
test.result <- cor.test(x = Flights$time,
y = Flights$dist
)
test.result
##
## Pearson's product-moment correlation
##
## data: Flights$time and Flights$dist
## t = 2676.311, df = 223872, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9846032 0.9848543
## sample estimates:
## cor
## 0.9847293
Yes, the flight time shows a positive correlation (r=.98) with the distance. This is nearly a perfect correlation like it should be for those variables.
dist.lm <- lm(time ~ dist ,
data = Flights
)
dist.lm
##
## Call:
## lm(formula = time ~ dist, data = Flights)
##
## Coefficients:
## (Intercept) dist
## 11.2503 0.1227
Yes, the distance predicts the flight time. The regression equation is y = 0.1227x + 11.2503.
depdelay.lm <- lm(time ~ dep_delay ,
data = Flights
)
depdelay.lm
##
## Call:
## lm(formula = time ~ dep_delay, data = Flights)
##
## Coefficients:
## (Intercept) dep_delay
## 107.63291 0.05411
arrdelay.lm <- lm(time ~ arr_delay ,
data = Flights
)
arrdelay.lm
##
## Call:
## lm(formula = time ~ arr_delay, data = Flights)
##
## Coefficients:
## (Intercept) arr_delay
## 107.85684 0.04024
#Put the plots next to each other
par(mfrow = c(1, 3))
#First Plot - Distance
plot(x = Flights$dist,
y = Flights$time,
main = "Time and Distance",
xlab = "Distance in miles",
ylab = "Time in minutes",
xlim = c(0, 4000),
ylim = c(0, 600),
col = "lightcoral",
pch = 18,
type = "p",
cex = 0.2
)
abline(lm(Flights$time ~ Flights$dist),
col = "black",
lty = 1
)
# Second Plot - dep_delay
plot(x = Flights$dep_delay,
y = Flights$time,
main = "Time and Departure Delay",
xlab = "Delay of departure",
ylab = "Time in hours",
xlim = c(-33, 1000),
ylim = c(0, 600),
col = "skyblue",
pch = 11,
type = "p",
cex = 0.5
)
abline(lm(Flights$time ~ Flights$dep_delay),
col = "black",
lty = 1
)
# Third Plot - arr_delay
plot(x = Flights$arr_delay,
y = Flights$time,
main = "Time and Arrival Delay",
xlab = "Delay of arrival",
ylab = "Time in hours",
xlim = c(-70, 1000),
ylim = c(0, 600),
col = "cyan3",
pch = 11,
type = "p",
cex = 0.5
)
abline(lm(Flights$time ~ Flights$arr_delay),
col = "black",
lty = 1
)
4. Create a histogram that shows how many flights departed each hour over the whole year.
hist(x = Flights$hour,
main = "Flights per hour over the year",
xlab = "Hour",
ylab = "Number of flights",
col = "aliceblue",
border = "black",
xlim = c(0,24),
ylim = c(0,20000),
breaks = seq(0,24, by = 1)
)
abline(v = median(Flights$hour, na.rm = T),
col = "black",
lwd = 3,
lty = 1
)
text(x = 16,
y = 20000,
labels = "Median",
lwd = 2)
abline(v = mean(Flights$hour, na.rm = T),
col = "indianred4",
lwd = 3,
lty = 1
)
text(x = 12,
y = 20000,
labels = "Mean",
lwd = 2,
col = "indianred4"
)
median(Flights$hour,
na.rm = T
)
## [1] 14
mean(Flights$hour,
na.rm = T
)
## [1] 13.66337
The histrogram shows that most flights departed at 2pm. The mean is 13.66 which means that, on average, the planes departed at 13:40pm. It also shows that during the night there is less air traffic than during the day.
5. Are the departure delay times of the two carriers AA and AS significantly different? Conduct a t-test. Create a custom function that tells you what the departure delay of the carrier AA is. To get information about the carriers WN and CO, create a scatterplot that tells you the arrival delay and the distance of these two carriers.
AA.delay <- subset(Flights, subset = carrier== "AA")$dep_delay
AS.delay <- subset(Flights, subset = carrier == "AS")$dep_delay
mean(AA.delay, na.rm = T)
## [1] 6.390144
mean(AS.delay, na.rm = T)
## [1] 3.712329
test.result <- t.test(x = AA.delay,
y = AS.delay,
alternative = "two.sided"
)
test.result
##
## Welch Two Sample t-test
##
## data: AA.delay and AS.delay
## t = 2.1717, df = 652.243, p-value = 0.03024
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2566127 5.0990185
## sample estimates:
## mean of x mean of y
## 6.390144 3.712329
The two-sample t-test shows that the two carriers significantly differ from each other regarding the departure delay. t(652.243) = 2.1717, p = 0.03024, 95% CI = [0.25, 5.09].
data <- subset(x = Flights$dep_delay,
subset = (Flights$carrier == "AA"))
#data
mean(data, na.rm=T)
## [1] 6.390144
a <- "AA"
b <- "9.444951"
madlib <- function(a,b) {output <- paste("If you take the carrier", a, "the departure delay will be", b, "minutes on average.")
return(output)
}
madlib("AA", "9.444951")
## [1] "If you take the carrier AA the departure delay will be 9.444951 minutes on average."
WN.data <- subset(Flights,
carrier == "WN"
)
CO.data <- subset(Flights,
carrier == "CO"
)
plot(x = WN.data$dist,
y = WN.data$arr_delay,
col = "darkolivegreen",
pch = 16,
type = "p",
cex = 0.3,
xlab = "Distance in miles",
ylab = "Arrival delay in minutes",
main = "Distance and arrival delay of flights"
)
points(x = CO.data$dist,
y = CO.data$arr_delay,
pch = 16,
col = "darkolivegreen3",
cex = 0.2
)
legend("topright",
legend = c("Carrier WN", "Carrier CO"),
pch = c(16, 16),
col = c("darkolivegreen", "darkolivegreen3")
)
The scatterplot shows that the two carriers WN and CO have different arrival delays and distances. The arrival delay seems mostly to be less than 100 minutes and there are few outliers.
SECTION 4: CONCLUSION
With our analysis we found out that flights with a distance of more than 600 miles on average take 140 minutes. The median is 129 minutes and the standard deviation 44 minutes. Furthermore, we analysed descriptive statistics of the arrival delay for each carrier. This can be important for future choices regarding which carrier to take.
We were interested in the relationship between distance and flight time. Flights with a longer distance logically should take longer, but the time of the start and the approach for a landing should affect the relationship. We found out that there is nearly a perfect correlation between the flight time and the distance. Additionally, we found out that also arrival delay and departure delay correlate with the distance.
To show how many flights departed at each hour during the day, we created a histogram. It shows the frequency of flights per hour over the whole year 2011. Because in most cases, night-flight is prohibited, the most flights departed during the day. On average, the planes departed at 13:40pm.
In our last analysis we wanted to see whether the carriers AA and AS have different departure delay times. There is a significant difference between these two carriers with carrier AS having less departure delay time. We additionally looked at two other carriers, WN and CO and plotted their arrival delay times.