Data Analysis and Visualization with R Final Paper
The dataset (“Flights”) was provided by Dr. Nathaniel Phillips, available at the following link: http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt. This dataset contains data from all flights leaving the Houston airport in one year.
Flights <- read.table(file = "http://nathanieldphillips.com/wp-content/uploads/2015/04/Flights.txt",
header = T,
sep = "\t",
stringsAsFactors = F)
There were 227496 rows and 14 columns in the dataset.
nrow(Flights)
## [1] 227496
ncol(Flights)
## [1] 14
The names of the columns were
names(Flights)
## [1] "date" "hour" "minute" "dep" "arr"
## [6] "dep_delay" "arr_delay" "carrier" "flight" "dest"
## [11] "plane" "cancelled" "time" "dist"
Here is what each means: 1. date = date of departure 2. hour = hour of departure 3. minute = minute of departure 4. dep = time of departure 5. arr = time of arrival 6. dep_delay = the flight departure delay in minutes 7. arr-delay = the flight arrival delay in minutes 8. carrier = the code of the carrier 9. flight = the flight number 10. dest = the flight destination 11. plane = plane number 12. cancelled = cancelled flights 13. time = flight time in minutes 14. dist = distance of flight
1.Which planes were used to which destinations? 2.What are the departure delays of flights at every hour? Also, for every hour, what are the flight distances? 3.For each destination where time smaller or equal to 30 minutes, what is the distance? 4.What is the mean, median, average of the departure delay for each destination? 5.Show the correlation between arrival delay and distance. 6.See relationship between flight dates and departure delays. 7.Calculate mean, median and standard deviation for distances of all carriers. 8. How does the arrival delay of flights differ for two of the carriers, let’s say DL and F9? 9. Create histogram of all departure delays and then create a histogram of departure delays for each carrier.
destinations <- Flights[,c("dest")]
length(unique(destinations))
## [1] 116
planes <- Flights[,c("plane")]
length(unique(planes))
## [1] 3320
I found 116 destinations and 3320 planes.
I computed a scatterplot with planes and destinations to see which planes go to which of the destinations in the list.
Because the database has many values and also empty rows and also because the destinations and planes columns have non-numerical values I used the function with to plot the relationship between the two variables.
### Task 6
with(Flights, plot( table(dest, plane), main = "Correlation between planes and flights", xlab="Destinations", ylab="Planes"))
We can see that some planes have more flights, as well as more flights towards certain destinations.
To see the relationships in more detail I plotted the relationship between planes and only three destinations, ex: DFW, ABQ, LFT.
destinations <- subset(Flights, dest == "DFW" | dest == "ABQ" | dest == "LFT")$dest
planes <- subset(Flights, dest == "DFW" | dest == "ABQ" | dest == "LFT")$plane
with(Flights, plot(table(destinations,planes), main = "Correlation between planes and flights", xlab="Destinations",ylab="Planes"))
This shows in more details how the planes had flights towards the three destinations, ABQ, DFW and LFT. We can observe that for example for the destination LFT there are some planes which had much more flights towards destination LFT than ABQ and DFW, but some had barely a few flights, which might be a cause of the flights being canceled and so reducing the future flights towards that destination.
Vectors which represent the amount of flights per hour, how many of them departed with no delay or earlier than the scheduled time and how many departed exactly as scheduled.
flights <- Flights[,c("hour")]
flights_nodelay <-subset(Flights,subset = (Flights$dep_delay <= 0))$hour
flights_ontime <-subset(Flights,subset = (Flights$dep_delay == 0))$hour
To see how these vectors look like I constructed a histogram with mean and median lines.
# # # TASK 2
med <- median(flights,na.rm=T)
mn <- mean(flights,na.rm=T)
# # # TASK 7
hist(flights, main = "Number of flights per hour", xlab = "Hour", ylab = "Amount of flights", xlim = c(0,24), ylim = c(0, 20000), col = "lightskyblue3")
hist(flights_nodelay, xlim = c(0,24), ylim = c(0, 20000), col = "deepskyblue3",add=T)
hist(flights_ontime, xlim = c(0,24), ylim = c(0, 20000), col = "darkblue",add=T)
box()
legend('topleft',c('Flights','Flights with no delay', 'Flights as scheduled'),
fill = c("lightskyblue3", "deepskyblue3","darkblue"), bty = 'n',
border = NA)
abline(v = med, col = "red", lwd = 3)
text(x = 16, y = 18000,labels="median", col = "red")
abline(v = mn, col = "violet", lwd = 3)
text(x = 12, y = 18000, labels="mean", col = "violet")
The histogram shows that almost half of the flights had delays, most of them around mid-day and few flights departed on time. Almost all flights before 05:00 departed without any delay and almost all flights after 22:00 had delays.
We can also see that most of the flights departed between 12:00 and 15:00, more precisely around 14:00 since the median is 13.66337, whereas there are few or no flights before 04:00 and after 22:00.
I built a boxplot to see for each hour, what are the flights distances.
# # # TASK 8
boxplot(Flights$dist~Flights$hour, xlab="Hours", ylab="Distances", main = "Distances of flights at every hour")
The plot shows that the flights with the longer distances are in the morning or the first half of the day and the distances of the flights decrease towards the end of the day.
I selected the hours and destinations of all flights under 30 minutes.
Destinations.dest <-subset(Flights,subset = (Flights$time <= 30))$hour
Destinations.dist <-subset(Flights,subset = (Flights$time <= 30))$dist
I constructed a boxplot for all flights under 30 minutes and their destinations.
# # # TASK 8
boxplot(Destinations.dist~Destinations.dest,xlab="Destinations", ylab="Distances",main = "Distances of flights under 30 minutes for every destinations")
We can see that except a few destinations, there are very few or no flights towards some destinations and therefore no distances. The destinations that planes flew towards have mostly relatively small distances between 120 km and 150 km, but there are few flights with a distance around 180 km and 200 km.
I used the dplyr library and it needs to be called beforehand.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# # # TASK 2
Flights%>%
group_by(dest) %>%
summarise (a.mean = mean(dep_delay,na.rm =T),
b.median = median(dep_delay, na.rm =T),
c.sd = sd(dep_delay,na.rm =T)
)
## Source: local data frame [116 x 4]
##
## dest a.mean b.median c.sd
## (chr) (dbl) (dbl) (dbl)
## 1 ABQ 8.478651 0 24.59470
## 2 AEX 6.379213 -2 26.81923
## 3 AGS 10.000000 10 NaN
## 4 AMA 6.703557 -2 27.95402
## 5 ANC 24.952000 7 45.18097
## 6 ASE 15.983333 2 42.75511
## 7 ATL 10.286341 -1 37.75687
## 8 AUS 8.419335 1 23.16978
## 9 AVL 8.899135 -1 34.10081
## 10 BFL 5.057654 -2 22.91824
## .. ... ... ... ...
I found that, for example, for the destination ANC the departure delay mean is 24.96, the median is 7 and the standard deviation is 45.19.
I recoded the missing values and replaced them with NA as a flag value.
# # # TASK 1
Flights[is.null(Flights)] <- NA
Firstly, I computed the correlation between distance and arrival delay.
# # # TASK 4
date_delay <- cor.test( x = Flights$dist, y = Flights$arr_delay)
date_delay
##
## Pearson's product-moment correlation
##
## data: Flights$dist and Flights$arr_delay
## t = -2.0981, df = 223870, p-value = 0.0359
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0085764454 -0.0002919103
## sample estimates:
## cor
## -0.004434254
The flight arrival delay shows a negative correlation (r = - 0.00, t(223870) = - 2.09, p > .01) meaning that as the distance increases, the arrival delay decreases and vice versa, but since the value of the Pearson’s correlation is close to 0 (- 0.0045) it indicates that there is almost no linear relationship between the two variables, distance and arrival delay.
Next I computed the regression between arrival time and distance.
# # # TASK 5
lm(Flights$arr_delay ~ Flights$dist)
##
## Call:
## lm(formula = Flights$arr_delay ~ Flights$dist)
##
## Coefficients:
## (Intercept) Flights$dist
## 7.3312 -0.0003
The regression equation is: y = 7.3312 + (- 0.003) = 7.3309. The regression shows a weak relationship between the distance and arrival delay, but the distance is somewhat relevant in predicting the arrival delay.
Then I plotted the results.
# # # TASK 6
plot(Flights$dist,Flights$arr_delay,main = "Arrival delay and distance" , xlab = "Distance", ylab = "Arrival delay", col = "lightskyblue3")
The plot verifies the previous results. While the distance increases , the arrival delay decreases and as the arrival delay increases, the distance gets smaller. It can be observed that the flight with the biggest arrival delay has a relatively small distance.
I ordered the flights by date and looped through all departure delays to see if they have a certain pattern.
# # # TASK 6
new<- Flights[order(Flights$date),]
new_f<- new [!(is.na(new$dep_delay)),]$dep_delay
f <- new [!(is.na(new$dep_delay)),]$date
with(Flights,plot(table(new_f,f), main = "Departure delays over time", xlab = "Departure delay", ylab = "Index time", xlim = c(0,1000)))
## Warning in mosaicplot.default(x, xlab = xlab, ylab = ylab, ...): extra
## argument 'xlim' will be disregarded
max<- max(new_f)
max
## [1] 981
min <- min(new_f)
min
## [1] -33
In “new” I ordered the entries of the database by the column date. Then I created the vector new_f to use it for storing the departure delays, ignoring the empty values in the column. The “f” vector contains all the dates ordered by the departure delays. The two vectors, new_f and f, are used to select the date and departure delays columns ordered by date.
Then I plotted the departure delays with the upper boundary of the departure delay as 1000 and the lower boundary as 0 since a plot with origin in -33 is not a completely correct plot and it is rather confusing, especially since the time axis starts at 0 also.
The plot shows that most of the departure delays are under 400 minutes, but they have the tendency to increase at certain dates in the year like for example around the 130th day or the 290th day.
Mean, median and standard deviation of distances for all 15 carriers.
# # # TASK 2 and 9
aggregate(dist ~ carrier,Flights, mean)
## carrier dist
## 1 AA 483.8212
## 2 AS 1874.0000
## 3 B6 1428.0000
## 4 CO 1098.0549
## 5 DL 723.2832
## 6 EV 775.6815
## 7 F9 882.7411
## 8 FL 685.4063
## 9 MQ 650.5310
## 10 OO 819.7279
## 11 UA 1177.8388
## 12 US 981.4677
## 13 WN 606.6218
## 14 XE 589.0326
## 15 YV 938.6709
aggregate(dist ~ carrier,Flights, median)
## carrier dist
## 1 AA 224
## 2 AS 1874
## 3 B6 1428
## 4 CO 1190
## 5 DL 689
## 6 EV 696
## 7 F9 883
## 8 FL 696
## 9 MQ 247
## 10 OO 809
## 11 UA 925
## 12 US 913
## 13 WN 453
## 14 XE 562
## 15 YV 913
aggregate(dist ~ carrier,Flights, sd)
## carrier dist
## 1 AA 353.269167
## 2 AS 0.000000
## 3 B6 0.000000
## 4 CO 505.204266
## 5 DL 103.886047
## 6 EV 259.664313
## 7 F9 7.496141
## 8 FL 45.508796
## 9 MQ 447.617384
## 10 OO 299.852214
## 11 UA 326.355580
## 12 US 110.098250
## 13 WN 399.144719
## 14 XE 280.514799
## 15 YV 79.603621
I also tried computing the same thing with a custom function.The following function calculates the mean, median and standard deviation of the distances for one carrier, that it is latter used to calculate the mean, median and standard deviation for each of the carriers. “data” contains the distances of the flights for a certain carrier.
# # # TASK 10
fnct <- function(x){
data <- subset(Flights, subset = (Flights$carrier == x))$dist
r1 <- mean(data, na.rm = T)
r2 <- median(data, na.rm = T)
r3 <- sd(data, na.rm = T)
result <- paste("Carrier",x, "has mean",r1, "median",r2, "and standard deviation",r3)
return(result)
}
# # # TASK 11
for (i in unique(Flights$carrier)) { print(fnct(i))}
## [1] "Carrier AA has mean 483.82120838471 median 224 and standard deviation 353.269167386022"
## [1] "Carrier AS has mean 1874 median 1874 and standard deviation 0"
## [1] "Carrier B6 has mean 1428 median 1428 and standard deviation 0"
## [1] "Carrier CO has mean 1098.05494631026 median 1190 and standard deviation 505.20426636437"
## [1] "Carrier DL has mean 723.283226050738 median 689 and standard deviation 103.886046520742"
## [1] "Carrier OO has mean 819.727850071602 median 809 and standard deviation 299.852214266527"
## [1] "Carrier UA has mean 1177.8388030888 median 925 and standard deviation 326.355579614994"
## [1] "Carrier US has mean 981.467662910338 median 913 and standard deviation 110.098250024878"
## [1] "Carrier WN has mean 606.621793882187 median 453 and standard deviation 399.144718671193"
## [1] "Carrier EV has mean 775.681488203267 median 696 and standard deviation 259.664312788219"
## [1] "Carrier F9 has mean 882.741050119332 median 883 and standard deviation 7.49614061401011"
## [1] "Carrier FL has mean 685.406264609631 median 696 and standard deviation 45.508795786583"
## [1] "Carrier MQ has mean 650.530981067126 median 247 and standard deviation 447.617384017244"
## [1] "Carrier XE has mean 589.032565397725 median 562 and standard deviation 280.514798691695"
## [1] "Carrier YV has mean 938.670886075949 median 913 and standard deviation 79.6036208297387"
arrival_delay_DL <- subset(Flights, subset = (Flights$carrier == "DL"))$arr_delay
arrival_delay_F9 <- subset(Flights, subset = (Flights$carrier == "F9"))$arr_delay
# # # TASK 3
t_test <- t.test( arrival_delay_DL, arrival_delay_F9)
t_test
##
## Welch Two Sample t-test
##
## data: arrival_delay_DL and arrival_delay_F9
## t = -1.3466, df = 2408.4, p-value = 0.1783
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.8910633 0.7227997
## sample estimates:
## mean of x mean of y
## 6.084137 7.668269
t(2408) = - 1.3466, p = .1783 The t-test shows that the two samples don’t differ regarding arrival delay.
# # # TASK 7
hist(Flights$dep_delay, main = "Distribution of flight departure delays",xlab = "Departure delays", ylab = "Frequency")
median(Flights$dep_delay,na.rm=T)
## [1] 0
abline(v = median(Flights$dep_delay,na.rm=T), col = "red", lwd = 3)
text(x = 16, y = 18000,labels="median", col = "red")
abline(v = mean(Flights$dep_delay,na.rm=T), col = "violet", lwd = 3)
text(x = 12, y = 18000, labels="mean", col = "violet")
It can be observed that most departure delays are under 200 minutes, with the average amount around the median at 0.
Then I created a histogram of the departure delays for all of the carriers.
# # # TASK 7 and 11
par(mfrow = c(3, 5))
for (carrier.i in unique(Flights$carrier)) {
hist(x = Flights$dep_delay[Flights$carrier == carrier.i],
xlab = "Departure delays",
main = paste("Departure delays for \ncarrier ", carrier.i, sep = "")
)
abline(v = median(Flights$dep_delay,na.rm=T), col = "red", lwd = 3)
text(x = 16, y = 18000,labels="median", col = "red")
abline(v = mean(Flights$dep_delay,na.rm=T), col = "violet", lwd = 3)
text(x = 12, y = 18000, labels="mean", col = "violet")
}
From the histograms it can be noticed that all carriers have the median at 0, but some have larger ranges of departure delays, the highest departure delay being in carrier AA.
From the answers from the questions above we can study the relationships between different variables. Firstly, studying the relationship between hours, respectively distances, and departure delays we observed that almost all of the flights between 22:00 and 05:00 had no departure delays and most of the flights departed around 14:00. Also, flights with longer distances departed in the first half of the day. Secondly, analyzing the arrival delay and distance, it can be observed that they do not have a perfect linear correlation and as one increases, the other decreases and the other way around.
Studying the departure delays over time, over the whole year, most of the departure delays are under 400 minutes and some tendencies of higher departure delays can be observed at certain dates like around the 130th day, as well as the highest departure delay being towards the end of the year, possibly around Christmas time due to weather conditions or busy airlines.