Exploratory Analysis of Flights Data
Part 1: Flights Data Visualization
1. How does average departure delay vary by carrier across different months of the year?
There are many carriers to work with here, so we will really only focus on some of the more significant ones. These include OO
(high delay), HA
(low delay), and FL
(middle-ish delay).
OO
seems inacvtive for much of the year, only being assigned carrier for 29 total flights throughout the entire dataset. For these observations, the airline may have had a smaller presence at the target airports, may have had a specialized purpose, may have flown smaller flights, etc.
HA
had significantly more activity, however. Mean flight delay is relatively stable throughout the year, with some higher variance in February (though this may not be an actual trend).
FL
is more interesting, and perhaps more indicative of the general trajectories through the year. Peak activity during summer nd near December holidays is expected, which is reflected through the boxplots.
In total, it is safe to say that while there are certainly trends to be seen, many of the carriers will have different flucutation patterns throughout the year due to carrier purpose, incidence of flights, size, etc.
2. What is the relationship between flight distance and air time for flights departing from each origin airport?
As expected, we can see roughly-linear relationship with clear positive correlation. This is not surprising because, generally, the further the flight, the more time is needed to fly.
3. Which carriers (carrier) have the highest average arrival delay for flights exceeding a certain distance threshold (e.g., 1000 miles)?
We can just use the below R script to calculate this:
= c()
aves for (i in 1:length(unique(flights.c$carrier))) {
= mean(flights.c[flights.c$carrier == unique(flights.c$carrier)[i] & flights.c$distance > 1000, ]$arr_delay, na.rm = TRUE)
aves[i]
}
<- which(aves == max(aves, na.rm = TRUE))
index unique(flights.c$carrier)[index]
[1] "F9"
aves[index]
[1] 21.9207
This suggests that, for flights over 1,000 miles in distance, F9
has the highest arrival delay, with ~22 minutes of delay. For a different threshold (like 2,000 miles), we can run the code again:
[1] "WN"
[1] 14.01647
4. How does the proportion of delayed flights differ by hour of scheduled departure across origins?
We can see a clear trend; as it gets later in the day, the proportion of delayed flights grows from around 20% to 60%. A substantial jump! Also, EWR
has the highest proportion of delayed flights, but not by a lot.
5. What is the trend of average departure delay over days of the month for each carrier?
6. How does the difference between scheduled and actual departure time vary by destination for a specific carrier?
We’ll use carrier FL
as our given carrier for our example. There are actually a very large number of destinations in the dataset, so we’ll take just show an average difference for 10 destinations. This can be computed from the code below:
<- sample(unique(flights.c$dest), )
dests $diff <- NA
flights.cfor (dest in unique(flights.c$dest)) {
$dest == dest, ]$diff <- flights.c[flights.c$dest == dest, ]$dep_time - flights.c[flights.c$dest == dest, ]$sched_dep_time
flights.c[flights.c
}
<- sample(unique(flights.c$dest), 10)
dests for (dest in dests) {
print(paste0(dest, ": ", mean(flights.c[flights.c$dest == dest, ]$diff, na.rm = TRUE)))
}
[1] "CRW: 35.2985074626866"
[1] "TUL: 56.7925170068027"
[1] "BOS: -4.31094394887498"
[1] "BUF: -0.189496717724289"
[1] "LGB: 20.7776096822995"
[1] "SJU: -43.5730122986316"
[1] "MVY: 14.8904761904762"
[1] "GRR: 28.1208791208791"
[1] "ORF: 26.255230125523"
[1] "SBN: 37.1"
7. What is the relationship between flight distance and arrival delay for flights departing during peak hours (6-9 or 17-20)?
The above plots are a bit hard to read. I added a regression line, however, as it shows that as flight distance increases, the delay will typically stay about the same.
8. Which origin-destination pairs have the highest average air time for flights in a given month?
Lets use March (my birth-month) as our given month.
= data.frame(
aves c(),
c()
)for (i in 1:length(unique(flights.c$origin))) {
for (j in 1:length(unique(flights.c$dest))) {
= mean(flights.c[flights.c$origin == unique(flights.c$origin)[i] & flights.c$dest == unique(flights.c$dest)[j] & flights.c$month == 3, ]$air_time, na.rm = TRUE)
aves[i, j]
}
}
max(aves, na.rm = TRUE) # at row 3, column 47
[1] 648.4194
unique(flights.c$origin)[3]
[1] "JFK"
unique(flights.c$dest)[47]
[1] "HNL"
During March, JFK
to HNL
flights have the highest average airtime at ~648.4 minutes.
9. How does the average speed in miles per hour vary by carrier across different flight distances?
Some simple plots can show this off effectively. We do have to calculate average MPH ourselves, and there are 16 carriers so I will only show the most interesting 6:
We can see that the trend is not quite linear, but increasing (makes sense). There is a surprising amount of scatter in the plots, however. Likely due to differently-size airplanes or maybe cargo loads (if applicable).
10. What is the distribution of arrival delays for flights with departure times after midnight by carrier?
Once more, there are many carriers, so I will only show the first 6. However, the code can easily be expanded to include more histograms:
Generally, the proportion of flights will spike nearing 6:00, with very few before. Not that this is carrier-dependent and some carriers do more frequent flying in late hours (12:00AM - 1:00AM), such as carrier B6
.
Part 2: Imbalanced Logistic Regression
Part 2 calls us to project 3, in which we used the hotel_booking
dataset to predict hotel reservation cancellation based ona variety of numerical and categorical features. This time, we are to consider a logistic regression trained on the dataset both before and after re-balancing the is_canceled
class.
The data cleaning will be identical. We’ll remove NA values, turn appropriate features to numeric, decompose the reservation_status_date
into generalizable features, scale numerical data, etc.
We’ll train the imbalanced logistic regression in the exact same way as in project 3. here are the performance metrics, with the logistic regression trained on the training data:
<- 0.8133428
log_accu <- 0.6169227
log_recall <- 0.8366784
log_precis <- 0.7101892
log_F1
<- 0.7741662
log_post_accu <- 0.7282645
log_post_recall <- 0.8039483
log_post_precis <- 0.7642372
log_post_F1
<- data.frame(
metrics <- c(log_accu, log_post_accu),
accu <- c(log_recall, log_post_recall),
recall <- c(log_precis, log_post_precis),
precis <- c(log_F1, log_post_F1)
F1
)
rownames(metrics) <- c("logistic", "logistic (rebalanced)")
colnames(metrics) <- c("accuracy", "recall", "precision", "F1")
print(metrics)
accuracy recall precision F1
logistic 0.8133428 0.6169227 0.8366784 0.7101892
logistic (rebalanced) 0.7741662 0.7282645 0.8039483 0.7642372