library(rlang)
require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)
library(forecast)
library(caTools)
library(reshape2)
library(psych)
library(ggthemes)
library(readr)
library(readxl)
#install.packages("skimr")
library(corrplot)
library(skimr)
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
#read in excel data
data <- read_excel("C:/teams/Desktop/Aero/Project Barracuda - Financial Analysis_v01.xlsx",sheet="GEG_Raw Data")
#check data structure
print('data structure')
## [1] "data structure"
skim(data )
Name | data |
Number of rows | 10785 |
Number of columns | 46 |
_______________________ | |
Column type frequency: | |
character | 40 |
numeric | 4 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
IATA | 10178 | 0.06 | 2 | 2 | 0 | 10 | 0 |
ICAO | 9044 | 0.16 | 3 | 3 | 0 | 60 | 0 |
Flight | 0 | 1.00 | 3 | 7 | 0 | 3358 | 0 |
Aircraft Type | 598 | 0.94 | 2 | 4 | 0 | 226 | 0 |
Aircraft Category | 598 | 0.94 | 3 | 10 | 0 | 5 | 0 |
Tail | 362 | 0.97 | 2 | 7 | 0 | 3024 | 0 |
Based (Y/N) | 0 | 1.00 | 2 | 6 | 0 | 10 | 0 |
From/To | 0 | 1.00 | 1 | 9 | 0 | 629 | 0 |
Operator | 861 | 0.92 | 6 | 43 | 0 | 1853 | 0 |
Operator (dba) | 804 | 0.93 | 6 | 43 | 0 | 1727 | 0 |
PH Name | 3729 | 0.65 | 6 | 40 | 0 | 1133 | 0 |
PH Address1 | 3729 | 0.65 | 7 | 40 | 0 | 694 | 0 |
PH Address2 | 3799 | 0.65 | 1 | 40 | 0 | 763 | 0 |
PH City | 3731 | 0.65 | 3 | 15 | 0 | 401 | 0 |
PH Con. 1 FN | 5743 | 0.47 | 2 | 13 | 0 | 267 | 0 |
PH Con. 1 LN | 5743 | 0.47 | 3 | 14 | 0 | 519 | 0 |
PH Con. 1 Title | 5752 | 0.47 | 5 | 20 | 0 | 90 | 0 |
PH Con. 1 Email | 6196 | 0.43 | 10 | 38 | 0 | 416 | 0 |
PH Con. 2 FN | 9035 | 0.16 | 2 | 13 | 0 | 126 | 0 |
PH Con. 2 LN | 9035 | 0.16 | 1 | 10 | 0 | 185 | 0 |
PH Con. 2 Title | 9035 | 0.16 | 3 | 20 | 0 | 60 | 0 |
PH Con. 2 Email | 9227 | 0.14 | 13 | 32 | 0 | 157 | 0 |
Apcode | 4115 | 0.62 | 3 | 4 | 0 | 403 | 0 |
Operator Name | 10051 | 0.07 | 6 | 40 | 0 | 136 | 0 |
Operator Address1 | 10076 | 0.07 | 8 | 40 | 0 | 122 | 0 |
Operator Address2 | 10394 | 0.04 | 7 | 40 | 0 | 59 | 0 |
Operator City | 10078 | 0.07 | 4 | 15 | 0 | 96 | 0 |
Operator Con. 1 FN | 10076 | 0.07 | 3 | 14 | 0 | 98 | 0 |
Operator Con. 1 LN | 10076 | 0.07 | 3 | 14 | 0 | 115 | 0 |
Operator Con. 1 Title | 10076 | 0.07 | 5 | 20 | 0 | 54 | 0 |
Operator Con. 1 Email | 10350 | 0.04 | 13 | 35 | 0 | 64 | 0 |
Operator Con. 2 FN | 10294 | 0.05 | 3 | 13 | 0 | 62 | 0 |
Operator Con. 2 LN | 10294 | 0.05 | 3 | 14 | 0 | 71 | 0 |
Operator Con. 2 Title | 10294 | 0.05 | 5 | 20 | 0 | 47 | 0 |
Operator Con. 2 Email | 10616 | 0.02 | 13 | 32 | 0 | 36 | 0 |
Reg. Owner Country | 862 | 0.92 | 2 | 14 | 0 | 9 | 0 |
Serial Number | 861 | 0.92 | 1 | 18 | 0 | 2291 | 0 |
Manufacturer Name | 861 | 0.92 | 4 | 30 | 0 | 115 | 0 |
Model | 861 | 0.92 | 1 | 20 | 0 | 432 | 0 |
Operational Category | 2820 | 0.74 | 7 | 15 | 0 | 15 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Fuel Uplift | 2876 | 0.73 | 327.82 | 439.95 | 30 | 100 | 200 | 450.00 | 6300 | ▇▁▁▁▁ |
Past Visits | 5182 | 0.52 | 69.55 | 103.07 | 1 | 4 | 28 | 75.00 | 471 | ▇▁▁▁▁ |
Company Visits | 5127 | 0.52 | 84.74 | 121.95 | 1 | 7 | 45 | 80.75 | 594 | ▇▁▁▁▁ |
MFR Year | 1051 | 0.90 | 2003.85 | 12.89 | 1942 | 2000 | 2006 | 2014.00 | 2023 | ▁▁▂▆▇ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date and Time | 0 | 1 | 2019-01-01 00:00:00 | 2023-07-18 00:00:00 | 2021-02-17 00:00:00 | 1634 |
Departed | 0 | 1 | 1899-12-31 00:00:21 | 1899-12-31 23:59:28 | 1899-12-31 15:12:05 | 4253 |
Its a time series data set with a datetime column and 39 other ‘factors’; some of which could be utilised as predictors
Total number of 40 variables
Total number of rows 10785
Quite a number of variables contain null or blanks making it not useful for prediction purposes
Only 4 variables are numerical: Fuel Uplift, Past Visits and Company Visits, MFR Year
Assumption is: Flights without “origination/destination” are aircrafts that were grounded for some reasons
Going forward in this analysis, we will only be looking at flights that actually took place as designated by a “Origin” followed by “Destination
#flights that actually flew from Spokane to another destination
# Specify the strings you want to exclude
strings_to_exclude <- c("KGEG/", "/","GEG")
# Use filter() to exclude rows
actual_flts_data <- data %>%
filter(!(`From/To` %in% strings_to_exclude))
actual_flts <- actual_flts_data %>%
rename(act_flts=`From/To`) %>%
mutate(Category='From/To') %>%
select(Category,act_flts) %>%
group_by(Category) %>%
summarise(no_aircrafts=n())
#flight that did not take flights at all
no_flts_data <- data %>%
filter((`From/To` %in% strings_to_exclude))
no_flts <- no_flts_data %>%
rename(no_flts=`From/To`) %>%
mutate(Category='From/To') %>%
select(Category,no_flts) %>%
group_by(Category) %>%
summarise(no_aircrafts=n())
# Combine data frames with a source column
df1_combined <- actual_flts %>% mutate(Source = "actual_flts")
df2_combined <- no_flts %>% mutate(Source = "no_flts")
# Combine into a single data frame
combined_df <- bind_rows(df1_combined, df2_combined)
# Create a side-by-side barplot
p1 <- ggplot(combined_df, aes(x = Category, y = no_aircrafts, fill = Source)) +
geom_bar(stat = "identity", position = position_dodge(width = 1)) +
labs(title = "Actual Flts vs. Grounded Flts", x = "Category", y = "No. of Flights") +
geom_text(aes(label = no_aircrafts), vjust = -.25,size = 4)+
scale_fill_manual(values = c("no_flts" = "blue", "actual_flts" = "red"))
plotly_plot <- ggplotly(p1)
plotly_plot
flight_data<- actual_flts_data %>%
group_by(Year = year(DateTime) )%>%
summarise(Flights=n())
# Create a bar chart
p2 <- ggplot(flight_data, aes(x = Year, y = Flights)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(
title = "Number of Flights Over 5 Years",
x = "Year",
y = "Number of Flights"
)
plotly_plot2 <- ggplotly(p2)
plotly_plot2
# Break down by year to see if there is seasonality and/or trends in actual flights
flight_data_season<- actual_flts_data %>%
group_by(Month = floor_date(DateTime, unit = "month")) %>%
summarise(Flights=n())
# Plot the line chart
p3 <- ggplot(flight_data_season, aes(x = Month, y = Flights)) +
geom_line() +
labs(title = "Total Flights over 5 years by Months", x = "Month", y = "No. of Flights") +
scale_x_datetime(date_labels = "%Y-%m", date_breaks = "1 month") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plotly_plot3 <- ggplotly(p3)
plotly_plot3
arima_model <- auto.arima(flight_data_season[,2])
print(summary(arima_model))
## Series: flight_data_season[, 2]
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.6768 164.9958
## s.e. 0.0990 12.1577
##
## sigma^2 = 946.5: log likelihood = -265.78
## AIC=537.56 AICc=538.03 BIC=543.59
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3759777 30.20128 24.22793 -5.544682 17.84585 0.9412288
## ACF1
## Training set 0.03238102
fcast <- forecast(arima_model,h=12)
print(fcast)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 56 127.0989 87.67086 166.5269 66.79893 187.3988
## 57 139.3478 91.73890 186.9568 66.53624 212.1595
## 58 147.6377 96.71866 198.5568 69.76373 225.5117
## 59 153.2482 100.88278 205.6136 73.16221 233.3341
## 60 157.0452 104.03053 210.0599 75.96625 238.1242
## 61 159.6150 106.30554 212.9244 78.08523 241.1448
## 62 161.3542 107.91025 214.7981 79.61875 243.0896
## 63 162.5312 109.02582 216.0366 80.70177 244.3606
## 64 163.3278 109.79428 216.8613 81.45534 245.2003
## 65 163.8669 110.32052 217.4133 81.97477 245.7591
## 66 164.2318 110.67949 217.7841 82.33061 246.1330
## 67 164.4787 110.92373 218.0337 82.57342 246.3841
plot(fcast)
The pandemic year of 20’ was the low point of total overall flights
Here’s an interesting observation, overall flights in 22’ came roaring back after the pre-pandemic year of 20’ almost on par with 19’ which saw the highest travelling season in the last 5 years
Peeling deeper, at the height of pandemic lock-down on March-04 2020, there were only 50 actual flights that took place from Spokane!
There’s a trend here for any given year: One can discern from the chart that august is the busiest travelling month of the year (ex year 20’) for all years (Please hover over the month of interest on chart to see the actual numbers)
Lastly, an automated ARIMA forecast was utilized just to see how the next 12 periods of flight traffic would look like. This was a simple illustration, in reality, a more robust process could be implemented with other algorithms like Holts-Winter or more sophisticated driver-based models such as lasso or stepwise regression
Note: 23’ is not a full year’s worth of flights, so its excluded from comparision
flight_data_boxplot <- actual_flts_data %>%
mutate(Year = year(DateTime),Month = floor_date(DateTime, unit = "month")) %>%
group_by(Year, Month, Aircraft_Category) %>%
summarise(Flights=n(),Fuel_Uplift=mean(Fuel_Uplift, na.rm = T)) %>%
filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")
p4 <- flight_data_boxplot %>%select(c(Aircraft_Category, Flights)) %>%
ggplot(aes(x = Aircraft_Category, y = Flights, fill = Aircraft_Category)) +
geom_boxplot() +
labs(title = "Boxplots of No. of Flights", x = "Category", y = "No. of Flights") +
theme_minimal()
plotly_plot4 <- ggplotly(p4)
plotly_plot4
p5 <- flight_data_boxplot %>%select(c(Aircraft_Category, Fuel_Uplift)) %>%
ggplot(aes(x = Aircraft_Category, y = Fuel_Uplift, fill = Aircraft_Category)) +
geom_boxplot() +
labs(title = "Boxplots of Fuel Mgmt Efficiency", x = "Category", y = "Amount of Fuel_uplift") +
theme_minimal()
plotly_plot5 <- ggplotly(p5)
plotly_plot5
By aircraft types, Jets have the highest median in terms of flight counts and also fuel uplift
It also had the biggest spread. This is in-line with jets typically having longer routes forcing them to carry more fuel which due to the higher risks, requires higher safety standards than smaller Piston or Turboprops engines.
It’s really hard to tell which aircraft types or even airline operators for that matter had better fuel management in place due to a variety of factors such as safety, flight routes etc without knowing all the relevant travelling specifics
fuel_uplift_stats <- actual_flts_data %>%
group_by(Aircraft_Category) %>%
summarise(
Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>%
filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")
fuel_uplift_stats
## # A tibble: 3 × 6
## Aircraft_Category Mean_Fuel_uplift Median_Fuel_uplift Min_Fu…¹ Max_F…² SD_Fu…³
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jet 455. 250 50 5800 486.
## 2 Piston 82.6 50 30 650 66.8
## 3 Turboprop 106. 100 50 1000 82.9
## # … with abbreviated variable names ¹Min_Fuel_uplift, ²Max_Fuel_uplift,
## # ³SD_Fuel_uplift
Max and min seem extreme and not indicative of the sample size
Mean and SD both look similar: Mean was chosen as a benchmark metric going forward
p4 <- fuel_uplift_stats %>%
select(Aircraft_Category, Mean_Fuel_uplift, Min_Fuel_uplift,Max_Fuel_uplift,SD_Fuel_uplift) %>%
pivot_longer(cols=-Aircraft_Category, names_to ="Stats", values_to = 'Value') %>%
ggplot(aes(x = Stats, y = Value, fill = Stats)) +
geom_bar(stat = "identity") +
labs(title = "Aircraft Category Stats Plot", x = "Statistics", y = "Value") +
theme_minimal()
plotly_plot4 <- ggplotly(p4)
plotly_plot4
fuel_uplift_stats_year <- actual_flts_data %>% mutate(Year = year(DateTime) ) %>%
group_by(Year,Aircraft_Category) %>%
summarise(
Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>%
filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")
fuel_uplift_stats_year
## # A tibble: 15 × 7
## # Groups: Year [5]
## Year Aircraft_Category Mean_Fuel_uplift Median_Fue…¹ Min_F…² Max_F…³ SD_Fu…⁴
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019 Jet 411. 250 50 3900 406.
## 2 2019 Piston 87.4 50 40 550 74.6
## 3 2019 Turboprop 112. 100 50 650 87.5
## 4 2020 Jet 431. 250 50 5800 484.
## 5 2020 Piston 81.9 50 30 650 70.7
## 6 2020 Turboprop 108. 100 50 540 77.8
## 7 2021 Jet 483. 300 50 4800 511.
## 8 2021 Piston 76.1 50 40 450 68.0
## 9 2021 Turboprop 100. 50 50 1000 84.3
## 10 2022 Jet 483. 350 50 5000 524.
## 11 2022 Piston 86.7 50 40 300 59.7
## 12 2022 Turboprop 107. 100 50 540 81.9
## 13 2023 Jet 483. 300 50 4700 530.
## 14 2023 Piston 74.8 50 40 200 47.7
## 15 2023 Turboprop 100. 50 50 540 81.2
## # … with abbreviated variable names ¹Median_Fuel_uplift, ²Min_Fuel_uplift,
## # ³Max_Fuel_uplift, ⁴SD_Fuel_uplift
p5 <- fuel_uplift_stats_year %>%
select(Year, Aircraft_Category, Mean_Fuel_uplift)%>%
ggplot(aes(x = Year, y = Mean_Fuel_uplift, group = Aircraft_Category, color = Aircraft_Category)) +
geom_line() +
facet_wrap(~Aircraft_Category, scales = "free_y") +
labs(title = "Yearly Line Chart of fuel uplift by Aircraft Types", x = "Time", y = "Mean fuel uplift") +
theme_minimal()
plotly_plot5 <- ggplotly(p5)
plotly_plot5
Seem like jets had huge rate of increase in fuel uplift from 19’ to 21’ and then flattened out from 22’ thru 23’
Meanwhile, piston and turboprops followed similar patterns; dropping in 21’ and then increasing again in 22’ almost in a see-saw pattern
Does Manufacturer Name has any impact?
To see if past visits/company visit have any relationship with each other when group by Manufacturers
#Note: I assumed visits means repairs?
filtered_data <- actual_flts_data %>%filter(Past_Visit == Company_Visits) %>%
group_by(Manufacturer_Name,Past_Visit,Operational_Category) %>%
summarise(Operational_Category=n()) %>%
na.omit()
filtered_data
## # A tibble: 1,052 × 3
## # Groups: Manufacturer_Name, Past_Visit [952]
## Manufacturer_Name Past_Visit Operational_Category
## <chr> <dbl> <int>
## 1 ASTRA 2 1
## 2 ASTRA 13 1
## 3 ASTRA 14 1
## 4 ASTRA 16 1
## 5 BAE SYSTEMS PLC 2 2
## 6 BEECH 1 4
## 7 BEECH 2 6
## 8 BEECH 3 1
## 9 BEECH 4 3
## 10 BEECH 6 2
## # … with 1,042 more rows
Bar charts at first glance, gave a visual impression that indicated Operational Category has a positive relationship with Past visits
Meaning as the number of vists increases/decreases, each operational category numbers followed suit. This mistakenly would have misled an analyst to make inferences between the number of operational categories and the number of visits
p6 <- filtered_data%>%
pivot_longer(cols = starts_with(c("Past_Visit","Operational_Category")), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Manufacturer_Name, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~Variable, scales = "free_y") +
labs(title = "Bar Charts by Operational Category and Past Visits", x = "Manufacturer_Name", y = "Counts") +theme(axis.text.x = element_text(angle = -90,hjust = 1))+
coord_flip()+
theme_minimal()
plotly_plot6 <- ggplotly(p6)
plotly_plot6
Interestingly enough, a correlation calculation showed its in fact slightly negative; meaning number of operational categories move in opposite direction to number of visits.
Lessons learned its better to robustly test than simply rely on visuals!!
# Calculate the correlation matrix
correlation_matrix <- cor(filtered_data[,2:3])
# Print the correlation matrix
print(correlation_matrix)
## Past_Visit Operational_Category
## Past_Visit 1.0000000 -0.1241611
## Operational_Category -0.1241611 1.0000000
#install.packages("corrplot")
corrplot(correlation_matrix, method = 'number',type='upper',bg="lightblue")
- The biggest takeaway for me was the data actually showed that in 20', a pandemic year, total flights plummeted and this was in-line with all the headline news that demand decreased sharply due to lockdowns and airlines retrenched
- It also showed that the rebound in demand was just as good pre-pandemic
- At a high level, it was hard to get more insights into fuel management efficacy by aircraft types and there was no way to infer (at least statistically) the direction of visits relative to operational categories
- Given more time and more numerical predictors, I would like to create a predictive model based on lasso regression or stepwise regression to statstically remove some of the less relevant independent variables and get a better and more streamlined data set to make better flight forecasts
- Lastly, given more time, 2 particular scenarios I would be interested in would be fuel management efficient w.r.t to MFR year and model types.