Libraries

library(rlang)
require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)   
library(forecast)
library(caTools)
library(reshape2)
library(psych) 
library(ggthemes)
library(readr)
library(readxl) 
#install.packages("skimr")
library(corrplot)
library(skimr)

Looking at all the variables of the data set and it’s structure

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#read in excel data
data <- read_excel("C:/teams/Desktop/Aero/Project Barracuda - Financial Analysis_v01.xlsx",sheet="GEG_Raw Data")
#check data structure
print('data structure')
## [1] "data structure"
skim(data )
Data summary
Name data
Number of rows 10785
Number of columns 46
_______________________
Column type frequency:
character 40
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
IATA 10178 0.06 2 2 0 10 0
ICAO 9044 0.16 3 3 0 60 0
Flight 0 1.00 3 7 0 3358 0
Aircraft Type 598 0.94 2 4 0 226 0
Aircraft Category 598 0.94 3 10 0 5 0
Tail 362 0.97 2 7 0 3024 0
Based (Y/N) 0 1.00 2 6 0 10 0
From/To 0 1.00 1 9 0 629 0
Operator 861 0.92 6 43 0 1853 0
Operator (dba) 804 0.93 6 43 0 1727 0
PH Name 3729 0.65 6 40 0 1133 0
PH Address1 3729 0.65 7 40 0 694 0
PH Address2 3799 0.65 1 40 0 763 0
PH City 3731 0.65 3 15 0 401 0
PH Con. 1 FN 5743 0.47 2 13 0 267 0
PH Con. 1 LN 5743 0.47 3 14 0 519 0
PH Con. 1 Title 5752 0.47 5 20 0 90 0
PH Con. 1 Email 6196 0.43 10 38 0 416 0
PH Con. 2 FN 9035 0.16 2 13 0 126 0
PH Con. 2 LN 9035 0.16 1 10 0 185 0
PH Con. 2 Title 9035 0.16 3 20 0 60 0
PH Con. 2 Email 9227 0.14 13 32 0 157 0
Apcode 4115 0.62 3 4 0 403 0
Operator Name 10051 0.07 6 40 0 136 0
Operator Address1 10076 0.07 8 40 0 122 0
Operator Address2 10394 0.04 7 40 0 59 0
Operator City 10078 0.07 4 15 0 96 0
Operator Con. 1 FN 10076 0.07 3 14 0 98 0
Operator Con. 1 LN 10076 0.07 3 14 0 115 0
Operator Con. 1 Title 10076 0.07 5 20 0 54 0
Operator Con. 1 Email 10350 0.04 13 35 0 64 0
Operator Con. 2 FN 10294 0.05 3 13 0 62 0
Operator Con. 2 LN 10294 0.05 3 14 0 71 0
Operator Con. 2 Title 10294 0.05 5 20 0 47 0
Operator Con. 2 Email 10616 0.02 13 32 0 36 0
Reg. Owner Country 862 0.92 2 14 0 9 0
Serial Number 861 0.92 1 18 0 2291 0
Manufacturer Name 861 0.92 4 30 0 115 0
Model 861 0.92 1 20 0 432 0
Operational Category 2820 0.74 7 15 0 15 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Fuel Uplift 2876 0.73 327.82 439.95 30 100 200 450.00 6300 ▇▁▁▁▁
Past Visits 5182 0.52 69.55 103.07 1 4 28 75.00 471 ▇▁▁▁▁
Company Visits 5127 0.52 84.74 121.95 1 7 45 80.75 594 ▇▁▁▁▁
MFR Year 1051 0.90 2003.85 12.89 1942 2000 2006 2014.00 2023 ▁▁▂▆▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
Date and Time 0 1 2019-01-01 00:00:00 2023-07-18 00:00:00 2021-02-17 00:00:00 1634
Departed 0 1 1899-12-31 00:00:21 1899-12-31 23:59:28 1899-12-31 15:12:05 4253

High level summary of data set:

  • Its a time series data set with a datetime column and 39 other ‘factors’; some of which could be utilised as predictors

  • Total number of 40 variables

  • Total number of rows 10785

  • Quite a number of variables contain null or blanks making it not useful for prediction purposes

  • Only 4 variables are numerical: Fuel Uplift, Past Visits and Company Visits, MFR Year

Data cleansing

  • Renaming certain column headers for better readability

Wrangling original data to filter out flights that have origination/destination only

#flights that actually flew from Spokane to another destination
# Specify the strings you want to exclude
strings_to_exclude <- c("KGEG/", "/","GEG")
# Use filter() to exclude rows
actual_flts_data <- data %>%
  filter(!(`From/To` %in% strings_to_exclude))

actual_flts <- actual_flts_data %>% 
  rename(act_flts=`From/To`) %>% 
  mutate(Category='From/To') %>% 
  select(Category,act_flts) %>% 
  group_by(Category) %>% 
  summarise(no_aircrafts=n())

#flight that did not take flights at all
no_flts_data <- data %>%
  filter((`From/To` %in% strings_to_exclude))

no_flts <- no_flts_data %>% 
  rename(no_flts=`From/To`) %>% 
  mutate(Category='From/To') %>% 
  select(Category,no_flts) %>% 
  group_by(Category) %>% 
  summarise(no_aircrafts=n())

# Combine data frames with a source column
df1_combined <- actual_flts %>% mutate(Source = "actual_flts")
df2_combined <- no_flts %>% mutate(Source = "no_flts")

# Combine into a single data frame
combined_df <- bind_rows(df1_combined, df2_combined)

# Create a side-by-side barplot
p1 <- ggplot(combined_df, aes(x = Category, y = no_aircrafts, fill = Source)) +
  geom_bar(stat = "identity", position = position_dodge(width = 1)) +
  labs(title = "Actual Flts vs. Grounded Flts", x = "Category", y = "No. of Flights") +
  geom_text(aes(label = no_aircrafts), vjust = -.25,size = 4)+
  scale_fill_manual(values = c("no_flts" = "blue", "actual_flts" = "red"))

plotly_plot <- ggplotly(p1)
plotly_plot

Observation 1:

  • As shown in the bar chart, 1639 events when aircraft did not take off while 9146 actually did using the above “assumptions” which is about 15% of total flights

Performed statistical analysis on flights that actually took off

total flights overall by years only

flight_data<- actual_flts_data %>% 
  group_by(Year = year(DateTime) )%>% 
             summarise(Flights=n())

# Create a bar chart
p2 <- ggplot(flight_data, aes(x = Year, y = Flights)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(
    title = "Number of Flights Over 5 Years",
    x = "Year",
    y = "Number of Flights"
  )
plotly_plot2 <- ggplotly(p2)
plotly_plot2

Total flights overall over 5 yrs by months

# Break down by year to see if there is seasonality and/or trends in actual flights
flight_data_season<- actual_flts_data %>% 
  group_by(Month = floor_date(DateTime, unit = "month")) %>% 
             summarise(Flights=n())

# Plot the line chart
p3 <- ggplot(flight_data_season, aes(x = Month, y = Flights)) +
  geom_line() +
  labs(title = "Total Flights over 5 years by Months", x = "Month", y = "No. of Flights") +
  scale_x_datetime(date_labels = "%Y-%m", date_breaks = "1 month") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plotly_plot3 <- ggplotly(p3)
plotly_plot3

Time series Forecasting with ARIMA Auto-Parameter select mode

arima_model <- auto.arima(flight_data_season[,2])
print(summary(arima_model))
## Series: flight_data_season[, 2] 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      mean
##       0.6768  164.9958
## s.e.  0.0990   12.1577
## 
## sigma^2 = 946.5:  log likelihood = -265.78
## AIC=537.56   AICc=538.03   BIC=543.59
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.3759777 30.20128 24.22793 -5.544682 17.84585 0.9412288
##                    ACF1
## Training set 0.03238102
fcast <- forecast(arima_model,h=12)
print(fcast)
##    Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## 56       127.0989  87.67086 166.5269 66.79893 187.3988
## 57       139.3478  91.73890 186.9568 66.53624 212.1595
## 58       147.6377  96.71866 198.5568 69.76373 225.5117
## 59       153.2482 100.88278 205.6136 73.16221 233.3341
## 60       157.0452 104.03053 210.0599 75.96625 238.1242
## 61       159.6150 106.30554 212.9244 78.08523 241.1448
## 62       161.3542 107.91025 214.7981 79.61875 243.0896
## 63       162.5312 109.02582 216.0366 80.70177 244.3606
## 64       163.3278 109.79428 216.8613 81.45534 245.2003
## 65       163.8669 110.32052 217.4133 81.97477 245.7591
## 66       164.2318 110.67949 217.7841 82.33061 246.1330
## 67       164.4787 110.92373 218.0337 82.57342 246.3841
plot(fcast)

Observation 2:

  • The pandemic year of 20’ was the low point of total overall flights

  • Here’s an interesting observation, overall flights in 22’ came roaring back after the pre-pandemic year of 20’ almost on par with 19’ which saw the highest travelling season in the last 5 years

  • Peeling deeper, at the height of pandemic lock-down on March-04 2020, there were only 50 actual flights that took place from Spokane!

  • There’s a trend here for any given year: One can discern from the chart that august is the busiest travelling month of the year (ex year 20’) for all years (Please hover over the month of interest on chart to see the actual numbers)

  • Lastly, an automated ARIMA forecast was utilized just to see how the next 12 periods of flight traffic would look like. This was a simple illustration, in reality, a more robust process could be implemented with other algorithms like Holts-Winter or more sophisticated driver-based models such as lasso or stepwise regression

  • Note: 23’ is not a full year’s worth of flights, so its excluded from comparision

Boxplots of flights

flight_data_boxplot <- actual_flts_data %>% 
  mutate(Year = year(DateTime),Month = floor_date(DateTime, unit = "month")) %>% 
  group_by(Year, Month, Aircraft_Category) %>% 
  summarise(Flights=n(),Fuel_Uplift=mean(Fuel_Uplift, na.rm = T)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")

p4 <- flight_data_boxplot %>%select(c(Aircraft_Category, Flights)) %>% 
  ggplot(aes(x = Aircraft_Category, y = Flights, fill = Aircraft_Category)) +
  geom_boxplot() +
  labs(title = "Boxplots of No. of Flights", x = "Category", y = "No. of Flights") +
  theme_minimal()

plotly_plot4 <- ggplotly(p4)
plotly_plot4

Finally, Boxplots of fuel uplift

p5 <- flight_data_boxplot %>%select(c(Aircraft_Category, Fuel_Uplift)) %>% 
  ggplot(aes(x = Aircraft_Category, y = Fuel_Uplift, fill = Aircraft_Category)) +
  geom_boxplot() +
  labs(title = "Boxplots of Fuel Mgmt Efficiency", x = "Category", y = "Amount of Fuel_uplift") +
  theme_minimal()

plotly_plot5 <- ggplotly(p5)
plotly_plot5

Observation 3:

  • By aircraft types, Jets have the highest median in terms of flight counts and also fuel uplift

  • It also had the biggest spread. This is in-line with jets typically having longer routes forcing them to carry more fuel which due to the higher risks, requires higher safety standards than smaller Piston or Turboprops engines.

  • It’s really hard to tell which aircraft types or even airline operators for that matter had better fuel management in place due to a variety of factors such as safety, flight routes etc without knowing all the relevant travelling specifics

Fuel Management statistics

fuel_uplift_stats <- actual_flts_data %>% 
  group_by(Aircraft_Category) %>% 
   summarise(
    Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
    Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
    Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
    Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
    SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")

fuel_uplift_stats
## # A tibble: 3 × 6
##   Aircraft_Category Mean_Fuel_uplift Median_Fuel_uplift Min_Fu…¹ Max_F…² SD_Fu…³
##   <chr>                        <dbl>              <dbl>    <dbl>   <dbl>   <dbl>
## 1 Jet                          455.                 250       50    5800   486. 
## 2 Piston                        82.6                 50       30     650    66.8
## 3 Turboprop                    106.                 100       50    1000    82.9
## # … with abbreviated variable names ¹​Min_Fuel_uplift, ²​Max_Fuel_uplift,
## #   ³​SD_Fuel_uplift

Trying to figure which stats metric is most relevant to use?

p4 <- fuel_uplift_stats %>%
  select(Aircraft_Category, Mean_Fuel_uplift, Min_Fuel_uplift,Max_Fuel_uplift,SD_Fuel_uplift) %>%
  pivot_longer(cols=-Aircraft_Category, names_to ="Stats", values_to = 'Value') %>% 
  ggplot(aes(x = Stats, y = Value, fill = Stats)) +
  geom_bar(stat = "identity") +
  labs(title = "Aircraft Category Stats Plot", x = "Statistics", y = "Value") +
  theme_minimal()

plotly_plot4 <- ggplotly(p4)
plotly_plot4

Observation 4:

Looks like Mean is the way to go!

fuel_uplift_stats_year <- actual_flts_data %>% mutate(Year = year(DateTime) ) %>% 
  group_by(Year,Aircraft_Category) %>% 
   summarise(
    Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
    Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
    Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
    Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
    SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")
fuel_uplift_stats_year
## # A tibble: 15 × 7
## # Groups:   Year [5]
##     Year Aircraft_Category Mean_Fuel_uplift Median_Fue…¹ Min_F…² Max_F…³ SD_Fu…⁴
##    <dbl> <chr>                        <dbl>        <dbl>   <dbl>   <dbl>   <dbl>
##  1  2019 Jet                          411.           250      50    3900   406. 
##  2  2019 Piston                        87.4           50      40     550    74.6
##  3  2019 Turboprop                    112.           100      50     650    87.5
##  4  2020 Jet                          431.           250      50    5800   484. 
##  5  2020 Piston                        81.9           50      30     650    70.7
##  6  2020 Turboprop                    108.           100      50     540    77.8
##  7  2021 Jet                          483.           300      50    4800   511. 
##  8  2021 Piston                        76.1           50      40     450    68.0
##  9  2021 Turboprop                    100.            50      50    1000    84.3
## 10  2022 Jet                          483.           350      50    5000   524. 
## 11  2022 Piston                        86.7           50      40     300    59.7
## 12  2022 Turboprop                    107.           100      50     540    81.9
## 13  2023 Jet                          483.           300      50    4700   530. 
## 14  2023 Piston                        74.8           50      40     200    47.7
## 15  2023 Turboprop                    100.            50      50     540    81.2
## # … with abbreviated variable names ¹​Median_Fuel_uplift, ²​Min_Fuel_uplift,
## #   ³​Max_Fuel_uplift, ⁴​SD_Fuel_uplift

Separated fuel uplift stats (using mean values)

p5 <-  fuel_uplift_stats_year %>%
  select(Year, Aircraft_Category, Mean_Fuel_uplift)%>%
  ggplot(aes(x = Year, y = Mean_Fuel_uplift, group = Aircraft_Category, color = Aircraft_Category)) +
  geom_line() +
  facet_wrap(~Aircraft_Category, scales = "free_y") +
  labs(title = "Yearly Line Chart of fuel uplift by Aircraft Types", x = "Time", y = "Mean fuel uplift") +
  theme_minimal()
plotly_plot5 <- ggplotly(p5)
plotly_plot5

Observation 5:

  • Seem like jets had huge rate of increase in fuel uplift from 19’ to 21’ and then flattened out from 22’ thru 23’

  • Meanwhile, piston and turboprops followed similar patterns; dropping in 21’ and then increasing again in 22’ almost in a see-saw pattern

Other Misc items of interest:

  • Does Manufacturer Name has any impact?

  • To see if past visits/company visit have any relationship with each other when group by Manufacturers

#Note: I assumed visits means repairs?

filtered_data <- actual_flts_data %>%filter(Past_Visit == Company_Visits) %>% 
  group_by(Manufacturer_Name,Past_Visit,Operational_Category) %>% 
  summarise(Operational_Category=n()) %>% 
  na.omit()
filtered_data
## # A tibble: 1,052 × 3
## # Groups:   Manufacturer_Name, Past_Visit [952]
##    Manufacturer_Name Past_Visit Operational_Category
##    <chr>                  <dbl>                <int>
##  1 ASTRA                      2                    1
##  2 ASTRA                     13                    1
##  3 ASTRA                     14                    1
##  4 ASTRA                     16                    1
##  5 BAE SYSTEMS PLC            2                    2
##  6 BEECH                      1                    4
##  7 BEECH                      2                    6
##  8 BEECH                      3                    1
##  9 BEECH                      4                    3
## 10 BEECH                      6                    2
## # … with 1,042 more rows

correlational study between Past Vist and Operational Category

Observation 6a:

  • Bar charts at first glance, gave a visual impression that indicated Operational Category has a positive relationship with Past visits

  • Meaning as the number of vists increases/decreases, each operational category numbers followed suit. This mistakenly would have misled an analyst to make inferences between the number of operational categories and the number of visits

p6 <- filtered_data%>%
  pivot_longer(cols = starts_with(c("Past_Visit","Operational_Category")), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Manufacturer_Name, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Variable, scales = "free_y") +
  labs(title = "Bar Charts by Operational Category and Past Visits", x = "Manufacturer_Name", y = "Counts") +theme(axis.text.x = element_text(angle = -90,hjust = 1))+
  coord_flip()+
  theme_minimal()
plotly_plot6 <- ggplotly(p6)
plotly_plot6

Observation 6b:

  • Interestingly enough, a correlation calculation showed its in fact slightly negative; meaning number of operational categories move in opposite direction to number of visits.

  • Lessons learned its better to robustly test than simply rely on visuals!!

# Calculate the correlation matrix 
correlation_matrix <- cor(filtered_data[,2:3])

# Print the correlation matrix
print(correlation_matrix)
##                      Past_Visit Operational_Category
## Past_Visit            1.0000000           -0.1241611
## Operational_Category -0.1241611            1.0000000
#install.packages("corrplot")

corrplot(correlation_matrix, method = 'number',type='upper',bg="lightblue")

Summary and Conclusions:


- The biggest takeaway for me was the data actually showed that in 20', a pandemic year, total flights plummeted and this was in-line with all the headline news that demand decreased sharply due to lockdowns and airlines retrenched

- It also showed that the rebound in demand was just as good pre-pandemic 

- At a high level, it was hard to get more insights into fuel management efficacy by aircraft types and there was no way to infer (at least statistically) the direction of visits relative to operational categories

- Given more time and more numerical predictors, I would like to create a predictive model based on lasso regression or stepwise regression to statstically remove some of the less relevant independent variables and get a better and more streamlined data set to make better flight forecasts

- Lastly, given more time, 2 particular scenarios I would be interested in would be fuel management efficient w.r.t to MFR year and model types.