Libraries

Loading all the pertinent libraries

library(rlang)
require(ggthemes)
library(tidyverse)
library(magrittr)
library(TTR)
library(tidyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(fpp2)   
library(forecast)
library(caTools)
library(reshape2)
library(psych) 
library(ggthemes)
library(readr)
library(readxl) 
#install.packages("skimr")
library(corrplot)
library(skimr)

Looking at all the variables of the data set and it’s structure

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
#read in excel data
data <- read_excel("C:/teams/Desktop/Aero/Project Barracuda - Financial Analysis_v01.xlsx",sheet="GEG_Raw Data")
#check data structure
print('data structure')

## [1] "data structure"

skim(data )

Data summary
Name	data
Number of rows	10785
Number of columns	46
_______________________
Column type frequency:
character	40
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
IATA	10178	0.06	2	2	10
ICAO	9044	0.16	3	3	60
Flight	0	1.00	3	7	3358
Aircraft Type	598	0.94	2	4	226
Aircraft Category	598	0.94	3	10	5
Tail	362	0.97	2	7	3024
Based (Y/N)	0	1.00	2	6	10
From/To	0	1.00	1	9	629
Operator	861	0.92	6	43	1853
Operator (dba)	804	0.93	6	43	1727
PH Name	3729	0.65	6	40	1133
PH Address1	3729	0.65	7	40	694
PH Address2	3799	0.65	1	40	763
PH City	3731	0.65	3	15	401
PH Con. 1 FN	5743	0.47	2	13	267
PH Con. 1 LN	5743	0.47	3	14	519
PH Con. 1 Title	5752	0.47	5	20	90
PH Con. 1 Email	6196	0.43	10	38	416
PH Con. 2 FN	9035	0.16	2	13	126
PH Con. 2 LN	9035	0.16	1	10	185
PH Con. 2 Title	9035	0.16	3	20	60
PH Con. 2 Email	9227	0.14	13	32	157
Apcode	4115	0.62	3	4	403
Operator Name	10051	0.07	6	40	136
Operator Address1	10076	0.07	8	40	122
Operator Address2	10394	0.04	7	40	59
Operator City	10078	0.07	4	15	96
Operator Con. 1 FN	10076	0.07	3	14	98
Operator Con. 1 LN	10076	0.07	3	14	115
Operator Con. 1 Title	10076	0.07	5	20	54
Operator Con. 1 Email	10350	0.04	13	35	64
Operator Con. 2 FN	10294	0.05	3	13	62
Operator Con. 2 LN	10294	0.05	3	14	71
Operator Con. 2 Title	10294	0.05	5	20	47
Operator Con. 2 Email	10616	0.02	13	32	36
Reg. Owner Country	862	0.92	2	14	9
Serial Number	861	0.92	1	18	2291
Manufacturer Name	861	0.92	4	30	115
Model	861	0.92	1	20	432
Operational Category	2820	0.74	7	15	15

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Fuel Uplift	2876	0.73	327.82	439.95	30	100	200	450.00	6300	▇▁▁▁▁
Past Visits	5182	0.52	69.55	103.07	1	4	28	75.00	471	▇▁▁▁▁
Company Visits	5127	0.52	84.74	121.95	1	7	45	80.75	594	▇▁▁▁▁
MFR Year	1051	0.90	2003.85	12.89	1942	2000	2006	2014.00	2023	▁▁▂▆▇

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date and Time	0	1	2019-01-01 00:00:00	2023-07-18 00:00:00	2021-02-17 00:00:00	1634
Departed	0	1	1899-12-31 00:00:21	1899-12-31 23:59:28	1899-12-31 15:12:05	4253

High level summary of data set:

Its a time series data set with a datetime column and 39 other ‘factors’; some of which could be utilised as predictors
Total number of 40 variables
Total number of rows 10785
Quite a number of variables contain null or blanks making it not useful for prediction purposes
Only 4 variables are numerical: Fuel Uplift, Past Visits and Company Visits, MFR Year

Data cleansing

Renaming certain column headers for better readability

Wrangling original data to filter out flights that have origination/destination only

Assumption is: Flights without “origination/destination” are aircrafts that were grounded for some reasons
Going forward in this analysis, we will only be looking at flights that actually took place as designated by a “Origin” followed by “Destination

#flights that actually flew from Spokane to another destination
# Specify the strings you want to exclude
strings_to_exclude <- c("KGEG/", "/","GEG")
# Use filter() to exclude rows
actual_flts_data <- data %>%
  filter(!(`From/To` %in% strings_to_exclude))

actual_flts <- actual_flts_data %>% 
  rename(act_flts=`From/To`) %>% 
  mutate(Category='From/To') %>% 
  select(Category,act_flts) %>% 
  group_by(Category) %>% 
  summarise(no_aircrafts=n())

#flight that did not take flights at all
no_flts_data <- data %>%
  filter((`From/To` %in% strings_to_exclude))

no_flts <- no_flts_data %>% 
  rename(no_flts=`From/To`) %>% 
  mutate(Category='From/To') %>% 
  select(Category,no_flts) %>% 
  group_by(Category) %>% 
  summarise(no_aircrafts=n())

# Combine data frames with a source column
df1_combined <- actual_flts %>% mutate(Source = "actual_flts")
df2_combined <- no_flts %>% mutate(Source = "no_flts")

# Combine into a single data frame
combined_df <- bind_rows(df1_combined, df2_combined)

# Create a side-by-side barplot
p1 <- ggplot(combined_df, aes(x = Category, y = no_aircrafts, fill = Source)) +
  geom_bar(stat = "identity", position = position_dodge(width = 1)) +
  labs(title = "Actual Flts vs. Grounded Flts", x = "Category", y = "No. of Flights") +
  geom_text(aes(label = no_aircrafts), vjust = -.25,size = 4)+
  scale_fill_manual(values = c("no_flts" = "blue", "actual_flts" = "red"))

plotly_plot <- ggplotly(p1)
plotly_plot

Observation 1:

As shown in the bar chart, 1639 events when aircraft did not take off while 9146 actually did using the above “assumptions” which is about 15% of total flights

Performed statistical analysis on flights that actually took off

total flights overall by years only

flight_data<- actual_flts_data %>% 
  group_by(Year = year(DateTime) )%>% 
             summarise(Flights=n())

# Create a bar chart
p2 <- ggplot(flight_data, aes(x = Year, y = Flights)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(
    title = "Number of Flights Over 5 Years",
    x = "Year",
    y = "Number of Flights"
  )
plotly_plot2 <- ggplotly(p2)
plotly_plot2

Total flights overall over 5 yrs by months

# Break down by year to see if there is seasonality and/or trends in actual flights
flight_data_season<- actual_flts_data %>% 
  group_by(Month = floor_date(DateTime, unit = "month")) %>% 
             summarise(Flights=n())

# Plot the line chart
p3 <- ggplot(flight_data_season, aes(x = Month, y = Flights)) +
  geom_line() +
  labs(title = "Total Flights over 5 years by Months", x = "Month", y = "No. of Flights") +
  scale_x_datetime(date_labels = "%Y-%m", date_breaks = "1 month") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

plotly_plot3 <- ggplotly(p3)
plotly_plot3

Time series Forecasting with ARIMA Auto-Parameter select mode

arima_model <- auto.arima(flight_data_season[,2])
print(summary(arima_model))

## Series: flight_data_season[, 2] 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      mean
##       0.6768  164.9958
## s.e.  0.0990   12.1577
## 
## sigma^2 = 946.5:  log likelihood = -265.78
## AIC=537.56   AICc=538.03   BIC=543.59
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.3759777 30.20128 24.22793 -5.544682 17.84585 0.9412288
##                    ACF1
## Training set 0.03238102

fcast <- forecast(arima_model,h=12)
print(fcast)

##    Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## 56       127.0989  87.67086 166.5269 66.79893 187.3988
## 57       139.3478  91.73890 186.9568 66.53624 212.1595
## 58       147.6377  96.71866 198.5568 69.76373 225.5117
## 59       153.2482 100.88278 205.6136 73.16221 233.3341
## 60       157.0452 104.03053 210.0599 75.96625 238.1242
## 61       159.6150 106.30554 212.9244 78.08523 241.1448
## 62       161.3542 107.91025 214.7981 79.61875 243.0896
## 63       162.5312 109.02582 216.0366 80.70177 244.3606
## 64       163.3278 109.79428 216.8613 81.45534 245.2003
## 65       163.8669 110.32052 217.4133 81.97477 245.7591
## 66       164.2318 110.67949 217.7841 82.33061 246.1330
## 67       164.4787 110.92373 218.0337 82.57342 246.3841

plot(fcast)

Observation 2:

The pandemic year of 20’ was the low point of total overall flights
Here’s an interesting observation, overall flights in 22’ came roaring back after the pre-pandemic year of 20’ almost on par with 19’ which saw the highest travelling season in the last 5 years
Peeling deeper, at the height of pandemic lock-down on March-04 2020, there were only 50 actual flights that took place from Spokane!
There’s a trend here for any given year: One can discern from the chart that august is the busiest travelling month of the year (ex year 20’) for all years (Please hover over the month of interest on chart to see the actual numbers)
Lastly, an automated ARIMA forecast was utilized just to see how the next 12 periods of flight traffic would look like. This was a simple illustration, in reality, a more robust process could be implemented with other algorithms like Holts-Winter or more sophisticated driver-based models such as lasso or stepwise regression
Note: 23’ is not a full year’s worth of flights, so its excluded from comparision

Boxplots of flights

flight_data_boxplot <- actual_flts_data %>% 
  mutate(Year = year(DateTime),Month = floor_date(DateTime, unit = "month")) %>% 
  group_by(Year, Month, Aircraft_Category) %>% 
  summarise(Flights=n(),Fuel_Uplift=mean(Fuel_Uplift, na.rm = T)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")

p4 <- flight_data_boxplot %>%select(c(Aircraft_Category, Flights)) %>% 
  ggplot(aes(x = Aircraft_Category, y = Flights, fill = Aircraft_Category)) +
  geom_boxplot() +
  labs(title = "Boxplots of No. of Flights", x = "Category", y = "No. of Flights") +
  theme_minimal()

plotly_plot4 <- ggplotly(p4)
plotly_plot4

Finally, Boxplots of fuel uplift

p5 <- flight_data_boxplot %>%select(c(Aircraft_Category, Fuel_Uplift)) %>% 
  ggplot(aes(x = Aircraft_Category, y = Fuel_Uplift, fill = Aircraft_Category)) +
  geom_boxplot() +
  labs(title = "Boxplots of Fuel Mgmt Efficiency", x = "Category", y = "Amount of Fuel_uplift") +
  theme_minimal()

plotly_plot5 <- ggplotly(p5)
plotly_plot5

Observation 3:

By aircraft types, Jets have the highest median in terms of flight counts and also fuel uplift
It also had the biggest spread. This is in-line with jets typically having longer routes forcing them to carry more fuel which due to the higher risks, requires higher safety standards than smaller Piston or Turboprops engines.
It’s really hard to tell which aircraft types or even airline operators for that matter had better fuel management in place due to a variety of factors such as safety, flight routes etc without knowing all the relevant travelling specifics

Fuel Management statistics

fuel_uplift_stats <- actual_flts_data %>% 
  group_by(Aircraft_Category) %>% 
   summarise(
    Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
    Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
    Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
    Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
    SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")

fuel_uplift_stats

## # A tibble: 3 × 6
##   Aircraft_Category Mean_Fuel_uplift Median_Fuel_uplift Min_Fu…¹ Max_F…² SD_Fu…³
##   <chr>                        <dbl>              <dbl>    <dbl>   <dbl>   <dbl>
## 1 Jet                          455.                 250       50    5800   486. 
## 2 Piston                        82.6                 50       30     650    66.8
## 3 Turboprop                    106.                 100       50    1000    82.9
## # … with abbreviated variable names ¹Min_Fuel_uplift, ²Max_Fuel_uplift,
## #   ³SD_Fuel_uplift

Trying to figure which stats metric is most relevant to use?

Max and min seem extreme and not indicative of the sample size
Mean and SD both look similar: Mean was chosen as a benchmark metric going forward

p4 <- fuel_uplift_stats %>%
  select(Aircraft_Category, Mean_Fuel_uplift, Min_Fuel_uplift,Max_Fuel_uplift,SD_Fuel_uplift) %>%
  pivot_longer(cols=-Aircraft_Category, names_to ="Stats", values_to = 'Value') %>% 
  ggplot(aes(x = Stats, y = Value, fill = Stats)) +
  geom_bar(stat = "identity") +
  labs(title = "Aircraft Category Stats Plot", x = "Statistics", y = "Value") +
  theme_minimal()

plotly_plot4 <- ggplotly(p4)
plotly_plot4

Observation 4:

Looks like Mean is the way to go!

Bucketed stats by year and aircraft types

fuel_uplift_stats_year <- actual_flts_data %>% mutate(Year = year(DateTime) ) %>% 
  group_by(Year,Aircraft_Category) %>% 
   summarise(
    Mean_Fuel_uplift = mean(Fuel_Uplift,na.rm = TRUE),
    Median_Fuel_uplift = median(Fuel_Uplift,na.rm = TRUE),
    Min_Fuel_uplift = min(Fuel_Uplift,na.rm = TRUE),
    Max_Fuel_uplift = max(Fuel_Uplift,na.rm = TRUE),
    SD_Fuel_uplift = sd(Fuel_Uplift,na.rm = TRUE)) %>% 
  filter(complete.cases(Aircraft_Category) & Aircraft_Category != "Unknown")
fuel_uplift_stats_year

## # A tibble: 15 × 7
## # Groups:   Year [5]
##     Year Aircraft_Category Mean_Fuel_uplift Median_Fue…¹ Min_F…² Max_F…³ SD_Fu…⁴
##    <dbl> <chr>                        <dbl>        <dbl>   <dbl>   <dbl>   <dbl>
##  1  2019 Jet                          411.           250      50    3900   406. 
##  2  2019 Piston                        87.4           50      40     550    74.6
##  3  2019 Turboprop                    112.           100      50     650    87.5
##  4  2020 Jet                          431.           250      50    5800   484. 
##  5  2020 Piston                        81.9           50      30     650    70.7
##  6  2020 Turboprop                    108.           100      50     540    77.8
##  7  2021 Jet                          483.           300      50    4800   511. 
##  8  2021 Piston                        76.1           50      40     450    68.0
##  9  2021 Turboprop                    100.            50      50    1000    84.3
## 10  2022 Jet                          483.           350      50    5000   524. 
## 11  2022 Piston                        86.7           50      40     300    59.7
## 12  2022 Turboprop                    107.           100      50     540    81.9
## 13  2023 Jet                          483.           300      50    4700   530. 
## 14  2023 Piston                        74.8           50      40     200    47.7
## 15  2023 Turboprop                    100.            50      50     540    81.2
## # … with abbreviated variable names ¹Median_Fuel_uplift, ²Min_Fuel_uplift,
## #   ³Max_Fuel_uplift, ⁴SD_Fuel_uplift

Separated fuel uplift stats (using mean values)

bucketed by year and aircraft types

p5 <-  fuel_uplift_stats_year %>%
  select(Year, Aircraft_Category, Mean_Fuel_uplift)%>%
  ggplot(aes(x = Year, y = Mean_Fuel_uplift, group = Aircraft_Category, color = Aircraft_Category)) +
  geom_line() +
  facet_wrap(~Aircraft_Category, scales = "free_y") +
  labs(title = "Yearly Line Chart of fuel uplift by Aircraft Types", x = "Time", y = "Mean fuel uplift") +
  theme_minimal()
plotly_plot5 <- ggplotly(p5)
plotly_plot5

Observation 5:

Seem like jets had huge rate of increase in fuel uplift from 19’ to 21’ and then flattened out from 22’ thru 23’
Meanwhile, piston and turboprops followed similar patterns; dropping in 21’ and then increasing again in 22’ almost in a see-saw pattern

Other Misc items of interest:

Does Manufacturer Name has any impact?
To see if past visits/company visit have any relationship with each other when group by Manufacturers

#Note: I assumed visits means repairs?

filtered_data <- actual_flts_data %>%filter(Past_Visit == Company_Visits) %>% 
  group_by(Manufacturer_Name,Past_Visit,Operational_Category) %>% 
  summarise(Operational_Category=n()) %>% 
  na.omit()
filtered_data

## # A tibble: 1,052 × 3
## # Groups:   Manufacturer_Name, Past_Visit [952]
##    Manufacturer_Name Past_Visit Operational_Category
##    <chr>                  <dbl>                <int>
##  1 ASTRA                      2                    1
##  2 ASTRA                     13                    1
##  3 ASTRA                     14                    1
##  4 ASTRA                     16                    1
##  5 BAE SYSTEMS PLC            2                    2
##  6 BEECH                      1                    4
##  7 BEECH                      2                    6
##  8 BEECH                      3                    1
##  9 BEECH                      4                    3
## 10 BEECH                      6                    2
## # … with 1,042 more rows

correlational study between Past Vist and Operational Category

Observation 6a:

Bar charts at first glance, gave a visual impression that indicated Operational Category has a positive relationship with Past visits
Meaning as the number of vists increases/decreases, each operational category numbers followed suit. This mistakenly would have misled an analyst to make inferences between the number of operational categories and the number of visits

p6 <- filtered_data%>%
  pivot_longer(cols = starts_with(c("Past_Visit","Operational_Category")), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Manufacturer_Name, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Variable, scales = "free_y") +
  labs(title = "Bar Charts by Operational Category and Past Visits", x = "Manufacturer_Name", y = "Counts") +theme(axis.text.x = element_text(angle = -90,hjust = 1))+
  coord_flip()+
  theme_minimal()
plotly_plot6 <- ggplotly(p6)
plotly_plot6

Observation 6b:

Interestingly enough, a correlation calculation showed its in fact slightly negative; meaning number of operational categories move in opposite direction to number of visits.
Lessons learned its better to robustly test than simply rely on visuals!!

# Calculate the correlation matrix 
correlation_matrix <- cor(filtered_data[,2:3])

# Print the correlation matrix
print(correlation_matrix)

##                      Past_Visit Operational_Category
## Past_Visit            1.0000000           -0.1241611
## Operational_Category -0.1241611            1.0000000

#install.packages("corrplot")

corrplot(correlation_matrix, method = 'number',type='upper',bg="lightblue")

Summary and Conclusions:


- The biggest takeaway for me was the data actually showed that in 20', a pandemic year, total flights plummeted and this was in-line with all the headline news that demand decreased sharply due to lockdowns and airlines retrenched

- It also showed that the rebound in demand was just as good pre-pandemic 

- At a high level, it was hard to get more insights into fuel management efficacy by aircraft types and there was no way to infer (at least statistically) the direction of visits relative to operational categories

- Given more time and more numerical predictors, I would like to create a predictive model based on lasso regression or stepwise regression to statstically remove some of the less relevant independent variables and get a better and more streamlined data set to make better flight forecasts

- Lastly, given more time, 2 particular scenarios I would be interested in would be fuel management efficient w.r.t to MFR year and model types.

Aero - Data Analysis

sufian

2023-11-09

Libraries

Looking at all the variables of the data set and it’s structure

High level summary of data set:

Data cleansing

Wrangling original data to filter out flights that have origination/destination only

Observation 1:

Performed statistical analysis on flights that actually took off

total flights overall by years only

Total flights overall over 5 yrs by months

Time series Forecasting with ARIMA Auto-Parameter select mode

Observation 2:

Boxplots of flights

Finally, Boxplots of fuel uplift

Observation 3:

Fuel Management statistics

Trying to figure which stats metric is most relevant to use?

Observation 4:

Looks like Mean is the way to go!

Separated fuel uplift stats (using mean values)

Observation 5:

Other Misc items of interest:

correlational study between Past Vist and Operational Category

Observation 6a:

Observation 6b:

Summary and Conclusions: