library(tidyverse)
library(dplyr)
library(ggplot2)
library(readr)

airline <- read_csv("international_airline.csv")

Executive Summary

The aim of this report is to address potential clients by formulating and answering questions from the dataset which pertains to flights and saleable seats operated by scheduled international passenger airlines operating to and from Australia. Using the statistical programming language R, I was able to manipulate and analyse the data to conclusions and report on my findings. The question I addressed in this report is whether there is a linear relationship between each year and maximum fittable seats/ total flights for Qantas Airways. The result of this study proved that there is indeed a positive linear relationship between examined variables. From the information gathered, I would recommend to the client that Qantas Airways could potentially be a lucrative investment for clients, as an increase in amount of seats and flights each year would suggest an increase of revenue due to a higher volume of passengers.

Initial Data Analysis (IDA)

Within the data, there were no missing variables and no duplicates of the data. The software was able to accurately classify each of the variables in the dataset, so no changes had to be made to the datatypes. The data was filtered by the airline Qantas Airways, the data then was grouped by years and the mean was calculated for the maximum number of seats and total flights for each year.

The data was obtained from data.gov.au, an online repository for Australian government related datasets. The dataset selected is valid and reliable as it was originally sourced from the Bureau of Infrastructure and Transport Research Economics an official Australian government organisation which provides economic analysis, research and statistics on infrastructure, transport and cities.

Regarding issues and limitations, as the dataset has been updated to the present year COVID-19 had greatly influenced the dependent variables for years 2020-2021 since border restrictions had caused many airlines to temporarily close down. In response, the two years were removed from analysis to create more accurate results.

str(airline)
## spec_tbl_df [106,312 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Month             : chr [1:106312] "Sep-03" "Sep-03" "Sep-03" "Sep-03" ...
##  $ In_Out            : chr [1:106312] "I" "I" "I" "I" ...
##  $ Australian_City   : chr [1:106312] "Adelaide" "Adelaide" "Adelaide" "Adelaide" ...
##  $ International_City: chr [1:106312] "Denpasar" "Hong Kong" "Kuala Lumpur" "Singapore" ...
##  $ Airline           : chr [1:106312] "Garuda Indonesia" "Cathay Pacific Airways" "Malaysia Airlines" "Qantas Airways" ...
##  $ Route             : chr [1:106312] "DPS-ADL-MEL" "HKG-ADL-MEL" "KUL-ADL" "SIN-DRW-ADL-MEL" ...
##  $ Port_Country      : chr [1:106312] "Indonesia" "Hong Kong (SAR)" "Malaysia" "Singapore" ...
##  $ Port_Region       : chr [1:106312] "SE Asia" "NE Asia" "SE Asia" "SE Asia" ...
##  $ Service_Country   : chr [1:106312] "Indonesia" "Hong Kong (SAR)" "Malaysia" "Singapore" ...
##  $ Service_Region    : chr [1:106312] "SE Asia" "NE Asia" "SE Asia" "SE Asia" ...
##  $ Stops             : num [1:106312] 0 0 0 1 1 0 0 0 0 0 ...
##  $ All_Flights       : num [1:106312] 13 8 17 4 9 12 36 18 8 14 ...
##  $ Max_Seats         : num [1:106312] 3809 2008 4726 908 2038 ...
##  $ Year              : num [1:106312] 2003 2003 2003 2003 2003 ...
##  $ Month_num         : num [1:106312] 9 9 9 9 9 9 9 9 9 9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Month = col_character(),
##   ..   In_Out = col_character(),
##   ..   Australian_City = col_character(),
##   ..   International_City = col_character(),
##   ..   Airline = col_character(),
##   ..   Route = col_character(),
##   ..   Port_Country = col_character(),
##   ..   Port_Region = col_character(),
##   ..   Service_Country = col_character(),
##   ..   Service_Region = col_character(),
##   ..   Stops = col_double(),
##   ..   All_Flights = col_double(),
##   ..   Max_Seats = col_double(),
##   ..   Year = col_double(),
##   ..   Month_num = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Is there a linear relationship between each year and available seats/total flight for Qantas Airways?

# Make a table for Year and the mean of Max_Seats
airline_quantas <- group_by(airline, Year)
airline_quantas <- filter(airline_quantas, Airline == "Qantas Airways")
airline_quantas <- summarize(airline_quantas, mean_seats = mean(Max_Seats))
airline_quantas
## # A tibble: 19 × 2
##     Year mean_seats
##    <dbl>      <dbl>
##  1  2003      4845.
##  2  2004      5699.
##  3  2005      5578.
##  4  2006      5125.
##  5  2007      5460.
##  6  2008      5651.
##  7  2009      5868.
##  8  2010      5894.
##  9  2011      6463.
## 10  2012      7161.
## 11  2013      7752.
## 12  2014      8534.
## 13  2015      8649.
## 14  2016      8526.
## 15  2017      8923.
## 16  2018      8784.
## 17  2019      8517.
## 18  2020      6501.
## 19  2021      3249.
# Plot a linear model of year and mean of Max_seats
airline_lm <- 
  ggplot(airline_quantas, aes(Year, mean_seats)) +
  geom_point()+
  geom_smooth(method = 'lm')+
  xlab("Year") +
  ylab("Max Seats") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Amount of Max Seats Over Time")

airline_lm

# Correlation Value of Year and mean of Max_Seats

cor(airline_quantas$Year, airline_quantas$mean_seats)
## [1] 0.4705629

For the first model, data was analysed to determine whether there is a linear relationship between each year and maximum fittable seats. Observing the scatterplot, the data correlation suggested overall a medium strength positive linear correlation (0.47) between each year and the maximum seats, also containing possible outliers for years 2020-2021 due to COVID-19.

# 

fit <- lm(log(mean_seats) ~ Year,airline_quantas)

par(mfrow = c(1, 2)) 
plot(fit, which = c(2, 3))

Assumptions of the linear model were checked visually by observing the residual diagnostic plots. The QQ-plot showed significant negative skewing for one datapoints on the bottom end of plot and for the scale-location plot an outlier was also discovered. As a result, due to the presence of an outlier the model did not meet the assumptions of normality, thus, changes had to be made to the to check if the model should be rejected.

fit <- 
  airline_quantas %>%
  filter(Year < 2020) %>%
  lm(mean_seats ~ Year, .)
par(mfrow = c(1, 2)) # make diagnostic plots more compact (optional)
plot(fit, which = c(2, 3))

To fix assumptions of normality the dataset was constrained, and the outliers were removed for years 2020-2021. Observing the updated residual plot with the outliers removed the model’s assumptions of normality are now satisfied.

airline_quantas %>%
  filter(Year < 2020) %>%
  ggplot(aes(Year, mean_seats)) +
  geom_line(aes(y = predict(fit))) +
  geom_point() +
  theme_bw() +
  xlab("Year") +
  ylab("Max Seats")

summary(fit)
## 
## Call:
## lm(formula = mean_seats ~ Year, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -733.45 -414.57  -27.44  331.88  784.50 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -557341.6    49864.2  -11.18 1.13e-08 ***
## Year            280.6       24.8   11.32 9.60e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 500.8 on 15 degrees of freedom
## Multiple R-squared:  0.8951, Adjusted R-squared:  0.8881 
## F-statistic:   128 on 1 and 15 DF,  p-value: 9.602e-09

Summarizing the data of the model indicates a statistically significant positive linear relationship between Year and Max Seats (R^2* = 0.89; p < 0.01). This result is supported by research from Bhadra, D. (2009) who similarly reported in San Diego the average aircraft size grew by roughly 15 seats since 2006.

airline_quantas <- group_by(airline, Year)
airline_quantas <- filter(airline_quantas, Airline == "Qantas Airways")
airline_quantas <- summarize(airline_quantas, mean_flights = mean(All_Flights))
airline_quantas
## # A tibble: 19 × 2
##     Year mean_flights
##    <dbl>        <dbl>
##  1  2003         16.2
##  2  2004         19.2
##  3  2005         18.7
##  4  2006         17.5
##  5  2007         19.1
##  6  2008         20.0
##  7  2009         21.2
##  8  2010         22.1
##  9  2011         23.8
## 10  2012         26.0
## 11  2013         27.6
## 12  2014         30.3
## 13  2015         30.9
## 14  2016         30.9
## 15  2017         32.2
## 16  2018         32.2
## 17  2019         32.4
## 18  2020         25.2
## 19  2021         15.2
airline_lm <- 
  ggplot(airline_quantas, aes(Year, mean_flights)) +
  geom_point()+
  geom_smooth(method = 'lm')+
  xlab("Year") +
  ylab("All Flights") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Total Amount of Flights Over Time")
airline_lm

cor(airline_quantas$Year, airline_quantas$mean_flights)
## [1] 0.6321142

The next model assessed whether there is a linear relationship between each year and total flights. From the scatterplot the data correlation suggested overall a moderate positive linear correlation (0.63) between each year and the total flights with some outliers for years 2020-2021 due to COVID-19.

fit <- lm(log(mean_flights) ~ Year,airline_quantas)

par(mfrow = c(1, 2)) 
plot(fit, which = c(2, 3))

Assumptions of the linear model were checked visually by observing the residual diagnostic plots. The QQ-plot showed significant negative skewing for one datapoints on the bottom end of plot and for the scale-location plot an outlier was also discovered. As a result, due to the presence of outliers the model did not meet the assumptions of normality, thus, changes had to be made to the to check if the model should be rejected.

fit <- 
  airline_quantas %>%
  filter(Year < 2020) %>%
  lm(mean_flights ~ Year, .)
par(mfrow = c(1, 2)) # make diagnostic plots more compact (optional)
plot(fit, which = c(2, 3))

To fix assumptions of normality the dataset was constrained, and the outliers were removed for years 2020-2021. Observing the updated residual plot with the outliers removed the model’s assumptions of normality were now satisfied.

airline_quantas %>%
  filter(Year < 2020) %>%
  ggplot(aes(Year, mean_flights)) +
  geom_line(aes(y = predict(fit))) +
  geom_point() +
  theme_bw() +
  xlab("Year") +
  ylab("All Flights")

summary(fit)
## 
## Call:
## lm(formula = mean_flights ~ Year, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5683 -1.2749  0.1084  0.6337  2.4453 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.258e+03  1.353e+02  -16.68 4.29e-11 ***
## Year         1.135e+00  6.729e-02   16.86 3.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.359 on 15 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9466 
## F-statistic: 284.4 on 1 and 15 DF,  p-value: 3.673e-11

Summarising the data of the model indicates a statistically significant positive linear relationship between Year and All_Flights (R^2* = 0.95; p < 0.001).

In conclusion, after removing the outliers for each model and testing the relationship between Year and Max_Seats /All_Flights in Qantas Airways, both showed a positive linear relationship. From the information gathered, I would recommend to the client that Qantas Airways could potentially be a lucrative investment for clients, as an increase in amount of seats and flights each year would suggest an increase of revenue due to a higher volume of passengers.