library(tidyverse)
library(dplyr)
library(ggplot2)
library(readr)
airline <- read_csv("international_airline.csv")
The aim of this report is to address potential clients by formulating and answering questions from the dataset which pertains to flights and saleable seats operated by scheduled international passenger airlines operating to and from Australia. Using the statistical programming language R, I was able to manipulate and analyse the data to conclusions and report on my findings. The question I addressed in this report is whether there is a linear relationship between each year and maximum fittable seats/ total flights for Qantas Airways. The result of this study proved that there is indeed a positive linear relationship between examined variables. From the information gathered, I would recommend to the client that Qantas Airways could potentially be a lucrative investment for clients, as an increase in amount of seats and flights each year would suggest an increase of revenue due to a higher volume of passengers.
Within the data, there were no missing variables and no duplicates of the data. The software was able to accurately classify each of the variables in the dataset, so no changes had to be made to the datatypes. The data was filtered by the airline Qantas Airways, the data then was grouped by years and the mean was calculated for the maximum number of seats and total flights for each year.
The data was obtained from data.gov.au, an online repository for Australian government related datasets. The dataset selected is valid and reliable as it was originally sourced from the Bureau of Infrastructure and Transport Research Economics an official Australian government organisation which provides economic analysis, research and statistics on infrastructure, transport and cities.
Regarding issues and limitations, as the dataset has been updated to the present year COVID-19 had greatly influenced the dependent variables for years 2020-2021 since border restrictions had caused many airlines to temporarily close down. In response, the two years were removed from analysis to create more accurate results.
str(airline)
## spec_tbl_df [106,312 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Month : chr [1:106312] "Sep-03" "Sep-03" "Sep-03" "Sep-03" ...
## $ In_Out : chr [1:106312] "I" "I" "I" "I" ...
## $ Australian_City : chr [1:106312] "Adelaide" "Adelaide" "Adelaide" "Adelaide" ...
## $ International_City: chr [1:106312] "Denpasar" "Hong Kong" "Kuala Lumpur" "Singapore" ...
## $ Airline : chr [1:106312] "Garuda Indonesia" "Cathay Pacific Airways" "Malaysia Airlines" "Qantas Airways" ...
## $ Route : chr [1:106312] "DPS-ADL-MEL" "HKG-ADL-MEL" "KUL-ADL" "SIN-DRW-ADL-MEL" ...
## $ Port_Country : chr [1:106312] "Indonesia" "Hong Kong (SAR)" "Malaysia" "Singapore" ...
## $ Port_Region : chr [1:106312] "SE Asia" "NE Asia" "SE Asia" "SE Asia" ...
## $ Service_Country : chr [1:106312] "Indonesia" "Hong Kong (SAR)" "Malaysia" "Singapore" ...
## $ Service_Region : chr [1:106312] "SE Asia" "NE Asia" "SE Asia" "SE Asia" ...
## $ Stops : num [1:106312] 0 0 0 1 1 0 0 0 0 0 ...
## $ All_Flights : num [1:106312] 13 8 17 4 9 12 36 18 8 14 ...
## $ Max_Seats : num [1:106312] 3809 2008 4726 908 2038 ...
## $ Year : num [1:106312] 2003 2003 2003 2003 2003 ...
## $ Month_num : num [1:106312] 9 9 9 9 9 9 9 9 9 9 ...
## - attr(*, "spec")=
## .. cols(
## .. Month = col_character(),
## .. In_Out = col_character(),
## .. Australian_City = col_character(),
## .. International_City = col_character(),
## .. Airline = col_character(),
## .. Route = col_character(),
## .. Port_Country = col_character(),
## .. Port_Region = col_character(),
## .. Service_Country = col_character(),
## .. Service_Region = col_character(),
## .. Stops = col_double(),
## .. All_Flights = col_double(),
## .. Max_Seats = col_double(),
## .. Year = col_double(),
## .. Month_num = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Make a table for Year and the mean of Max_Seats
airline_quantas <- group_by(airline, Year)
airline_quantas <- filter(airline_quantas, Airline == "Qantas Airways")
airline_quantas <- summarize(airline_quantas, mean_seats = mean(Max_Seats))
airline_quantas
## # A tibble: 19 × 2
## Year mean_seats
## <dbl> <dbl>
## 1 2003 4845.
## 2 2004 5699.
## 3 2005 5578.
## 4 2006 5125.
## 5 2007 5460.
## 6 2008 5651.
## 7 2009 5868.
## 8 2010 5894.
## 9 2011 6463.
## 10 2012 7161.
## 11 2013 7752.
## 12 2014 8534.
## 13 2015 8649.
## 14 2016 8526.
## 15 2017 8923.
## 16 2018 8784.
## 17 2019 8517.
## 18 2020 6501.
## 19 2021 3249.
# Plot a linear model of year and mean of Max_seats
airline_lm <-
ggplot(airline_quantas, aes(Year, mean_seats)) +
geom_point()+
geom_smooth(method = 'lm')+
xlab("Year") +
ylab("Max Seats") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Amount of Max Seats Over Time")
airline_lm
# Correlation Value of Year and mean of Max_Seats
cor(airline_quantas$Year, airline_quantas$mean_seats)
## [1] 0.4705629
For the first model, data was analysed to determine whether there is a linear relationship between each year and maximum fittable seats. Observing the scatterplot, the data correlation suggested overall a medium strength positive linear correlation (0.47) between each year and the maximum seats, also containing possible outliers for years 2020-2021 due to COVID-19.
#
fit <- lm(log(mean_seats) ~ Year,airline_quantas)
par(mfrow = c(1, 2))
plot(fit, which = c(2, 3))
Assumptions of the linear model were checked visually by observing the residual diagnostic plots. The QQ-plot showed significant negative skewing for one datapoints on the bottom end of plot and for the scale-location plot an outlier was also discovered. As a result, due to the presence of an outlier the model did not meet the assumptions of normality, thus, changes had to be made to the to check if the model should be rejected.
fit <-
airline_quantas %>%
filter(Year < 2020) %>%
lm(mean_seats ~ Year, .)
par(mfrow = c(1, 2)) # make diagnostic plots more compact (optional)
plot(fit, which = c(2, 3))
To fix assumptions of normality the dataset was constrained, and the outliers were removed for years 2020-2021. Observing the updated residual plot with the outliers removed the model’s assumptions of normality are now satisfied.
airline_quantas %>%
filter(Year < 2020) %>%
ggplot(aes(Year, mean_seats)) +
geom_line(aes(y = predict(fit))) +
geom_point() +
theme_bw() +
xlab("Year") +
ylab("Max Seats")
summary(fit)
##
## Call:
## lm(formula = mean_seats ~ Year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -733.45 -414.57 -27.44 331.88 784.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -557341.6 49864.2 -11.18 1.13e-08 ***
## Year 280.6 24.8 11.32 9.60e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 500.8 on 15 degrees of freedom
## Multiple R-squared: 0.8951, Adjusted R-squared: 0.8881
## F-statistic: 128 on 1 and 15 DF, p-value: 9.602e-09
Summarizing the data of the model indicates a statistically significant positive linear relationship between Year and Max Seats (R^2* = 0.89; p < 0.01). This result is supported by research from Bhadra, D. (2009) who similarly reported in San Diego the average aircraft size grew by roughly 15 seats since 2006.
airline_quantas <- group_by(airline, Year)
airline_quantas <- filter(airline_quantas, Airline == "Qantas Airways")
airline_quantas <- summarize(airline_quantas, mean_flights = mean(All_Flights))
airline_quantas
## # A tibble: 19 × 2
## Year mean_flights
## <dbl> <dbl>
## 1 2003 16.2
## 2 2004 19.2
## 3 2005 18.7
## 4 2006 17.5
## 5 2007 19.1
## 6 2008 20.0
## 7 2009 21.2
## 8 2010 22.1
## 9 2011 23.8
## 10 2012 26.0
## 11 2013 27.6
## 12 2014 30.3
## 13 2015 30.9
## 14 2016 30.9
## 15 2017 32.2
## 16 2018 32.2
## 17 2019 32.4
## 18 2020 25.2
## 19 2021 15.2
airline_lm <-
ggplot(airline_quantas, aes(Year, mean_flights)) +
geom_point()+
geom_smooth(method = 'lm')+
xlab("Year") +
ylab("All Flights") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Total Amount of Flights Over Time")
airline_lm
cor(airline_quantas$Year, airline_quantas$mean_flights)
## [1] 0.6321142
The next model assessed whether there is a linear relationship between each year and total flights. From the scatterplot the data correlation suggested overall a moderate positive linear correlation (0.63) between each year and the total flights with some outliers for years 2020-2021 due to COVID-19.
fit <- lm(log(mean_flights) ~ Year,airline_quantas)
par(mfrow = c(1, 2))
plot(fit, which = c(2, 3))
Assumptions of the linear model were checked visually by observing the residual diagnostic plots. The QQ-plot showed significant negative skewing for one datapoints on the bottom end of plot and for the scale-location plot an outlier was also discovered. As a result, due to the presence of outliers the model did not meet the assumptions of normality, thus, changes had to be made to the to check if the model should be rejected.
fit <-
airline_quantas %>%
filter(Year < 2020) %>%
lm(mean_flights ~ Year, .)
par(mfrow = c(1, 2)) # make diagnostic plots more compact (optional)
plot(fit, which = c(2, 3))
To fix assumptions of normality the dataset was constrained, and the outliers were removed for years 2020-2021. Observing the updated residual plot with the outliers removed the model’s assumptions of normality were now satisfied.
airline_quantas %>%
filter(Year < 2020) %>%
ggplot(aes(Year, mean_flights)) +
geom_line(aes(y = predict(fit))) +
geom_point() +
theme_bw() +
xlab("Year") +
ylab("All Flights")
summary(fit)
##
## Call:
## lm(formula = mean_flights ~ Year, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5683 -1.2749 0.1084 0.6337 2.4453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.258e+03 1.353e+02 -16.68 4.29e-11 ***
## Year 1.135e+00 6.729e-02 16.86 3.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.359 on 15 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9466
## F-statistic: 284.4 on 1 and 15 DF, p-value: 3.673e-11
Summarising the data of the model indicates a statistically significant positive linear relationship between Year and All_Flights (R^2* = 0.95; p < 0.001).
In conclusion, after removing the outliers for each model and testing the relationship between Year and Max_Seats /All_Flights in Qantas Airways, both showed a positive linear relationship. From the information gathered, I would recommend to the client that Qantas Airways could potentially be a lucrative investment for clients, as an increase in amount of seats and flights each year would suggest an increase of revenue due to a higher volume of passengers.