airlines.df<-read.csv(paste("SixAirlinesDataV2.csv", sep=""))
attach(airlines.df)
library(psych)
describe(airlines.df)[,c(1:5,8,9)]
## vars n mean sd median min max
## Airline* 1 458 3.01 1.65 2.00 1.00 6.00
## Aircraft* 2 458 1.67 0.47 2.00 1.00 2.00
## FlightDuration 3 458 7.58 3.54 7.79 1.25 14.66
## TravelMonth* 4 458 2.56 1.17 3.00 1.00 4.00
## IsInternational* 5 458 1.91 0.28 2.00 1.00 2.00
## SeatsEconomy 6 458 202.31 76.37 185.00 78.00 389.00
## SeatsPremium 7 458 33.65 13.26 36.00 8.00 66.00
## PitchEconomy 8 458 31.22 0.66 31.00 30.00 33.00
## PitchPremium 9 458 37.91 1.31 38.00 34.00 40.00
## WidthEconomy 10 458 17.84 0.56 18.00 17.00 19.00
## WidthPremium 11 458 19.47 1.10 19.00 17.00 21.00
## PriceEconomy 12 458 1327.08 988.27 1242.00 65.00 3593.00
## PricePremium 13 458 1845.26 1288.14 1737.00 86.00 7414.00
## PriceRelative 14 458 0.49 0.45 0.36 0.02 1.89
## SeatsTotal 15 458 235.96 85.29 227.00 98.00 441.00
## PitchDifference 16 458 6.69 1.76 7.00 2.00 10.00
## WidthDifference 17 458 1.63 1.19 1.00 0.00 4.00
## PercentPremiumSeats 18 458 14.65 4.84 13.21 4.71 24.69
Relationship between Premium prices and type of aircraft
summary(Aircraft)
## AirBus Boeing
## 151 307
boxplot(PricePremium, Aircraft)
This indicates that there is lower premium prices in Boeing make aircrafts.
Next let us consider the airlines
boxplot(PricePremium~Airline)
Here, it can be seen that Delta charges least premium price compared to other airlines. However, the airlines can further be profiled according to Domestic or international flights. As can be seen here-
xtabs(~Airline+IsInternational)
## IsInternational
## Airline Domestic International
## AirFrance 0 74
## British 0 175
## Delta 40 6
## Jet 0 61
## Singapore 0 40
## Virgin 0 62
Let us have a look at the premium prices again, profiled further by domestic or international flight
par(mfrow=c(2, 1))
boxplot(PricePremium~Airline, subset = IsInternational=='International')
boxplot(PricePremium~Airline, subset = IsInternational=='Domestic')
How does the premium price for each airline vary across travel month?
TravelMonth=factor(TravelMonth, levels(TravelMonth)[c(2,1,4,3)])
par(mfrow=c(3,2))
boxplot(PricePremium~TravelMonth, subset= Airline=='British')
boxplot(PricePremium~TravelMonth, subset= Airline=='AirFrance')
boxplot(PricePremium~TravelMonth, subset= Airline=='Delta')
boxplot(PricePremium~TravelMonth, subset= Airline=='Jet')
boxplot(PricePremium~TravelMonth, subset= Airline=='Singapore')
boxplot(PricePremium~TravelMonth, subset= Airline=='Virgin')
There appears to no great variation across months, however the statistical significance will have to be investigated.
Let us investigate more relations using scatterplots
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
par(mfrow=c(1,1))
scatterplot(PercentPremiumSeats, PricePremium)
scatterplot(PriceEconomy, PricePremium)
scatterplot(PitchDifference, PricePremium)
scatterplot(WidthDifference, PricePremium)
scatterplot(FlightDuration, PricePremium)
From the above plots, the only clear relation appears between PriceEconomy and PricePremium and PriceEconomy and PricePremium, while it is unclear for percentage of premium seat and premium prices. With pitch difference and width difference plots for PricePremium, it appears that a certain mean/median pitch difference and width difference is more charged compared to others (i.e. they are not linearly related). A correlation plot will throw more light on it
numairline.df<- airlines.df[,c('PricePremium', 'FlightDuration', 'PercentPremiumSeats', 'PriceEconomy','PriceRelative', 'PitchDifference', 'WidthDifference')] #creating a separate dataframe containing independent numerical variables
library("corrplot")
## corrplot 0.84 loaded
round(cor(numairline.df),2)
## PricePremium FlightDuration PercentPremiumSeats
## PricePremium 1.00 0.65 0.12
## FlightDuration 0.65 1.00 0.06
## PercentPremiumSeats 0.12 0.06 1.00
## PriceEconomy 0.90 0.57 0.07
## PriceRelative 0.03 0.12 -0.16
## PitchDifference -0.02 -0.04 -0.09
## WidthDifference -0.01 -0.12 -0.28
## PriceEconomy PriceRelative PitchDifference
## PricePremium 0.90 0.03 -0.02
## FlightDuration 0.57 0.12 -0.04
## PercentPremiumSeats 0.07 -0.16 -0.09
## PriceEconomy 1.00 -0.29 -0.10
## PriceRelative -0.29 1.00 0.47
## PitchDifference -0.10 0.47 1.00
## WidthDifference -0.08 0.49 0.76
## WidthDifference
## PricePremium -0.01
## FlightDuration -0.12
## PercentPremiumSeats -0.28
## PriceEconomy -0.08
## PriceRelative 0.49
## PitchDifference 0.76
## WidthDifference 1.00
library("corrgram")
corrgram(numairline.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of various factors")
the above corgram indicates that PricePremium is strongly corelated with FlightDuration and PriceEconomy
Pearson’s correlation test
cor.test(PricePremium, FlightDuration)
##
## Pearson's product-moment correlation
##
## data: PricePremium and FlightDuration
## t = 18.204, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5923218 0.6988270
## sample estimates:
## cor
## 0.6487398
since the p value is <0.05 thus, statistically significant corelation exists between PricePremium and FlightDuration
cor.test(PricePremium, PriceEconomy)
##
## Pearson's product-moment correlation
##
## data: PricePremium and PriceEconomy
## t = 44.452, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8826622 0.9172579
## sample estimates:
## cor
## 0.9013887
here again the p value is <0.05, thus statistically significant corelation exists between PricePremium and PriceEconomy
cor.test(PricePremium, PercentPremiumSeats)
##
## Pearson's product-moment correlation
##
## data: PricePremium and PercentPremiumSeats
## t = 2.5024, df = 456, p-value = 0.01268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0250311 0.2058228
## sample estimates:
## cor
## 0.116391
The p value is <0.05 however not as small as those observed for PriceEconomy and Flight duration, implying weak correlation
cor.test(PricePremium, PitchDifference)
##
## Pearson's product-moment correlation
##
## data: PricePremium and PitchDifference
## t = -0.38585, df = 456, p-value = 0.6998
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1095118 0.0736825
## sample estimates:
## cor
## -0.01806629
The large p value implies that the correlation between PitchDifference and PricePremium is false. The same should hold for WidthDifference.
Hypothesis 1: Longer duration flights have higher Premium Prices
Hypothesis 2: Higher economy prices are correlated with higher premium prices
t.test(PricePremium, FlightDuration)
##
## Welch Two Sample t-test
##
## data: PricePremium and FlightDuration
## t = 30.531, df = 457.01, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1719.395 1955.965
## sample estimates:
## mean of x mean of y
## 1845.257642 7.577838
the p- value being very small, thus the null hypothesis that Premium Prices are independent of Flight Duration- is rejected. Thus, Hypothesis 1 tested.
t.test(PricePremium, PriceEconomy)
##
## Welch Two Sample t-test
##
## data: PricePremium and PriceEconomy
## t = 6.8304, df = 856.56, p-value = 1.605e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 369.2793 667.0831
## sample estimates:
## mean of x mean of y
## 1845.258 1327.076
The p value being extremely small again indicates that PriceEconomy and PricePremium are correlated in a statistically significant manner.Thus, Hypothesis 2 is verified.
fit<- lm(PricePremium~ PriceEconomy+ PercentPremiumSeats+ PitchDifference+ WidthDifference+ FlightDuration, data= numairline.df)
summary(fit)
##
## Call:
## lm(formula = PricePremium ~ PriceEconomy + PercentPremiumSeats +
## PitchDifference + WidthDifference + FlightDuration, data = numairline.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -803.2 -287.5 -46.8 151.6 3434.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -572.66050 132.93317 -4.308 2.02e-05 ***
## PriceEconomy 1.02266 0.02881 35.497 < 2e-16 ***
## PercentPremiumSeats 21.58784 5.09201 4.240 2.72e-05 ***
## PitchDifference -4.03335 20.94083 -0.193 0.847353
## WidthDifference 115.29104 32.16186 3.585 0.000374 ***
## FlightDuration 76.97237 8.07003 9.538 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 496.8 on 452 degrees of freedom
## Multiple R-squared: 0.8529, Adjusted R-squared: 0.8512
## F-statistic: 524 on 5 and 452 DF, p-value: < 2.2e-16
the above regression analysis indicates that except for Pitchdifference, all other factors are important variables in determining the price of premium seats.