airline.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/SixAirlinesDataV2.csv", sep = ""))
head(airline.df)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British Boeing 12.25 Jul International 122
## 2 British Boeing 12.25 Aug International 122
## 3 British Boeing 12.25 Sep International 122
## 4 British Boeing 12.25 Oct International 122
## 5 British Boeing 8.16 Aug International 122
## 6 British Boeing 8.16 Sep International 122
## SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1 40 31 38 18 19
## 2 40 31 38 18 19
## 3 40 31 38 18 19
## 4 40 31 38 18 19
## 5 40 31 38 18 19
## 6 40 31 38 18 19
## PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1 2707 3725 0.38 162 7
## 2 2707 3725 0.38 162 7
## 3 2707 3725 0.38 162 7
## 4 2707 3725 0.38 162 7
## 5 1793 2999 0.67 162 7
## 6 1793 2999 0.67 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
library(psych)
describe(airline.df[,6:18]) ##Summarizing the data
## vars n mean sd median trimmed mad min
## SeatsEconomy 1 458 202.31 76.37 185.00 194.64 85.99 78.00
## SeatsPremium 2 458 33.65 13.26 36.00 33.35 11.86 8.00
## PitchEconomy 3 458 31.22 0.66 31.00 31.26 0.00 30.00
## PitchPremium 4 458 37.91 1.31 38.00 38.05 0.00 34.00
## WidthEconomy 5 458 17.84 0.56 18.00 17.81 0.00 17.00
## WidthPremium 6 458 19.47 1.10 19.00 19.53 0.00 17.00
## PriceEconomy 7 458 1327.08 988.27 1242.00 1244.40 1159.39 65.00
## PricePremium 8 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative 9 458 0.49 0.45 0.36 0.42 0.41 0.02
## SeatsTotal 10 458 235.96 85.29 227.00 228.73 90.44 98.00
## PitchDifference 11 458 6.69 1.76 7.00 6.76 0.00 2.00
## WidthDifference 12 458 1.63 1.19 1.00 1.53 0.00 0.00
## PercentPremiumSeats 13 458 14.65 4.84 13.21 14.31 2.68 4.71
## max range skew kurtosis se
## SeatsEconomy 389.00 311.00 0.72 -0.36 3.57
## SeatsPremium 66.00 58.00 0.23 -0.46 0.62
## PitchEconomy 33.00 3.00 -0.03 -0.35 0.03
## PitchPremium 40.00 6.00 -1.51 3.52 0.06
## WidthEconomy 19.00 2.00 -0.04 -0.08 0.03
## WidthPremium 21.00 4.00 -0.08 -0.31 0.05
## PriceEconomy 3593.00 3528.00 0.51 -0.88 46.18
## PricePremium 7414.00 7328.00 0.50 0.43 60.19
## PriceRelative 1.89 1.87 1.17 0.72 0.02
## SeatsTotal 441.00 343.00 0.70 -0.53 3.99
## PitchDifference 10.00 8.00 -0.54 1.78 0.08
## WidthDifference 4.00 4.00 0.84 -0.53 0.06
## PercentPremiumSeats 24.69 19.98 0.71 0.28 0.23
str(airline.df)
## 'data.frame': 458 obs. of 18 variables:
## $ Airline : Factor w/ 6 levels "AirFrance","British",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Aircraft : Factor w/ 2 levels "AirBus","Boeing": 2 2 2 2 2 2 2 2 2 2 ...
## $ FlightDuration : num 12.25 12.25 12.25 12.25 8.16 ...
## $ TravelMonth : Factor w/ 4 levels "Aug","Jul","Oct",..: 2 1 4 3 1 4 3 1 4 4 ...
## $ IsInternational : Factor w/ 2 levels "Domestic","International": 2 2 2 2 2 2 2 2 2 2 ...
## $ SeatsEconomy : int 122 122 122 122 122 122 122 122 122 122 ...
## $ SeatsPremium : int 40 40 40 40 40 40 40 40 40 40 ...
## $ PitchEconomy : int 31 31 31 31 31 31 31 31 31 31 ...
## $ PitchPremium : int 38 38 38 38 38 38 38 38 38 38 ...
## $ WidthEconomy : int 18 18 18 18 18 18 18 18 18 18 ...
## $ WidthPremium : int 19 19 19 19 19 19 19 19 19 19 ...
## $ PriceEconomy : int 2707 2707 2707 2707 1793 1793 1793 1476 1476 1705 ...
## $ PricePremium : int 3725 3725 3725 3725 2999 2999 2999 2997 2997 2989 ...
## $ PriceRelative : num 0.38 0.38 0.38 0.38 0.67 0.67 0.67 1.03 1.03 0.75 ...
## $ SeatsTotal : int 162 162 162 162 162 162 162 162 162 162 ...
## $ PitchDifference : int 7 7 7 7 7 7 7 7 7 7 ...
## $ WidthDifference : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PercentPremiumSeats: num 24.7 24.7 24.7 24.7 24.7 ...
To draw boxplots to see the distribution of PriceEconomy by Airline
par(mfrow = c(2,1))
boxplot(airline.df$PriceEconomy ~ airline.df$Airline, horizontal = TRUE, ylab = "Airline", xlab = "Price of Economy class tickets", yaxt = "n", main = "Boxplot of Price of Economy Tickets by Airlines")
axis(side = 2, las = 2, at = c(1:6), labels = c("AirFrance","British","Delta","Jet","Virgin","Singapore"))
boxplot(airline.df$PricePremium ~ airline.df$Airline, horizontal = TRUE, ylab = "Airline", xlab = "Price of Premium class tickets", yaxt = "n", main = "Boxplot of Price of Premium Tickets by Airlines")
axis(side = 2, las = 2, at = c(1:6), labels = c("AirFrance","British","Delta","Jet","Virgin","Singapore"))
Insights from the Boxplots:
To get the correlation matrix for all numeric variables in the dataset, rounded to 2 decimal places:
round(cor(airline.df[,6:18]),2)
## SeatsEconomy SeatsPremium PitchEconomy PitchPremium
## SeatsEconomy 1.00 0.63 0.14 0.12
## SeatsPremium 0.63 1.00 -0.03 0.00
## PitchEconomy 0.14 -0.03 1.00 -0.55
## PitchPremium 0.12 0.00 -0.55 1.00
## WidthEconomy 0.37 0.46 0.29 -0.02
## WidthPremium 0.10 0.00 -0.54 0.75
## PriceEconomy 0.13 0.11 0.37 0.05
## PricePremium 0.18 0.22 0.23 0.09
## PriceRelative 0.00 -0.10 -0.42 0.42
## SeatsTotal 0.99 0.72 0.12 0.11
## PitchDifference 0.04 0.02 -0.78 0.95
## WidthDifference -0.08 -0.22 -0.64 0.70
## PercentPremiumSeats -0.33 0.49 -0.10 -0.18
## WidthEconomy WidthPremium PriceEconomy PricePremium
## SeatsEconomy 0.37 0.10 0.13 0.18
## SeatsPremium 0.46 0.00 0.11 0.22
## PitchEconomy 0.29 -0.54 0.37 0.23
## PitchPremium -0.02 0.75 0.05 0.09
## WidthEconomy 1.00 0.08 0.07 0.15
## WidthPremium 0.08 1.00 -0.06 0.06
## PriceEconomy 0.07 -0.06 1.00 0.90
## PricePremium 0.15 0.06 0.90 1.00
## PriceRelative -0.04 0.50 -0.29 0.03
## SeatsTotal 0.41 0.09 0.13 0.19
## PitchDifference -0.13 0.76 -0.10 -0.02
## WidthDifference -0.39 0.88 -0.08 -0.01
## PercentPremiumSeats 0.23 -0.18 0.07 0.12
## PriceRelative SeatsTotal PitchDifference
## SeatsEconomy 0.00 0.99 0.04
## SeatsPremium -0.10 0.72 0.02
## PitchEconomy -0.42 0.12 -0.78
## PitchPremium 0.42 0.11 0.95
## WidthEconomy -0.04 0.41 -0.13
## WidthPremium 0.50 0.09 0.76
## PriceEconomy -0.29 0.13 -0.10
## PricePremium 0.03 0.19 -0.02
## PriceRelative 1.00 -0.01 0.47
## SeatsTotal -0.01 1.00 0.03
## PitchDifference 0.47 0.03 1.00
## WidthDifference 0.49 -0.11 0.76
## PercentPremiumSeats -0.16 -0.22 -0.09
## WidthDifference PercentPremiumSeats
## SeatsEconomy -0.08 -0.33
## SeatsPremium -0.22 0.49
## PitchEconomy -0.64 -0.10
## PitchPremium 0.70 -0.18
## WidthEconomy -0.39 0.23
## WidthPremium 0.88 -0.18
## PriceEconomy -0.08 0.07
## PricePremium -0.01 0.12
## PriceRelative 0.49 -0.16
## SeatsTotal -0.11 -0.22
## PitchDifference 0.76 -0.09
## WidthDifference 1.00 -0.28
## PercentPremiumSeats -0.28 1.00
To construct a Corrgram based on all numeric variables in the dataset:
library(corrgram)
corrgram(airline.df[,6:18], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of airline.df intercorrelations")
While we have plotted a corrgram of all numeric variables in the data, we must remember that our intended research question is to find out the variables that account for the difference in prices in the two classes and NOT the individual prices.
Thus, from the above corrgram, if we look closely at the variables - PriceRelative, PitchDifference, WidthDifference and PercentPremiumSeats, we see that: - Moderately strong positive correlation appears to exist between PriceRelative vs PitchDifference and WidthDifference. - There appears to be a negative correlation between PriceRelative and PercentPremiumSeats, which means higher the percentage of premium seats, lesser the price difference between Premium and Economy class, which makes sense.
Scatterplots Matrix to understand the variables that account for the Price Difference between the Economy and Premium class:
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(airline.df[,c("PriceRelative","PitchDifference","WidthDifference","PercentPremiumSeats")], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")
The scatterplot matrix further revalidates our observations above as we notice a strong positive correlation between PriceRelative and PitchDifference & PriceRelative and WidthDifference. There’s also a slightly negative correlation evident in PriceRelative vs PercentPremiumSeats.
We need to check our above findings with the T-test, Cor Test and Regression Analysis.
Null Hypothesis - There is no correlation between the difference between prices in the two classes and PitchDifference between them. Alternative Hypothesis - There is a correlation between the difference between prices in the two classes and PitchDifference between them.
t.test(airline.df$PriceRelative, airline.df$PitchDifference)
##
## Welch Two Sample t-test
##
## data: airline.df$PriceRelative and airline.df$PitchDifference
## t = -72.974, df = 516.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.367495 -6.033640
## sample estimates:
## mean of x mean of y
## 0.4872052 6.6877729
cor.test(airline.df$PriceRelative, airline.df$PitchDifference)
##
## Pearson's product-moment correlation
##
## data: airline.df$PriceRelative and airline.df$PitchDifference
## t = 11.331, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3940262 0.5372817
## sample estimates:
## cor
## 0.4687302
Thus, we see that p-value < 2.2e-16 which is very less than 0.05. Thus, we reject the null hypothesis and conclude that there is indeed a correlation between PriceRelative and PitchDifference of the two classes.
Null Hypothesis - There is no correlation between the difference between prices in the two classes and WidthDifference between them. Alternative Hypothesis - There is a correlation between the difference between prices in the two classes and WidthDifference between them.
t.test(airline.df$PriceRelative, airline.df$WidthDifference)
##
## Welch Two Sample t-test
##
## data: airline.df$PriceRelative and airline.df$WidthDifference
## t = -19.284, df = 585.55, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.262697 -1.029268
## sample estimates:
## mean of x mean of y
## 0.4872052 1.6331878
cor.test(airline.df$PriceRelative, airline.df$WidthDifference)
##
## Pearson's product-moment correlation
##
## data: airline.df$PriceRelative and airline.df$WidthDifference
## t = 11.869, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4125388 0.5528218
## sample estimates:
## cor
## 0.4858024
Thus, we see that p-value < 2.2e-16 which is very less than 0.05. Thus, we reject the null hypothesis and conclude that there is indeed a correlation between PriceRelative and WidthDifference of the two classes.
y = b0 + b1(x1) + b2(x2) + b3(x3) PriceRelative = b0 + b1(PitchDifference) + b2(WidthDifference) + b3(PercentPremiumSeats)
fit <- lm(formula = PriceRelative ~ PitchDifference + WidthDifference + PercentPremiumSeats, data = airline.df)
summary(fit)
##
## Call:
## lm(formula = PriceRelative ~ PitchDifference + WidthDifference +
## PercentPremiumSeats, data = airline.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88643 -0.29471 -0.05005 0.19013 1.17157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.031508 0.097220 -0.324 0.746
## PitchDifference 0.064596 0.016171 3.994 7.56e-05 ***
## WidthDifference 0.104782 0.024813 4.223 2.92e-05 ***
## PercentPremiumSeats -0.005764 0.003971 -1.451 0.147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3882 on 454 degrees of freedom
## Multiple R-squared: 0.2627, Adjusted R-squared: 0.2579
## F-statistic: 53.93 on 3 and 454 DF, p-value: < 2.2e-16
Looking at the results of the regression analysis, we see that the F-statistic (=53.93) and the p-value (< 2.2e-16 which is far less than 0.05) suggest that the model as a whole is a good one. So it passes the test imposed by the F-Test. The value of Adjusted R-squared (=0.2579) suggests that the selected explanatory variables explain approximately 25.8% of the variance while also suggesting that there must be quite a few other variables that account for the difference in price between the two classes i.e. PriceRelative. Additionally - - For 1 unit increase in Pitch Difference, there’s a 0.064 unit increase in price between the two classes. - For 1 unit increase in Width DIfference, there’s a 0.105 unit increase in price between the two classes. - With p = 0.147 (>0.05), we conclude that the beta coefficient of PercentPremiumSeats is not statistically significant.