Import the dataset:

airline.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/SixAirlinesDataV2.csv", sep = ""))
head(airline.df)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British   Boeing          12.25         Jul   International          122
## 2 British   Boeing          12.25         Aug   International          122
## 3 British   Boeing          12.25         Sep   International          122
## 4 British   Boeing          12.25         Oct   International          122
## 5 British   Boeing           8.16         Aug   International          122
## 6 British   Boeing           8.16         Sep   International          122
##   SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1           40           31           38           18           19
## 2           40           31           38           18           19
## 3           40           31           38           18           19
## 4           40           31           38           18           19
## 5           40           31           38           18           19
## 6           40           31           38           18           19
##   PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1         2707         3725          0.38        162               7
## 2         2707         3725          0.38        162               7
## 3         2707         3725          0.38        162               7
## 4         2707         3725          0.38        162               7
## 5         1793         2999          0.67        162               7
## 6         1793         2999          0.67        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69
library(psych)
describe(airline.df[,6:18]) ##Summarizing the data
##                     vars   n    mean      sd  median trimmed     mad   min
## SeatsEconomy           1 458  202.31   76.37  185.00  194.64   85.99 78.00
## SeatsPremium           2 458   33.65   13.26   36.00   33.35   11.86  8.00
## PitchEconomy           3 458   31.22    0.66   31.00   31.26    0.00 30.00
## PitchPremium           4 458   37.91    1.31   38.00   38.05    0.00 34.00
## WidthEconomy           5 458   17.84    0.56   18.00   17.81    0.00 17.00
## WidthPremium           6 458   19.47    1.10   19.00   19.53    0.00 17.00
## PriceEconomy           7 458 1327.08  988.27 1242.00 1244.40 1159.39 65.00
## PricePremium           8 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative          9 458    0.49    0.45    0.36    0.42    0.41  0.02
## SeatsTotal            10 458  235.96   85.29  227.00  228.73   90.44 98.00
## PitchDifference       11 458    6.69    1.76    7.00    6.76    0.00  2.00
## WidthDifference       12 458    1.63    1.19    1.00    1.53    0.00  0.00
## PercentPremiumSeats   13 458   14.65    4.84   13.21   14.31    2.68  4.71
##                         max   range  skew kurtosis    se
## SeatsEconomy         389.00  311.00  0.72    -0.36  3.57
## SeatsPremium          66.00   58.00  0.23    -0.46  0.62
## PitchEconomy          33.00    3.00 -0.03    -0.35  0.03
## PitchPremium          40.00    6.00 -1.51     3.52  0.06
## WidthEconomy          19.00    2.00 -0.04    -0.08  0.03
## WidthPremium          21.00    4.00 -0.08    -0.31  0.05
## PriceEconomy        3593.00 3528.00  0.51    -0.88 46.18
## PricePremium        7414.00 7328.00  0.50     0.43 60.19
## PriceRelative          1.89    1.87  1.17     0.72  0.02
## SeatsTotal           441.00  343.00  0.70    -0.53  3.99
## PitchDifference       10.00    8.00 -0.54     1.78  0.08
## WidthDifference        4.00    4.00  0.84    -0.53  0.06
## PercentPremiumSeats   24.69   19.98  0.71     0.28  0.23
str(airline.df)
## 'data.frame':    458 obs. of  18 variables:
##  $ Airline            : Factor w/ 6 levels "AirFrance","British",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Aircraft           : Factor w/ 2 levels "AirBus","Boeing": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FlightDuration     : num  12.25 12.25 12.25 12.25 8.16 ...
##  $ TravelMonth        : Factor w/ 4 levels "Aug","Jul","Oct",..: 2 1 4 3 1 4 3 1 4 4 ...
##  $ IsInternational    : Factor w/ 2 levels "Domestic","International": 2 2 2 2 2 2 2 2 2 2 ...
##  $ SeatsEconomy       : int  122 122 122 122 122 122 122 122 122 122 ...
##  $ SeatsPremium       : int  40 40 40 40 40 40 40 40 40 40 ...
##  $ PitchEconomy       : int  31 31 31 31 31 31 31 31 31 31 ...
##  $ PitchPremium       : int  38 38 38 38 38 38 38 38 38 38 ...
##  $ WidthEconomy       : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ WidthPremium       : int  19 19 19 19 19 19 19 19 19 19 ...
##  $ PriceEconomy       : int  2707 2707 2707 2707 1793 1793 1793 1476 1476 1705 ...
##  $ PricePremium       : int  3725 3725 3725 3725 2999 2999 2999 2997 2997 2989 ...
##  $ PriceRelative      : num  0.38 0.38 0.38 0.38 0.67 0.67 0.67 1.03 1.03 0.75 ...
##  $ SeatsTotal         : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ PitchDifference    : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ WidthDifference    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PercentPremiumSeats: num  24.7 24.7 24.7 24.7 24.7 ...

Including BoxPlots

To draw boxplots to see the distribution of PriceEconomy by Airline

par(mfrow = c(2,1))
boxplot(airline.df$PriceEconomy ~ airline.df$Airline, horizontal = TRUE, ylab = "Airline", xlab = "Price of Economy class tickets", yaxt = "n", main = "Boxplot of Price of Economy Tickets by Airlines")
axis(side = 2, las = 2, at = c(1:6), labels = c("AirFrance","British","Delta","Jet","Virgin","Singapore"))
boxplot(airline.df$PricePremium ~ airline.df$Airline, horizontal = TRUE, ylab = "Airline", xlab = "Price of Premium class tickets", yaxt = "n", main = "Boxplot of Price of Premium Tickets by Airlines")
axis(side = 2, las = 2, at = c(1:6), labels = c("AirFrance","British","Delta","Jet","Virgin","Singapore"))

Insights from the Boxplots:

Plotting a Corrgram

To get the correlation matrix for all numeric variables in the dataset, rounded to 2 decimal places:

round(cor(airline.df[,6:18]),2)
##                     SeatsEconomy SeatsPremium PitchEconomy PitchPremium
## SeatsEconomy                1.00         0.63         0.14         0.12
## SeatsPremium                0.63         1.00        -0.03         0.00
## PitchEconomy                0.14        -0.03         1.00        -0.55
## PitchPremium                0.12         0.00        -0.55         1.00
## WidthEconomy                0.37         0.46         0.29        -0.02
## WidthPremium                0.10         0.00        -0.54         0.75
## PriceEconomy                0.13         0.11         0.37         0.05
## PricePremium                0.18         0.22         0.23         0.09
## PriceRelative               0.00        -0.10        -0.42         0.42
## SeatsTotal                  0.99         0.72         0.12         0.11
## PitchDifference             0.04         0.02        -0.78         0.95
## WidthDifference            -0.08        -0.22        -0.64         0.70
## PercentPremiumSeats        -0.33         0.49        -0.10        -0.18
##                     WidthEconomy WidthPremium PriceEconomy PricePremium
## SeatsEconomy                0.37         0.10         0.13         0.18
## SeatsPremium                0.46         0.00         0.11         0.22
## PitchEconomy                0.29        -0.54         0.37         0.23
## PitchPremium               -0.02         0.75         0.05         0.09
## WidthEconomy                1.00         0.08         0.07         0.15
## WidthPremium                0.08         1.00        -0.06         0.06
## PriceEconomy                0.07        -0.06         1.00         0.90
## PricePremium                0.15         0.06         0.90         1.00
## PriceRelative              -0.04         0.50        -0.29         0.03
## SeatsTotal                  0.41         0.09         0.13         0.19
## PitchDifference            -0.13         0.76        -0.10        -0.02
## WidthDifference            -0.39         0.88        -0.08        -0.01
## PercentPremiumSeats         0.23        -0.18         0.07         0.12
##                     PriceRelative SeatsTotal PitchDifference
## SeatsEconomy                 0.00       0.99            0.04
## SeatsPremium                -0.10       0.72            0.02
## PitchEconomy                -0.42       0.12           -0.78
## PitchPremium                 0.42       0.11            0.95
## WidthEconomy                -0.04       0.41           -0.13
## WidthPremium                 0.50       0.09            0.76
## PriceEconomy                -0.29       0.13           -0.10
## PricePremium                 0.03       0.19           -0.02
## PriceRelative                1.00      -0.01            0.47
## SeatsTotal                  -0.01       1.00            0.03
## PitchDifference              0.47       0.03            1.00
## WidthDifference              0.49      -0.11            0.76
## PercentPremiumSeats         -0.16      -0.22           -0.09
##                     WidthDifference PercentPremiumSeats
## SeatsEconomy                  -0.08               -0.33
## SeatsPremium                  -0.22                0.49
## PitchEconomy                  -0.64               -0.10
## PitchPremium                   0.70               -0.18
## WidthEconomy                  -0.39                0.23
## WidthPremium                   0.88               -0.18
## PriceEconomy                  -0.08                0.07
## PricePremium                  -0.01                0.12
## PriceRelative                  0.49               -0.16
## SeatsTotal                    -0.11               -0.22
## PitchDifference                0.76               -0.09
## WidthDifference                1.00               -0.28
## PercentPremiumSeats           -0.28                1.00

To construct a Corrgram based on all numeric variables in the dataset:

library(corrgram)
corrgram(airline.df[,6:18], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of airline.df intercorrelations")

While we have plotted a corrgram of all numeric variables in the data, we must remember that our intended research question is to find out the variables that account for the difference in prices in the two classes and NOT the individual prices.

Thus, from the above corrgram, if we look closely at the variables - PriceRelative, PitchDifference, WidthDifference and PercentPremiumSeats, we see that: - Moderately strong positive correlation appears to exist between PriceRelative vs PitchDifference and WidthDifference. - There appears to be a negative correlation between PriceRelative and PercentPremiumSeats, which means higher the percentage of premium seats, lesser the price difference between Premium and Economy class, which makes sense.

Including Scatterplots Matrix

Scatterplots Matrix to understand the variables that account for the Price Difference between the Economy and Premium class:

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(airline.df[,c("PriceRelative","PitchDifference","WidthDifference","PercentPremiumSeats")], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")

The scatterplot matrix further revalidates our observations above as we notice a strong positive correlation between PriceRelative and PitchDifference & PriceRelative and WidthDifference. There’s also a slightly negative correlation evident in PriceRelative vs PercentPremiumSeats.

We need to check our above findings with the T-test, Cor Test and Regression Analysis.

T-test and Cor Test to check correlation between PriceRelative and PitchDifference

Null Hypothesis - There is no correlation between the difference between prices in the two classes and PitchDifference between them. Alternative Hypothesis - There is a correlation between the difference between prices in the two classes and PitchDifference between them.

t.test(airline.df$PriceRelative, airline.df$PitchDifference)
## 
##  Welch Two Sample t-test
## 
## data:  airline.df$PriceRelative and airline.df$PitchDifference
## t = -72.974, df = 516.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.367495 -6.033640
## sample estimates:
## mean of x mean of y 
## 0.4872052 6.6877729
cor.test(airline.df$PriceRelative, airline.df$PitchDifference)
## 
##  Pearson's product-moment correlation
## 
## data:  airline.df$PriceRelative and airline.df$PitchDifference
## t = 11.331, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3940262 0.5372817
## sample estimates:
##       cor 
## 0.4687302

Thus, we see that p-value < 2.2e-16 which is very less than 0.05. Thus, we reject the null hypothesis and conclude that there is indeed a correlation between PriceRelative and PitchDifference of the two classes.

T-test and Cor Test to check correlation between PriceRelative and WidthDifference

Null Hypothesis - There is no correlation between the difference between prices in the two classes and WidthDifference between them. Alternative Hypothesis - There is a correlation between the difference between prices in the two classes and WidthDifference between them.

t.test(airline.df$PriceRelative, airline.df$WidthDifference)
## 
##  Welch Two Sample t-test
## 
## data:  airline.df$PriceRelative and airline.df$WidthDifference
## t = -19.284, df = 585.55, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.262697 -1.029268
## sample estimates:
## mean of x mean of y 
## 0.4872052 1.6331878
cor.test(airline.df$PriceRelative, airline.df$WidthDifference)
## 
##  Pearson's product-moment correlation
## 
## data:  airline.df$PriceRelative and airline.df$WidthDifference
## t = 11.869, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4125388 0.5528218
## sample estimates:
##       cor 
## 0.4858024

Thus, we see that p-value < 2.2e-16 which is very less than 0.05. Thus, we reject the null hypothesis and conclude that there is indeed a correlation between PriceRelative and WidthDifference of the two classes.

T-test and Cor Test to check correlation between PriceRelative and PercentPremiumSeats

Null Hypothesis - There is no correlation between the difference in prices in the two classes and PercentPremiumSeats. Alternative Hypothesis - There is a correlation between the difference between prices in the two classes and PercentPremiumSeats.

t.test(airline.df$PriceRelative, airline.df$PercentPremiumSeats)
## 
##  Welch Two Sample t-test
## 
## data:  airline.df$PriceRelative and airline.df$PercentPremiumSeats
## t = -62.302, df = 464.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.60477 -13.71164
## sample estimates:
##  mean of x  mean of y 
##  0.4872052 14.6454148
cor.test(airline.df$PriceRelative, airline.df$PercentPremiumSeats)
## 
##  Pearson's product-moment correlation
## 
## data:  airline.df$PriceRelative and airline.df$PercentPremiumSeats
## t = -3.496, df = 456, p-value = 0.0005185
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.24949885 -0.07098966
## sample estimates:
##        cor 
## -0.1615656

Thus, we see that p-value < 2.2e-16 which is very less than 0.05. Thus, we reject the null hypothesis and conclude that there is indeed a correlation between PriceRelative and PercentPremiumSeats of the two classes.

Regression Analysis

y = b0 + b1(x1) + b2(x2) + b3(x3) PriceRelative = b0 + b1(PitchDifference) + b2(WidthDifference) + b3(PercentPremiumSeats)

fit <- lm(formula = PriceRelative ~ PitchDifference + WidthDifference + PercentPremiumSeats, data = airline.df)
summary(fit)
## 
## Call:
## lm(formula = PriceRelative ~ PitchDifference + WidthDifference + 
##     PercentPremiumSeats, data = airline.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88643 -0.29471 -0.05005  0.19013  1.17157 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.031508   0.097220  -0.324    0.746    
## PitchDifference      0.064596   0.016171   3.994 7.56e-05 ***
## WidthDifference      0.104782   0.024813   4.223 2.92e-05 ***
## PercentPremiumSeats -0.005764   0.003971  -1.451    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3882 on 454 degrees of freedom
## Multiple R-squared:  0.2627, Adjusted R-squared:  0.2579 
## F-statistic: 53.93 on 3 and 454 DF,  p-value: < 2.2e-16

Looking at the results of the regression analysis, we see that the F-statistic (=53.93) and the p-value (< 2.2e-16 which is far less than 0.05) suggest that the model as a whole is a good one. So it passes the test imposed by the F-Test. The value of Adjusted R-squared (=0.2579) suggests that the selected explanatory variables explain approximately 25.8% of the variance while also suggesting that there must be quite a few other variables that account for the difference in price between the two classes i.e. PriceRelative. Additionally - - For 1 unit increase in Pitch Difference, there’s a 0.064 unit increase in price between the two classes. - For 1 unit increase in Width DIfference, there’s a 0.105 unit increase in price between the two classes. - With p = 0.147 (>0.05), we conclude that the beta coefficient of PercentPremiumSeats is not statistically significant.