MAIN RESEARCH QUESTION - What factors explain the difference in price between an economy ticket and a premium-economy airline ticket?
Reading Data into R
setwd("C:/Users/Parul Verma/Desktop/Data Analytics Internship")
Data.df <-read.csv(paste ("SixAirlinesDataV2.csv", sep=""))
Summarizing the data to understand the mean, median, standard deviation of each variable using 1. summary command
summary(Data.df)
## Airline Aircraft FlightDuration TravelMonth
## AirFrance: 74 AirBus:151 Min. : 1.250 Aug:127
## British :175 Boeing:307 1st Qu.: 4.260 Jul: 75
## Delta : 46 Median : 7.790 Oct:127
## Jet : 61 Mean : 7.578 Sep:129
## Singapore: 40 3rd Qu.:10.620
## Virgin : 62 Max. :14.660
## IsInternational SeatsEconomy SeatsPremium PitchEconomy
## Domestic : 40 Min. : 78.0 Min. : 8.00 Min. :30.00
## International:418 1st Qu.:133.0 1st Qu.:21.00 1st Qu.:31.00
## Median :185.0 Median :36.00 Median :31.00
## Mean :202.3 Mean :33.65 Mean :31.22
## 3rd Qu.:243.0 3rd Qu.:40.00 3rd Qu.:32.00
## Max. :389.0 Max. :66.00 Max. :33.00
## PitchPremium WidthEconomy WidthPremium PriceEconomy
## Min. :34.00 Min. :17.00 Min. :17.00 Min. : 65
## 1st Qu.:38.00 1st Qu.:18.00 1st Qu.:19.00 1st Qu.: 413
## Median :38.00 Median :18.00 Median :19.00 Median :1242
## Mean :37.91 Mean :17.84 Mean :19.47 Mean :1327
## 3rd Qu.:38.00 3rd Qu.:18.00 3rd Qu.:21.00 3rd Qu.:1909
## Max. :40.00 Max. :19.00 Max. :21.00 Max. :3593
## PricePremium PriceRelative SeatsTotal PitchDifference
## Min. : 86.0 Min. :0.0200 Min. : 98 Min. : 2.000
## 1st Qu.: 528.8 1st Qu.:0.1000 1st Qu.:166 1st Qu.: 6.000
## Median :1737.0 Median :0.3650 Median :227 Median : 7.000
## Mean :1845.3 Mean :0.4872 Mean :236 Mean : 6.688
## 3rd Qu.:2989.0 3rd Qu.:0.7400 3rd Qu.:279 3rd Qu.: 7.000
## Max. :7414.0 Max. :1.8900 Max. :441 Max. :10.000
## WidthDifference PercentPremiumSeats
## Min. :0.000 Min. : 4.71
## 1st Qu.:1.000 1st Qu.:12.28
## Median :1.000 Median :13.21
## Mean :1.633 Mean :14.65
## 3rd Qu.:3.000 3rd Qu.:15.36
## Max. :4.000 Max. :24.69
Alternatively, one can also use 2. describe command
library(psych)
describe(Data.df)
## vars n mean sd median trimmed mad min
## Airline* 1 458 3.01 1.65 2.00 2.89 1.48 1.00
## Aircraft* 2 458 1.67 0.47 2.00 1.71 0.00 1.00
## FlightDuration 3 458 7.58 3.54 7.79 7.57 4.81 1.25
## TravelMonth* 4 458 2.56 1.17 3.00 2.58 1.48 1.00
## IsInternational* 5 458 1.91 0.28 2.00 2.00 0.00 1.00
## SeatsEconomy 6 458 202.31 76.37 185.00 194.64 85.99 78.00
## SeatsPremium 7 458 33.65 13.26 36.00 33.35 11.86 8.00
## PitchEconomy 8 458 31.22 0.66 31.00 31.26 0.00 30.00
## PitchPremium 9 458 37.91 1.31 38.00 38.05 0.00 34.00
## WidthEconomy 10 458 17.84 0.56 18.00 17.81 0.00 17.00
## WidthPremium 11 458 19.47 1.10 19.00 19.53 0.00 17.00
## PriceEconomy 12 458 1327.08 988.27 1242.00 1244.40 1159.39 65.00
## PricePremium 13 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative 14 458 0.49 0.45 0.36 0.42 0.41 0.02
## SeatsTotal 15 458 235.96 85.29 227.00 228.73 90.44 98.00
## PitchDifference 16 458 6.69 1.76 7.00 6.76 0.00 2.00
## WidthDifference 17 458 1.63 1.19 1.00 1.53 0.00 0.00
## PercentPremiumSeats 18 458 14.65 4.84 13.21 14.31 2.68 4.71
## max range skew kurtosis se
## Airline* 6.00 5.00 0.61 -0.95 0.08
## Aircraft* 2.00 1.00 -0.72 -1.48 0.02
## FlightDuration 14.66 13.41 -0.07 -1.12 0.17
## TravelMonth* 4.00 3.00 -0.14 -1.46 0.05
## IsInternational* 2.00 1.00 -2.91 6.50 0.01
## SeatsEconomy 389.00 311.00 0.72 -0.36 3.57
## SeatsPremium 66.00 58.00 0.23 -0.46 0.62
## PitchEconomy 33.00 3.00 -0.03 -0.35 0.03
## PitchPremium 40.00 6.00 -1.51 3.52 0.06
## WidthEconomy 19.00 2.00 -0.04 -0.08 0.03
## WidthPremium 21.00 4.00 -0.08 -0.31 0.05
## PriceEconomy 3593.00 3528.00 0.51 -0.88 46.18
## PricePremium 7414.00 7328.00 0.50 0.43 60.19
## PriceRelative 1.89 1.87 1.17 0.72 0.02
## SeatsTotal 441.00 343.00 0.70 -0.53 3.99
## PitchDifference 10.00 8.00 -0.54 1.78 0.08
## WidthDifference 4.00 4.00 0.84 -0.53 0.06
## PercentPremiumSeats 24.69 19.98 0.71 0.28 0.23
From the list of values given, one can look at the mean, median, mode and standard deviation of each of the column variables.
ScatterPlots and BoxPlots for analysing variables:
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
library(psych)
library(lattice)
For this, we take Relative Pricing and find the correlation between Relative Pricing and the other variable -
boxplot(PriceRelative~Airline, data = Data.df, main = "Airline Vs Relative Price", xlab = "Airlines", ylab = "Relative Price", col = c("red","blue","green","orange","yellow","purple"))
Correlation present.
boxplot(PriceRelative~Aircraft,data = Data.df, main="Relative Price Vs.Aircraft ", xlab="Aircraft", ylab="Price Relative", col = c("blue","red"))
Correlation present (weak)
scatterplot(PriceRelative~FlightDuration , data = Data.df, xlab = "Flight Duration", ylab = "Relative Pricing")
Correlation present.
boxplot(PriceRelative~TravelMonth , data = Data.df, xlab = "Travel Month", ylab = "Relative Pricing", col = c("red","blue","green","orange"))
No correlation.
boxplot(PriceRelative~IsInternational, data = Data.df, xlab = "IsInternational", ylab = "Relative Pricing", col = c("red","blue"))
Correlation present.
scatterplot(PriceRelative~SeatsEconomy, data = Data.df, xlab = "Seats in Economy", ylab = "Relative Pricing")
No correlation
scatterplot(PriceRelative~SeatsPremium, data = Data.df, xlab = "Seats in Premium", ylab = "Relative Pricing")
Correlation present (negative)
scatterplot(PriceRelative~PitchEconomy, data = Data.df, xlab = " Pitch in Economy", ylab = "Relative Pricing")
Correlation present
scatterplot(PriceRelative~PitchPremium, data = Data.df, xlab = "Pitch in Premium", ylab = "Relative Pricing")
Correlation present
scatterplot(PriceRelative~WidthEconomy, data = Data.df, xlab = "Width in Economy", ylab = "Relative Pricing")
Correlation present (not statistically significant)
scatterplot(PriceRelative~WidthPremium, data = Data.df, xlab = "Width in Premium", ylab = "Relative Pricing")
Correlation present.
scatterplot(PriceRelative~SeatsTotal, data = Data.df, xlab = "Total Seats", ylab = "Relative Pricing")
No correlation.
scatterplot(PriceRelative~PitchDifference, data = Data.df, xlab = "Difference in Pitch", ylab = "Relative Pricing")
Correlation present.
scatterplot(PriceRelative~WidthDifference, data = Data.df, xlab = "Difference in Width", ylab = "Relative Pricing")
Correlation present
scatterplot(PriceRelative~PercentPremiumSeats, data = Data.df, xlab = "Percentage of Premium Seats", ylab = "Relative Pricing")
Correlation present (negative)
Therefore, from above, we get the factors that affect the Pricing.
Drawing a CORRGRAM now -
library(corrgram)
par(mfrow=c(1,1))
corrgram(Data.df, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt,main="Corrgram")
Variance-CoVariance Matrix
VarCovarmatrix <- var(Data.df[,1:18])
## Warning in var(Data.df[, 1:18]): NAs introduced by coercion
round(VarCovarmatrix, 2)
## Airline Aircraft FlightDuration TravelMonth
## Airline NA NA NA NA
## Aircraft NA NA NA NA
## FlightDuration NA NA 12.55 NA
## TravelMonth NA NA NA NA
## IsInternational NA NA NA NA
## SeatsEconomy NA NA 52.92 NA
## SeatsPremium NA NA 7.57 NA
## PitchEconomy NA NA 0.68 NA
## PitchPremium NA NA 0.45 NA
## WidthEconomy NA NA 0.90 NA
## WidthPremium NA NA 0.40 NA
## PriceEconomy NA NA 1983.54 NA
## PricePremium NA NA 2959.98 NA
## PriceRelative NA NA 0.19 NA
## SeatsTotal NA NA 60.49 NA
## PitchDifference NA NA -0.23 NA
## WidthDifference NA NA -0.50 NA
## PercentPremiumSeats NA NA 1.04 NA
## IsInternational SeatsEconomy SeatsPremium PitchEconomy
## Airline NA NA NA NA
## Aircraft NA NA NA NA
## FlightDuration NA 52.92 7.57 0.68
## TravelMonth NA NA NA NA
## IsInternational NA NA NA NA
## SeatsEconomy NA 5832.92 633.07 7.21
## SeatsPremium NA 633.07 175.87 -0.30
## PitchEconomy NA 7.21 -0.30 0.43
## PitchPremium NA 11.96 0.09 -0.47
## WidthEconomy NA 15.91 3.37 0.11
## WidthPremium NA 8.58 -0.04 -0.39
## PriceEconomy NA 9673.79 1489.38 238.70
## PricePremium NA 17413.25 3717.36 190.85
## PriceRelative NA 0.14 -0.58 -0.12
## SeatsTotal NA 6465.99 808.94 6.91
## PitchDifference NA 4.75 0.38 -0.90
## WidthDifference NA -7.33 -3.41 -0.50
## PercentPremiumSeats NA -122.39 31.15 -0.33
## PitchPremium WidthEconomy WidthPremium PriceEconomy
## Airline NA NA NA NA
## Aircraft NA NA NA NA
## FlightDuration 0.45 0.90 0.40 1983.54
## TravelMonth NA NA NA NA
## IsInternational NA NA NA NA
## SeatsEconomy 11.96 15.91 8.58 9673.79
## SeatsPremium 0.09 3.37 -0.04 1489.38
## PitchEconomy -0.47 0.11 -0.39 238.70
## PitchPremium 1.73 -0.02 1.08 65.43
## WidthEconomy -0.02 0.31 0.05 37.46
## WidthPremium 1.08 0.05 1.20 -61.85
## PriceEconomy 65.43 37.46 -61.85 976684.06
## PricePremium 149.85 108.12 90.48 1147494.77
## PriceRelative 0.25 -0.01 0.25 -128.50
## SeatsTotal 12.05 19.28 8.54 11163.18
## PitchDifference 2.20 -0.12 1.47 -173.28
## WidthDifference 1.10 -0.26 1.15 -99.32
## PercentPremiumSeats -1.12 0.61 -0.97 312.61
## PricePremium PriceRelative SeatsTotal PitchDifference
## Airline NA NA NA NA
## Aircraft NA NA NA NA
## FlightDuration 2959.98 0.19 60.49 -0.23
## TravelMonth NA NA NA NA
## IsInternational NA NA NA NA
## SeatsEconomy 17413.25 0.14 6465.99 4.75
## SeatsPremium 3717.36 -0.58 808.94 0.38
## PitchEconomy 190.85 -0.12 6.91 -0.90
## PitchPremium 149.85 0.25 12.05 2.20
## WidthEconomy 108.12 -0.01 19.28 -0.12
## WidthPremium 90.48 0.25 8.54 1.47
## PriceEconomy 1147494.77 -128.50 11163.18 -173.28
## PricePremium 1659293.12 18.48 21130.62 -41.00
## PriceRelative 18.48 0.20 -0.44 0.37
## SeatsTotal 21130.62 -0.44 7274.92 5.13
## PitchDifference -41.00 0.37 5.13 3.10
## WidthDifference -17.64 0.26 -10.74 1.59
## PercentPremiumSeats 726.02 -0.35 -91.24 -0.79
## WidthDifference PercentPremiumSeats
## Airline NA NA
## Aircraft NA NA
## FlightDuration -0.50 1.04
## TravelMonth NA NA
## IsInternational NA NA
## SeatsEconomy -7.33 -122.39
## SeatsPremium -3.41 31.15
## PitchEconomy -0.50 -0.33
## PitchPremium 1.10 -1.12
## WidthEconomy -0.26 0.61
## WidthPremium 1.15 -0.97
## PriceEconomy -99.32 312.61
## PricePremium -17.64 726.02
## PriceRelative 0.26 -0.35
## SeatsTotal -10.74 -91.24
## PitchDifference 1.59 -0.79
## WidthDifference 1.41 -1.59
## PercentPremiumSeats -1.59 23.45
Correlation Matrix
correlmatrix <- cor(Data.df[,6:18])
round(correlmatrix,digits = 2)
## SeatsEconomy SeatsPremium PitchEconomy PitchPremium
## SeatsEconomy 1.00 0.63 0.14 0.12
## SeatsPremium 0.63 1.00 -0.03 0.00
## PitchEconomy 0.14 -0.03 1.00 -0.55
## PitchPremium 0.12 0.00 -0.55 1.00
## WidthEconomy 0.37 0.46 0.29 -0.02
## WidthPremium 0.10 0.00 -0.54 0.75
## PriceEconomy 0.13 0.11 0.37 0.05
## PricePremium 0.18 0.22 0.23 0.09
## PriceRelative 0.00 -0.10 -0.42 0.42
## SeatsTotal 0.99 0.72 0.12 0.11
## PitchDifference 0.04 0.02 -0.78 0.95
## WidthDifference -0.08 -0.22 -0.64 0.70
## PercentPremiumSeats -0.33 0.49 -0.10 -0.18
## WidthEconomy WidthPremium PriceEconomy PricePremium
## SeatsEconomy 0.37 0.10 0.13 0.18
## SeatsPremium 0.46 0.00 0.11 0.22
## PitchEconomy 0.29 -0.54 0.37 0.23
## PitchPremium -0.02 0.75 0.05 0.09
## WidthEconomy 1.00 0.08 0.07 0.15
## WidthPremium 0.08 1.00 -0.06 0.06
## PriceEconomy 0.07 -0.06 1.00 0.90
## PricePremium 0.15 0.06 0.90 1.00
## PriceRelative -0.04 0.50 -0.29 0.03
## SeatsTotal 0.41 0.09 0.13 0.19
## PitchDifference -0.13 0.76 -0.10 -0.02
## WidthDifference -0.39 0.88 -0.08 -0.01
## PercentPremiumSeats 0.23 -0.18 0.07 0.12
## PriceRelative SeatsTotal PitchDifference
## SeatsEconomy 0.00 0.99 0.04
## SeatsPremium -0.10 0.72 0.02
## PitchEconomy -0.42 0.12 -0.78
## PitchPremium 0.42 0.11 0.95
## WidthEconomy -0.04 0.41 -0.13
## WidthPremium 0.50 0.09 0.76
## PriceEconomy -0.29 0.13 -0.10
## PricePremium 0.03 0.19 -0.02
## PriceRelative 1.00 -0.01 0.47
## SeatsTotal -0.01 1.00 0.03
## PitchDifference 0.47 0.03 1.00
## WidthDifference 0.49 -0.11 0.76
## PercentPremiumSeats -0.16 -0.22 -0.09
## WidthDifference PercentPremiumSeats
## SeatsEconomy -0.08 -0.33
## SeatsPremium -0.22 0.49
## PitchEconomy -0.64 -0.10
## PitchPremium 0.70 -0.18
## WidthEconomy -0.39 0.23
## WidthPremium 0.88 -0.18
## PriceEconomy -0.08 0.07
## PricePremium -0.01 0.12
## PriceRelative 0.49 -0.16
## SeatsTotal -0.11 -0.22
## PitchDifference 0.76 -0.09
## WidthDifference 1.00 -0.28
## PercentPremiumSeats -0.28 1.00
Hypothesis and its T-testing :
Hypothesis H1 - The price difference of the ticket depends on Seat Width. Null hypothesis H0 - The price difference of the ticket does not depend on Seat Width.
Running a T-Test
t.test( Data.df$WidthPremium,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: Data.df$WidthPremium and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.277, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -552.2598 -445.1594
## sample estimates:
## mean of x mean of y
## 19.47162 518.18122
Since the p=value < 0.05, we reject the null hypothesis. Therefore, seat width does affect pricing.
Hypothesis H2 - The price of the ticket depends on Pitch. Null hypothesis H0 - The price of the ticket does not depend on Pitch.
Running a T-Test
t.test( Data.df$PitchPremium,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: Data.df$PitchPremium and Data.df$PricePremium - Data.df$PriceEconomy
## t = -17.602, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -533.8253 -426.7249
## sample estimates:
## mean of x mean of y
## 37.90611 518.18122
Since the p=value < 0.05, we reject the null hypothesis. Therefore, pitch does affect pricing.
Hypothesis H3 - The price of the ticket depends on Flight Duration. Null hypothesis H0 - The price of the ticket does not depend on Flight Duration.
Running a T-Test
t.test( Data.df$FlightDuration,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: Data.df$FlightDuration and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.713, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -564.1545 -457.0523
## sample estimates:
## mean of x mean of y
## 7.577838 518.181223
Since the p=value < 0.05, we reject the null hypothesis. Therefore, flight duration does affect pricing.
Hypothesis H4 - The price difference of the ticket depends on Percentage of Premium Seats. Null hypothesis H0 - The price difference of the ticket does not depend on Percentage of Premium Seats.
Running a T-Test
t.test( Data.df$PercentPremiumSeats,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: Data.df$PercentPremiumSeats and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.454, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -557.0877 -449.9839
## sample estimates:
## mean of x mean of y
## 14.64541 518.18122
Since the p=value < 0.05, we reject the null hypothesis. Therefore, Percentage of Premium Seats does affect pricing.
Regression Analysis :
regmodel <- lm(formula = Data.df$PriceRelative~Data.df$PitchDifference+Data.df$WidthDifference+Data.df$FlightDuration+Data.df$PercentPremiumSeats)
summary(regmodel)
##
## Call:
## lm(formula = Data.df$PriceRelative ~ Data.df$PitchDifference +
## Data.df$WidthDifference + Data.df$FlightDuration + Data.df$PercentPremiumSeats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79439 -0.29424 -0.03427 0.16197 1.13688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.179033 0.101492 -1.764 0.07840 .
## Data.df$PitchDifference 0.059311 0.015921 3.725 0.00022 ***
## Data.df$WidthDifference 0.118140 0.024555 4.811 2.05e-06 ***
## Data.df$FlightDuration 0.021707 0.005085 4.269 2.39e-05 ***
## Data.df$PercentPremiumSeats -0.005999 0.003898 -1.539 0.12454
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.381 on 453 degrees of freedom
## Multiple R-squared: 0.2913, Adjusted R-squared: 0.285
## F-statistic: 46.54 on 4 and 453 DF, p-value: < 2.2e-16
As our p-value is < 0.05, we can also use this to reject our null hypothesis.
The regression model : PriceRelative = ( 0.059310975 PitchDifference) + (0.118140211 WidthDifference) + (0.021707245FlightDuration) + (-0.005999062PercentPremiumSeats) + (-0.179033482=Intercept)
coefficients(regmodel)
## (Intercept) Data.df$PitchDifference
## -0.179033482 0.059310975
## Data.df$WidthDifference Data.df$FlightDuration
## 0.118140211 0.021707245
## Data.df$PercentPremiumSeats
## -0.005999062
The regression model : PriceRelative = ( 0.059310975 PitchDifference) + (0.118140211 WidthDifference) + (0.021707245FlightDuration) + (-0.005999062PercentPremiumSeats) + (-0.179033482=Intercept)
Thus, y can be Relative Price and x = {x1,x2,x3,x4} can be Pitch Difference, Width Difference, Flight Duration and Percentage of Premium Seats.
So, we can conclude that there are various factors (like Pitch Difference, Width Difference, Flight Duration and Percentage of Premium Seats for example) that cause a difference in the prices of Premium Economy and Economy airline tickets.