TASK 3 - MINI PROJECT ON THE AIRLINE INDUSTRY
We will be analyzing the pricing of Premium Economy tickets relative to regular Economy airline tickets.
Read the data into R
setwd("C:/Users/HP/Downloads/Intern/WEEK 3 DAY 6")
SixAirlinesData.df <- read.csv(paste("SixAirlinesDataV2.csv", sep=""))
View(SixAirlinesData.df)
Summarize the data to understand the mean, median, standard deviation of each variable
summary(SixAirlinesData.df)
## Airline Aircraft FlightDuration TravelMonth
## AirFrance: 74 AirBus:151 Min. : 1.250 Aug:127
## British :175 Boeing:307 1st Qu.: 4.260 Jul: 75
## Delta : 46 Median : 7.790 Oct:127
## Jet : 61 Mean : 7.578 Sep:129
## Singapore: 40 3rd Qu.:10.620
## Virgin : 62 Max. :14.660
## IsInternational SeatsEconomy SeatsPremium PitchEconomy
## Domestic : 40 Min. : 78.0 Min. : 8.00 Min. :30.00
## International:418 1st Qu.:133.0 1st Qu.:21.00 1st Qu.:31.00
## Median :185.0 Median :36.00 Median :31.00
## Mean :202.3 Mean :33.65 Mean :31.22
## 3rd Qu.:243.0 3rd Qu.:40.00 3rd Qu.:32.00
## Max. :389.0 Max. :66.00 Max. :33.00
## PitchPremium WidthEconomy WidthPremium PriceEconomy
## Min. :34.00 Min. :17.00 Min. :17.00 Min. : 65
## 1st Qu.:38.00 1st Qu.:18.00 1st Qu.:19.00 1st Qu.: 413
## Median :38.00 Median :18.00 Median :19.00 Median :1242
## Mean :37.91 Mean :17.84 Mean :19.47 Mean :1327
## 3rd Qu.:38.00 3rd Qu.:18.00 3rd Qu.:21.00 3rd Qu.:1909
## Max. :40.00 Max. :19.00 Max. :21.00 Max. :3593
## PricePremium PriceRelative SeatsTotal PitchDifference
## Min. : 86.0 Min. :0.0200 Min. : 98 Min. : 2.000
## 1st Qu.: 528.8 1st Qu.:0.1000 1st Qu.:166 1st Qu.: 6.000
## Median :1737.0 Median :0.3650 Median :227 Median : 7.000
## Mean :1845.3 Mean :0.4872 Mean :236 Mean : 6.688
## 3rd Qu.:2989.0 3rd Qu.:0.7400 3rd Qu.:279 3rd Qu.: 7.000
## Max. :7414.0 Max. :1.8900 Max. :441 Max. :10.000
## WidthDifference PercentPremiumSeats
## Min. :0.000 Min. : 4.71
## 1st Qu.:1.000 1st Qu.:12.28
## Median :1.000 Median :13.21
## Mean :1.633 Mean :14.65
## 3rd Qu.:3.000 3rd Qu.:15.36
## Max. :4.000 Max. :24.69
library(psych)
describe(SixAirlinesData.df)
## vars n mean sd median trimmed mad min
## Airline* 1 458 3.01 1.65 2.00 2.89 1.48 1.00
## Aircraft* 2 458 1.67 0.47 2.00 1.71 0.00 1.00
## FlightDuration 3 458 7.58 3.54 7.79 7.57 4.81 1.25
## TravelMonth* 4 458 2.56 1.17 3.00 2.58 1.48 1.00
## IsInternational* 5 458 1.91 0.28 2.00 2.00 0.00 1.00
## SeatsEconomy 6 458 202.31 76.37 185.00 194.64 85.99 78.00
## SeatsPremium 7 458 33.65 13.26 36.00 33.35 11.86 8.00
## PitchEconomy 8 458 31.22 0.66 31.00 31.26 0.00 30.00
## PitchPremium 9 458 37.91 1.31 38.00 38.05 0.00 34.00
## WidthEconomy 10 458 17.84 0.56 18.00 17.81 0.00 17.00
## WidthPremium 11 458 19.47 1.10 19.00 19.53 0.00 17.00
## PriceEconomy 12 458 1327.08 988.27 1242.00 1244.40 1159.39 65.00
## PricePremium 13 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative 14 458 0.49 0.45 0.36 0.42 0.41 0.02
## SeatsTotal 15 458 235.96 85.29 227.00 228.73 90.44 98.00
## PitchDifference 16 458 6.69 1.76 7.00 6.76 0.00 2.00
## WidthDifference 17 458 1.63 1.19 1.00 1.53 0.00 0.00
## PercentPremiumSeats 18 458 14.65 4.84 13.21 14.31 2.68 4.71
## max range skew kurtosis se
## Airline* 6.00 5.00 0.61 -0.95 0.08
## Aircraft* 2.00 1.00 -0.72 -1.48 0.02
## FlightDuration 14.66 13.41 -0.07 -1.12 0.17
## TravelMonth* 4.00 3.00 -0.14 -1.46 0.05
## IsInternational* 2.00 1.00 -2.91 6.50 0.01
## SeatsEconomy 389.00 311.00 0.72 -0.36 3.57
## SeatsPremium 66.00 58.00 0.23 -0.46 0.62
## PitchEconomy 33.00 3.00 -0.03 -0.35 0.03
## PitchPremium 40.00 6.00 -1.51 3.52 0.06
## WidthEconomy 19.00 2.00 -0.04 -0.08 0.03
## WidthPremium 21.00 4.00 -0.08 -0.31 0.05
## PriceEconomy 3593.00 3528.00 0.51 -0.88 46.18
## PricePremium 7414.00 7328.00 0.50 0.43 60.19
## PriceRelative 1.89 1.87 1.17 0.72 0.02
## SeatsTotal 441.00 343.00 0.70 -0.53 3.99
## PitchDifference 10.00 8.00 -0.54 1.78 0.08
## WidthDifference 4.00 4.00 0.84 -0.53 0.06
## PercentPremiumSeats 24.69 19.98 0.71 0.28 0.23
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
Draw Bar Plots to visualize the distribution of each variable independently
par(mfrow=c(1,2))
hist(SixAirlinesData.df$FlightDuration, breaks=18, col="springgreen", xlab="FlightDuration", main="Histogram")
hist(SixAirlinesData.df$SeatsEconomy, breaks=18, col="salmon", xlab="SeatsEconomy", main="Histogram")
hist(SixAirlinesData.df$SeatsPremium, breaks=18, col="tomato2", xlab="SeatsPremium", main="Histogram")
hist(SixAirlinesData.df$PitchEconomy, breaks=18, col="wheat", xlab="PitchEconomy", main="Histogram")
hist(SixAirlinesData.df$PitchPremium, breaks=18, col="tan1", xlab="PitchPremium", main="Histogram")
hist(SixAirlinesData.df$WidthEconomy, breaks=18, col="yellowgreen", xlab="WidthEconomy", main="Histogram")
hist(SixAirlinesData.df$WidthPremium, breaks=18, col="rosybrown", xlab="WidthPremium", main="Histogram")
hist(SixAirlinesData.df$PriceEconomy, breaks=18, col="royalblue", xlab="PriceEconomy", main="Histogram")
hist(SixAirlinesData.df$PricePremium, breaks=18, col="steelblue", xlab="PricePremium", main="Histogram")
hist(SixAirlinesData.df$PriceRelative, breaks=18, col="saddlebrown", xlab="PriceRelative", main="Histogram")
hist(SixAirlinesData.df$SeatsTotal, breaks=18, col="violetred", xlab="SeatsTotal", main="Histogram")
hist(SixAirlinesData.df$PitchDifference, breaks=18, col="sandybrown",xlab="PitchDifference", main="Histogram")
hist(SixAirlinesData.df$WidthDifference, breaks=18, col="seagreen", xlab="WidthDifference", main="Histogram")
hist(SixAirlinesData.df$PercentPremiumSeats, breaks=18, col="gray", xlab="PercentPremiumSeats", main="Histogram")
Draw Box Plots to visualize the distribution of each variable independently
par(mfrow=c(1,2))
boxplot(SixAirlinesData.df$SeatsEconomy, col = "tan3")
boxplot(SixAirlinesData.df$SeatsPremium, col = "tan2")
boxplot(SixAirlinesData.df$SeatsTotal, col="tan1")
boxplot(SixAirlinesData.df$PitchEconomy, col = "red4")
boxplot(SixAirlinesData.df$PitchPremium, col = "yellow1")
boxplot(SixAirlinesData.df$PitchDifference, col = "wheat2")
boxplot(SixAirlinesData.df$WidthEconomy, col = "ivory")
boxplot(SixAirlinesData.df$WidthPremium, col= "khaki")
boxplot(SixAirlinesData.df$WidthDifference, col = "hotpink")
boxplot(SixAirlinesData.df$PriceEconomy, col="lawngreen")
boxplot(SixAirlinesData.df$PricePremium, col="magenta")
boxplot(SixAirlinesData.df$PriceRelative, col = "linen")
boxplot(SixAirlinesData.df$PercentPremiumSeats, col = "maroon1")
boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$SeatsTotal, horizontal=TRUE, ylab="Relative Price", xlab="Seats Total", las=1, main="Boxplot", col="honeydew")
boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$WidthDifference, horizontal=TRUE, ylab="Relative Price", xlab="WidthDifference", las=1, main="Boxplot", col="indianred")
boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$PitchDifference, horizontal=TRUE, ylab="Relative Price", xlab="PitchDifference", las=1, main="Boxplot", col="limegreen")
boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$PercentPremiumSeats, horizontal=TRUE,ylab="Relative Price", xlab="PercentPremiumSeats", las=1,main="Boxplot", col="lightblue")
Draw Scatter Plots to understand how are the variables correlated pair-wise
library(car)
scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "SeatsTotal")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightpink")
scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "PitchDifference")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="mediumblue")
scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "WidthDifference")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightcoral")
scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "PercentPremiumSeats")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightcyan")
Draw Corrgram
finals.df <- SixAirlinesData.df[6:18]
library(corrgram)
corrgram(SixAirlinesData.df, order=FALSE,
lower.panel=panel.shade,
upper.panel=panel.pie,
text.panel=panel.txt,
diag.panel=panel.minmax,
main="Corrgram")
Till Now, We see:
That more the number of seats, higher the price & More the percent of seats, less is its price
Positive Correlation : (WidthDifference, PitchDifference) & Negative Correlation: (PercentPremiumSeats, SeatsTotal)
Create a Variance-Covariance Matrix
#cov(finals.df)
#cor(finals.df)
#library(psych)
#corr.test(finals.df, use="complete")
library(corpcor)
library(tseries)
range.names = c("SeatsEconomy", "SeatsPremium", "PitchEconomy", "PitchPremium", "WidthEconomy", "WidthPremium", "PriceEcnomy", "PricePremium", "PriceRelative", "SeatsTotal", "PitchDifference", "WidthDifference", "PercentPremiumSeats")
covmat = matrix(c(cov(finals.df)), nrow=457, ncol=13)
## Warning in matrix(c(cov(finals.df)), nrow = 457, ncol = 13): data length
## [169] is not a sub-multiple or multiple of the number of rows [457]
#names(range.names) = range.names
#dimnames(covmat) = list(names(range.names), range.names)
# covmat
#cov.shrink(covmat)
Articulate a Hypothesis (or two) that you could test using a Regression Model
Run T-Tests appropriate, to test your Hypotheses
library(MASS)
library(psych)
attach(SixAirlinesData.df)
#H1: Increase in PitchDifference leads to increase in relative price.
t.test(PriceRelative, PitchDifference, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: PriceRelative and PitchDifference
## t = -72.974, df = 516.54, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -6.34058 Inf
## sample estimates:
## mean of x mean of y
## 0.4872052 6.6877729
# p-value is 1(> 0.05 suggests no significant difference between the means of our sample population and we would not reject our null hypothesis. )
# Confidence Intervals contains zero.
#H2: Increase in WidthDifference leads to increase in relative price.
t.test(PriceRelative, WidthDifference, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: PriceRelative and WidthDifference
## t = -19.284, df = 585.55, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1.243885 Inf
## sample estimates:
## mean of x mean of y
## 0.4872052 1.6331878
# p-value is 1(> 0.05 suggests no significant difference between the means of our sample population and we would not reject our null hypothesis. )
# Confidence Intervals contains zero.
#H3: Increase in PercentPremiumSeats leads to increase in relative price.
t.test(PriceRelative, PercentPremiumSeats)
##
## Welch Two Sample t-test
##
## data: PriceRelative and PercentPremiumSeats
## t = -62.302, df = 464.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.60477 -13.71164
## sample estimates:
## mean of x mean of y
## 0.4872052 14.6454148
# Confidence Intervals do not contain zeroes.
# p-value is < 2.2e-16 (< 0.05 suggests a significant difference between the means of our sample population and we would reject our null hypothesis. )
Formulate a Regression Model:
y = b0 + b1x1 + b2x2 + …
PriceRelative as ‘y’
x = {x1, x2, …} : x1,x2,x3,x4 are SeatsTotal, PitchDifference, WidthDifference, PercentPremiumSeats respectively
Fit a Linear Regression Model using lm()
m1 <- lm(PriceRelative ~
SeatsTotal
+ PitchDifference
+ WidthDifference
+ PercentPremiumSeats,
data=finals.df)
summary(m1)
##
## Call:
## lm(formula = PriceRelative ~ SeatsTotal + PitchDifference + WidthDifference +
## PercentPremiumSeats, data = finals.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8892 -0.2959 -0.0509 0.1915 1.1727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.434e-02 1.135e-01 -0.214 0.830356
## SeatsTotal -2.806e-05 2.287e-04 -0.123 0.902421
## PitchDifference 6.509e-02 1.667e-02 3.904 0.000109 ***
## WidthDifference 1.038e-01 2.599e-02 3.995 7.56e-05 ***
## PercentPremiumSeats -5.921e-03 4.175e-03 -1.418 0.156831
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3886 on 453 degrees of freedom
## Multiple R-squared: 0.2628, Adjusted R-squared: 0.2563
## F-statistic: 40.37 on 4 and 453 DF, p-value: < 2.2e-16
# beta coefficients
m1$coefficients
## (Intercept) SeatsTotal PitchDifference
## -2.433740e-02 -2.805517e-05 6.508568e-02
## WidthDifference PercentPremiumSeats
## 1.038415e-01 -5.920510e-03
# confidence intervals
confint(m1)
## 2.5 % 97.5 %
## (Intercept) -0.2474495396 0.1987747471
## SeatsTotal -0.0004775002 0.0004213899
## PitchDifference 0.0323188879 0.0978524650
## WidthDifference 0.0527557878 0.1549271927
## PercentPremiumSeats -0.0141248368 0.0022838171
library(coefplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
coefplot(m1, predictors=c("PitchDifference", "WidthDifference"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax
#statiscally significant
coefplot(m1, predictors=c("PercentPremiumSeats"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax
#not statiscally significant as confidence interval includes zero
coefplot(m1, predictors=c("SeatsTotal"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax
#not statiscally significant as confidence interval includes zero
#-----------------------------------------------------------------------------------------------
# Compare the PriceRelative with the fitted values
# Here is the actual PriceRelative
#finals.df$PriceRelative
## ------------------------------------------------------------------------
# Here is the PriceRelative, as predicted by the OLS model
#fitted(m1)
# Compare relative price predicted by the model with the actual relative price given in the data
#predictedPriceRelative = data.frame(fitted(m1))
#actualPriceRelative = data.frame(finals.df$PriceRelative)
#PriceRelativeComparison = cbind(actualPriceRelative, predictedPriceRelative)
#View(PriceRelativeComparison)
Use this model results to test your Hypotheses and draw inferences
What factors explain the difference in price between an economy ticket and a premium-economy airline ticket?
Factors like PitchDifference and WidthDifference are positively correlated to PriceRelative and are also statiscally significant. Increase in Pitch Difference and Width Difference may lead to Increase in Relative Price (ie. difference between prices of premiumseats and economyseats will increase.)