TASK 3 - MINI PROJECT ON THE AIRLINE INDUSTRY

We will be analyzing the pricing of Premium Economy tickets relative to regular Economy airline tickets.

Read the data into R

setwd("C:/Users/HP/Downloads/Intern/WEEK 3 DAY 6")

SixAirlinesData.df <- read.csv(paste("SixAirlinesDataV2.csv", sep=""))
View(SixAirlinesData.df)

Summarize the data to understand the mean, median, standard deviation of each variable

summary(SixAirlinesData.df)
##       Airline      Aircraft   FlightDuration   TravelMonth
##  AirFrance: 74   AirBus:151   Min.   : 1.250   Aug:127    
##  British  :175   Boeing:307   1st Qu.: 4.260   Jul: 75    
##  Delta    : 46                Median : 7.790   Oct:127    
##  Jet      : 61                Mean   : 7.578   Sep:129    
##  Singapore: 40                3rd Qu.:10.620              
##  Virgin   : 62                Max.   :14.660              
##       IsInternational  SeatsEconomy    SeatsPremium    PitchEconomy  
##  Domestic     : 40    Min.   : 78.0   Min.   : 8.00   Min.   :30.00  
##  International:418    1st Qu.:133.0   1st Qu.:21.00   1st Qu.:31.00  
##                       Median :185.0   Median :36.00   Median :31.00  
##                       Mean   :202.3   Mean   :33.65   Mean   :31.22  
##                       3rd Qu.:243.0   3rd Qu.:40.00   3rd Qu.:32.00  
##                       Max.   :389.0   Max.   :66.00   Max.   :33.00  
##   PitchPremium    WidthEconomy    WidthPremium    PriceEconomy 
##  Min.   :34.00   Min.   :17.00   Min.   :17.00   Min.   :  65  
##  1st Qu.:38.00   1st Qu.:18.00   1st Qu.:19.00   1st Qu.: 413  
##  Median :38.00   Median :18.00   Median :19.00   Median :1242  
##  Mean   :37.91   Mean   :17.84   Mean   :19.47   Mean   :1327  
##  3rd Qu.:38.00   3rd Qu.:18.00   3rd Qu.:21.00   3rd Qu.:1909  
##  Max.   :40.00   Max.   :19.00   Max.   :21.00   Max.   :3593  
##   PricePremium    PriceRelative      SeatsTotal  PitchDifference 
##  Min.   :  86.0   Min.   :0.0200   Min.   : 98   Min.   : 2.000  
##  1st Qu.: 528.8   1st Qu.:0.1000   1st Qu.:166   1st Qu.: 6.000  
##  Median :1737.0   Median :0.3650   Median :227   Median : 7.000  
##  Mean   :1845.3   Mean   :0.4872   Mean   :236   Mean   : 6.688  
##  3rd Qu.:2989.0   3rd Qu.:0.7400   3rd Qu.:279   3rd Qu.: 7.000  
##  Max.   :7414.0   Max.   :1.8900   Max.   :441   Max.   :10.000  
##  WidthDifference PercentPremiumSeats
##  Min.   :0.000   Min.   : 4.71      
##  1st Qu.:1.000   1st Qu.:12.28      
##  Median :1.000   Median :13.21      
##  Mean   :1.633   Mean   :14.65      
##  3rd Qu.:3.000   3rd Qu.:15.36      
##  Max.   :4.000   Max.   :24.69
library(psych)
describe(SixAirlinesData.df)
##                     vars   n    mean      sd  median trimmed     mad   min
## Airline*               1 458    3.01    1.65    2.00    2.89    1.48  1.00
## Aircraft*              2 458    1.67    0.47    2.00    1.71    0.00  1.00
## FlightDuration         3 458    7.58    3.54    7.79    7.57    4.81  1.25
## TravelMonth*           4 458    2.56    1.17    3.00    2.58    1.48  1.00
## IsInternational*       5 458    1.91    0.28    2.00    2.00    0.00  1.00
## SeatsEconomy           6 458  202.31   76.37  185.00  194.64   85.99 78.00
## SeatsPremium           7 458   33.65   13.26   36.00   33.35   11.86  8.00
## PitchEconomy           8 458   31.22    0.66   31.00   31.26    0.00 30.00
## PitchPremium           9 458   37.91    1.31   38.00   38.05    0.00 34.00
## WidthEconomy          10 458   17.84    0.56   18.00   17.81    0.00 17.00
## WidthPremium          11 458   19.47    1.10   19.00   19.53    0.00 17.00
## PriceEconomy          12 458 1327.08  988.27 1242.00 1244.40 1159.39 65.00
## PricePremium          13 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative         14 458    0.49    0.45    0.36    0.42    0.41  0.02
## SeatsTotal            15 458  235.96   85.29  227.00  228.73   90.44 98.00
## PitchDifference       16 458    6.69    1.76    7.00    6.76    0.00  2.00
## WidthDifference       17 458    1.63    1.19    1.00    1.53    0.00  0.00
## PercentPremiumSeats   18 458   14.65    4.84   13.21   14.31    2.68  4.71
##                         max   range  skew kurtosis    se
## Airline*               6.00    5.00  0.61    -0.95  0.08
## Aircraft*              2.00    1.00 -0.72    -1.48  0.02
## FlightDuration        14.66   13.41 -0.07    -1.12  0.17
## TravelMonth*           4.00    3.00 -0.14    -1.46  0.05
## IsInternational*       2.00    1.00 -2.91     6.50  0.01
## SeatsEconomy         389.00  311.00  0.72    -0.36  3.57
## SeatsPremium          66.00   58.00  0.23    -0.46  0.62
## PitchEconomy          33.00    3.00 -0.03    -0.35  0.03
## PitchPremium          40.00    6.00 -1.51     3.52  0.06
## WidthEconomy          19.00    2.00 -0.04    -0.08  0.03
## WidthPremium          21.00    4.00 -0.08    -0.31  0.05
## PriceEconomy        3593.00 3528.00  0.51    -0.88 46.18
## PricePremium        7414.00 7328.00  0.50     0.43 60.19
## PriceRelative          1.89    1.87  1.17     0.72  0.02
## SeatsTotal           441.00  343.00  0.70    -0.53  3.99
## PitchDifference       10.00    8.00 -0.54     1.78  0.08
## WidthDifference        4.00    4.00  0.84    -0.53  0.06
## PercentPremiumSeats   24.69   19.98  0.71     0.28  0.23
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit

Draw Bar Plots to visualize the distribution of each variable independently

par(mfrow=c(1,2))

hist(SixAirlinesData.df$FlightDuration, breaks=18, col="springgreen", xlab="FlightDuration", main="Histogram")

hist(SixAirlinesData.df$SeatsEconomy, breaks=18, col="salmon", xlab="SeatsEconomy", main="Histogram")

hist(SixAirlinesData.df$SeatsPremium, breaks=18, col="tomato2", xlab="SeatsPremium", main="Histogram")

hist(SixAirlinesData.df$PitchEconomy, breaks=18, col="wheat", xlab="PitchEconomy", main="Histogram")

hist(SixAirlinesData.df$PitchPremium, breaks=18, col="tan1", xlab="PitchPremium", main="Histogram")

hist(SixAirlinesData.df$WidthEconomy, breaks=18, col="yellowgreen", xlab="WidthEconomy", main="Histogram")

hist(SixAirlinesData.df$WidthPremium, breaks=18, col="rosybrown", xlab="WidthPremium", main="Histogram")

hist(SixAirlinesData.df$PriceEconomy, breaks=18, col="royalblue", xlab="PriceEconomy", main="Histogram")

hist(SixAirlinesData.df$PricePremium, breaks=18, col="steelblue", xlab="PricePremium", main="Histogram")

hist(SixAirlinesData.df$PriceRelative, breaks=18, col="saddlebrown", xlab="PriceRelative", main="Histogram")

hist(SixAirlinesData.df$SeatsTotal, breaks=18, col="violetred", xlab="SeatsTotal", main="Histogram")

hist(SixAirlinesData.df$PitchDifference, breaks=18, col="sandybrown",xlab="PitchDifference", main="Histogram")

hist(SixAirlinesData.df$WidthDifference, breaks=18, col="seagreen", xlab="WidthDifference", main="Histogram")

hist(SixAirlinesData.df$PercentPremiumSeats, breaks=18, col="gray", xlab="PercentPremiumSeats", main="Histogram")

Draw Box Plots to visualize the distribution of each variable independently

par(mfrow=c(1,2))

boxplot(SixAirlinesData.df$SeatsEconomy, col = "tan3")

boxplot(SixAirlinesData.df$SeatsPremium, col = "tan2")

boxplot(SixAirlinesData.df$SeatsTotal, col="tan1")

boxplot(SixAirlinesData.df$PitchEconomy, col = "red4")

boxplot(SixAirlinesData.df$PitchPremium, col = "yellow1")

boxplot(SixAirlinesData.df$PitchDifference, col = "wheat2")

boxplot(SixAirlinesData.df$WidthEconomy, col = "ivory")

boxplot(SixAirlinesData.df$WidthPremium, col= "khaki")

boxplot(SixAirlinesData.df$WidthDifference, col = "hotpink")

boxplot(SixAirlinesData.df$PriceEconomy, col="lawngreen")

boxplot(SixAirlinesData.df$PricePremium, col="magenta")

boxplot(SixAirlinesData.df$PriceRelative, col = "linen")

boxplot(SixAirlinesData.df$PercentPremiumSeats, col = "maroon1")

boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$SeatsTotal, horizontal=TRUE, ylab="Relative Price", xlab="Seats Total", las=1, main="Boxplot", col="honeydew")

boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$WidthDifference, horizontal=TRUE, ylab="Relative Price", xlab="WidthDifference", las=1, main="Boxplot", col="indianred")

boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$PitchDifference, horizontal=TRUE, ylab="Relative Price", xlab="PitchDifference", las=1, main="Boxplot", col="limegreen")

boxplot(SixAirlinesData.df$PriceRelative ~ SixAirlinesData.df$PercentPremiumSeats, horizontal=TRUE,ylab="Relative Price", xlab="PercentPremiumSeats", las=1,main="Boxplot", col="lightblue")

Draw Scatter Plots to understand how are the variables correlated pair-wise

library(car)

scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "SeatsTotal")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightpink")

scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "PitchDifference")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="mediumblue")

scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "WidthDifference")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightcoral")

scatterplotMatrix(SixAirlinesData.df[,c("PriceRelative", "PercentPremiumSeats")],spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix", col="lightcyan")

Draw Corrgram

finals.df <- SixAirlinesData.df[6:18]

library(corrgram)

corrgram(SixAirlinesData.df, order=FALSE, 
         lower.panel=panel.shade,
         upper.panel=panel.pie, 
         text.panel=panel.txt,
         diag.panel=panel.minmax,
         main="Corrgram")

Till Now, We see:

That more the number of seats, higher the price & More the percent of seats, less is its price

Positive Correlation : (WidthDifference, PitchDifference) & Negative Correlation: (PercentPremiumSeats, SeatsTotal)


Create a Variance-Covariance Matrix

#cov(finals.df)
#cor(finals.df)

#library(psych)
#corr.test(finals.df, use="complete")


library(corpcor)
library(tseries)

range.names = c("SeatsEconomy", "SeatsPremium", "PitchEconomy", "PitchPremium", "WidthEconomy", "WidthPremium", "PriceEcnomy", "PricePremium", "PriceRelative", "SeatsTotal", "PitchDifference", "WidthDifference", "PercentPremiumSeats")

covmat = matrix(c(cov(finals.df)), nrow=457, ncol=13)
## Warning in matrix(c(cov(finals.df)), nrow = 457, ncol = 13): data length
## [169] is not a sub-multiple or multiple of the number of rows [457]
#names(range.names) = range.names
#dimnames(covmat) = list(names(range.names), range.names)

# covmat

#cov.shrink(covmat)

Articulate a Hypothesis (or two) that you could test using a Regression Model

Run T-Tests appropriate, to test your Hypotheses

library(MASS)
library(psych)
attach(SixAirlinesData.df)

#H1: Increase in PitchDifference leads to increase in relative price.

t.test(PriceRelative, PitchDifference, alternative = "greater") 
## 
##  Welch Two Sample t-test
## 
## data:  PriceRelative and PitchDifference
## t = -72.974, df = 516.54, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -6.34058      Inf
## sample estimates:
## mean of x mean of y 
## 0.4872052 6.6877729
# p-value is 1(> 0.05 suggests no significant difference between the means of our sample population and we would not reject our null hypothesis. )
# Confidence Intervals contains zero.


#H2: Increase in WidthDifference leads to increase in relative price.

t.test(PriceRelative, WidthDifference, alternative = "greater") 
## 
##  Welch Two Sample t-test
## 
## data:  PriceRelative and WidthDifference
## t = -19.284, df = 585.55, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.243885       Inf
## sample estimates:
## mean of x mean of y 
## 0.4872052 1.6331878
# p-value is 1(> 0.05 suggests no significant difference between the means of our sample population and we would not reject our null hypothesis. )
# Confidence Intervals contains zero.


#H3: Increase in PercentPremiumSeats leads to increase in relative price.

t.test(PriceRelative, PercentPremiumSeats) 
## 
##  Welch Two Sample t-test
## 
## data:  PriceRelative and PercentPremiumSeats
## t = -62.302, df = 464.91, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.60477 -13.71164
## sample estimates:
##  mean of x  mean of y 
##  0.4872052 14.6454148
# Confidence Intervals do not contain zeroes.
# p-value is < 2.2e-16 (< 0.05 suggests a significant difference between the means of our sample population and we would reject our null hypothesis. )

Formulate a Regression Model:

y = b0 + b1x1 + b2x2 + …

PriceRelative as ‘y’

x = {x1, x2, …} : x1,x2,x3,x4 are SeatsTotal, PitchDifference, WidthDifference, PercentPremiumSeats respectively

Fit a Linear Regression Model using lm()

m1 <- lm(PriceRelative ~ 
           SeatsTotal
         + PitchDifference
         + WidthDifference
         + PercentPremiumSeats,
         data=finals.df)
summary(m1)
## 
## Call:
## lm(formula = PriceRelative ~ SeatsTotal + PitchDifference + WidthDifference + 
##     PercentPremiumSeats, data = finals.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8892 -0.2959 -0.0509  0.1915  1.1727 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.434e-02  1.135e-01  -0.214 0.830356    
## SeatsTotal          -2.806e-05  2.287e-04  -0.123 0.902421    
## PitchDifference      6.509e-02  1.667e-02   3.904 0.000109 ***
## WidthDifference      1.038e-01  2.599e-02   3.995 7.56e-05 ***
## PercentPremiumSeats -5.921e-03  4.175e-03  -1.418 0.156831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3886 on 453 degrees of freedom
## Multiple R-squared:  0.2628, Adjusted R-squared:  0.2563 
## F-statistic: 40.37 on 4 and 453 DF,  p-value: < 2.2e-16
# beta coefficients
m1$coefficients
##         (Intercept)          SeatsTotal     PitchDifference 
##       -2.433740e-02       -2.805517e-05        6.508568e-02 
##     WidthDifference PercentPremiumSeats 
##        1.038415e-01       -5.920510e-03
# confidence intervals
confint(m1)
##                             2.5 %       97.5 %
## (Intercept)         -0.2474495396 0.1987747471
## SeatsTotal          -0.0004775002 0.0004213899
## PitchDifference      0.0323188879 0.0978524650
## WidthDifference      0.0527557878 0.1549271927
## PercentPremiumSeats -0.0141248368 0.0022838171
library(coefplot)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
coefplot(m1, predictors=c("PitchDifference", "WidthDifference"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax

#statiscally significant

coefplot(m1, predictors=c("PercentPremiumSeats"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax

#not statiscally significant as confidence interval includes zero

coefplot(m1, predictors=c("SeatsTotal"), col="magenta")
## Warning: Ignoring unknown aesthetics: xmin, xmax

#not statiscally significant as confidence interval includes zero


#-----------------------------------------------------------------------------------------------
# Compare the PriceRelative with the fitted values 

# Here is the actual PriceRelative
#finals.df$PriceRelative

## ------------------------------------------------------------------------
# Here is the PriceRelative, as predicted by the OLS model
#fitted(m1)

# Compare relative price predicted by the model with the actual relative price given in the data

#predictedPriceRelative = data.frame(fitted(m1)) 
#actualPriceRelative = data.frame(finals.df$PriceRelative)
#PriceRelativeComparison = cbind(actualPriceRelative, predictedPriceRelative)
#View(PriceRelativeComparison)

Use this model results to test your Hypotheses and draw inferences

What factors explain the difference in price between an economy ticket and a premium-economy airline ticket?

Factors like PitchDifference and WidthDifference are positively correlated to PriceRelative and are also statiscally significant. Increase in Pitch Difference and Width Difference may lead to Increase in Relative Price (ie. difference between prices of premiumseats and economyseats will increase.)