airlines.df<-read.csv(paste("SixAirlinesDataV2.csv", sep=""))
attach(airlines.df)
library(psych)
describe(airlines.df)[,c(1:5,8,9)]
##                     vars   n    mean      sd  median   min     max
## Airline*               1 458    3.01    1.65    2.00  1.00    6.00
## Aircraft*              2 458    1.67    0.47    2.00  1.00    2.00
## FlightDuration         3 458    7.58    3.54    7.79  1.25   14.66
## TravelMonth*           4 458    2.56    1.17    3.00  1.00    4.00
## IsInternational*       5 458    1.91    0.28    2.00  1.00    2.00
## SeatsEconomy           6 458  202.31   76.37  185.00 78.00  389.00
## SeatsPremium           7 458   33.65   13.26   36.00  8.00   66.00
## PitchEconomy           8 458   31.22    0.66   31.00 30.00   33.00
## PitchPremium           9 458   37.91    1.31   38.00 34.00   40.00
## WidthEconomy          10 458   17.84    0.56   18.00 17.00   19.00
## WidthPremium          11 458   19.47    1.10   19.00 17.00   21.00
## PriceEconomy          12 458 1327.08  988.27 1242.00 65.00 3593.00
## PricePremium          13 458 1845.26 1288.14 1737.00 86.00 7414.00
## PriceRelative         14 458    0.49    0.45    0.36  0.02    1.89
## SeatsTotal            15 458  235.96   85.29  227.00 98.00  441.00
## PitchDifference       16 458    6.69    1.76    7.00  2.00   10.00
## WidthDifference       17 458    1.63    1.19    1.00  0.00    4.00
## PercentPremiumSeats   18 458   14.65    4.84   13.21  4.71   24.69

Exploring data

Relationship between Premium prices and type of aircraft

summary(Aircraft)
## AirBus Boeing 
##    151    307
boxplot(PricePremium, Aircraft)

This indicates that there is lower premium prices in Boeing make aircrafts.

Next let us consider the airlines

boxplot(PricePremium~Airline)

Here, it can be seen that Delta charges least premium price compared to other airlines. However, the airlines can further be profiled according to Domestic or international flights. As can be seen here-

xtabs(~Airline+IsInternational)
##            IsInternational
## Airline     Domestic International
##   AirFrance        0            74
##   British          0           175
##   Delta           40             6
##   Jet              0            61
##   Singapore        0            40
##   Virgin           0            62

Let us have a look at the premium prices again, profiled further by domestic or international flight

par(mfrow=c(2, 1))
boxplot(PricePremium~Airline, subset = IsInternational=='International')
boxplot(PricePremium~Airline, subset = IsInternational=='Domestic')

How does the premium price for each airline vary across travel month?

TravelMonth=factor(TravelMonth, levels(TravelMonth)[c(2,1,4,3)])
par(mfrow=c(3,2))
boxplot(PricePremium~TravelMonth, subset= Airline=='British')
boxplot(PricePremium~TravelMonth, subset= Airline=='AirFrance')
boxplot(PricePremium~TravelMonth, subset= Airline=='Delta')
boxplot(PricePremium~TravelMonth, subset= Airline=='Jet')
boxplot(PricePremium~TravelMonth, subset= Airline=='Singapore')
boxplot(PricePremium~TravelMonth, subset= Airline=='Virgin')

There appears to no great variation across months, however the statistical significance will have to be investigated.

Let us investigate more relations using scatterplots

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
par(mfrow=c(1,1))
scatterplot(PercentPremiumSeats, PricePremium)

scatterplot(PriceEconomy, PricePremium)

scatterplot(PitchDifference, PricePremium)

scatterplot(WidthDifference, PricePremium)

scatterplot(FlightDuration, PricePremium)

From the above plots, the only clear relation appears between PriceEconomy and PricePremium and PriceEconomy and PricePremium, while it is unclear for percentage of premium seat and premium prices. With pitch difference and width difference plots for PricePremium, it appears that a certain mean/median pitch difference and width difference is more charged compared to others (i.e. they are not linearly related). A correlation plot will throw more light on it

numairline.df<- airlines.df[,c('PricePremium', 'FlightDuration', 'PercentPremiumSeats', 'PriceEconomy','PriceRelative', 'PitchDifference', 'WidthDifference')] #creating a separate dataframe containing independent numerical variables
library("corrplot")
## corrplot 0.84 loaded
round(cor(numairline.df),2)
##                     PricePremium FlightDuration PercentPremiumSeats
## PricePremium                1.00           0.65                0.12
## FlightDuration              0.65           1.00                0.06
## PercentPremiumSeats         0.12           0.06                1.00
## PriceEconomy                0.90           0.57                0.07
## PriceRelative               0.03           0.12               -0.16
## PitchDifference            -0.02          -0.04               -0.09
## WidthDifference            -0.01          -0.12               -0.28
##                     PriceEconomy PriceRelative PitchDifference
## PricePremium                0.90          0.03           -0.02
## FlightDuration              0.57          0.12           -0.04
## PercentPremiumSeats         0.07         -0.16           -0.09
## PriceEconomy                1.00         -0.29           -0.10
## PriceRelative              -0.29          1.00            0.47
## PitchDifference            -0.10          0.47            1.00
## WidthDifference            -0.08          0.49            0.76
##                     WidthDifference
## PricePremium                  -0.01
## FlightDuration                -0.12
## PercentPremiumSeats           -0.28
## PriceEconomy                  -0.08
## PriceRelative                  0.49
## PitchDifference                0.76
## WidthDifference                1.00
library("corrgram")
corrgram(numairline.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of various factors")

the above corgram indicates that PricePremium is strongly corelated with FlightDuration and PriceEconomy

Pearson’s correlation test

cor.test(PricePremium, FlightDuration)
## 
##  Pearson's product-moment correlation
## 
## data:  PricePremium and FlightDuration
## t = 18.204, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5923218 0.6988270
## sample estimates:
##       cor 
## 0.6487398

since the p value is <0.05 thus, statistically significant corelation exists between PricePremium and FlightDuration

cor.test(PricePremium, PriceEconomy)
## 
##  Pearson's product-moment correlation
## 
## data:  PricePremium and PriceEconomy
## t = 44.452, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8826622 0.9172579
## sample estimates:
##       cor 
## 0.9013887

here again the p value is <0.05, thus statistically significant corelation exists between PricePremium and PriceEconomy

cor.test(PricePremium, PercentPremiumSeats)
## 
##  Pearson's product-moment correlation
## 
## data:  PricePremium and PercentPremiumSeats
## t = 2.5024, df = 456, p-value = 0.01268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0250311 0.2058228
## sample estimates:
##      cor 
## 0.116391

The p value is <0.05 however not as small as those observed for PriceEconomy and Flight duration, implying weak correlation

cor.test(PricePremium, PitchDifference)
## 
##  Pearson's product-moment correlation
## 
## data:  PricePremium and PitchDifference
## t = -0.38585, df = 456, p-value = 0.6998
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1095118  0.0736825
## sample estimates:
##         cor 
## -0.01806629

The large p value implies that the correlation between PitchDifference and PricePremium is false. The same should hold for WidthDifference.

Formulating Hypothesis

Hypothesis 1: Longer duration flights have higher Premium Prices

Hypothesis 2: Higher economy prices are correlated with higher premium prices

T-test

t.test(PricePremium, FlightDuration)
## 
##  Welch Two Sample t-test
## 
## data:  PricePremium and FlightDuration
## t = 30.531, df = 457.01, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1719.395 1955.965
## sample estimates:
##   mean of x   mean of y 
## 1845.257642    7.577838

the p- value being very small, thus the null hypothesis that Premium Prices are independent of Flight Duration- is rejected. Thus, Hypothesis 1 tested.

t.test(PricePremium, PriceEconomy)
## 
##  Welch Two Sample t-test
## 
## data:  PricePremium and PriceEconomy
## t = 6.8304, df = 856.56, p-value = 1.605e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  369.2793 667.0831
## sample estimates:
## mean of x mean of y 
##  1845.258  1327.076

The p value being extremely small again indicates that PriceEconomy and PricePremium are correlated in a statistically significant manner.Thus, Hypothesis 2 is verified.

Regression Analysis

fit<- lm(PricePremium~ PriceEconomy+ PercentPremiumSeats+ PitchDifference+ WidthDifference+ FlightDuration, data= numairline.df)
summary(fit)
## 
## Call:
## lm(formula = PricePremium ~ PriceEconomy + PercentPremiumSeats + 
##     PitchDifference + WidthDifference + FlightDuration, data = numairline.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -803.2 -287.5  -46.8  151.6 3434.6 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -572.66050  132.93317  -4.308 2.02e-05 ***
## PriceEconomy           1.02266    0.02881  35.497  < 2e-16 ***
## PercentPremiumSeats   21.58784    5.09201   4.240 2.72e-05 ***
## PitchDifference       -4.03335   20.94083  -0.193 0.847353    
## WidthDifference      115.29104   32.16186   3.585 0.000374 ***
## FlightDuration        76.97237    8.07003   9.538  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 496.8 on 452 degrees of freedom
## Multiple R-squared:  0.8529, Adjusted R-squared:  0.8512 
## F-statistic:   524 on 5 and 452 DF,  p-value: < 2.2e-16

the above regression analysis indicates that except for Pitchdifference, all other factors are important variables in determining the price of premium seats.