MAIN RESEARCH QUESTION - What factors explain the difference in price between an economy ticket and a premium-economy airline ticket?

Reading Data into R

setwd("C:/Users/Parul Verma/Desktop/Data Analytics Internship")
Data.df <-read.csv(paste ("SixAirlinesDataV2.csv", sep=""))

Summarizing the data to understand the mean, median, standard deviation of each variable using 1. summary command

summary(Data.df)
##       Airline      Aircraft   FlightDuration   TravelMonth
##  AirFrance: 74   AirBus:151   Min.   : 1.250   Aug:127    
##  British  :175   Boeing:307   1st Qu.: 4.260   Jul: 75    
##  Delta    : 46                Median : 7.790   Oct:127    
##  Jet      : 61                Mean   : 7.578   Sep:129    
##  Singapore: 40                3rd Qu.:10.620              
##  Virgin   : 62                Max.   :14.660              
##       IsInternational  SeatsEconomy    SeatsPremium    PitchEconomy  
##  Domestic     : 40    Min.   : 78.0   Min.   : 8.00   Min.   :30.00  
##  International:418    1st Qu.:133.0   1st Qu.:21.00   1st Qu.:31.00  
##                       Median :185.0   Median :36.00   Median :31.00  
##                       Mean   :202.3   Mean   :33.65   Mean   :31.22  
##                       3rd Qu.:243.0   3rd Qu.:40.00   3rd Qu.:32.00  
##                       Max.   :389.0   Max.   :66.00   Max.   :33.00  
##   PitchPremium    WidthEconomy    WidthPremium    PriceEconomy 
##  Min.   :34.00   Min.   :17.00   Min.   :17.00   Min.   :  65  
##  1st Qu.:38.00   1st Qu.:18.00   1st Qu.:19.00   1st Qu.: 413  
##  Median :38.00   Median :18.00   Median :19.00   Median :1242  
##  Mean   :37.91   Mean   :17.84   Mean   :19.47   Mean   :1327  
##  3rd Qu.:38.00   3rd Qu.:18.00   3rd Qu.:21.00   3rd Qu.:1909  
##  Max.   :40.00   Max.   :19.00   Max.   :21.00   Max.   :3593  
##   PricePremium    PriceRelative      SeatsTotal  PitchDifference 
##  Min.   :  86.0   Min.   :0.0200   Min.   : 98   Min.   : 2.000  
##  1st Qu.: 528.8   1st Qu.:0.1000   1st Qu.:166   1st Qu.: 6.000  
##  Median :1737.0   Median :0.3650   Median :227   Median : 7.000  
##  Mean   :1845.3   Mean   :0.4872   Mean   :236   Mean   : 6.688  
##  3rd Qu.:2989.0   3rd Qu.:0.7400   3rd Qu.:279   3rd Qu.: 7.000  
##  Max.   :7414.0   Max.   :1.8900   Max.   :441   Max.   :10.000  
##  WidthDifference PercentPremiumSeats
##  Min.   :0.000   Min.   : 4.71      
##  1st Qu.:1.000   1st Qu.:12.28      
##  Median :1.000   Median :13.21      
##  Mean   :1.633   Mean   :14.65      
##  3rd Qu.:3.000   3rd Qu.:15.36      
##  Max.   :4.000   Max.   :24.69

Alternatively, one can also use 2. describe command

library(psych)
describe(Data.df)
##                     vars   n    mean      sd  median trimmed     mad   min
## Airline*               1 458    3.01    1.65    2.00    2.89    1.48  1.00
## Aircraft*              2 458    1.67    0.47    2.00    1.71    0.00  1.00
## FlightDuration         3 458    7.58    3.54    7.79    7.57    4.81  1.25
## TravelMonth*           4 458    2.56    1.17    3.00    2.58    1.48  1.00
## IsInternational*       5 458    1.91    0.28    2.00    2.00    0.00  1.00
## SeatsEconomy           6 458  202.31   76.37  185.00  194.64   85.99 78.00
## SeatsPremium           7 458   33.65   13.26   36.00   33.35   11.86  8.00
## PitchEconomy           8 458   31.22    0.66   31.00   31.26    0.00 30.00
## PitchPremium           9 458   37.91    1.31   38.00   38.05    0.00 34.00
## WidthEconomy          10 458   17.84    0.56   18.00   17.81    0.00 17.00
## WidthPremium          11 458   19.47    1.10   19.00   19.53    0.00 17.00
## PriceEconomy          12 458 1327.08  988.27 1242.00 1244.40 1159.39 65.00
## PricePremium          13 458 1845.26 1288.14 1737.00 1799.05 1845.84 86.00
## PriceRelative         14 458    0.49    0.45    0.36    0.42    0.41  0.02
## SeatsTotal            15 458  235.96   85.29  227.00  228.73   90.44 98.00
## PitchDifference       16 458    6.69    1.76    7.00    6.76    0.00  2.00
## WidthDifference       17 458    1.63    1.19    1.00    1.53    0.00  0.00
## PercentPremiumSeats   18 458   14.65    4.84   13.21   14.31    2.68  4.71
##                         max   range  skew kurtosis    se
## Airline*               6.00    5.00  0.61    -0.95  0.08
## Aircraft*              2.00    1.00 -0.72    -1.48  0.02
## FlightDuration        14.66   13.41 -0.07    -1.12  0.17
## TravelMonth*           4.00    3.00 -0.14    -1.46  0.05
## IsInternational*       2.00    1.00 -2.91     6.50  0.01
## SeatsEconomy         389.00  311.00  0.72    -0.36  3.57
## SeatsPremium          66.00   58.00  0.23    -0.46  0.62
## PitchEconomy          33.00    3.00 -0.03    -0.35  0.03
## PitchPremium          40.00    6.00 -1.51     3.52  0.06
## WidthEconomy          19.00    2.00 -0.04    -0.08  0.03
## WidthPremium          21.00    4.00 -0.08    -0.31  0.05
## PriceEconomy        3593.00 3528.00  0.51    -0.88 46.18
## PricePremium        7414.00 7328.00  0.50     0.43 60.19
## PriceRelative          1.89    1.87  1.17     0.72  0.02
## SeatsTotal           441.00  343.00  0.70    -0.53  3.99
## PitchDifference       10.00    8.00 -0.54     1.78  0.08
## WidthDifference        4.00    4.00  0.84    -0.53  0.06
## PercentPremiumSeats   24.69   19.98  0.71     0.28  0.23

From the list of values given, one can look at the mean, median, mode and standard deviation of each of the column variables.

ScatterPlots and BoxPlots for analysing variables:

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
library(psych)
library(lattice)

For this, we take Relative Pricing and find the correlation between Relative Pricing and the other variable -

boxplot(PriceRelative~Airline, data = Data.df, main = "Airline Vs Relative Price", xlab = "Airlines", ylab = "Relative Price", col = c("red","blue","green","orange","yellow","purple"))

Correlation present.

boxplot(PriceRelative~Aircraft,data = Data.df, main="Relative Price Vs.Aircraft ",     xlab="Aircraft", ylab="Price Relative", col = c("blue","red"))

Correlation present (weak)

scatterplot(PriceRelative~FlightDuration , data = Data.df, xlab = "Flight Duration", ylab = "Relative Pricing")

Correlation present.

boxplot(PriceRelative~TravelMonth , data = Data.df, xlab = "Travel Month", ylab = "Relative Pricing", col = c("red","blue","green","orange"))

No correlation.

boxplot(PriceRelative~IsInternational, data = Data.df, xlab = "IsInternational", ylab = "Relative Pricing", col = c("red","blue"))

Correlation present.

scatterplot(PriceRelative~SeatsEconomy, data = Data.df, xlab = "Seats in Economy", ylab = "Relative Pricing")

No correlation

scatterplot(PriceRelative~SeatsPremium, data = Data.df, xlab = "Seats in Premium", ylab = "Relative Pricing")

Correlation present (negative)

scatterplot(PriceRelative~PitchEconomy, data = Data.df, xlab = " Pitch in Economy", ylab = "Relative Pricing")

Correlation present

scatterplot(PriceRelative~PitchPremium, data = Data.df, xlab = "Pitch in Premium", ylab = "Relative Pricing")

Correlation present

scatterplot(PriceRelative~WidthEconomy, data = Data.df, xlab = "Width in Economy", ylab = "Relative Pricing")

Correlation present (not statistically significant)

scatterplot(PriceRelative~WidthPremium, data = Data.df, xlab = "Width in Premium", ylab = "Relative Pricing")

Correlation present.

scatterplot(PriceRelative~SeatsTotal, data = Data.df, xlab = "Total Seats", ylab = "Relative Pricing")

No correlation.

scatterplot(PriceRelative~PitchDifference, data = Data.df, xlab = "Difference in Pitch", ylab = "Relative Pricing")

Correlation present.

scatterplot(PriceRelative~WidthDifference, data = Data.df, xlab = "Difference in Width", ylab = "Relative Pricing")

Correlation present

scatterplot(PriceRelative~PercentPremiumSeats, data = Data.df, xlab = "Percentage of Premium Seats", ylab = "Relative Pricing")

Correlation present (negative)

Therefore, from above, we get the factors that affect the Pricing.

Drawing a CORRGRAM now -

library(corrgram)
par(mfrow=c(1,1))
corrgram(Data.df, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt,main="Corrgram")

Variance-CoVariance Matrix

VarCovarmatrix <- var(Data.df[,1:18])
## Warning in var(Data.df[, 1:18]): NAs introduced by coercion
round(VarCovarmatrix, 2)
##                     Airline Aircraft FlightDuration TravelMonth
## Airline                  NA       NA             NA          NA
## Aircraft                 NA       NA             NA          NA
## FlightDuration           NA       NA          12.55          NA
## TravelMonth              NA       NA             NA          NA
## IsInternational          NA       NA             NA          NA
## SeatsEconomy             NA       NA          52.92          NA
## SeatsPremium             NA       NA           7.57          NA
## PitchEconomy             NA       NA           0.68          NA
## PitchPremium             NA       NA           0.45          NA
## WidthEconomy             NA       NA           0.90          NA
## WidthPremium             NA       NA           0.40          NA
## PriceEconomy             NA       NA        1983.54          NA
## PricePremium             NA       NA        2959.98          NA
## PriceRelative            NA       NA           0.19          NA
## SeatsTotal               NA       NA          60.49          NA
## PitchDifference          NA       NA          -0.23          NA
## WidthDifference          NA       NA          -0.50          NA
## PercentPremiumSeats      NA       NA           1.04          NA
##                     IsInternational SeatsEconomy SeatsPremium PitchEconomy
## Airline                          NA           NA           NA           NA
## Aircraft                         NA           NA           NA           NA
## FlightDuration                   NA        52.92         7.57         0.68
## TravelMonth                      NA           NA           NA           NA
## IsInternational                  NA           NA           NA           NA
## SeatsEconomy                     NA      5832.92       633.07         7.21
## SeatsPremium                     NA       633.07       175.87        -0.30
## PitchEconomy                     NA         7.21        -0.30         0.43
## PitchPremium                     NA        11.96         0.09        -0.47
## WidthEconomy                     NA        15.91         3.37         0.11
## WidthPremium                     NA         8.58        -0.04        -0.39
## PriceEconomy                     NA      9673.79      1489.38       238.70
## PricePremium                     NA     17413.25      3717.36       190.85
## PriceRelative                    NA         0.14        -0.58        -0.12
## SeatsTotal                       NA      6465.99       808.94         6.91
## PitchDifference                  NA         4.75         0.38        -0.90
## WidthDifference                  NA        -7.33        -3.41        -0.50
## PercentPremiumSeats              NA      -122.39        31.15        -0.33
##                     PitchPremium WidthEconomy WidthPremium PriceEconomy
## Airline                       NA           NA           NA           NA
## Aircraft                      NA           NA           NA           NA
## FlightDuration              0.45         0.90         0.40      1983.54
## TravelMonth                   NA           NA           NA           NA
## IsInternational               NA           NA           NA           NA
## SeatsEconomy               11.96        15.91         8.58      9673.79
## SeatsPremium                0.09         3.37        -0.04      1489.38
## PitchEconomy               -0.47         0.11        -0.39       238.70
## PitchPremium                1.73        -0.02         1.08        65.43
## WidthEconomy               -0.02         0.31         0.05        37.46
## WidthPremium                1.08         0.05         1.20       -61.85
## PriceEconomy               65.43        37.46       -61.85    976684.06
## PricePremium              149.85       108.12        90.48   1147494.77
## PriceRelative               0.25        -0.01         0.25      -128.50
## SeatsTotal                 12.05        19.28         8.54     11163.18
## PitchDifference             2.20        -0.12         1.47      -173.28
## WidthDifference             1.10        -0.26         1.15       -99.32
## PercentPremiumSeats        -1.12         0.61        -0.97       312.61
##                     PricePremium PriceRelative SeatsTotal PitchDifference
## Airline                       NA            NA         NA              NA
## Aircraft                      NA            NA         NA              NA
## FlightDuration           2959.98          0.19      60.49           -0.23
## TravelMonth                   NA            NA         NA              NA
## IsInternational               NA            NA         NA              NA
## SeatsEconomy            17413.25          0.14    6465.99            4.75
## SeatsPremium             3717.36         -0.58     808.94            0.38
## PitchEconomy              190.85         -0.12       6.91           -0.90
## PitchPremium              149.85          0.25      12.05            2.20
## WidthEconomy              108.12         -0.01      19.28           -0.12
## WidthPremium               90.48          0.25       8.54            1.47
## PriceEconomy          1147494.77       -128.50   11163.18         -173.28
## PricePremium          1659293.12         18.48   21130.62          -41.00
## PriceRelative              18.48          0.20      -0.44            0.37
## SeatsTotal              21130.62         -0.44    7274.92            5.13
## PitchDifference           -41.00          0.37       5.13            3.10
## WidthDifference           -17.64          0.26     -10.74            1.59
## PercentPremiumSeats       726.02         -0.35     -91.24           -0.79
##                     WidthDifference PercentPremiumSeats
## Airline                          NA                  NA
## Aircraft                         NA                  NA
## FlightDuration                -0.50                1.04
## TravelMonth                      NA                  NA
## IsInternational                  NA                  NA
## SeatsEconomy                  -7.33             -122.39
## SeatsPremium                  -3.41               31.15
## PitchEconomy                  -0.50               -0.33
## PitchPremium                   1.10               -1.12
## WidthEconomy                  -0.26                0.61
## WidthPremium                   1.15               -0.97
## PriceEconomy                 -99.32              312.61
## PricePremium                 -17.64              726.02
## PriceRelative                  0.26               -0.35
## SeatsTotal                   -10.74              -91.24
## PitchDifference                1.59               -0.79
## WidthDifference                1.41               -1.59
## PercentPremiumSeats           -1.59               23.45

Correlation Matrix

correlmatrix <- cor(Data.df[,6:18])
round(correlmatrix,digits = 2)
##                     SeatsEconomy SeatsPremium PitchEconomy PitchPremium
## SeatsEconomy                1.00         0.63         0.14         0.12
## SeatsPremium                0.63         1.00        -0.03         0.00
## PitchEconomy                0.14        -0.03         1.00        -0.55
## PitchPremium                0.12         0.00        -0.55         1.00
## WidthEconomy                0.37         0.46         0.29        -0.02
## WidthPremium                0.10         0.00        -0.54         0.75
## PriceEconomy                0.13         0.11         0.37         0.05
## PricePremium                0.18         0.22         0.23         0.09
## PriceRelative               0.00        -0.10        -0.42         0.42
## SeatsTotal                  0.99         0.72         0.12         0.11
## PitchDifference             0.04         0.02        -0.78         0.95
## WidthDifference            -0.08        -0.22        -0.64         0.70
## PercentPremiumSeats        -0.33         0.49        -0.10        -0.18
##                     WidthEconomy WidthPremium PriceEconomy PricePremium
## SeatsEconomy                0.37         0.10         0.13         0.18
## SeatsPremium                0.46         0.00         0.11         0.22
## PitchEconomy                0.29        -0.54         0.37         0.23
## PitchPremium               -0.02         0.75         0.05         0.09
## WidthEconomy                1.00         0.08         0.07         0.15
## WidthPremium                0.08         1.00        -0.06         0.06
## PriceEconomy                0.07        -0.06         1.00         0.90
## PricePremium                0.15         0.06         0.90         1.00
## PriceRelative              -0.04         0.50        -0.29         0.03
## SeatsTotal                  0.41         0.09         0.13         0.19
## PitchDifference            -0.13         0.76        -0.10        -0.02
## WidthDifference            -0.39         0.88        -0.08        -0.01
## PercentPremiumSeats         0.23        -0.18         0.07         0.12
##                     PriceRelative SeatsTotal PitchDifference
## SeatsEconomy                 0.00       0.99            0.04
## SeatsPremium                -0.10       0.72            0.02
## PitchEconomy                -0.42       0.12           -0.78
## PitchPremium                 0.42       0.11            0.95
## WidthEconomy                -0.04       0.41           -0.13
## WidthPremium                 0.50       0.09            0.76
## PriceEconomy                -0.29       0.13           -0.10
## PricePremium                 0.03       0.19           -0.02
## PriceRelative                1.00      -0.01            0.47
## SeatsTotal                  -0.01       1.00            0.03
## PitchDifference              0.47       0.03            1.00
## WidthDifference              0.49      -0.11            0.76
## PercentPremiumSeats         -0.16      -0.22           -0.09
##                     WidthDifference PercentPremiumSeats
## SeatsEconomy                  -0.08               -0.33
## SeatsPremium                  -0.22                0.49
## PitchEconomy                  -0.64               -0.10
## PitchPremium                   0.70               -0.18
## WidthEconomy                  -0.39                0.23
## WidthPremium                   0.88               -0.18
## PriceEconomy                  -0.08                0.07
## PricePremium                  -0.01                0.12
## PriceRelative                  0.49               -0.16
## SeatsTotal                    -0.11               -0.22
## PitchDifference                0.76               -0.09
## WidthDifference                1.00               -0.28
## PercentPremiumSeats           -0.28                1.00

Hypothesis and its T-testing :

Hypothesis H1 - The price difference of the ticket depends on Seat Width. Null hypothesis H0 - The price difference of the ticket does not depend on Seat Width.

Running a T-Test

t.test( Data.df$WidthPremium,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  Data.df$WidthPremium and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.277, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -552.2598 -445.1594
## sample estimates:
## mean of x mean of y 
##  19.47162 518.18122

Since the p=value < 0.05, we reject the null hypothesis. Therefore, seat width does affect pricing.

Hypothesis H2 - The price of the ticket depends on Pitch. Null hypothesis H0 - The price of the ticket does not depend on Pitch.

Running a T-Test

t.test( Data.df$PitchPremium,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  Data.df$PitchPremium and Data.df$PricePremium - Data.df$PriceEconomy
## t = -17.602, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -533.8253 -426.7249
## sample estimates:
## mean of x mean of y 
##  37.90611 518.18122

Since the p=value < 0.05, we reject the null hypothesis. Therefore, pitch does affect pricing.

Hypothesis H3 - The price of the ticket depends on Flight Duration. Null hypothesis H0 - The price of the ticket does not depend on Flight Duration.

Running a T-Test

t.test( Data.df$FlightDuration,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  Data.df$FlightDuration and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.713, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -564.1545 -457.0523
## sample estimates:
##  mean of x  mean of y 
##   7.577838 518.181223

Since the p=value < 0.05, we reject the null hypothesis. Therefore, flight duration does affect pricing.

Hypothesis H4 - The price difference of the ticket depends on Percentage of Premium Seats. Null hypothesis H0 - The price difference of the ticket does not depend on Percentage of Premium Seats.

Running a T-Test

t.test( Data.df$PercentPremiumSeats,Data.df$PricePremium - Data.df$PriceEconomy, var.equal = TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  Data.df$PercentPremiumSeats and Data.df$PricePremium - Data.df$PriceEconomy
## t = -18.454, df = 914, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -557.0877 -449.9839
## sample estimates:
## mean of x mean of y 
##  14.64541 518.18122

Since the p=value < 0.05, we reject the null hypothesis. Therefore, Percentage of Premium Seats does affect pricing.

Regression Analysis :

regmodel <- lm(formula = Data.df$PriceRelative~Data.df$PitchDifference+Data.df$WidthDifference+Data.df$FlightDuration+Data.df$PercentPremiumSeats)
summary(regmodel)
## 
## Call:
## lm(formula = Data.df$PriceRelative ~ Data.df$PitchDifference + 
##     Data.df$WidthDifference + Data.df$FlightDuration + Data.df$PercentPremiumSeats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79439 -0.29424 -0.03427  0.16197  1.13688 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.179033   0.101492  -1.764  0.07840 .  
## Data.df$PitchDifference      0.059311   0.015921   3.725  0.00022 ***
## Data.df$WidthDifference      0.118140   0.024555   4.811 2.05e-06 ***
## Data.df$FlightDuration       0.021707   0.005085   4.269 2.39e-05 ***
## Data.df$PercentPremiumSeats -0.005999   0.003898  -1.539  0.12454    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.381 on 453 degrees of freedom
## Multiple R-squared:  0.2913, Adjusted R-squared:  0.285 
## F-statistic: 46.54 on 4 and 453 DF,  p-value: < 2.2e-16

As our p-value is < 0.05, we can also use this to reject our null hypothesis.

The regression model : PriceRelative = ( 0.059310975 PitchDifference) + (0.118140211 WidthDifference) + (0.021707245FlightDuration) + (-0.005999062PercentPremiumSeats) + (-0.179033482=Intercept)

coefficients(regmodel)
##                 (Intercept)     Data.df$PitchDifference 
##                -0.179033482                 0.059310975 
##     Data.df$WidthDifference      Data.df$FlightDuration 
##                 0.118140211                 0.021707245 
## Data.df$PercentPremiumSeats 
##                -0.005999062

The regression model : PriceRelative = ( 0.059310975 PitchDifference) + (0.118140211 WidthDifference) + (0.021707245FlightDuration) + (-0.005999062PercentPremiumSeats) + (-0.179033482=Intercept)

Thus, y can be Relative Price and x = {x1,x2,x3,x4} can be Pitch Difference, Width Difference, Flight Duration and Percentage of Premium Seats.

So, we can conclude that there are various factors (like Pitch Difference, Width Difference, Flight Duration and Percentage of Premium Seats for example) that cause a difference in the prices of Premium Economy and Economy airline tickets.