MINI PROJECT ON THE AIRLINE INDUSTRY

About the project

The airline dataset that we are working on is a classification data set it is used to differentiate the economy class air-ticket price from the premium economy class air-ticket price. Several other factors are provided in the set of data sets.

The aim of this analysis is to calculate the value of the economy premium and economy class air tickets.

Read and view airlines data sets

airlines.df <- read.csv("SixAirlinesDataV2.csv")
View(airlines.df)

Summarize the data to understand the mean, median, standard deviation of each variable

summary(airlines.df)

##       Airline      Aircraft   FlightDuration   TravelMonth
##  AirFrance: 74   AirBus:151   Min.   : 1.250   Aug:127    
##  British  :175   Boeing:307   1st Qu.: 4.260   Jul: 75    
##  Delta    : 46                Median : 7.790   Oct:127    
##  Jet      : 61                Mean   : 7.578   Sep:129    
##  Singapore: 40                3rd Qu.:10.620              
##  Virgin   : 62                Max.   :14.660              
##       IsInternational  SeatsEconomy    SeatsPremium    PitchEconomy  
##  Domestic     : 40    Min.   : 78.0   Min.   : 8.00   Min.   :30.00  
##  International:418    1st Qu.:133.0   1st Qu.:21.00   1st Qu.:31.00  
##                       Median :185.0   Median :36.00   Median :31.00  
##                       Mean   :202.3   Mean   :33.65   Mean   :31.22  
##                       3rd Qu.:243.0   3rd Qu.:40.00   3rd Qu.:32.00  
##                       Max.   :389.0   Max.   :66.00   Max.   :33.00  
##   PitchPremium    WidthEconomy    WidthPremium    PriceEconomy 
##  Min.   :34.00   Min.   :17.00   Min.   :17.00   Min.   :  65  
##  1st Qu.:38.00   1st Qu.:18.00   1st Qu.:19.00   1st Qu.: 413  
##  Median :38.00   Median :18.00   Median :19.00   Median :1242  
##  Mean   :37.91   Mean   :17.84   Mean   :19.47   Mean   :1327  
##  3rd Qu.:38.00   3rd Qu.:18.00   3rd Qu.:21.00   3rd Qu.:1909  
##  Max.   :40.00   Max.   :19.00   Max.   :21.00   Max.   :3593  
##   PricePremium    PriceRelative      SeatsTotal  PitchDifference 
##  Min.   :  86.0   Min.   :0.0200   Min.   : 98   Min.   : 2.000  
##  1st Qu.: 528.8   1st Qu.:0.1000   1st Qu.:166   1st Qu.: 6.000  
##  Median :1737.0   Median :0.3650   Median :227   Median : 7.000  
##  Mean   :1845.3   Mean   :0.4872   Mean   :236   Mean   : 6.688  
##  3rd Qu.:2989.0   3rd Qu.:0.7400   3rd Qu.:279   3rd Qu.: 7.000  
##  Max.   :7414.0   Max.   :1.8900   Max.   :441   Max.   :10.000  
##  WidthDifference PercentPremiumSeats
##  Min.   :0.000   Min.   : 4.71      
##  1st Qu.:1.000   1st Qu.:12.28      
##  Median :1.000   Median :13.21      
##  Mean   :1.633   Mean   :14.65      
##  3rd Qu.:3.000   3rd Qu.:15.36      
##  Max.   :4.000   Max.   :24.69

British <- subset(airlines.df, Airline == "British")
Virgin <- subset(airlines.df, Airline == "Virgin")
Delta <- subset(airlines.df, Airline == "Delta")
AirFrance <- subset(airlines.df, Airline == "AirFrance")
Jet <- subset(airlines.df, Airline == "Jet")
Singapore <- subset(airlines.df, Airline == "Singapore")

Draw Box Plots / Bar Plots to visualize the distribution of each variable independently

library(lattice)

boxplot(PricePremium ~ IsInternational, data = airlines.df, xlab = "Price ($)", horizontal = TRUE, main = "Price Distribution of Different class across Domestic and International", col = c("violet","green", "yellow", "orange", "red","blue"))

boxplot(PricePremium ~ IsInternational, data = airlines.df, xlab = "Price ($)", horizontal = TRUE, main = "Price Distribution of Premium Economy class across Domestic and International", col = c("blue", "green", "yellow","violet", "orange", "red"))

 boxplot(PriceEconomy ~ Airline, data = airlines.df, xlab = "Price ($)", ylab = "Airline", horizontal = TRUE, main = "Price Distribution of Economy class across Airlines", col = c("violet", "orange","green", "yellow","blue",  "red"))

par(mfrow=c(1,2))
with(airlines.df,plot(Aircraft,PriceEconomy,col=c("peachpuff","khaki"),main="Aircraft vs Economy Pricing"))
with(airlines.df,plot(Aircraft,PricePremium,col=c("peachpuff","khaki"), main="Aircraft vs Premium Pricing"))

Comparison of premium economy ticket price and economy ticket price

plot(airlines.df$PriceEconomy + airlines.df$PricePremium, main="Premium Economy Price vs. Economy Price")
abline(0,1)

Pitch Analysis: The difference between pitch economy and economy seats

library(lattice)

histogram(airlines.df$PitchDifference, main = "Distribution of Pitch Difference", xlab="Difference in Pitch")

Histogram indicates seat pitch in economy and premium category versus frequency

par(mfrow=c(1,2))
hist(airlines.df$PitchEconomy, xlab="Economy Seats Pitch",col = "pink",main="Economy class ")
hist(airlines.df$PitchPremium, xlab="Premium Seats Pitch",col = "red",main="Premium class ")

Histogram indicates the width of seats and premium class versus frequency in the economy

par(mfrow=c(1,2))
hist(airlines.df$WidthEconomy, xlab="Economy Seats Width",col = "yellow",main="Economy class")
hist(airlines.df$WidthPremium, xlab="Premium Seats Width",col = "blue",main="Premium class")

Histogram indicates the price of seats in the economy and premium category versus frequency

par(mfrow=c(1,2))
hist(airlines.df$PriceEconomy, xlab="Economy Seats Price",col = "blue",main="Economy class")
hist(airlines.df$PricePremium, xlab="Premium Seats Price",col = "red",main="Premium class")

Analysis of the effect of the pitch is different on the relative value of the economy and the premium economy.

library(car)
real_price = aggregate(cbind(PriceEconomy,PricePremium, PriceRelative) ~ PitchDifference, data = airlines.df, mean)
scatterplot(real_price$PitchDifference, real_price$PriceRelative, main="Relative Price Difference & Pitch", xlab="Pitch Difference", ylab="Relative Price b/w Economy and Premium Economy")

Comparing the distribution of the difference between the width of the economy and the width of the economy.

histogram(airlines.df$WidthDifference, main = "Distribution of Difference in Seat Width", xlab="Difference in Seat Width")

Analyzing the capacity of the plane

xyplot(airlines.df$PriceRelative ~ airlines.df$SeatsTotal,type = c("p", "g"), xlab = "Total Seats (Economy + Premium Economy Seats)", ylab = "Rel. Price Difference")

Analyze Percentage of Premium Economy Seats

boxplot(airlines.df$PercentPremiumSeats, main="Percentage of Premium Economy Seats", ylab="Percentage of Premium Economy Seats in Plane")

Draw Scatter Plots to understand how are the variables correlated pair-wise

library(car)

scatterplotMatrix(formula = ~PriceEconomy +PricePremium + PitchDifference + WidthDifference + PercentPremiumSeats + FlightDuration, data = airlines.df)

How to understand scatter plots variables are correlated pairing

scatterplot(airlines.df$PriceRelative ~ airlines.df$PitchDifference, data= airlines.df,
            spread=FALSE, smoother.args=list(lty=2), pch=19,
            main="Scatter plot of price relative vs pitch difference",
            xlab="pitch difference",
            ylab="price relative")

scatterplot(airlines.df$PriceRelative ~ airlines.df$WidthDifference, data= airlines.df,
            spread=FALSE, smoother.args=list(lty=2), pch=19,
            main="Scatter plot of price relative vs Width difference",
            xlab="Width difference",
            ylab="price relative")

scatterplot(airlines.df$PriceRelative ~ airlines.df$PercentPremiumSeats, data= airlines.df,
            spread=FALSE, smoother.args=list(lty=2), pch=19,
            main="Scatter plot of price relative vs Percent premium seat",
            xlab="Percent Premium Seat",
            ylab="price relative")

Price variation of economy with flight duration

plot(airlines.df$FlightDuration,airlines.df$PriceEconomy,
     main="Flight duration vs Economy Price",
     xlab="Flight duration",
     ylab = "Economy Price")
abline(lm(airlines.df$PriceEconomy~airlines.df$FlightDuration),
       col="red")

Price variation of premium with flight duration

plot(airlines.df$FlightDuration,airlines.df$PricePremium,
     main="Flight duration vs Premium Price",
     xlab="flight duration",
     ylab="Premium Price")
abline(lm(airlines.df$PricePremium~airlines.df$FlightDuration),
       col="blue")

With the duration of the flight, the growth for both classes is gradual, however, the growth rate of the economy is higher than the premium

attach(airlines.df)
plot(WidthDifference,PriceRelative,main = "Analysis of width of Seats in Difference in price of class")
abline(lm(PriceRelative~WidthDifference),col="pink")

plot(PitchDifference,PriceRelative,main = "Analysis of Pitch of Seats in Difference in price of class")
abline(lm(PriceRelative~PitchDifference),col="green")

Correlation plots

library(corrplot)

## corrplot 0.84 loaded

library(corrgram)

cor(airlines.df[, c(3, 12, 13, 16:18)])

##                     FlightDuration PriceEconomy PricePremium
## FlightDuration          1.00000000   0.56664039   0.64873981
## PriceEconomy            0.56664039   1.00000000   0.90138870
## PricePremium            0.64873981   0.90138870   1.00000000
## PitchDifference        -0.03749288  -0.09952511  -0.01806629
## WidthDifference        -0.11856070  -0.08449975  -0.01151218
## PercentPremiumSeats     0.06051625   0.06532232   0.11639097
##                     PitchDifference WidthDifference PercentPremiumSeats
## FlightDuration          -0.03749288     -0.11856070          0.06051625
## PriceEconomy            -0.09952511     -0.08449975          0.06532232
## PricePremium            -0.01806629     -0.01151218          0.11639097
## PitchDifference          1.00000000      0.76089108         -0.09264869
## WidthDifference          0.76089108      1.00000000         -0.27559416
## PercentPremiumSeats     -0.09264869     -0.27559416          1.00000000

corrgram(airlines.df, order = TRUE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel=panel.txt, main = "Corrgram of airlines intercorrealtions")

library(corrgram)
corrgram(airlines.df, order=NULL, panel=panel.cor,text.panel=panel.txt,main="Corrogram")

Pearson’s Test

cor.test(airlines.df$PricePremium, airlines.df$FlightDuration)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$FlightDuration
## t = 18.204, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5923218 0.6988270
## sample estimates:
##       cor 
## 0.6487398

cor.test(airlines.df$PricePremium, airlines.df$SeatsTotal)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$SeatsTotal
## t = 4.1851, df = 456, p-value = 3.421e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1025049 0.2790349
## sample estimates:
##       cor 
## 0.1923253

cor.test(airlines.df$PricePremium, airlines.df$SeatsEconomy)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$SeatsEconomy
## t = 3.8403, df = 456, p-value = 0.0001402
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08678154 0.26434066
## sample estimates:
##       cor 
## 0.1770009

cor.test(airlines.df$PricePremium, airlines.df$SeatsPremium)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$SeatsPremium
## t = 4.761, df = 456, p-value = 2.591e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1285487 0.3031938
## sample estimates:
##       cor 
## 0.2176124

cor.test(airlines.df$PricePremium, airlines.df$PitchEconomy)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$PitchEconomy
## t = 4.9575, df = 456, p-value = 1.009e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1373612 0.3113179
## sample estimates:
##       cor 
## 0.2261418

cor.test(airlines.df$PricePremium, airlines.df$WidthEconomy)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$WidthEconomy
## t = 3.2519, df = 456, p-value = 0.001231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0597457 0.2388800
## sample estimates:
##       cor 
## 0.1505484

cor.test(airlines.df$PricePremium, airlines.df$PriceEconomy)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$PriceEconomy
## t = 44.452, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8826622 0.9172579
## sample estimates:
##       cor 
## 0.9013887

cor.test(airlines.df$PricePremium, airlines.df$PercentPremiumSeats)

## 
##  Pearson's product-moment correlation
## 
## data:  airlines.df$PricePremium and airlines.df$PercentPremiumSeats
## t = 2.5024, df = 456, p-value = 0.01268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0250311 0.2058228
## sample estimates:
##      cor 
## 0.116391

The Pearson chi-square test is usually not recommended for testing the composite hypothesis of normality due to its inferior power properties compared to other tests. It is common practice to compute the p-value from the chi-square distribution with n.classes - 3 degrees of freedom, in order to adjust for the additional estimation of two parameters. ###T- test performance T-test Hypotheses #H1: There is no relation between relative price and width difference. #H2: There is no relation between relative price and pitch difference.

t.test(airlines.df$PriceRelative,airlines.df$WidthDifference)

## 
##  Welch Two Sample t-test
## 
## data:  airlines.df$PriceRelative and airlines.df$WidthDifference
## t = -19.284, df = 585.55, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.262697 -1.029268
## sample estimates:
## mean of x mean of y 
## 0.4872052 1.6331878

Here, p value is less than 0.05 so H1 is rejected

t.test(airlines.df$PriceRelative,airlines.df$PitchDifference)

## 
##  Welch Two Sample t-test
## 
## data:  airlines.df$PriceRelative and airlines.df$PitchDifference
## t = -72.974, df = 516.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.367495 -6.033640
## sample estimates:
## mean of x mean of y 
## 0.4872052 6.6877729

Here again, p value is less than 0.05, therefore H2 is rejected as well.

The results show good correlation between all the above variables mentioned.

Making a Linera Regression Model with the above variables on Price Premium

model1 <- lm(PricePremium ~ FlightDuration+SeatsTotal+SeatsEconomy+SeatsPremium+PitchEconomy+WidthEconomy+PriceEconomy+PercentPremiumSeats, data = airlines.df)
summary(model1)

## 
## Call:
## lm(formula = PricePremium ~ FlightDuration + SeatsTotal + SeatsEconomy + 
##     SeatsPremium + PitchEconomy + WidthEconomy + PriceEconomy + 
##     PercentPremiumSeats, data = airlines.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -715.2 -268.7  -79.5  126.9 3193.6 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7510.66412 1274.70557   5.892 7.48e-09 ***
## FlightDuration        75.48733    8.96891   8.417 5.21e-16 ***
## SeatsTotal            19.99110    6.68139   2.992  0.00292 ** 
## SeatsEconomy         -22.01275    7.75443  -2.839  0.00473 ** 
## SeatsPremium                NA         NA      NA       NA    
## PitchEconomy        -258.40498   40.24849  -6.420 3.46e-10 ***
## WidthEconomy          30.51273   57.95981   0.526  0.59884    
## PriceEconomy           1.08216    0.03105  34.851  < 2e-16 ***
## PercentPremiumSeats  -28.30520   15.56054  -1.819  0.06957 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 474.3 on 450 degrees of freedom
## Multiple R-squared:  0.8665, Adjusted R-squared:  0.8644 
## F-statistic: 417.2 on 7 and 450 DF,  p-value: < 2.2e-16

The summary statistics of the above model show that the R-squared value of the model is 0.8665 which is very high. However, te variables SeastPremium, WidthEconomy, PercentPremiumSeats have very less statistical significance (p>0.05)

Making another Linear Regression Model without the variables SeatsPremium, WidthEconomy and PercentPremiumSeats

model2 <- lm(PricePremium~FlightDuration+SeatsTotal+PitchEconomy+SeatsEconomy+PriceEconomy, data = airlines.df)
summary(model2)

## 
## Call:
## lm(formula = PricePremium ~ FlightDuration + SeatsTotal + PitchEconomy + 
##     SeatsEconomy + PriceEconomy, data = airlines.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -811.6 -255.6  -70.1  121.9 3215.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7848.1045  1158.8056   6.773 3.95e-11 ***
## FlightDuration   75.7793     7.7476   9.781  < 2e-16 ***
## SeatsTotal        8.6861     2.1909   3.965 8.54e-05 ***
## PitchEconomy   -265.0437    37.5671  -7.055 6.51e-12 ***
## SeatsEconomy     -8.7836     2.4490  -3.587 0.000372 ***
## PriceEconomy      1.0735     0.0283  37.934  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 475 on 452 degrees of freedom
## Multiple R-squared:  0.8655, Adjusted R-squared:  0.864 
## F-statistic: 581.8 on 5 and 452 DF,  p-value: < 2.2e-16

The model’s R-squared value is 0.8655, which is highly suggesting that the model is a good indicator of the price of the premium economy class. regressors are statistically significant (p <0.05)

Summary of the above model: Although Scatterplot matrix and Corelogram show all economies related to the price of Premium Economy Class. In linear regression models, some variables do not have sufficient statistical significance. Therefore, the flight duration, the total number seats, the pitch in the economy class, not gathered in the economy class, the price of the economy class, the price of the premium economy affects the linear regression model, we understand that premium seats, economy class percent seats width and Premium classes are not statistically significant remasters so, these variables are small HKRs, not proposed The regression model includes other variables can be considered as a factor that operates high-priced premium economy class seats that.

With the definition of a new regression model premium economic class with negative negative numbers, those variables that should run the price difference to a large extent, they should be flight controller, percentage premium limitations, width conditions, price economy and pitch defender, because They are related due to these reasons the airline’s industry promotes the premium economy class

model <- lm(PricePremium~FlightDuration+PercentPremiumSeats+PitchDifference+WidthDifference+PriceEconomy, data=airlines.df)
summary(model)

## 
## Call:
## lm(formula = PricePremium ~ FlightDuration + PercentPremiumSeats + 
##     PitchDifference + WidthDifference + PriceEconomy, data = airlines.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -803.2 -287.5  -46.8  151.6 3434.6 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -572.66050  132.93317  -4.308 2.02e-05 ***
## FlightDuration        76.97237    8.07003   9.538  < 2e-16 ***
## PercentPremiumSeats   21.58784    5.09201   4.240 2.72e-05 ***
## PitchDifference       -4.03335   20.94083  -0.193 0.847353    
## WidthDifference      115.29104   32.16186   3.585 0.000374 ***
## PriceEconomy           1.02266    0.02881  35.497  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 496.8 on 452 degrees of freedom
## Multiple R-squared:  0.8529, Adjusted R-squared:  0.8512 
## F-statistic:   524 on 5 and 452 DF,  p-value: < 2.2e-16

The results show that the value of these variables is the R-square value of 0.8529 with the price of the premium economy class. Therefore, it can be concluded that the width between the flight period, the percentage of premium seats, the economy and the premium economy class, the gap between the economy and the premium economy class and the price of the economy class is the driving factor for increasing value. . Premium Economy Class However, the percentage premium has less statistical significance in the seat which is suggesting that any premium for seats in the premium economy class plays a small role in running the cost of the economy class.

Conclusions

The boxplot graphs of airlines vs premium air-ticket cost and airlines vs economy class air-ticket costs have a similar hyperbolic curve shape which suggests that the two classes is dependent on the airline type.

The airline factor is statistically related to economy class air ticket price, the premium economy class air ticket price and the relative price for the corresponding values for both correlation tests.

The seats in the economy class are statistically related to the price of economy class air ticket, as per the adjoining regression model and correlation table.

The seats in the economy economy class are statistically related to the price of economy class air ticket, as per the adjoining regression model and correlation table.

The difference in the number of seats in the economy class and the premium economy class does not contributed significantly to the economy rate of the air ticket cost and the premium economy class, since p-value> 0.05 adjoining linear regression model

Based on the correlation test, the international factor shows the economy and premium economy class air tickets.

Surprisingly, the travel month is positively correlated to the economy between the economy class and premium economy class air tickets, from the adjoining correlation table and test but its close to zero, hence its very weakly correlated. Also the travel month is not statistically significant to the relative prices from the regression model.