About the project
The airline dataset that we are working on is a classification data set it is used to differentiate the economy class air-ticket price from the premium economy class air-ticket price. Several other factors are provided in the set of data sets.
The aim of this analysis is to calculate the value of the economy premium and economy class air tickets.
Read and view airlines data sets
airlines.df <- read.csv("SixAirlinesDataV2.csv")
View(airlines.df)
Draw Box Plots / Bar Plots to visualize the distribution of each variable independently
library(lattice)
boxplot(PricePremium ~ IsInternational, data = airlines.df, xlab = "Price ($)", horizontal = TRUE, main = "Price Distribution of Different class across Domestic and International", col = c("violet","green", "yellow", "orange", "red","blue"))

boxplot(PricePremium ~ IsInternational, data = airlines.df, xlab = "Price ($)", horizontal = TRUE, main = "Price Distribution of Premium Economy class across Domestic and International", col = c("blue", "green", "yellow","violet", "orange", "red"))

boxplot(PriceEconomy ~ Airline, data = airlines.df, xlab = "Price ($)", ylab = "Airline", horizontal = TRUE, main = "Price Distribution of Economy class across Airlines", col = c("violet", "orange","green", "yellow","blue", "red"))

par(mfrow=c(1,2))
with(airlines.df,plot(Aircraft,PriceEconomy,col=c("peachpuff","khaki"),main="Aircraft vs Economy Pricing"))
with(airlines.df,plot(Aircraft,PricePremium,col=c("peachpuff","khaki"), main="Aircraft vs Premium Pricing"))

Comparison of premium economy ticket price and economy ticket price
plot(airlines.df$PriceEconomy + airlines.df$PricePremium, main="Premium Economy Price vs. Economy Price")
abline(0,1)

Pitch Analysis: The difference between pitch economy and economy seats
library(lattice)
histogram(airlines.df$PitchDifference, main = "Distribution of Pitch Difference", xlab="Difference in Pitch")

Histogram indicates seat pitch in economy and premium category versus frequency
par(mfrow=c(1,2))
hist(airlines.df$PitchEconomy, xlab="Economy Seats Pitch",col = "pink",main="Economy class ")
hist(airlines.df$PitchPremium, xlab="Premium Seats Pitch",col = "red",main="Premium class ")

Histogram indicates the width of seats and premium class versus frequency in the economy
par(mfrow=c(1,2))
hist(airlines.df$WidthEconomy, xlab="Economy Seats Width",col = "yellow",main="Economy class")
hist(airlines.df$WidthPremium, xlab="Premium Seats Width",col = "blue",main="Premium class")

Histogram indicates the price of seats in the economy and premium category versus frequency
par(mfrow=c(1,2))
hist(airlines.df$PriceEconomy, xlab="Economy Seats Price",col = "blue",main="Economy class")
hist(airlines.df$PricePremium, xlab="Premium Seats Price",col = "red",main="Premium class")

Analysis of the effect of the pitch is different on the relative value of the economy and the premium economy.
library(car)
real_price = aggregate(cbind(PriceEconomy,PricePremium, PriceRelative) ~ PitchDifference, data = airlines.df, mean)
scatterplot(real_price$PitchDifference, real_price$PriceRelative, main="Relative Price Difference & Pitch", xlab="Pitch Difference", ylab="Relative Price b/w Economy and Premium Economy")

Comparing the distribution of the difference between the width of the economy and the width of the economy.
histogram(airlines.df$WidthDifference, main = "Distribution of Difference in Seat Width", xlab="Difference in Seat Width")

Analyzing the capacity of the plane
xyplot(airlines.df$PriceRelative ~ airlines.df$SeatsTotal,type = c("p", "g"), xlab = "Total Seats (Economy + Premium Economy Seats)", ylab = "Rel. Price Difference")

Analyze Percentage of Premium Economy Seats
boxplot(airlines.df$PercentPremiumSeats, main="Percentage of Premium Economy Seats", ylab="Percentage of Premium Economy Seats in Plane")

Price variation of premium with flight duration
plot(airlines.df$FlightDuration,airlines.df$PricePremium,
main="Flight duration vs Premium Price",
xlab="flight duration",
ylab="Premium Price")
abline(lm(airlines.df$PricePremium~airlines.df$FlightDuration),
col="blue")

With the duration of the flight, the growth for both classes is gradual, however, the growth rate of the economy is higher than the premium
attach(airlines.df)
plot(WidthDifference,PriceRelative,main = "Analysis of width of Seats in Difference in price of class")
abline(lm(PriceRelative~WidthDifference),col="pink")

plot(PitchDifference,PriceRelative,main = "Analysis of Pitch of Seats in Difference in price of class")
abline(lm(PriceRelative~PitchDifference),col="green")

Correlation plots
library(corrplot)
## corrplot 0.84 loaded
library(corrgram)
cor(airlines.df[, c(3, 12, 13, 16:18)])
## FlightDuration PriceEconomy PricePremium
## FlightDuration 1.00000000 0.56664039 0.64873981
## PriceEconomy 0.56664039 1.00000000 0.90138870
## PricePremium 0.64873981 0.90138870 1.00000000
## PitchDifference -0.03749288 -0.09952511 -0.01806629
## WidthDifference -0.11856070 -0.08449975 -0.01151218
## PercentPremiumSeats 0.06051625 0.06532232 0.11639097
## PitchDifference WidthDifference PercentPremiumSeats
## FlightDuration -0.03749288 -0.11856070 0.06051625
## PriceEconomy -0.09952511 -0.08449975 0.06532232
## PricePremium -0.01806629 -0.01151218 0.11639097
## PitchDifference 1.00000000 0.76089108 -0.09264869
## WidthDifference 0.76089108 1.00000000 -0.27559416
## PercentPremiumSeats -0.09264869 -0.27559416 1.00000000
corrgram(airlines.df, order = TRUE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel=panel.txt, main = "Corrgram of airlines intercorrealtions")

library(corrgram)
corrgram(airlines.df, order=NULL, panel=panel.cor,text.panel=panel.txt,main="Corrogram")

Pearson’s Test
cor.test(airlines.df$PricePremium, airlines.df$FlightDuration)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$FlightDuration
## t = 18.204, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5923218 0.6988270
## sample estimates:
## cor
## 0.6487398
cor.test(airlines.df$PricePremium, airlines.df$SeatsTotal)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$SeatsTotal
## t = 4.1851, df = 456, p-value = 3.421e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1025049 0.2790349
## sample estimates:
## cor
## 0.1923253
cor.test(airlines.df$PricePremium, airlines.df$SeatsEconomy)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$SeatsEconomy
## t = 3.8403, df = 456, p-value = 0.0001402
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08678154 0.26434066
## sample estimates:
## cor
## 0.1770009
cor.test(airlines.df$PricePremium, airlines.df$SeatsPremium)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$SeatsPremium
## t = 4.761, df = 456, p-value = 2.591e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1285487 0.3031938
## sample estimates:
## cor
## 0.2176124
cor.test(airlines.df$PricePremium, airlines.df$PitchEconomy)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$PitchEconomy
## t = 4.9575, df = 456, p-value = 1.009e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1373612 0.3113179
## sample estimates:
## cor
## 0.2261418
cor.test(airlines.df$PricePremium, airlines.df$WidthEconomy)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$WidthEconomy
## t = 3.2519, df = 456, p-value = 0.001231
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0597457 0.2388800
## sample estimates:
## cor
## 0.1505484
cor.test(airlines.df$PricePremium, airlines.df$PriceEconomy)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$PriceEconomy
## t = 44.452, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8826622 0.9172579
## sample estimates:
## cor
## 0.9013887
cor.test(airlines.df$PricePremium, airlines.df$PercentPremiumSeats)
##
## Pearson's product-moment correlation
##
## data: airlines.df$PricePremium and airlines.df$PercentPremiumSeats
## t = 2.5024, df = 456, p-value = 0.01268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0250311 0.2058228
## sample estimates:
## cor
## 0.116391
The Pearson chi-square test is usually not recommended for testing the composite hypothesis of normality due to its inferior power properties compared to other tests. It is common practice to compute the p-value from the chi-square distribution with n.classes - 3 degrees of freedom, in order to adjust for the additional estimation of two parameters. ###T- test performance T-test Hypotheses #H1: There is no relation between relative price and width difference. #H2: There is no relation between relative price and pitch difference.
t.test(airlines.df$PriceRelative,airlines.df$WidthDifference)
##
## Welch Two Sample t-test
##
## data: airlines.df$PriceRelative and airlines.df$WidthDifference
## t = -19.284, df = 585.55, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.262697 -1.029268
## sample estimates:
## mean of x mean of y
## 0.4872052 1.6331878
Here, p value is less than 0.05 so H1 is rejected
t.test(airlines.df$PriceRelative,airlines.df$PitchDifference)
##
## Welch Two Sample t-test
##
## data: airlines.df$PriceRelative and airlines.df$PitchDifference
## t = -72.974, df = 516.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.367495 -6.033640
## sample estimates:
## mean of x mean of y
## 0.4872052 6.6877729
Here again, p value is less than 0.05, therefore H2 is rejected as well.
The results show good correlation between all the above variables mentioned.
Making a Linera Regression Model with the above variables on Price Premium
model1 <- lm(PricePremium ~ FlightDuration+SeatsTotal+SeatsEconomy+SeatsPremium+PitchEconomy+WidthEconomy+PriceEconomy+PercentPremiumSeats, data = airlines.df)
summary(model1)
##
## Call:
## lm(formula = PricePremium ~ FlightDuration + SeatsTotal + SeatsEconomy +
## SeatsPremium + PitchEconomy + WidthEconomy + PriceEconomy +
## PercentPremiumSeats, data = airlines.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -715.2 -268.7 -79.5 126.9 3193.6
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7510.66412 1274.70557 5.892 7.48e-09 ***
## FlightDuration 75.48733 8.96891 8.417 5.21e-16 ***
## SeatsTotal 19.99110 6.68139 2.992 0.00292 **
## SeatsEconomy -22.01275 7.75443 -2.839 0.00473 **
## SeatsPremium NA NA NA NA
## PitchEconomy -258.40498 40.24849 -6.420 3.46e-10 ***
## WidthEconomy 30.51273 57.95981 0.526 0.59884
## PriceEconomy 1.08216 0.03105 34.851 < 2e-16 ***
## PercentPremiumSeats -28.30520 15.56054 -1.819 0.06957 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 474.3 on 450 degrees of freedom
## Multiple R-squared: 0.8665, Adjusted R-squared: 0.8644
## F-statistic: 417.2 on 7 and 450 DF, p-value: < 2.2e-16
The summary statistics of the above model show that the R-squared value of the model is 0.8665 which is very high. However, te variables SeastPremium, WidthEconomy, PercentPremiumSeats have very less statistical significance (p>0.05)
Making another Linear Regression Model without the variables SeatsPremium, WidthEconomy and PercentPremiumSeats
model2 <- lm(PricePremium~FlightDuration+SeatsTotal+PitchEconomy+SeatsEconomy+PriceEconomy, data = airlines.df)
summary(model2)
##
## Call:
## lm(formula = PricePremium ~ FlightDuration + SeatsTotal + PitchEconomy +
## SeatsEconomy + PriceEconomy, data = airlines.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -811.6 -255.6 -70.1 121.9 3215.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7848.1045 1158.8056 6.773 3.95e-11 ***
## FlightDuration 75.7793 7.7476 9.781 < 2e-16 ***
## SeatsTotal 8.6861 2.1909 3.965 8.54e-05 ***
## PitchEconomy -265.0437 37.5671 -7.055 6.51e-12 ***
## SeatsEconomy -8.7836 2.4490 -3.587 0.000372 ***
## PriceEconomy 1.0735 0.0283 37.934 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 475 on 452 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.864
## F-statistic: 581.8 on 5 and 452 DF, p-value: < 2.2e-16
The model’s R-squared value is 0.8655, which is highly suggesting that the model is a good indicator of the price of the premium economy class. regressors are statistically significant (p <0.05)
Summary of the above model: Although Scatterplot matrix and Corelogram show all economies related to the price of Premium Economy Class. In linear regression models, some variables do not have sufficient statistical significance. Therefore, the flight duration, the total number seats, the pitch in the economy class, not gathered in the economy class, the price of the economy class, the price of the premium economy affects the linear regression model, we understand that premium seats, economy class percent seats width and Premium classes are not statistically significant remasters so, these variables are small HKRs, not proposed The regression model includes other variables can be considered as a factor that operates high-priced premium economy class seats that.
Conclusions
The boxplot graphs of airlines vs premium air-ticket cost and airlines vs economy class air-ticket costs have a similar hyperbolic curve shape which suggests that the two classes is dependent on the airline type.
The airline factor is statistically related to economy class air ticket price, the premium economy class air ticket price and the relative price for the corresponding values for both correlation tests.
The seats in the economy class are statistically related to the price of economy class air ticket, as per the adjoining regression model and correlation table.
The seats in the economy economy class are statistically related to the price of economy class air ticket, as per the adjoining regression model and correlation table.
The difference in the number of seats in the economy class and the premium economy class does not contributed significantly to the economy rate of the air ticket cost and the premium economy class, since p-value> 0.05 adjoining linear regression model
Based on the correlation test, the international factor shows the economy and premium economy class air tickets.
Surprisingly, the travel month is positively correlated to the economy between the economy class and premium economy class air tickets, from the adjoining correlation table and test but its close to zero, hence its very weakly correlated. Also the travel month is not statistically significant to the relative prices from the regression model.