In this mini-project, an airlines dataset is studied and analyzed. A basic summary of all the variables, their visualizations is carried out to better understand the data.
After this, a corrgram of the data is constructed, to identify the variables having a really strong relationship.
The data needs to read and attached and only then can it be analyzed.
setwd("~/Muyeena/Internship/Case Studies/Airlines")
airlines = read.csv("Airlines.csv")
#str(airlines)
attach(airlines)
We will now try running some basic statistics, to understand the data better.
sum = summary(airlines)
sum[,c(1,2,4,5)]
## Airline Aircraft TravelMonth IsInternational
## AirFrance: 74 AirBus:151 Aug:127 Domestic : 40
## British :175 Boeing:307 Jul: 75 International:418
## Delta : 46 Oct:127
## Jet : 61 Sep:129
## Singapore: 40
## Virgin : 62
library(psych)
des = describe(airlines)
des[c(3,6:18), c(3,4,5,8,9,10)]
## mean sd median min max range
## FlightDuration 7.58 3.54 7.79 1.25 14.66 13.41
## SeatsEconomy 202.31 76.37 185.00 78.00 389.00 311.00
## SeatsPremium 33.65 13.26 36.00 8.00 66.00 58.00
## PitchEconomy 31.22 0.66 31.00 30.00 33.00 3.00
## PitchPremium 37.91 1.31 38.00 34.00 40.00 6.00
## WidthEconomy 17.84 0.56 18.00 17.00 19.00 2.00
## WidthPremium 19.47 1.10 19.00 17.00 21.00 4.00
## PriceEconomy 1327.08 988.27 1242.00 65.00 3593.00 3528.00
## PricePremium 1845.26 1288.14 1737.00 86.00 7414.00 7328.00
## PriceRelative 0.49 0.45 0.36 0.02 1.89 1.87
## SeatsTotal 235.96 85.29 227.00 98.00 441.00 343.00
## PitchDifference 6.69 1.76 7.00 2.00 10.00 8.00
## WidthDifference 1.63 1.19 1.00 0.00 4.00 4.00
## PercentPremiumSeats 14.65 4.84 13.21 4.71 24.69 19.98
From this data, we can infer the following about the given dataset:
We will now try visualizing most of the important data points, so that we can draw better inference from the data.
Categorical variables are those which do not contain a numerical value. The inbuilt r function hist() needs numerical data. So, here ggplot2() package is used to get the desired visualization.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data.frame(airlines), aes(x=Airline)) +
geom_bar(fill = "lightblue")
ggplot(data.frame(airlines), aes(x=Aircraft)) +
geom_bar(fill = "lightgreen")
ggplot(data.frame(airlines), aes(x=IsInternational)) +
geom_bar(fill = "purple")
ggplot(data.frame(airlines), aes(x=TravelMonth)) +
geom_bar(fill = "maroon")
par(mfrow=c(1,1))
boxplot(FlightDuration, xlab = "Flight Duration (hours)", col = "pink", cex = 1.5)
par(mfrow = c(2,1))
boxplot(PriceEconomy, horizontal = TRUE, xlab = "Price of Economy Flights (dollars)", col = "skyblue", cex = 1.5, ylim = c(0, 8000))
boxplot(PricePremium, horizontal = TRUE, xlab = "Price of Premium Flights (dollars)", col = "yellow", cex = 1.5, ylim = c(0, 8000))
As can be seen from above, the price distribution of Economy and Premium flights is almost similar, with premium flights being distributed more widely. There is one flight which has a price of around 7600$, which drastically effects the mean, and range of the Premium Flights Price.
par(mfrow=c(3,1))
hist(PitchEconomy, xlim = c(30,40), col = "skyblue")
hist(PitchPremium, xlim = c(30,40), col = "yellow")
hist(PitchDifference, col = "lightgreen")
As seen, there is a distinct difference between the maximum pitch of an economy flight and the minimum pitch of a premium flight. This results in a minimum difference of 2 inches.
Most premium flights have a pitch of 38, except for a few outliers. The maximum pitch difference is of 7 inches.
par(mfrow=c(3,1))
hist(WidthEconomy, xlim = c(17,21),col = "skyblue")
hist(WidthPremium, xlim = c(17,21), col = "yellow")
hist(WidthDifference, col = "lightgreen")
When it comes to width, there is not much clear difference between the width of the economy flights, and that of the premium flights. This has resulted in few cases, where the width difference is zero.
The maximum width difference is of 1 inch.
par(mfrow=c(2,2))
hist(SeatsTotal, xlim = c(5,500), col = "pink")
hist(SeatsEconomy, xlim = c(5,390),col = "skyblue")
hist(PercentPremiumSeats, xlim = c(0,30), col = "lightgreen")
hist(SeatsPremium, xlim = c(5,390), col = "yellow")
From the above visualizations, it is noted that the in most flights, the no. of economy seats is above 100, while all of the premium seats in our dataset is less than 100.
We also notice that the percentage of premium seats is 14%.
In this section, the relationship of the price of both the types of the flights with respect to other variables is explored.
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ Airline, ylim = c(0,4200), col = clr)
plot(PricePremium ~ Airline, ylim = c(0,4200), col = clr)
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ Airline, col = clr)
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ Aircraft, ylim = c(0,4200), col = clr)
plot(PricePremium ~ Aircraft, ylim = c(0,4200), col = clr)
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ Aircraft, col = clr)
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ TravelMonth, ylim = c(0,4200), col = clr)
plot(PricePremium ~ TravelMonth, ylim = c(0,4200), col = clr)
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ TravelMonth, col = clr)
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ IsInternational, ylim = c(0,4200), col = clr)
plot(PricePremium ~ IsInternational, ylim = c(0,4200), col = clr)
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ IsInternational, col = clr)
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ FlightDuration, ylim = c(0,4200), col = "black")
abline(lm(PriceEconomy ~ FlightDuration))
abline(lm(PricePremium ~ FlightDuration), col = "red")
plot(PricePremium ~ FlightDuration, ylim = c(0,4200), col = "red")
abline(lm(PriceEconomy ~ FlightDuration))
abline(lm(PricePremium ~ FlightDuration), col = "red")
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ FlightDuration, col = "black")
abline(lm(PriceRelative ~ FlightDuration))
Note : An outlier value of around 7500$ is not represented in the above diagram (Price of Premium vs. Flight Duration).
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ PitchEconomy, ylim = c(0,4200), xlim = c(30,40), col = "black")
abline(lm(PriceEconomy ~ PitchEconomy))
abline(lm(PricePremium ~ PitchPremium), col = "red")
plot(PricePremium ~ PitchPremium, ylim = c(0,4200), xlim = c(30,40), col = "red")
abline(lm(PriceEconomy ~ PitchEconomy))
abline(lm(PricePremium ~ PitchPremium), col = "red")
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ PitchDifference, col = "black")
abline(lm(PriceRelative ~ PitchDifference))
Note : An outlier value of around 7500$ is not represented in the above diagram (Price of Premium vs.Pitch of Premium). It has a pitch of 38 inches.
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ WidthEconomy, ylim = c(0,4200), xlim = c(17,21), col = "black")
abline(lm(PriceEconomy ~ WidthEconomy))
abline(lm(PricePremium ~ WidthPremium), col = "red")
plot(PricePremium ~ WidthPremium, ylim = c(0,4200), xlim = c(17,21), col = "red")
abline(lm(PriceEconomy ~ WidthEconomy))
abline(lm(PricePremium ~ WidthPremium), col = "red")
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ WidthDifference, col = "darkgreen", pch = 18)
abline(lm(PriceRelative ~ WidthDifference), col = "darkgreen")
Note : An outlier value of around 7500$ is not represented in the above diagram (Price of Premium vs.Width of Premium). It has a width of 19 inches.
From the above visualizations, we can infer the following :
par(mfrow=c(1,2))
clr = c("purple","red","skyblue","green","yellow","pink" )
plot(PriceEconomy ~ SeatsTotal, ylim = c(0,4200), xlim = c(0,390), col = "black")
abline(lm(PriceEconomy ~ SeatsTotal))
abline(lm(PricePremium ~ SeatsTotal), col = "red")
plot(PricePremium ~ SeatsTotal, ylim = c(0,4200), xlim = c(0,390), col = "red")
abline(lm(PriceEconomy ~ SeatsTotal))
abline(lm(PricePremium ~ SeatsTotal), col = "red")
text = "An outlier value of around 7500$ is not represented"
mtext(text, side = 1, cex = 0.7, col = grey(0.5), line = 2)
par(mfrow=c(1,1))
plot(PriceRelative ~ SeatsTotal, col = "darkgreen", pch = 18)
abline(lm(PriceRelative ~ SeatsTotal), col = "darkgreen")
Note : An outlier value of around 7500$ is not represented in the above diagram (Price of Premium vs.Seat Premium). It has a total of 220 seats.
From the above visualizations, we can infer the following :
par(mfrow=c(1,1))
library(corrgram)
library(corrplot)
## corrplot 0.84 loaded
col = c(3,(6:18))
airlines1 = airlines[,col]
corrplot(corr = cor(airlines1), method = "ellipse", type = "upper")
From the above corrplot, we can infer the following :
After the exhaustive analysis of all the above visualizations, it is advisable to explore the relationship between the following variables :
The Null Hypothesis for the same will be :
There is no signficant change in Price Economy wrt change in Flight Duration.
cor.test(PriceEconomy, FlightDuration)
##
## Pearson's product-moment correlation
##
## data: PriceEconomy and FlightDuration
## t = 14.685, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5010266 0.6257772
## sample estimates:
## cor
## 0.5666404
A low p-value (<0.05) indicates that the null hypothesis can be rejected.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt change in Pitch Difference.
cor.test(PriceRelative, PitchDifference)
##
## Pearson's product-moment correlation
##
## data: PriceRelative and PitchDifference
## t = 11.331, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3940262 0.5372817
## sample estimates:
## cor
## 0.4687302
A low p-value (<0.05) indicates that the null hypothesis can be rejected. But, they have a weak correlation.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt change in Width Difference.
cor.test(PriceRelative, WidthDifference)
##
## Pearson's product-moment correlation
##
## data: PriceRelative and WidthDifference
## t = 11.869, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4125388 0.5528218
## sample estimates:
## cor
## 0.4858024
A low p-value (<0.05) indicates that the null hypothesis can be rejected.But they have a weak correlation.
The Null Hypothesis for the same will be :
There is no signficant change in Pitch Difference wrt change in Width Difference.
cor.test(PitchDifference, WidthDifference)
##
## Pearson's product-moment correlation
##
## data: PitchDifference and WidthDifference
## t = 25.04, df = 456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7194209 0.7969557
## sample estimates:
## cor
## 0.7608911
A low p-value (<0.05) indicates that the null hypothesis can be rejected.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt Airline.
airline_pr = xtabs(~PriceRelative+Airline)
chisq.test(airline_pr)
## Warning in chisq.test(airline_pr): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: airline_pr
## X-squared = 1402.9, df = 485, p-value < 2.2e-16
A low p-value (<0.05) indicates that the null hypothesis can be rejected.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt change in Aircraft.
aircraft_pr = xtabs(~PriceRelative+Aircraft)
chisq.test(aircraft_pr)
## Warning in chisq.test(aircraft_pr): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: aircraft_pr
## X-squared = 245.44, df = 97, p-value = 7.647e-15
A low p-value (<0.05) indicates that the null hypothesis can be rejected.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt International/Domestic.
isint_pr = xtabs(~PriceRelative+IsInternational)
chisq.test(airline_pr)
## Warning in chisq.test(airline_pr): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: airline_pr
## X-squared = 1402.9, df = 485, p-value < 2.2e-16
A low p-value (<0.05) indicates that the null hypothesis can be rejected.
The Null Hypothesis for the same will be :
There is no signficant change in Price Relative wrt change in the month of travel.
month_pr = xtabs(~PriceRelative+TravelMonth)
chisq.test(month_pr)
## Warning in chisq.test(month_pr): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: month_pr
## X-squared = 169.32, df = 291, p-value = 1
As the p-value is too high, we fail to reject the null hypothesis.
From the above data, we can conclude the following :
The premium price is heavily dependant on the economy flight price. The economy flight price in turn is dependent on the flight duration
AND
The relative price of a flight (this variable takes into account the change in economy price), is dependent on a lot of factors, which include :
When fitting a model for the relative price, we need to account for all the above variables.
form_pr = PriceRelative ~ PitchDifference + WidthDifference + Airline + IsInternational + Aircraft
reg_pr = lm(formula = form_pr, data = airlines)
summary(reg_pr)
##
## Call:
## lm(formula = form_pr, data = airlines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84780 -0.21460 -0.08242 0.11540 1.39717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.268518 0.288101 -0.932 0.351824
## PitchDifference 0.040276 0.064391 0.625 0.531971
## WidthDifference 0.001098 0.082734 0.013 0.989415
## AirlineBritish 0.175432 0.111163 1.578 0.115235
## AirlineDelta 0.188034 0.185262 1.015 0.310673
## AirlineJet 0.559896 0.143294 3.907 0.000108 ***
## AirlineSingapore 0.318518 0.081863 3.891 0.000115 ***
## AirlineVirgin 0.517611 0.107450 4.817 2e-06 ***
## IsInternationalInternational 0.188594 0.245605 0.768 0.442966
## AircraftBoeing 0.080674 0.043792 1.842 0.066105 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3713 on 448 degrees of freedom
## Multiple R-squared: 0.3342, Adjusted R-squared: 0.3208
## F-statistic: 24.99 on 9 and 448 DF, p-value: < 2.2e-16
From the above, we can conclude the following :
+When accounted for all other categorical variables, pitch difference and Width difference doesnt have a significant impact on Relative Price +All categorical data’s are based on the first factor in the category. +With this, we conclude that Jet Airlines, Singapore Airlines and Virgin Airlines have a significant effect on relative pricing.
Note : This model only accounts for 32% of the overall values (Adjusted R- square). So, I am not sure how good the model is.
form_pr = PriceRelative ~ PitchDifference + WidthDifference
reg_pr = lm(formula = form_pr, data = airlines)
summary(reg_pr)
##
## Call:
## lm(formula = form_pr, data = airlines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84163 -0.28484 -0.07241 0.17698 1.18778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.10514 0.08304 -1.266 0.206077
## PitchDifference 0.06019 0.01590 3.785 0.000174 ***
## WidthDifference 0.11621 0.02356 4.933 1.14e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3886 on 455 degrees of freedom
## Multiple R-squared: 0.2593, Adjusted R-squared: 0.2561
## F-statistic: 79.65 on 2 and 455 DF, p-value: < 2.2e-16
With the abovemodel summary, we can conclude the following :