Sameer Mathur
Delhi-Mumbai Airline Pricing Analysis
The following are the two major factors influencing airline ticket prcing:
Seasonality
Advance Booking
Effect of Seasonality on Ticket Prices
The following are the three types of seasonality in this dataset:
Time of departure (AM vs. PM)
Day of the week (Weekend vs. Weekdays)
Special events (e.g. Diwali)
Are morning flights more expensive than evening flights?
Are weekend flights more expensive than weekday flights?
Are flights more expensive on the day before / after Diwali?
BOM-DEL & DEL-BOM - one way airfares
4 Airlines
63 unique flights
Count of flights by airline
# Count of flights by airline
library(data.table)
dt <- data.table(airline.df)
dt[, list(UniqueFlights = length(unique(FlightNumber)), DataCount=.N), by = list(Airline)]
Airline UniqueFlights DataCount
1: Jet 30 144
2: Spice Jet 8 40
3: IndiGo 16 80
4: Air India 9 41
length(unique(FlightNumber)) #Total Unique Flights
[1] 63
nrow(dt) #Total Data
[1] 305
library(gplots)
plotmeans(Price ~ MetroArrival, data = airline.df, mean.labels = TRUE,
main = "Average Airfare on All BOM-DEL, DEL-BOM Flights",
xlab = "All Airlines", legends = FALSE)
# scatterplot of airfare
plot(Price, type = "p", data = airline.df,
main = "Scatterplot of Airfare")
abline(h = mean(airline.df$Price), col="red", lwd=3, lty=2)
agg1 <- aggregate(Price, by = list(Airline), mean)
colnames(agg1) <- c("Airline", "AverageAirfares")
agg1
Airline AverageAirfares
1 Air India 6335.000
2 IndiGo 4879.525
3 Jet 5496.146
4 Spice Jet 5094.850
# Average airfare for all airlines
avg = mean(airline.df$Price)
round(avg,2)
[1] 5394.54
# boxplot of airfares by airline
boxplot(Price ~ Airline, data = airline.df,
main = "Distribution of Airfares by Airlines",
ylab = "Airline", xlab = "Airfare (INR)",
horizontal = TRUE)
Time of departure (AM vs. PM)
# average airfares and unique flights by departure
library(data.table)
dt <- data.table(airline.df)
dt[, list(UniqueFlights = length(unique(FlightNumber)),
DataCount = .N,
Price = round(mean(Price), 2)),
by = list(Departure)]
Departure UniqueFlights DataCount Price
1: AM 34 169 5598.89
2: PM 29 136 5140.61
# boxplot of airfares by Departure Time (AM/PM)
boxplot(Price ~ Departure, data = airline.df,
main = "Distribution of Airfares by Departure Time (AM/PM)",
ylab = "Departure", xlab = "Airfare (INR)",
horizontal = TRUE)
# mean plot of airfares by Departure Time (AM/PM)
library(gplots)
plotmeans(Price ~ Departure, data = airline.df,
main = "Mean plot of Airfares by Departure Time (AM/PM)",
xlab = "Departure", ylab = "Airfare (INR)",
mean.labels = TRUE, frame = TRUE)
Day of the week (Weekend vs. Weekdays)
# average airfares and unique flights by Weekend / Weekday
library(data.table)
dt <- data.table(airline.df)
dt[, list(NoOfFlights = length(unique(FlightNumber)),
Price = round(mean(Price), 2)),
by = list(IsWeekend)]
IsWeekend NoOfFlights Price
1: No 63 5518.55
2: Yes 29 4596.07
# boxplot of airfares by Weekday/Weekend
boxplot(Price ~ IsWeekend, data = airline.df,
main = "Distribution of Airfares by Weekday/Weekend",
ylab = "Weekend", xlab = "Airfare (INR)",
horizontal = TRUE)
# mean plot of airfares by Weekday/Weekend
library(gplots)
plotmeans(Price ~ IsWeekend, data = airline.df,
main = "Average Airfares by Weekday/Weekend",
xlab = "Weekend", ylab = "Airfare (INR)",
mean.labels = TRUE, frame = TRUE)
Holiday/Festival (e.g. Diwali)
# average airfares at Diwali
library(data.table)
dt <- data.table(airline.df)
dt[, list(NoOfFlights = length(unique(FlightNumber)),
Price = round(mean(Price), 2)),
by = list(IsDiwali)]
IsDiwali NoOfFlights Price
1: 1 62 5897.48
2: 0 63 5063.81
# boxplot of airfares by Diwali
boxplot(Price ~ IsDiwali, data = airline.df,
main = "Distribution of Airfares by Diwali",
ylab = "Diwali", xlab = "Airfare (INR)",
horizontal = TRUE)
# mean plot of airfares by Diwali
library(gplots)
plotmeans(Price ~ IsDiwali, data = airline.df,
main = "Mean plot of Airfares by Diwali",
xlab = "Diwali", ylab = "Airfare (INR)",
mean.labels = TRUE, frame = TRUE)
Effect of Advanced Booking on Ticket Prices
Tickets are cheaper if you buy them in advance.
Last minute travel is expensive. (But by how much?)
By how much do ticket prices increase for every additional day's delay in purchasing a ticket?
Date of data collection (Sep 8-19, 2018)
Data was collected for flights departing
plot(AdvancedBookingDays, Price, data = airline.df,
main = "Scatter Plot of Airfares by Advanced Booking Days",
xlab = "Advanced Booking Days", ylab = "Airfare (INR)")
# preparing subset of data
subData1 <- subset(airline.df, AdvancedBookingDays >= 2 & AdvancedBookingDays < 7)
subData2 <- subset(airline.df, AdvancedBookingDays > 7 & AdvancedBookingDays <= 30)
# merging subsets of data
subData3 <- rbind(subData1, subData2)
# converting into AdvancedBookingDays factor
subData3$AdvancedBookingDays <- as.factor(subData3$AdvancedBookingDays)
# table of advanced booking days by airline
#table(subData3$Airline, subData3$AdvancedBookingDays)
# average airfares and unique flights by airline and advance booking
library(data.table)
dt <- data.table(subData3)
dt[, list(NoOfFlights = length(unique(FlightNumber)),
Price = round(mean(Price), 2)),
by = list(AdvancedBookingDays, Airline)]
AdvancedBookingDays Airline NoOfFlights Price
1: 2 IndiGo 16 8628.06
2: 2 Jet 30 6259.47
3: 2 Spice Jet 8 6325.75
4: 2 Air India 9 5613.78
5: 30 IndiGo 16 2966.12
6: 30 Jet 27 4129.37
7: 30 Spice Jet 8 4324.00
8: 30 Air India 8 4766.88
interaction.plot(subData3$AdvancedBookingDays, subData3$Airline, subData3$Price,
main = "Interaction Plot of Advance Booking Days and Airline",
xlab = "Advance Booking Days", ylab = "Avaerage Airfare (INR)",
col=c("red","black","green", "blue"),
fixed=TRUE, lwd = 5,
leg.bty = "o")
Many additioanal factors are likely to also influence airline ticket prices. Our dataset includes the following additional variables:
DEL-BOM vs. BOM-DEL
Flying Time
Seating Quality
# Average Airfare by Departure City (DEL or BOM)
library(gplots)
plotmeans(Price ~ DepartureCityCode, data = airline.df,
main = "Mean plot of Airfares by Departure City (New Delhi / Mumbai)",
xlab = "Departure City", ylab = "Airfare (INR)",
mean.labels = TRUE, frame = TRUE)
# descriptive statistics of Flying Time
library(psych)
describe(airline.df$FlyingMinutes)[, c(2:5, 8:9)]
n mean sd median min max
X1 305 136.03 4.71 135 125 145
# scatterplot of flying time
plot(FlyingMinutes, type = "p", data = airline.df,
main = "Scatterplot of Flying Time")
abline(h = mean(airline.df$FlyingMinutes), col = "blue", lwd=3, lty=2)
library(psych)
# descriptive statistics of SeatPitch
describe(airline.df$SeatPitch)[, c(2:5, 8:9)]
n mean sd median min max
X1 305 30.26 0.93 30 29 33
# descriptive statistics of SeatWidth
describe(airline.df$SeatWidth)[, c(2:5, 8:9)]
n mean sd median min max
X1 305 17.41 0.49 17 17 18
expVar <- airline.df[c("Price", "AdvancedBookingDays", "FlyingMinutes", "Capacity", "SeatPitch", "SeatWidth")]
round(cor(expVar), 2)
Price AdvancedBookingDays FlyingMinutes Capacity
Price 1.00 -0.01 -0.02 -0.03
AdvancedBookingDays -0.01 1.00 0.01 -0.01
FlyingMinutes -0.02 0.01 1.00 -0.32
Capacity -0.03 -0.01 -0.32 1.00
SeatPitch 0.07 -0.01 -0.03 0.51
SeatWidth -0.06 0.05 -0.18 0.45
SeatPitch SeatWidth
Price 0.07 -0.06
AdvancedBookingDays -0.01 0.05
FlyingMinutes -0.03 -0.18
Capacity 0.51 0.45
SeatPitch 1.00 0.32
SeatWidth 0.32 1.00
expVar <- airline.df[c("Price", "AdvancedBookingDays", "FlyingMinutes", "Capacity", "SeatPitch", "SeatWidth")]
library(Hmisc)
rcorr(as.matrix(expVar))
Price AdvancedBookingDays FlyingMinutes Capacity
Price 1.00 -0.01 -0.02 -0.03
AdvancedBookingDays -0.01 1.00 0.01 -0.01
FlyingMinutes -0.02 0.01 1.00 -0.32
Capacity -0.03 -0.01 -0.32 1.00
SeatPitch 0.07 -0.01 -0.03 0.51
SeatWidth -0.06 0.05 -0.18 0.45
SeatPitch SeatWidth
Price 0.07 -0.06
AdvancedBookingDays -0.01 0.05
FlyingMinutes -0.03 -0.18
Capacity 0.51 0.45
SeatPitch 1.00 0.32
SeatWidth 0.32 1.00
n= 305
P
Price AdvancedBookingDays FlyingMinutes Capacity
Price 0.8732 0.7513 0.6513
AdvancedBookingDays 0.8732 0.9292 0.8781
FlyingMinutes 0.7513 0.9292 0.0000
Capacity 0.6513 0.8781 0.0000
SeatPitch 0.1942 0.8052 0.5521 0.0000
SeatWidth 0.2998 0.3411 0.0013 0.0000
SeatPitch SeatWidth
Price 0.1942 0.2998
AdvancedBookingDays 0.8052 0.3411
FlyingMinutes 0.5521 0.0013
Capacity 0.0000 0.0000
SeatPitch 0.0000
SeatWidth 0.0000
library(corrgram)
corrgram(expVar, order=TRUE, lower.panel=panel.conf,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram")
expVar <- airline.df[c("Price", "AdvancedBookingDays", "FlyingMinutes", "Capacity", "SeatPitch", "SeatWidth")]
library("PerformanceAnalytics")
chart.Correlation(expVar, histogram = TRUE, pch=19)
One sample t-test
# subset of data having only Bombay to Delhi flights
depBOMData <- subset(airline.df, DepartureCityCode == "BOM")
# one-sample t-test
t.test(depBOMData$Price, mu = 5000)
One Sample t-test
data: depBOMData$Price
t = 6.0784, df = 129, p-value = 1.277e-08
alternative hypothesis: true mean is not equal to 5000
95 percent confidence interval:
5844.506 6659.601
sample estimates:
mean of x
6252.054
The p-value of the test is 1.277e-08, which is less than the significance level alpha = 0.05. Here, we fail to reject our null hypothesis. We can conclude that the average ticket prices of Mumbai to Delhi flights are greater than INR 5000.
Independent t-test
# boxplot of departure
boxplot(Price ~ Departure, data = airline.df,
horizontal = TRUE)
Do the two populations have the same variances?
varTest1 <- var.test(Price ~ Departure, data = airline.df)
varTest1
F test to compare two variances
data: Price by Departure
F = 2.0941, num df = 168, denom df = 135, p-value = 1.074e-05
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.513842 2.880244
sample estimates:
ratio of variances
2.094081
# independent t-test (AM-PM effect)
t.test(Price ~ Departure, data = airline.df, var.equal = FALSE, alternative = "greater")
Welch Two Sample t-test
data: Price by Departure
t = 1.736, df = 296.58, p-value = 0.0418
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
22.71262 Inf
sample estimates:
mean in group AM mean in group PM
5598.893 5140.610
The p-value of the test is 0.04791, which is less than the significance level alpha = 0.05. We can conclude that the ticket prices of morning flights are greater than the afternoon flights.
Independent t-Test
# boxplot of Diwali
boxplot(Price ~ IsDiwali, data = airline.df,
horizontal = TRUE)
Do the two populations have the same variances?
varTest2 <- var.test(Price ~ IsDiwali, data = airline.df)
varTest2
F test to compare two variances
data: Price by IsDiwali
F = 0.87328, num df = 183, denom df = 120, p-value = 0.4069
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.626232 1.204170
sample estimates:
ratio of variances
0.8732808
# independent t-test (Diwali effect)
t.test(Price ~ IsDiwali, data = airline.df, var.equal = TRUE, alternative = "less")
Two Sample t-test
data: Price by IsDiwali
t = -3.022, df = 303, p-value = 0.001363
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -378.5134
sample estimates:
mean in group 0 mean in group 1
5063.810 5897.479
The p-value of the test is 0.001363, which is less than the significance level alpha = 0.05. We can conclude that airline charges more price on Diwali compared to non-Diwali
Independent t-Test
# boxplot of Airfare versus Weekday/Weekend
boxplot(Price ~ IsWeekend, data = airline.df,
horizontal = TRUE)
Do the two populations have the same variances?
varTest2 <- var.test(Price ~ IsWeekend, data = airline.df)
varTest2
F test to compare two variances
data: Price by IsWeekend
F = 2.9781, num df = 263, denom df = 40, p-value = 9.145e-05
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.774635 4.596324
sample estimates:
ratio of variances
2.978065
# independent t-test (Weekend effect)
t.test(Price ~ IsWeekend, data = airline.df, var.equal = FALSE, alternative = "greater")
Welch Two Sample t-test
data: Price by IsWeekend
t = 3.3951, df = 82.861, p-value = 0.0005276
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
470.5005 Inf
sample estimates:
mean in group No mean in group Yes
5518.549 4596.073
The p-value of the test is 0.0005276, which is less than the significance level alpha = 0.05. We can conclude that airline charges more price on weekdays compared to weekend.
Independent t-test
# subset of data having only Air India and IndiGo airline
subAirline <- subset(airline.df, Airline %in% c("Air India", "IndiGo"))
# boxplot of airline
boxplot(Price ~ Airline, data = subAirline,
horizontal = TRUE)
Do the two populations have the same variances?
# subset of data having only Air India and IndiGo airline
subAirline <- subset(airline.df, Airline %in% c("Air India", "IndiGo"))
varTest3 <- var.test(Price ~ Airline, data = subAirline)
varTest3
F test to compare two variances
data: Price by Airline
F = 0.83011, num df = 40, denom df = 79, p-value = 0.5233
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.4936707 1.4658591
sample estimates:
ratio of variances
0.8301099
# subset of data having only Air India and IndiGo airline
subAirline <- subset(airline.df, Airline %in% c("Air India", "IndiGo"))
# dependent t-test (Airline effect)
t.test(subAirline$Price ~ Airline, data = subAirline, var.equal = TRUE, alternative = "greater")
Two Sample t-test
data: subAirline$Price by Airline
t = 2.6396, df = 119, p-value = 0.004705
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
541.4039 Inf
sample estimates:
mean in group Air India mean in group IndiGo
6335.000 4879.525
The p-value of the test is 0.004705, which is less than the significance level alpha = 0.05. We can conclude that the ticket prices on Air India are greater than IndiGo.