This note investigates how airline web visits are related with airline passenger numbers. At this point, our goal is not causal inference from web visits to passengers, but merely to investigate whether web visits help predicting passengers.
Our data set consists of 40 leading airlines, including the top 10 in the world and most of Europe’s low cost carriers (LCC). See the appendix for a detailed list of airlines covered.
This document is designed be read in conjunction with the blog post Digital Airlines: Online traffic predicts 75% of passenger volume. We will be grateful to receive any feedback - and please get in touch if you would like to receive the raw data for your own analysis: contact details here.
For each airline, we observe the 2014 passenger count from offical airline sources or industry publications. Web traffic data comes from SimilarWeb - we annualise the monthly visitor series.
Descriptive Stats: As one would expect, airline size is very heterogeneous. This is clear from the summary statistics, and leads us to adopt a log-log specification:
kable(summary(dds[,c("annualTraffic", "Passengers14")]))
| annualTraffic | Passengers14 | |
|---|---|---|
| Min. : 2800032 | Min. : 1600000 | |
| 1st Qu.: 11460738 | 1st Qu.: 7700000 | |
| Median : 36634603 | Median : 17270000 | |
| Mean : 69706578 | Mean : 30142195 | |
| 3rd Qu.:103773833 | 3rd Qu.: 38000000 | |
| Max. :334453198 | Max. :135770000 |
We estimate, using ordinary least squares (OLS) a bivariate regression of airline web traffic, controlling for passenger numbers:
reg<-lm(log(dds$annualTraffic) ~ log(dds$Passengers14), data=dds)
summary(reg);
##
## Call:
## lm(formula = log(dds$annualTraffic) ~ log(dds$Passengers14),
## data = dds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42137 -0.35914 0.07874 0.41187 1.61634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4869 1.7667 0.276 0.784
## log(dds$Passengers14) 1.0095 0.1059 9.536 9.66e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7587 on 39 degrees of freedom
## Multiple R-squared: 0.6998, Adjusted R-squared: 0.6921
## F-statistic: 90.93 on 1 and 39 DF, p-value: 9.662e-12
# elasticity: online traffic - airline traffic not significantly different from 1
linearHypothesis(reg, c("log(dds$Passengers14) = 1"))
## Linear hypothesis test
##
## Hypothesis:
## log(dds$Passengers14) = 1
##
## Model 1: restricted model
## Model 2: log(dds$annualTraffic) ~ log(dds$Passengers14)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 22.455
## 2 39 22.450 1 0.004622 0.008 0.9291
The estimates show
In view of the relatively small sample size, we worry that heteroscedasticity or autocorrelation may lead our test statistics to be unreliable. Hence we conduct a Breusch-Pagan test for the former and a Durbin-Watson test for the latter. Both tests decisively pass, so the t estimates should be reliable:
# bptest: no evidence for heteroscedasticity
bptest(reg);
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 0.33482, df = 1, p-value = 0.5628
# dwtest: no evidence for autocorrelation
dwtest(reg, order.by = ~dds$Passengers14, alternative="two.sided");
##
## Durbin-Watson test
##
## data: reg
## DW = 2.2765, p-value = 0.465
## alternative hypothesis: true autocorrelation is not 0
dwtest(reg, order.by = ~dds$annualTraffic, alternative="two.sided");
##
## Durbin-Watson test
##
## data: reg
## DW = 1.2374, p-value = 0.006962
## alternative hypothesis: true autocorrelation is not 0
Defining “digital success” as the residual of the regression, we can identify the “top 10%” of the sample. These airlines are:
dds$residuals <- reg$residuals;
top4<-head(dds[with(dds, order(-residuals)),c("ICAO_Code", "residuals")], n=4);
top4$percentageAhead <- round((exp(top4$residuals)-1)*100);
print(top4);
## ICAO_Code residuals percentageAhead
## 44 VRD 1.6163370 403
## 47 WZZ 1.4660407 333
## 45 VOE 0.9509000 159
## 8 BTI 0.7461732 111
We explore the unique feature of each of these airlines in the companion blog post.
Last but not least, a picture:
# highlight the top 4 in plot
dds$featured <- is.element(dds$ICAO_Code,top4$ICAO_Code);
dds$featureLabel<-'';
dds[which(dds$featured),]$featureLabel <- as.character(dds[which(dds$featured),]$ICAO_Code);
# pretty ggplot2 of the raw data and regression line
#png("out/online-tariff-and-airline-passengers.png", width = 8, height = 6, units = 'in', res = 300);
qplot(log(Passengers14), log(annualTraffic), label = featureLabel, data = dds, color = as.factor(dds$featured), size = I(5) , size=4, geom="jitter") +
geom_abline(intercept = reg[[1]][[1]], slope = reg[[1]][[2]], size=0.75, colour="#6184DB") + xlab("Annual Passengers Volume") +
ylab("Annual Website Visits") +
labs(title = "Online Traffic and Airline Passengers (log scale)") +
theme(panel.background = element_rect(fill = "white")) + theme(panel.grid.major = element_line(colour = "grey", size = 0.1)) +theme(panel.grid.minor = element_blank()) +
theme(legend.key = element_rect(fill = "white")) +
scale_colour_manual(values=c( "#50C5B7", "#533A71"), name = "") +
geom_text(vjust=-1, color="#533A71", size=5) +
theme(legend.position = "none") +
scale_x_continuous(breaks=c(log(5*10^6),log(2.5*10^7), log(1.25*10^8)), labels=c("5 mln", "25 mln", "125 mln")) +
scale_y_continuous(breaks=c(log(5*10^6),log(2.5*10^7), log(1.25*10^8)), labels=c("5 mln", "25 mln", "125 mln"));
#dev.off()
## [1] "Aeroflot" "Air Astana"
## [3] "Air Berlin" "Air Canada"
## [5] "Air Europa" "AirBaltic"
## [7] "Alitalia" "American Airlines"
## [9] "Austrian" "Belavia"
## [11] "British Airways" "Delta"
## [13] "easyJet" "Emirates"
## [15] "flyBe" "FlyThomasCook"
## [17] "Germanwings" "Hop!"
## [19] "Iberia" "Iberia Express"
## [21] "Jet2" "KLM"
## [23] "Lufthansa" "Monarch"
## [25] "Norwegian Air Shuttle" "Pegasus"
## [27] "Qantas" "Ryanair"
## [29] "SAS" "Southwest"
## [31] "Swiss" "Thomson"
## [33] "Transavia" "Tuifly"
## [35] "Turkish Airlines" "United"
## [37] "Virgin America" "Volotea"
## [39] "Vueling" "Wizzair"
## [41] "Polskie Linie Lotnicze LOT"