Introduction

This note investigates how airline web visits are related with airline passenger numbers. At this point, our goal is not causal inference from web visits to passengers, but merely to investigate whether web visits help predicting passengers.

Our data set consists of 40 leading airlines, including the top 10 in the world and most of Europe’s low cost carriers (LCC). See the appendix for a detailed list of airlines covered.

This document is designed be read in conjunction with the blog post Digital Airlines: Online traffic predicts 75% of passenger volume. We will be grateful to receive any feedback - and please get in touch if you would like to receive the raw data for your own analysis: contact details here.

The dataset

For each airline, we observe the 2014 passenger count from offical airline sources or industry publications. Web traffic data comes from SimilarWeb - we annualise the monthly visitor series.

Descriptive Stats: As one would expect, airline size is very heterogeneous. This is clear from the summary statistics, and leads us to adopt a log-log specification:

kable(summary(dds[,c("annualTraffic", "Passengers14")]))

	annualTraffic	Passengers14
	Min. : 2800032	Min. : 1600000
	1st Qu.: 11460738	1st Qu.: 7700000
	Median : 36634603	Median : 17270000
	Mean : 69706578	Mean : 30142195
	3rd Qu.:103773833	3rd Qu.: 38000000
	Max. :334453198	Max. :135770000

Regression Model

We estimate, using ordinary least squares (OLS) a bivariate regression of airline web traffic, controlling for passenger numbers:

reg<-lm(log(dds$annualTraffic) ~ log(dds$Passengers14), data=dds)
summary(reg);

## 
## Call:
## lm(formula = log(dds$annualTraffic) ~ log(dds$Passengers14), 
##     data = dds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42137 -0.35914  0.07874  0.41187  1.61634 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.4869     1.7667   0.276    0.784    
## log(dds$Passengers14)   1.0095     0.1059   9.536 9.66e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7587 on 39 degrees of freedom
## Multiple R-squared:  0.6998, Adjusted R-squared:  0.6921 
## F-statistic: 90.93 on 1 and 39 DF,  p-value: 9.662e-12

# elasticity: online traffic - airline traffic not significantly different from 1
linearHypothesis(reg, c("log(dds$Passengers14) = 1"))

## Linear hypothesis test
## 
## Hypothesis:
## log(dds$Passengers14) = 1
## 
## Model 1: restricted model
## Model 2: log(dds$annualTraffic) ~ log(dds$Passengers14)
## 
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     40 22.455                          
## 2     39 22.450  1  0.004622 0.008 0.9291

The estimates show

75% of variation in airline passengers is explained by online visits (R^2)
the elasticity of passengers with respect to web visits not significantly different from unity

Diagnostic Tests

In view of the relatively small sample size, we worry that heteroscedasticity or autocorrelation may lead our test statistics to be unreliable. Hence we conduct a Breusch-Pagan test for the former and a Durbin-Watson test for the latter. Both tests decisively pass, so the t estimates should be reliable:

# bptest: no evidence for heteroscedasticity
bptest(reg);

## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 0.33482, df = 1, p-value = 0.5628

# dwtest: no evidence for autocorrelation
dwtest(reg, order.by = ~dds$Passengers14, alternative="two.sided");

## 
##  Durbin-Watson test
## 
## data:  reg
## DW = 2.2765, p-value = 0.465
## alternative hypothesis: true autocorrelation is not 0

dwtest(reg, order.by = ~dds$annualTraffic, alternative="two.sided");

## 
##  Durbin-Watson test
## 
## data:  reg
## DW = 1.2374, p-value = 0.006962
## alternative hypothesis: true autocorrelation is not 0

Most Successful Airlines in terms of Digital Traffic

Defining “digital success” as the residual of the regression, we can identify the “top 10%” of the sample. These airlines are:

dds$residuals <- reg$residuals;
top4<-head(dds[with(dds, order(-residuals)),c("ICAO_Code", "residuals")], n=4);
top4$percentageAhead <- round((exp(top4$residuals)-1)*100);
print(top4);

##    ICAO_Code residuals percentageAhead
## 44       VRD 1.6163370             403
## 47       WZZ 1.4660407             333
## 45       VOE 0.9509000             159
## 8        BTI 0.7461732             111

We explore the unique feature of each of these airlines in the companion blog post.

Last but not least, a picture:

Plot: Airline Traffic - on the web and in the plane

# highlight the top 4 in plot
dds$featured <- is.element(dds$ICAO_Code,top4$ICAO_Code);
dds$featureLabel<-'';
dds[which(dds$featured),]$featureLabel <- as.character(dds[which(dds$featured),]$ICAO_Code);

# pretty ggplot2 of the raw data and regression line
#png("out/online-tariff-and-airline-passengers.png", width = 8, height = 6, units = 'in', res = 300);
qplot(log(Passengers14), log(annualTraffic), label = featureLabel, data = dds, color = as.factor(dds$featured), size = I(5) , size=4, geom="jitter") +
  geom_abline(intercept = reg[[1]][[1]], slope = reg[[1]][[2]], size=0.75, colour="#6184DB") + xlab("Annual Passengers Volume") +
  ylab("Annual Website Visits") +
  labs(title = "Online Traffic and Airline Passengers (log scale)") +
  theme(panel.background = element_rect(fill = "white"))  + theme(panel.grid.major = element_line(colour = "grey", size = 0.1)) +theme(panel.grid.minor = element_blank()) +
  theme(legend.key = element_rect(fill = "white")) +
  scale_colour_manual(values=c( "#50C5B7", "#533A71"), name = "") +
  geom_text(vjust=-1,  color="#533A71", size=5) +
  theme(legend.position = "none")  +
  scale_x_continuous(breaks=c(log(5*10^6),log(2.5*10^7), log(1.25*10^8)), labels=c("5 mln", "25 mln", "125 mln")) + 
  scale_y_continuous(breaks=c(log(5*10^6),log(2.5*10^7), log(1.25*10^8)), labels=c("5 mln", "25 mln", "125 mln"));

#dev.off()

Appendix: List of Airlines in the Sample

##  [1] "Aeroflot"                   "Air Astana"                
##  [3] "Air Berlin"                 "Air Canada"                
##  [5] "Air Europa"                 "AirBaltic"                 
##  [7] "Alitalia"                   "American Airlines"         
##  [9] "Austrian"                   "Belavia"                   
## [11] "British Airways"            "Delta"                     
## [13] "easyJet"                    "Emirates"                  
## [15] "flyBe"                      "FlyThomasCook"             
## [17] "Germanwings"                "Hop!"                      
## [19] "Iberia"                     "Iberia Express"            
## [21] "Jet2"                       "KLM"                       
## [23] "Lufthansa"                  "Monarch"                   
## [25] "Norwegian Air Shuttle"      "Pegasus"                   
## [27] "Qantas"                     "Ryanair"                   
## [29] "SAS"                        "Southwest"                 
## [31] "Swiss"                      "Thomson"                   
## [33] "Transavia"                  "Tuifly"                    
## [35] "Turkish Airlines"           "United"                    
## [37] "Virgin America"             "Volotea"                   
## [39] "Vueling"                    "Wizzair"                   
## [41] "Polskie Linie Lotnicze LOT"

How airline web visits are related with airline passengers

Hinnerk Gnutzmann

August 3rd, 2015