The dataset provided comes from the Hungarian Central Statistical Office and it is about the flights to and from the Budapest Ferenc Liszt International Airport between 2007 and 2012. In this project I have interest solely in the monthly number of passengers coming to Budapest from different countries on scheduled flights and the available number of seats. My aim is to gain an insight into the monthly variation of efficiency of ticket sales from different countries. I defined efficiency as the number of passengers/seats*100 [%]. Since the total number of passengers during the whole period varies a lot, the sales efficiency from top countries is the most crucial to study and improve at specific times. I have a detailed look at the top 5 countries.
Before I start to group the data, I clean the environment, load the used R packages, set my working directory to where the “BUD flights 2007-2012 v2.xlsx” file is placed. With the use of readxl package I load the data into memory and keep only the “Incoming” and “Scheduled” lines where the number of seats higher than 0. Since more cities are available from Budapest in some countries, I summarize the data for date (month) and country. I rename the columns for easier use and then print the first 10 lines of the dataframe.
rm(list=ls())
library(readxl)
require(dplyr)
library(lubridate)
library(ggplot2)
library(animation)
require(graphics)
setwd("C:/Data Science/SatRday Challenge")
BUDdata <- read_excel("BUD flights 2007-2012 v2.xlsx")
BUDdata <- BUDdata[grep("Incoming", BUDdata$`FLIGH DIRECTION`),]
BUDdata <- BUDdata[which(BUDdata$`FLIGHT TYPE` == "Scheduled"),]
BUDdata <- filter(BUDdata, BUDdata$`SEAT CAPACITY`>0)
BUDdata <- group_by(BUDdata, DATE, COUNTRY)
BUDdata <- summarise(BUDdata, sum(`NBR OF PASSENGERS`), sum(`SEAT CAPACITY`))
names(BUDdata) <- c("DATE", "COUNTRY", "PASSENGERS", "SEATS")
BUDdata[1:10,]
## Source: local data frame [10 x 4]
## Groups: DATE [1]
##
## DATE COUNTRY PASSENGERS SEATS
## (time) (chr) (dbl) (dbl)
## 1 2007-01-01 Albania 2448 4024
## 2 2007-01-01 Austria 2530 5684
## 3 2007-01-01 Azerbaijan 0 2365
## 4 2007-01-01 Belgium 7070 13271
## 5 2007-01-01 Bosnia-Herczegovina 1017 1524
## 6 2007-01-01 Bulgaria 3682 5875
## 7 2007-01-01 Canada 1077 1850
## 8 2007-01-01 China 1252 2205
## 9 2007-01-01 Croatia 1048 2908
## 10 2007-01-01 Cyprus 1562 2650
Later in the analysis I will need to know the names of the countries with the highest number of total incoming passengers. (Ideally I would select these countries considering the statistical distribution of the data, but this is not the aim of this work and my free time is short.) I decided to focus on the top 5 countries from the total 56. The code below prepares this character vector (top5) by grouping the data for countries, summing all passengers for each of them, arranging them in descending order, selecting the first 5 lines and finally transforming the data into a vector and printing it.
totalBUDdata <- group_by(BUDdata, COUNTRY)
totalBUDdata <- summarise(totalBUDdata, sum(PASSENGERS))
orderedTotalBUDdata <- arrange(totalBUDdata, desc(`sum(PASSENGERS)`))
top5 <- orderedTotalBUDdata[1:5,1]
top5 <- top5[['COUNTRY']]
top5
## [1] "Germany" "United Kingdom" "Italy" "France"
## [5] "The Netherlands"
I decided to have a look at the sales efficiency of flights from different countries in different months throughout the given period. I was curious if there is any difference between months typical of travelling like summer or December. I want to hold primarily two subsets of data: 1. monthly data between 2007 and 2012 (number of PASSENGERS and SEATS are already given in the original dataset) and 2. sums of PASSENGERS and SEATS for months, for example for all Januaries from which I calculate average monthly sales efficiency for each country. The code below prepares these subsets, also just for the top 5 countries (even though it will not be used for subset1), and prints their first 10-10 lines into the document.
#adding efficiency to subset1
subset1 <- BUDdata
subset1$EFFICIENCY <- subset1$PASSENGERS/subset1$SEATS*100
subset1 <- arrange(subset1, DATE)
subset1Top5 <- filter(subset1, COUNTRY %in% as.character(top5))
subset1Top5[1:10,]
## Source: local data frame [10 x 5]
## Groups: DATE [2]
##
## DATE COUNTRY PASSENGERS SEATS EFFICIENCY
## (time) (chr) (dbl) (dbl) (dbl)
## 1 2007-01-01 France 14316 30611 46.76750
## 2 2007-01-01 Germany 39946 71170 56.12758
## 3 2007-01-01 Italy 17333 35437 48.91215
## 4 2007-01-01 The Netherlands 10755 20704 51.94648
## 5 2007-01-01 United Kingdom 24918 41369 60.23351
## 6 2007-02-01 France 15126 26324 57.46087
## 7 2007-02-01 Germany 44272 69119 64.05185
## 8 2007-02-01 Italy 18319 31209 58.69781
## 9 2007-02-01 The Netherlands 13444 19937 67.43241
## 10 2007-02-01 United Kingdom 29084 39311 73.98438
#subset2
subset2 <- BUDdata
subset2$MONTH <- month(subset2$DATE)
subset2 <- group_by(subset2, MONTH, COUNTRY)
subset2 <- summarise(subset2, sum(PASSENGERS), sum(SEATS))
subset2$EFFICIENCY <- subset2$`sum(PASSENGERS)`/subset2$`sum(SEATS)`*100
subset2Top5 <- filter(subset2, COUNTRY %in% as.character(top5))
subset2Top5[1:10,]
## Source: local data frame [10 x 5]
## Groups: MONTH [2]
##
## MONTH COUNTRY sum(PASSENGERS) sum(SEATS) EFFICIENCY
## (dbl) (chr) (dbl) (dbl) (dbl)
## 1 1 France 86086 162873 52.85468
## 2 1 Germany 238640 439979 54.23895
## 3 1 Italy 96355 186493 51.66682
## 4 1 The Netherlands 73998 136713 54.12653
## 5 1 United Kingdom 180193 292531 61.59792
## 6 2 France 85863 139744 61.44307
## 7 2 Germany 262301 415090 63.19136
## 8 2 Italy 93279 159664 58.42206
## 9 2 The Netherlands 80676 119257 67.64886
## 10 2 United Kingdom 196252 267357 73.40447
Box-and-Whisker Plots are convenient to compare different groups of data, like different months in the BUD dataset. Below I present monthly sales efficiency data for all countries on two ways using Box-and-Whisker Plots plus I always overlay the data points for the top 5 countries with the size indicating the number of passengers. Median notches are shown to better observe differences in central values of distributions.
I will need to show the sample number (number of countries) for each month, therefore, first I include a function to locate the text and count the data in a vector. I print an example where position is 26 and sample number is 9.
n_fun <- function(x){
return(data.frame(y = 26, label = length(x)))
}
n_fun(c(1,2,3,4,5,6,7,8,9))
## y label
## 1 26 9
Below, I prepare an animation, save it and then embed it in this document. The animation uses two for loops, one for years and one for months within the given year. It subsets the subset1 data. For months it keeps all months until the given month, for example when j is 4, it will contain data for months 1-4 in year i. Then I needed to make a new column for the month and make sure that it has 12 levels (for months of the year) in all cases because I want to keep a space for all of them even though data might not be available in the given month-subset. In this loop I also filter the data for the top 5 countries then I make a plot to include in the animation. In the plot I show sales efficiency in different countries separated by month as a factor. I set y and x axis to make sure they always look the same on the animation. For setting x axis I needed to use the fixed 12 levels of MONTH column. I add the sample number by the above defined function and finally start to overlay the data points of top 5 countries. Text sizes and animation width and length needed to be adjusted empirically and converted between units to appear similar in the document to the later shown figure.
saveGIF({
for(i in unique(year(subset1$DATE))){
yearSubset <- filter(subset1, year(DATE)==i)
for(j in unique(month(subset1$DATE))){
monthSubset <- filter(yearSubset, month(DATE)<=j)
monthSubset$MONTH <- as.factor(month(monthSubset$DATE))
levels(monthSubset$MONTH) <- as.factor(1:12)
monthSubsetTop5 <- filter(monthSubset, COUNTRY %in% as.character(top5))
f<-ggplot(monthSubset, aes(x = as.factor(MONTH), y = EFFICIENCY)) +
geom_boxplot(notch = TRUE) +
scale_y_continuous(limits=c(25, 100)) +
scale_x_discrete(breaks=as.factor(1:12), drop=FALSE, labels=c(month.abb[1:12])) +
stat_summary(fun.data = n_fun, geom = "text") +
annotate("text", x = 0.6, y = 26, label = "n:") +
geom_point(data = monthSubsetTop5, aes(x = as.factor(MONTH), y = EFFICIENCY, colour = as.factor(COUNTRY), size = PASSENGERS), alpha=1/3) +
guides(size=FALSE, colour = guide_legend(title = "top 5 countries")) +
scale_size(range = c(6, 14)) +
labs(title = paste("Monthly variation of sales efficiency of incoming flights \n from all countries and in countries with top 5 number of passengers \n", i), x = "time (month between 2007 and 2012)", y = "sales efficiency (%)")+
theme(axis.title = element_text(size = 18), axis.text = element_text(size = 18), legend.text = element_text(size = 20), legend.title= element_text(size = 22), plot.title= element_text(size = 22))
print(f)
}
}
}, movie.name = "subset1.gif", ani.width = 900, ani.height = 600, interval = 0.4)
## [1] TRUE
One can see above that the years of the dataset show similar variation to each other with lower values in winter months and higher values in the summer. Similarly to the animation, below, I prepare a summarizing figure for the months of all of the available years using the subset2.
f<-ggplot(subset2, aes(x = as.factor(MONTH), y = EFFICIENCY)) +
geom_boxplot(notch = TRUE) +
scale_y_continuous(limits=c(25, 100)) +
scale_x_discrete(labels=c(month.abb[1:12])) +
stat_summary(fun.data = n_fun, geom = "text") +
annotate("text", x = 0.6, y = 26, label = "n:") +
geom_point(data = subset2Top5, aes(x = as.factor(MONTH), y = EFFICIENCY, colour = as.factor(COUNTRY), size = `sum(PASSENGERS)`), alpha=1/2) +
guides(size=FALSE, colour = guide_legend(title = "top 5 countries")) +
scale_size(range = c(2, 10)) +
labs(title = "Average monthly sales efficiency of incoming flights \n from all countries and in countries with top 5 number of passengers \n All years", x = "month in 2007-2012 period", y = "average sales efficiency (%)")+
theme(axis.title = element_text(size = 12), axis.text = element_text(size = 12), legend.text = element_text(size = 12), legend.title= element_text(size = 14), plot.title= element_text(size = 14))
ggsave("subset2.png", width = 21.17, height = 13.23, units = "cm")
Based on the animation and the summarizing figure the following statements can be made:
Sales efficiency for incoming BUD flights from all countries significantly varies throughout the months of the year with typically low values in winter months and higher values in the summer. Therefore, improvement could be focused on winter period.
Top 5 countries follow a similar pattern (low sales efficiency in winter and high in summer) usually with high values on the distribution in comparison with other countries. One significant exception is that they do not show low values in December but even more in January. Therefore, main attention should be paid for improvement in January.
Among the top 5 countries UK performs the best, the second seems to be The Netherlands, which are both countries where Hungarians typically live and not only holiday destinations. Italy seems to perform the worst regarding values of sales efficiency indicating that improvement would be necessary. Germany has the highest number of passengers (size of the markers) but seems its sales efficiency drops in August which raises attention.