During the year 2018 preparation for a trip around the world, I came across an aspect previously unknown about air tickets. That the price per km flown decreases with distance. Although easily understandable from a cost point of view for companies. The variation surprised me, and I decided to investigate the matter further.
I realized that the value per km flown on long-haul flights (greater than 5000 km) always approached USD 0.06 at the time. This was the price I paid for the flights divided by the total distance.
Soon after that, I started to as myself: Would it be possible to calculate the minimum value of an air ticket? Could I predict the occurrence of a price range? How far can tickets go? what is the real chance of buying one at 5% cheaper? Is this a way of knowing what a cheap ticket would be, due to the cost-benefit ratio? Did I take a loss or did I get along?
The data used were obtained from the Transport Statistics Office of the Department of Transport - DOT (USA), the data are for 2018, it was origilnaly compiled by Ryan Zernach (KAGGLE). This was the largest database found and that had long distance ticket values. Although my historical/experimental values were from international Europe/Asia tickets, I believe that due to globalization and strong competition, the prices in the United States were very similar in the same period.
The companies were: (WN) Southwest Airlines, (DL) Delta Air Lines, (AA) American Airlines, (UA) United Air Lines, (B6, JetBlue Airways, (AS) Alaska Airlines, (NK) Spirit Air Lines, ( G4) Allegiant Air, (F9) Frontier Airlines, (HA) Hawaiian Airlines, (SY) Sun Country Airlines, (VX) Virgin America.
All processing was performed in R and RStudio using Dplyr and ggplot.
A table was created from the original file of approximately 9 Million lines and 700 Megabytes in size:
library(readr)
library(tidyverse)
Flights <- read_csv("/mnt/DADOS/Data Science/Viagens/Cleaned_2018_Flights.csv")
Let’s calculate the value per km flown, creating the variable “PricePerKm” and summarize the data with distances (Miles), values per km flown (PricePerKm) and total prices (PricePerTicket):
Flights <- Flights %>% mutate(PricePerKm= PricePerTicket/(Miles*1.8))
summary(Flights$PricePerKm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005839 0.065521 0.103591 0.143329 0.175479 13.232323
summary(Flights$Miles)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11 679 1028 1202 1609 5095
summary(Flights$PricePerTicket)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 133.9 197.2 232.4 290.0 1000.0
The average of all 9 million records available is 0.14USD / km, a distance of 1202 Miles (2200km) and a price of 232USD per ticket.
Visualizing the distribution of the data by plotting Distance x Value per Km. Will this answer if the values decrease according to the distance flown, that is, are longer flights proportionally cheaper?
g <- ggplot(Flights,aes(Miles,PricePerKm))
g + geom_point(aes(color=Origin,alpha=0.01),show.legend = FALSE)
g
USD/Km cost graph flown versus distance
We see in Figure 1 that it is inconclusive, there are many extreme values. Compressing the axes using the logarithmic scale (base 10) could simplify visualization:
g +scale_y_continuous(trans="log10") + scale_x_continuous(trans="log10")
Same as the previous one, but in logarithmic scaling (Base 10)
Figure 2 shows a much more interesting result and with very clear price limits in the correlation between value per km and distance flown. However, this correlation and inclination are due to the fact that one of the variables is directly dependent on the other.
Analyzing now (with frequency histograms) the distribution of prices according to the distance flown (X axis is Value per km flown and Y, the frequency of flights in that value), first below 500 miles (900 km, approximately 1 hour flight) and separating each airline by color:
ggplot(filter(Flights,Flights$Miles <=500),aes(PricePerKm))+geom_histogram(aes(fill=AirlineCompany))
+scale_x_continuous(trans="log10")+geom_vline(xintercept=0.06, linetype="solid", color = "blue")
Frequency histogram with flight data below 500 miles, X axis is USD / Km value, Y is the frequency of occurrence
The blue line in figure 3 represents the practical/experimental value of 0.06 USD/km, in this case, short flights have much more expensive values. Even with the logarithmic transformation, the data is quite asymmetric.
Now, for long flights, over 4000 miles (7500km, approximately 8 flight hours):
ggplot(filter(Flights, Miles >= 4000),aes(PricePerKm,fill=AirlineCompany))
+geom_histogram()+scale_x_continuous(trans="log10")+
+geom_vline(xintercept=0.06, linetype="solid", color = "blue")
Histogram of values per Km flown for routes above 4000 miles
Figure 4 shows that the average is practically 0.06USD/Km! In other words, my experience was quite average in terms of getting a cheap ticket. Distribution is largely dominated by America Airlines flights. But let’s see a summary of all flights over 4000 Miles and see how these values are distributed:
summary(subset(Flights$PricePerKm,Flights$Miles >= 4000))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005839 0.048663 0.059851 0.062382 0.073558 0.132250
Yes, I was well above the average. Although 50% of the tickets are cheaper, 25% of them (2nd Quartile) start at 0.048USD/Km which is only 20% cheaper than what I paid, which can be explained by another variable such as the absence of baggage allowance ( that was acquired in my case)
We will investigate the maximum amount paid for the 5% lower tariffs. Selecting the bottom 5% on the 4000 mile acmy flights:
cinco <- subset(Flights,Flights$Miles >= 4000)
quantile(cinco$PricePerKm,0.05)
## 5%
## 0.03612282
The lucky 5% (or not, as it is not clear that discount coupons may have been used) paid an equivalent of 3.6 cents per dollar flown. 60% of my practical/historical value.
Now investigating the values according to the airlines using boxplots:
Flights %>% filter(Miles >= 4000) %>% ggplot(aes(PricePerKm, AirlineCompany))
+geom_boxplot(aes(fill=AirlineCompany))+scale_x_continuous(trans="log10") +geom_vline(xintercept=0.06, linetype="solid", color = "blue")
Flight boxplots above 4000 miles, separated by airline
Each boxplot line in figure 5 represents an airline, the central box comprises 50% of the values per km flown in each company (2nd and 3rd Quartile, 50% and 75%). The centerline within each box is the median. The horizontal lines comprise the first and last quartiles (25% and 100%). Points outside these latter are considered outliers.
Even separating the airlines it is possible to note that the medians are very close and that only Virgin America (low cost) and American Airlines (the largest in the world in terms of passenger volume) do not have many outliers in the lower price range.
Let’s check now, if the same value per Km flown can include some concentration of airline or segments, we will use the values between 0.12 and 0.16 USD/Km as an example:
Dez <- subset(Flights,PricePerKm > 0.12 & PricePerKm <0.16 )
ggplot(Dez) +geom_histogram(aes(Miles, fill=AirlineCompany))+scale_x_continuous(trans="log10")
ggplot(Dez) +geom_histogram(aes(Miles, fill=AirlineCompany))+scale_x_continuous(trans="log10")
+facet_wrap(~AirlineCompany)
Flight distribution in the range between 0.12 and 0.14USD per Km
Distribution of flights in the range between 0.12 and 0.14USD / Km, separated by airline
In figure 6 the values appear to be normally distributed over the range. Figure 7 shows the separation by the airlines. We will now see if the total distribution of flights over distances is constant or if there are many more short than long flights.
Histogram of all flights according to distance
Figure 8 shows that the number of flights increases according to the distance until it reaches a maximum and then decreases.
We are going nwo to create a summary with all the average values per Km flown, but grouping by distance (each distance represents a route). This will help us to try to see a correlation between the number of flights and the decrease in price.
F1 <- aggregate(Flights, list(Flights$Miles), mean)
ggplot(F1, aes(Miles,PricePerKm))+geom_line()+scale_x_continuous(trans="log10")+
+scale_y_continuous(trans="log10")
This is the average USD/Km flown for each distance (route), clear trend
Frequency distribution of flights with decreasing values per km flown (Red)
We observed that the values per Km flown (red line in the graph above) correlate more with the number of occurrences of routes than with the distance itself.
We see an abrupt drop at approximately 2630 Miles (Blue line). This value is the average distance between the two coasts in the USA. Longer distances would only be necessary on specific flights such as Alaska, Hawaii or Puerto Rico, which occur less frequently.
By the graph “PricePerKm x Miles” We already know that the delimitation of values is always fixed. Now, let’s find the ratio between the cheapest and most expensive tickets by filtering for some distances, more specifically, 200, 500 and 2560 Miles:
c1 <- subset(Flights,Flights$Miles == 200)
max(c1$PricePerTicket)
## [1] 998
min(c1$PricePerTicket)
## [1] 50
table(max(c1$PricePerTicket)/min(c1$PricePerTicket))
##
## 19.96
## 1
c1 <- subset(Flights,Flights$Miles == 500)
max(c1$PricePerTicket)
## [1] 991
min(c1$PricePerTicket)
## [1] 50
table(max(c1$PricePerTicket)/min(c1$PricePerTicket))
##
## 19.82
## 1
c1 <- subset(Flights,(Flights$Miles ==2586))
max(c1$PricePerTicket)
## [1] 1000
min(c1$PricePerTicket)
## [1] 50
table(max(c1$PricePerTicket)/min(c1$PricePerTicket))
##
## 20
## 1
Yes, the values per Km flown (in 200, 500 and 2560 miles) always approach a 20x ratio between the cheapest and most expensive.
We confirm that the values per km flown actually decrease according to the distance. One theory would be that it probably suffers amortization of fixed costs over the longest distance. However, the variable number of flights also seems to be related, and that could have the same cause, more flights, greater amortization and competition.
It is interesting to note that the number of flights increases with distance due to the fact that other forms of transport compete with this way of transportation. Longer flights (over 4 hours) are unlikely to be replaced by car or train. Thus, there is an increase in competition and a decrease in fixed costs.
I conclude that despite a lot of research, I bought my tickets at a value very close to the average. As I had checked baggage, I can consider that I would still have the option of giving up this feature to further reduce the average ticket and try to reach the lowest 25% percentil (20% less mopney than I paid, 0.048 USD/Km) and maybe 5 Lower% (40% less than what I paid or 0.036 USD/Km)
Another interesting finding is that the difference between the maximum and minimum values is always in the 20x range. That is, on the eve of a flight (when the value in general is at the maximum) it can be considered that the minimum value that could be found would be only 5% of it. This would be indicative, not a rule.
Another practical approach to minimize the value of an air ticket would be to calculate the distance of the flight and use the average of 0.058 USD / Km and create promotions/sales alerts for flights on that route using this value.
A risky tactic would be to test if increasing the number of flights or connections, there would be an increase in the total distance flown and consequently a decrease in the value of the ticket. That way the passenger, instead of making the connection, would leave the airport, and not board the last flight. This technique, called Skiplagging (or hidden city), made an america website be sued by the united airlines. However, it would not be possible with round-trip flights.
Skiplagging
Lufthansa trying to avoid Skiplagging
2018 Airplane Flights - Predicting prices of airline flights (Ryan Zernach) https://www.kaggle.com/zernach/2018-airplane-flights
The Bureau of Transportation Statistics. 2018. https://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey