DCD 14.0 Proposal - Flight Fare Prediction
Adding Libraries
library(tidyverse)
library(readxl) #to read excel file
library(lubridate) #to ease of transforming datetime columns
library(ggplot2) #to better visualize the EDA
library(plotly) #make plots interactiveProblem Research
Background
As a fan of The Amazing Race since a very young age, an American reality TV show about competition to become the fastest team to travel around the world, traveling to the most interesting places in the world has become a part of my lifelong goal, as I’m sure most other common people would also enjoy doing. For me, it’s about the interesting culture and culinary that makes me fall in love with doing so, despite some limitations of doing it more often, like the cost of transportations and accommodations as the biggest factor, as they are definitely not cheap, especially those flight trips, as also supported by Sterling Price of ValuePenguin.com that researched about average cost of a vacation, that we can see that the budget is mostly going to the transportation. Therefore, a good deal for flight fare has been a deciding factor for me and most people who also love to travel in the final decision of going through with spending our hard-earned money to travel to interesting places in the world.
As international travel came to quite a standstill since 2020 due to the COVID-19 pandemic, the travel and tourism industries, especially aviation industry, suffered what it describes as the “the worst year in history for air travel demand”. According to the International Air Transport Association (IATA), global passenger traffic as measured in revenue passenger kilometers declined by 65.9 percent compared to 2019, as international passenger demand dropped 75.6 percent and domestic demand fell 48.8 percent below 2019 levels.
From an article at statista.com, we can see IATA’s prediction once again claimed that air travel demand to reach just 50 percent of 2019 levels in 2021, even in its more optimistic scenario. If the new strains of the virus continue to take hold, things could be even worse and recovery could be limited to just 13 percent above 2020 levels, leaving the industry at 38 percent of 2019 passenger demand, as seen from above chart. Combined with rapidly changing local government’s regulations among other things, traveling as a hobby or even a business trip, has become the hardest activity to accomplish during these challenging times.
Keeping positive thoughts that COVID-19 and its variants could soon be tackled, with the emerging vaccines and strict government regulations in place, I can only hope to be able to travel and see the world again one day. Since most people, like me, are not super rich, a budget planning for a duly required vacation would be important for us to be able to maximize our enjoyment of the vacation while balancing the feeling of impending doom of credit card bills and reduced savings after such a trip.
Since we know that the transportation costs the most, if we are planning to visit places that are outside the land that we are currently in, it would be unavoidable to use an aviation service. As a frequent traveler and quite “stingy” person, won’t it be nice to be able to know when the cheapest flight to that destination vacation that we have in mind would become available, and when would the latest date be that can we hold off on buying such tickets just so we can hold off our expenses until it is really required? Sure, we can always look for our preferred travel-related services and applications like Travel_ka and Tik_t.com, that would be such a hassle. Since I have studied with Algoritma in the last few months about data science, can’t we somehow predict it with the help of machine learning?
Business Impact
By having a flight dashboard to mainly help predicting the best range of date of reserving the cheapest ticket possible, it could help potential travelers, be it for leisure or business, to be able to optimize their traveling expenses, so that these people would be able to hold off expending budget until the time that it would be totally necessary and advantageous. Any third-party traveling services could also benefited from this as having the knowledge that their similar competitions could be expected to have the price pattern, and take advantage on being different, for example to plan some special price promotions during the time that most other services would be at their highest, unlikeable price, so that the last-minute potential customers could prefer their business instead.
Flight-related businesses, like accommodation, culinary, and other tourism services would be indirectly impacted by the ease and effective way of travelers spending less money on flights, that they would be more inclined to spend money on other things during the vacation. As a country that had such an established tourism industry and experienced such blowback during the pandemic, Indonesia and most other interesting countries would be benefited in the general if the flight demand increased by the impact of ease of access to the most optimal date to buy airline tickets.
Main Target User and Their Benefits
Target user is more about the potential travelers that would be very likely to use flight services and optimize their expenditure, and also any third-party travel reservation services to be able to optimize their price algorithm to be more advantageous against their competitors.
Other than those two that would be quite directly related, other travel and tourism industry businesses, such as hotels, culinary, etc. would also be indirectly targeted as it should be quite established already that they would be indirectly benefited from this.
Data Collection
Summary
The main dataset that I am about to use was sourced from Koninklijke Nederlandse Akademie van Wetenschappen (KNAW) or simply Royal Netherlands Academy of Arts and Sciences.
flfare <- read.csv("German Air Fares.csv")
nrow(flfare)#> [1] 62627
head(flfare)#> ï..departure_city arrival_city scrape_date departure_date
#> 1 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> 2 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> 3 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> 4 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> 5 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> 6 TXL Berlin-Tegel DUS Düsseldorf 18.10.2019 25.10.2019
#> departure_date_distance departure_time arrival_time airline stops
#> 1 1 week 6:30am 7:45am Eurowings direct
#> 2 1 week 6:40am 7:55am easyJet direct
#> 3 1 week 6:45am 8:00am easyJet direct
#> 4 1 week 6:15am 7:30am Eurowings direct
#> 5 1 week 6:55am 8:10am Eurowings direct
#> 6 1 week 8:00am 9:10am Eurowings direct
#> price..â...
#> 1 74.00
#> 2 75.00
#> 3 80.00
#> 4 84.00
#> 5 84.00
#> 6 104.00
The dataset consists of 62627 rows x 10 columns, which are:
-
ï..departure_city: character, 3-letter airport code followed by the airport name of the departure city. -
arrival_city: character, 3-letter airport code followed by the airport name of the arrival city. -
scrape_date: character / date, when the particular observation / row was scraped. -
departure_date: character / date, when the particular flight departed. -
departure_date_distance: character, how far are thescrape_dateanddeparture_dateare distanced. -
departure_time: character / time, when the particular flight departed. -
arrival_time: character / time, when the particular flight arrived. -
airline: character, name of the airline carrier. -
stops: character, showing how many stops did the flight before arriving. -
price..â...
: character / numeric, price of the particular flight in Euro.
A secondary dataset, for the sake of completeness in case that the above dataset is not sufficient.
The dataset was gained from Kaggle.com - Flight Fare Prediction MH, and consists of 3 files: “Data_Train”, “Sample_Submission”, and “Data_Test”. For the sake of basic data exploration, we will look at the Data_Train.xlsx for now:
flfare_excel <- read_xlsx("Data_Train.xlsx")
nrow(flfare_excel)#> [1] 10683
head(flfare_excel)#> # A tibble: 6 x 11
#> Airline Date_of_Journey Source Destination Route Dep_Time Arrival_Time
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 IndiGo 24/03/2019 Banglore New Delhi BLR <U+2192> ~ 22:20 01:10 22 Mar
#> 2 Air India 1/05/2019 Kolkata Banglore CCU <U+2192> ~ 05:50 13:15
#> 3 Jet Airways 9/06/2019 Delhi Cochin DEL <U+2192> ~ 09:25 04:25 10 Jun
#> 4 IndiGo 12/05/2019 Kolkata Banglore CCU <U+2192> ~ 18:05 23:30
#> 5 IndiGo 01/03/2019 Banglore New Delhi BLR <U+2192> ~ 16:50 21:35
#> 6 SpiceJet 24/06/2019 Kolkata Banglore CCU <U+2192> ~ 09:00 11:25
#> # ... with 4 more variables: Duration <chr>, Total_Stops <chr>,
#> # Additional_Info <chr>, Price <dbl>
The dataset has 10683 rows x 11 columns:
-
Airline: The name of the airline. -
Date_of_Journey: The date of the journey -
Source: The source from which the service begins. -
Destination: The destination where the service ends. -
Route: The route taken by the flight to reach the destination. -
Dep_Time: The time when the journey starts from the source. -
Arrival_Time: Time of arrival at the destination. -
Duration: Total duration of the flight. -
Total_Stops: Total stops between the source and destination. -
Additional_Info: Additional information about the flight -
Price: The price of the ticket A regression problem in which duration, source, destination, dates of flight and route are provided
Data Source and Accountability
As the first was sourced from a publicly available source of Royal Netherlands Academy of Arts and Sciences, I believe that the dataset is accountable as it was used for academical purposes and free to use under a Creative Commons Attribution 4.0 International license, that means that: “You can share, copy and modify this dataset so long as you give appropriate credit, provide a link to the CC BY license, and indicate if changes were made, but you may not do so in a way that suggests the rights holder has endorsed you or your use of the dataset. Note that further permission may be required for any content within the dataset that is identified as belonging to a third party.”
Here is the link to the dataset: F, Frederick (2020), “German Flight Ticket Prices”, Mendeley Data, V1, doi: 10.17632/gz75x2pzr7.1
The description mentioned that the dataset was generated by web scraping and includes the ticket prices on 84 German connections over a period of 6 months. a total of 63,000 prices and connections are included in the dataset.
Meanwhile for the secondary dataset, it was publicly available in Kaggle.com: Flight Fare Prediction MH.
When traced for the source, it seems to be produced by a site called MachineHack, similar site to Kaggle, that offers prices for routinely held machine learning competitions. Judging from the currency of the price, the competitions seems mostly based in India. Here is their GitHub link.
Business Requirement
Preferably I would prefer a real-time dataset concerning either the best-known airline carriers in the world or specifically concerning Indonesia as part of the route so that the dataset would be very relevant and can be used directly by inquiring potential customers, but it won’t be feasible since after some researches, that such datasets are only available in a paid or subscription basis. Therefore, for now, we will consider using other freely and publicly available datasets.
From the publicly available datasets as described above, for now, they should be sufficient to fulfill the business requirement, which is to create a flight price prediction model.
The datasets have departure date and time, arrival date and time, departure city/airport, arrival city/airport, airline name, number of stops, and most importantly, the price or the ticket fare of each observations, as this price would likely be our target variable later on.
Of course, some data preparation would be required before we dive in deeper into the modelling, not limited to and most likely including these steps:- Renaming the columns to be more generally readable
- Renaming special characters in future character of factor types, to avoid causing errors when working with the data
-
Adjusting data types, especially the numerical
priceand the date-time formatting of the departure or arrival date-times. -
Further feature engineering like creating a new column that would be more representative of our problem at hand, like possible
durationproduced by arrival time minus departure time, if it hasn’t been available already.
Data Preparation
Target and Predictors
There are at least 2 types of Machine Learning models that I am considering to use for this business requirements, that is: as a Time-Series Analysis to see if the fares have a more noticable seasonality pattern as a more deciding factor of the price, or a Regression case factoring many predictors including other than datetime ones.
Therefore the Targets and Predictors definition would differ based on which kind of Machine Learning algorithm that we are going to pursue.
Time-Series Analysis
Target: Price
Predictor(s): Possibly either one of Departure Datetime or Arrival Datetime.
Regression
Target: Price
-
Departure Datetime(mandatory) -
Arrival Datetime(mandatory) -
Duration(mandatory, possible replacement to bothDeparture DatetimeandArrival Datetime) -
Departure City(optional) -
Arrival City(optional) -
Airline(optional) -
Number of Stops(optional)
Data Preparation Steps
flfare_clean <-
flfare %>%
rename(departure_city = "ï..departure_city",
price = "price..â...") %>%
mutate(departure_datetime = dmy_hm(paste(departure_date, " ", departure_time)),
arrival_datetime = dmy_hm(paste(departure_date, " ", arrival_time)),
arrival_datetime = as_datetime(ifelse(arrival_datetime<departure_datetime,
arrival_datetime + days(1), arrival_datetime)),
scrape_date = dmy(scrape_date),
departure_date = dmy(departure_date),
price = gsub(pattern = '"', replacement = '', x = price),
price = gsub(pattern = ',', replacement = '', x = price),
price = as.numeric(price),
route = as.factor(paste(substr(departure_city, 1, 3), "-", substr(arrival_city, 1, 3))),
stops = as.factor(case_when(stops == "(1 stop)" ~ "1 stop",
stops == "(1 Stopp)" ~ "1 stop",
stops == "(2 Stopps)" ~ "2 stops",
TRUE ~ "direct")),
flight_duration = as.numeric(difftime(arrival_datetime, departure_datetime)),
airline = as.factor(airline)
)
unique(flfare_clean$stops)#> [1] direct 1 stop 2 stops
#> Levels: 1 stop 2 stops direct
Data Exploration
Check Missing Values
colSums(is.na(flfare_clean))#> departure_city arrival_city scrape_date
#> 0 0 1
#> departure_date departure_date_distance departure_time
#> 1 0 0
#> arrival_time airline stops
#> 0 0 0
#> price departure_datetime arrival_datetime
#> 0 1 1
#> route flight_duration
#> 0 1
Since from seeing the CSV file I know that the end of the data, there is one empty row that only totals the price, I am quite sure that this what we are seeing here as the only row without missing value.
tail(flfare_clean,1)#> departure_city arrival_city scrape_date departure_date
#> 62627 <NA> <NA>
#> departure_date_distance departure_time arrival_time airline stops
#> 62627 direct
#> price departure_datetime arrival_datetime route flight_duration
#> 62627 13499686 <NA> <NA> - NA
Seems true! Now dropping the last row:
flfare_clean2 <-
flfare_clean %>%
drop_na()
colSums(is.na(flfare_clean2))#> departure_city arrival_city scrape_date
#> 0 0 0
#> departure_date departure_date_distance departure_time
#> 0 0 0
#> arrival_time airline stops
#> 0 0 0
#> price departure_datetime arrival_datetime
#> 0 0 0
#> route flight_duration
#> 0 0
Summary Statistics
summary(flfare_clean2)#> departure_city arrival_city scrape_date
#> Length:62626 Length:62626 Min. :2019-10-18
#> Class :character Class :character 1st Qu.:2019-10-20
#> Mode :character Mode :character Median :2019-10-22
#> Mean :2019-10-21
#> 3rd Qu.:2019-10-24
#> Max. :2019-10-24
#>
#> departure_date departure_date_distance departure_time
#> Min. :2019-10-25 Length:62626 Length:62626
#> 1st Qu.:2019-11-06 Class :character Class :character
#> Median :2019-12-01 Mode :character Mode :character
#> Mean :2019-12-27
#> 3rd Qu.:2020-01-24
#> Max. :2020-04-24
#>
#> arrival_time airline stops price
#> Length:62626 Lufthansa :45912 1 stop :29276 Min. : 26.0
#> Class :character Eurowings :12289 2 stops: 4072 1st Qu.: 100.0
#> Mode :character easyJet : 2935 direct :29278 Median : 164.0
#> KLM : 341 Mean : 215.6
#> Luxair : 290 3rd Qu.: 333.0
#> British Airways: 197 Max. :3088.0
#> (Other) : 662
#> departure_datetime arrival_datetime route
#> Min. :2019-10-25 06:00:00 Min. :2019-10-25 06:50:00 TXL - NUE: 2969
#> 1st Qu.:2019-11-06 11:00:00 1st Qu.:2019-11-06 14:30:00 NUE - TXL: 2765
#> Median :2019-12-01 06:50:00 Median :2019-12-01 11:55:00 FDH - TXL: 1790
#> Mean :2019-12-27 23:02:31 Mean :2019-12-28 04:02:57 TXL - FDH: 1759
#> 3rd Qu.:2020-01-24 06:00:00 3rd Qu.:2020-01-24 08:00:00 TXL - CGN: 1586
#> Max. :2020-04-24 22:30:00 Max. :2020-04-25 13:50:00 CGN - TXL: 1523
#> (Other) :50234
#> flight_duration
#> Min. : 0
#> 1st Qu.: 4200
#> Median :11100
#> Mean :18026
#> 3rd Qu.:27600
#> Max. :86100
#>
Flight Duration = 0?
flfare_clean2 %>%
filter(flight_duration == 0)#> departure_city arrival_city scrape_date departure_date
#> 1 NUE Nürnberg TXL Berlin-Tegel 2019-10-18 2019-10-25
#> 2 TXL Berlin-Tegel NUE Nürnberg 2019-10-18 2020-01-18
#> 3 HAM Hamburg NUE Nürnberg 2019-10-18 2020-01-18
#> 4 HAM Hamburg NUE Nürnberg 2019-10-18 2020-01-18
#> 5 HAM Hamburg NUE Nürnberg 2019-10-18 2020-01-18
#> 6 HAM Hamburg NUE Nürnberg 2019-10-18 2020-01-18
#> 7 NUE Nürnberg HAM Hamburg 2019-10-18 2020-01-18
#> 8 NUE Nürnberg HAM Hamburg 2019-10-18 2020-01-18
#> 9 NUE Nürnberg HAM Hamburg 2019-10-18 2020-01-18
#> 10 NUE Nürnberg HAM Hamburg 2019-10-18 2020-01-18
#> 11 STR Stuttgart BRE Bremen 2019-10-18 2020-01-18
#> 12 STR Stuttgart BRE Bremen 2019-10-18 2020-01-18
#> 13 STR Stuttgart BRE Bremen 2019-10-18 2020-01-18
#> 14 STR Stuttgart BRE Bremen 2019-10-18 2020-01-18
#> 15 STR Stuttgart HAJ Hannover 2019-10-18 2020-01-18
#> 16 HAM Hamburg NUE Nürnberg 2019-10-19 2019-10-26
#> 17 CGN Köln/Bonn LEJ Leipzig/Halle 2019-10-19 2019-10-26
#> 18 LEJ Leipzig/Halle CGN Köln/Bonn 2019-10-19 2019-10-26
#> 19 LEJ Leipzig/Halle STR Stuttgart 2019-10-19 2019-10-26
#> 20 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 21 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 22 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 23 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 24 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 25 HAM Hamburg NUE Nürnberg 2019-10-19 2019-11-02
#> 26 NUE Nürnberg HAM Hamburg 2019-10-19 2019-11-02
#> 27 NUE Nürnberg HAM Hamburg 2019-10-19 2019-11-02
#> 28 NUE Nürnberg HAM Hamburg 2019-10-19 2019-11-02
#> 29 STR Stuttgart BRE Bremen 2019-10-19 2019-11-02
#> 30 STR Stuttgart BRE Bremen 2019-10-19 2019-11-02
#> 31 STR Stuttgart HAJ Hannover 2019-10-19 2019-11-02
#> 32 HAM Hamburg FDH Friedrichshafen 2019-10-19 2020-01-19
#> departure_date_distance departure_time arrival_time
#> 1 1 week 3:00pm 3:00pm
#> 2 3 month 5:55pm 5:55pm
#> 3 3 month 5:55pm 5:55pm
#> 4 3 month 9:45am 9:45am
#> 5 3 month 9:45am 9:45am
#> 6 3 month 7:20pm 7:20pm
#> 7 3 month 10:25am 10:25am
#> 8 3 month 10:25am 10:25am
#> 9 3 month 10:25am 10:25am
#> 10 3 month 10:25am 10:25am
#> 11 3 month 7:20pm 7:20pm
#> 12 3 month 7:20pm 7:20pm
#> 13 3 month 7:20pm 7:20pm
#> 14 3 month 7:20pm 7:20pm
#> 15 3 month 2:25pm 2:25pm
#> 16 1 week 7:45pm 7:45pm
#> 17 1 week 8:10pm 8:10pm
#> 18 1 week 8:35pm 8:35pm
#> 19 1 week 4:35pm 4:35pm
#> 20 2 weeks 9:45am 9:45am
#> 21 2 weeks 9:45am 9:45am
#> 22 2 weeks 9:45am 9:45am
#> 23 2 weeks 9:45am 9:45am
#> 24 2 weeks 5:55pm 5:55pm
#> 25 2 weeks 7:20pm 7:20pm
#> 26 2 weeks 10:25am 10:25am
#> 27 2 weeks 10:25am 10:25am
#> 28 2 weeks 10:25am 10:25am
#> 29 2 weeks 7:20pm 7:20pm
#> 30 2 weeks 7:20pm 7:20pm
#> 31 2 weeks 2:25pm 2:25pm
#> 32 3 month 2:00pm 2:00pm
#> airline stops price departure_datetime
#> 1 Swiss International Air Lines 1 stop 438 2019-10-25 15:00:00
#> 2 Air France 1 stop 131 2020-01-18 17:55:00
#> 3 Air France 1 stop 127 2020-01-18 17:55:00
#> 4 Lufthansa 1 stop 177 2020-01-18 09:45:00
#> 5 Multiple Airlines 1 stop 178 2020-01-18 09:45:00
#> 6 LOT-Polish Airlines 1 stop 180 2020-01-18 19:20:00
#> 7 Swiss International Air Lines 1 stop 182 2020-01-18 10:25:00
#> 8 Lufthansa 1 stop 182 2020-01-18 10:25:00
#> 9 Multiple Airlines 1 stop 183 2020-01-18 10:25:00
#> 10 Multiple Airlines 1 stop 183 2020-01-18 10:25:00
#> 11 Lufthansa 1 stop 160 2020-01-18 19:20:00
#> 12 Swiss International Air Lines 1 stop 160 2020-01-18 19:20:00
#> 13 Multiple Airlines 1 stop 161 2020-01-18 19:20:00
#> 14 Multiple Airlines 1 stop 161 2020-01-18 19:20:00
#> 15 KLM 1 stop 121 2020-01-18 14:25:00
#> 16 Lufthansa 1 stop 288 2019-10-26 19:45:00
#> 17 Lufthansa 1 stop 304 2019-10-26 20:10:00
#> 18 Lufthansa 1 stop 329 2019-10-26 20:35:00
#> 19 Lufthansa 1 stop 407 2019-10-26 16:35:00
#> 20 Swiss International Air Lines 1 stop 258 2019-11-02 09:45:00
#> 21 Lufthansa 1 stop 258 2019-11-02 09:45:00
#> 22 Multiple Airlines 1 stop 260 2019-11-02 09:45:00
#> 23 Multiple Airlines 1 stop 260 2019-11-02 09:45:00
#> 24 Air France 1 stop 311 2019-11-02 17:55:00
#> 25 LOT-Polish Airlines 1 stop 651 2019-11-02 19:20:00
#> 26 Lufthansa 1 stop 377 2019-11-02 10:25:00
#> 27 Multiple Airlines 1 stop 380 2019-11-02 10:25:00
#> 28 Swiss International Air Lines 1 stop 388 2019-11-02 10:25:00
#> 29 Multiple Airlines 1 stop 363 2019-11-02 19:20:00
#> 30 Lufthansa 1 stop 370 2019-11-02 19:20:00
#> 31 KLM 1 stop 309 2019-11-02 14:25:00
#> 32 Lufthansa 1 stop 379 2020-01-19 14:00:00
#> arrival_datetime route flight_duration
#> 1 2019-10-25 15:00:00 NUE - TXL 0
#> 2 2020-01-18 17:55:00 TXL - NUE 0
#> 3 2020-01-18 17:55:00 HAM - NUE 0
#> 4 2020-01-18 09:45:00 HAM - NUE 0
#> 5 2020-01-18 09:45:00 HAM - NUE 0
#> 6 2020-01-18 19:20:00 HAM - NUE 0
#> 7 2020-01-18 10:25:00 NUE - HAM 0
#> 8 2020-01-18 10:25:00 NUE - HAM 0
#> 9 2020-01-18 10:25:00 NUE - HAM 0
#> 10 2020-01-18 10:25:00 NUE - HAM 0
#> 11 2020-01-18 19:20:00 STR - BRE 0
#> 12 2020-01-18 19:20:00 STR - BRE 0
#> 13 2020-01-18 19:20:00 STR - BRE 0
#> 14 2020-01-18 19:20:00 STR - BRE 0
#> 15 2020-01-18 14:25:00 STR - HAJ 0
#> 16 2019-10-26 19:45:00 HAM - NUE 0
#> 17 2019-10-26 20:10:00 CGN - LEJ 0
#> 18 2019-10-26 20:35:00 LEJ - CGN 0
#> 19 2019-10-26 16:35:00 LEJ - STR 0
#> 20 2019-11-02 09:45:00 HAM - NUE 0
#> 21 2019-11-02 09:45:00 HAM - NUE 0
#> 22 2019-11-02 09:45:00 HAM - NUE 0
#> 23 2019-11-02 09:45:00 HAM - NUE 0
#> 24 2019-11-02 17:55:00 HAM - NUE 0
#> 25 2019-11-02 19:20:00 HAM - NUE 0
#> 26 2019-11-02 10:25:00 NUE - HAM 0
#> 27 2019-11-02 10:25:00 NUE - HAM 0
#> 28 2019-11-02 10:25:00 NUE - HAM 0
#> 29 2019-11-02 19:20:00 STR - BRE 0
#> 30 2019-11-02 19:20:00 STR - BRE 0
#> 31 2019-11-02 14:25:00 STR - HAJ 0
#> 32 2020-01-19 14:00:00 HAM - FDH 0
It does not make sense to have flight duration of 0. I would expect either it is actually takes exactly around 1 day for the flight, data input error, or the flight is cancelled. Some research regarding some of routes shown above, using only Google flights, showing that the flight route exists, and it usually takes around 1-3 hours.
Therefore, since this only affects 32 rows of data, I think to be safe, we should drop it from consideration.
flfare_clean2 <-
flfare_clean2 %>%
filter(flight_duration != 0)
summary(flfare_clean2)#> departure_city arrival_city scrape_date
#> Length:62594 Length:62594 Min. :2019-10-18
#> Class :character Class :character 1st Qu.:2019-10-20
#> Mode :character Mode :character Median :2019-10-22
#> Mean :2019-10-21
#> 3rd Qu.:2019-10-24
#> Max. :2019-10-24
#>
#> departure_date departure_date_distance departure_time
#> Min. :2019-10-25 Length:62594 Length:62594
#> 1st Qu.:2019-11-06 Class :character Class :character
#> Median :2019-12-01 Mode :character Mode :character
#> Mean :2019-12-27
#> 3rd Qu.:2020-01-24
#> Max. :2020-04-24
#>
#> arrival_time airline stops price
#> Length:62594 Lufthansa :45901 1 stop :29244 Min. : 26.0
#> Class :character Eurowings :12289 2 stops: 4072 1st Qu.: 100.0
#> Mode :character easyJet : 2935 direct :29278 Median : 164.0
#> KLM : 339 Mean : 215.5
#> Luxair : 290 3rd Qu.: 333.0
#> British Airways: 197 Max. :3088.0
#> (Other) : 643
#> departure_datetime arrival_datetime route
#> Min. :2019-10-25 06:00:00 Min. :2019-10-25 06:50:00 TXL - NUE: 2968
#> 1st Qu.:2019-11-06 11:26:15 1st Qu.:2019-11-06 14:55:00 NUE - TXL: 2764
#> Median :2019-12-01 06:50:00 Median :2019-12-01 11:55:00 FDH - TXL: 1790
#> Mean :2019-12-27 23:17:30 Mean :2019-12-28 04:18:05 TXL - FDH: 1759
#> 3rd Qu.:2020-01-24 06:00:00 3rd Qu.:2020-01-24 08:10:00 TXL - CGN: 1586
#> Max. :2020-04-24 22:30:00 Max. :2020-04-25 13:50:00 CGN - TXL: 1523
#> (Other) :50204
#> flight_duration
#> Min. : 300
#> 1st Qu.: 4200
#> Median :11100
#> Mean :18035
#> 3rd Qu.:27600
#> Max. :86100
#>
Spread of Departure Date
flfare_clean2 %>%
group_by(departure_date) %>%
summarise(freq = n()) %>%
plot()As we see above, some dates appear to be missing. Therefore, we might not be able to consider Time-Series Analysis for our case here, unless we found a more complete data. Or maybe we need to aggregate it weekly, but we would need to explore the data a bit more.
Spread of Price
# boxplot(flfare_clean2$price)
flfare_viz1 <-
flfare_clean2 %>%
ggplot(mapping = aes(y = price))+
geom_boxplot()
ggplotly(flfare_viz1)As shown at the summary statistics and the boxplot above, the price range is quite far, from 26 EUR to the highest of 3088 EUR! Surely some of prices are outliers, or have some special cases that we haven’t quite see, like distance, or ticket class that is unfortunately is not captured within our dataset.
Route vs Price
ggplot(data = flfare_clean2, mapping = aes(x = route, y = price))+
geom_boxplot(fill = "cyan")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title = "Route vs Price",
x = "Available Routes",
y = "Price in EUR")There are just too many routes! I think we need to focus on only some of routes to see any patterns here, but 1 insight that I can take from here is especially from FRA - MUC route, having so many obvious and quite standing out outliers, compared to other routes.
Seeing that our data might be affected by outliers, I will try to summarize some insights regarding price using median.
flfare_clean2 %>%
group_by(route) %>%
summarise(med_price = median(price)) %>%
arrange(desc(med_price)) %>%
head(10) %>%
ggplot(mapping = aes(x = med_price, y = reorder(route, med_price)))+
geom_col(fill = "blue", alpha = 0.3)+
labs(title = "Top 10 Routes with Highest Median Fares",
y = "Route",
x = "Median Price in EUR")2 clear route winners here, are back-to-back routes of SCN - TXL and TXL - SCN. Both really stand out and have the median prices of above 1,000 EUR. Other than that, the median prices are quite similar at the middle of 250 -500 scale, which should put them close to 375 EUR.
We can also notice some pairings of back-to-back routes here within the top 10 routes, like LEJ - STR in 3rd highest with STR - LEJ in 7th, or DUS - FDH (4th) and FDH - DUS (6th), showing insights that the median prices between opposite direction of a route shoule be not so much different.
flfare_clean2 %>%
group_by(route) %>%
summarise(med_price = median(price)) %>%
arrange(med_price) %>%
head(10) %>%
ggplot(mapping = aes(x = med_price, y = reorder(route, desc(med_price))))+
geom_col(fill = "red", alpha = 0.3)+
labs(title = "10 Routes with Lowest Median Fares",
y = "Route",
x = "Median Price in EUR")Since the scale is much lower here, we can say that the difference between the lowest median fares are not too large, as the lowest fare at TXL - MUC differ around 50 EUR compared to the 10th place of FKB - TXL.
Duration of Flight vs Price
ggplot(data = flfare_clean2, mapping = aes(x = flight_duration, y = price))+
geom_point()+
labs(title = "Flight Duration vs Price",
y = "Price in EUR",
x = "Flight Duration in seconds") Despite logically there should be some kind of correlation between flight duration to price (we would think that longer flights = more distanced airports = more costly), it appears that there is no clear correlation between them as we can see here.
Number of Stops vs Price
flfare_clean2 %>%
ggplot(mapping = aes(x = stops, y = price))+
geom_boxplot(fill = "pink")+
labs(title = "Number of Stops vs Prices",
y = "Number of Stops",
x = "Price in EUR") Similar to Flight Duration, we would expect to see Number of Stops might factor in the higher flight prices. This is more or less reflected by the Maximum value and even Interquartile Range of the boxplot above, which should predict around 50% of our more centralized data, that Direct prices is more lower than 1 Stop, and 1 Stop is generally more lower than 2 Stops.
The median is quite strange though. As we can see, 1-stop flights is shown to have higher prices compared to both 2-stops and direct ones. Also, 2-stops and direct ones are quite similar if we see the median prices. Will this affect our final model later on?
Airline vs Price
flfare_clean2 %>%
group_by(airline) %>%
summarise(med_price = median(price)) %>%
ggplot(mapping = aes(x = med_price, y = reorder(airline, med_price)))+
geom_col(fill = "blue", alpha = 0.3)+
labs(title = "Airline vs Median Fares",
y = "Airlines",
x = "Median Fare in EUR")Airline Mehrere Fluglinien is clearly having the highest fare here, as it is the only one touching more than 800 EUR as even the median price. Setting this aside, the second highest in SAS is around 500 EUR, making it almost twice the difference. Meanwhile, the easyJet airline is really low on the list, expected to have a median of around 25 EUR. Airline branding could really play a part here in predicting the price.
Meanwhile, on the summary statistics we can see that Lufthansa is having the highest number observations among other airlines, contributing 40,000+ out of our 62,000+ total rows. Clearly it will have some value in later, but whether we should just focus on this airline, let’s see when are doing the modelling.
For now, I want to see how the price is spread only within Lufthansa.
flfare_clean2 %>%
filter(airline == "Lufthansa") %>%
ggplot(mapping = aes(x = price))+
geom_boxplot(fill = "turquoise")+
labs(title = "Lufthansa Fares",
x = "Price in EUR")Most of the fares fall below 600 EUR. It should be noted that we might treat the fares above 600 as outliers, if we want to focus only on Lufthansa when it comes to the modelling.
Product Design
Machine Learning Algorithm
I was thinking of using Time-Series Analysis due to the possibility of the seasonality of the data would make a sense to the price pattern, but then again when checking the spread the of departure_date, seems that a lot of dates are not covered and have a lot of empty ones. Since a Time-Series would require a complete sequence of data, forcing to impute a lot of missing dates would not be a great course of action because our analysis would only based on the assumption of the imputation, not based on fact.
Therefore, the best course I think of, would be to use Regression using Simple Linear Regression, Random Forest, and Neural Network. Or other algorithms that would be great that I found along the research, I might try them here.
Dashboard Features
- Mandatory drop-down menus and date pickers for the mandatory predictors.
- Optional drop-down menus and date pickers for the optional predictors.
- Output of text recommendation of top 3 best dates and airlines with the best (cheapest) fares.
- In a different page, I would love to see overall statistics, shown by visually pleasing interactive plots, of which airline has the best fare, best not-delay / on-time prediction, etc.
Project Output
Interactive webpage dashboard, which has the following features:-
Interactive webpage where we can input some parameters into, like: date picker to pick the range of desired departure date (required), drop-down choices input of arrival airport or city (required), drop-down choices input of departure airport or city (required), drop-down choices of possible airline carriers (optional), drop-down choices of possible airline classes (optional).
When the required parameters are filled, the expected output would be at least a sentence of the best range of date possible for 1 top flight carrier with the best (cheapest) price possible. Optionally, it would be nice to have few other secondary options of airline carriers and/or date range as the recommendation. -
Interactive informative dashboard concerning delay statistics of all flights arriving to the destination city, top overall flight carriers according to fare and on-time rate. Inputs will be optional, in which case by default it will show the overall result. When inputs are filled in, the information shown will be processed to be more specified.
Inputs will include: date picker to pick the range of desired departure date, drop-down choices input of arrival airport or city, drop-down choices input of departure airport or city, drop-down choices of possible airline carriers, drop-down choices of possible airline classes.
Desired outputs would be some sentences of recommendations like the best date range to travel in the destination places, and some plots, like line chart of forecasted lowest price according to a date range, bar plots for best airline carriers according to fare or on-time rate, scatter plots of different flight carriers according to their characteristics, etc.
Similar Interpretations
As it would be likely for me to use Time-Series Forecasting or some other machine learning methods for optimization and prediction purposes, this project could also be implemented to similar businesses that have some seasonality and fluctuating fares factored into the accuracy of the modeling, for examples (but not limited to): train fare prediction, rental house and accommodation prices prediction, etc. In other words, the fare vs time optimization case that would be presented in the project could be highly implemented across industries.