Flight Price Prediction

Prepared By:

Matrix	Name
23054201	Lim Xin Yi
22109241	Lai Yong Lin
17161761	Ong Lay Shen
S2028502	Chin Jia Yi
22109241	Lee Ching Men

Project Background

In the current landscape, airline corporations employ intricate strategies for calculating flight ticket fares, resulting in a highly complex and dynamic pricing structure. This complexity makes it challenging for customers to anticipate or guess the fare accurately, as it undergoes frequent changes. The cost of air travel can vary significantly based on various factors. Precise flight price prediction is paramount for travellers seeking to plan their trips cost-effectively and for airlines aiming to optimise revenue management. Machine learning models can be employed to analyse historical data and predict future flight prices based on relevant features.

Problem Statement

The problem revolves around predicting flight prices accurately, given the uncertainty and volatility in the airline industry. The pricing of flight tickets is a complex process influenced by various factors such as demand, seasonality, and market dynamics. Developing a model that can accurately predict flight prices can help both passengers and airlines make informed decisions, optimise pricing strategies, and improve overall travel experience.

Objectives

Objective I: Analyse Historical Flight Data to Identify Factors Influencing Flight Price

Perform exploratory data analysis to gain insights into the distribution of flight prices, correlations between features, and identify patterns that may impact pricing.

Objective II: Establish Categorical Price Tiers within Flight Pricing Data

Develop a categorization system within the flight price dataset to classify each ticket price into one of three categories: low, medium, or high. This will involve calculating the median price for each source-destination pair, labeling the prices based on their deviation from the median, and preparing the data through encoding categorical variables for subsequent classification modeling.

Objective III: Compare Predictive Models for Accurate Price Forecasting

Evaluate and compare the effectiveness of linear regression and random forest regression models in predicting flight prices. The comparison will be made based on the calculation of Mean Squared Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) for both models, followed by a thorough cross-validation process to corroborate the findings.

Data Understanding

Dataset contains information about flight booking options from the website “Ease My Trip” for flight travel between India’s top 6 metro cities. There are 10,683 distinct flight booking options and 11 features in the dataset. The raw data was obtained from Kaggle.

Variable Definition:
1. Airline: The name of the airline.
2. Date_of_Journey: The date of the journey start.
3. Source: The place which the service begin.
4. Destination: The place which the service end.
5. Route: The route taken by the flight to reach the destination.
6. Dep_Time: The time when the journey start from the source.
7. Arrival_Time: The time when reach the destination.
8. Duration: The time of a flight will take to complete the journey.
9. Total_Stops: Total stops between the source and destination.
10. Additional_Info: Additional information about the flight.
11. Price: The price of the flight for a complete journey.

Data Preprocessing

Loading library

library(lubridate)
library(readxl)
library(hms)
library(dplyr)
library(ggplot2)
library(tidyr)
library(fastDummies)
library(randomForest)
library(caret)
library(Metrics)
library(rpart)
library(rpart.plot)
library(corrplot)

Loading dataset:

df<-read_excel("Data_Train.xlsx")
head(df)

Exploring structure of the dataset:

glimpse(df)

## Rows: 10,683
## Columns: 11
## $ Airline         <chr> "IndiGo", "Air India", "Jet Airways", "IndiGo", "IndiG…
## $ Date_of_Journey <chr> "24/03/2019", "1/05/2019", "9/06/2019", "12/05/2019", …
## $ Source          <chr> "Banglore", "Kolkata", "Delhi", "Kolkata", "Banglore",…
## $ Destination     <chr> "New Delhi", "Banglore", "Cochin", "Banglore", "New De…
## $ Route           <chr> "BLR → DEL", "CCU → IXR → BBI → BLR", "DEL → LKO → BOM…
## $ Dep_Time        <chr> "22:20", "05:50", "09:25", "18:05", "16:50", "09:00", …
## $ Arrival_Time    <chr> "01:10 22 Mar", "13:15", "04:25 10 Jun", "23:30", "21:…
## $ Duration        <chr> "2h 50m", "7h 25m", "19h", "5h 25m", "4h 45m", "2h 25m…
## $ Total_Stops     <chr> "non-stop", "2 stops", "2 stops", "1 stop", "1 stop", …
## $ Additional_Info <chr> "No info", "No info", "No info", "No info", "No info",…
## $ Price           <dbl> 3897, 7662, 13882, 6218, 13302, 3873, 11087, 22270, 11…

The dataset comprises 10,683 records with 11 columns, where each column represents a distinct attribute.

Checking if there are any duplicate records in the dataset:

duplicates<-sum(duplicated(df))
cat("Number of duplicated rows: ",duplicates,"\n")

## Number of duplicated rows:  220

df<-df[!duplicated(df),]

Duplicated records are eliminated by keeping only the first occurrence of each duplicate and removing the subsequent occurrences.

Checking if there are any missing values in the dataset:

cat("Number of missing values for each column: \n")

## Number of missing values for each column:

colSums(is.na(df))

##         Airline Date_of_Journey          Source     Destination           Route 
##               0               0               0               0               1 
##        Dep_Time    Arrival_Time        Duration     Total_Stops Additional_Info 
##               0               0               0               1               0 
##           Price 
##               0

missing_value<-nrow(df[!complete.cases(df),])
cat("Rows with missing values: ",missing_value,"\n")

## Rows with missing values:  1

cat("Percentage of rows with missing values: ",round(missing_value/nrow(df)*100,2),"%\n")

## Percentage of rows with missing values:  0.01 %

Only one record is identified as incomplete. Given that the percentage of rows with missing values is less than 1%, it is acceptable to remove the row. Potential data bias associated with this action is not anticipated.

df1<-na.omit(df)

Converting Date_of_Journey into date format:

df1$Date_of_Journey<-dmy(df1$Date_of_Journey)

Converting Dep_Time and Arrival_Time into time format:

df1$Dep_Time<-as.hms(strptime(df1$Dep_Time, format="%H:%M"))
df1$Arrival_Time<-as.hms(strptime(df1$Arrival_Time, format="%H:%M"))

Transforming Duration from character data type to numeric data type:

df1$Hour<-trimws((substr(df1$Duration,1,3)))
df1$Hour<-as.integer(sub("h","",df1$Hour))
df1$Hour[is.na(df1$Hour)]<-0

df1$Minute<-trimws(substr(df1$Duration,
                          nchar(df1$Duration)-2,
                          nchar(df1$Duration)))
df1$Minute<-as.integer(sub("m","",df1$Minute))
df1$Minute[is.na(df1$Minute)]<-0

df1$Duration<-df1$Hour + df1$Minute /60 # unit: hour
df1<-subset(df1, select=-c(Hour,Minute))

The Duration has been converted to a numeric data type in the unit of hour.

Transforming Total_Steps from character data type to numeric data type:

cat("Unique values of Total Stops: \n",paste(unique(df1$Total_Stops),collapse="\n"),"\n")

## Unique values of Total Stops: 
##  non-stop
## 2 stops
## 1 stop
## 3 stops
## 4 stops

df1$Total_Stops<-as.integer(ifelse(df1$Total_Stops=="non-stop",0, 
                                   ifelse(df1$Total_Stops=="1 stop",1,
                                          ifelse(df1$Total_Stops=="2 stops",2,
                                   ifelse(df1$Total_Stops=="3 stops",3,
                                   ifelse(df1$Total_Stops=="4 stops",4,NA))))))

Removing irrelevant column from the dataset:

cat("Unique values of Additional Info: \n",paste(unique(df1$Additional_Info),collapse="\n"),"\n")

## Unique values of Additional Info: 
##  No info
## In-flight meal not included
## No check-in baggage included
## 1 Short layover
## No Info
## 1 Long layover
## Change airports
## Business class
## Red-eye flight
## 2 Long layover

df1<-subset(df1, select=-Additional_Info)

Checking statistical summary for numeric data:

df2<-df1 %>% select(Duration, Total_Stops, Price)
summary(df2)

##     Duration         Total_Stops         Price      
##  Min.   : 0.08333   Min.   :0.0000   Min.   : 1759  
##  1st Qu.: 2.83333   1st Qu.:0.0000   1st Qu.: 5224  
##  Median : 8.41667   Median :1.0000   Median : 8266  
##  Mean   :10.49636   Mean   :0.8023   Mean   : 9027  
##  3rd Qu.:15.16667   3rd Qu.:1.0000   3rd Qu.:12345  
##  Max.   :47.66667   Max.   :4.0000   Max.   :79512

Getting mode for categorical data:

df3<-df1 %>% select(Airline, Source, Destination, Route)

getMode<-function(v) {
  uniqv<-unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

getMode(df3$Airline)

## [1] "Jet Airways"

getMode(df3$Source)

## [1] "Delhi"

getMode(df3$Destination)

## [1] "Cochin"

Remove ‘Route’ column: It is not necessary for the model, especially if ‘Route’ is causing issues due to its high cardinality. ‘Total_Stops’ feature captures the essence of the ‘Route’.

df4 <- df1[ , !(names(df1) %in% c("Route"))]

EDA

Objective I: Analyse Historical Flight Data to Identify Factors Influencing Flight Price

Step 1: to verify all the categories in certain attribute

str(df1)

## tibble [10,462 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Airline        : chr [1:10462] "IndiGo" "Air India" "Jet Airways" "IndiGo" ...
##  $ Date_of_Journey: Date[1:10462], format: "2019-03-24" "2019-05-01" ...
##  $ Source         : chr [1:10462] "Banglore" "Kolkata" "Delhi" "Kolkata" ...
##  $ Destination    : chr [1:10462] "New Delhi" "Banglore" "Cochin" "Banglore" ...
##  $ Route          : chr [1:10462] "BLR → DEL" "CCU → IXR → BBI → BLR" "DEL → LKO → BOM → COK" "CCU → NAG → BLR" ...
##  $ Dep_Time       : 'hms' num [1:10462] 22:20:00 05:50:00 09:25:00 18:05:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Arrival_Time   : 'hms' num [1:10462] 01:10:00 13:15:00 04:25:00 23:30:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Duration       : num [1:10462] 2.83 7.42 19 5.42 4.75 ...
##  $ Total_Stops    : int [1:10462] 0 2 2 1 1 0 1 1 1 1 ...
##  $ Price          : num [1:10462] 3897 7662 13882 6218 13302 ...
##  - attr(*, "na.action")= 'omit' Named int 8894
##   ..- attr(*, "names")= chr "8894"

df1

Categorical Variables

plot_categorical <- function(data, column) {
  ggplot(df4, aes_string(x=column)) + 
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title=paste("Count Plot of", column), x=column, y="Count")
}

# Plot for Airline, Source, Destination, etc.
plot_categorical(data, "Airline")

plot_categorical(data, "Source")

plot_categorical(data, "Destination")

plot_categorical(data, "Total_Stops")

Due to the high variety of unique values in some of these columns, the count plots may be too cluttered to provide clear insights. Hence, Univariate analysis is only done for Airline, Source, Destination and Total_Stops.

Continuous Variables

# Histogram for Price
ggplot(df1, aes(x=Price)) + 
  geom_histogram(bins=30, fill="blue", color="black")

  labs(title="Histogram of Price", x="Price", y="Frequency")

## $x
## [1] "Price"
## 
## $y
## [1] "Frequency"
## 
## $title
## [1] "Histogram of Price"
## 
## attr(,"class")
## [1] "labels"

#Converting categorical to numeric
data_num <- df1 %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.factor, as.numeric)%>%mutate_if(is.integer,as.numeric)

head(data_num)

str(data_num)

## tibble [10,462 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Airline        : num [1:10462] 4 2 5 4 4 9 5 5 5 7 ...
##  $ Date_of_Journey: Date[1:10462], format: "2019-03-24" "2019-05-01" ...
##  $ Source         : num [1:10462] 1 4 3 4 1 4 1 1 1 3 ...
##  $ Destination    : num [1:10462] 6 1 2 1 6 1 6 6 6 2 ...
##  $ Route          : num [1:10462] 19 85 119 92 30 65 6 6 6 105 ...
##  $ Dep_Time       : 'hms' num [1:10462] 22:20:00 05:50:00 09:25:00 18:05:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Arrival_Time   : 'hms' num [1:10462] 01:10:00 13:15:00 04:25:00 23:30:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Duration       : num [1:10462] 2.83 7.42 19 5.42 4.75 ...
##  $ Total_Stops    : num [1:10462] 0 2 2 1 1 0 1 1 1 1 ...
##  $ Price          : num [1:10462] 3897 7662 13882 6218 13302 ...
##  - attr(*, "na.action")= 'omit' Named int 8894
##   ..- attr(*, "names")= chr "8894"

Correlation Analysis

# Transform 'Date_of_Journey' into a Date object and extract day and month
data_num$Date_of_Journey <- as.Date(data_num$Date_of_Journey, format = "%d/%m/%Y")
data_num$Journey_Day <- day(data_num$Date_of_Journey)
data_num$Journey_Month <- month(data_num$Date_of_Journey)

# Select only numeric variables (including the newly created ones)
data_numeric <- data_num %>% 
  select_if(~is.numeric(.))

# Compute correlation matrix
cor_matrix <- cor(data_numeric, use="complete.obs")

# Plot the correlation matrix
corrplot(cor_matrix, method="color")

By referring to the correlation matrix, the Duration and Total_Stops significantly affecting the price.

Data Analysis (Modelling)

Objective II: Classification Modelling (Pricing Category Prediction)

Step 1: Define the Target Variable ‘Price_Category’

# Drop the 'Route' variable
#df4 <- df1 %>% select(-Route)

# Calculate median price for each Source-Destination pair
median_prices <- df4 %>%
  group_by(Source, Destination) %>%
  summarize(MedianPrice = median(Price, na.rm = TRUE), .groups = 'drop')  # Ensure group structure is dropped

# Join the median prices back to the original dataframe
df5 <- df4 %>%
  left_join(median_prices, by = c("Source", "Destination"))

# Categorize prices relative to the median price for that route
df5 <- df5 %>%
  mutate(Price_Category = case_when(
    Price < 0.8 * MedianPrice ~ "Low",    # Prices less than 80% of the median price are considered 'Low'
    Price > 1.2 * MedianPrice ~ "High",   # Prices more than 120% of the median price are considered 'High'
    TRUE ~ "Medium"                       # Prices within +/- 20% of the median are considered 'Medium'
  ))

# Remove the intermediate MedianPrice column if no longer needed
df5 <- df5 %>% select(-MedianPrice)

Step 2: Encode Categorical Variables

# install.packages("fastDummies")


# Create dummy variables
df5 <- dummy_cols(df5, select_columns = c("Airline", "Source", "Destination"))

# Drop the original categorical columns
df5a <- df5 %>% select(-Airline, -Source, -Destination)

Step 3: Split Data into Training and Testing Sets

# Set seed for reproducibility
set.seed(234)

# Create data partition and subset data
split_df5a <- createDataPartition(df5a$Price_Category, p = 0.8, list = FALSE)
train_df5a <- df5a[split_df5a, ]
test_df5a <- df5a[-split_df5a, ]

Step 4: Machine Learning Model

# install.packages("rpart.plot")


# Train the Decision Tree model
tree_model <- rpart(Price_Category ~ ., data = train_df5a, method = "class")

# Visualize the Decision Tree
rpart.plot(tree_model)

# Make predictions on the test set
predictions <- predict(tree_model, newdata = test_df5a, type = "class")

Step 5: Evaluate the Model

# Convert 'test_df5a$Price_Category' to factor with the same levels as 'predictions'
test_df5a$Price_Category <- factor(test_df5a$Price_Category, levels = levels(predictions))

# calculate the confusion matrix
conf_matrix <- confusionMatrix(predictions, test_df5a$Price_Category)

# Print out the confusion matrix
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low Medium
##     High    649   6      8
##     Low       2 633     22
##     Medium   37   5    729
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9617          
##                  95% CI : (0.9526, 0.9695)
##     No Information Rate : 0.363           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9425          
##                                           
##  Mcnemar's Test P-Value : 7.027e-07       
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity               0.9433     0.9829        0.9605
## Specificity               0.9900     0.9834        0.9685
## Pos Pred Value            0.9789     0.9635        0.9455
## Neg Pred Value            0.9727     0.9923        0.9773
## Prevalence                0.3290     0.3080        0.3630
## Detection Rate            0.3104     0.3027        0.3486
## Detection Prevalence      0.3171     0.3142        0.3687
## Balanced Accuracy         0.9667     0.9832        0.9645

Discussion of Modelling Outcomes

Step 1 to 3 built the foundation for a classification model by transforming the source and destination prices into a categorization of ‘Low’, ‘Medium’, and ‘High’. Specifically, categorization was undertaken, designating prices as ‘Low’ if falling below 80% of the associated median, ‘High’ if exceeding 120%, and ‘Medium’ otherwise. Categorical variables were encoded for processing in machine learning, and data was split for training and validation purposes.

Step 4 and 5 introduced a decision tree model to predict these categories. The model seems to perform quite well based on the confusion matrix with an overall accuracy of 96.17% and high values for sensitivity, specificity, and precision across all price categories. The statistics are commendable, with the ‘Medium’ category showing slightly lower predictive value than ‘High’ and ‘Low’. This could be attributed to possibly a higher overlap in the characteristics of medium-range tickets with both the high and low categories, introducing more difficulty in accurate classification.

Objective III: Regression (Price Prediction)

Linear Regression Model vs Random Forest Model

Step 1: Prepare data and establish a baseline linear regression model and random forest model

set.seed(123) 
indices <- sample(1:nrow(df4), size = 0.8 * nrow(df4))
train_data <- df4[indices, ]
test_data <- df4[-indices, ]

Step 2: Model Training

# Linear Regression
lm_model <- lm(Price ~ ., data = train_data)

# Random Forest
rf_model <- randomForest(Price ~ ., data = train_data)

Step 3: Model Evaluation

# Evaluation for Linear Regression
predicted_prices_lm <- predict(lm_model, newdata = test_data)
# Evaluation for Random Forest
predicted_prices_rf <- predict(rf_model, newdata = test_data)

Step 4: Evaluating the performance by calculating three common metrics: Mean Squared Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE)

# Output Linear Regression performance
mse_lm <- mse(test_data$Price, predicted_prices_lm)
rmse_lm <- sqrt(mse_lm)
mae_lm <- mae(test_data$Price, predicted_prices_lm)

print(paste("Linear Regression MSE:", mse_lm))

## [1] "Linear Regression MSE: 8071254.09554571"

print(paste("Linear Regression RMSE:", rmse_lm))

## [1] "Linear Regression RMSE: 2840.99526496362"

print(paste("Linear Regression MAE:", mae_lm))

## [1] "Linear Regression MAE: 1922.48144677609"

# Output Random Forest performance
mse_rf <- mse(test_data$Price, predicted_prices_rf)
rmse_rf <- sqrt(mse_rf)
mae_rf <- mae(test_data$Price, predicted_prices_rf)

print(paste("Random Forest MSE:", mse_rf))

## [1] "Random Forest MSE: 4824253.59439937"

print(paste("Random Forest RMSE:", rmse_rf))

## [1] "Random Forest RMSE: 2196.4183559603"

print(paste("Random Forest MAE:", mae_rf))

## [1] "Random Forest MAE: 1479.14260638997"

Step 5: Final Evaluation with Cross-Validation

# Set up cross-validation for linear regression
train_control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation

# Train the model
lm_cv_model <- train(Price ~ ., 
                     data = train_data, 
                     method = "lm", 
                     trControl = train_control)

# Summarize the results
print(lm_cv_model)

## Linear Regression 
## 
## 8369 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7530, 7533, 7532, 7532, 7531, 7533, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2891.999  0.6092504  1965.419
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

# Set up cross-validation for random forest
train_control_rf <- trainControl(method = "cv", number = 10) # 10-fold cross-validation

# Train the Random Forest model with cross-validation
rf_cv_model <- train(Price ~ ., 
                     data = train_data, 
                     method = "rf", 
                     trControl = train_control_rf)

# Summarize the Random Forest results
print(rf_cv_model)

## Random Forest 
## 
## 8369 samples
##    8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7532, 7531, 7532, 7532, 7532, 7532, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2855.294  0.6782484  2009.968
##   13    1883.365  0.8358895  1146.837
##   25    1991.752  0.8164410  1152.752
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 13.

Discussion of Modelling Outcomes

Linear regression and random forest models were built to predict actual ticket prices. Both models show promise, but the random forest model outperforms the linear regression model on all counts (MSE, RMSE, MAE), indicating it is better able to capture the complexities of the dataset.

Cross-validation results confirm the initial findings: the linear regression model has higher error rates than the random forest model. Cross-validation for the random forest model indicates that an mtry of 13 (the number of variables randomly sampled as candidates at each split) provides the best results, balancing the trade-off between computational efficiency and model accuracy.

Conclusion

The classification model developed for Objective II not only achieved a high degree of accuracy but also demonstrated practical utility in segmenting prices into ‘Low’, ‘Medium’, and ‘High’ categories. The use of a decision tree model in this context afforded transparency into the pricing structure, making it especially valuable for customer-facing applications that require quick and comprehensible price indication.

In addressing Objective III, our investigation uncovered significant differences between linear and non-linear modeling techniques. The superior performance of the random forest model elucidated the complex and non-linear nature of airfare pricing—while also highlighting the trade-off between increased computational cost and predictive accuracy. Cross-validation reinforced the conclusion that, in the context of this data set, the more nuanced approach offered by random forest modeling is preferred to linear regression.

The future of this initiative is promising and brimming with potential. Improved hyperparameter tuning, exploration of ensemble methods, and the integration of advanced regressors like gradient boosting machines are the logical next steps. Additionally, investing effort into feature engineering is expected to uncover more intricate relationships within the data, further enhancing the performance and utility of both the classification and regression models.

Reflecting on the broader implications, the findings from this assignment have significant ramifications for the airline industry. The capacity to predict and effectively categorize airfare could translate into more sophisticated pricing strategies, optimized revenue management, and a more personalized customer experience. In a landscape marked by volatility and fierce competition, the ability to not just respond to but anticipate market dynamics confers a tangible competitive edge. By embracing the power of data-driven decision-making models, stakeholders in the airline industry can navigate the complexities of pricing with greater confidence and precision, ultimately leading to more profitable and customer-centric outcomes.

Flight Price Prediction

OCC2 Group 6

2024-01-13

Prepared By:

Project Background

Problem Statement

Objectives

Data Understanding

Data Preprocessing

EDA

Objective I: Analyse Historical Flight Data to Identify Factors Influencing Flight Price

Data Analysis (Modelling)

Objective II: Classification Modelling (Pricing Category Prediction)

Discussion of Modelling Outcomes

Objective III: Regression (Price Prediction)

Discussion of Modelling Outcomes

Conclusion