Group Number:
Group 4
Title:
Machine Learning in Aviation: Predicting Fares and Ticket Classifications
Members:
Khor Kean Teng, Loong Shih-Wai, Nur Maisarah Jalalulail, Pang Suk Min, Tioh Zi Cong
1.1.1 Background
Air travel has become an integral part of modern society with millions of passengers flying for business and leisure. The rapidly expanding airline industry is also considered one of the most sophisticated industries in employing complex pricing strategies. This is because flight fare pricing is dynamic and complex, involving numerous factors such as timing, demand, seasonality, and market conditions. The same airline ticket prices can vary dynamically and significantly, even for nearby seats within the same cabin. Research reports that the same ticket for a specific flight can change up to seven times a day.
Admittedly, flight fare volatility creates uncertainty for both travelers seeking the best deals and airlines optimizing their revenue management. This interplay of goals creates a unique tension: customers are seeking to get the lowest price for their ticket, while airline companies aim to maximize their overall revenue and profit. Mismatches between available seats and passenger demand often lead to customers paying more or airlines losing revenue. While airlines are equipped with advanced tools and capabilities to control the pricing process, customers are also becoming more strategic with the development of online tools that compare prices across various airlines. This evolving dynamic underscores the growing sophistication and challenges in the flight pricing ecosystem.
Recent advancement in machine learning has opened new possibilities for more accurate fare prediction and classification models. Through analyzing vast amount of historical flight fare data, it opens up opportunities to develop predictive models that can forecast flight fares with better accuracy as well as greater transparency and efficiency in airline pricing.
1.1.2 Business Objectives
1.1.3 Business Success Criteria
Accurate prediction of flight-level passenger traffic is of paramount importance in airline operations, influencing key decisions from pricing to route optimization (Ehsani et al., 2024). However, forecasting flight fares is a critical tasks in the rapidly expanding aviation industry as it involves numerous factors. Nowadays, aviation industry adopt dynamic pricing to assign airfare prices utilizing complex strategies and methods involving financial, marketing, commercial and social factors.
The high complexity in airfare pricing models make it very difficult for customer to purchase an air ticket in the lowest price since the price changes dynamically. According to a (Abdella et al., 2021), the same ticket price for a specific flight can change up to 7 times a day, presenting a complex predictive modeling opportunities. The best time to buy a ticket is a key issue subjected to research. Various predictions models to optimize airline ticket purchasing have been proposed and achieved promising results. (Subramanian et al., 2022) proposed the use of machine learning algorithms in airline fare prediction utilizing various regression algorithms such as KNN, Linear Regresssion, Lasso, Ridge, Decision Tree, Stacking Tree and Random Forest on scaped data from MakeMyTrip, Data World and New Zealand Domestic airlines. The research results R2 score ranges from 60.24% to 83.83%; 27.95% to 84.46%; 37.16% to 81.89% highlighting the complex and nonlinear relationships of multiple factors in airline fare prediction. However, the data used on cover specific airports limited to certain time period, thus limited the result generalizability and practicality.
(Tziridis et al., 2017) proposed a comparative study using data from the Aegean Airlines using MLP, GRNN, Â ELM, RF, Regression Tree, Bagging Regression Tree, SVM and LR. The research outcome ranges from 49.40% to 87.42%, improvement is made with feature selection from 57.07% to 87.59%. Nonetheless, the research lacks practicality where only 1814 data flights collected in period between December and July is used for training, thus limiting the findings application to a broader cases and demographics.
Recent advancement in deep learning also ushered in novel forecasting approach in flight fare prediction. (Degife and Lin, 2023) suggested the use of gated recurrent unit, GRU and achieved a remarkable 98% R2 score and 5.93 RMSE representing a significant advacement in prediction accuracy. Nevertheless, the use of deep learning architecture can be computational intensive during the model training stage.
Above all, the landscape of flight fare prediction demonstrates significant potential and ongoing challenges. Despite the promising advancements in machine learning and deep learning techniques, current research is constrained by limited dataset generalizability, regional specificity, and computational intensity. Future research could emphasize on the development of more comprehensive, robust models that can integrate diverse data sources, encompass broader geographic regions, and balance predictive accuracy with computational efficiency. Moreover, interdisciplinary approaches that combine machine learning algorithms with domain expertise from aviation economics, marketing, and social sciences could provide more nuanced and adaptable fare prediction models, ultimately benefiting both airlines and passengers by offering more transparent and optimized pricing strategies.
The project make use of CRISP-DM framework to streamline the project workflows. R language will be used for data processing, modeling and evaluation tasks. The report for the project will then be written in RMarkdown and published on RPubs.
###2.1 Collect Initial Data
# clear workspace variables
rm(list = ls())
# set-up: turn off warnings and set plot size
options(warn = -1)
options(repr.plot.width = 10, repr.plot.height = 5)
# set working directory
setwd("C:\\Users\\Loong\\Downloads\\New folder")
# install the required packages
packages <- c("caret", "e1071", "tidyverse", "tidyr", "readr", "randomForest", "class", "nnet", "corrplot", "glmnet")
# Get the list of currently installed packages
installed_packages <- as.data.frame(installed.packages())['Package']
# Function to check and install missing packages
install_if_missing <- function(pkg) {
if (!(pkg %in% installed_packages$Package)) {
suppressMessages(install.packages(pkg))
}
}
# Apply the function to each package in the vector
sapply(packages, install_if_missing)
## $caret
## NULL
##
## $e1071
## NULL
##
## $tidyverse
## NULL
##
## $tidyr
## NULL
##
## $readr
## NULL
##
## $randomForest
## NULL
##
## $class
## NULL
##
## $nnet
## NULL
##
## $corrplot
## NULL
##
## $glmnet
## NULL
# import packages
suppressPackageStartupMessages({
# add the required packages here
library(tidyr)
library(readr)
library(tidyverse)
library(caret)
library(e1071)
library(randomForest)
library(class)
library(nnet)
library(corrplot)
library(glmnet)
})
# Download the dataset
system("kaggle datasets download yashdharme36/airfare-ml-predicting-flight-fares")
## [1] 0
cat("Data loaded...\n")
## Data loaded...
unzip("airfare-ml-predicting-flight-fares.zip")
cat("Done unzip...\n")
## Done unzip...
# first 5 rows of the data
data <- read_csv('Cleaned_dataset.csv', show_col_types = FALSE)
head(data, 5)
## # A tibble: 5 × 13
## Date_of_journey Journey_day Airline Flight_code Class Source Departure
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2023-01-16 Monday SpiceJet SG-8169 Economy Delhi After 6 PM
## 2 2023-01-16 Monday Indigo 6E-2519 Economy Delhi After 6 PM
## 3 2023-01-16 Monday GO FIRST G8-354 Economy Delhi After 6 PM
## 4 2023-01-16 Monday SpiceJet SG-8709 Economy Delhi After 6 PM
## 5 2023-01-16 Monday Air India AI-805 Economy Delhi After 6 PM
## # ℹ 6 more variables: Total_stops <chr>, Arrival <chr>, Destination <chr>,
## # Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>
cat("The dimension of the dataset is", dim(data)[1], "rows and", dim(data)[2], "columns.\n")
## The dimension of the dataset is 452088 rows and 13 columns.
# Variables list with type
str(head(data))
## tibble [6 × 13] (S3: tbl_df/tbl/data.frame)
## $ Date_of_journey : Date[1:6], format: "2023-01-16" "2023-01-16" ...
## $ Journey_day : chr [1:6] "Monday" "Monday" "Monday" "Monday" ...
## $ Airline : chr [1:6] "SpiceJet" "Indigo" "GO FIRST" "SpiceJet" ...
## $ Flight_code : chr [1:6] "SG-8169" "6E-2519" "G8-354" "SG-8709" ...
## $ Class : chr [1:6] "Economy" "Economy" "Economy" "Economy" ...
## $ Source : chr [1:6] "Delhi" "Delhi" "Delhi" "Delhi" ...
## $ Departure : chr [1:6] "After 6 PM" "After 6 PM" "After 6 PM" "After 6 PM" ...
## $ Total_stops : chr [1:6] "non-stop" "non-stop" "non-stop" "non-stop" ...
## $ Arrival : chr [1:6] "After 6 PM" "Before 6 AM" "Before 6 AM" "After 6 PM" ...
## $ Destination : chr [1:6] "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
## $ Duration_in_hours: num [1:6] 2.08 2.33 2.17 2.08 2.17 ...
## $ Days_left : num [1:6] 1 1 1 1 1 1
## $ Fare : num [1:6] 5335 5899 5801 5794 5955 ...
# Segregate variables to type
float_columns <- names(data)[sapply(data, function(x) is.numeric(x) & !is.integer(x))]
string_columns <- names(data)[sapply(data, is.character)]
datetime_columns <- names(data)[sapply(data, function(x) inherits(x, "POSIXct") | inherits(x, "Date"))]
cat("Float columns:", paste(float_columns, collapse = ", "), "\n")
## Float columns: Duration_in_hours, Days_left, Fare
cat("String columns:", paste(string_columns, collapse = ", "), "\n")
## String columns: Journey_day, Airline, Flight_code, Class, Source, Departure, Total_stops, Arrival, Destination
cat("Datetime columns:", paste(datetime_columns, collapse = ", "), "\n")
## Datetime columns: Date_of_journey
# Check unique values for each column in the dataset
unique_values <- lapply(data, unique)
# Statistical summary of dataset
summary(data)
## Date_of_journey Journey_day Airline Flight_code
## Min. :2023-01-16 Length:452088 Length:452088 Length:452088
## 1st Qu.:2023-01-28 Class :character Class :character Class :character
## Median :2023-02-10 Mode :character Mode :character Mode :character
## Mean :2023-02-09
## 3rd Qu.:2023-02-22
## Max. :2023-03-06
## Class Source Departure Total_stops
## Length:452088 Length:452088 Length:452088 Length:452088
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Arrival Destination Duration_in_hours Days_left
## Length:452088 Length:452088 Min. : 0.750 Min. : 1.00
## Class :character Class :character 1st Qu.: 6.583 1st Qu.:13.00
## Mode :character Mode :character Median :11.333 Median :26.00
## Mean :12.349 Mean :25.63
## 3rd Qu.:16.500 3rd Qu.:38.00
## Max. :43.583 Max. :50.00
## Fare
## Min. : 1307
## 1st Qu.: 8763
## Median : 13407
## Mean : 22840
## 3rd Qu.: 35587
## Max. :143019
summary(data$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1307 8763 13407 22840 35587 143019
# Histogram for fare
ggplot(data, aes(x = Fare)) + geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Flight Fare", x = "Fare", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Days left
summary(data$Days_left)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 13.00 26.00 25.63 38.00 50.00
# Histogram for fare
ggplot(data, aes(x = Days_left)) + geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Distribution of Days Left", x = "Days between Booking day and Flight day", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Journey Day - histogram
ggplot(data, aes(x = Journey_day)) + geom_bar(binwidth = 10, fill = "plum", color = "black") +
labs(title = "Journey Day", x = "Journey Day", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Journey Month
ggplot(data, aes(x = lubridate::month(Date_of_journey, label = TRUE))) +
geom_bar(fill = "palevioletred", color = "black") +
labs(title = "Frequency of Journey by Month", x = "Month of Journey", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12)
)
# Airline - histogram
ggplot(data, aes(x = Airline)) + geom_bar(binwidth = 10, fill = "turquoise", color = "black") +
labs(title = "Airline Frequency", x = "Airline", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Class - histogram
ggplot(data, aes(x = Class)) + geom_bar(binwidth = 10, fill = "lightpink", color = "black") +
labs(title = "Types of Classes", x = "Class", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Fare with Journey Day
ggplot(data, aes(x = Journey_day, y = Fare)) + geom_boxplot(fill = "lightcyan") +
labs(title = "Fare Distribution by Journey Day", x = "Journey Day", y = "Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Fare with Airline
ggplot(data, aes(x = Airline, y = Fare)) + geom_boxplot(fill = "turquoise") +
labs(title = "Fare Distribution by Airline", x = "Airline", y = "Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Fare with Class
ggplot(data, aes(x = Class, y = Fare)) + geom_boxplot(fill = "lightpink") +
labs(title = "Fare Distribution by Class", x = "Class", y = "Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Class with duration of hours
ggplot(data, aes(x = Class, y = Duration_in_hours)) + geom_boxplot(fill = "lightpink") +
labs(title = "Duration of Flight by Class", x = "Class", y = "Duration in hours") +
theme_minimal() +
theme(
plot.title = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold"),
axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
axis.text.y = element_text(size = 16) )
# Fare with Days Left
data$days_left_group <- cut(data$Days_left, breaks = c(0, 7, 14, 30, 60, 90, Inf),
labels = c("0-7 days", "8-14 days", "15-30 days", "31-60 days", "61-90 days", "90+ days"))
ggplot(data, aes(x = days_left_group, y = Fare)) +
geom_boxplot(fill = "khaki") +
labs(title = "Fare Prices by Days Left Intervals", x = "Days Left Interval", y = "Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1)
)
# Average Fare with days left
avg_fare_by_days <- data %>% group_by(Days_left) %>% summarise(avg_fare = mean(Fare, na.rm = TRUE))
ggplot(avg_fare_by_days, aes(x = Days_left, y = avg_fare)) +
geom_line(color = "blue", size = 1) +
labs(title = "Average Fare vs. Days Left", x = "Days Left", y = "Average Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12)
)
# Grouping by the extracted month from Date_of_journey and calculating average fare
avg_fare_by_month <- data %>%
group_by(Month_of_journey = lubridate::month(Date_of_journey, label = TRUE)) %>%
summarise(avg_fare = mean(Fare, na.rm = TRUE))
# Plotting the relationship
# Plotting the relationship with proper grouping
ggplot(avg_fare_by_month, aes(x = Month_of_journey, y = avg_fare, group = 1)) +
geom_line(color = "hotpink", size = 1) +
geom_point(color = "blue", size = 2) +
labs(title = "Average Fare vs. Month of Journey", x = "Month of Journey", y = "Average Fare") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold"),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
axis.text = element_text(size = 12)
)
numeric_data <- data[sapply(data, is.numeric)]
cor_matrix <- cor(numeric_data)
print(cor_matrix)
## Duration_in_hours Days_left Fare
## Duration_in_hours 1.00000000 -0.03287841 0.17990877
## Days_left -0.03287841 1.00000000 -0.08785164
## Fare 0.17990877 -0.08785164 1.00000000
corrplot(cor_matrix, method = "circle", type = "upper", tl.col = "black", tl.srt = 30,col = colorRampPalette(c("green", "pink", "blue"))(200))
# Facet Grid for Fare by Airline and Duration with Total Stops
ggplot(data, aes(x = Duration_in_hours, y = Total_stops, color = Fare)) +
geom_point(alpha = 0.6) +
facet_wrap(~ Airline) +
labs(title = "Fare by Airline, Duration, and Total Stops", x = "Duration in Hours", y = "Total Stops") +
scale_color_gradient(low = "blue", high = "red")
# Count of missing values in each column
colSums(is.na(data))
## Date_of_journey Journey_day Airline Flight_code
## 0 0 0 0
## Class Source Departure Total_stops
## 0 0 0 0
## Arrival Destination Duration_in_hours Days_left
## 0 0 0 0
## Fare days_left_group
## 0 0
# Check rows with missing values
missing_rows <- data[!complete.cases(data), ]
print(missing_rows)
## # A tibble: 0 × 14
## # ℹ 14 variables: Date_of_journey <date>, Journey_day <chr>, Airline <chr>,
## # Flight_code <chr>, Class <chr>, Source <chr>, Departure <chr>,
## # Total_stops <chr>, Arrival <chr>, Destination <chr>,
## # Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>, days_left_group <fct>
# Check for duplicate rows
duplicate_rows <- data[duplicated(data), ]
if (nrow(duplicate_rows) == 0) {
print("There is no duplicate row.")
} else {
print("Duplicate Rows:")
print(duplicate_rows)
}
## [1] "Duplicate Rows:"
## # A tibble: 6,722 × 14
## Date_of_journey Journey_day Airline Flight_code Class Source Departure
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2023-01-16 Monday Vistara UK-627 Economy Delhi 12 PM - …
## 2 2023-01-16 Monday Vistara UK-627 Premium Eco… Delhi 12 PM - …
## 3 2023-01-16 Monday Vistara UK-627 Premium Eco… Delhi 12 PM - …
## 4 2023-01-17 Tuesday Vistara UK-929 Economy Delhi After 6 …
## 5 2023-01-17 Tuesday Vistara UK-627 Premium Eco… Delhi 12 PM - …
## 6 2023-01-18 Wednesday Vistara UK-975 Economy Delhi Before 6…
## 7 2023-01-18 Wednesday Vistara UK-951 Economy Delhi 12 PM - …
## 8 2023-01-18 Wednesday Vistara UK-993 Economy Delhi 12 PM - …
## 9 2023-01-18 Wednesday Vistara UK-945 Economy Delhi 6 AM - 1…
## 10 2023-01-18 Wednesday Vistara UK-995 Economy Delhi 6 AM - 1…
## # ℹ 6,712 more rows
## # ℹ 7 more variables: Total_stops <chr>, Arrival <chr>, Destination <chr>,
## # Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>, days_left_group <fct>
# Verify data quality
verify_data_quality <- function(data) {
# 1. Check if 'Date_of_journey' is in the correct format (YYYY-MM-DD)
data$Date_of_journey <- as.Date(data$Date_of_journey, format = "%Y-%m-%d")
invalid_dates <- sum(is.na(data$Date_of_journey)) # Count invalid dates
# 2. Check if 'Duration_in_hours' contains non-negative values
negative_duration <- sum(data$Duration_in_hours < 0, na.rm = TRUE) # Count negative durations
# 3. Check if 'Days_left' contains non-negative values
negative_days_left <- sum(data$Days_left < 0, na.rm = TRUE) # Count negative days left
# Output the results
cat("Data Quality Check Results:\n")
# Invalid date check
if (invalid_dates > 0) {
cat(paste(invalid_dates, "rows have invalid date format in 'Date_of_journey'.\n"))
} else {
cat("All dates in 'Date_of_journey' are in the correct format.\n")
}
# Negative duration check
if (negative_duration > 0) {
cat(paste(negative_duration, "rows have negative values in 'Duration_in_hours'.\n"))
} else {
cat("All values in 'Duration_in_hours' are non-negative.\n")
}
# Negative days left check
if (negative_days_left > 0) {
cat(paste(negative_days_left, "rows have negative values in 'Days_left'.\n"))
} else {
cat("All values in 'Days_left' are non-negative.\n")
}
}
verify_data_quality(data)
## Data Quality Check Results:
## All dates in 'Date_of_journey' are in the correct format.
## All values in 'Duration_in_hours' are non-negative.
## All values in 'Days_left' are non-negative.
glimpse(data)
## Rows: 452,088
## Columns: 14
## $ Date_of_journey <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 202…
## $ Journey_day <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "M…
## $ Airline <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air I…
## $ Flight_code <chr> "SG-8169", "6E-2519", "G8-354", "SG-8709", "AI-805",…
## $ Class <chr> "Economy", "Economy", "Economy", "Economy", "Economy…
## $ Source <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi"…
## $ Departure <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 P…
## $ Total_stops <chr> "non-stop", "non-stop", "non-stop", "non-stop", "non…
## $ Arrival <chr> "After 6 PM", "Before 6 AM", "Before 6 AM", "After 6…
## $ Destination <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "M…
## $ Duration_in_hours <dbl> 2.0833, 2.3333, 2.1667, 2.0833, 2.1667, 2.2500, 2.25…
## $ Days_left <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Fare <dbl> 5335, 5899, 5801, 5794, 5955, 5955, 5955, 5899, 5829…
## $ days_left_group <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-…
Duplicate rows removed.
# Remove duplicate rows
data <- data[!duplicated(data), ]
# Count of removed duplicate rows
cat("Number of removed duplicate rows:", nrow(duplicate_rows), "\n")
## Number of removed duplicate rows: 6722
Handle missing values
# For simplicity, we can remove rows with missing values
data <- na.omit(data)
# Count of rows removed with missing values in field
cat("Number of removed rows with missing values:", nrow(missing_rows), "\n")
## Number of removed rows with missing values: 0
Removed first class since the count of data lines are insignificant, seen from the initial EDA
first_class_rows <- nrow(data %>% filter(Class == "First"))
cat("Number of rows in First class:", first_class_rows, "\n")
## Number of rows in First class: 144
# Remove rows of first class
data <- data %>% filter(Class != "First")
# Count of rows removed with first class
cat("Number of removed rows with first class:", first_class_rows, "\n")
## Number of removed rows with first class: 144
Removing columns: Flight_code, arrival, duration_in_hours
# Remove columns: Flight_code, arrival, duration_in_hours, duration_category, and fare_per_hour
data <- data %>% select(-Flight_code, -Arrival, -Duration_in_hours)
Verify data quality after cleaning and look at the data.
# Verify data quality again after cleaning
verify_data_quality(data)
## Data Quality Check Results:
## All dates in 'Date_of_journey' are in the correct format.
## All values in 'Duration_in_hours' are non-negative.
## All values in 'Days_left' are non-negative.
# Display the cleaned data
glimpse(data)
## Rows: 445,222
## Columns: 11
## $ Date_of_journey <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 2023-…
## $ Journey_day <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Mon…
## $ Airline <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air Ind…
## $ Class <chr> "Economy", "Economy", "Economy", "Economy", "Economy",…
## $ Source <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", …
## $ Departure <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 PM"…
## $ Total_stops <chr> "non-stop", "non-stop", "non-stop", "non-stop", "non-s…
## $ Destination <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mum…
## $ Days_left <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Fare <dbl> 5335, 5899, 5801, 5794, 5955, 5955, 5955, 5899, 5829, …
## $ days_left_group <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 …
Feature Engineering: Create new features from existing data
# Create a binary feature indicating whether the flight is on a weekend
data$Is_weekend <- ifelse(data$Journey_day %in% c("Saturday", "Sunday"), 1, 0)
# Create a feature indicating whether the flight departs in the morning,
# afternoon, evening, or after midnight
data$Departure_period <- ifelse(data$Departure == "Before 6 AM", 1,
ifelse(data$Departure == "6 AM - 12 PM", 2,
ifelse(data$Departure == "12 PM - 6 PM", 3,
ifelse(data$Departure == "After 6 PM", 4, 0))))
# Create interaction features between Airline and Class, Source and Destination.
data$Airline_Class <- paste(data$Airline, data$Class, sep = "_")
data$Route <- paste(data$Source, data$Destination, sep = "_")
# Update Total_stops feature to be numeric: 0 if non stop, 1 if 1 stop, 2 if 2 or more stops
data$Total_stops <- case_when(
data$Total_stops == "non-stop" ~ 0,
data$Total_stops == "1-stop" ~ 1,
TRUE ~ 2
)
# Mutate Fare to be in thousands instead of ones and update column name to Fare(x1000)
data <- data %>% mutate(`Fare(x1000)` = Fare / 1000) %>% select(-Fare)
Create a new feature Is_Holiday if any of the flight dates align with India Holidays / Restricted Holidays based on information from Administrative offices of central government located at Delhi / New Delhi (between Jan 16 - March 06, 2023 to align with the dataset): https://cag.gov.in/uploads/media/2023-Holiday-List-062c177906f8592-90189505.pdf
# Define the list of holidays
holidays <- as.Date(c("2023-01-26", "2023-02-05", "2023-02-15", "2023-02-18", "2023-02-19"))
# Create the Is_Holiday feature
data$Is_Holiday <- ifelse(data$Date_of_journey %in% holidays, 1, 0)
# Display the constructed data
glimpse(data)
## Rows: 445,222
## Columns: 16
## $ Date_of_journey <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 2023…
## $ Journey_day <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Mo…
## $ Airline <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air In…
## $ Class <chr> "Economy", "Economy", "Economy", "Economy", "Economy"…
## $ Source <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi",…
## $ Departure <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 PM…
## $ Total_stops <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Destination <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mu…
## $ Days_left <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ days_left_group <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7…
## $ Is_weekend <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Departure_period <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 3, 2, 2,…
## $ Airline_Class <chr> "SpiceJet_Economy", "Indigo_Economy", "GO FIRST_Econo…
## $ Route <chr> "Delhi_Mumbai", "Delhi_Mumbai", "Delhi_Mumbai", "Delh…
## $ `Fare(x1000)` <dbl> 5.335, 5.899, 5.801, 5.794, 5.955, 5.955, 5.955, 5.89…
## $ Is_Holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Adjust the output plots to be wider.
options(repr.plot.width = 20, repr.plot.height = 12)
Preparing distinct_airline_class to plot fare distribution per airline class
# Preparing distinct_airline_class to plot fare distribution for each Airline Class
distinct_airline_class <- data %>%
distinct(Airline_Class) %>%
pull(Airline_Class)
Boxplot of fare distribution for each airline class; comparing weekend and weekday fares
# Box plot of Fare by Route for each Airline_Class comparing weekdays and weekends
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
p <- ggplot(data %>% filter(Airline_Class == airline_class), aes(x = Route, y = `Fare(x1000)`, fill = factor(Is_weekend, labels = c("Weekdays", "Weekends")))) +
geom_boxplot(outlier.size = 0.7) +
labs(title = paste("Fare Distribution by Route for", airline_class, " (Weekdays vs Weekends)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Day") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust=0.5),
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
axis.text.y = element_text(size = 8)
)
plots[[airline_class]] <- p
}
# Display the plots
for (airline_class in distinct_airline_class) {
print(plots[[airline_class]])
}
Boxplot of fare distribution for each airline class; comparing Holiday and non-holiday
# Box plot of Fare by Route for each Airline_Class comparing Holiday and non-holiday
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
p <- ggplot(data %>% filter(Airline_Class == airline_class), aes(x = Route, y = `Fare(x1000)`, fill = factor(Is_Holiday, labels = c("Non-Holiday", "Holiday")))) +
geom_boxplot(outlier.size = 0.7) +
labs(title = paste("Fare Distribution by Route for", airline_class, " (Holiday vs Non-Holiday)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Day") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust=0.5),
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
axis.text.y = element_text(size = 8)
)
plots[[airline_class]] <- p
}
# Display the plots
for (airline_class in distinct_airline_class) {
print(plots[[airline_class]])
}
Boxplot of fare distribution for each airline class; comparing departure time
# Box plot of Fare by Route for each Airline_Class comparing Departure time
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
filtered_data <- data %>% filter(Airline_Class == airline_class)
filtered_data$Departure_period <- factor(filtered_data$Departure_period, levels = 1:4, labels = c("12am - 6am", "6am - 12pm", "12pm - 6pm", "6pm - 12am"))
p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = Departure_period)) +
geom_boxplot(outlier.size = 0.7) +
labs(title = paste("Fare Distribution by Route for", airline_class, " (Departure Time)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Departure Time") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust=0.5),
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
axis.text.y = element_text(size = 8)
)
plots[[airline_class]] <- p
}
# Display the plots
for (airline_class in distinct_airline_class) {
print(plots[[airline_class]])
}
Boxplot of fare distribution for each airline class; comparing number of stops
# Box plot of Fare by Route for each Airline_Class comparing number of stops
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
filtered_data <- data %>% filter(Airline_Class == airline_class)
filtered_data$Total_stops <- factor(filtered_data$Total_stops, levels = 0:2, labels = c("Non-stop", "1-stop", "2+ stops"))
p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = Total_stops)) +
geom_boxplot(outlier.size = 0.7) +
labs(title = paste("Fare Distribution by Route for", airline_class, " (Total Stops)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Total Stops") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust=0.5),
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
axis.text.y = element_text(size = 8)
)
plots[[airline_class]] <- p
}
# Display the plots
for (airline_class in distinct_airline_class) {
print(plots[[airline_class]])
}
Boxplot of fare distribution for each airline class; comparing How early the ticket is bought
# Box plot of Fare by Route for each Airline_Class comparing Days Left Group
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
filtered_data <- data %>% filter(Airline_Class == airline_class)
p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = days_left_group)) +
geom_boxplot(outlier.size = 0.7) +
labs(title = paste("Fare Distribution by Route for", airline_class, " (Ticket bought days prior to flight)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Ticket bought days prior to flight") +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust=0.5),
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold"),
axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
axis.text.y = element_text(size = 8)
)
plots[[airline_class]] <- p
}
# Display the plots
for (airline_class in distinct_airline_class) {
print(plots[[airline_class]])
}