Group 4 Programming Data Science Assignment

0. Project Details

Group Number:

Group 4

Title:

Machine Learning in Aviation: Predicting Fares and Ticket Classifications

Members:

Khor Kean Teng, Loong Shih-Wai, Nur Maisarah Jalalulail, Pang Suk Min, Tioh Zi Cong


1. Business Understanding

1.1 Business Objective

1.1.1 Background

Air travel has become an integral part of modern society with millions of passengers flying for business and leisure. The rapidly expanding airline industry is also considered one of the most sophisticated industries in employing complex pricing strategies. This is because flight fare pricing is dynamic and complex, involving numerous factors such as timing, demand, seasonality, and market conditions. The same airline ticket prices can vary dynamically and significantly, even for nearby seats within the same cabin. Research reports that the same ticket for a specific flight can change up to seven times a day.

Admittedly, flight fare volatility creates uncertainty for both travelers seeking the best deals and airlines optimizing their revenue management. This interplay of goals creates a unique tension: customers are seeking to get the lowest price for their ticket, while airline companies aim to maximize their overall revenue and profit. Mismatches between available seats and passenger demand often lead to customers paying more or airlines losing revenue. While airlines are equipped with advanced tools and capabilities to control the pricing process, customers are also becoming more strategic with the development of online tools that compare prices across various airlines. This evolving dynamic underscores the growing sophistication and challenges in the flight pricing ecosystem.

Recent advancement in machine learning has opened new possibilities for more accurate fare prediction and classification models. Through analyzing vast amount of historical flight fare data, it opens up opportunities to develop predictive models that can forecast flight fares with better accuracy as well as greater transparency and efficiency in airline pricing.

1.1.2 Business Objectives

  • Increase revenue by optimizing dynamic flight pricing
    • Achieve revenue increment by implementing machine learning models that set competitive ticket prices based on market conditions
  • Improve customer retention
    • Retain and enhance customer loyalty through tailored pricing strategies for frequent travelers
  • Enhance customer satisfaction ratings
    • Improve customer satisfaction metrics with competitive fares in each class while maintaining service quality

1.1.3 Business Success Criteria

  • Enhance forecasting
    • Achieve a measurable accuracy in flight fare prediction and classification by implementing machine learning model
  • Revenue optimization
    • Help airlines optimize their pricing strategy to maximize revenue and minimize seat wastage. Tools to assist customer to find the best time to book flight ticket can also be implemented to avoid overspending.
  • Operational efficiency
    • The implementation of machine learning model reduce the need for manual fare trends analysis as the model continuously learned from collected datasets

1.2 Situation Assessment

Accurate prediction of flight-level passenger traffic is of paramount importance in airline operations, influencing key decisions from pricing to route optimization (Ehsani et al., 2024). However, forecasting flight fares is a critical tasks in the rapidly expanding aviation industry as it involves numerous factors. Nowadays, aviation industry adopt dynamic pricing to assign airfare prices utilizing complex strategies and methods involving financial, marketing, commercial and social factors.

The high complexity in airfare pricing models make it very difficult for customer to purchase an air ticket in the lowest price since the price changes dynamically. According to a (Abdella et al., 2021), the same ticket price for a specific flight can change up to 7 times a day, presenting a complex predictive modeling opportunities. The best time to buy a ticket is a key issue subjected to research. Various predictions models to optimize airline ticket purchasing have been proposed and achieved promising results. (Subramanian et al., 2022) proposed the use of machine learning algorithms in airline fare prediction utilizing various regression algorithms such as KNN, Linear Regresssion, Lasso, Ridge, Decision Tree, Stacking Tree and Random Forest on scaped data from MakeMyTrip, Data World and New Zealand Domestic airlines. The research results R2 score ranges from 60.24% to 83.83%; 27.95% to 84.46%; 37.16% to 81.89% highlighting the complex and nonlinear relationships of multiple factors in airline fare prediction. However, the data used on cover specific airports limited to certain time period, thus limited the result generalizability and practicality.

(Tziridis et al., 2017) proposed a comparative study using data from the Aegean Airlines using MLP, GRNN,  ELM, RF, Regression Tree, Bagging Regression Tree, SVM and LR. The research outcome ranges from 49.40% to 87.42%, improvement is made with feature selection from 57.07% to 87.59%. Nonetheless, the research lacks practicality where only 1814 data flights collected in period between December and July is used for training, thus limiting the findings application to a broader cases and demographics.

Recent advancement in deep learning also ushered in novel forecasting approach in flight fare prediction. (Degife and Lin, 2023) suggested the use of gated recurrent unit, GRU and achieved a remarkable 98% R2 score and 5.93 RMSE representing a significant advacement in prediction accuracy. Nevertheless, the use of deep learning architecture can be computational intensive during the model training stage.

Above all, the landscape of flight fare prediction demonstrates significant potential and ongoing challenges. Despite the promising advancements in machine learning and deep learning techniques, current research is constrained by limited dataset generalizability, regional specificity, and computational intensity. Future research could emphasize on the development of more comprehensive, robust models that can integrate diverse data sources, encompass broader geographic regions, and balance predictive accuracy with computational efficiency. Moreover, interdisciplinary approaches that combine machine learning algorithms with domain expertise from aviation economics, marketing, and social sciences could provide more nuanced and adaptable fare prediction models, ultimately benefiting both airlines and passengers by offering more transparent and optimized pricing strategies.

1.3 Data Mining Goals

  • To implement machine learning models in airline flight ticket price prediction
  • To apply machine learning models in airline flight ticket classes classification
  • To evaluate the prediction and classification performance of the models with and without feature selection
  • To identify factors influencing airline flight ticket pricing and flight ticket classes

1.4 Project Plan

The project make use of CRISP-DM framework to streamline the project workflows. R language will be used for data processing, modeling and evaluation tasks. The report for the project will then be written in RMarkdown and published on RPubs.

2. Data Understanding

###2.1 Collect Initial Data

# clear workspace variables
rm(list = ls())

# set-up: turn off warnings and set plot size
options(warn = -1)
options(repr.plot.width = 10, repr.plot.height = 5)

# set working directory
setwd("C:\\Users\\Loong\\Downloads\\New folder")
# install the required packages
packages <- c("caret", "e1071", "tidyverse", "tidyr", "readr", "randomForest", "class", "nnet", "corrplot", "glmnet")

# Get the list of currently installed packages
installed_packages <- as.data.frame(installed.packages())['Package']

# Function to check and install missing packages
install_if_missing <- function(pkg) {
  if (!(pkg %in% installed_packages$Package)) {
    suppressMessages(install.packages(pkg))
  }
}

# Apply the function to each package in the vector
sapply(packages, install_if_missing)
## $caret
## NULL
## 
## $e1071
## NULL
## 
## $tidyverse
## NULL
## 
## $tidyr
## NULL
## 
## $readr
## NULL
## 
## $randomForest
## NULL
## 
## $class
## NULL
## 
## $nnet
## NULL
## 
## $corrplot
## NULL
## 
## $glmnet
## NULL
# import packages
suppressPackageStartupMessages({
  # add the required packages here
  library(tidyr)
  library(readr)
  library(tidyverse)
  library(caret)
  library(e1071)
  library(randomForest)
  library(class)
  library(nnet)
  library(corrplot)
  library(glmnet)
})
# Download the dataset
system("kaggle datasets download yashdharme36/airfare-ml-predicting-flight-fares")
## [1] 0
cat("Data loaded...\n")
## Data loaded...
unzip("airfare-ml-predicting-flight-fares.zip")
cat("Done unzip...\n")
## Done unzip...
# first 5 rows of the data
data <- read_csv('Cleaned_dataset.csv', show_col_types = FALSE)
head(data, 5)
## # A tibble: 5 × 13
##   Date_of_journey Journey_day Airline   Flight_code Class   Source Departure 
##   <date>          <chr>       <chr>     <chr>       <chr>   <chr>  <chr>     
## 1 2023-01-16      Monday      SpiceJet  SG-8169     Economy Delhi  After 6 PM
## 2 2023-01-16      Monday      Indigo    6E-2519     Economy Delhi  After 6 PM
## 3 2023-01-16      Monday      GO FIRST  G8-354      Economy Delhi  After 6 PM
## 4 2023-01-16      Monday      SpiceJet  SG-8709     Economy Delhi  After 6 PM
## 5 2023-01-16      Monday      Air India AI-805      Economy Delhi  After 6 PM
## # ℹ 6 more variables: Total_stops <chr>, Arrival <chr>, Destination <chr>,
## #   Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>

2.2 Describe Data

cat("The dimension of the dataset is", dim(data)[1], "rows and", dim(data)[2], "columns.\n")
## The dimension of the dataset is 452088 rows and 13 columns.
# Variables list with type
str(head(data))
## tibble [6 × 13] (S3: tbl_df/tbl/data.frame)
##  $ Date_of_journey  : Date[1:6], format: "2023-01-16" "2023-01-16" ...
##  $ Journey_day      : chr [1:6] "Monday" "Monday" "Monday" "Monday" ...
##  $ Airline          : chr [1:6] "SpiceJet" "Indigo" "GO FIRST" "SpiceJet" ...
##  $ Flight_code      : chr [1:6] "SG-8169" "6E-2519" "G8-354" "SG-8709" ...
##  $ Class            : chr [1:6] "Economy" "Economy" "Economy" "Economy" ...
##  $ Source           : chr [1:6] "Delhi" "Delhi" "Delhi" "Delhi" ...
##  $ Departure        : chr [1:6] "After 6 PM" "After 6 PM" "After 6 PM" "After 6 PM" ...
##  $ Total_stops      : chr [1:6] "non-stop" "non-stop" "non-stop" "non-stop" ...
##  $ Arrival          : chr [1:6] "After 6 PM" "Before 6 AM" "Before 6 AM" "After 6 PM" ...
##  $ Destination      : chr [1:6] "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
##  $ Duration_in_hours: num [1:6] 2.08 2.33 2.17 2.08 2.17 ...
##  $ Days_left        : num [1:6] 1 1 1 1 1 1
##  $ Fare             : num [1:6] 5335 5899 5801 5794 5955 ...
# Segregate variables to type
float_columns <- names(data)[sapply(data, function(x) is.numeric(x) & !is.integer(x))]
string_columns <- names(data)[sapply(data, is.character)]
datetime_columns <- names(data)[sapply(data, function(x) inherits(x, "POSIXct") | inherits(x, "Date"))]

cat("Float columns:", paste(float_columns, collapse = ", "), "\n")
## Float columns: Duration_in_hours, Days_left, Fare
cat("String columns:", paste(string_columns, collapse = ", "), "\n")
## String columns: Journey_day, Airline, Flight_code, Class, Source, Departure, Total_stops, Arrival, Destination
cat("Datetime columns:", paste(datetime_columns, collapse = ", "), "\n")
## Datetime columns: Date_of_journey
# Check unique values for each column in the dataset
unique_values <- lapply(data, unique)
# Statistical summary of dataset
summary(data)
##  Date_of_journey      Journey_day          Airline          Flight_code       
##  Min.   :2023-01-16   Length:452088      Length:452088      Length:452088     
##  1st Qu.:2023-01-28   Class :character   Class :character   Class :character  
##  Median :2023-02-10   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2023-02-09                                                           
##  3rd Qu.:2023-02-22                                                           
##  Max.   :2023-03-06                                                           
##     Class              Source           Departure         Total_stops       
##  Length:452088      Length:452088      Length:452088      Length:452088     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Arrival          Destination        Duration_in_hours   Days_left    
##  Length:452088      Length:452088      Min.   : 0.750    Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 6.583    1st Qu.:13.00  
##  Mode  :character   Mode  :character   Median :11.333    Median :26.00  
##                                        Mean   :12.349    Mean   :25.63  
##                                        3rd Qu.:16.500    3rd Qu.:38.00  
##                                        Max.   :43.583    Max.   :50.00  
##       Fare       
##  Min.   :  1307  
##  1st Qu.:  8763  
##  Median : 13407  
##  Mean   : 22840  
##  3rd Qu.: 35587  
##  Max.   :143019

2.3 Explore Data

  • Univariate
summary(data$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1307    8763   13407   22840   35587  143019
# Histogram for fare
ggplot(data, aes(x = Fare)) + geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Flight Fare", x = "Fare", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Days left
summary(data$Days_left)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   26.00   25.63   38.00   50.00
# Histogram for fare
ggplot(data, aes(x = Days_left)) + geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Days Left", x = "Days between Booking day and Flight day", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Journey Day - histogram
ggplot(data, aes(x = Journey_day)) + geom_bar(binwidth = 10, fill = "plum", color = "black") +
  labs(title = "Journey Day", x = "Journey Day", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Journey Month
ggplot(data, aes(x = lubridate::month(Date_of_journey, label = TRUE))) +
  geom_bar(fill = "palevioletred", color = "black") +
  labs(title = "Frequency of Journey by Month", x = "Month of Journey", y = "Count") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12)
  )

# Airline - histogram
ggplot(data, aes(x = Airline)) + geom_bar(binwidth = 10, fill = "turquoise", color = "black") +
  labs(title = "Airline Frequency", x = "Airline", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Class - histogram
ggplot(data, aes(x = Class)) + geom_bar(binwidth = 10, fill = "lightpink", color = "black") +
  labs(title = "Types of Classes", x = "Class", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

  • Bivariate
# Fare with Journey Day
ggplot(data, aes(x = Journey_day, y = Fare)) + geom_boxplot(fill = "lightcyan") +
  labs(title = "Fare Distribution by Journey Day", x = "Journey Day", y = "Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Fare with Airline
ggplot(data, aes(x = Airline, y = Fare)) + geom_boxplot(fill = "turquoise") +
  labs(title = "Fare Distribution by Airline", x = "Airline", y = "Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Fare with Class
ggplot(data, aes(x = Class, y = Fare)) + geom_boxplot(fill = "lightpink") +
  labs(title = "Fare Distribution by Class", x = "Class", y = "Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Class with duration of hours
ggplot(data, aes(x = Class, y = Duration_in_hours)) + geom_boxplot(fill = "lightpink") +
  labs(title = "Duration of Flight by Class", x = "Class", y = "Duration in hours") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 16, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 16)          )

# Fare with Days Left
data$days_left_group <- cut(data$Days_left, breaks = c(0, 7, 14, 30, 60, 90, Inf),
                            labels = c("0-7 days", "8-14 days", "15-30 days", "31-60 days", "61-90 days", "90+ days"))

ggplot(data, aes(x = days_left_group, y = Fare)) +
  geom_boxplot(fill = "khaki") +
  labs(title = "Fare Prices by Days Left Intervals", x = "Days Left Interval", y = "Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1)
  )

# Average Fare with days left
avg_fare_by_days <- data %>% group_by(Days_left) %>% summarise(avg_fare = mean(Fare, na.rm = TRUE))

ggplot(avg_fare_by_days, aes(x = Days_left, y = avg_fare)) +
  geom_line(color = "blue", size = 1) +
  labs(title = "Average Fare vs. Days Left", x = "Days Left", y = "Average Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12)
  )

# Grouping by the extracted month from Date_of_journey and calculating average fare
avg_fare_by_month <- data %>%
  group_by(Month_of_journey = lubridate::month(Date_of_journey, label = TRUE)) %>%
  summarise(avg_fare = mean(Fare, na.rm = TRUE))

# Plotting the relationship
# Plotting the relationship with proper grouping
ggplot(avg_fare_by_month, aes(x = Month_of_journey, y = avg_fare, group = 1)) +
  geom_line(color = "hotpink", size = 1) +
  geom_point(color = "blue", size = 2) +
  labs(title = "Average Fare vs. Month of Journey", x = "Month of Journey", y = "Average Fare") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12)
  )

  • Multivariate
numeric_data <- data[sapply(data, is.numeric)]
cor_matrix <- cor(numeric_data)
print(cor_matrix)
##                   Duration_in_hours   Days_left        Fare
## Duration_in_hours        1.00000000 -0.03287841  0.17990877
## Days_left               -0.03287841  1.00000000 -0.08785164
## Fare                     0.17990877 -0.08785164  1.00000000
corrplot(cor_matrix, method = "circle", type = "upper", tl.col = "black", tl.srt = 30,col = colorRampPalette(c("green", "pink", "blue"))(200))

# Facet Grid for Fare by Airline and Duration with Total Stops
ggplot(data, aes(x = Duration_in_hours, y = Total_stops, color = Fare)) +
  geom_point(alpha = 0.6) +
  facet_wrap(~ Airline) +
  labs(title = "Fare by Airline, Duration, and Total Stops", x = "Duration in Hours", y = "Total Stops") +
  scale_color_gradient(low = "blue", high = "red")

2.4 Verify Data Quality

# Count of missing values in each column
colSums(is.na(data))
##   Date_of_journey       Journey_day           Airline       Flight_code 
##                 0                 0                 0                 0 
##             Class            Source         Departure       Total_stops 
##                 0                 0                 0                 0 
##           Arrival       Destination Duration_in_hours         Days_left 
##                 0                 0                 0                 0 
##              Fare   days_left_group 
##                 0                 0
# Check rows with missing values
missing_rows <- data[!complete.cases(data), ]
print(missing_rows)
## # A tibble: 0 × 14
## # ℹ 14 variables: Date_of_journey <date>, Journey_day <chr>, Airline <chr>,
## #   Flight_code <chr>, Class <chr>, Source <chr>, Departure <chr>,
## #   Total_stops <chr>, Arrival <chr>, Destination <chr>,
## #   Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>, days_left_group <fct>
# Check for duplicate rows
duplicate_rows <- data[duplicated(data), ]
if (nrow(duplicate_rows) == 0) {
  print("There is no duplicate row.")
} else {
  print("Duplicate Rows:")
  print(duplicate_rows)
}
## [1] "Duplicate Rows:"
## # A tibble: 6,722 × 14
##    Date_of_journey Journey_day Airline Flight_code Class        Source Departure
##    <date>          <chr>       <chr>   <chr>       <chr>        <chr>  <chr>    
##  1 2023-01-16      Monday      Vistara UK-627      Economy      Delhi  12 PM - …
##  2 2023-01-16      Monday      Vistara UK-627      Premium Eco… Delhi  12 PM - …
##  3 2023-01-16      Monday      Vistara UK-627      Premium Eco… Delhi  12 PM - …
##  4 2023-01-17      Tuesday     Vistara UK-929      Economy      Delhi  After 6 …
##  5 2023-01-17      Tuesday     Vistara UK-627      Premium Eco… Delhi  12 PM - …
##  6 2023-01-18      Wednesday   Vistara UK-975      Economy      Delhi  Before 6…
##  7 2023-01-18      Wednesday   Vistara UK-951      Economy      Delhi  12 PM - …
##  8 2023-01-18      Wednesday   Vistara UK-993      Economy      Delhi  12 PM - …
##  9 2023-01-18      Wednesday   Vistara UK-945      Economy      Delhi  6 AM - 1…
## 10 2023-01-18      Wednesday   Vistara UK-995      Economy      Delhi  6 AM - 1…
## # ℹ 6,712 more rows
## # ℹ 7 more variables: Total_stops <chr>, Arrival <chr>, Destination <chr>,
## #   Duration_in_hours <dbl>, Days_left <dbl>, Fare <dbl>, days_left_group <fct>
# Verify data quality
verify_data_quality <- function(data) {

  # 1. Check if 'Date_of_journey' is in the correct format (YYYY-MM-DD)
  data$Date_of_journey <- as.Date(data$Date_of_journey, format = "%Y-%m-%d")
  invalid_dates <- sum(is.na(data$Date_of_journey))  # Count invalid dates

  # 2. Check if 'Duration_in_hours' contains non-negative values
  negative_duration <- sum(data$Duration_in_hours < 0, na.rm = TRUE)  # Count negative durations

  # 3. Check if 'Days_left' contains non-negative values
  negative_days_left <- sum(data$Days_left < 0, na.rm = TRUE)  # Count negative days left

  # Output the results
  cat("Data Quality Check Results:\n")

  # Invalid date check
  if (invalid_dates > 0) {
    cat(paste(invalid_dates, "rows have invalid date format in 'Date_of_journey'.\n"))
  } else {
    cat("All dates in 'Date_of_journey' are in the correct format.\n")
  }

  # Negative duration check
  if (negative_duration > 0) {
    cat(paste(negative_duration, "rows have negative values in 'Duration_in_hours'.\n"))
  } else {
    cat("All values in 'Duration_in_hours' are non-negative.\n")
  }

  # Negative days left check
  if (negative_days_left > 0) {
    cat(paste(negative_days_left, "rows have negative values in 'Days_left'.\n"))
  } else {
    cat("All values in 'Days_left' are non-negative.\n")
  }
}

verify_data_quality(data)
## Data Quality Check Results:
## All dates in 'Date_of_journey' are in the correct format.
## All values in 'Duration_in_hours' are non-negative.
## All values in 'Days_left' are non-negative.

3. Data Preparation

3.1 Select Data

Look at the data we have on hand
glimpse(data)
## Rows: 452,088
## Columns: 14
## $ Date_of_journey   <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 202…
## $ Journey_day       <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "M…
## $ Airline           <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air I…
## $ Flight_code       <chr> "SG-8169", "6E-2519", "G8-354", "SG-8709", "AI-805",…
## $ Class             <chr> "Economy", "Economy", "Economy", "Economy", "Economy…
## $ Source            <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi"…
## $ Departure         <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 P…
## $ Total_stops       <chr> "non-stop", "non-stop", "non-stop", "non-stop", "non…
## $ Arrival           <chr> "After 6 PM", "Before 6 AM", "Before 6 AM", "After 6…
## $ Destination       <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "M…
## $ Duration_in_hours <dbl> 2.0833, 2.3333, 2.1667, 2.0833, 2.1667, 2.2500, 2.25…
## $ Days_left         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Fare              <dbl> 5335, 5899, 5801, 5794, 5955, 5955, 5955, 5899, 5829…
## $ days_left_group   <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-…

3.2 Clean Data

Duplicate rows removed.

# Remove duplicate rows
data <- data[!duplicated(data), ]
# Count of removed duplicate rows
cat("Number of removed duplicate rows:", nrow(duplicate_rows), "\n")
## Number of removed duplicate rows: 6722

Handle missing values

# For simplicity, we can remove rows with missing values
data <- na.omit(data)
# Count of rows removed with missing values in field
cat("Number of removed rows with missing values:", nrow(missing_rows), "\n")
## Number of removed rows with missing values: 0

Removed first class since the count of data lines are insignificant, seen from the initial EDA

first_class_rows <- nrow(data %>% filter(Class == "First"))
cat("Number of rows in First class:", first_class_rows, "\n")
## Number of rows in First class: 144
# Remove rows of first class
data <- data %>% filter(Class != "First")
# Count of rows removed with first class
cat("Number of removed rows with first class:", first_class_rows, "\n")
## Number of removed rows with first class: 144

Removing columns: Flight_code, arrival, duration_in_hours

# Remove columns: Flight_code, arrival, duration_in_hours, duration_category, and fare_per_hour
data <- data %>% select(-Flight_code, -Arrival, -Duration_in_hours)

Verify data quality after cleaning and look at the data.

# Verify data quality again after cleaning
verify_data_quality(data)
## Data Quality Check Results:
## All dates in 'Date_of_journey' are in the correct format.
## All values in 'Duration_in_hours' are non-negative.
## All values in 'Days_left' are non-negative.
# Display the cleaned data
glimpse(data)
## Rows: 445,222
## Columns: 11
## $ Date_of_journey <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 2023-…
## $ Journey_day     <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Mon…
## $ Airline         <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air Ind…
## $ Class           <chr> "Economy", "Economy", "Economy", "Economy", "Economy",…
## $ Source          <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", …
## $ Departure       <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 PM"…
## $ Total_stops     <chr> "non-stop", "non-stop", "non-stop", "non-stop", "non-s…
## $ Destination     <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mum…
## $ Days_left       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Fare            <dbl> 5335, 5899, 5801, 5794, 5955, 5955, 5955, 5899, 5829, …
## $ days_left_group <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 …

3.3 Construct Data

Feature Engineering: Create new features from existing data

# Create a binary feature indicating whether the flight is on a weekend
data$Is_weekend <- ifelse(data$Journey_day %in% c("Saturday", "Sunday"), 1, 0)
# Create a feature indicating whether the flight departs in the morning,
# afternoon, evening, or after midnight
data$Departure_period <- ifelse(data$Departure == "Before 6 AM", 1,
                         ifelse(data$Departure == "6 AM - 12 PM", 2,
                         ifelse(data$Departure == "12 PM - 6 PM", 3,
                         ifelse(data$Departure == "After 6 PM", 4, 0))))
# Create interaction features between Airline and Class, Source and Destination.
data$Airline_Class <- paste(data$Airline, data$Class, sep = "_")
data$Route <- paste(data$Source, data$Destination, sep = "_")
# Update Total_stops feature to be numeric: 0 if non stop, 1 if 1 stop, 2 if 2 or more stops
data$Total_stops <- case_when(
    data$Total_stops == "non-stop" ~ 0,
    data$Total_stops == "1-stop" ~ 1,
    TRUE ~ 2
)
# Mutate Fare to be in thousands instead of ones and update column name to Fare(x1000)
data <- data %>% mutate(`Fare(x1000)` = Fare / 1000) %>% select(-Fare)

Create a new feature Is_Holiday if any of the flight dates align with India Holidays / Restricted Holidays based on information from Administrative offices of central government located at Delhi / New Delhi (between Jan 16 - March 06, 2023 to align with the dataset): https://cag.gov.in/uploads/media/2023-Holiday-List-062c177906f8592-90189505.pdf

  • Republic Holiday / Basant Panchami / Sri Panchami : 2023-01-26
  • Hazarat Ali’s Birthday / Guru Ravi Das’ Birthday : 2023-02-05
  • Birthday of Swami Dayananda Saraswati : 2023-02-15
  • Maha Shiratri : 2023-02-18
  • Shiva ji Jayanti : 2023-02-19
# Define the list of holidays
holidays <- as.Date(c("2023-01-26", "2023-02-05", "2023-02-15", "2023-02-18", "2023-02-19"))

# Create the Is_Holiday feature
data$Is_Holiday <- ifelse(data$Date_of_journey %in% holidays, 1, 0)
# Display the constructed data
glimpse(data)
## Rows: 445,222
## Columns: 16
## $ Date_of_journey  <date> 2023-01-16, 2023-01-16, 2023-01-16, 2023-01-16, 2023…
## $ Journey_day      <chr> "Monday", "Monday", "Monday", "Monday", "Monday", "Mo…
## $ Airline          <chr> "SpiceJet", "Indigo", "GO FIRST", "SpiceJet", "Air In…
## $ Class            <chr> "Economy", "Economy", "Economy", "Economy", "Economy"…
## $ Source           <chr> "Delhi", "Delhi", "Delhi", "Delhi", "Delhi", "Delhi",…
## $ Departure        <chr> "After 6 PM", "After 6 PM", "After 6 PM", "After 6 PM…
## $ Total_stops      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Destination      <chr> "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mumbai", "Mu…
## $ Days_left        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ days_left_group  <fct> 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7 days, 0-7…
## $ Is_weekend       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Departure_period <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 3, 2, 2,…
## $ Airline_Class    <chr> "SpiceJet_Economy", "Indigo_Economy", "GO FIRST_Econo…
## $ Route            <chr> "Delhi_Mumbai", "Delhi_Mumbai", "Delhi_Mumbai", "Delh…
## $ `Fare(x1000)`    <dbl> 5.335, 5.899, 5.801, 5.794, 5.955, 5.955, 5.955, 5.89…
## $ Is_Holiday       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Adjust the output plots to be wider.

options(repr.plot.width = 20, repr.plot.height = 12)

Preparing distinct_airline_class to plot fare distribution per airline class

# Preparing distinct_airline_class to plot fare distribution for each Airline Class
distinct_airline_class <- data %>%
    distinct(Airline_Class) %>%
    pull(Airline_Class)

Boxplot of fare distribution for each airline class; comparing weekend and weekday fares

# Box plot of Fare by Route for each Airline_Class comparing weekdays and weekends
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
    p <- ggplot(data %>% filter(Airline_Class == airline_class), aes(x = Route, y = `Fare(x1000)`, fill = factor(Is_weekend, labels = c("Weekdays", "Weekends")))) +
        geom_boxplot(outlier.size = 0.7) +
        labs(title = paste("Fare Distribution by Route for", airline_class, " (Weekdays vs Weekends)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Day") +
        theme_minimal() +
        theme(
            plot.title = element_text(size = 12, face = "bold", hjust=0.5),
            axis.title.x = element_text(size = 10, face = "bold"),
            axis.title.y = element_text(size = 10, face = "bold"),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8)
        )
    plots[[airline_class]] <- p
}

# Display the plots
for (airline_class in distinct_airline_class) {
    print(plots[[airline_class]])
}

Boxplot of fare distribution for each airline class; comparing Holiday and non-holiday

# Box plot of Fare by Route for each Airline_Class comparing Holiday and non-holiday
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
    p <- ggplot(data %>% filter(Airline_Class == airline_class), aes(x = Route, y = `Fare(x1000)`, fill = factor(Is_Holiday, labels = c("Non-Holiday", "Holiday")))) +
        geom_boxplot(outlier.size = 0.7) +
        labs(title = paste("Fare Distribution by Route for", airline_class, " (Holiday vs Non-Holiday)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Day") +
        theme_minimal() +
        theme(
            plot.title = element_text(size = 12, face = "bold", hjust=0.5),
            axis.title.x = element_text(size = 10, face = "bold"),
            axis.title.y = element_text(size = 10, face = "bold"),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8)
        )
    plots[[airline_class]] <- p
}

# Display the plots
for (airline_class in distinct_airline_class) {
    print(plots[[airline_class]])
}

Boxplot of fare distribution for each airline class; comparing departure time

# Box plot of Fare by Route for each Airline_Class comparing Departure time
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
    filtered_data <- data %>% filter(Airline_Class == airline_class)
    filtered_data$Departure_period <- factor(filtered_data$Departure_period, levels = 1:4, labels = c("12am - 6am", "6am - 12pm", "12pm - 6pm", "6pm - 12am"))

    p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = Departure_period)) +
        geom_boxplot(outlier.size = 0.7) +
        labs(title = paste("Fare Distribution by Route for", airline_class, " (Departure Time)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Departure Time") +
        theme_minimal() +
        theme(
            plot.title = element_text(size = 12, face = "bold", hjust=0.5),
            axis.title.x = element_text(size = 10, face = "bold"),
            axis.title.y = element_text(size = 10, face = "bold"),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8)
        )
    plots[[airline_class]] <- p
}

# Display the plots
for (airline_class in distinct_airline_class) {
    print(plots[[airline_class]])
}

Boxplot of fare distribution for each airline class; comparing number of stops

# Box plot of Fare by Route for each Airline_Class comparing number of stops
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
    filtered_data <- data %>% filter(Airline_Class == airline_class)
    filtered_data$Total_stops <- factor(filtered_data$Total_stops, levels = 0:2, labels = c("Non-stop", "1-stop", "2+ stops"))

    p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = Total_stops)) +
        geom_boxplot(outlier.size = 0.7) +
        labs(title = paste("Fare Distribution by Route for", airline_class, " (Total Stops)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Total Stops") +
        theme_minimal() +
        theme(
            plot.title = element_text(size = 12, face = "bold", hjust=0.5),
            axis.title.x = element_text(size = 10, face = "bold"),
            axis.title.y = element_text(size = 10, face = "bold"),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8)
        )
    plots[[airline_class]] <- p
}

# Display the plots
for (airline_class in distinct_airline_class) {
    print(plots[[airline_class]])
}

Boxplot of fare distribution for each airline class; comparing How early the ticket is bought

# Box plot of Fare by Route for each Airline_Class comparing Days Left Group
# Split the plot into one plot each for each Airline_Class
plots <- list()
for (airline_class in distinct_airline_class) {
    filtered_data <- data %>% filter(Airline_Class == airline_class)

    p <- ggplot(filtered_data, aes(x = Route, y = `Fare(x1000)`, fill = days_left_group)) +
        geom_boxplot(outlier.size = 0.7) +
        labs(title = paste("Fare Distribution by Route for", airline_class, " (Ticket bought days prior to flight)"), x = "Route", y = "Fare (in thousand rupees)", fill = "Ticket bought days prior to flight") +
        theme_minimal() +
        theme(
            plot.title = element_text(size = 12, face = "bold", hjust=0.5),
            axis.title.x = element_text(size = 10, face = "bold"),
            axis.title.y = element_text(size = 10, face = "bold"),
            axis.text.x = element_text(size = 8, angle = 90, hjust = 1),
            axis.text.y = element_text(size = 8)
        )
    plots[[airline_class]] <- p
}

# Display the plots
for (airline_class in distinct_airline_class) {
    print(plots[[airline_class]])
}