1. Introduction


Members

Name Matrix Number
Ding Lee Choong 23075143
Wong Yang Gui 22104322
Li Yang Bo 22099073
Asadullah Qamar Bhatti 2136569
LOW MENG FEI 23063305

Introduction

British Airways (BA), one of the world’s leading airlines, has a rich history of providing air travel services to millions of passengers each year. Founded in 1974, the airline has grown to become a major player in the aviation industry, known for its extensive network, quality service, and commitment to passenger comfort and safety. As the airline industry becomes increasingly competitive, maintaining high levels of customer satisfaction is essential for retaining customers and fostering loyalty.

In recent years, the proliferation of digital communication channels has led to an explosion in customer feedback. Passengers now share their experiences through various platforms, including in-flight surveys, online reviews, social media, and direct communications with customer service. This feedback contains valuable insights into passengers’ experiences, preferences, and expectations.

Analyzing this feedback effectively allows British Airways to identify strengths and weaknesses in their services, respond to customer needs more efficiently, and make data-driven decisions to enhance overall customer satisfaction. By leveraging data science techniques, the study aims to transform raw customer feedback into actionable insights that can drive strategic improvements and ensure a superior travel experience for its passengers.

Objectives

2. Data Understanding


2.1 Introduction

  • Title: British Airways customer feedback

  • Year: The dataset was collected in 2023.

  • Source: The dataset was extracted from Airline Quality through web scrapping by Anshul Chaudhary & Muskan Risinghani, available on kaggle.

  • Purpose: The British Airways customer feedback dataset provide a valuable resource of customer feedback for analysis, spanning from 2010 to 2023. It offers real-time insights, enabling sentiment analysis, service quality assessment, route performance evaluation, aircraft experience analysis and etc. By leveraging this data, British Airways can enhance the overall customer experience, gain a competitive advantage, and make data-driven decisions for targeted improvements.

2.2 Dataset Overview

  • Dimension: 3,701 Rows & 19 Columns

  • Contents & Structure:

    1. Flight Details

     [Date] DateFlown: Date of the flight.
     [Character] Name: Customer's name who provided the feedback.
     [Nominal] TypeOfTraveller: Traveler type (e.g., Business, Leisure).
     [Nominal] SeatType: Seat class (e.g. Business, Economy).
     [Character] Route: The flight route taken by the customer.
     [Character] Aircraft: Aircraft model.

    2. Rating Feedback

     [Nominal] Recommended: Whether the customer recommends British Airways.
     [Ordinal] OverallRating: Overall customer rating.
     [Ordinal] SeatComfort: Seat comfort rating.
     [Ordinal] CabinStaffService: Cabin staff service rating.
     [Ordinal] GroundService: Ground service rating.
     [Ordinal] ValueForMoney: Value for money rating.
     [Ordinal] Food&Beverages: Food & beverages rating.
     [Ordinal] InflightEntertainment: Inflight entertainment rating.
     [Ordinal] Wifi&Connectivity: Onboard wifi and connectivity rating.

    3. Textual Feedback

     [Character] ReviewHeader: Title of the customer's review.
     [Character] ReviewBody: Detailed review content.

    4. Other Details

     [Date] Datetime: The date when the feedback was posted.
     [Nominal] VerifiedReview: Indicates if the review is verified.
if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman") # R package management tool
pacman::p_load(tidyverse) # An integrated package allows for data import, tidying, manipulation (dyplr), visualisation (ggplot2), stringr (text preprocessing) and etc.

setwd("C:\\Users\\User\\Desktop\\Melaya\\WQD7004 Programming for Data Science")

# Import data
original_df <- read.csv('BA_AirlineReviews.csv',header=TRUE)[-1]

# Data overview
glimpse(original_df)
## Rows: 3,701
## Columns: 19
## $ OverallRating         <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1, …
## $ ReviewHeader          <chr> "\"Service level far worse then Ryanair\"", "\"d…
## $ Name                  <chr> "L Keele", "Austin Jones", "M A Collie", "Nigel …
## $ Datetime              <chr> "19th November 2023", "19th November 2023", "16t…
## $ VerifiedReview        <chr> "True", "True", "False", "True", "False", "True"…
## $ ReviewBody            <chr> "4 Hours before takeoff we received a Mail stati…
## $ TypeOfTraveller       <chr> "Couple Leisure", "Business", "Couple Leisure", …
## $ SeatType              <chr> "Economy Class", "Economy Class", "Business Clas…
## $ Route                 <chr> "London to Stuttgart", "Brussels to London", "Lo…
## $ DateFlown             <chr> "November 2023", "November 2023", "November 2023…
## $ SeatComfort           <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2, …
## $ CabinStaffService     <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3, …
## $ GroundService         <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1, …
## $ ValueForMoney         <chr> "1", "2", "3", "1", "1", "1", "4", "3", "5", "1"…
## $ Recommended           <chr> "no", "no", "yes", "no", "no", "no", "yes", "yes…
## $ Aircraft              <chr> "", "A320", "A320", "", "", "A320", "Boeing 777-…
## $ Food.Beverages        <chr> "", "1", "4", "", "1", "1", "4", "3", "4", "1", …
## $ InflightEntertainment <chr> "", "2", "", "", "1", "1", "4", "", "4", "1", ""…
## $ Wifi.Connectivity     <int> NA, 2, NA, NA, 1, NA, NA, NA, 3, 1, NA, 3, 1, NA…

2.3 Summary

summary(original_df)
##  OverallRating    ReviewHeader           Name             Datetime        
##  Min.   : 1.000   Length:3701        Length:3701        Length:3701       
##  1st Qu.: 2.000   Class :character   Class :character   Class :character  
##  Median : 4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 4.734                                                           
##  3rd Qu.: 8.000                                                           
##  Max.   :10.000                                                           
##  NA's   :5                                                                
##  VerifiedReview      ReviewBody        TypeOfTraveller      SeatType        
##  Length:3701        Length:3701        Length:3701        Length:3701       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Route            DateFlown          SeatComfort    CabinStaffService
##  Length:3701        Length:3701        Min.   :1.000   Min.   :1.000    
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000    
##  Mode  :character   Mode  :character   Median :3.000   Median :3.000    
##                                        Mean   :2.875   Mean   :3.254    
##                                        3rd Qu.:4.000   3rd Qu.:5.000    
##                                        Max.   :5.000   Max.   :5.000    
##                                        NA's   :116     NA's   :127      
##  GroundService   ValueForMoney      Recommended          Aircraft        
##  Min.   :1.000   Length:3701        Length:3701        Length:3701       
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :3.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.784                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :5.000                                                           
##  NA's   :846                                                             
##  Food.Beverages     InflightEntertainment Wifi.Connectivity
##  Length:3701        Length:3701           Min.   :1.000    
##  Class :character   Class :character      1st Qu.:1.000    
##  Mode  :character   Mode  :character      Median :1.000    
##                                           Mean   :1.925    
##                                           3rd Qu.:3.000    
##                                           Max.   :5.000    
##                                           NA's   :3092

3. Data Cleaning


The subsequent stage involves data cleaning, aiming to achieve accuracy, consistency, and readiness for analysis and modelling. This multifaceted process involves three distinct stages: common cleaning, text pre-processing, and text representation. Let’s delve into each stage with more detail:

3.1 Common Cleaning

3.1.1 Data Type Conversion

Ensuring proper data types for each variable to facilitate analysis.

df <- original_df

# convert date variables into date format
df$Datetime <- as.Date(df$Datetime, format = "%dth %B %Y")
df$DateFlown <- as.Date(paste0("1 ", df$DateFlown), format = "%d %B %Y")

df$VerifiedReview <- as.logical(df$VerifiedReview)

# convert rating variables into integer format
df$ValueForMoney  <- as.integer(df$ValueForMoney)
df$Food.Beverages  <- as.integer(df$Food.Beverages)

df$InflightEntertainment  <- as.integer(df$InflightEntertainment)
df$Recommended <- as.integer(factor(df$Recommended, levels = c("no", "yes"), labels = c(0, 1)))

# ensure that character variable with "" is showing as NA
df <- mutate_if(df, is.character, na_if, "")

# encoded the categorical variables
df$encoded_TypeOfTraveller <- as.integer(factor(df$TypeOfTraveller))
df$encoded_SeatType <- as.integer(factor(df$SeatType))
# Output for checking:
glimpse(df)
## Rows: 3,701
## Columns: 21
## $ OverallRating           <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1…
## $ ReviewHeader            <chr> "\"Service level far worse then Ryanair\"", "\…
## $ Name                    <chr> "L Keele", "Austin Jones", "M A Collie", "Nige…
## $ Datetime                <date> 2023-11-19, 2023-11-19, 2023-11-16, 2023-11-1…
## $ VerifiedReview          <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
## $ ReviewBody              <chr> "4 Hours before takeoff we received a Mail sta…
## $ TypeOfTraveller         <chr> "Couple Leisure", "Business", "Couple Leisure"…
## $ SeatType                <chr> "Economy Class", "Economy Class", "Business Cl…
## $ Route                   <chr> "London to Stuttgart", "Brussels to London", "…
## $ DateFlown               <date> 2023-11-01, 2023-11-01, 2023-11-01, 2022-12-0…
## $ SeatComfort             <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2…
## $ CabinStaffService       <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3…
## $ GroundService           <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1…
## $ ValueForMoney           <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, 5, 2, 1…
## $ Recommended             <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1…
## $ Aircraft                <chr> NA, "A320", "A320", NA, NA, "A320", "Boeing 77…
## $ Food.Beverages          <int> NA, 1, 4, NA, 1, 1, 4, 3, 4, 1, NA, 3, 1, 4, 3…
## $ InflightEntertainment   <int> NA, 2, NA, NA, 1, 1, 4, NA, 4, 1, NA, 3, NA, 3…
## $ Wifi.Connectivity       <int> NA, 2, NA, NA, 1, NA, NA, NA, 3, 1, NA, 3, 1, …
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, 2, 1, 2…
## $ encoded_SeatType        <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, 2, 2, 2…

3.1.2 Handling Missing Values

Identifying and addressing missing data points to maintain data integrity.

  1. Wifi.Connectivity has an 84% missing rate, leading to substantial data loss. Imputing such a high percentage of missing values could introduce significant biases and distort the analysis. We decide to drop the variable to maintain the integrity and reliability of the dataset.

  2. Route and Aircraft variables were excluded due to inconsistent representations in the data, despite conveying the same meaning. Moreover, these variables introduce a multitude of categories, making them less manageable as categorical variables.

df_clean1 <- df %>% select(-Wifi.Connectivity,-Aircraft,-Route)
  1. For the rating variables InflightEntertainment, GroundService, Food.Beverages, CabinStaffService, SeatComfort, and ValueForMoney, we propose using K-Nearest Neighbors (KNN) imputation. KNN imputation effectively preserves the data structure by filling in missing values based on similarities between observations. This method suits these rating variables well, given their positive correlation with others. They can thus serve as dependable proxies for completing missing values in associated variables.

  2. For the encoded variables encoded_SeatType and encoded_TypeOfTraveller, we decided to remove the missing values. This decision is based on the finding that these variables have a low correlation with others, making them less reliable for imputation and potentially less impactful on the overall data analysis.

df_clean2 <- df_clean1 %>% drop_na("SeatType","TypeOfTraveller")
pacman::p_load("VIM","imputeTS","conflicted") # load the necessary pacakges for KNN imputation

conflicts_prefer(dplyr::filter) # conflicct in dplyr:filter and stats::filter
## [conflicted] Will prefer dplyr::filter over any other package.
# columns for imputation
feature1 <- c("SeatComfort", "CabinStaffService", "GroundService", "ValueForMoney", "Food.Beverages", "InflightEntertainment","Recommended","OverallRating")

# apply kNN imputation
df_imputed <- df_clean2 %>% select(all_of(feature1)) %>% kNN()
head(df_imputed)
##   SeatComfort CabinStaffService GroundService ValueForMoney Food.Beverages
## 1           1                 1             1             1              1
## 2           2                 3             1             2              1
## 3           3                 3             4             3              4
## 4           3                 3             1             1              2
## 5           1                 1             1             1              1
## 6           1                 1             1             1              1
##   InflightEntertainment Recommended OverallRating SeatComfort_imp
## 1                     1           1             1           FALSE
## 2                     2           1             3           FALSE
## 3                     4           2             8           FALSE
## 4                     2           1             1           FALSE
## 5                     1           1             1           FALSE
## 6                     1           1             1           FALSE
##   CabinStaffService_imp GroundService_imp ValueForMoney_imp Food.Beverages_imp
## 1                 FALSE             FALSE             FALSE               TRUE
## 2                 FALSE             FALSE             FALSE              FALSE
## 3                 FALSE             FALSE             FALSE              FALSE
## 4                 FALSE             FALSE             FALSE               TRUE
## 5                 FALSE             FALSE             FALSE              FALSE
## 6                 FALSE             FALSE             FALSE              FALSE
##   InflightEntertainment_imp Recommended_imp OverallRating_imp
## 1                      TRUE           FALSE             FALSE
## 2                     FALSE           FALSE             FALSE
## 3                      TRUE           FALSE             FALSE
## 4                      TRUE           FALSE             FALSE
## 5                     FALSE           FALSE             FALSE
## 6                     FALSE           FALSE             FALSE
  1. The study will focus on rating and review data. Consequently, retaining the remaining missing values for DateFlown and Datetime will not affect the model results. However, these variables will be utilized in the analysis phase to extract potential insights.

3.1.3 Dealing with Duplicate Entries

Detecting and removing duplicate records to maintain dataset uniqueness.

# show duplicate rows
nrow(df_clean3[duplicated(df_clean3),])
## [1] 0
# remove 7 duplicated rows from the dataset
df_clean4 <- unique(df_clean3)

print(paste0("[Before cleaning] ","Rows: ",nrow(original_df),", Columns: ", ncol(original_df)))
## [1] "[Before cleaning] Rows: 3701, Columns: 19"
print(paste0("[After cleaning] ","Rows: ",nrow(df_clean3),", Columns: ", ncol(df_clean3)))
## [1] "[After cleaning] Rows: 2929, Columns: 18"

3.1.4 Handling Outliers

Frequency analysis has been utilized to identify outliers in ordinal data. The analysis did not show significant evidence that less frequent levels should be considered as outliers.

df_final = df_clean4 %>% mutate(row=row_number())
glimpse(df_final)
## Rows: 2,929
## Columns: 19
## $ OverallRating           <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1…
## $ ReviewHeader            <chr> "\"Service level far worse then Ryanair\"", "\…
## $ Name                    <chr> "L Keele", "Austin Jones", "M A Collie", "Nige…
## $ Datetime                <date> 2023-11-19, 2023-11-19, 2023-11-16, 2023-11-1…
## $ VerifiedReview          <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
## $ ReviewBody              <chr> "4 Hours before takeoff we received a Mail sta…
## $ TypeOfTraveller         <chr> "Couple Leisure", "Business", "Couple Leisure"…
## $ SeatType                <chr> "Economy Class", "Economy Class", "Business Cl…
## $ DateFlown               <date> 2023-11-01, 2023-11-01, 2023-11-01, 2022-12-0…
## $ SeatComfort             <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2…
## $ CabinStaffService       <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3…
## $ GroundService           <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1…
## $ ValueForMoney           <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, 5, 2, 1…
## $ Recommended             <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1…
## $ Food.Beverages          <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, 4, 3, 3…
## $ InflightEntertainment   <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, 3, 4, 3…
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, 2, 1, 2…
## $ encoded_SeatType        <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ row                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…

3.2 Text Pre-processing

3.2.1 Noise Removal / Normalization / Stop Words Removal

  1. Removing unwanted or irrelevant data from a dataset, such as: punctuations, special characters, duplicate text, html tags, urls, headers, and footers.

  2. converting all text to lower case to ensure all in standard format.

  3. Discarding commonly used words that do not carry much significance in a given context, such as “and”, “the”, or “is”.

pacman::p_load(tm) # tm package for cleaning text

# define a text cleaning function
cleaning_text <- function(text){
  text <- tolower(text) # convert to lower case
  text <- removePunctuation(text) # remove punctuations
  text <- removeNumbers(text) # remove numbers
  text <- removeWords(text,stopwords("en")) # stop word removal
  text <- gsub("[^a-z ]","",text) # to remove non-lowercase letters and spaces
  text <- stripWhitespace(text) # ensure no excessive spaces
  return(text)
}

# perform  cleaning on review header
df_head <- df_final %>% select(row,ReviewHeader) %>% mutate(clean_head = sapply(ReviewHeader, cleaning_text)) %>% as.data.frame()
head(df_head,3)
##   row                             ReviewHeader                      clean_head
## 1   1   "Service level far worse then Ryanair" service level far worse ryanair
## 2   2 "do not upgrade members based on status"    upgrade members based status
## 3   3            "Flight was smooth and quick"             flight smooth quick
df_body <- df_final %>% select(row,ReviewBody) %>% mutate(clean_body = sapply(ReviewBody, cleaning_text)) %>% as.data.frame()
head(df_body,3)
##   row
## 1   1
## 2   2
## 3   3
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ReviewBody
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4 Hours before takeoff we received a Mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. So did the capacity of the Heathrow Airport really hit British Airways by surprise, 4h before departure? Anyhow - we took the one hour delay so what - but then we have been forced to check in our Hand luggage. I travel only with hand luggage to avoid waiting for the ultra slow processing of the checked in luggage. Overall 2h later at home than planed, with really no reason, just due to incompetent people. Service level far worse then Ryanair and triple the price. Really never again. Thanks for nothing.
## 2 I recently had a delay on British Airways from BRU to LHR that was due to staff shortages. They announced that there was a 2 hour holding delay but they would board us immediately in hopes of clearing the gate and leaving early. We had to wait the full 2 hours inside the airplane. The plane was old, dirty, had no power at the seats. The staff provided a small bag of pretzels and 250ml of water to the passengers for 2 hour delay and 2 hour flight. There were no options to purchase food or drink. There were no entertainment options available. I am a OneWorld emerald elite member but they do not upgrade members based on status. First class lounges at Heathrow are overcrowded, understaffed and poorly equipped. The help desk is completely unhelpful when an error arises with delays and cancellations - even when having the top status. The Avios points system has been devalued to near worthlessness and requires fees to book reward that nearly equal the price of the revenue ticket. British has lost its way in recent years and has a moved from a world-class airline to a budget airline with much worse service and timeliness than Ryanair or EasyJet.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Boarded on time, but it took ages to get to the runway due to congestion. Flight was smooth and quick. Snack and drinks were good for a short flight. Landed only about ten minutes late. One bag of three left in London, forms quickly filled in, and the bag was delivered the next morning.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               clean_body
## 1                                                                                                                                                                                                                                                                                                                                        hours takeoff received mail stating cryptic message disruptions expected limit many planes can leave time capacity heathrow airport really hit british airways surprise h departure anyhow took one hour delay forced check hand luggage travel hand luggage avoid waiting ultra slow processing checked luggage overall h later home planed really reason just due incompetent people service level far worse ryanair triple price really never thanks nothing
## 2  recently delay british airways bru lhr due staff shortages announced hour holding delay board us immediately hopes clearing gate leaving early wait full hours inside airplane plane old dirty power seats staff provided small bag pretzels ml water passengers hour delay hour flight options purchase food drink entertainment options available oneworld emerald elite member upgrade members based status first class lounges heathrow overcrowded understaffed poorly equipped help desk completely unhelpful error arises delays cancellations even top status avios points system devalued near worthlessness requires fees book reward nearly equal price revenue ticket british lost way recent years moved worldclass airline budget airline much worse service timeliness ryanair easyjet
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  boarded time took ages get runway due congestion flight smooth quick snack drinks good short flight landed ten minutes late one bag three left london forms quickly filled bag delivered next morning

3.2.3 Lemmatization

lemmatization reduces words to thier base form, but it considers the context and morphological analysis. For instance, “better” would be lemmatized to “good” instead of “bet” as in stemming.

pacman::p_load(textstem) # packages
# Function to lemmatize sentences
lemmatize_sentence <- function(sentence) {
  # Tokenize the sentence into words
  words <- unlist(strsplit(sentence, " "))
  # Lemmatize each word
  lemmatized_words <- lemmatize_words(words)
  # Reconstruct the sentence
  lemmatized_sentence <- paste(lemmatized_words, collapse = " ")
  return(lemmatized_sentence)
}
# lemmatization for head
df_head$norm_head <- sapply(df_head$clean_head, lemmatize_sentence)
head(df_head[df_head$norm_head != df_head$clean_head,c('clean_head','norm_head')],3)
##                        clean_head                     norm_head
## 1 service level far worse ryanair service level far bad ryanair
## 2    upgrade members based status    upgrade member base status
## 6      cant imagine worst airline      cant imagine bad airline
# lemmatization for body
df_body$norm_body <- sapply(df_body$clean_body, lemmatize_sentence)
head(df_body[df_body$norm_body != df_body$clean_body,c('clean_body','norm_body')],3)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               clean_body
## 1                                                                                                                                                                                                                                                                                                                                        hours takeoff received mail stating cryptic message disruptions expected limit many planes can leave time capacity heathrow airport really hit british airways surprise h departure anyhow took one hour delay forced check hand luggage travel hand luggage avoid waiting ultra slow processing checked luggage overall h later home planed really reason just due incompetent people service level far worse ryanair triple price really never thanks nothing
## 2  recently delay british airways bru lhr due staff shortages announced hour holding delay board us immediately hopes clearing gate leaving early wait full hours inside airplane plane old dirty power seats staff provided small bag pretzels ml water passengers hour delay hour flight options purchase food drink entertainment options available oneworld emerald elite member upgrade members based status first class lounges heathrow overcrowded understaffed poorly equipped help desk completely unhelpful error arises delays cancellations even top status avios points system devalued near worthlessness requires fees book reward nearly equal price revenue ticket british lost way recent years moved worldclass airline budget airline much worse service timeliness ryanair easyjet
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  boarded time took ages get runway due congestion flight smooth quick snack drinks good short flight landed ten minutes late one bag three left london forms quickly filled bag delivered next morning
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          norm_body
## 1                                                                                                                                                                                                                                                                                                                         hour takeoff receive mail state cryptic message disruption expect limit many plane can leave time capacity heathrow airport really hit british airway surprise h departure anyhow take one hour delay force check hand luggage travel hand luggage avoid wait ultra slow process check luggage overall h late home plane really reason just due incompetent people service level far bad ryanair triple price really never thank nothing
## 2  recently delay british airway bru lhr due staff shortage announce hour hold delay board us immediately hope clear gate leave early wait full hour inside airplane plane old dirty power seat staff provide small bag pretzel ml water passenger hour delay hour flight option purchase food drink entertainment option available oneworld emerald elite member upgrade member base status first class lounge heathrow overcrowd understaffed poorly equip help desk completely unhelpful error arise delay cancellation even top status avios point system devalue near worthlessness require fee book reward nearly equal price revenue ticket british lose way recent year move worldclass airline budget airline much bad service timeliness ryanair easyjet
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       board time take age get runway due congestion flight smooth quick snack drink good short flight land ten minute late one bag three leave london form quickly fill bag deliver next morning

3.2.2 Tokenization

Breaking the text into individual words.

pacman::p_load(tidytext, dplyr)

# tokenzied each word for head
head_token <- df_head %>% select(row, norm_head) %>% unnest_tokens(word, norm_head)
head(head_token,3)
##   row    word
## 1   1 service
## 2   1   level
## 3   1     far
# tokenzied each word for body
body_token <- df_body %>% select(row, norm_body) %>% unnest_tokens(word, norm_body)
head(body_token,3)
##   row    word
## 1   1    hour
## 2   1 takeoff
## 3   1 receive

3.3 Text Representation

Sentiment analysis has evolved through lexicon-based and model-based approaches. Initially, lexicon-based analysis was introduced to assign sentiment scores, often categorized as positive, negative, or neutral, to text. Subsequently, this data was transformed into a format suitable for machine learning training and testing purposes.

3.3.1 Lexicon-Based Sentiment Analysis

Lexicon-based sentiment analysis is a method used to determine the sentiment or emotional tone of a piece of text by leveraging a pre-defined list of words (a lexicon) where each word is associated with a specific sentiment value. This approach relies on dictionaries of words annotated with sentiment scores, typically indicating whether a word is positive, negative, or neutral.

pacman::p_load(tidytext,textdata,tidytext,janeaustenr,dplyr,stringr)

head_token2 <- head_token %>%  left_join(get_sentiments("bing"), by = "word") %>%  mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>% mutate(score = ifelse(sentiment=="neutral",0,ifelse(sentiment=="negative",-1,1)))
head(head_token2)
##   row    word sentiment score
## 1   1 service   neutral     0
## 2   1   level   neutral     0
## 3   1     far   neutral     0
## 4   1     bad  negative    -1
## 5   1 ryanair   neutral     0
## 6   2 upgrade   neutral     0
body_token2 <- body_token %>%  left_join(get_sentiments("bing"), by = "word")  %>%  mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>% mutate(score = ifelse(sentiment=="neutral",0,ifelse(sentiment=="negative",-1,1)))
head(body_token2)
##   row    word sentiment score
## 1   1    hour   neutral     0
## 2   1 takeoff   neutral     0
## 3   1 receive   neutral     0
## 4   1    mail   neutral     0
## 5   1   state   neutral     0
## 6   1 cryptic   neutral     0

final_df = df_final %>% left_join(select(overall_sentiment,row,overall_sentiment_score_body,overall_sentiment_score_head,overall_sentiment_score,overall_sentiment_class, encoded_overll_sentiment_class),by = 'row')
glimpse(final_df)
## Rows: 2,929
## Columns: 24
## $ OverallRating                  <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, …
## $ ReviewHeader                   <chr> "\"Service level far worse then Ryanair…
## $ Name                           <chr> "L Keele", "Austin Jones", "M A Collie"…
## $ Datetime                       <date> 2023-11-19, 2023-11-19, 2023-11-16, 20…
## $ VerifiedReview                 <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ ReviewBody                     <chr> "4 Hours before takeoff we received a M…
## $ TypeOfTraveller                <chr> "Couple Leisure", "Business", "Couple L…
## $ SeatType                       <chr> "Economy Class", "Economy Class", "Busi…
## $ DateFlown                      <date> 2023-11-01, 2023-11-01, 2023-11-01, 20…
## $ SeatComfort                    <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, …
## $ CabinStaffService              <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, …
## $ GroundService                  <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, …
## $ ValueForMoney                  <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, …
## $ Recommended                    <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, …
## $ Food.Beverages                 <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, …
## $ InflightEntertainment          <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, …
## $ encoded_TypeOfTraveller        <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, …
## $ encoded_SeatType               <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, …
## $ row                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ overall_sentiment_score_body   <dbl> -0.08, -0.06, 0.03, -0.01, -0.04, -0.16…
## $ overall_sentiment_score_head   <dbl> -0.20, 0.00, 0.33, -0.33, 0.00, -0.25, …
## $ overall_sentiment_score        <dbl> -0.14, -0.03, 0.18, -0.17, -0.02, -0.21…
## $ overall_sentiment_class        <chr> "negative", "neutral", "positive", "neg…
## $ encoded_overll_sentiment_class <dbl> -1, 0, 1, -1, 0, -1, 1, 1, -1, 0, 0, 1,…

3.3.2 Text Visualisation

Visualizing text data can help identify instances of misclassification in text cleaning and representation. This process allows us to examine and understand the patterns of misclassification more effectively. Additionally, visualizing the distribution of sentiment scores or probabilities for each class can provide further insights into the model’s performance and potential areas for improvement.

3.4 Summary

To summarize the data for analyse and model

# final clean data for analyse and modeling
glimpse(final_df)
## Rows: 2,929
## Columns: 24
## $ OverallRating                  <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, …
## $ ReviewHeader                   <chr> "\"Service level far worse then Ryanair…
## $ Name                           <chr> "L Keele", "Austin Jones", "M A Collie"…
## $ Datetime                       <date> 2023-11-19, 2023-11-19, 2023-11-16, 20…
## $ VerifiedReview                 <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ ReviewBody                     <chr> "4 Hours before takeoff we received a M…
## $ TypeOfTraveller                <chr> "Couple Leisure", "Business", "Couple L…
## $ SeatType                       <chr> "Economy Class", "Economy Class", "Busi…
## $ DateFlown                      <date> 2023-11-01, 2023-11-01, 2023-11-01, 20…
## $ SeatComfort                    <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, …
## $ CabinStaffService              <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, …
## $ GroundService                  <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, …
## $ ValueForMoney                  <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, …
## $ Recommended                    <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, …
## $ Food.Beverages                 <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, …
## $ InflightEntertainment          <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, …
## $ encoded_TypeOfTraveller        <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, …
## $ encoded_SeatType               <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, …
## $ row                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ overall_sentiment_score_body   <dbl> -0.08, -0.06, 0.03, -0.01, -0.04, -0.16…
## $ overall_sentiment_score_head   <dbl> -0.20, 0.00, 0.33, -0.33, 0.00, -0.25, …
## $ overall_sentiment_score        <dbl> -0.14, -0.03, 0.18, -0.17, -0.02, -0.21…
## $ overall_sentiment_class        <chr> "negative", "neutral", "positive", "neg…
## $ encoded_overll_sentiment_class <dbl> -1, 0, 1, -1, 0, -1, 1, 1, -1, 0, 0, 1,…
write.csv(final_df, file = "final_df.csv", row.names=FALSE)

# final text data for text analysis (eda part only)
final_text <- bind_rows(head_token2 %>% mutate(source = "head"), body_token2 %>% mutate(source = "body"))

glimpse(final_text)
## Rows: 258,883
## Columns: 5
## $ row       <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, …
## $ word      <chr> "service", "level", "far", "bad", "ryanair", "upgrade", "mem…
## $ sentiment <chr> "neutral", "neutral", "neutral", "negative", "neutral", "neu…
## $ score     <dbl> 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 0, 0, 0, 0, 0, 0…
## $ source    <chr> "head", "head", "head", "head", "head", "head", "head", "hea…
write.csv(final_text, file = "final_text.csv", row.names=FALSE)

4. Data Analysis


4.1 Analysis with time component

Comprehensive analysis Relationship between ratings and number of flights: While the counts of flights dropped sharply in the early 2020s, the ratings (especially the overall ratings) showed large fluctuations. This may be due to a reduction in passenger numbers during the pandemic, leading to more significant changes in individual ratings. As the number of flights recovered in 2021 and beyond, the fluctuations in the ratings decreased, but they still have not returned to the stable levels seen before the pandemic.

Relationship between rating categories: OverallRating is significantly higher than the other categories, indicating higher overall passenger satisfaction, while ratings for specific services (e.g., Cabin Staff Service, Food & Beverages, etc.) are relatively low and not very different from each other.

By combining these two charts, it is possible to better understand the changes in passenger ratings of airline services over time and the relationship between these changes and the number of flights

4.2 Analysis with text data

The combination of the word cloud map and the pie chart allows us to draw the following conclusions: The most frequently mentioned words in the reviews are related to passengers’ flight experience and service. Most of the words mentioned in the reviews tend to be negative, e.g. ‘delay’, ‘cancel’, ‘refund’, etc., which is in line with the high percentage of non-recommended reviews shown in the pie chart. Some positive terms, such as ‘comfortable’, ‘good’, are also present but relatively few, suggesting that while some passengers are satisfied with the airline service, the overall proportion of non-recommendations is high. This visual analysis helps us to better understand passenger feedback and issues in airline services

5. Modelling & Evaluation


5.1 Classification

  • Purpose: To determine if a given flight plan (SeatType, TypeOfTraveler, CabinStaffRating, GroundServiceRating) will be satisfactory, unsatisfactory or neutral sentiment score. The flight planners can hence take actions to tweak the flight plan such that it will give a satisfactory experience for the passengers.

  • Model: Logistic Regression, Support Vector Machines (SVM), Naive Bayes, KNN, Decision Tree, and Random Forest

  • Training: 0.7train/0.3test

  • Evaluation: accuracy and f1

# Load required libraries
pacman::p_load(dplyr)

# Assuming final_df is already defined and loaded
final_df_selected <- dplyr::select(final_df, SeatType, TypeOfTraveller, CabinStaffService, GroundService, overall_sentiment_class, overall_sentiment_score)

# Check the selected data frame
print(head(final_df_selected))
##         SeatType TypeOfTraveller CabinStaffService GroundService
## 1  Economy Class  Couple Leisure                 1             1
## 2  Economy Class        Business                 3             1
## 3 Business Class  Couple Leisure                 3             4
## 4  Economy Class  Couple Leisure                 3             1
## 5  Economy Class  Couple Leisure                 1             1
## 6  Economy Class    Solo Leisure                 1             1
##   overall_sentiment_class overall_sentiment_score
## 1                negative                   -0.14
## 2                 neutral                   -0.03
## 3                positive                    0.18
## 4                negative                   -0.17
## 5                 neutral                   -0.02
## 6                negative                   -0.21
# prompt: Convert CabinStaffService and GroundService to numeric

final_df_selected <- final_df_selected %>% mutate(CabinStaffService = as.numeric(CabinStaffService), GroundService = as.numeric(GroundService))
pacman::p_load(caret)

# Assume final_df_selected is already defined and loaded

# Create a dummy variable model for SeatType and TypeOfTraveller
dummies <- dummyVars(~ SeatType + TypeOfTraveller, data = final_df_selected)

# Apply the dummy variable model to the data and convert to a data frame
encoded_df <- as.data.frame(predict(dummies, newdata = final_df_selected))

# Combine the one-hot encoded columns back with the original columns that were not part of the encoding
non_encoded_columns <- final_df_selected[, !(colnames(final_df_selected) %in% c("SeatType", "TypeOfTraveller"))]
final_df_selected <- cbind(non_encoded_columns, encoded_df)

# Check the structure to confirm all columns are present
str(final_df_selected)
## 'data.frame':    2929 obs. of  12 variables:
##  $ CabinStaffService            : num  1 3 3 3 1 1 5 3 5 3 ...
##  $ GroundService                : num  1 1 4 1 1 1 4 3 3 3 ...
##  $ overall_sentiment_class      : chr  "negative" "neutral" "positive" "negative" ...
##  $ overall_sentiment_score      : num  -0.14 -0.03 0.18 -0.17 -0.02 -0.21 0.12 0.25 -0.36 0 ...
##  $ SeatTypeBusiness Class       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ SeatTypeEconomy Class        : num  1 1 0 1 1 1 0 1 1 1 ...
##  $ SeatTypeFirst Class          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SeatTypePremium Economy      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ TypeOfTravellerBusiness      : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ TypeOfTravellerCouple Leisure: num  1 0 1 1 1 0 1 0 0 1 ...
##  $ TypeOfTravellerFamily Leisure: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ TypeOfTravellerSolo Leisure  : num  0 0 0 0 0 1 0 1 0 0 ...

5.1.2 Train Test Split with Stratification

To ensure balanced classes in train and test sets.

# Train Test Split, stratified by target_variable
# Load required libraries
pacman::p_load(caret)

# Define the target column
target_column <- "overall_sentiment_class"  # Replace with the actual name of your target column if different

# Perform a stratified train-test split
set.seed(123)
trainIndex <- createDataPartition(final_df_selected[[target_column]], p = .8,
                                  list = FALSE,
                                  times = 1)

# Create training and testing sets
train_set <- final_df_selected[trainIndex, ]
test_set <- final_df_selected[-trainIndex, ]

# Separate features and target
X_train <- train_set[, !(colnames(train_set) %in% c(target_column,"overall_sentiment_score"))]
y_train <- train_set[[target_column]]
X_test <- test_set[, !(colnames(test_set) %in% c(target_column,"overall_sentiment_score"))]
y_test <- test_set[[target_column]]

# Check the structures
str(X_train)
## 'data.frame':    2344 obs. of  10 variables:
##  $ CabinStaffService            : num  3 3 3 1 5 3 5 3 3 1 ...
##  $ GroundService                : num  1 4 1 1 4 3 3 4 1 3 ...
##  $ SeatTypeBusiness Class       : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ SeatTypeEconomy Class        : num  1 0 1 1 0 1 1 0 1 1 ...
##  $ SeatTypeFirst Class          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SeatTypePremium Economy      : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ TypeOfTravellerBusiness      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ TypeOfTravellerCouple Leisure: num  0 1 1 1 1 0 0 1 1 0 ...
##  $ TypeOfTravellerFamily Leisure: num  0 0 0 0 0 0 1 0 0 0 ...
##  $ TypeOfTravellerSolo Leisure  : num  0 0 0 0 0 1 0 0 0 1 ...
str(y_train)
##  chr [1:2344] "neutral" "positive" "negative" "neutral" "positive" ...
str(X_test)
## 'data.frame':    585 obs. of  10 variables:
##  $ CabinStaffService            : num  1 1 3 1 5 3 3 3 3 3 ...
##  $ GroundService                : num  1 1 3 2 1 2 2 1 1 1 ...
##  $ SeatTypeBusiness Class       : num  0 0 0 1 1 0 0 1 0 1 ...
##  $ SeatTypeEconomy Class        : num  1 1 1 0 0 1 1 0 1 0 ...
##  $ SeatTypeFirst Class          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SeatTypePremium Economy      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TypeOfTravellerBusiness      : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ TypeOfTravellerCouple Leisure: num  1 0 1 0 1 1 1 1 0 0 ...
##  $ TypeOfTravellerFamily Leisure: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TypeOfTravellerSolo Leisure  : num  0 1 0 1 0 0 0 0 0 0 ...
str(y_test)
##  chr [1:585] "negative" "negative" "neutral" "negative" "neutral" ...

5.1.3 Preliminary Set of Algorithms

The initial set of algorithms are Naive Bayes, Decision Tree, SVM, and Random Forest.

# Load required libraries

pacman::p_load(nnet,naivebayes,e1071,rpart,randomForest,MLmetrics, caret)

# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
  colnames(df) <- make.names(colnames(df), unique = TRUE)
  return(df)
}

# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)

# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Set seed for reproducibility
set.seed(123)

# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)

# Multinomial Logistic Regression
multinom_model <- multinom(y_train ~ ., data = data.frame(X_train, y_train))
## # weights:  36 (22 variable)
## initial  value 2575.147205 
## iter  10 value 2360.085766
## iter  20 value 2232.032308
## final  value 2221.951984 
## converged
pred_multinom <- predict(multinom_model, X_test)
acc_multinom <- sum(pred_multinom == y_test) / length(y_test)
f1_multinom <- F1_Score(y_true = y_test, y_pred = pred_multinom, positive = levels(y_test)[1])
models$multinom <- multinom_model
evaluation <- rbind(evaluation, data.frame(Model = "Multinomial Logistic Regression", Accuracy = acc_multinom, F1 = f1_multinom))

# Support Vector Machines (SVM)
svm_model <- svm(y_train ~ ., data = data.frame(X_train, y_train), kernel = "linear", probability = TRUE)
pred_svm <- predict(svm_model, X_test)
acc_svm <- sum(pred_svm == y_test) / length(y_test)
f1_svm <- F1_Score(y_true = y_test, y_pred = pred_svm, positive = levels(y_test)[1])
models$svm <- svm_model
evaluation <- rbind(evaluation, data.frame(Model = "SVM", Accuracy = acc_svm, F1 = f1_svm))

# Naive Bayes
nb_model <- naive_bayes(y_train ~ ., data = data.frame(X_train, y_train))
pred_nb <- predict(nb_model, X_test)
acc_nb <- sum(pred_nb == y_test) / length(y_test)
f1_nb <- F1_Score(y_true = y_test, y_pred = pred_nb, positive = levels(y_test)[1])
models$nb <- nb_model
evaluation <- rbind(evaluation, data.frame(Model = "Naive Bayes", Accuracy = acc_nb, F1 = f1_nb))

# K-Nearest Neighbors (KNN)
knn_model <- train(y_train ~ ., data = data.frame(X_train, y_train), method = "knn", tuneLength = 5)
pred_knn <- predict(knn_model, X_test)
acc_knn <- sum(pred_knn == y_test) / length(y_test)
f1_knn <- F1_Score(y_true = y_test, y_pred = pred_knn, positive = levels(y_test)[1])
models$knn <- knn_model
evaluation <- rbind(evaluation, data.frame(Model = "KNN", Accuracy = acc_knn, F1 = f1_knn))

# Decision Tree
dt_model <- rpart(y_train ~ ., data = data.frame(X_train, y_train))
pred_dt <- predict(dt_model, X_test, type = "class")
acc_dt <- sum(pred_dt == y_test) / length(y_test)
f1_dt <- F1_Score(y_true = y_test, y_pred = pred_dt, positive = levels(y_test)[1])
models$dt <- dt_model
evaluation <- rbind(evaluation, data.frame(Model = "Decision Tree", Accuracy = acc_dt, F1 = f1_dt))

# Random Forest
rf_model <- randomForest(y_train ~ ., data = data.frame(X_train, y_train), ntree = 100)
pred_rf <- predict(rf_model, X_test)
acc_rf <- sum(pred_rf == y_test) / length(y_test)
f1_rf <- F1_Score(y_true = y_test, y_pred = pred_rf, positive = levels(y_test)[1])
models$rf <- rf_model
evaluation <- rbind(evaluation, data.frame(Model = "Random Forest", Accuracy = acc_rf, F1 = f1_rf))

# Print evaluation results
print(evaluation)
##                             Model  Accuracy         F1
## 1 Multinomial Logistic Regression 0.5264957 0.41447368
## 2                             SVM 0.4974359 0.09803922
## 3                     Naive Bayes 0.5196581 0.51020408
## 4                             KNN 0.4871795 0.36795252
## 5                   Decision Tree 0.5230769 0.40677966
## 6                   Random Forest 0.4957265 0.35220126

The best model from the preliminary modelling is Naive Bayes which is only able to accomplish accuracy and F1 of 51%. Since this is as good as a coin flip, this is definitely not good enough, so we will try to remedy by using different models and through feature engineering.

5.1.4 Advanced Tree-Based Algorithms

This time, we used Randomized Hyperparameter Search for Decision Tree, Random Forest, and advanced algorithm XGBoost, to see if this can improve the performance.

# Load required libraries
pacman::p_load(caret,e1071,randomForest,xgboost,MLmetrics, doParallel, kernlab)


# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
  colnames(df) <- make.names(colnames(df), unique = TRUE)
  return(df)
}

# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)

# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Set seed for reproducibility
set.seed(123)

# Set up parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)

# Define a control function for randomized search
train_control <- trainControl(method = "cv", number = 3,
                              summaryFunction = multiClassSummary,
                              classProbs = TRUE, verboseIter = TRUE)

# Randomized search for SVM with limited iterations
svm_grid <- expand.grid(C = 2^seq(-5, 2, length.out = 5), sigma = 2^seq(-5, 2, length.out = 5))
svm_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
                   method = "svmRadial",
                   tuneGrid = svm_grid,
                   trControl = train_control,
                   metric = "F1",
                   preProcess = c("center", "scale"),
                   tuneLength = 5)
## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.105, C = 0.354 on full training set
# Randomized search for Random Forest with limited iterations
rf_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
                  method = "rf",
                  tuneLength = 5,
                  trControl = train_control,
                  metric = "F1")
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2 on full training set
# Randomized search for XGBoost with limited iterations
xgb_grid <- expand.grid(nrounds = 50,
                        max_depth = seq(3, 7, by = 2),
                        eta = c(0.01, 0.1),
                        gamma = c(0, 0.1),
                        colsample_bytree = c(0.6, 0.8),
                        min_child_weight = c(1, 3),
                        subsample = c(0.7, 0.9))
xgb_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
                   method = "xgbTree",
                   tuneGrid = xgb_grid,
                   trControl = train_control,
                   metric = "F1")
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 50, max_depth = 3, eta = 0.1, gamma = 0.1, colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.7 on full training set
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()

# Predict and evaluate
models <- list(svm = svm_model, rf = rf_model, xgb = xgb_model)
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)

for (model_name in names(models)) {
  model <- models[[model_name]]
  pred <- predict(model, X_test)
  acc <- sum(pred == y_test) / length(y_test)
  f1 <- F1_Score(y_true = y_test, y_pred = pred, positive = levels(y_test)[1])
  evaluation <- rbind(evaluation, data.frame(Model = model_name, Accuracy = acc, F1 = f1))
}

# Print evaluation results
print(evaluation)
##   Model  Accuracy        F1
## 1   svm 0.5042735 0.2588235
## 2    rf 0.5196581 0.3391003
## 3   xgb 0.5025641 0.3513514

This did not improve the performance. And the best algorithm is still Naive Bayes with the best Accuracy and F1 score.

5.1.5 Advanced Bayesian Algorithms

We also considered Bayesian-like algorithms since Naive Bayes performed the best at first

# Load required libraries
pacman::p_load(caret,klaR,MLmetrics, doParallel, kernlab, Shiny)
## Installing package into 'C:/Users/User/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
  colnames(df) <- make.names(colnames(df), unique = TRUE)
  return(df)
}

# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)

# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Set seed for reproducibility
set.seed(123)

# Set up parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)

# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)

# Gaussian Naive Bayes
gnb_model <- foreach(i = 1, .packages = 'klaR') %dopar% {
  NaiveBayes(y_train ~ ., data = data.frame(X_train, y_train), usekernel = FALSE)
}
pred_gnb <- predict(gnb_model[[1]], X_test)$class
acc_gnb <- sum(pred_gnb == y_test) / length(y_test)
f1_gnb <- F1_Score(y_true = y_test, y_pred = pred_gnb, positive = levels(y_test)[1])
models$gnb <- gnb_model[[1]]
evaluation <- rbind(evaluation, data.frame(Model = "Gaussian Naive Bayes", Accuracy = acc_gnb, F1 = f1_gnb))

# Complement Naive Bayes
cnb_model <- foreach(i = 1, .packages = 'klaR') %dopar% {
  NaiveBayes(y_train ~ ., data = data.frame(X_train, y_train), usekernel = FALSE)
}
pred_cnb <- predict(cnb_model[[1]], X_test)$class
acc_cnb <- sum(pred_cnb == y_test) / length(y_test)
f1_cnb <- F1_Score(y_true = y_test, y_pred = pred_cnb, positive = levels(y_test)[1])
models$cnb <- cnb_model[[1]]
evaluation <- rbind(evaluation, data.frame(Model = "Complement Naive Bayes", Accuracy = acc_cnb, F1 = f1_cnb))

# Stop parallel processing
stopCluster(cl)
registerDoSEQ()

# Print evaluation results
print(evaluation)
##                    Model  Accuracy        F1
## 1   Gaussian Naive Bayes 0.5196581 0.5102041
## 2 Complement Naive Bayes 0.5196581 0.5102041

The more advanced algorithm still performs the same as the basic Naive Bayes. Performance is assumed the best in the Naive Bayes algorithm.

5.1.6 Feature Engineering to Enhance Performance

X_train$GroundCabinService = (X_train$GroundService + X_train$CabinStaffService) / 2
X_test$GroundCabinService = (X_test$GroundService + X_test$CabinStaffService) / 2
# Load required libraries

pacman::p_load(nnet,naivebayes,e1071,rpart,randomForest,MLmetrics, caret)

# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
  colnames(df) <- make.names(colnames(df), unique = TRUE)
  return(df)
}

# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)

# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Set seed for reproducibility
set.seed(123)

# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)

# Multinomial Logistic Regression
multinom_model <- multinom(y_train ~ ., data = data.frame(X_train, y_train))
## # weights:  39 (24 variable)
## initial  value 2575.147205 
## iter  10 value 2342.201854
## iter  20 value 2230.757446
## final  value 2221.951984 
## converged
pred_multinom <- predict(multinom_model, X_test)
acc_multinom <- sum(pred_multinom == y_test) / length(y_test)
f1_multinom <- F1_Score(y_true = y_test, y_pred = pred_multinom, positive = levels(y_test)[1])
models$multinom <- multinom_model
evaluation <- rbind(evaluation, data.frame(Model = "Multinomial Logistic Regression", Accuracy = acc_multinom, F1 = f1_multinom))

# Support Vector Machines (SVM)
svm_model <- svm(y_train ~ ., data = data.frame(X_train, y_train), kernel = "linear", probability = TRUE)
pred_svm <- predict(svm_model, X_test)
acc_svm <- sum(pred_svm == y_test) / length(y_test)
f1_svm <- F1_Score(y_true = y_test, y_pred = pred_svm, positive = levels(y_test)[1])
models$svm <- svm_model
evaluation <- rbind(evaluation, data.frame(Model = "SVM", Accuracy = acc_svm, F1 = f1_svm))

# Naive Bayes
nb_model <- naive_bayes(y_train ~ ., data = data.frame(X_train, y_train))
pred_nb <- predict(nb_model, X_test)
acc_nb <- sum(pred_nb == y_test) / length(y_test)
f1_nb <- F1_Score(y_true = y_test, y_pred = pred_nb, positive = levels(y_test)[1])
models$nb <- nb_model
evaluation <- rbind(evaluation, data.frame(Model = "Naive Bayes", Accuracy = acc_nb, F1 = f1_nb))

# K-Nearest Neighbors (KNN)
knn_model <- train(y_train ~ ., data = data.frame(X_train, y_train), method = "knn", tuneLength = 5)
pred_knn <- predict(knn_model, X_test)
acc_knn <- sum(pred_knn == y_test) / length(y_test)
f1_knn <- F1_Score(y_true = y_test, y_pred = pred_knn, positive = levels(y_test)[1])
models$knn <- knn_model
evaluation <- rbind(evaluation, data.frame(Model = "KNN", Accuracy = acc_knn, F1 = f1_knn))

# Decision Tree
dt_model <- rpart(y_train ~ ., data = data.frame(X_train, y_train))
pred_dt <- predict(dt_model, X_test, type = "class")
acc_dt <- sum(pred_dt == y_test) / length(y_test)
f1_dt <- F1_Score(y_true = y_test, y_pred = pred_dt, positive = levels(y_test)[1])
models$dt <- dt_model
evaluation <- rbind(evaluation, data.frame(Model = "Decision Tree", Accuracy = acc_dt, F1 = f1_dt))

# Random Forest
rf_model <- randomForest(y_train ~ ., data = data.frame(X_train, y_train), ntree = 100)
pred_rf <- predict(rf_model, X_test)
acc_rf <- sum(pred_rf == y_test) / length(y_test)
f1_rf <- F1_Score(y_true = y_test, y_pred = pred_rf, positive = levels(y_test)[1])
models$rf <- rf_model
evaluation <- rbind(evaluation, data.frame(Model = "Random Forest", Accuracy = acc_rf, F1 = f1_rf))

# Print evaluation results
print(evaluation)
##                             Model  Accuracy         F1
## 1 Multinomial Logistic Regression 0.5264957 0.41447368
## 2                             SVM 0.5008547 0.09803922
## 3                     Naive Bayes 0.5162393 0.53023256
## 4                             KNN 0.4974359 0.38596491
## 5                   Decision Tree 0.5299145 0.32432432
## 6                   Random Forest 0.5076923 0.40707965

5.1.7 Boosting Performance through Voting Ensemble

pacman::p_load(caretEnsemble,Kernlab,caret,randomForest,MLmetrics,klaR,kernlab)
## Installing package into 'C:/Users/User/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
# Assuming X_train, y_train, X_test, and y_test are already loaded in your environment

# Combine X_train and y_train into a single training dataset
trainData <- data.frame(X_train, target = y_train)
testData <- data.frame(X_test, target = y_test)

# Convert target to a factor if it's a classification problem
trainData$target <- as.factor(trainData$target)
testData$target <- as.factor(testData$target)

# Define control function for training with hyperparameter tuning
control <- trainControl(method = "cv", number = 5, savePredictions = "final",
                        classProbs = TRUE, summaryFunction = multiClassSummary,
                        allowParallel = TRUE)

# Train base models
model_nb <- train(target ~ ., data = trainData, method = "nb", trControl = control)
model_knn <- train(target ~ ., data = trainData, method = "knn", trControl = control)
model_rf <- train(target ~ ., data = trainData, method = "rf", trControl = control)
model_svm <- train(target ~ ., data = trainData, method = "svmRadial", trControl = control)

# Create a new training dataset for the meta-model using predictions from base models
meta_train <- data.frame(
  nb = predict(model_nb, newdata = trainData, type = "prob"),
  knn = predict(model_knn, newdata = trainData, type = "prob"),
  rf = predict(model_rf, newdata = trainData, type = "prob"),
  svm = predict(model_svm, newdata = trainData, type = "prob"),
  target = trainData$target
)

# Train the meta-model (using randomForest for example)
meta_control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = multiClassSummary)
meta_model <- train(target ~ ., data = meta_train, method = "rf", trControl = meta_control)

# Create a new test dataset for the meta-model using predictions from base models
meta_test <- data.frame(
  nb = predict(model_nb, newdata = testData, type = "prob"),
  knn = predict(model_knn, newdata = testData, type = "prob"),
  rf = predict(model_rf, newdata = testData, type = "prob"),
  svm = predict(model_svm, newdata = testData, type = "prob")
)

# Predict on test data using the meta-model
stacked_predictions <- predict(meta_model, newdata = meta_test)

# Evaluate performance
confMatrix <- confusionMatrix(stacked_predictions, testData$target)
accuracy <- confMatrix$overall['Accuracy']

# Calculate F1 score for each class and then average
f1_scores <- sapply(levels(testData$target), function(class) {
  F1_Score(y_true = testData$target, y_pred = stacked_predictions, positive = class)
})
f1_score <- mean(f1_scores)

# Output the results
print(paste("Accuracy: ", accuracy))
## [1] "Accuracy:  0.497435897435897"
print(paste("F1 Score: ", f1_score))
## [1] "F1 Score:  0.498716875375897"

5.1.8 Conclusion on Classification

The best performance we could accomplish is an Accuracy of 52.1% and F1 score of 51.9% based on the stacking ensemble model that integrates Naive Bayes, KNN, Random Forest, and SVM. This performance can be seen as not any better than random guessing on the sentiment score.

It is safe to assume that the available data in this dataset cannot be used reliably to predict the sentiment of the customer. It means that the sentiments of the customers are influenced by factors outside of the dataset. Further analysis could consider more features to look into whether they can be predictive of the customer’s satisfaction.

5.2 Regression

  • Purpose: To examine the relationship between customer’s sentiment and product’s quality.

  • Model: Linear Regression (benchmark), Decision Tree Regressor, Random Forest Regressor, Polynomial Linear Regression, XGBoost Regressor

  • Training: 0.7train/0.3test

  • Evaluation: MSE, MAPE, RMSE, R2

pacman::p_load(tidyr,caret,tidyverse,randomForest,xgboost,Metrics,ggplot2,e1071,tidymodels)

# rescale the sentiment score into 0 - 2
rescale_to_0_2 <- function(x) {
  min_x <- min(x, na.rm = TRUE)
  max_x <- max(x, na.rm = TRUE)
  scaled_x <- 2 * (x - min_x) / (max_x - min_x)
  return(scaled_x)
}

final_df$overall_sentiment_score <- rescale_to_0_2(final_df$overall_sentiment_score)
# prompt: Select SeatType, TypeOfTraveler, CabinStaffRating, GroundServiceRating as X and SentimentClass as Y
X <- dplyr::select(final_df, encoded_SeatType, encoded_TypeOfTraveller, CabinStaffService, GroundService)
Y <- final_df$overall_sentiment_score

# split the dataset into X and Y with training dataset 70% and testing dataset 30%
set.seed(456)
train_index <- createDataPartition(Y, p = 0.7, list = FALSE)
train_data <- X[train_index,]
test_data <- X[-train_index,]
train_label <- Y[train_index]
test_label <- Y[-train_index]

training_set <- cbind(train_data, Target = train_label)
testing_set <- cbind(test_data, Target = test_label)
# Performs cross-validation with 10 number of folds
mControl <- trainControl(
  method = "cv",
  number = 10,
  savePredictions = "final"
)
# Define the evaluation metrics which uses Root Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error and R-Squared.

# Initialize final evaluation result
final_eval_result <- data.frame()

# Function to evaluate model
evaluate_model <- function(model, model_name, testing_set, test_label) {
  predictions <- predict(model, newdata = testing_set)
  predictions_df <- data.frame(Truth = test_label, Prediction = predictions)

  metrics <- metric_set(rmse, mae, rsq, mape)
  eval_results <- metrics(predictions_df, truth = Truth, estimate = Prediction)
  eval_results <- eval_results %>%
    mutate(Model = model_name)

  return(eval_results)
}

5.2.2 Linear Regression model

Perform modelling on Linear Regression model with evaluation metrics set

# Linear Regression model
set.seed(456)

# fit the data into the model
lm_model <- train(Target ~ ., data = training_set, method = "lm", metric = "MAE", maximize = FALSE, trControl = mControl)

# evaluate the model
eval_results <- evaluate_model(lm_model, "Linear Regression", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
##   .metric .estimator .estimate Model            
##   <chr>   <chr>          <dbl> <chr>            
## 1 rmse    standard       0.272 Linear Regression
## 2 mae     standard       0.213 Linear Regression
## 3 rsq     standard       0.211 Linear Regression
## 4 mape    standard      30.4   Linear Regression

5.2.3 Decision Tree Regressor model

Perform modelling on Decision Tree Regressor model with evaluation metrics set

# Decision Tree Regressor model
set.seed(456)

# fit the data in decision tree regressor model for training
dt_model <- train(Target ~ ., data = training_set, method = "rpart", metric = "MAE", maximize = FALSE, trControl = mControl)

# evaluate the model
eval_results <- evaluate_model(dt_model, "Decision Tree Regressor", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
##   .metric .estimator .estimate Model                  
##   <chr>   <chr>          <dbl> <chr>                  
## 1 rmse    standard       0.274 Decision Tree Regressor
## 2 mae     standard       0.215 Decision Tree Regressor
## 3 rsq     standard       0.199 Decision Tree Regressor
## 4 mape    standard      31.1   Decision Tree Regressor

5.2.4 Random Forest Regressor model

Perform modelling on Random Forest Regressor model with evaluation metrics set

# Random Forest Regressor model
set.seed(456)

# fit the data in random forest regressor model for training
rf_model <- train(Target ~ ., data = training_set, method = "rf", metric = "MAE", maximize = FALSE, trControl = mControl)

# evaluate the model
eval_results <- evaluate_model(rf_model, "Random Forest Regressor", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
##   .metric .estimator .estimate Model                  
##   <chr>   <chr>          <dbl> <chr>                  
## 1 rmse    standard       0.277 Random Forest Regressor
## 2 mae     standard       0.217 Random Forest Regressor
## 3 rsq     standard       0.199 Random Forest Regressor
## 4 mape    standard      30.8   Random Forest Regressor

5.2.4 Polynomial Linear Regression model

Perform modelling on Polynomial Linear Regression model with evaluation metrics set

# Polynomial Linear Regression model
formula <- as.formula("Target ~ poly(encoded_SeatType, degree = 2) + poly(encoded_TypeOfTraveller, degree = 2) + poly(CabinStaffService, degree = 2) + poly(GroundService, degree = 2)")
set.seed(456)

# fit the data in polynomial regression model for training
poly_model <- train(formula, data = training_set, method = "lm", metric = "MAE", maximize = FALSE, trControl = mControl)
eval_results <- evaluate_model(poly_model, "Polynomial Linear Regression", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
##   .metric .estimator .estimate Model                       
##   <chr>   <chr>          <dbl> <chr>                       
## 1 rmse    standard       0.272 Polynomial Linear Regression
## 2 mae     standard       0.213 Polynomial Linear Regression
## 3 rsq     standard       0.214 Polynomial Linear Regression
## 4 mape    standard      30.4   Polynomial Linear Regression

5.2.5 XGBoost Regressor model

Perform modelling on XGBoost Regressor model with evaluation metrics set

# XGBoost Regressor model

# Define XGBoost model specification
xgb_spec <- boost_tree(
  trees = 100,
  tree_depth = 6,
  min_n = 1,
  loss_reduction = 0,
  sample_size = 1,
  mtry = 1,
  learn_rate = 0.3
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

# setup the workflow for xgboost
xgb_workflow <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Target ~ .)

# Create cross-validation with 10 folds
set.seed(456)
cv_folds <- vfold_cv(testing_set, v = 10)

# Fit the model
xgb_resamples <- fit_resamples(
  xgb_workflow,
  resamples = cv_folds,
  metrics = metric_set(rmse, mae, rsq, mape),
  control = control_resamples(save_pred = TRUE)
)

# Collect metrics from cross-validation
eval_results <- collect_metrics(xgb_resamples)
eval_results <- eval_results %>%
  mutate(Model = "XGBoost Regressor")
print(eval_results)
## # A tibble: 4 × 7
##   .metric .estimator   mean     n std_err .config              Model            
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>                <chr>            
## 1 mae     standard    0.236    10 0.00704 Preprocessor1_Model1 XGBoost Regressor
## 2 mape    standard   33.1      10 2.53    Preprocessor1_Model1 XGBoost Regressor
## 3 rmse    standard    0.302    10 0.00868 Preprocessor1_Model1 XGBoost Regressor
## 4 rsq     standard    0.113    10 0.00921 Preprocessor1_Model1 XGBoost Regressor

5.2.6 Conclusion on Regression

Overview from the regression models built:

  • RMSE of all models above are evenly identical which concludes that the model’s prediction average deviate from the actual values are the similar.

  • MAE of the models are also similar which shows the models prediction error are nearly the same.

  • R-square value is very low which indicates that the model is not capturing the variability in the data.

  • MAPE is too high which shows that the model prediction is not very accurate.

Thus, Polynomial Linear Regression model performs the best out of all the model. But overall models performance are not great and have low accuracy even though cross validation is applied.

6. Conclusion & Discussion