Members
Name | Matrix Number |
---|---|
Ding Lee Choong | 23075143 |
Wong Yang Gui | 22104322 |
Li Yang Bo | 22099073 |
Asadullah Qamar Bhatti | 2136569 |
LOW MENG FEI | 23063305 |
Introduction
British Airways (BA), one of the world’s leading airlines, has a rich history of providing air travel services to millions of passengers each year. Founded in 1974, the airline has grown to become a major player in the aviation industry, known for its extensive network, quality service, and commitment to passenger comfort and safety. As the airline industry becomes increasingly competitive, maintaining high levels of customer satisfaction is essential for retaining customers and fostering loyalty.
In recent years, the proliferation of digital communication channels has led to an explosion in customer feedback. Passengers now share their experiences through various platforms, including in-flight surveys, online reviews, social media, and direct communications with customer service. This feedback contains valuable insights into passengers’ experiences, preferences, and expectations.
Analyzing this feedback effectively allows British Airways to identify strengths and weaknesses in their services, respond to customer needs more efficiently, and make data-driven decisions to enhance overall customer satisfaction. By leveraging data science techniques, the study aims to transform raw customer feedback into actionable insights that can drive strategic improvements and ensure a superior travel experience for its passengers.
Objectives
Title: British Airways customer feedback
Year: The dataset was collected in 2023.
Source: The dataset was extracted from Airline Quality through web scrapping by Anshul Chaudhary & Muskan Risinghani, available on kaggle.
Purpose: The British Airways customer feedback dataset provide a valuable resource of customer feedback for analysis, spanning from 2010 to 2023. It offers real-time insights, enabling sentiment analysis, service quality assessment, route performance evaluation, aircraft experience analysis and etc. By leveraging this data, British Airways can enhance the overall customer experience, gain a competitive advantage, and make data-driven decisions for targeted improvements.
Dimension: 3,701 Rows & 19 Columns
Contents & Structure:
1. Flight Details
[Date] DateFlown: Date of the flight.
[Character] Name: Customer's name who provided the feedback.
[Nominal] TypeOfTraveller: Traveler type (e.g., Business, Leisure).
[Nominal] SeatType: Seat class (e.g. Business, Economy).
[Character] Route: The flight route taken by the customer.
[Character] Aircraft: Aircraft model.
2. Rating Feedback
[Nominal] Recommended: Whether the customer recommends British Airways.
[Ordinal] OverallRating: Overall customer rating.
[Ordinal] SeatComfort: Seat comfort rating.
[Ordinal] CabinStaffService: Cabin staff service rating.
[Ordinal] GroundService: Ground service rating.
[Ordinal] ValueForMoney: Value for money rating.
[Ordinal] Food&Beverages: Food & beverages rating.
[Ordinal] InflightEntertainment: Inflight entertainment rating.
[Ordinal] Wifi&Connectivity: Onboard wifi and connectivity rating.
3. Textual Feedback
[Character] ReviewHeader: Title of the customer's review.
[Character] ReviewBody: Detailed review content.
4. Other Details
[Date] Datetime: The date when the feedback was posted.
[Nominal] VerifiedReview: Indicates if the review is verified.
if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman") # R package management tool
pacman::p_load(tidyverse) # An integrated package allows for data import, tidying, manipulation (dyplr), visualisation (ggplot2), stringr (text preprocessing) and etc.
setwd("C:\\Users\\User\\Desktop\\Melaya\\WQD7004 Programming for Data Science")
# Import data
original_df <- read.csv('BA_AirlineReviews.csv',header=TRUE)[-1]
# Data overview
glimpse(original_df)
## Rows: 3,701
## Columns: 19
## $ OverallRating <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1, …
## $ ReviewHeader <chr> "\"Service level far worse then Ryanair\"", "\"d…
## $ Name <chr> "L Keele", "Austin Jones", "M A Collie", "Nigel …
## $ Datetime <chr> "19th November 2023", "19th November 2023", "16t…
## $ VerifiedReview <chr> "True", "True", "False", "True", "False", "True"…
## $ ReviewBody <chr> "4 Hours before takeoff we received a Mail stati…
## $ TypeOfTraveller <chr> "Couple Leisure", "Business", "Couple Leisure", …
## $ SeatType <chr> "Economy Class", "Economy Class", "Business Clas…
## $ Route <chr> "London to Stuttgart", "Brussels to London", "Lo…
## $ DateFlown <chr> "November 2023", "November 2023", "November 2023…
## $ SeatComfort <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2, …
## $ CabinStaffService <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3, …
## $ GroundService <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1, …
## $ ValueForMoney <chr> "1", "2", "3", "1", "1", "1", "4", "3", "5", "1"…
## $ Recommended <chr> "no", "no", "yes", "no", "no", "no", "yes", "yes…
## $ Aircraft <chr> "", "A320", "A320", "", "", "A320", "Boeing 777-…
## $ Food.Beverages <chr> "", "1", "4", "", "1", "1", "4", "3", "4", "1", …
## $ InflightEntertainment <chr> "", "2", "", "", "1", "1", "4", "", "4", "1", ""…
## $ Wifi.Connectivity <int> NA, 2, NA, NA, 1, NA, NA, NA, 3, 1, NA, 3, 1, NA…
summary(original_df)
## OverallRating ReviewHeader Name Datetime
## Min. : 1.000 Length:3701 Length:3701 Length:3701
## 1st Qu.: 2.000 Class :character Class :character Class :character
## Median : 4.000 Mode :character Mode :character Mode :character
## Mean : 4.734
## 3rd Qu.: 8.000
## Max. :10.000
## NA's :5
## VerifiedReview ReviewBody TypeOfTraveller SeatType
## Length:3701 Length:3701 Length:3701 Length:3701
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Route DateFlown SeatComfort CabinStaffService
## Length:3701 Length:3701 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :3.000
## Mean :2.875 Mean :3.254
## 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000
## NA's :116 NA's :127
## GroundService ValueForMoney Recommended Aircraft
## Min. :1.000 Length:3701 Length:3701 Length:3701
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :3.000 Mode :character Mode :character Mode :character
## Mean :2.784
## 3rd Qu.:4.000
## Max. :5.000
## NA's :846
## Food.Beverages InflightEntertainment Wifi.Connectivity
## Length:3701 Length:3701 Min. :1.000
## Class :character Class :character 1st Qu.:1.000
## Mode :character Mode :character Median :1.000
## Mean :1.925
## 3rd Qu.:3.000
## Max. :5.000
## NA's :3092
The subsequent stage involves data cleaning, aiming to achieve accuracy, consistency, and readiness for analysis and modelling. This multifaceted process involves three distinct stages: common cleaning, text pre-processing, and text representation. Let’s delve into each stage with more detail:
Ensuring proper data types for each variable to facilitate analysis.
df <- original_df
# convert date variables into date format
df$Datetime <- as.Date(df$Datetime, format = "%dth %B %Y")
df$DateFlown <- as.Date(paste0("1 ", df$DateFlown), format = "%d %B %Y")
df$VerifiedReview <- as.logical(df$VerifiedReview)
# convert rating variables into integer format
df$ValueForMoney <- as.integer(df$ValueForMoney)
df$Food.Beverages <- as.integer(df$Food.Beverages)
df$InflightEntertainment <- as.integer(df$InflightEntertainment)
df$Recommended <- as.integer(factor(df$Recommended, levels = c("no", "yes"), labels = c(0, 1)))
# ensure that character variable with "" is showing as NA
df <- mutate_if(df, is.character, na_if, "")
# encoded the categorical variables
df$encoded_TypeOfTraveller <- as.integer(factor(df$TypeOfTraveller))
df$encoded_SeatType <- as.integer(factor(df$SeatType))
# Output for checking:
glimpse(df)
## Rows: 3,701
## Columns: 21
## $ OverallRating <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1…
## $ ReviewHeader <chr> "\"Service level far worse then Ryanair\"", "\…
## $ Name <chr> "L Keele", "Austin Jones", "M A Collie", "Nige…
## $ Datetime <date> 2023-11-19, 2023-11-19, 2023-11-16, 2023-11-1…
## $ VerifiedReview <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
## $ ReviewBody <chr> "4 Hours before takeoff we received a Mail sta…
## $ TypeOfTraveller <chr> "Couple Leisure", "Business", "Couple Leisure"…
## $ SeatType <chr> "Economy Class", "Economy Class", "Business Cl…
## $ Route <chr> "London to Stuttgart", "Brussels to London", "…
## $ DateFlown <date> 2023-11-01, 2023-11-01, 2023-11-01, 2022-12-0…
## $ SeatComfort <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2…
## $ CabinStaffService <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3…
## $ GroundService <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1…
## $ ValueForMoney <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, 5, 2, 1…
## $ Recommended <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1…
## $ Aircraft <chr> NA, "A320", "A320", NA, NA, "A320", "Boeing 77…
## $ Food.Beverages <int> NA, 1, 4, NA, 1, 1, 4, 3, 4, 1, NA, 3, 1, 4, 3…
## $ InflightEntertainment <int> NA, 2, NA, NA, 1, 1, 4, NA, 4, 1, NA, 3, NA, 3…
## $ Wifi.Connectivity <int> NA, 2, NA, NA, 1, NA, NA, NA, 3, 1, NA, 3, 1, …
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, 2, 1, 2…
## $ encoded_SeatType <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, 2, 2, 2…
Identifying and addressing missing data points to maintain data integrity.
Wifi.Connectivity has an 84% missing rate, leading to substantial data loss. Imputing such a high percentage of missing values could introduce significant biases and distort the analysis. We decide to drop the variable to maintain the integrity and reliability of the dataset.
Route and Aircraft variables were excluded due to inconsistent representations in the data, despite conveying the same meaning. Moreover, these variables introduce a multitude of categories, making them less manageable as categorical variables.
df_clean1 <- df %>% select(-Wifi.Connectivity,-Aircraft,-Route)
For the rating variables InflightEntertainment, GroundService, Food.Beverages, CabinStaffService, SeatComfort, and ValueForMoney, we propose using K-Nearest Neighbors (KNN) imputation. KNN imputation effectively preserves the data structure by filling in missing values based on similarities between observations. This method suits these rating variables well, given their positive correlation with others. They can thus serve as dependable proxies for completing missing values in associated variables.
For the encoded variables encoded_SeatType and encoded_TypeOfTraveller, we decided to remove the missing values. This decision is based on the finding that these variables have a low correlation with others, making them less reliable for imputation and potentially less impactful on the overall data analysis.
df_clean2 <- df_clean1 %>% drop_na("SeatType","TypeOfTraveller")
pacman::p_load("VIM","imputeTS","conflicted") # load the necessary pacakges for KNN imputation
conflicts_prefer(dplyr::filter) # conflicct in dplyr:filter and stats::filter
## [conflicted] Will prefer dplyr::filter over any other package.
# columns for imputation
feature1 <- c("SeatComfort", "CabinStaffService", "GroundService", "ValueForMoney", "Food.Beverages", "InflightEntertainment","Recommended","OverallRating")
# apply kNN imputation
df_imputed <- df_clean2 %>% select(all_of(feature1)) %>% kNN()
head(df_imputed)
## SeatComfort CabinStaffService GroundService ValueForMoney Food.Beverages
## 1 1 1 1 1 1
## 2 2 3 1 2 1
## 3 3 3 4 3 4
## 4 3 3 1 1 2
## 5 1 1 1 1 1
## 6 1 1 1 1 1
## InflightEntertainment Recommended OverallRating SeatComfort_imp
## 1 1 1 1 FALSE
## 2 2 1 3 FALSE
## 3 4 2 8 FALSE
## 4 2 1 1 FALSE
## 5 1 1 1 FALSE
## 6 1 1 1 FALSE
## CabinStaffService_imp GroundService_imp ValueForMoney_imp Food.Beverages_imp
## 1 FALSE FALSE FALSE TRUE
## 2 FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE TRUE
## 5 FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE
## InflightEntertainment_imp Recommended_imp OverallRating_imp
## 1 TRUE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 TRUE FALSE FALSE
## 4 TRUE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
Detecting and removing duplicate records to maintain dataset uniqueness.
# show duplicate rows
nrow(df_clean3[duplicated(df_clean3),])
## [1] 0
# remove 7 duplicated rows from the dataset
df_clean4 <- unique(df_clean3)
print(paste0("[Before cleaning] ","Rows: ",nrow(original_df),", Columns: ", ncol(original_df)))
## [1] "[Before cleaning] Rows: 3701, Columns: 19"
print(paste0("[After cleaning] ","Rows: ",nrow(df_clean3),", Columns: ", ncol(df_clean3)))
## [1] "[After cleaning] Rows: 2929, Columns: 18"
Frequency analysis has been utilized to identify outliers in ordinal data. The analysis did not show significant evidence that less frequent levels should be considered as outliers.
df_final = df_clean4 %>% mutate(row=row_number())
glimpse(df_final)
## Rows: 2,929
## Columns: 19
## $ OverallRating <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, 8, 7, 1…
## $ ReviewHeader <chr> "\"Service level far worse then Ryanair\"", "\…
## $ Name <chr> "L Keele", "Austin Jones", "M A Collie", "Nige…
## $ Datetime <date> 2023-11-19, 2023-11-19, 2023-11-16, 2023-11-1…
## $ VerifiedReview <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
## $ ReviewBody <chr> "4 Hours before takeoff we received a Mail sta…
## $ TypeOfTraveller <chr> "Couple Leisure", "Business", "Couple Leisure"…
## $ SeatType <chr> "Economy Class", "Economy Class", "Business Cl…
## $ DateFlown <date> 2023-11-01, 2023-11-01, 2023-11-01, 2022-12-0…
## $ SeatComfort <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, 4, 3, 2…
## $ CabinStaffService <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, 5, 5, 3…
## $ GroundService <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, 5, 5, 1…
## $ ValueForMoney <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, 5, 2, 1…
## $ Recommended <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1…
## $ Food.Beverages <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, 4, 3, 3…
## $ InflightEntertainment <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, 3, 4, 3…
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, 2, 1, 2…
## $ encoded_SeatType <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ row <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
Removing unwanted or irrelevant data from a dataset, such as: punctuations, special characters, duplicate text, html tags, urls, headers, and footers.
converting all text to lower case to ensure all in standard format.
Discarding commonly used words that do not carry much significance in a given context, such as “and”, “the”, or “is”.
pacman::p_load(tm) # tm package for cleaning text
# define a text cleaning function
cleaning_text <- function(text){
text <- tolower(text) # convert to lower case
text <- removePunctuation(text) # remove punctuations
text <- removeNumbers(text) # remove numbers
text <- removeWords(text,stopwords("en")) # stop word removal
text <- gsub("[^a-z ]","",text) # to remove non-lowercase letters and spaces
text <- stripWhitespace(text) # ensure no excessive spaces
return(text)
}
# perform cleaning on review header
df_head <- df_final %>% select(row,ReviewHeader) %>% mutate(clean_head = sapply(ReviewHeader, cleaning_text)) %>% as.data.frame()
head(df_head,3)
## row ReviewHeader clean_head
## 1 1 "Service level far worse then Ryanair" service level far worse ryanair
## 2 2 "do not upgrade members based on status" upgrade members based status
## 3 3 "Flight was smooth and quick" flight smooth quick
df_body <- df_final %>% select(row,ReviewBody) %>% mutate(clean_body = sapply(ReviewBody, cleaning_text)) %>% as.data.frame()
head(df_body,3)
## row
## 1 1
## 2 2
## 3 3
## ReviewBody
## 1 4 Hours before takeoff we received a Mail stating a cryptic message that there are disruptions to be expected as there is a limit on how many planes can leave at the same time. So did the capacity of the Heathrow Airport really hit British Airways by surprise, 4h before departure? Anyhow - we took the one hour delay so what - but then we have been forced to check in our Hand luggage. I travel only with hand luggage to avoid waiting for the ultra slow processing of the checked in luggage. Overall 2h later at home than planed, with really no reason, just due to incompetent people. Service level far worse then Ryanair and triple the price. Really never again. Thanks for nothing.
## 2 I recently had a delay on British Airways from BRU to LHR that was due to staff shortages. They announced that there was a 2 hour holding delay but they would board us immediately in hopes of clearing the gate and leaving early. We had to wait the full 2 hours inside the airplane. The plane was old, dirty, had no power at the seats. The staff provided a small bag of pretzels and 250ml of water to the passengers for 2 hour delay and 2 hour flight. There were no options to purchase food or drink. There were no entertainment options available. I am a OneWorld emerald elite member but they do not upgrade members based on status. First class lounges at Heathrow are overcrowded, understaffed and poorly equipped. The help desk is completely unhelpful when an error arises with delays and cancellations - even when having the top status. The Avios points system has been devalued to near worthlessness and requires fees to book reward that nearly equal the price of the revenue ticket. British has lost its way in recent years and has a moved from a world-class airline to a budget airline with much worse service and timeliness than Ryanair or EasyJet.
## 3 Boarded on time, but it took ages to get to the runway due to congestion. Flight was smooth and quick. Snack and drinks were good for a short flight. Landed only about ten minutes late. One bag of three left in London, forms quickly filled in, and the bag was delivered the next morning.
## clean_body
## 1 hours takeoff received mail stating cryptic message disruptions expected limit many planes can leave time capacity heathrow airport really hit british airways surprise h departure anyhow took one hour delay forced check hand luggage travel hand luggage avoid waiting ultra slow processing checked luggage overall h later home planed really reason just due incompetent people service level far worse ryanair triple price really never thanks nothing
## 2 recently delay british airways bru lhr due staff shortages announced hour holding delay board us immediately hopes clearing gate leaving early wait full hours inside airplane plane old dirty power seats staff provided small bag pretzels ml water passengers hour delay hour flight options purchase food drink entertainment options available oneworld emerald elite member upgrade members based status first class lounges heathrow overcrowded understaffed poorly equipped help desk completely unhelpful error arises delays cancellations even top status avios points system devalued near worthlessness requires fees book reward nearly equal price revenue ticket british lost way recent years moved worldclass airline budget airline much worse service timeliness ryanair easyjet
## 3 boarded time took ages get runway due congestion flight smooth quick snack drinks good short flight landed ten minutes late one bag three left london forms quickly filled bag delivered next morning
lemmatization reduces words to thier base form, but it considers the context and morphological analysis. For instance, “better” would be lemmatized to “good” instead of “bet” as in stemming.
pacman::p_load(textstem) # packages
# Function to lemmatize sentences
lemmatize_sentence <- function(sentence) {
# Tokenize the sentence into words
words <- unlist(strsplit(sentence, " "))
# Lemmatize each word
lemmatized_words <- lemmatize_words(words)
# Reconstruct the sentence
lemmatized_sentence <- paste(lemmatized_words, collapse = " ")
return(lemmatized_sentence)
}
# lemmatization for head
df_head$norm_head <- sapply(df_head$clean_head, lemmatize_sentence)
head(df_head[df_head$norm_head != df_head$clean_head,c('clean_head','norm_head')],3)
## clean_head norm_head
## 1 service level far worse ryanair service level far bad ryanair
## 2 upgrade members based status upgrade member base status
## 6 cant imagine worst airline cant imagine bad airline
# lemmatization for body
df_body$norm_body <- sapply(df_body$clean_body, lemmatize_sentence)
head(df_body[df_body$norm_body != df_body$clean_body,c('clean_body','norm_body')],3)
## clean_body
## 1 hours takeoff received mail stating cryptic message disruptions expected limit many planes can leave time capacity heathrow airport really hit british airways surprise h departure anyhow took one hour delay forced check hand luggage travel hand luggage avoid waiting ultra slow processing checked luggage overall h later home planed really reason just due incompetent people service level far worse ryanair triple price really never thanks nothing
## 2 recently delay british airways bru lhr due staff shortages announced hour holding delay board us immediately hopes clearing gate leaving early wait full hours inside airplane plane old dirty power seats staff provided small bag pretzels ml water passengers hour delay hour flight options purchase food drink entertainment options available oneworld emerald elite member upgrade members based status first class lounges heathrow overcrowded understaffed poorly equipped help desk completely unhelpful error arises delays cancellations even top status avios points system devalued near worthlessness requires fees book reward nearly equal price revenue ticket british lost way recent years moved worldclass airline budget airline much worse service timeliness ryanair easyjet
## 3 boarded time took ages get runway due congestion flight smooth quick snack drinks good short flight landed ten minutes late one bag three left london forms quickly filled bag delivered next morning
## norm_body
## 1 hour takeoff receive mail state cryptic message disruption expect limit many plane can leave time capacity heathrow airport really hit british airway surprise h departure anyhow take one hour delay force check hand luggage travel hand luggage avoid wait ultra slow process check luggage overall h late home plane really reason just due incompetent people service level far bad ryanair triple price really never thank nothing
## 2 recently delay british airway bru lhr due staff shortage announce hour hold delay board us immediately hope clear gate leave early wait full hour inside airplane plane old dirty power seat staff provide small bag pretzel ml water passenger hour delay hour flight option purchase food drink entertainment option available oneworld emerald elite member upgrade member base status first class lounge heathrow overcrowd understaffed poorly equip help desk completely unhelpful error arise delay cancellation even top status avios point system devalue near worthlessness require fee book reward nearly equal price revenue ticket british lose way recent year move worldclass airline budget airline much bad service timeliness ryanair easyjet
## 3 board time take age get runway due congestion flight smooth quick snack drink good short flight land ten minute late one bag three leave london form quickly fill bag deliver next morning
Breaking the text into individual words.
pacman::p_load(tidytext, dplyr)
# tokenzied each word for head
head_token <- df_head %>% select(row, norm_head) %>% unnest_tokens(word, norm_head)
head(head_token,3)
## row word
## 1 1 service
## 2 1 level
## 3 1 far
# tokenzied each word for body
body_token <- df_body %>% select(row, norm_body) %>% unnest_tokens(word, norm_body)
head(body_token,3)
## row word
## 1 1 hour
## 2 1 takeoff
## 3 1 receive
Sentiment analysis has evolved through lexicon-based and model-based approaches. Initially, lexicon-based analysis was introduced to assign sentiment scores, often categorized as positive, negative, or neutral, to text. Subsequently, this data was transformed into a format suitable for machine learning training and testing purposes.
Lexicon-based sentiment analysis is a method used to determine the sentiment or emotional tone of a piece of text by leveraging a pre-defined list of words (a lexicon) where each word is associated with a specific sentiment value. This approach relies on dictionaries of words annotated with sentiment scores, typically indicating whether a word is positive, negative, or neutral.
pacman::p_load(tidytext,textdata,tidytext,janeaustenr,dplyr,stringr)
head_token2 <- head_token %>% left_join(get_sentiments("bing"), by = "word") %>% mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>% mutate(score = ifelse(sentiment=="neutral",0,ifelse(sentiment=="negative",-1,1)))
head(head_token2)
## row word sentiment score
## 1 1 service neutral 0
## 2 1 level neutral 0
## 3 1 far neutral 0
## 4 1 bad negative -1
## 5 1 ryanair neutral 0
## 6 2 upgrade neutral 0
body_token2 <- body_token %>% left_join(get_sentiments("bing"), by = "word") %>% mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>% mutate(score = ifelse(sentiment=="neutral",0,ifelse(sentiment=="negative",-1,1)))
head(body_token2)
## row word sentiment score
## 1 1 hour neutral 0
## 2 1 takeoff neutral 0
## 3 1 receive neutral 0
## 4 1 mail neutral 0
## 5 1 state neutral 0
## 6 1 cryptic neutral 0
final_df = df_final %>% left_join(select(overall_sentiment,row,overall_sentiment_score_body,overall_sentiment_score_head,overall_sentiment_score,overall_sentiment_class, encoded_overll_sentiment_class),by = 'row')
glimpse(final_df)
## Rows: 2,929
## Columns: 24
## $ OverallRating <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, …
## $ ReviewHeader <chr> "\"Service level far worse then Ryanair…
## $ Name <chr> "L Keele", "Austin Jones", "M A Collie"…
## $ Datetime <date> 2023-11-19, 2023-11-19, 2023-11-16, 20…
## $ VerifiedReview <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ ReviewBody <chr> "4 Hours before takeoff we received a M…
## $ TypeOfTraveller <chr> "Couple Leisure", "Business", "Couple L…
## $ SeatType <chr> "Economy Class", "Economy Class", "Busi…
## $ DateFlown <date> 2023-11-01, 2023-11-01, 2023-11-01, 20…
## $ SeatComfort <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, …
## $ CabinStaffService <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, …
## $ GroundService <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, …
## $ ValueForMoney <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, …
## $ Recommended <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, …
## $ Food.Beverages <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, …
## $ InflightEntertainment <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, …
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, …
## $ encoded_SeatType <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, …
## $ row <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ overall_sentiment_score_body <dbl> -0.08, -0.06, 0.03, -0.01, -0.04, -0.16…
## $ overall_sentiment_score_head <dbl> -0.20, 0.00, 0.33, -0.33, 0.00, -0.25, …
## $ overall_sentiment_score <dbl> -0.14, -0.03, 0.18, -0.17, -0.02, -0.21…
## $ overall_sentiment_class <chr> "negative", "neutral", "positive", "neg…
## $ encoded_overll_sentiment_class <dbl> -1, 0, 1, -1, 0, -1, 1, 1, -1, 0, 0, 1,…
Visualizing text data can help identify instances of misclassification in text cleaning and representation. This process allows us to examine and understand the patterns of misclassification more effectively. Additionally, visualizing the distribution of sentiment scores or probabilities for each class can provide further insights into the model’s performance and potential areas for improvement.
To summarize the data for analyse and model
# final clean data for analyse and modeling
glimpse(final_df)
## Rows: 2,929
## Columns: 24
## $ OverallRating <int> 1, 3, 8, 1, 1, 1, 8, 7, 2, 3, 8, 1, 6, …
## $ ReviewHeader <chr> "\"Service level far worse then Ryanair…
## $ Name <chr> "L Keele", "Austin Jones", "M A Collie"…
## $ Datetime <date> 2023-11-19, 2023-11-19, 2023-11-16, 20…
## $ VerifiedReview <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, T…
## $ ReviewBody <chr> "4 Hours before takeoff we received a M…
## $ TypeOfTraveller <chr> "Couple Leisure", "Business", "Couple L…
## $ SeatType <chr> "Economy Class", "Economy Class", "Busi…
## $ DateFlown <date> 2023-11-01, 2023-11-01, 2023-11-01, 20…
## $ SeatComfort <int> 1, 2, 3, 3, 1, 1, 5, 3, 4, 3, 3, 3, 3, …
## $ CabinStaffService <int> 1, 3, 3, 3, 1, 1, 5, 3, 5, 3, 3, 3, 1, …
## $ GroundService <int> 1, 1, 4, 1, 1, 1, 4, 3, 3, 3, 4, 1, 3, …
## $ ValueForMoney <int> 1, 2, 3, 1, 1, 1, 4, 3, 5, 1, 3, 1, 2, …
## $ Recommended <int> 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, …
## $ Food.Beverages <int> 1, 1, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 1, …
## $ InflightEntertainment <int> 1, 2, 4, 2, 1, 1, 4, 3, 4, 1, 3, 3, 2, …
## $ encoded_TypeOfTraveller <int> 2, 1, 2, 2, 2, 4, 2, 4, 3, 2, 2, 2, 4, …
## $ encoded_SeatType <int> 2, 2, 1, 2, 2, 2, 4, 2, 2, 2, 1, 2, 2, …
## $ row <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ overall_sentiment_score_body <dbl> -0.08, -0.06, 0.03, -0.01, -0.04, -0.16…
## $ overall_sentiment_score_head <dbl> -0.20, 0.00, 0.33, -0.33, 0.00, -0.25, …
## $ overall_sentiment_score <dbl> -0.14, -0.03, 0.18, -0.17, -0.02, -0.21…
## $ overall_sentiment_class <chr> "negative", "neutral", "positive", "neg…
## $ encoded_overll_sentiment_class <dbl> -1, 0, 1, -1, 0, -1, 1, 1, -1, 0, 0, 1,…
write.csv(final_df, file = "final_df.csv", row.names=FALSE)
# final text data for text analysis (eda part only)
final_text <- bind_rows(head_token2 %>% mutate(source = "head"), body_token2 %>% mutate(source = "body"))
glimpse(final_text)
## Rows: 258,883
## Columns: 5
## $ row <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, …
## $ word <chr> "service", "level", "far", "bad", "ryanair", "upgrade", "mem…
## $ sentiment <chr> "neutral", "neutral", "neutral", "negative", "neutral", "neu…
## $ score <dbl> 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 0, 0, 0, 0, 0, 0…
## $ source <chr> "head", "head", "head", "head", "head", "head", "head", "hea…
write.csv(final_text, file = "final_text.csv", row.names=FALSE)
Comprehensive analysis Relationship between ratings and number of flights: While the counts of flights dropped sharply in the early 2020s, the ratings (especially the overall ratings) showed large fluctuations. This may be due to a reduction in passenger numbers during the pandemic, leading to more significant changes in individual ratings. As the number of flights recovered in 2021 and beyond, the fluctuations in the ratings decreased, but they still have not returned to the stable levels seen before the pandemic.
Relationship between rating categories: OverallRating is significantly higher than the other categories, indicating higher overall passenger satisfaction, while ratings for specific services (e.g., Cabin Staff Service, Food & Beverages, etc.) are relatively low and not very different from each other.
By combining these two charts, it is possible to better understand the changes in passenger ratings of airline services over time and the relationship between these changes and the number of flights
The combination of the word cloud map and the pie chart allows us to draw the following conclusions: The most frequently mentioned words in the reviews are related to passengers’ flight experience and service. Most of the words mentioned in the reviews tend to be negative, e.g. ‘delay’, ‘cancel’, ‘refund’, etc., which is in line with the high percentage of non-recommended reviews shown in the pie chart. Some positive terms, such as ‘comfortable’, ‘good’, are also present but relatively few, suggesting that while some passengers are satisfied with the airline service, the overall proportion of non-recommendations is high. This visual analysis helps us to better understand passenger feedback and issues in airline services
Purpose: To determine if a given flight plan (SeatType, TypeOfTraveler, CabinStaffRating, GroundServiceRating) will be satisfactory, unsatisfactory or neutral sentiment score. The flight planners can hence take actions to tweak the flight plan such that it will give a satisfactory experience for the passengers.
Model: Logistic Regression, Support Vector Machines (SVM), Naive Bayes, KNN, Decision Tree, and Random Forest
Training: 0.7train/0.3test
Evaluation: accuracy and f1
# Load required libraries
pacman::p_load(dplyr)
# Assuming final_df is already defined and loaded
final_df_selected <- dplyr::select(final_df, SeatType, TypeOfTraveller, CabinStaffService, GroundService, overall_sentiment_class, overall_sentiment_score)
# Check the selected data frame
print(head(final_df_selected))
## SeatType TypeOfTraveller CabinStaffService GroundService
## 1 Economy Class Couple Leisure 1 1
## 2 Economy Class Business 3 1
## 3 Business Class Couple Leisure 3 4
## 4 Economy Class Couple Leisure 3 1
## 5 Economy Class Couple Leisure 1 1
## 6 Economy Class Solo Leisure 1 1
## overall_sentiment_class overall_sentiment_score
## 1 negative -0.14
## 2 neutral -0.03
## 3 positive 0.18
## 4 negative -0.17
## 5 neutral -0.02
## 6 negative -0.21
# prompt: Convert CabinStaffService and GroundService to numeric
final_df_selected <- final_df_selected %>% mutate(CabinStaffService = as.numeric(CabinStaffService), GroundService = as.numeric(GroundService))
pacman::p_load(caret)
# Assume final_df_selected is already defined and loaded
# Create a dummy variable model for SeatType and TypeOfTraveller
dummies <- dummyVars(~ SeatType + TypeOfTraveller, data = final_df_selected)
# Apply the dummy variable model to the data and convert to a data frame
encoded_df <- as.data.frame(predict(dummies, newdata = final_df_selected))
# Combine the one-hot encoded columns back with the original columns that were not part of the encoding
non_encoded_columns <- final_df_selected[, !(colnames(final_df_selected) %in% c("SeatType", "TypeOfTraveller"))]
final_df_selected <- cbind(non_encoded_columns, encoded_df)
# Check the structure to confirm all columns are present
str(final_df_selected)
## 'data.frame': 2929 obs. of 12 variables:
## $ CabinStaffService : num 1 3 3 3 1 1 5 3 5 3 ...
## $ GroundService : num 1 1 4 1 1 1 4 3 3 3 ...
## $ overall_sentiment_class : chr "negative" "neutral" "positive" "negative" ...
## $ overall_sentiment_score : num -0.14 -0.03 0.18 -0.17 -0.02 -0.21 0.12 0.25 -0.36 0 ...
## $ SeatTypeBusiness Class : num 0 0 1 0 0 0 0 0 0 0 ...
## $ SeatTypeEconomy Class : num 1 1 0 1 1 1 0 1 1 1 ...
## $ SeatTypeFirst Class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SeatTypePremium Economy : num 0 0 0 0 0 0 1 0 0 0 ...
## $ TypeOfTravellerBusiness : num 0 1 0 0 0 0 0 0 0 0 ...
## $ TypeOfTravellerCouple Leisure: num 1 0 1 1 1 0 1 0 0 1 ...
## $ TypeOfTravellerFamily Leisure: num 0 0 0 0 0 0 0 0 1 0 ...
## $ TypeOfTravellerSolo Leisure : num 0 0 0 0 0 1 0 1 0 0 ...
To ensure balanced classes in train and test sets.
# Train Test Split, stratified by target_variable
# Load required libraries
pacman::p_load(caret)
# Define the target column
target_column <- "overall_sentiment_class" # Replace with the actual name of your target column if different
# Perform a stratified train-test split
set.seed(123)
trainIndex <- createDataPartition(final_df_selected[[target_column]], p = .8,
list = FALSE,
times = 1)
# Create training and testing sets
train_set <- final_df_selected[trainIndex, ]
test_set <- final_df_selected[-trainIndex, ]
# Separate features and target
X_train <- train_set[, !(colnames(train_set) %in% c(target_column,"overall_sentiment_score"))]
y_train <- train_set[[target_column]]
X_test <- test_set[, !(colnames(test_set) %in% c(target_column,"overall_sentiment_score"))]
y_test <- test_set[[target_column]]
# Check the structures
str(X_train)
## 'data.frame': 2344 obs. of 10 variables:
## $ CabinStaffService : num 3 3 3 1 5 3 5 3 3 1 ...
## $ GroundService : num 1 4 1 1 4 3 3 4 1 3 ...
## $ SeatTypeBusiness Class : num 0 1 0 0 0 0 0 1 0 0 ...
## $ SeatTypeEconomy Class : num 1 0 1 1 0 1 1 0 1 1 ...
## $ SeatTypeFirst Class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SeatTypePremium Economy : num 0 0 0 0 1 0 0 0 0 0 ...
## $ TypeOfTravellerBusiness : num 1 0 0 0 0 0 0 0 0 0 ...
## $ TypeOfTravellerCouple Leisure: num 0 1 1 1 1 0 0 1 1 0 ...
## $ TypeOfTravellerFamily Leisure: num 0 0 0 0 0 0 1 0 0 0 ...
## $ TypeOfTravellerSolo Leisure : num 0 0 0 0 0 1 0 0 0 1 ...
str(y_train)
## chr [1:2344] "neutral" "positive" "negative" "neutral" "positive" ...
str(X_test)
## 'data.frame': 585 obs. of 10 variables:
## $ CabinStaffService : num 1 1 3 1 5 3 3 3 3 3 ...
## $ GroundService : num 1 1 3 2 1 2 2 1 1 1 ...
## $ SeatTypeBusiness Class : num 0 0 0 1 1 0 0 1 0 1 ...
## $ SeatTypeEconomy Class : num 1 1 1 0 0 1 1 0 1 0 ...
## $ SeatTypeFirst Class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SeatTypePremium Economy : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TypeOfTravellerBusiness : num 0 0 0 0 0 0 0 0 1 1 ...
## $ TypeOfTravellerCouple Leisure: num 1 0 1 0 1 1 1 1 0 0 ...
## $ TypeOfTravellerFamily Leisure: num 0 0 0 0 0 0 0 0 0 0 ...
## $ TypeOfTravellerSolo Leisure : num 0 1 0 1 0 0 0 0 0 0 ...
str(y_test)
## chr [1:585] "negative" "negative" "neutral" "negative" "neutral" ...
The initial set of algorithms are Naive Bayes, Decision Tree, SVM, and Random Forest.
# Load required libraries
pacman::p_load(nnet,naivebayes,e1071,rpart,randomForest,MLmetrics, caret)
# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
colnames(df) <- make.names(colnames(df), unique = TRUE)
return(df)
}
# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)
# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Set seed for reproducibility
set.seed(123)
# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)
# Multinomial Logistic Regression
multinom_model <- multinom(y_train ~ ., data = data.frame(X_train, y_train))
## # weights: 36 (22 variable)
## initial value 2575.147205
## iter 10 value 2360.085766
## iter 20 value 2232.032308
## final value 2221.951984
## converged
pred_multinom <- predict(multinom_model, X_test)
acc_multinom <- sum(pred_multinom == y_test) / length(y_test)
f1_multinom <- F1_Score(y_true = y_test, y_pred = pred_multinom, positive = levels(y_test)[1])
models$multinom <- multinom_model
evaluation <- rbind(evaluation, data.frame(Model = "Multinomial Logistic Regression", Accuracy = acc_multinom, F1 = f1_multinom))
# Support Vector Machines (SVM)
svm_model <- svm(y_train ~ ., data = data.frame(X_train, y_train), kernel = "linear", probability = TRUE)
pred_svm <- predict(svm_model, X_test)
acc_svm <- sum(pred_svm == y_test) / length(y_test)
f1_svm <- F1_Score(y_true = y_test, y_pred = pred_svm, positive = levels(y_test)[1])
models$svm <- svm_model
evaluation <- rbind(evaluation, data.frame(Model = "SVM", Accuracy = acc_svm, F1 = f1_svm))
# Naive Bayes
nb_model <- naive_bayes(y_train ~ ., data = data.frame(X_train, y_train))
pred_nb <- predict(nb_model, X_test)
acc_nb <- sum(pred_nb == y_test) / length(y_test)
f1_nb <- F1_Score(y_true = y_test, y_pred = pred_nb, positive = levels(y_test)[1])
models$nb <- nb_model
evaluation <- rbind(evaluation, data.frame(Model = "Naive Bayes", Accuracy = acc_nb, F1 = f1_nb))
# K-Nearest Neighbors (KNN)
knn_model <- train(y_train ~ ., data = data.frame(X_train, y_train), method = "knn", tuneLength = 5)
pred_knn <- predict(knn_model, X_test)
acc_knn <- sum(pred_knn == y_test) / length(y_test)
f1_knn <- F1_Score(y_true = y_test, y_pred = pred_knn, positive = levels(y_test)[1])
models$knn <- knn_model
evaluation <- rbind(evaluation, data.frame(Model = "KNN", Accuracy = acc_knn, F1 = f1_knn))
# Decision Tree
dt_model <- rpart(y_train ~ ., data = data.frame(X_train, y_train))
pred_dt <- predict(dt_model, X_test, type = "class")
acc_dt <- sum(pred_dt == y_test) / length(y_test)
f1_dt <- F1_Score(y_true = y_test, y_pred = pred_dt, positive = levels(y_test)[1])
models$dt <- dt_model
evaluation <- rbind(evaluation, data.frame(Model = "Decision Tree", Accuracy = acc_dt, F1 = f1_dt))
# Random Forest
rf_model <- randomForest(y_train ~ ., data = data.frame(X_train, y_train), ntree = 100)
pred_rf <- predict(rf_model, X_test)
acc_rf <- sum(pred_rf == y_test) / length(y_test)
f1_rf <- F1_Score(y_true = y_test, y_pred = pred_rf, positive = levels(y_test)[1])
models$rf <- rf_model
evaluation <- rbind(evaluation, data.frame(Model = "Random Forest", Accuracy = acc_rf, F1 = f1_rf))
# Print evaluation results
print(evaluation)
## Model Accuracy F1
## 1 Multinomial Logistic Regression 0.5264957 0.41447368
## 2 SVM 0.4974359 0.09803922
## 3 Naive Bayes 0.5196581 0.51020408
## 4 KNN 0.4871795 0.36795252
## 5 Decision Tree 0.5230769 0.40677966
## 6 Random Forest 0.4957265 0.35220126
The best model from the preliminary modelling is Naive Bayes which is only able to accomplish accuracy and F1 of 51%. Since this is as good as a coin flip, this is definitely not good enough, so we will try to remedy by using different models and through feature engineering.
This time, we used Randomized Hyperparameter Search for Decision Tree, Random Forest, and advanced algorithm XGBoost, to see if this can improve the performance.
# Load required libraries
pacman::p_load(caret,e1071,randomForest,xgboost,MLmetrics, doParallel, kernlab)
# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
colnames(df) <- make.names(colnames(df), unique = TRUE)
return(df)
}
# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)
# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Set seed for reproducibility
set.seed(123)
# Set up parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)
# Define a control function for randomized search
train_control <- trainControl(method = "cv", number = 3,
summaryFunction = multiClassSummary,
classProbs = TRUE, verboseIter = TRUE)
# Randomized search for SVM with limited iterations
svm_grid <- expand.grid(C = 2^seq(-5, 2, length.out = 5), sigma = 2^seq(-5, 2, length.out = 5))
svm_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
method = "svmRadial",
tuneGrid = svm_grid,
trControl = train_control,
metric = "F1",
preProcess = c("center", "scale"),
tuneLength = 5)
## Aggregating results
## Selecting tuning parameters
## Fitting sigma = 0.105, C = 0.354 on full training set
# Randomized search for Random Forest with limited iterations
rf_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
method = "rf",
tuneLength = 5,
trControl = train_control,
metric = "F1")
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 2 on full training set
# Randomized search for XGBoost with limited iterations
xgb_grid <- expand.grid(nrounds = 50,
max_depth = seq(3, 7, by = 2),
eta = c(0.01, 0.1),
gamma = c(0, 0.1),
colsample_bytree = c(0.6, 0.8),
min_child_weight = c(1, 3),
subsample = c(0.7, 0.9))
xgb_model <- train(y_train ~ ., data = data.frame(X_train, y_train),
method = "xgbTree",
tuneGrid = xgb_grid,
trControl = train_control,
metric = "F1")
## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 50, max_depth = 3, eta = 0.1, gamma = 0.1, colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.7 on full training set
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()
# Predict and evaluate
models <- list(svm = svm_model, rf = rf_model, xgb = xgb_model)
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)
for (model_name in names(models)) {
model <- models[[model_name]]
pred <- predict(model, X_test)
acc <- sum(pred == y_test) / length(y_test)
f1 <- F1_Score(y_true = y_test, y_pred = pred, positive = levels(y_test)[1])
evaluation <- rbind(evaluation, data.frame(Model = model_name, Accuracy = acc, F1 = f1))
}
# Print evaluation results
print(evaluation)
## Model Accuracy F1
## 1 svm 0.5042735 0.2588235
## 2 rf 0.5196581 0.3391003
## 3 xgb 0.5025641 0.3513514
This did not improve the performance. And the best algorithm is still Naive Bayes with the best Accuracy and F1 score.
We also considered Bayesian-like algorithms since Naive Bayes performed the best at first
# Load required libraries
pacman::p_load(caret,klaR,MLmetrics, doParallel, kernlab, Shiny)
## Installing package into 'C:/Users/User/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
colnames(df) <- make.names(colnames(df), unique = TRUE)
return(df)
}
# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)
# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Set seed for reproducibility
set.seed(123)
# Set up parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)
# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)
# Gaussian Naive Bayes
gnb_model <- foreach(i = 1, .packages = 'klaR') %dopar% {
NaiveBayes(y_train ~ ., data = data.frame(X_train, y_train), usekernel = FALSE)
}
pred_gnb <- predict(gnb_model[[1]], X_test)$class
acc_gnb <- sum(pred_gnb == y_test) / length(y_test)
f1_gnb <- F1_Score(y_true = y_test, y_pred = pred_gnb, positive = levels(y_test)[1])
models$gnb <- gnb_model[[1]]
evaluation <- rbind(evaluation, data.frame(Model = "Gaussian Naive Bayes", Accuracy = acc_gnb, F1 = f1_gnb))
# Complement Naive Bayes
cnb_model <- foreach(i = 1, .packages = 'klaR') %dopar% {
NaiveBayes(y_train ~ ., data = data.frame(X_train, y_train), usekernel = FALSE)
}
pred_cnb <- predict(cnb_model[[1]], X_test)$class
acc_cnb <- sum(pred_cnb == y_test) / length(y_test)
f1_cnb <- F1_Score(y_true = y_test, y_pred = pred_cnb, positive = levels(y_test)[1])
models$cnb <- cnb_model[[1]]
evaluation <- rbind(evaluation, data.frame(Model = "Complement Naive Bayes", Accuracy = acc_cnb, F1 = f1_cnb))
# Stop parallel processing
stopCluster(cl)
registerDoSEQ()
# Print evaluation results
print(evaluation)
## Model Accuracy F1
## 1 Gaussian Naive Bayes 0.5196581 0.5102041
## 2 Complement Naive Bayes 0.5196581 0.5102041
The more advanced algorithm still performs the same as the basic Naive Bayes. Performance is assumed the best in the Naive Bayes algorithm.
X_train$GroundCabinService = (X_train$GroundService + X_train$CabinStaffService) / 2
X_test$GroundCabinService = (X_test$GroundService + X_test$CabinStaffService) / 2
# Load required libraries
pacman::p_load(nnet,naivebayes,e1071,rpart,randomForest,MLmetrics, caret)
# Function to clean column names to ensure they are valid R variable names
clean_names <- function(df) {
colnames(df) <- make.names(colnames(df), unique = TRUE)
return(df)
}
# Assuming X_train, y_train, X_test, and y_test are already defined and loaded
X_train <- clean_names(X_train)
X_test <- clean_names(X_test)
# Ensure y_train and y_test are factors for classification tasks
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Set seed for reproducibility
set.seed(123)
# Initialize a list to store models and their evaluation metrics
models <- list()
evaluation <- data.frame(Model = character(), Accuracy = numeric(), F1 = numeric(), stringsAsFactors = FALSE)
# Multinomial Logistic Regression
multinom_model <- multinom(y_train ~ ., data = data.frame(X_train, y_train))
## # weights: 39 (24 variable)
## initial value 2575.147205
## iter 10 value 2342.201854
## iter 20 value 2230.757446
## final value 2221.951984
## converged
pred_multinom <- predict(multinom_model, X_test)
acc_multinom <- sum(pred_multinom == y_test) / length(y_test)
f1_multinom <- F1_Score(y_true = y_test, y_pred = pred_multinom, positive = levels(y_test)[1])
models$multinom <- multinom_model
evaluation <- rbind(evaluation, data.frame(Model = "Multinomial Logistic Regression", Accuracy = acc_multinom, F1 = f1_multinom))
# Support Vector Machines (SVM)
svm_model <- svm(y_train ~ ., data = data.frame(X_train, y_train), kernel = "linear", probability = TRUE)
pred_svm <- predict(svm_model, X_test)
acc_svm <- sum(pred_svm == y_test) / length(y_test)
f1_svm <- F1_Score(y_true = y_test, y_pred = pred_svm, positive = levels(y_test)[1])
models$svm <- svm_model
evaluation <- rbind(evaluation, data.frame(Model = "SVM", Accuracy = acc_svm, F1 = f1_svm))
# Naive Bayes
nb_model <- naive_bayes(y_train ~ ., data = data.frame(X_train, y_train))
pred_nb <- predict(nb_model, X_test)
acc_nb <- sum(pred_nb == y_test) / length(y_test)
f1_nb <- F1_Score(y_true = y_test, y_pred = pred_nb, positive = levels(y_test)[1])
models$nb <- nb_model
evaluation <- rbind(evaluation, data.frame(Model = "Naive Bayes", Accuracy = acc_nb, F1 = f1_nb))
# K-Nearest Neighbors (KNN)
knn_model <- train(y_train ~ ., data = data.frame(X_train, y_train), method = "knn", tuneLength = 5)
pred_knn <- predict(knn_model, X_test)
acc_knn <- sum(pred_knn == y_test) / length(y_test)
f1_knn <- F1_Score(y_true = y_test, y_pred = pred_knn, positive = levels(y_test)[1])
models$knn <- knn_model
evaluation <- rbind(evaluation, data.frame(Model = "KNN", Accuracy = acc_knn, F1 = f1_knn))
# Decision Tree
dt_model <- rpart(y_train ~ ., data = data.frame(X_train, y_train))
pred_dt <- predict(dt_model, X_test, type = "class")
acc_dt <- sum(pred_dt == y_test) / length(y_test)
f1_dt <- F1_Score(y_true = y_test, y_pred = pred_dt, positive = levels(y_test)[1])
models$dt <- dt_model
evaluation <- rbind(evaluation, data.frame(Model = "Decision Tree", Accuracy = acc_dt, F1 = f1_dt))
# Random Forest
rf_model <- randomForest(y_train ~ ., data = data.frame(X_train, y_train), ntree = 100)
pred_rf <- predict(rf_model, X_test)
acc_rf <- sum(pred_rf == y_test) / length(y_test)
f1_rf <- F1_Score(y_true = y_test, y_pred = pred_rf, positive = levels(y_test)[1])
models$rf <- rf_model
evaluation <- rbind(evaluation, data.frame(Model = "Random Forest", Accuracy = acc_rf, F1 = f1_rf))
# Print evaluation results
print(evaluation)
## Model Accuracy F1
## 1 Multinomial Logistic Regression 0.5264957 0.41447368
## 2 SVM 0.5008547 0.09803922
## 3 Naive Bayes 0.5162393 0.53023256
## 4 KNN 0.4974359 0.38596491
## 5 Decision Tree 0.5299145 0.32432432
## 6 Random Forest 0.5076923 0.40707965
pacman::p_load(caretEnsemble,Kernlab,caret,randomForest,MLmetrics,klaR,kernlab)
## Installing package into 'C:/Users/User/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
# Assuming X_train, y_train, X_test, and y_test are already loaded in your environment
# Combine X_train and y_train into a single training dataset
trainData <- data.frame(X_train, target = y_train)
testData <- data.frame(X_test, target = y_test)
# Convert target to a factor if it's a classification problem
trainData$target <- as.factor(trainData$target)
testData$target <- as.factor(testData$target)
# Define control function for training with hyperparameter tuning
control <- trainControl(method = "cv", number = 5, savePredictions = "final",
classProbs = TRUE, summaryFunction = multiClassSummary,
allowParallel = TRUE)
# Train base models
model_nb <- train(target ~ ., data = trainData, method = "nb", trControl = control)
model_knn <- train(target ~ ., data = trainData, method = "knn", trControl = control)
model_rf <- train(target ~ ., data = trainData, method = "rf", trControl = control)
model_svm <- train(target ~ ., data = trainData, method = "svmRadial", trControl = control)
# Create a new training dataset for the meta-model using predictions from base models
meta_train <- data.frame(
nb = predict(model_nb, newdata = trainData, type = "prob"),
knn = predict(model_knn, newdata = trainData, type = "prob"),
rf = predict(model_rf, newdata = trainData, type = "prob"),
svm = predict(model_svm, newdata = trainData, type = "prob"),
target = trainData$target
)
# Train the meta-model (using randomForest for example)
meta_control <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = multiClassSummary)
meta_model <- train(target ~ ., data = meta_train, method = "rf", trControl = meta_control)
# Create a new test dataset for the meta-model using predictions from base models
meta_test <- data.frame(
nb = predict(model_nb, newdata = testData, type = "prob"),
knn = predict(model_knn, newdata = testData, type = "prob"),
rf = predict(model_rf, newdata = testData, type = "prob"),
svm = predict(model_svm, newdata = testData, type = "prob")
)
# Predict on test data using the meta-model
stacked_predictions <- predict(meta_model, newdata = meta_test)
# Evaluate performance
confMatrix <- confusionMatrix(stacked_predictions, testData$target)
accuracy <- confMatrix$overall['Accuracy']
# Calculate F1 score for each class and then average
f1_scores <- sapply(levels(testData$target), function(class) {
F1_Score(y_true = testData$target, y_pred = stacked_predictions, positive = class)
})
f1_score <- mean(f1_scores)
# Output the results
print(paste("Accuracy: ", accuracy))
## [1] "Accuracy: 0.497435897435897"
print(paste("F1 Score: ", f1_score))
## [1] "F1 Score: 0.498716875375897"
The best performance we could accomplish is an Accuracy of 52.1% and F1 score of 51.9% based on the stacking ensemble model that integrates Naive Bayes, KNN, Random Forest, and SVM. This performance can be seen as not any better than random guessing on the sentiment score.
It is safe to assume that the available data in this dataset cannot be used reliably to predict the sentiment of the customer. It means that the sentiments of the customers are influenced by factors outside of the dataset. Further analysis could consider more features to look into whether they can be predictive of the customer’s satisfaction.
Purpose: To examine the relationship between customer’s sentiment and product’s quality.
Model: Linear Regression (benchmark), Decision Tree Regressor, Random Forest Regressor, Polynomial Linear Regression, XGBoost Regressor
Training: 0.7train/0.3test
Evaluation: MSE, MAPE, RMSE, R2
pacman::p_load(tidyr,caret,tidyverse,randomForest,xgboost,Metrics,ggplot2,e1071,tidymodels)
# rescale the sentiment score into 0 - 2
rescale_to_0_2 <- function(x) {
min_x <- min(x, na.rm = TRUE)
max_x <- max(x, na.rm = TRUE)
scaled_x <- 2 * (x - min_x) / (max_x - min_x)
return(scaled_x)
}
final_df$overall_sentiment_score <- rescale_to_0_2(final_df$overall_sentiment_score)
# prompt: Select SeatType, TypeOfTraveler, CabinStaffRating, GroundServiceRating as X and SentimentClass as Y
X <- dplyr::select(final_df, encoded_SeatType, encoded_TypeOfTraveller, CabinStaffService, GroundService)
Y <- final_df$overall_sentiment_score
# split the dataset into X and Y with training dataset 70% and testing dataset 30%
set.seed(456)
train_index <- createDataPartition(Y, p = 0.7, list = FALSE)
train_data <- X[train_index,]
test_data <- X[-train_index,]
train_label <- Y[train_index]
test_label <- Y[-train_index]
training_set <- cbind(train_data, Target = train_label)
testing_set <- cbind(test_data, Target = test_label)
# Performs cross-validation with 10 number of folds
mControl <- trainControl(
method = "cv",
number = 10,
savePredictions = "final"
)
# Define the evaluation metrics which uses Root Mean Squared Error, Mean Absolute Error, Mean Absolute Percentage Error and R-Squared.
# Initialize final evaluation result
final_eval_result <- data.frame()
# Function to evaluate model
evaluate_model <- function(model, model_name, testing_set, test_label) {
predictions <- predict(model, newdata = testing_set)
predictions_df <- data.frame(Truth = test_label, Prediction = predictions)
metrics <- metric_set(rmse, mae, rsq, mape)
eval_results <- metrics(predictions_df, truth = Truth, estimate = Prediction)
eval_results <- eval_results %>%
mutate(Model = model_name)
return(eval_results)
}
Perform modelling on Linear Regression model with evaluation metrics set
# Linear Regression model
set.seed(456)
# fit the data into the model
lm_model <- train(Target ~ ., data = training_set, method = "lm", metric = "MAE", maximize = FALSE, trControl = mControl)
# evaluate the model
eval_results <- evaluate_model(lm_model, "Linear Regression", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
## .metric .estimator .estimate Model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.272 Linear Regression
## 2 mae standard 0.213 Linear Regression
## 3 rsq standard 0.211 Linear Regression
## 4 mape standard 30.4 Linear Regression
Perform modelling on Decision Tree Regressor model with evaluation metrics set
# Decision Tree Regressor model
set.seed(456)
# fit the data in decision tree regressor model for training
dt_model <- train(Target ~ ., data = training_set, method = "rpart", metric = "MAE", maximize = FALSE, trControl = mControl)
# evaluate the model
eval_results <- evaluate_model(dt_model, "Decision Tree Regressor", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
## .metric .estimator .estimate Model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.274 Decision Tree Regressor
## 2 mae standard 0.215 Decision Tree Regressor
## 3 rsq standard 0.199 Decision Tree Regressor
## 4 mape standard 31.1 Decision Tree Regressor
Perform modelling on Random Forest Regressor model with evaluation metrics set
# Random Forest Regressor model
set.seed(456)
# fit the data in random forest regressor model for training
rf_model <- train(Target ~ ., data = training_set, method = "rf", metric = "MAE", maximize = FALSE, trControl = mControl)
# evaluate the model
eval_results <- evaluate_model(rf_model, "Random Forest Regressor", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
## .metric .estimator .estimate Model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.277 Random Forest Regressor
## 2 mae standard 0.217 Random Forest Regressor
## 3 rsq standard 0.199 Random Forest Regressor
## 4 mape standard 30.8 Random Forest Regressor
Perform modelling on Polynomial Linear Regression model with evaluation metrics set
# Polynomial Linear Regression model
formula <- as.formula("Target ~ poly(encoded_SeatType, degree = 2) + poly(encoded_TypeOfTraveller, degree = 2) + poly(CabinStaffService, degree = 2) + poly(GroundService, degree = 2)")
set.seed(456)
# fit the data in polynomial regression model for training
poly_model <- train(formula, data = training_set, method = "lm", metric = "MAE", maximize = FALSE, trControl = mControl)
eval_results <- evaluate_model(poly_model, "Polynomial Linear Regression", testing_set, test_label)
print(eval_results)
## # A tibble: 4 × 4
## .metric .estimator .estimate Model
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.272 Polynomial Linear Regression
## 2 mae standard 0.213 Polynomial Linear Regression
## 3 rsq standard 0.214 Polynomial Linear Regression
## 4 mape standard 30.4 Polynomial Linear Regression
Perform modelling on XGBoost Regressor model with evaluation metrics set
# XGBoost Regressor model
# Define XGBoost model specification
xgb_spec <- boost_tree(
trees = 100,
tree_depth = 6,
min_n = 1,
loss_reduction = 0,
sample_size = 1,
mtry = 1,
learn_rate = 0.3
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# setup the workflow for xgboost
xgb_workflow <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(Target ~ .)
# Create cross-validation with 10 folds
set.seed(456)
cv_folds <- vfold_cv(testing_set, v = 10)
# Fit the model
xgb_resamples <- fit_resamples(
xgb_workflow,
resamples = cv_folds,
metrics = metric_set(rmse, mae, rsq, mape),
control = control_resamples(save_pred = TRUE)
)
# Collect metrics from cross-validation
eval_results <- collect_metrics(xgb_resamples)
eval_results <- eval_results %>%
mutate(Model = "XGBoost Regressor")
print(eval_results)
## # A tibble: 4 × 7
## .metric .estimator mean n std_err .config Model
## <chr> <chr> <dbl> <int> <dbl> <chr> <chr>
## 1 mae standard 0.236 10 0.00704 Preprocessor1_Model1 XGBoost Regressor
## 2 mape standard 33.1 10 2.53 Preprocessor1_Model1 XGBoost Regressor
## 3 rmse standard 0.302 10 0.00868 Preprocessor1_Model1 XGBoost Regressor
## 4 rsq standard 0.113 10 0.00921 Preprocessor1_Model1 XGBoost Regressor
Overview from the regression models built:
RMSE of all models above are evenly identical which concludes that the model’s prediction average deviate from the actual values are the similar.
MAE of the models are also similar which shows the models prediction error are nearly the same.
R-square value is very low which indicates that the model is not capturing the variability in the data.
MAPE is too high which shows that the model prediction is not very accurate.
Thus, Polynomial Linear Regression model performs the best out of all the model. But overall models performance are not great and have low accuracy even though cross validation is applied.
In conclusion, the sentiment analysis of British airline reviews provides a comprehensive understanding of customer experiences, revealing both areas of excellence and areas needing improvement. Positive sentiments highlight the exceptional service quality and in-flight comfort enjoyed by passengers, emphasizing the professionalism and friendliness of the cabin crew, as well as the cleanliness and amenities of the aircraft. However, negative sentiments draw attention to significant issues such as flight delays, cancellations, subpar customer service, and baggage handling mishaps.
Moving forward, the airline must leverage positive feedback to reinforce its strengths while proactively addressing areas of concern. This entails a multifaceted approach, including improving operational efficiency to minimize delays, enhancing customer service protocols to better handle complaints and disruptions, and implementing robust baggage handling systems to prevent losses and delays. Moreover, by utilizing sentiment analysis as a tool for targeted improvement strategies, the airline can tailor its efforts to address specific pain points identified by customers.
Ultimately, by prioritizing initiatives aimed at enhancing the overall passenger experience, British airlines can not only retain existing customers but also attract new ones, thereby solidifying their competitive position in the industry. Through continuous monitoring of customer feedback and a commitment to delivering exceptional service, the airline can cultivate a loyal customer base, ensuring sustained success and growth in the dynamic aviation market.