🛩 Qatar Airline Review Data Preprocessing

Bro Intelligence

November 6th, 2024

Falcon Project 14

Introduction

In this Data Preprocessing notebook, we conduct the initial steps for analyzing Qatar Airlines reviews data frame, focusing on cleaning and preparing the data for further enhancement, analysis and visualization, and forecasting.

The steps in this notebook include:

Data Cleaning – Handling missing values, removing unnecessary columns, and filtering out irrelevant data.
Feature Engineering – Classifying aircraft types, extracting date-related information, and categorizing key topics within the reviews.

This notebook serves as a preliminary stage before Data Enhancement.

Data Cleaning

Read the data:

library(tidyverse)

df <- read.csv("datasets/Airline_Qatar_Reviews.csv")

Glimpse of the dataframe

df %>% glimpse()

## Rows: 2,369
## Columns: 16
## $ Unnamed..0        <dbl> 0, NA, 2, 3, 4, 5, NA, 7, 8, NA, NA, NA, 12, 13, NA,…
## $ Date.Published    <chr> "2024-03-02", "2024-02-29", "2024-02-29", "2024-02-2…
## $ Rating            <dbl> 1, 1, 1, 10, 7, 10, 10, 7, 10, 4, 6, 10, 7, 1, 1, 2,…
## $ Max.Rating        <dbl> 10, 10, 10, 10, NA, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Title             <chr> "“marred by inconveniences”", "", "\"no attempt to a…
## $ Author            <chr> "Mary Le", "Brian English", "Wayne Burgess", "Alwale…
## $ Country           <chr> "United Kingdom", "", "Australia", "Qatar", "United …
## $ Date              <chr> "2024-03-02", "2024-02-29", "2024-02-29", "2024-02-2…
## $ Review.Body       <chr> " The delay of my flight from Haneda to Doha caused …
## $ Type.Of.Traveller <chr> "Solo Leisure", "Couple Leisure", "Business", "Coupl…
## $ Seat.Type         <chr> "Economy Class", "Business Class", "Economy Class", …
## $ Route             <chr> "Tokyo to London Heathrow via Doha", "Doha to Montre…
## $ Date.Flown        <chr> "March 2024", "February 2024", "", "February 2024", …
## $ Recommended       <chr> "no", "no", "no", "", "no", "", "", "", "", "no", "n…
## $ Aircraft          <chr> "", "", "A380", "A380-800", "", "A380", "", "Boeing …
## $ Verified          <dbl> 0, 1, 1, NA, NA, 1, NA, NA, 1, NA, 1, NA, 1, 1, 0, 1…

First 10 Rows

df %>% head(10)

Last 10 Rows

df %>% tail(10)

There are undetected NAs and non necessary columns. We’re gonna replace the undetected NAs with NA

Replace undected NAs with NA:

df <- df %>% mutate(across(where(is.character), ~ na_if(., "")))

df %>% head()

Remove unnecessary columns:

df <- df %>%
  select(
    -Unnamed..0,
    -Date.Published,
    -Max.Rating,
    -Author,
    -Date.Flown,
    -Recommended,
    -Verified
  ) %>%
  slice(-1)

df %>% glimpse()

## Rows: 2,368
## Columns: 9
## $ Rating            <dbl> 1, 1, 10, 7, 10, 10, 7, 10, 4, 6, 10, 7, 1, 1, 2, NA…
## $ Title             <chr> NA, "\"no attempt to address our complaint\"", "“I h…
## $ Country           <chr> NA, "Australia", "Qatar", "United Kingdom", NA, "Uni…
## $ Date              <chr> "2024-02-29", "2024-02-29", "2024-02-25", "2024-02-2…
## $ Review.Body       <chr> "  They convinced me that I needed to pay $1500 to a…
## $ Type.Of.Traveller <chr> "Couple Leisure", "Business", "Couple Leisure", "Cou…
## $ Seat.Type         <chr> "Business Class", "Economy Class", "First Class", "B…
## $ Route             <chr> "Doha to Montreal", "Doha to Perth", "Doha to London…
## $ Aircraft          <chr> NA, "A380", "A380-800", NA, "A380", NA, "Boeing 787-…

Check duplicates:

df %>% anyDuplicated()

## [1] 0

There are no duplicated row

Check missing values distribution:

df %>% is.na() %>% colSums()

##            Rating             Title           Country              Date 
##               475               474               474               474 
##       Review.Body Type.Of.Traveller         Seat.Type             Route 
##               474               825               474               821 
##          Aircraft 
##              1351

There still many missing values but we dont need to do anything with them except for Rating and Date which we’ll deal with them in Data Enhancement stage. What we need to do for now is just removing rows that are missing both the Rating and Review.Body

Remove useless rows:

Remove rows where Rating and Review.Body are both missing

df <- df %>%
  filter(!(is.na(Review.Body) & is.na(Rating)))

df %>% glimpse()

## Rows: 2,269
## Columns: 9
## $ Rating            <dbl> 1, 1, 10, 7, 10, 10, 7, 10, 4, 6, 10, 7, 1, 1, 2, NA…
## $ Title             <chr> NA, "\"no attempt to address our complaint\"", "“I h…
## $ Country           <chr> NA, "Australia", "Qatar", "United Kingdom", NA, "Uni…
## $ Date              <chr> "2024-02-29", "2024-02-29", "2024-02-25", "2024-02-2…
## $ Review.Body       <chr> "  They convinced me that I needed to pay $1500 to a…
## $ Type.Of.Traveller <chr> "Couple Leisure", "Business", "Couple Leisure", "Cou…
## $ Seat.Type         <chr> "Business Class", "Economy Class", "First Class", "B…
## $ Route             <chr> "Doha to Montreal", "Doha to Perth", "Doha to London…
## $ Aircraft          <chr> NA, "A380", "A380-800", NA, "A380", NA, "Boeing 787-…

Feature Engineering

Classify aircraft type:

Classify between Airbus and Boeing aircraft

library(stringr)

df <- df %>%
  mutate(
    Aircraft_Cleaned = Aircraft %>%
      sub("/.*", "", .) %>%
      sub("-.*", "", .) %>%
      trimws() %>% 
      tolower(),
    
    Aircraft_Classified = case_when(
      str_detect(Aircraft_Cleaned, "7") & !str_detect(Aircraft_Cleaned, "a") ~ "Boeing",  # Boeing jika ada angka 7 dan tidak ada A
      str_detect(Aircraft_Cleaned, "a") ~ "Airbus",  # Airbus jika ada huruf A
      TRUE ~ NA  # NA jika tidak memenuhi keduanya
    ) %>% as.factor()
  )

df %>% 
  select(Aircraft, Aircraft_Cleaned, Aircraft_Classified) %>% 
  filter(complete.cases(.)) %>% 
  head()

df %>% is.na() %>% colSums()

##              Rating               Title             Country                Date 
##                 376                 449                 458                 456 
##         Review.Body   Type.Of.Traveller           Seat.Type               Route 
##                 375                 789                 456                 786 
##            Aircraft    Aircraft_Cleaned Aircraft_Classified 
##                1296                1296                1299

Classify the review topics:

We categorize reviews into different topics based on keywords. The topics are designed to identify key areas of customer feedback, such as customer service, time and delay, seat comfort, and food and beverage. Reviews that contain specific keywords will be counted under the respective topics.

library(NLP)
library(tm)

# My text cleaner function
clean_text <- function(text, as.corpus = T, lower = T, rm.number = T, rm.stopwords_english = T, rm.stopwords_bahasa = F, rm.punctuation = T, stem = T, rm.whitespace = T){
  text_corpus <- text %>% VectorSource() %>% VCorpus()
  
  # Lowercasing
  if (lower){
    text_corpus <- tm_map(x = text_corpus,
                          FUN = content_transformer(tolower))
  }
  # Removing numbers
  if (rm.number){
    text_corpus <- tm_map(x = text_corpus,
                          FUN = removeNumbers)
    
  # Removing punctuation
  if (rm.punctuation){
    text_corpus <- tm_map(x = text_corpus,
                          FUN = removePunctuation)
    text_corpus <- tm_map(text_corpus, content_transformer(function(x) gsub("[[:punct:]]+", "", x)))
    text_corpus <- tm_map(text_corpus, content_transformer(function(x) gsub("/", " ", x)))
  }
    
  }
  # Removing english stop words
  if (rm.stopwords_english){
    list_stop_words_english <- readLines("stop-words_english.txt", warn = FALSE, encoding = "UTF-8")
    
    text_corpus <- tm_map(x = text_corpus,
                          FUN = removeWords,
                          list_stop_words_english)
  }
  # Removing bahasa stop words
  if (rm.stopwords_bahasa){
    list_stop_words_bahasa <- readLines("stop-words_bahasa.txt", warn = FALSE, encoding = "UTF-8")
    
    text_corpus <- tm_map(x = text_corpus,
                          FUN = removeWords,
                          list_stop_words_bahasa)
  }
  
  # Reducing words to their base form
  if (stem){
    text_corpus <- tm_map(x = text_corpus,
                          FUN = stemDocument)
  }
  # Removing white/blank spaces
  if (rm.whitespace){
    text_corpus <- tm_map(x = text_corpus,
                          FUN = stripWhitespace)
  }
  
  # Returning the text as or not as corpus
  if (as.corpus){
    return(text_corpus)
  }
  else(
    return(sapply(text_corpus, as.character))
  )
}


# Function to check for topic keywords and classify reviews
classify_review <- function(review, category_keywords) {
  if (is.na(review)) {
    return(0)  # If the review is NA
  }
  
  cleaned_review <- clean_text(review, F, rm.stopwords_bahasa = F)  # Clean the review text
  
  # Check if any of the keywords appear in the cleaned review
  if (any(sapply(category_keywords, function(keyword) {
    str_detect(cleaned_review, regex(keyword, ignore_case = TRUE))
  }))) {
    return(1)  # Return 1 if any keyword is detected
  } else {
    return(0)  # Return 0 if no keyword is detected
  }
}

# Keywords for each topic
keywords_customer_service <- c("service", "support", "customer", "response", "staff", "help", "crew", "email")
keywords_time_delay <- c("delay", "wait", "time", "hour", "late", "boarding", "missed")
keywords_seat_comfort <- c("seat", "comfort", "space", "legroom", "comfortability", "recline", "pillow", "blanket", "ac", "temperature", "air", "conditioner", "hot")
keywords_food_beverage <- c("food", "meal", "beverage", "drink", "snack", "refreshment", "menu", "lunch", "dinner", "breakfast")

# Classify topics
df <- df %>%
  mutate(
    "Customer_Service" = ifelse(is.na(Review.Body), 0, sapply(Review.Body, classify_review, category_keywords = keywords_customer_service)),
    "Time_and_Delay" = ifelse(is.na(Review.Body), 0, sapply(Review.Body, classify_review, category_keywords = keywords_time_delay)),
    "Seat_and_Comfort" = ifelse(is.na(Review.Body), 0, sapply(Review.Body, classify_review, category_keywords = keywords_seat_comfort)),
    "Food_and_Beverages" = ifelse(is.na(Review.Body), 0, sapply(Review.Body, classify_review, category_keywords = keywords_food_beverage))
  )

df %>% glimpse()

## Rows: 2,269
## Columns: 15
## $ Rating              <dbl> 1, 1, 10, 7, 10, 10, 7, 10, 4, 6, 10, 7, 1, 1, 2, …
## $ Title               <chr> NA, "\"no attempt to address our complaint\"", "“I…
## $ Country             <chr> NA, "Australia", "Qatar", "United Kingdom", NA, "U…
## $ Date                <chr> "2024-02-29", "2024-02-29", "2024-02-25", "2024-02…
## $ Review.Body         <chr> "  They convinced me that I needed to pay $1500 to…
## $ Type.Of.Traveller   <chr> "Couple Leisure", "Business", "Couple Leisure", "C…
## $ Seat.Type           <chr> "Business Class", "Economy Class", "First Class", …
## $ Route               <chr> "Doha to Montreal", "Doha to Perth", "Doha to Lond…
## $ Aircraft            <chr> NA, "A380", "A380-800", NA, "A380", NA, "Boeing 78…
## $ Aircraft_Cleaned    <chr> NA, "a380", "a380", NA, "a380", NA, "boeing 787", …
## $ Aircraft_Classified <fct> NA, Airbus, Airbus, NA, Airbus, NA, Boeing, Boeing…
## $ Customer_Service    <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,…
## $ Time_and_Delay      <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,…
## $ Seat_and_Comfort    <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,…
## $ Food_and_Beverages  <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,…

Result

The final data result:

Glimpse of the dataframe

df %>% glimpse()

## Rows: 2,269
## Columns: 15
## $ Rating              <dbl> 1, 1, 10, 7, 10, 10, 7, 10, 4, 6, 10, 7, 1, 1, 2, …
## $ Title               <chr> NA, "\"no attempt to address our complaint\"", "“I…
## $ Country             <chr> NA, "Australia", "Qatar", "United Kingdom", NA, "U…
## $ Date                <chr> "2024-02-29", "2024-02-29", "2024-02-25", "2024-02…
## $ Review.Body         <chr> "  They convinced me that I needed to pay $1500 to…
## $ Type.Of.Traveller   <chr> "Couple Leisure", "Business", "Couple Leisure", "C…
## $ Seat.Type           <chr> "Business Class", "Economy Class", "First Class", …
## $ Route               <chr> "Doha to Montreal", "Doha to Perth", "Doha to Lond…
## $ Aircraft            <chr> NA, "A380", "A380-800", NA, "A380", NA, "Boeing 78…
## $ Aircraft_Cleaned    <chr> NA, "a380", "a380", NA, "a380", NA, "boeing 787", …
## $ Aircraft_Classified <fct> NA, Airbus, Airbus, NA, Airbus, NA, Boeing, Boeing…
## $ Customer_Service    <dbl> 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,…
## $ Time_and_Delay      <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,…
## $ Seat_and_Comfort    <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,…
## $ Food_and_Beverages  <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,…

Bro is actually intelligent💀

First 10 Rows

df %>% head(10)

Bro is actually intelligent💀

Last 10 Rows

df %>% tail(10)

Bro is actually intelligent💀

Save the preprocessed data:

write.csv(df, "datasets/1-preprocessed.csv")

Alhamdulillah. Next stage is Data Enhancement, where we will address the missing values in Rating and Date columns and add a sentiment category.