Predicting Airline Sentiment using Naive Bayes

Introduction

This project aims to implement sentiment analysis in Twitter data. Specifically, this project aims to demonstrate and implement an efficient, precise, and reliable method for classifying airline review tweets according to their respective sentiments, namely: positive, negative, and neutral.

The following concepts were used in implementing this project:

Data Cleaning
Vectorization/Document-Term Matrix
Balancing Imbalanced Datasets
Naive Bayes
K-Fold Cross Validation

Loading the Packages

Three sets of packages were loaded for this project:

Tidyverse-related packages: for reading, manipulating, and mapping data
Text-cleaning packages: for cleaning and converting text to readable data.
Modelling packages: for implementing the k-fold cross validation and the naive bayes algorithm

# Tidyverse-related packages
pacman::p_load("readr", "dplyr", "stringr", "purrr")

# Text-cleaning packages
pacman::p_load("textclean", "qdapRegex", "stopwords", "tm")

# Modelling packages
pacman::p_load("fastNaiveBayes", "UBL", "ePCR", "caret", "rsconnect")

Loading the Dataset

The dataset for this project is taken from the fastNaiveBayes project. Specifically, it contains more than 14,000 tweets about each major U.S. airline from February 2015 along with pre-classified sentiments (positive, negative, and neutral).

The original source of the data can be found here: https://www.figure-eight.com/data-for-everyone/

# Loads file from the fastNaiveBayes package
airline_tweets <- tweets
airline_tweets$airline_sentiment <- factor(airline_tweets$airline_sentiment)

# Previews the dimensions of the dataset
cat("Rows: ", dim(airline_tweets)[1], "\nColumns: ", dim(airline_tweets)[2])

## Rows:  14640 
## Columns:  2

# Displays the first 10 rows
head(airline_tweets, 10)

##    airline_sentiment
## 1            neutral
## 2           positive
## 3            neutral
## 4           negative
## 5           negative
## 6           negative
## 7           positive
## 8            neutral
## 9           positive
## 10          positive
##                                                                                                                                       text
## 1                                                                                                      @VirginAmerica What @dhepburn said.
## 2                                                                 @VirginAmerica plus you've added commercials to the experience... tacky.
## 3                                                                  @VirginAmerica I didn't today... Must mean I need to take another trip!
## 4           @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
## 5                                                                                  @VirginAmerica and it's a really big bad thing about it
## 6  @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA
## 7                                                         @VirginAmerica yes, nearly every time I fly VX this UIear wormU wonUt go away :)
## 8                             @VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP
## 9                                                                                         @virginamerica Well, I didn'tU_but NOW I DO! :-D
## 10                                                        @VirginAmerica it was amazing, and arrived an hour early. You're too good to me.

Inspecting the Dataset

Initial inspection in the data shows that there is an overwhelming number of negative tweets compared to positive and neutral tweets. This imbalance could potentially lead to a model biased on negative tweets, if implemented. To address this problem, the SMOTE algorithm is used for this project (further explanation is found in the succeeding sections).

# Tabulates the number of tweets per sentiment
table(airline_tweets$airline_sentiment)

## 
## negative  neutral positive 
##     9178     3099     2363

Cleaning the Tweets

The first step in fitting a model is pre-processing the data. Using the tm package, various functions are loaded and condensed in a single user-defined function to specifically clean Twitter data. This is in preparation for the document-term matrix for the next step as well as to remove any elements in the text which are not helpful in model-training.

# Uses functions from the tm package to clean tweets
clean_tweet <- function(tweet){
  
  tweet <- str_to_lower(tweet) # changes all letters to lowercase
  tweet <- replace_url(tweet) # removes hyperlinks
  tweet <- replace_email(tweet) # removes emails
  tweet <- replace_html(tweet) # changes html markup symbols to correct form
  
  # changes emoticons to word-equivalents, ignores emoticons in words
  tweet <- paste(unlist(map(unlist(str_split(tweet, " ")), 
                            function(word) ifelse(!(str_detect(word, "^[A-Za-z]")), 
                                                  replace_emoticon(word), 
                                                  word))), collapse=" ") 
  
  tweet <- str_squish(tweet) # removes additional space
  tweet <- replace_internet_slang(tweet) # changes slang words to word-equivalents
  tweet <- replace_word_elongation(tweet) # shortens extended words
  tweet <- replace_non_ascii(tweet) # changes non-ascii words to correct word form
  tweet <- replace_white(tweet) # removes escaped characters
  
  # removes stopwords
  tweet <- paste(unlist(str_split(tweet, " "))
                 [!((unlist(str_split(tweet, " "))) %in% stopwords(kind="en"))], collapse=" ")
  
  tweet <- replace_contraction(tweet) # extends any contracted words/words with "'"
  tweet <- str_squish(tweet) # removes additional space
  tweet <- str_replace_all(tweet, "[:digit:]", "") # removes numbers
  tweet <- str_replace_all(tweet, "[:punct:]", "") # removes punctuation marks
  
  # removes any other characters excluding letters
  tweet <- paste(unlist(map(unlist(str_split(tweet, " ")),
                            function(word) paste(unlist(str_extract_all(word, "[:alnum:]")),
                                                 collapse=""))), collapse=" ")
  
  # removes two-letter words
  tweet <- paste(unlist(str_split(tweet, " "))
                 [nchar(unlist(str_split(tweet, " "))) > 2], collapse=" ")
  
  tweet <- str_squish(tweet) # removes additional space
  tweet
  
}

# Cleans all tweets using the function above
cleaned_tweets <- unlist(map(airline_tweets$text, clean_tweet))

# Displays the first 10 rows
head(cleaned_tweets, 10)

##  [1] "virginamerica dhepburn said"                                                               
##  [2] "virginamerica plus added commercials experience tacky"                                     
##  [3] "virginamerica today must mean need take another trip"                                      
##  [4] "virginamerica really aggressive blast obnoxious entertainment guests faces little recourse"
##  [5] "virginamerica really big bad thing"                                                        
##  [6] "virginamerica seriously pay flight seats playing really bad thing flying"                  
##  [7] "virginamerica yes nearly every time fly uiear wormu wonut away smiley"                     
##  [8] "virginamerica really missed prime opportunity men without hats parody there"               
##  [9] "virginamerica well did notubut now laughing"                                               
## [10] "virginamerica amazing arrived hour early good"

Creating a Document-Term Matrix

After the data is pre-processed, a document-term matrix is created to serve as our data for model-building. This matrix contains a column of all possible words and a row for each pre-processed tweet. What it does is that it counts the number of word occurrences for all tweets in the dataset.

The sparsity input, on the other hand, indicates any feature that has a percentage of occurrences. This project will consider terms that appear in 99.99% in all of the tweets made.

# Creates the final document-term matrix using the vector of cleaned tweets and sparsity input
generate_dtm <- function(cleaned_tweets, sparsity){
  
  # Creates a vector corpus from all tweets and stems any related-words
  tweets_corpus <- VCorpus(VectorSource(cleaned_tweets))
  tweets_corpus <- tm_map(tweets_corpus, stemDocument)
  
  # Creates a document-term matrix and removes terms based on sparsity level
  tweets_dtm <- DocumentTermMatrix(tweets_corpus)
  tweets_dtm <- removeSparseTerms(tweets_dtm, sparse = sparsity)
  
  # Creates the final document-term matrix including the response to be modeled
  tweets_dtm_final <- data.frame(as.matrix(tweets_dtm))
  tweets_dtm_final <- data.frame(cbind(airline_sentiment = airline_tweets$airline_sentiment,
                                       tweets_dtm_final))
  tweets_dtm_final
}

dtm_cleaned_tweets <- generate_dtm(cleaned_tweets, 0.9999)

# Displays the first 10 rows and columns of the dataset
dtm_cleaned_tweets[1:10, 1:10]

##    airline_sentiment aaba aadfw aadv aadvantag aafail aal aano aarp abandon
## 1            neutral    0     0    0         0      0   0    0    0       0
## 2           positive    0     0    0         0      0   0    0    0       0
## 3            neutral    0     0    0         0      0   0    0    0       0
## 4           negative    0     0    0         0      0   0    0    0       0
## 5           negative    0     0    0         0      0   0    0    0       0
## 6           negative    0     0    0         0      0   0    0    0       0
## 7           positive    0     0    0         0      0   0    0    0       0
## 8            neutral    0     0    0         0      0   0    0    0       0
## 9           positive    0     0    0         0      0   0    0    0       0
## 10          positive    0     0    0         0      0   0    0    0       0

Remedy for Imbalanced Data

To tackle the problem of imbalanced datasets, the SMOTE (Synthetic Minority Over-sampling Technique) algorithm is implemented on the code below. What it does is that it creates synthetic examples for the under-represented classes (i.e. Positive, Neutral Tweets) by sampling from the minority class and identifying its nearest neighbors (using the K-nearest neighbors). This way, the data is guaranteed to be balanced while being a distinct datapoint compared to other datapoints.

# Generates an updated dataset with balanced classes 
balanced_dataset <- function(train_fold){
  
  # Implements SMOTE while retaining the same number of observations
  balanced_fold <- SmoteClassif(airline_sentiment~., train_fold, C.perc="balance")
  
  # Generated coefficients > 0 are rounded up to the next integer 
  balanced_rounded <- map_dfr(balanced_fold[, -1], ceiling) 
  
  # Combines again the dataset
  balanced_fold <- data.frame(cbind(airline_sentiment = balanced_fold$airline_sentiment,
                                    balanced_rounded))
  balanced_fold
}

Implementing Naive Bayes per Training Fold

The Naive Bayes algorithm was used to model the data with the assumption that each feature is independent with each other. Given the amount of data to be used, an updated version of this test was used (fastNaiveBayes). Compared to the existing Naive Bayes implementations (caret, etc.), this updated version trains the data much faster and more efficiently.

# Trains the training fold of data using the Naive Bayes algorithm
generate_model <- function(train_data){
  nb_model <- fastNaiveBayes(y=train_data$airline_sentiment,
                             x=train_data[, -1], laplace = 1)
  nb_model
}

Model Evaluation per Training Fold

After the model is trained, it is then tested using the respective test data for each fold. The confusion matrix is then generated which contains all relevant metrics outputed by the model.

# Evaluates the model using the test data and generates a Confusion Matrix
generate_cf <- function(nb_model, test_data){
  predictions <- predict(nb_model, test_data[, -1])
  cf <- confusionMatrix(predictions, test_data[, 1])
  cf
}

Implementing the K-Fold Cross Validation using Naive Bayes

To ensure that the algorithm performs well across various data, k-fold Cross-validation was implemented. For each fold/subset, a model will be fit using a training data and validated using the remaining data. The process is repeated but with different training/testing data and then the metrics are averaged to get the overall score/s.

# Runs the k-fold Cross Validation based on number of folds and whether 
# or not the dataset is balanced
cv_nb_models <- function(tweets_dtm, num_folds, is_balanced){
  
  set.seed(1) # for replicability
  folds <- cv(tweets_dtm, fold = num_folds) # generates indeces per training and testing fold
  
  # Generates all k-folds for Training and Testing
  train_folds <- map(1:num_folds, 
                     function(fold) tweets_dtm[folds$train[[fold]], ])
  
  test_folds <- map(1:num_folds, 
                    function(fold) tweets_dtm[folds$test[[fold]], ])
  
  # Balances the training data for each fold if is_balanced is TRUE
  if(is_balanced)
    train_folds <- map(1:num_folds, function(fold) balanced_dataset(train_folds[[fold]]))
  
  # Creates Naive Bayes models for all 5 folds
  nb_models <- map(1:num_folds, 
                   function(fold) generate_model(train_folds[[fold]]))
  
  # Evaluates each model per training fold and stores each Confusion Matrix
  cv_cfm <- map(1:num_folds, 
                function(fold) generate_cf(nb_models[[fold]], test_folds[[fold]]))
  cv_cfm
  
}

Performing the 5-fold Cross Validation for Balanced/Imbalanced Datasets

After the creation of the user-defined functions for training/testing, this function will now be used to implement a 5-fold cross validation across the given document-term matrix. To observe the effect of balancing the data later, the data will also be trained on the imbalanced data and compared with the balanced dataset.

# Performs cross validation on 5-folds using imbalanced/balanced data
cv_results_imbalanced <- cv_nb_models(dtm_cleaned_tweets, 5, FALSE)
cv_results_balanced <- cv_nb_models(dtm_cleaned_tweets, 5, TRUE)

Selecting the Metrics to be Evaluated

The following metrics will be used to evaluate and compare the performance of both the balanced/imbalanced datasets:

Accuracy
Precision
Recall
F1 Score

# Prints the final metrics to be evaluated
print_metrics <- function(results, num_folds){
  
  # Gets the overall mean accuracy of each fold 
  overall_mean_accuracy <- mean(map_dfr(1:num_folds, function(fold) results[[fold]][["overall"]])$Accuracy)
  
  # Gets the overall mean for each metric per fold and predicted class
  metrics_per_class <- rbind(
    "Precision" = colMeans(map_dfr(1:num_folds, function(fold) results[[fold]][["byClass"]][, "Precision"])),
    "Recall" = colMeans(map_dfr(1:num_folds, function(fold) results[[fold]][["byClass"]][, "Recall"])),
    "F1 Score" = colMeans(map_dfr(1:num_folds, function(fold) results[[fold]][["byClass"]][, "F1"]))
  )
  
  # Prints the final metrics 
  cat(
    "Folds: ", num_folds,
    "\nOverall accuracy: ", overall_mean_accuracy,
    "\n\nMetrics per class:\n"
  )
  print(metrics_per_class)
}

Evaluating the Metrics of the Test

Based on the results below, the accuracy of the model trained with balanced data (74%) showed a drop from its imbalanced counterpart (76%), though the difference seems insignificant (~2% difference).

As for its predicting ability, there was an improvement in the recall of the model trained with balanced dataset for both positive and neutral classes (12-18%) compared from its imbalanced counterpart. This implied that the model was able to correctly predict neutral and positive classes better after balancing out the dataset.

Other metrics such as precision and F1 score were influenced as a result of the improvement in the predictive ability. While the model became better at identifying which tweet is actually negative, the overall F1 score indicates only a slight improvement for the neutral class due to the tradeoff for both positive and negative classes.

# Prints the final metrics for imbalanced/balanced data tests
print_metrics(cv_results_imbalanced, 5)

## Folds:  5 
## Overall accuracy:  0.7629781 
## 
## Metrics per class:
##           Class: negative Class: neutral Class: positive
## Precision       0.8122934      0.6033001       0.7087849
## Recall          0.8930993      0.4607156       0.6549134
## F1 Score        0.8507311      0.5222271       0.6804755

print_metrics(cv_results_balanced, 5)

## Folds:  5 
## Overall accuracy:  0.7428279 
## 
## Metrics per class:
##           Class: negative Class: neutral Class: positive
## Precision       0.8983471      0.5317622       0.5927512
## Recall          0.7745710      0.6217527       0.7790409
## F1 Score        0.8318369      0.5730642       0.6730304

Conclusion

The goal of this project is to be able to implement sentiment analysis to predict sentiments from Twitter data regarding airline reviews. Various methods were introduced to clean, preprocess, and balance the data in order to correctly and accurately fit the given models using cross validation.

Results have shown that implementing the SMOTE algorithm improves the predictive ability of the lower classes while maintaining a decent overall accuracy (74%). Using the k-fold Cross validation also ensured that the correct accuracy is obtained for all iterations.