Sentiment Analysis

Introduction

Source of Picture

In this study, we will perform sentiment analysis based on tweets from the consumer of major US airlines in February 2015. The consumer’s voice is important because it can describe the level of satisfaction with services and products offered by US airlines. The airlines included in this dataset are Virgin American, Delta, Southwest, United, US Airways and Virgin America. The Naive Bayes model will be used in this analysis.

Data Information

The data is collected from the Kaggle website [^1]. The data contains 14640 tweets related to services and products of US airlines. The target variable is airline_sentiment.

Data Preparation

Load libraries

Load all libraries used in this study

library(dbplyr)
library(e1071)
library(inspectdf)
library(tm)
library(wordcloud)
library(caret)
library(ggplot2)
library(wesanderson)
library(ROCR)
library(pROC)

Attach Dataset

Load dataset by using function read.csv()

# Read the dataset
us <- read.csv("Tweets.csv")

# Check dataset
rmarkdown::paged_table(tail(us))

Data Wrangling

Observe Dataset

There are 14640 observations and 15 variables. For feature selection, we require only 2 columns, which are airline_sentiment and text. Thus, we will change text column into factor.

str(us)
## 'data.frame':    14640 obs. of  15 variables:
##  $ tweet_id                    : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment           : chr  "neutral" "positive" "neutral" "negative" ...
##  $ airline_sentiment_confidence: num  1 0.349 0.684 1 1 ...
##  $ negativereason              : chr  "" "" "" "Bad Flight" ...
##  $ negativereason_confidence   : num  NA 0 NA 0.703 1 ...
##  $ airline                     : chr  "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
##  $ airline_sentiment_gold      : chr  "" "" "" "" ...
##  $ name                        : chr  "cairdin" "jnardino" "yvonnalynn" "jnardino" ...
##  $ negativereason_gold         : chr  "" "" "" "" ...
##  $ retweet_count               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ text                        : chr  "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they hav"| __truncated__ ...
##  $ tweet_coord                 : chr  "" "" "" "" ...
##  $ tweet_created               : chr  "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800" "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800" ...
##  $ tweet_location              : chr  "" "" "Lets Play" "" ...
##  $ user_timezone               : chr  "Eastern Time (US & Canada)" "Pacific Time (US & Canada)" "Central Time (US & Canada)" "Pacific Time (US & Canada)" ...

Our target variable is airline_sentiment. We can check the class of this target by using the function unique(). The text has 3 levels of category, which are neutral, positive and negative.

unique(us$airline_sentiment)
## [1] "neutral"  "positive" "negative"

The proportion for each class in the target variable is not balanced. We will keep this proportion and observe the model performance.

rmarkdown::paged_table(as.data.frame(prop.table(table(us$airline_sentiment))))

Now, create the plot of sentiment classification for each airline by using ggplot

# Create dataframe containing the frequency of each sentiment based on each airline. 
airline <- as.data.frame(table(us$airline, us$airline_sentiment))
colnames(airline) <- c("Airline", "Sentiment", "Frequency")

# Create plot 
ggplot(airline, aes(x=Airline, y =Frequency, fill= Sentiment)) +
  geom_col()+
  theme_minimal()+
  labs(title = "Sentiment Classification Each Airline")+
  scale_fill_manual(values= wes_palette(10, name = "Darjeeling1", type="continuous"))+
  theme(plot.title = element_text(size= 14, color = 'black', face ='bold'),
            axis.title.x = element_text(size=12, color = 'black'),
            axis.title.y = element_text(size = 12, color = 'black'),
            axis.text.x = element_text(size = 12, color = 'black', vjust = 0.5),
            axis.text.y = element_text(size = 12, color = 'black'),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(),
            axis.line = element_line(colour = "black"),
            legend.position = "bottom")

Insight: As shown in the plot, the proportion of negative sentiment overpowers the other sentiment class. It implies that most customer of major U.S airline is not satisfied with the services or products of the airline. The highest proportion of negative sentiment belongs to US Airways and United. While Virgin America has a balance proportion for the sentiment of each tweet.

Feature Selection

# Select only 2 columns and change column text airline_sentiment into factor
us <- us %>% 
  dplyr::select(airline_sentiment, text) %>% 
  dplyr::mutate(airline_sentiment=as.factor(airline_sentiment))

# Check the dataset
rmarkdown::paged_table(head(us))

Missing Value

We can observe missing value using function colSums() and is.na(). We don’t have missing value in dataset

colSums(is.na(us))
## airline_sentiment              text 
##                 0                 0

Text Cleaning

Text to Corpus

Corpus is the component from the document. In this case, one document is equal to one observation of the text. In one text, it can be one or more words. One package that can be utilized for text mining is tm package. We use the Vcorpus to change the vector text into the original corpus.

word_corpus <- VCorpus(VectorSource(us$text))

Let’s check observation no 10.

word_corpus[[10]]$content
## [1] "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."

Text cleaning

In the next code, we will replace the uppercase with lowercase by using the function tolower.

word_corpus <- word_corpus %>% 
  tm_map(content_transformer(tolower))

# Check observation no 10 
word_corpus[[10]]$content
## [1] "@virginamerica it was amazing, and arrived an hour early. you're too good to me."

In the next code, we will remove numbers, punctuation, white spaces and stopwords. Also, we will transform the word into its original form. We can use function tm_map().

# Cleaning the text
word_corpus <- word_corpus %>% 
  tm_map(removeWords, stopwords("english")) %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(stemDocument) %>% 
  tm_map(stripWhitespace) 

# Check the observation number 10 
word_corpus[[10]]$content
## [1] "virginamerica amaz arriv hour earli good"

Document-Term Matrix (DTM)

We need to transform the text data into Document-Term Matrix (DTM) through the tokenization process. This process split one text into several terms. In DTM, one word will be one predictor with the value containing the frequency of those words appearing in the document. We use function DocumentTermMatrix(). Then, we inspect the result by using the function inspect().

word_dtm <- DocumentTermMatrix(word_corpus)

# Inspect 
inspect(word_dtm)
## <<DocumentTermMatrix (documents: 14640, terms: 11280)>>
## Non-/sparse entries: 136258/165002942
## Sparsity           : 100%
## Maximal term length: 46
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   americanair can flight get hour jetblu southwestair thank unit usairway
##   1076           0   0      5   0    0      0            0     0    1        0
##   2363           0   0      0   0    0      0            0     0    1        0
##   2945           0   0      0   1    0      0            0     0    2        0
##   3698           0   0      2   0    0      0            0     0    1        0
##   376            0   0      2   0    0      0            0     0    0        0
##   3871           0   0      0   1    0      0            0     1    1        0
##   3995           0   0      0   0    0      0            0     0    1        0
##   4797           0   0      2   0    1      0            1     0    0        0
##   827            0   0      1   0    0      0            0     0    1        0
##   974            0   0      2   0    0      0            0     0    1        0

Cross-Validation

We split the train and test dataset with 75% and 25% proportions. We use function sample()

RNGkind(sample.kind = "Rounding")
set.seed(100)

# Split train and test dataset
intrain <- sample(x = nrow(word_dtm), nrow(word_dtm)*0.75)

# train-test splitting
word_train <- word_dtm[intrain,]
word_test <- word_dtm[-intrain,]

Then, we prepare for the label for train and test dataset

# Prepare label for train and test
label_train <- us[intrain, "airline_sentiment"]
label_test <- us[-intrain, "airline_sentiment"]

# Check proportion 
prop.table(table(label_train))
## label_train
##  negative   neutral  positive 
## 0.6286885 0.2131148 0.1581967

The proportion of the class target is not balanced. Let’s keep it and observe the model performance. Now, Let’s check the dimension of the train dataset

dim(word_train)
## [1] 10980 11280

We use the function findFreqTerms() to reduce the noise in our dataset by collecting the word that frequently appears. Thus, we set lowfreq = 20 to ensure just words with at least 20 times appearance are used in this dataset.

word_freq <- findFreqTerms(x = word_train, lowfreq=20)

Subset the train dataset only containing the word_freq. The output shows that only 644 predictors (words) have appeared more than 20 times.

# Subset the train dataset
word_train <- word_train[ ,word_freq]

# Check for dimension
dim(word_train)
## [1] 10980   753

Bernoulli Converter

Bernoulli Converter is used to transform the frequency value into condition ; 0 as not appear and 1 as appear

  • If frequency > 0, then the value is 1
  • If frequency = 0, then the value is 0
# Create Bernoulli converter function 
bernoulli_conv <- function(x){
  x <- as.factor(ifelse(x > 0, 1, 0)) 
  return(x)
}

# Apply the function to train dataset
word_train_bn <- apply(word_train, 
                       MARGIN =2, # apply function to column, thus the matrix keeps as DocumentTermMatrix
                       FUN = bernoulli_conv)

# Apply the function to test dataset
word_test_bn <- apply(word_test, 
                      MARGIN = 2, 
                      FUN= bernoulli_conv)

# Check the result 
word_train_bn[4:10, 1:5]
##        Terms
## Docs    “jetblu abl absolut accept access
##   826   "0"     "0" "0"     "0"    "0"   
##   6858  "0"     "0" "0"     "0"    "0"   
##   7080  "0"     "0" "0"     "0"    "0"   
##   11889 "0"     "0" "0"     "0"    "0"   
##   5419  "0"     "0" "0"     "0"    "0"   
##   7998  "0"     "0" "0"     "0"    "0"   
##   2492  "0"     "0" "0"     "0"    "0"

Naive Bayes

Naive Bayes is an algorithm of simple “probabilistic classifiers” based on Bayes’ theorem. The model considers independence assumptions between predictor variables [^2]. The model is easy to be used and fast to build a real-time prediction. It can handle both discrete and continuous target variables. However, the model suffers from skewness due to data scarcity.

Model Fitting

Create Naive Bayes model and assign it as model_naive

model_naive <- naiveBayes(x=word_train_bn,
                          y = label_train,
                          laplace=1) # to prevent from data scarcity

Model Prediction

Prediction can be performed by using function predict()

model_pred <- predict(object = model_naive,
                      newdata= word_test_bn,
                      type="class")

Model Evaluation

Confusion Matrix

The confusion matrix is used to describe the performance of a classification algorithm. Four metrics to evaluate classifiers are Accuracy, Sensitivity, Specificity and Precision. In this case, we will use accuracy for our performance metrics. Accuracy is defined as the ratio of correctly predicted cases by the total cases.

confusionMatrix(data = as.factor(model_pred), 
                reference= as.factor(label_test),
                positive="negative")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative neutral positive
##   negative     1776     163       68
##   neutral       345     504       91
##   positive      154      92      467
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7505          
##                  95% CI : (0.7362, 0.7645)
##     No Information Rate : 0.6216          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5643          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                   0.7807         0.6640          0.7460
## Specificity                   0.8332         0.8497          0.9189
## Pos Pred Value                0.8849         0.5362          0.6550
## Neg Pred Value                0.6981         0.9062          0.9460
## Prevalence                    0.6216         0.2074          0.1710
## Detection Rate                0.4852         0.1377          0.1276
## Detection Prevalence          0.5484         0.2568          0.1948
## Balanced Accuracy             0.8069         0.7569          0.8325

Insight: In this case, we are only interested in accuracy as a performance metric. The output shows that the accuracy of the model is 0.7432, which is justified as an apppropriate model.

ROC and AUC

The empirical ROC curve is a probability curve showing the true positive rate (sensitivity) versus the false positive rate (1 - specificity) for all possible cut-off values [^3]. The AUC (Area Under Curve) measures how much the model is capable to distinguish between classes. The higher the AUC value, the better the model is at distinguishing positive and negative classes.

# Create a probability prediction from `model_naive`
prob_naive <- predict(object = model_naive, 
                      newdata= word_test_bn, 
                      type="raw")

# Check the prob_naive
head(prob_naive)
##         negative    neutral  positive
## [1,] 0.086772852 0.63364012 0.2795870
## [2,] 0.038372158 0.02292477 0.9387031
## [3,] 0.196390932 0.27562676 0.5279823
## [4,] 0.028351975 0.58168238 0.3899656
## [5,] 0.008846767 0.31145431 0.6796989
## [6,] 0.088155208 0.66783561 0.2440092
# Create a prob and label from prob_naive. 
roc_naive <- data.frame(prob=prob_naive[,1],
                       label=as.numeric(label_test=="negative"))

# Create an object prediction 
prediction_roc_naive <- prediction(predictions = roc_naive$prob, 
                                   labels = roc_naive$label) 
# Create an ROC plot 
plot(ROCR::performance(prediction.obj = prediction_roc_naive, 
                 measure = "tpr", 
                 x.measure = "fpr"), main = "ROC Naive Bayes", col="#519259")
abline(a = 0, b = 1)

# Obtain the AUC 
auc_naive  <- ROCR::performance(prediction.obj=prediction_roc_naive, measure = "auc")
auc_naive@y.values[[1]]
## [1] 0.890021

Insight: The AUC of the model is 0.8839, implying that the model can distinguish between positive and negative classes.

Conclusion

Even though the target variable is not balanced, the performance metrics of accuracy show that the model is quite appropriate for predicting the airline sentiment of US major airline customers. The ROC value also shows that the model can distinguish the positive and negative classes very well.

References