Sentiment Analysis: A Predictive Approach

Natural Language Processing

Natural Language Processing refers to Artificial Intelligence tools developed to ease human-to-machine communications. In this blog we introduce a predictive approach, which consists of predicting sentiments based on opinions collected from the well-known social network Twitter. As usual, let’s load some libraries. (if not installed, type install_package("package_not_installed"))

library(tm)
library(rtweet)
library(xgboost)
library(caret)
library(Hmisc)
library(dplyr)
library(reshape2)
library(wordcloud2)
library(tidytext)
library(DT)
library(stringr)

Loading Data

For this blog we’ll use a previously obtained dataset of opinions (in spanish language) on twitter previously labeled as -1:Negative, 0: Neutral, 1: Positive. It’s available for download here. Let’s have a look at the data.

data_tweets=read.table(".../opinones_consolidado.csv",header = T,sep =";")

Exploring & Cleaning data

Negative opinions are the most common output.

data.frame(table(data_tweets$opinion))

##   Var1 Freq
## 1   -1  260
## 2    0  177
## 3    1  239

data_tweets$fecha=as.Date(str_sub(data_tweets$fecha,1,10) %>%str_squish(),format="%d/%m/%Y")

ggplot(data_tweets,aes(x=fecha,fill=as.factor(opinion)))+geom_histogram(aes(y =stat(count))) +guides(size=F)+
  labs(x="Date",fill="Opinion")+theme_gray()

Opinions contain special characters and bad spelling, it’s possible to partly correct this by using a set of rules. For instance, replace accented characters like “á” by just “a” or limit a word size to a number of characters (it helps to remove long hashtags on twitter). We’ve previously written a function to adress this problem. It’s available on Github https://github.com/statscol/clean_tweets/blob/master/clean_tweets.R

data_tweets$texto<-clean_tweets(data_tweets$texto)

Let’s look at the wordcloud.

stopw=data.frame(stopwrds=stopwords::stopwords("es"),stringsAsFactors = F)
word_freq=tibble(text=data_tweets$texto) %>% unnest_tokens(word,text) %>% 
  count(word) %>% anti_join(stopw,by = c("word"="stopwrds"))

wordcloud2(word_freq,size = 1.2,shape = "pentagon")

Preparing data for prediction

Using DocumentTermMatrix it’s possible to split every phrase by word or number of words (bigrams) and remove stopwords. For that we need to define a tokenizer which will be applied to our text corpus.

corpus_opinions<-VCorpus(VectorSource(data_tweets$texto))  

NLP_tokenizer <- function(x){
  unlist(lapply(ngrams(words(x), 1:2), paste, collapse = "_"), use.names = FALSE)
}


control_list_ngram = list(tokenize = NLP_tokenizer,
                          removePunctuation = FALSE,
                          removeNumbers = TRUE, 
                          stopwords = stopwords::stopwords("es"), 
                          tolower = F, 
                          stemming = F, 
                          weighting = function(x)
                            weightTf(x)
)


tdm_td<-DocumentTermMatrix(corpus_opinions,control = control_list_ngram)
data_mod=as.matrix(tdm_td) %>%as.data.frame()

Right after our DocumentTermMatrix is complete, we can proceed to build a model. However, there are a few things to consider:

Is the DocumentTermMatrix too large?
Do we need all words?

The aforementioned could be solved removing sparse terms, that is, remove words/documents that are not frecuent enough. removeSparseTerms function does that for us, it receives a DocumentTermMatrix object and removes terms that are not in at most \((1-sparse)\%\) of opinions.

tdm_td2=removeSparseTerms(tdm_td, sparse = 0.99)

data_mod2=as.matrix(tdm_td2) %>%as.data.frame()

We can see the difference of removing sparse terms at the dimension of our dataset which will be used to train a model right next.

##       original after_removal
## rows       676           676
## terms    15700           382

Training a Model: Xgboost

Now we can proceed to train a model. As this time we’ll only use binary predictors, Xgboost could be a good option, but feel free to use LightGBM or CatBoost if you prefer. First, lets split our data, a 10% for Validation, the rest 90% as follows: 70% for training and 30% for testing. We also need to rename opinion values: because of xgboost standards, prediction labels must be integers greater or equal to 0.

data_tweets$opinion=with(data_tweets,factor(opinion,labels = c(0,1,2)))

data_mod2$target=data_tweets$opinion

valid_rows=createDataPartition(y=data_mod2$target,p = 0.1)$Resample1
valid_data=data_mod2[valid_rows,]

train_test_data=data_mod2[-valid_rows,]

sample_train_test=createDataPartition(y=train_test_data$target,p = 0.7)

trainx=as.matrix(train_test_data[sample_train_test$Resample1,-ncol(train_test_data)])
trainy=train_test_data$target[sample_train_test$Resample1] %>%
  as.character() %>% as.numeric()
testx=as.matrix(train_test_data[-sample_train_test$Resample1,-ncol(train_test_data)])
testy=train_test_data$target[-sample_train_test$Resample1]  %>%
  as.character() %>% as.numeric()

Now let’s fit the model, here we show how to tune one of the parameters (nrounds) using 5-CV, but max.depth, min_child_weight or regularization parameters like \(lambda\) or \(alpha\),must be tuned in a grid. We initiate learning rate (eta) at 0.1 and max.depth as the squared root of the number of variables.

xgb.train = xgb.DMatrix(data=trainx,label=trainy)

xgb.test = xgb.DMatrix(data=testx,label=testy)


watchlist <- list(train=xgb.train, test=xgb.test)

params <- list(booster = "gbtree", objective = "multi:softmax", num_class = 3, eval_metric = list("mlogloss","accuracy"),max.depth=floor(sqrt(ncol(data_mod2))))

xgbcv <- xgb.cv(params = params, data = xgb.train, nrounds = 100, nfold = 5, showsd = TRUE, stratified = TRUE, print_every_n = 10, early_stop_round = 5, maximize = FALSE, prediction = TRUE)

## [1]  train-mlogloss:0.893321+0.004170    test-mlogloss:0.964936+0.015120 
## [11] train-mlogloss:0.281110+0.008540    test-mlogloss:0.703221+0.043394 
## [21] train-mlogloss:0.164600+0.005343    test-mlogloss:0.737192+0.047125 
## [31] train-mlogloss:0.127583+0.004660    test-mlogloss:0.784956+0.052911 
## [41] train-mlogloss:0.111035+0.004637    test-mlogloss:0.816444+0.062374 
## [51] train-mlogloss:0.101702+0.004237    test-mlogloss:0.835672+0.072683 
## [61] train-mlogloss:0.095395+0.004092    test-mlogloss:0.853016+0.075564 
## [71] train-mlogloss:0.091024+0.004021    test-mlogloss:0.862122+0.083004 
## [81] train-mlogloss:0.087766+0.003951    test-mlogloss:0.875376+0.084993 
## [91] train-mlogloss:0.085120+0.003882    test-mlogloss:0.884724+0.086366 
## [100]    train-mlogloss:0.083316+0.003744    test-mlogloss:0.890405+0.088559

round.max=which.min(xgbcv$evaluation_log$test_mlogloss_mean)

modxgb<- xgb.train(data=xgb.train, max.depth=floor(sqrt(ncol(data_mod2))), eta=0.1, nthread = 4, nrounds=round.max,metric="accuracy",num_class=3,watchlist=watchlist, objective = "multi:softmax")

## [1]  train-merror:0.138173   test-merror:0.397790 
## [2]  train-merror:0.119438   test-merror:0.392265 
## [3]  train-merror:0.114754   test-merror:0.375691 
## [4]  train-merror:0.105386   test-merror:0.364641 
## [5]  train-merror:0.105386   test-merror:0.353591 
## [6]  train-merror:0.098361   test-merror:0.364641 
## [7]  train-merror:0.098361   test-merror:0.370166 
## [8]  train-merror:0.096019   test-merror:0.386740 
## [9]  train-merror:0.088993   test-merror:0.381215 
## [10] train-merror:0.077283   test-merror:0.375691 
## [11] train-merror:0.070258   test-merror:0.359116

Model Evaluation

Our base model has a higher accuracy on training compared to testing set. We’re clearly overfitting.

predxgb_train=as.factor(predict(modxgb,trainx))

predxgb_test=as.factor(predict(modxgb,testx))


confusionMatrix(predxgb_train,as.factor(trainy))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 158  13   4
##          1   3  96   4
##          2   3   3 143
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9297          
##                  95% CI : (0.9012, 0.9521)
##     No Information Rate : 0.3841          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8929          
##                                           
##  Mcnemar's Test P-Value : 0.08826         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9634   0.8571   0.9470
## Specificity            0.9354   0.9778   0.9783
## Pos Pred Value         0.9029   0.9320   0.9597
## Neg Pred Value         0.9762   0.9506   0.9712
## Prevalence             0.3841   0.2623   0.3536
## Detection Rate         0.3700   0.2248   0.3349
## Detection Prevalence   0.4098   0.2412   0.3489
## Balanced Accuracy      0.9494   0.9175   0.9626

confusionMatrix(predxgb_test,as.factor(testy))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2
##          0 45 24  5
##          1 21 19  7
##          2  4  4 52
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6409          
##                  95% CI : (0.5664, 0.7107)
##     No Information Rate : 0.3867          
##     P-Value [Acc > NIR] : 4.322e-12       
##                                           
##                   Kappa : 0.4536          
##                                           
##  Mcnemar's Test P-Value : 0.77            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6429   0.4043   0.8125
## Specificity            0.7387   0.7910   0.9316
## Pos Pred Value         0.6081   0.4043   0.8667
## Neg Pred Value         0.7664   0.7910   0.9008
## Prevalence             0.3867   0.2597   0.3536
## Detection Rate         0.2486   0.1050   0.2873
## Detection Prevalence   0.4088   0.2597   0.3315
## Balanced Accuracy      0.6908   0.5977   0.8721

Seems like our model didn’t learn quite well, just above 60% in the holdout set. How about using Smote or Adasyn to balance our dataset? how about using a spelling correction model like the one provided by hunspell package? Also remember we didn’t do a grid search for our Xgboost Model.

predxgb_valid=as.factor(predict(modxgb,as.matrix(valid_data[,-ncol(valid_data)])))

confusionMatrix(predxgb_valid,as.factor(valid_data$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2
##          0 15  5  4
##          1 11 12  2
##          2  0  1 18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6618          
##                  95% CI : (0.5368, 0.7721)
##     No Information Rate : 0.3824          
##     P-Value [Acc > NIR] : 2.87e-06        
##                                           
##                   Kappa : 0.4945          
##                                           
##  Mcnemar's Test P-Value : 0.08643         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5769   0.6667   0.7500
## Specificity            0.7857   0.7400   0.9773
## Pos Pred Value         0.6250   0.4800   0.9474
## Neg Pred Value         0.7500   0.8605   0.8776
## Prevalence             0.3824   0.2647   0.3529
## Detection Rate         0.2206   0.1765   0.2647
## Detection Prevalence   0.3529   0.3676   0.2794
## Balanced Accuracy      0.6813   0.7033   0.8636

Final Comments

It’s possible to predict opinions based on the raw text, however, we need to do a lot of feature engineering and model tuning in order to do so.