Natural Language Processing

Natural Language Processing refers to Artificial Intelligence tools developed to ease human-to-machine communications. In this blog we introduce a predictive approach, which consists of predicting sentiments based on opinions collected from the well-known social network Twitter. As usual, let’s load some libraries. (if not installed, type install_package("package_not_installed"))

Loading Data

For this blog we’ll use a previously obtained dataset of opinions (in spanish language) on twitter previously labeled as -1:Negative, 0: Neutral, 1: Positive. It’s available for download here. Let’s have a look at the data.

Exploring & Cleaning data

Negative opinions are the most common output.

##   Var1 Freq
## 1   -1  260
## 2    0  177
## 3    1  239

Opinions contain special characters and bad spelling, it’s possible to partly correct this by using a set of rules. For instance, replace accented characters like “á” by just “a” or limit a word size to a number of characters (it helps to remove long hashtags on twitter). We’ve previously written a function to adress this problem. It’s available on Github https://github.com/statscol/clean_tweets/blob/master/clean_tweets.R

Let’s look at the wordcloud.

Preparing data for prediction

Using DocumentTermMatrix it’s possible to split every phrase by word or number of words (bigrams) and remove stopwords. For that we need to define a tokenizer which will be applied to our text corpus.

Right after our DocumentTermMatrix is complete, we can proceed to build a model. However, there are a few things to consider:

  • Is the DocumentTermMatrix too large?
  • Do we need all words?

The aforementioned could be solved removing sparse terms, that is, remove words/documents that are not frecuent enough. removeSparseTerms function does that for us, it receives a DocumentTermMatrix object and removes terms that are not in at most \((1-sparse)\%\) of opinions.

We can see the difference of removing sparse terms at the dimension of our dataset which will be used to train a model right next.

##       original after_removal
## rows       676           676
## terms    15700           382

Training a Model: Xgboost

Now we can proceed to train a model. As this time we’ll only use binary predictors, Xgboost could be a good option, but feel free to use LightGBM or CatBoost if you prefer. First, lets split our data, a 10% for Validation, the rest 90% as follows: 70% for training and 30% for testing. We also need to rename opinion values: because of xgboost standards, prediction labels must be integers greater or equal to 0.

Now let’s fit the model, here we show how to tune one of the parameters (nrounds) using 5-CV, but max.depth, min_child_weight or regularization parameters like \(lambda\) or \(alpha\),must be tuned in a grid. We initiate learning rate (eta) at 0.1 and max.depth as the squared root of the number of variables.

## [1]  train-mlogloss:0.893321+0.004170    test-mlogloss:0.964936+0.015120 
## [11] train-mlogloss:0.281110+0.008540    test-mlogloss:0.703221+0.043394 
## [21] train-mlogloss:0.164600+0.005343    test-mlogloss:0.737192+0.047125 
## [31] train-mlogloss:0.127583+0.004660    test-mlogloss:0.784956+0.052911 
## [41] train-mlogloss:0.111035+0.004637    test-mlogloss:0.816444+0.062374 
## [51] train-mlogloss:0.101702+0.004237    test-mlogloss:0.835672+0.072683 
## [61] train-mlogloss:0.095395+0.004092    test-mlogloss:0.853016+0.075564 
## [71] train-mlogloss:0.091024+0.004021    test-mlogloss:0.862122+0.083004 
## [81] train-mlogloss:0.087766+0.003951    test-mlogloss:0.875376+0.084993 
## [91] train-mlogloss:0.085120+0.003882    test-mlogloss:0.884724+0.086366 
## [100]    train-mlogloss:0.083316+0.003744    test-mlogloss:0.890405+0.088559
## [1]  train-merror:0.138173   test-merror:0.397790 
## [2]  train-merror:0.119438   test-merror:0.392265 
## [3]  train-merror:0.114754   test-merror:0.375691 
## [4]  train-merror:0.105386   test-merror:0.364641 
## [5]  train-merror:0.105386   test-merror:0.353591 
## [6]  train-merror:0.098361   test-merror:0.364641 
## [7]  train-merror:0.098361   test-merror:0.370166 
## [8]  train-merror:0.096019   test-merror:0.386740 
## [9]  train-merror:0.088993   test-merror:0.381215 
## [10] train-merror:0.077283   test-merror:0.375691 
## [11] train-merror:0.070258   test-merror:0.359116

Model Evaluation

Our base model has a higher accuracy on training compared to testing set. We’re clearly overfitting.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 158  13   4
##          1   3  96   4
##          2   3   3 143
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9297          
##                  95% CI : (0.9012, 0.9521)
##     No Information Rate : 0.3841          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8929          
##                                           
##  Mcnemar's Test P-Value : 0.08826         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9634   0.8571   0.9470
## Specificity            0.9354   0.9778   0.9783
## Pos Pred Value         0.9029   0.9320   0.9597
## Neg Pred Value         0.9762   0.9506   0.9712
## Prevalence             0.3841   0.2623   0.3536
## Detection Rate         0.3700   0.2248   0.3349
## Detection Prevalence   0.4098   0.2412   0.3489
## Balanced Accuracy      0.9494   0.9175   0.9626
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2
##          0 45 24  5
##          1 21 19  7
##          2  4  4 52
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6409          
##                  95% CI : (0.5664, 0.7107)
##     No Information Rate : 0.3867          
##     P-Value [Acc > NIR] : 4.322e-12       
##                                           
##                   Kappa : 0.4536          
##                                           
##  Mcnemar's Test P-Value : 0.77            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.6429   0.4043   0.8125
## Specificity            0.7387   0.7910   0.9316
## Pos Pred Value         0.6081   0.4043   0.8667
## Neg Pred Value         0.7664   0.7910   0.9008
## Prevalence             0.3867   0.2597   0.3536
## Detection Rate         0.2486   0.1050   0.2873
## Detection Prevalence   0.4088   0.2597   0.3315
## Balanced Accuracy      0.6908   0.5977   0.8721

Seems like our model didn’t learn quite well, just above 60% in the holdout set. How about using Smote or Adasyn to balance our dataset? how about using a spelling correction model like the one provided by hunspell package? Also remember we didn’t do a grid search for our Xgboost Model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2
##          0 15  5  4
##          1 11 12  2
##          2  0  1 18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6618          
##                  95% CI : (0.5368, 0.7721)
##     No Information Rate : 0.3824          
##     P-Value [Acc > NIR] : 2.87e-06        
##                                           
##                   Kappa : 0.4945          
##                                           
##  Mcnemar's Test P-Value : 0.08643         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.5769   0.6667   0.7500
## Specificity            0.7857   0.7400   0.9773
## Pos Pred Value         0.6250   0.4800   0.9474
## Neg Pred Value         0.7500   0.8605   0.8776
## Prevalence             0.3824   0.2647   0.3529
## Detection Rate         0.2206   0.1765   0.2647
## Detection Prevalence   0.3529   0.3676   0.2794
## Balanced Accuracy      0.6813   0.7033   0.8636

Final Comments

It’s possible to predict opinions based on the raw text, however, we need to do a lot of feature engineering and model tuning in order to do so.