Natural Language Processing refers to Artificial Intelligence tools developed to ease human-to-machine communications. In this blog we introduce a predictive approach, which consists of predicting sentiments based on opinions collected from the well-known social network Twitter. As usual, let’s load some libraries. (if not installed, type install_package("package_not_installed")
)
For this blog we’ll use a previously obtained dataset of opinions (in spanish language) on twitter previously labeled as -1:Negative, 0: Neutral, 1: Positive. It’s available for download here. Let’s have a look at the data.
Negative opinions are the most common output.
## Var1 Freq
## 1 -1 260
## 2 0 177
## 3 1 239
data_tweets$fecha=as.Date(str_sub(data_tweets$fecha,1,10) %>%str_squish(),format="%d/%m/%Y")
ggplot(data_tweets,aes(x=fecha,fill=as.factor(opinion)))+geom_histogram(aes(y =stat(count))) +guides(size=F)+
labs(x="Date",fill="Opinion")+theme_gray()
Opinions contain special characters and bad spelling, it’s possible to partly correct this by using a set of rules. For instance, replace accented characters like “á” by just “a” or limit a word size to a number of characters (it helps to remove long hashtags on twitter). We’ve previously written a function to adress this problem. It’s available on Github https://github.com/statscol/clean_tweets/blob/master/clean_tweets.R
Let’s look at the wordcloud.
Using DocumentTermMatrix it’s possible to split every phrase by word or number of words (bigrams) and remove stopwords. For that we need to define a tokenizer which will be applied to our text corpus.
corpus_opinions<-VCorpus(VectorSource(data_tweets$texto))
NLP_tokenizer <- function(x){
unlist(lapply(ngrams(words(x), 1:2), paste, collapse = "_"), use.names = FALSE)
}
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = TRUE,
stopwords = stopwords::stopwords("es"),
tolower = F,
stemming = F,
weighting = function(x)
weightTf(x)
)
tdm_td<-DocumentTermMatrix(corpus_opinions,control = control_list_ngram)
data_mod=as.matrix(tdm_td) %>%as.data.frame()
Right after our DocumentTermMatrix is complete, we can proceed to build a model. However, there are a few things to consider:
The aforementioned could be solved removing sparse terms, that is, remove words/documents that are not frecuent enough. removeSparseTerms
function does that for us, it receives a DocumentTermMatrix object and removes terms that are not in at most \((1-sparse)\%\) of opinions.
We can see the difference of removing sparse terms at the dimension of our dataset which will be used to train a model right next.
## original after_removal
## rows 676 676
## terms 15700 382
Now we can proceed to train a model. As this time we’ll only use binary predictors, Xgboost could be a good option, but feel free to use LightGBM or CatBoost if you prefer. First, lets split our data, a 10% for Validation, the rest 90% as follows: 70% for training and 30% for testing. We also need to rename opinion values: because of xgboost standards, prediction labels must be integers greater or equal to 0.
data_tweets$opinion=with(data_tweets,factor(opinion,labels = c(0,1,2)))
data_mod2$target=data_tweets$opinion
valid_rows=createDataPartition(y=data_mod2$target,p = 0.1)$Resample1
valid_data=data_mod2[valid_rows,]
train_test_data=data_mod2[-valid_rows,]
sample_train_test=createDataPartition(y=train_test_data$target,p = 0.7)
trainx=as.matrix(train_test_data[sample_train_test$Resample1,-ncol(train_test_data)])
trainy=train_test_data$target[sample_train_test$Resample1] %>%
as.character() %>% as.numeric()
testx=as.matrix(train_test_data[-sample_train_test$Resample1,-ncol(train_test_data)])
testy=train_test_data$target[-sample_train_test$Resample1] %>%
as.character() %>% as.numeric()
Now let’s fit the model, here we show how to tune one of the parameters (nrounds) using 5-CV, but max.depth, min_child_weight or regularization parameters like \(lambda\) or \(alpha\),must be tuned in a grid. We initiate learning rate (eta) at 0.1 and max.depth as the squared root of the number of variables.
xgb.train = xgb.DMatrix(data=trainx,label=trainy)
xgb.test = xgb.DMatrix(data=testx,label=testy)
watchlist <- list(train=xgb.train, test=xgb.test)
params <- list(booster = "gbtree", objective = "multi:softmax", num_class = 3, eval_metric = list("mlogloss","accuracy"),max.depth=floor(sqrt(ncol(data_mod2))))
xgbcv <- xgb.cv(params = params, data = xgb.train, nrounds = 100, nfold = 5, showsd = TRUE, stratified = TRUE, print_every_n = 10, early_stop_round = 5, maximize = FALSE, prediction = TRUE)
## [1] train-mlogloss:0.893321+0.004170 test-mlogloss:0.964936+0.015120
## [11] train-mlogloss:0.281110+0.008540 test-mlogloss:0.703221+0.043394
## [21] train-mlogloss:0.164600+0.005343 test-mlogloss:0.737192+0.047125
## [31] train-mlogloss:0.127583+0.004660 test-mlogloss:0.784956+0.052911
## [41] train-mlogloss:0.111035+0.004637 test-mlogloss:0.816444+0.062374
## [51] train-mlogloss:0.101702+0.004237 test-mlogloss:0.835672+0.072683
## [61] train-mlogloss:0.095395+0.004092 test-mlogloss:0.853016+0.075564
## [71] train-mlogloss:0.091024+0.004021 test-mlogloss:0.862122+0.083004
## [81] train-mlogloss:0.087766+0.003951 test-mlogloss:0.875376+0.084993
## [91] train-mlogloss:0.085120+0.003882 test-mlogloss:0.884724+0.086366
## [100] train-mlogloss:0.083316+0.003744 test-mlogloss:0.890405+0.088559
round.max=which.min(xgbcv$evaluation_log$test_mlogloss_mean)
modxgb<- xgb.train(data=xgb.train, max.depth=floor(sqrt(ncol(data_mod2))), eta=0.1, nthread = 4, nrounds=round.max,metric="accuracy",num_class=3,watchlist=watchlist, objective = "multi:softmax")
## [1] train-merror:0.138173 test-merror:0.397790
## [2] train-merror:0.119438 test-merror:0.392265
## [3] train-merror:0.114754 test-merror:0.375691
## [4] train-merror:0.105386 test-merror:0.364641
## [5] train-merror:0.105386 test-merror:0.353591
## [6] train-merror:0.098361 test-merror:0.364641
## [7] train-merror:0.098361 test-merror:0.370166
## [8] train-merror:0.096019 test-merror:0.386740
## [9] train-merror:0.088993 test-merror:0.381215
## [10] train-merror:0.077283 test-merror:0.375691
## [11] train-merror:0.070258 test-merror:0.359116
Our base model has a higher accuracy on training compared to testing set. We’re clearly overfitting.
predxgb_train=as.factor(predict(modxgb,trainx))
predxgb_test=as.factor(predict(modxgb,testx))
confusionMatrix(predxgb_train,as.factor(trainy))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 158 13 4
## 1 3 96 4
## 2 3 3 143
##
## Overall Statistics
##
## Accuracy : 0.9297
## 95% CI : (0.9012, 0.9521)
## No Information Rate : 0.3841
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8929
##
## Mcnemar's Test P-Value : 0.08826
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.9634 0.8571 0.9470
## Specificity 0.9354 0.9778 0.9783
## Pos Pred Value 0.9029 0.9320 0.9597
## Neg Pred Value 0.9762 0.9506 0.9712
## Prevalence 0.3841 0.2623 0.3536
## Detection Rate 0.3700 0.2248 0.3349
## Detection Prevalence 0.4098 0.2412 0.3489
## Balanced Accuracy 0.9494 0.9175 0.9626
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 45 24 5
## 1 21 19 7
## 2 4 4 52
##
## Overall Statistics
##
## Accuracy : 0.6409
## 95% CI : (0.5664, 0.7107)
## No Information Rate : 0.3867
## P-Value [Acc > NIR] : 4.322e-12
##
## Kappa : 0.4536
##
## Mcnemar's Test P-Value : 0.77
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.6429 0.4043 0.8125
## Specificity 0.7387 0.7910 0.9316
## Pos Pred Value 0.6081 0.4043 0.8667
## Neg Pred Value 0.7664 0.7910 0.9008
## Prevalence 0.3867 0.2597 0.3536
## Detection Rate 0.2486 0.1050 0.2873
## Detection Prevalence 0.4088 0.2597 0.3315
## Balanced Accuracy 0.6908 0.5977 0.8721
Seems like our model didn’t learn quite well, just above 60% in the holdout set. How about using Smote or Adasyn to balance our dataset? how about using a spelling correction model like the one provided by hunspell
package? Also remember we didn’t do a grid search for our Xgboost Model.
predxgb_valid=as.factor(predict(modxgb,as.matrix(valid_data[,-ncol(valid_data)])))
confusionMatrix(predxgb_valid,as.factor(valid_data$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 15 5 4
## 1 11 12 2
## 2 0 1 18
##
## Overall Statistics
##
## Accuracy : 0.6618
## 95% CI : (0.5368, 0.7721)
## No Information Rate : 0.3824
## P-Value [Acc > NIR] : 2.87e-06
##
## Kappa : 0.4945
##
## Mcnemar's Test P-Value : 0.08643
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.5769 0.6667 0.7500
## Specificity 0.7857 0.7400 0.9773
## Pos Pred Value 0.6250 0.4800 0.9474
## Neg Pred Value 0.7500 0.8605 0.8776
## Prevalence 0.3824 0.2647 0.3529
## Detection Rate 0.2206 0.1765 0.2647
## Detection Prevalence 0.3529 0.3676 0.2794
## Balanced Accuracy 0.6813 0.7033 0.8636
It’s possible to predict opinions based on the raw text, however, we need to do a lot of feature engineering and model tuning in order to do so.