Hello! We would like to introduce you and ourselves to the world of gradient boosting machines. Here, the practical part of our journey starts. We are trying to build a gradient boosting model to predict sex of reviewers based on the words they are most likely to use in their reviews.

Firstly, let’s load the dataset and do some data cleaning.

library(base)
library(glmnet)
library(caret)
library(tm)
library(dplyr)
library(tidytext)
library(stringr)
library(readr)
reviews <- read.csv("review.csv", header=TRUE, sep=";", encoding = "UTF-8")
reviews$Sex = as.character(reviews$Sex)
reviews$Sex = ifelse(str_detect(reviews$Sex, "N/A"), NA, reviews$Sex)
reviews$Sex = ifelse(str_detect(reviews$Sex, "Ж"), "Ж ", reviews$Sex)
reviews = reviews %>% na.omit()
reviews$Review <- as.character(reviews$Review)

reviews$id = 1:nrow(reviews)

Now, we are up to some operations the result of which is a document-term matrix of words from reviews with stopwords filtered.

library(quanteda)
library(stopwords)

reviews.dtm <- reviews %>%
    unnest_tokens(word, Review) %>%
    anti_join(tibble(word=stopwords("ru"))) %>%
    filter(!str_detect(word, "[0-9]+")) %>%
    dplyr::count(id, word) %>%
    cast_dfm(id, word, n)
reviews.dtm
## Document-feature matrix of: 2,268 documents, 69,351 features (99.8% sparse).

Too many fearures: 69,351, with 99.8% of cells containing zeroes . Let’s trim it, leaving about 500 features. Also, we need to apply stemming to extract words’ stems for a tidier image of the document-term matrix.

reviews.trim <- reviews.dtm %>% as.dfm %>%
    dfm_wordstem(language = "ru") %>%
    dfm_trim(max_docfreq=0.3, min_docfreq=0.01, docfreq_type="prop") %>%
    dfm_tfidf
reviews.trim
## Document-feature matrix of: 2,268 documents, 1,829 features (96.4% sparse).

To build a classifier we need a training set and a test set. A model will learn on the former one and be assessed on the latter one. Let’s make a sample split.

library(rsample)
reviews$Sex <- as.factor(reviews$Sex)
reviews.split  <- initial_split(reviews, prop=0.8, strata=Sex)
reviews.split
## <1815/453/2268>

Now, we extract training and testing set from the split, and peek into the sex proportions.

reviews.train <- training(reviews.split)
table(reviews.train$Sex)
## 
##   Ж    М  
## 1569  246
reviews.test  <- testing(reviews.split)
table(reviews.test$Sex)
## 
##  Ж   М  
## 396  57

It can be seen that the quantity of female reviewers signifficantly prevails. In both sets.

Splitting the reviews dfm into train/test parts and providing a training objective (sex labels).

train.dtmR  <- reviews.trim %>% dfm_subset(docnames(reviews.dtm) 
                                          %in% reviews.train$id)
test.dtmR  <- reviews.trim %>% dfm_subset(docnames(reviews.dtm) 
                                            %in% reviews.test$id)

trainYR <- reviews.train$Sex 
train_labs <- as.numeric(trainYR)-1
testYR <- reviews.test$Sex
test_labs <- as.numeric(testYR)-1

Now, we can combine our response labels and document-term matrix to use it further in model building. We do it for both, train set and test set.

mat.dfm.train <- as.matrix(train.dtmR)
mat.dfm.test <- as.matrix(test.dtmR)

colnames(mat.dfm.train) <- make.names(colnames(mat.dfm.train)) 
colnames(mat.dfm.test) <- make.names(colnames(mat.dfm.test))
df.train <- data.frame(Sex = train_labs, mat.dfm.train) 
df.test <- data.frame(Sex = test_labs, mat.dfm.test)
formula <- formula(paste0("Sex ~ ", paste0(colnames(mat.dfm.train), collapse = "+"))) 

Here, the model building itself begins! We use gbm::gbm function setting parameters the following way:

distribution type to “bernoulli”, as we are working on a classification problem (in case of regression, “gaussian” type is used)

number of trees to 1000, as the minimum number of trees used in the model ensemble is 1000. And setting it too high would cause our model to overfit.

interaction depth to 6, as it is determined by the Salford Default Setting.

options(scipen=999)
library(gbm)

set.seed(1604)
model.boost <- gbm(formula, 
                data=df.train, 
                distribution="bernoulli",
                n.trees=1000, 
                interaction.depth=6)
head(summary(model.boost, order=TRUE,  cBars = 50), n = 10)

The table of relative influence reveals that top-5 of independent variables that weight the most in predicting reviewers’ sex are words like ‘зрен’, ‘тщательн’, ‘идет’, ‘технолог’, and ‘теор’. As they have the greatest relative influence coefficients.

Next, let’s look at the prediction. n.trees = 1000 means that we use the first 100 out of 1000 trees for the prediction. As an output we get probabilities, just like with logistic regression.

predTrainProb.boost = predict(model.boost, df.train, n.trees = 100, type = "response")
predTestProb.boost = predict(model.boost, df.test, n.trees = 100, type = "response")

head(predTrainProb.boost)
## [1] 0.1341726 0.1323540 0.1329946 0.1326902 0.1349239 0.1322422

Finally, let’s see how accurate this model is. To do that we build a confusion matrix with accuracy and other assessment parameters specified. The threshold for being assigned to class “male” is 0.7, cause if stated otherwise, there will be no observations of this class recognized at all.

predTrain.boost = as.factor(ifelse(predTrainProb.boost < 0.5, "0", "1"))
predTest.boost = as.factor(ifelse(predTestProb.boost < 0.5, "0", "1"))

df.train$Sex <- as.factor(as.character(df.train$Sex))
df.test$Sex <- as.factor(as.character(df.test$Sex))

accuracyTrain.boost = confusionMatrix(predTrain.boost, df.train$Sex)$overall["Accuracy"]
## Warning in confusionMatrix.default(predTrain.boost, df.train$Sex): Levels
## are not in the same order for reference and data. Refactoring data to
## match.
accuracyTest.boost = confusionMatrix(predTest.boost, df.test$Sex)$overall["Accuracy"]
## Warning in confusionMatrix.default(predTest.boost, df.test$Sex): Levels are
## not in the same order for reference and data. Refactoring data to match.
accuracyTrain.boost
##  Accuracy 
## 0.8644628
accuracyTest.boost
##  Accuracy 
## 0.8741722
cmTrain <- confusionMatrix(predTrain.boost, df.train$Sex)
## Warning in confusionMatrix.default(predTrain.boost, df.train$Sex): Levels
## are not in the same order for reference and data. Refactoring data to
## match.
cmTest <- confusionMatrix(predTest.boost, df.test$Sex)
## Warning in confusionMatrix.default(predTest.boost, df.test$Sex): Levels are
## not in the same order for reference and data. Refactoring data to match.
cmTrain
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1569  246
##          1    0    0
##                                              
##                Accuracy : 0.8645             
##                  95% CI : (0.8479, 0.8799)   
##     No Information Rate : 0.8645             
##     P-Value [Acc > NIR] : 0.517              
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.0000             
##             Specificity : 0.0000             
##          Pos Pred Value : 0.8645             
##          Neg Pred Value :    NaN             
##              Prevalence : 0.8645             
##          Detection Rate : 0.8645             
##    Detection Prevalence : 1.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 0                  
## 
cmTest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 396  57
##          1   0   0
##                                             
##                Accuracy : 0.8742            
##                  95% CI : (0.8401, 0.9033)  
##     No Information Rate : 0.8742            
##     P-Value [Acc > NIR] : 0.5352            
##                                             
##                   Kappa : 0                 
##                                             
##  Mcnemar's Test P-Value : 0.0000000000001195
##                                             
##             Sensitivity : 1.0000            
##             Specificity : 0.0000            
##          Pos Pred Value : 0.8742            
##          Neg Pred Value :    NaN            
##              Prevalence : 0.8742            
##          Detection Rate : 0.8742            
##    Detection Prevalence : 1.0000            
##       Balanced Accuracy : 0.5000            
##                                             
##        'Positive' Class : 0                 
## 

The accuracy is really big that can be interpreted as the model builds quite accurate predictions. However, if we look at the confusion matrix, we see that the situation is quite different. The model does performs good, but solely on the observations of class ‘0’, meaning that it predicts correctly only women and absolutely doesn’t see any of class ‘1’, i.e. male reviewers. This can be attributed to an absolute disbalance of classes. In the sample there are about 5-10% of male reviewers, everyone else are female.

One way to handle this issue is to do downsampling, i.e. to equaize number of observations in each class. Let’s try that one. Basically, we’ll start over but with a new initial dataframe - the one consisting of number of observations for female reviewers equal to observations of male ones.

reviews2 <- downSample(reviews, reviews$Sex)

reviews2.dtm <- reviews2 %>%
    unnest_tokens(word, Review) %>%
    anti_join(tibble(word=stopwords("ru"))) %>%
    filter(!str_detect(word, "[0-9]+")) %>%
    dplyr::count(id, word) %>%
    cast_dfm(id, word, n)
## Joining, by = "word"
reviews2.dtm
## Document-feature matrix of: 606 documents, 31,353 features (99.6% sparse).
reviews2.trim <- reviews2.dtm %>% as.dfm %>%
    dfm_wordstem(language = "ru") %>%
    dfm_trim(max_docfreq=0.3, min_docfreq=0.01, docfreq_type="prop") %>%
    dfm_tfidf
reviews2.trim
## Document-feature matrix of: 606 documents, 1,909 features (96.3% sparse).
reviews2$Sex <- as.factor(reviews2$Sex)
reviews2.split  <- initial_split(reviews2, prop=0.8, strata=Sex)
reviews2.split
## <486/120/606>
reviews2.train <- training(reviews2.split)
table(reviews2.train$Sex)
## 
##  Ж   М  
## 243 243
reviews2.test  <- testing(reviews2.split)
table(reviews2.test$Sex)
## 
## Ж  М  
## 60 60
train.dtmR2  <- reviews2.trim %>% dfm_subset(docnames(reviews2.dtm) 
                                          %in% reviews2.train$id)
test2.dtmR2  <- reviews2.trim %>% dfm_subset(docnames(reviews2.dtm) 
                                            %in% reviews2.test$id)

trainYR2 <- reviews2.train$Sex 
train_labs2 <- as.numeric(trainYR2)-1
testYR2 <- reviews2.test$Sex
test_labs2 <- as.numeric(testYR2)-1

mat.dfm.train2 <- as.matrix(train.dtmR2)
mat.dfm.test2 <- as.matrix(test2.dtmR2)

colnames(mat.dfm.train2) <- make.names(colnames(mat.dfm.train2))
colnames(mat.dfm.test2) <- make.names(colnames(mat.dfm.test2))
df.train2 <- data.frame(Sex = train_labs2, mat.dfm.train2) 
df.test2 <- data.frame(Sex = test_labs2, mat.dfm.test2)
formula2 <- formula(paste0("Sex ~ ", paste0(colnames(mat.dfm.train2), collapse = "+"))) 

set.seed(1604)
model.boost2 <- gbm(formula2, 
                data=df.train2, 
                distribution="bernoulli",
                n.trees=1000, 
                interaction.depth=6 
                )
head(summary(model.boost2, order=TRUE,  cBars = 50), n = 10)
predTrainProb.boost2 = predict(model.boost2, df.train2, n.trees = 1000, type = "response")
predTestProb.boost2 = predict(model.boost2, df.test2, n.trees = 1000, type = "response")

head(predTrainProb.boost2)
## [1] 0.4320361 0.4134269 0.3762085 0.4792158 0.4547749 0.5024926
predTrain.boost2 = as.factor(ifelse(predTrainProb.boost2 < 0.5, "0", "1"))
predTest.boost2 = as.factor(ifelse(predTestProb.boost2 < 0.5, "0", "1"))

df.train2$Sex <- as.factor(as.character(df.train2$Sex))
df.test2$Sex <- as.factor(as.character(df.test2$Sex))

cmTrain_2 <- confusionMatrix(predTrain.boost2, df.train2$Sex)
cmTest_2 <- confusionMatrix(predTest.boost2, df.test2$Sex)
cmTrain_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 222  74
##          1  21 169
##                                                
##                Accuracy : 0.8045               
##                  95% CI : (0.7664, 0.8389)     
##     No Information Rate : 0.5                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6091               
##                                                
##  Mcnemar's Test P-Value : 0.0000000955         
##                                                
##             Sensitivity : 0.9136               
##             Specificity : 0.6955               
##          Pos Pred Value : 0.7500               
##          Neg Pred Value : 0.8895               
##              Prevalence : 0.5000               
##          Detection Rate : 0.4568               
##    Detection Prevalence : 0.6091               
##       Balanced Accuracy : 0.8045               
##                                                
##        'Positive' Class : 0                    
## 
cmTest_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 36
##          1 20 24
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.4401, 0.6249)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.26150         
##                                           
##                   Kappa : 0.0667          
##                                           
##  Mcnemar's Test P-Value : 0.04502         
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5263          
##          Neg Pred Value : 0.5455          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.6333          
##       Balanced Accuracy : 0.5333          
##                                           
##        'Positive' Class : 0               
## 

It seems that this solution did help a little bit. Now, the classifier recognizes the males, and the Kappa value is now a bit different from zero meaning that this classifier is better than a random one. However, the accuracy value on the test set decreased a lot. Now it’s about 0.53 which means that the model predicts only 53% of the observations it has never seen. Yet, this is actually expected considering the fact that there are only 486 records in the training set - the model didn’t get enough examples to train on.

All in all, we think it makes sense to keep this last model as its prediction power appears to be greater. As a side note, be careful with your data ’cause it also one of the most important determinants of your model performance.

That’s all. Thank you for your attention!