Hello! We would like to introduce you and ourselves to the world of gradient boosting machines. Here, the practical part of our journey starts. We are trying to build a gradient boosting model to predict sex of reviewers based on the words they are most likely to use in their reviews.

Firstly, let’s load the dataset and do some data cleaning.

library(base)
library(glmnet)
library(caret)
library(tm)
library(dplyr)
library(tidytext)
library(stringr)
library(readr)
reviews <- read.csv("review.csv", header=TRUE, sep=";", encoding = "UTF-8")
reviews$Sex = as.character(reviews$Sex)
reviews$Sex = ifelse(str_detect(reviews$Sex, "N/A"), NA, reviews$Sex)
reviews$Sex = ifelse(str_detect(reviews$Sex, "Ж"), "Ж ", reviews$Sex)
reviews = reviews %>% na.omit()
reviews$Review <- as.character(reviews$Review)

reviews$id = 1:nrow(reviews)

Now, we are up to some operations the result of which is a document-term matrix of words from reviews with stopwords filtered.

library(quanteda)
library(stopwords)

reviews.dtm <- reviews %>%
    unnest_tokens(word, Review) %>%
    anti_join(tibble(word=stopwords("ru"))) %>%
    filter(!str_detect(word, "[0-9]+")) %>%
    dplyr::count(id, word) %>%
    cast_dfm(id, word, n)
reviews.dtm
## Document-feature matrix of: 2,268 documents, 69,351 features (99.8% sparse).

Too many fearures: 69,351, with 99.8% of cells containing zeroes . Let’s trim it, leaving about 500 features. Also, we need to apply stemming to extract words’ stems for a tidier image of the document-term matrix.

reviews.trim <- reviews.dtm %>% as.dfm %>%
    dfm_wordstem(language = "ru") %>%
    dfm_trim(max_docfreq=0.3, min_docfreq=0.01, docfreq_type="prop") %>%
    dfm_tfidf
reviews.trim
## Document-feature matrix of: 2,268 documents, 1,829 features (96.4% sparse).

To build a classifier we need a training set and a test set. A model will learn on the former one and be assessed on the latter one. Let’s make a sample split.

library(rsample)
## Loading required package: tidyr
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
## 
##     expand
reviews$Sex <- as.factor(reviews$Sex)
reviews.split  <- initial_split(reviews, prop=0.8, strata=Sex)
reviews.split
## <1815/453/2268>

Now, we extract training and testing set from the split, and peek into the sex proportions.

reviews.train <- training(reviews.split)
table(reviews.train$Sex)
## 
##   Ж    М  
## 1564  251
reviews.test  <- testing(reviews.split)
table(reviews.test$Sex)
## 
##  Ж   М  
## 401  52

It can be seen that the quantity of female reviewers signifficantly prevails. In both sets.

Splitting the reviews dfm into train/test parts and providing a training objective (sex labels).

train.dtmR  <- reviews.trim %>% dfm_subset(docnames(reviews.dtm) 
                                          %in% reviews.train$id)
test.dtmR  <- reviews.trim %>% dfm_subset(docnames(reviews.dtm) 
                                            %in% reviews.test$id)

trainYR <- reviews.train$Sex 
train_labs <- as.numeric(trainYR)-1
testYR <- reviews.test$Sex
test_labs <- as.numeric(testYR)-1

Now, we can combine our response labels and document-term matrix to use it further in model building. We do it for both, train set and test set.

mat.dfm.train <- as.matrix(train.dtmR)
mat.dfm.test <- as.matrix(test.dtmR)

colnames(mat.dfm.train) <- make.names(colnames(mat.dfm.train)) 
colnames(mat.dfm.test) <- make.names(colnames(mat.dfm.test))
df.train <- data.frame(Sex = train_labs, mat.dfm.train) 
df.test <- data.frame(Sex = test_labs, mat.dfm.test)
formula <- formula(paste0("Sex ~ ", paste0(colnames(mat.dfm.train), collapse = "+"))) 

Here, the model building itself begins! We use gbm::gbm function setting parameters the following way:

distribution type to “bernoulli”, as we are working on a classification problem (in case of regression, “gaussian” type is used)

number of trees to 1000, as the minimum number of trees used in the model ensemble is 1000. And setting it too high would cause our model to overfit.

interaction depth to 6, as it is determined by the Salford Default Setting.

options(scipen=999)
library(gbm)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1604)
model.boost <- gbm(formula, 
                data=df.train, 
                distribution="bernoulli",
                n.trees=1000, 
                interaction.depth=6)
head(summary(model.boost, order=TRUE,  cBars = 50), n = 10)

##                     var  rel.inf
## средневеков средневеков 3.563649
## идёт               идёт 3.545140
## ген                 ген 3.338291
## технолог       технолог 3.171834
## российск       российск 2.177620
## велик             велик 2.142676
## собра             собра 1.766862
## зрен               зрен 1.665467
## тщательн       тщательн 1.658618
## подход           подход 1.638755

The table of relative influence reveals that top-5 of independent variables that weight the most in predicting reviewers’ sex are words like ‘зрен’, ‘тщательн’, ‘идет’, ‘технолог’, and ‘теор’. As they have the greatest relative influence coefficients.

Next, let’s look at the prediction. n.trees = 1000 means that we use the first 100 out of 1000 trees for the prediction. As an output we get probabilities, just like with logistic regression.

predTrainProb.boost = predict(model.boost, df.train, n.trees = 100, type = "response")
predTestProb.boost = predict(model.boost, df.test, n.trees = 100, type = "response")

head(predTrainProb.boost)
## [1] 0.1394674 0.1353042 0.1367344 0.1351552 0.1395100 0.1442521

Finally, let’s see how accurate this model is. To do that we build a confusion matrix with accuracy and other assessment parameters specified. The threshold for being assigned to class “male” is 0.7, cause if stated otherwise, there will be no observations of this class recognized at all.

predTrain.boost = as.factor(ifelse(predTrainProb.boost < 0.3, "1", "0"))
predTest.boost = as.factor(ifelse(predTestProb.boost < 0.3, "1", "0"))

df.train$Sex <- as.factor(as.character(df.train$Sex))
df.test$Sex <- as.factor(as.character(df.test$Sex))

accuracyTrain.boost = confusionMatrix(predTrain.boost, df.train$Sex)$overall["Accuracy"]
## Warning in confusionMatrix.default(predTrain.boost, df.train$Sex): Levels
## are not in the same order for reference and data. Refactoring data to
## match.
accuracyTest.boost = confusionMatrix(predTest.boost, df.test$Sex)$overall["Accuracy"]
## Warning in confusionMatrix.default(predTest.boost, df.test$Sex): Levels are
## not in the same order for reference and data. Refactoring data to match.
accuracyTrain.boost
## Accuracy 
## 0.138292
accuracyTest.boost
##  Accuracy 
## 0.1147903
cmTrain <- confusionMatrix(predTrain.boost, df.train$Sex)
## Warning in confusionMatrix.default(predTrain.boost, df.train$Sex): Levels
## are not in the same order for reference and data. Refactoring data to
## match.
cmTest <- confusionMatrix(predTest.boost, df.test$Sex)
## Warning in confusionMatrix.default(predTest.boost, df.test$Sex): Levels are
## not in the same order for reference and data. Refactoring data to match.
cmTrain
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0    0    0
##          1 1564  251
##                                              
##                Accuracy : 0.1383             
##                  95% CI : (0.1227, 0.155)    
##     No Information Rate : 0.8617             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.0000             
##             Specificity : 1.0000             
##          Pos Pred Value :    NaN             
##          Neg Pred Value : 0.1383             
##              Prevalence : 0.8617             
##          Detection Rate : 0.0000             
##    Detection Prevalence : 0.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 0                  
## 
cmTest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0   0   0
##          1 401  52
##                                              
##                Accuracy : 0.1148             
##                  95% CI : (0.0869, 0.1478)   
##     No Information Rate : 0.8852             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.0000             
##             Specificity : 1.0000             
##          Pos Pred Value :    NaN             
##          Neg Pred Value : 0.1148             
##              Prevalence : 0.8852             
##          Detection Rate : 0.0000             
##    Detection Prevalence : 0.0000             
##       Balanced Accuracy : 0.5000             
##                                              
##        'Positive' Class : 0                  
## 

The accuracy is pretty small, also it can be seen that the model can’t handle the class “female”, failing to classify it. The problem can be figured with setting another threshold for prediction with the model on test and train sample, i.e. raising it up. However, when we do so, we miss any observations of class “male” at all, as there are way too few of them.

Another way to handle this issue is to do downsampling, i.e. to equaize number of observations in each class. Let’s try that one. Basically, we’ll start over but with a new initial dataframe - the one consisting of number of observations for female reviewers equal to observations of male ones.

reviews2 <- downSample(reviews, reviews$Sex)

reviews2.dtm <- reviews2 %>%
    unnest_tokens(word, Review) %>%
    anti_join(tibble(word=stopwords("ru"))) %>%
    filter(!str_detect(word, "[0-9]+")) %>%
    dplyr::count(id, word) %>%
    cast_dfm(id, word, n)
## Joining, by = "word"
reviews2.dtm
## Document-feature matrix of: 606 documents, 31,353 features (99.6% sparse).
reviews2.trim <- reviews2.dtm %>% as.dfm %>%
    dfm_wordstem(language = "ru") %>%
    dfm_trim(max_docfreq=0.3, min_docfreq=0.01, docfreq_type="prop") %>%
    dfm_tfidf
reviews2.trim
## Document-feature matrix of: 606 documents, 1,909 features (96.3% sparse).
reviews2$Sex <- as.factor(reviews2$Sex)
reviews2.split  <- initial_split(reviews2, prop=0.8, strata=Sex)
reviews2.split
## <486/120/606>
reviews2.train <- training(reviews2.split)
table(reviews2.train$Sex)
## 
##  Ж   М  
## 243 243
reviews2.test  <- testing(reviews2.split)
table(reviews2.test$Sex)
## 
## Ж  М  
## 60 60
train.dtmR2  <- reviews2.trim %>% dfm_subset(docnames(reviews2.dtm) 
                                          %in% reviews2.train$id)
test2.dtmR2  <- reviews2.trim %>% dfm_subset(docnames(reviews2.dtm) 
                                            %in% reviews2.test$id)

trainYR2 <- reviews2.train$Sex 
train_labs2 <- as.numeric(trainYR2)-1
testYR2 <- reviews2.test$Sex
test_labs2 <- as.numeric(testYR2)-1

mat.dfm.train2 <- as.matrix(train.dtmR2)
mat.dfm.test2 <- as.matrix(test2.dtmR2)

colnames(mat.dfm.train2) <- make.names(colnames(mat.dfm.train2))
colnames(mat.dfm.test2) <- make.names(colnames(mat.dfm.test2))
df.train2 <- data.frame(Sex = train_labs2, mat.dfm.train2) 
df.test2 <- data.frame(Sex = test_labs2, mat.dfm.test2)
formula2 <- formula(paste0("Sex ~ ", paste0(colnames(mat.dfm.train2), collapse = "+"))) 

set.seed(1604)
model.boost2 <- gbm(formula2, 
                data=df.train2, 
                distribution="bernoulli",
                n.trees=1000, 
                interaction.depth=6 
                )
head(summary(model.boost2, order=TRUE,  cBars = 50), n = 10)

##                   var  rel.inf
## станов         станов 5.973086
## рук               рук 4.069700
## рома             рома 2.525221
## ин                 ин 2.320847
## произведен произведен 1.904472
## вообщ           вообщ 1.845403
## мир               мир 1.547765
## талант         талант 1.499328
## будущ           будущ 1.433893
## способн       способн 1.080268
predTrainProb.boost2 = predict(model.boost2, df.train2, n.trees = 1000, type = "response")
predTestProb.boost2 = predict(model.boost2, df.test2, n.trees = 1000, type = "response")

head(predTrainProb.boost2)
## [1] 0.4320361 0.4134269 0.3762085 0.4792158 0.4547749 0.5024926
predTrain.boost2 = as.factor(ifelse(predTrainProb.boost2 < 0.5, "1", "0"))
predTest.boost2 = as.factor(ifelse(predTestProb.boost2 < 0.5, "1", "0"))

df.train2$Sex <- as.factor(as.character(df.train2$Sex))
df.test2$Sex <- as.factor(as.character(df.test2$Sex))

cmTrain_2 <- confusionMatrix(predTrain.boost2, df.train2$Sex)
cmTest_2 <- confusionMatrix(predTest.boost2, df.test2$Sex)
cmTrain_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  21 169
##          1 222  74
##                                           
##                Accuracy : 0.1955          
##                  95% CI : (0.1611, 0.2336)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.000000        
##                                           
##                   Kappa : -0.6091         
##                                           
##  Mcnemar's Test P-Value : 0.008545        
##                                           
##             Sensitivity : 0.08642         
##             Specificity : 0.30453         
##          Pos Pred Value : 0.11053         
##          Neg Pred Value : 0.25000         
##              Prevalence : 0.50000         
##          Detection Rate : 0.04321         
##    Detection Prevalence : 0.39095         
##       Balanced Accuracy : 0.19547         
##                                           
##        'Positive' Class : 0               
## 
cmTest_2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20 24
##          1 40 36
##                                           
##                Accuracy : 0.4667          
##                  95% CI : (0.3751, 0.5599)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.79429         
##                                           
##                   Kappa : -0.0667         
##                                           
##  Mcnemar's Test P-Value : 0.06079         
##                                           
##             Sensitivity : 0.3333          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.4545          
##          Neg Pred Value : 0.4737          
##              Prevalence : 0.5000          
##          Detection Rate : 0.1667          
##    Detection Prevalence : 0.3667          
##       Balanced Accuracy : 0.4667          
##                                           
##        'Positive' Class : 0               
## 

For this model, the top-5 predictors, i.e. words, which contribute the most to the prediction are ‘станов’, ‘рук’, ‘рома’, ‘ин’, and ‘произведен’. It seems that this solution did help a little bit. Now, the classifier correctly assigns some proportion of females in the train sample and the majority of males and females in the test sample. Furthermore, accuracy value has also increased being about 0.2 for the train set, and 0.47 for the test set - which is really weird as we would expect our model to perform better with the set it was trained on.

All in all, we think it makes sense to keep this last model as its prediction power appears to be greater.