Attention: some interpretations may be inconsistent with the data displayed! I tried setting set.seed but it didn’t help. I am unable to find the source of randomness. I changed my interpretations several times but with every new knitting the numbers in the resulting accuracy table change. I am sorry for that :`(

Intro

In this project I am going to build a logistic regression based on book`s reviews data provided by males and females.

Data preparation & NA removal

text <- read_delim("~/review.csv", ";", escape_double = FALSE, trim_ws = TRUE)

text$Sex = as.character(text$Sex)
text$Sex = ifelse(str_detect(text$Sex, "N/A"), NA, text$Sex)
text = text %>% na.omit()
text$Sex = as.factor(text$Sex)

Является ли дата написания отзыва (Date) уникальным идентификатором?

length(unique(text$Date))
## [1] 1961

There are 1961 unique data points out of 2255 reviews. So, some reviews are written at the same date. Thus, date cannot be used as a unique identifier of the review. For simplicity, I will create an id column to uniquely identify each review.

text$id = 1:nrow(text)

Proceed with data manipulation

text$Review = str_to_lower(text$Review) 
text$Review = gsub("[[:punct:]]", "",text$Review)


text$lemma <-system2("mystem", c("-c","-l","-d"), input = text$Review,stdout = T)
text$lemma <- str_replace_all(text$lemma, "\\{([^}]+?)([?]+)?\\}", "\\1")

rustopwords <- data.frame(words= c(stopwords("ru"),"книга","это","который","весь","свой","мочь","читать","герой","автор","история","просто","роман","очень","самый"), stringsAsFactors=FALSE)

text.tok = text %>% unnest_tokens(words, lemma) %>% na.omit() %>% anti_join(rustopwords, by = "words")

Creating a document-term matrix. Eliminating 2% of the most frequent and the most rarely used words. Applying TF-IDF weightening.

Elimination 10 (and even 5) % of the most and the least frequent words yilds a very limited number of features, which in turn can result in bad model performance.

Thus, I decided to eliminate only top and bottom 2 % of words, which results in 917 features which seems fine. Still, the matrix remains extremely sparsed, unfortunately.

text.tok_1 <- text.tok %>% count(id, words) %>% cast_dfm(id, words, n) 

books.clean = text.tok_1 %>% as.dfm %>%
          dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
          dfm_tfidf
books.clean
## Document-feature matrix of: 2,255 documents, 917 features (94.7% sparse).

Creating data partition

books.split  <- initial_split(text, prop=0.8, strata=Sex)
books.split
## <1805/450/2255>
books.train <- training(books.split)
table(books.train$Sex)
## 
##    Ж    М 
## 1573  232
books.test  <- testing(books.split)
table(books.test$Sex)
## 
##   Ж   М 
## 385  65
train.dtm  <- books.clean %>% dfm_subset(docnames(text.tok_1) %in% books.train$id)
test.dtm  <- books.clean %>% dfm_subset(docnames(text.tok_1) %in% books.test$id)

trainY <- books.train$Sex
testY <- books.test$Sex

Building a model with different types of regularization parameter

In the beggining of the chunck with regularization parameter calculation I am setting up a set seed to avoid multiple lmbda creation. As we have already discussed on the seminar, this is not the best solution, but since I want my model coefficients to be constant, this is the best I can do so far.

Ridge

set.seed(321)
cv.ridge <- cv.glmnet(x=train.dtm, y=trainY, alpha=0, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.ridge$lambda.min
## [1] 0.153112

Lasso

set.seed(321)
cv.lasso <- cv.glmnet(x=train.dtm, y=trainY, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.003620316

Elasticnet

set.seed(321)
cv.elasticnet <- cv.glmnet(x=train.dtm, y=trainY, alpha = 0.5, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.elasticnet$lambda.min
## [1] 0.0060113

Predicting

Lasso

It can be seen that the model performs very poorly on the test set. It almost fails to recognize any male`s reviews as such, treating everything as female’s ones (2 exceptions). It sucks!

set.seed(321)
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))
table(predicted.lasso)
## predicted.lasso
##   Ж   М 
## 449   1
table(testY)
## testY
##   Ж   М 
## 385  65

Ridge

Model with Ridge typy gives very disappointing result. No male’s reviews are recognized :(

set.seed(321)
predicted.ridge <- as.factor(predict(cv.ridge, test.dtm, type="class"))
table(predicted.ridge)
## predicted.ridge
##   Ж 
## 450
table(testY)
## testY
##   Ж   М 
## 385  65

Elasticnet

The same goes for elasticnet.

set.seed(321)
predicted.elasticnet <- as.factor(predict(cv.elasticnet, test.dtm, type="class"))
table(predicted.elasticnet)
## predicted.elasticnet
##   Ж 
## 450
table(testY)
## testY
##   Ж   М 
## 385  65

Looking in the eyes of bitter truth

Lasso

cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY), positive="М")
cm.lasso
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Ж   М
##          Ж 384  65
##          М   1   0
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.8172, 0.8847)
##     No Information Rate : 0.8556          
##     P-Value [Acc > NIR] : 0.5856          
##                                           
##                   Kappa : -0.0044         
##                                           
##  Mcnemar's Test P-Value : 8.851e-15       
##                                           
##             Sensitivity : 0.000000        
##             Specificity : 0.997403        
##          Pos Pred Value : 0.000000        
##          Neg Pred Value : 0.855234        
##              Prevalence : 0.144444        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.002222        
##       Balanced Accuracy : 0.498701        
##                                           
##        'Positive' Class : М               
## 

As funny as it sounds, the accuracy of the model which does not recognize the opposite class is still 84%! However, we can see that Kappa = is negative, which means we did only worse (and Specificity = 0), which indicates the concern just outlined.

Ridge

cm.ridge <- confusionMatrix(data = predicted.ridge, reference = as.factor(testY), positive="М")
cm.ridge
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Ж   М
##          Ж 385  65
##          М   0   0
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8196, 0.8867)
##     No Information Rate : 0.8556          
##     P-Value [Acc > NIR] : 0.533           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.051e-15       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8556          
##              Prevalence : 0.1444          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : М               
## 

Very sad result with Kappa = 0 and no men`s class are correctly recognized.

Now let`s look at elasticnet:

cm.elasticnet <- confusionMatrix(data = predicted.elasticnet, reference = as.factor(testY), positive="М")
cm.elasticnet
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Ж   М
##          Ж 385  65
##          М   0   0
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.8196, 0.8867)
##     No Information Rate : 0.8556          
##     P-Value [Acc > NIR] : 0.533           
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.051e-15       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8556          
##              Prevalence : 0.1444          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : М               
## 

The accuracy is still 84%, and Kappa is negative again. Shitty results.

The most important words?

For Lasso:

coef(cv.lasso, cv.lasso$lambda.min) %>%
    as.matrix %>% as.data.frame %>%
    tibble::rownames_to_column() %>%
    arrange(-abs(`1`)) %>% head(15)

As far as I can interpret it, negative sign means that the word provide incentives to be assigned to “Ж” class, and the positive sign refers to the probability(?) of this word to be assigned to the “М” class. Thus, if my judjement is correct, суть, рекомендовать, собирать, мировой, иметь, работа - words that can characterized men`s discourse, while узнавать,любовь(!),смочь,приятно, картинка, чудесный - female’s one.

This time I would not go for error analysis because my models appeared to be quiet poor and misclassified a lot. Consequently, there are a lot of different words causing this.

Conclusion

All in all, it was interesting to build a logistic regression model on a new data without any prior specific expectation. There are several sources could be involved resulting in bad model performance:

First, the data itself can be of poor quality.

Second, I could limit some better lambdas while setting a set seed and the random result appeared to be just this.

Third, I could do some mistakes along the way.

Finally, the test subset is really small and disproportional (only ~50 male review). Since the initial data was very skewed as well, the model could just have not enought data to learn how to distinguish males’ reviews. Thus, performing badly of the even smaller test set.

p.s. I HAD TO CHANGE INTERPRETATIONS SEVERAL TIMES BEFORE SETTING A SET.SEED ALMOST EVERYWHERE! (NOT ONLY AFTER LAMDA CALCULATION BUT ALSO ON THE STAGE OF PREDICTION).