set.seed
but it didn’t help. I am unable to find the source of randomness. I changed my interpretations several times but with every new knitting the numbers in the resulting accuracy table change. I am sorry for that :`(In this project I am going to build a logistic regression based on book`s reviews data provided by males and females.
text <- read_delim("~/review.csv", ";", escape_double = FALSE, trim_ws = TRUE)
text$Sex = as.character(text$Sex)
text$Sex = ifelse(str_detect(text$Sex, "N/A"), NA, text$Sex)
text = text %>% na.omit()
text$Sex = as.factor(text$Sex)
Является ли дата написания отзыва (Date
) уникальным идентификатором?
length(unique(text$Date))
## [1] 1961
There are 1961 unique data points out of 2255 reviews. So, some reviews are written at the same date. Thus, date cannot be used as a unique identifier of the review. For simplicity, I will create an id
column to uniquely identify each review.
text$id = 1:nrow(text)
Proceed with data manipulation
text$Review = str_to_lower(text$Review)
text$Review = gsub("[[:punct:]]", "",text$Review)
text$lemma <-system2("mystem", c("-c","-l","-d"), input = text$Review,stdout = T)
text$lemma <- str_replace_all(text$lemma, "\\{([^}]+?)([?]+)?\\}", "\\1")
rustopwords <- data.frame(words= c(stopwords("ru"),"книга","это","который","весь","свой","мочь","читать","герой","автор","история","просто","роман","очень","самый"), stringsAsFactors=FALSE)
text.tok = text %>% unnest_tokens(words, lemma) %>% na.omit() %>% anti_join(rustopwords, by = "words")
Elimination 10 (and even 5) % of the most and the least frequent words yilds a very limited number of features, which in turn can result in bad model performance.
Thus, I decided to eliminate only top and bottom 2 % of words, which results in 917 features which seems fine. Still, the matrix remains extremely sparsed, unfortunately.
text.tok_1 <- text.tok %>% count(id, words) %>% cast_dfm(id, words, n)
books.clean = text.tok_1 %>% as.dfm %>%
dfm_trim(max_docfreq=0.98, min_docfreq=0.02, docfreq_type="prop") %>%
dfm_tfidf
books.clean
## Document-feature matrix of: 2,255 documents, 917 features (94.7% sparse).
books.split <- initial_split(text, prop=0.8, strata=Sex)
books.split
## <1805/450/2255>
books.train <- training(books.split)
table(books.train$Sex)
##
## Ж М
## 1573 232
books.test <- testing(books.split)
table(books.test$Sex)
##
## Ж М
## 385 65
train.dtm <- books.clean %>% dfm_subset(docnames(text.tok_1) %in% books.train$id)
test.dtm <- books.clean %>% dfm_subset(docnames(text.tok_1) %in% books.test$id)
trainY <- books.train$Sex
testY <- books.test$Sex
In the beggining of the chunck with regularization parameter calculation I am setting up a set seed to avoid multiple lmbda creation. As we have already discussed on the seminar, this is not the best solution, but since I want my model coefficients to be constant, this is the best I can do so far.
Ridge
set.seed(321)
cv.ridge <- cv.glmnet(x=train.dtm, y=trainY, alpha=0, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.ridge$lambda.min
## [1] 0.153112
Lasso
set.seed(321)
cv.lasso <- cv.glmnet(x=train.dtm, y=trainY, alpha=1, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.lasso$lambda.min
## [1] 0.003620316
Elasticnet
set.seed(321)
cv.elasticnet <- cv.glmnet(x=train.dtm, y=trainY, alpha = 0.5, family="binomial", type.measure="auc", nfolds = 5, standardize=FALSE)
cv.elasticnet$lambda.min
## [1] 0.0060113
Lasso
It can be seen that the model performs very poorly on the test set. It almost fails to recognize any male`s reviews as such, treating everything as female’s ones (2 exceptions). It sucks!
set.seed(321)
predicted.lasso <- as.factor(predict(cv.lasso, test.dtm, type="class"))
table(predicted.lasso)
## predicted.lasso
## Ж М
## 449 1
table(testY)
## testY
## Ж М
## 385 65
Ridge
Model with Ridge typy gives very disappointing result. No male’s reviews are recognized :(
set.seed(321)
predicted.ridge <- as.factor(predict(cv.ridge, test.dtm, type="class"))
table(predicted.ridge)
## predicted.ridge
## Ж
## 450
table(testY)
## testY
## Ж М
## 385 65
Elasticnet
The same goes for elasticnet.
set.seed(321)
predicted.elasticnet <- as.factor(predict(cv.elasticnet, test.dtm, type="class"))
table(predicted.elasticnet)
## predicted.elasticnet
## Ж
## 450
table(testY)
## testY
## Ж М
## 385 65
Lasso
cm.lasso <- confusionMatrix(data = predicted.lasso, reference = as.factor(testY), positive="М")
cm.lasso
## Confusion Matrix and Statistics
##
## Reference
## Prediction Ж М
## Ж 384 65
## М 1 0
##
## Accuracy : 0.8533
## 95% CI : (0.8172, 0.8847)
## No Information Rate : 0.8556
## P-Value [Acc > NIR] : 0.5856
##
## Kappa : -0.0044
##
## Mcnemar's Test P-Value : 8.851e-15
##
## Sensitivity : 0.000000
## Specificity : 0.997403
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.855234
## Prevalence : 0.144444
## Detection Rate : 0.000000
## Detection Prevalence : 0.002222
## Balanced Accuracy : 0.498701
##
## 'Positive' Class : М
##
As funny as it sounds, the accuracy of the model which does not recognize the opposite class is still 84%! However, we can see that Kappa = is negative, which means we did only worse (and Specificity = 0), which indicates the concern just outlined.
Ridge
cm.ridge <- confusionMatrix(data = predicted.ridge, reference = as.factor(testY), positive="М")
cm.ridge
## Confusion Matrix and Statistics
##
## Reference
## Prediction Ж М
## Ж 385 65
## М 0 0
##
## Accuracy : 0.8556
## 95% CI : (0.8196, 0.8867)
## No Information Rate : 0.8556
## P-Value [Acc > NIR] : 0.533
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.051e-15
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8556
## Prevalence : 0.1444
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : М
##
Very sad result with Kappa = 0 and no men`s class are correctly recognized.
Now let`s look at elasticnet:
cm.elasticnet <- confusionMatrix(data = predicted.elasticnet, reference = as.factor(testY), positive="М")
cm.elasticnet
## Confusion Matrix and Statistics
##
## Reference
## Prediction Ж М
## Ж 385 65
## М 0 0
##
## Accuracy : 0.8556
## 95% CI : (0.8196, 0.8867)
## No Information Rate : 0.8556
## P-Value [Acc > NIR] : 0.533
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.051e-15
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8556
## Prevalence : 0.1444
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : М
##
The accuracy is still 84%, and Kappa is negative again. Shitty results.
For Lasso:
coef(cv.lasso, cv.lasso$lambda.min) %>%
as.matrix %>% as.data.frame %>%
tibble::rownames_to_column() %>%
arrange(-abs(`1`)) %>% head(15)
As far as I can interpret it, negative sign means that the word provide incentives to be assigned to “Ж” class, and the positive sign refers to the probability(?) of this word to be assigned to the “М” class. Thus, if my judjement is correct, суть, рекомендовать, собирать, мировой, иметь, работа - words that can characterized men`s discourse, while узнавать,любовь(!),смочь,приятно, картинка, чудесный - female’s one.
This time I would not go for error analysis because my models appeared to be quiet poor and misclassified a lot. Consequently, there are a lot of different words causing this.
All in all, it was interesting to build a logistic regression model on a new data without any prior specific expectation. There are several sources could be involved resulting in bad model performance:
First, the data itself can be of poor quality.
Second, I could limit some better lambdas while setting a set seed and the random result appeared to be just this.
Third, I could do some mistakes along the way.
Finally, the test subset is really small and disproportional (only ~50 male review). Since the initial data was very skewed as well, the model could just have not enought data to learn how to distinguish males’ reviews. Thus, performing badly of the even smaller test set.
p.s. I HAD TO CHANGE INTERPRETATIONS SEVERAL TIMES BEFORE SETTING A SET.SEED ALMOST EVERYWHERE! (NOT ONLY AFTER LAMDA CALCULATION BUT ALSO ON THE STAGE OF PREDICTION).