Our personal sentiments on a subject are sometimes conveyed through the vernacular we use in writing. How does the language we use reveal thought patterns or life outlooks? In what way are words with a negative sentiment indicative of an underlying mental health issue? The exploration of related questions is a pressing task for people who work in mental health professions. The Mental Health Corpus data set contains texts that exhibit negative mental health attributes, and texts that do not. In this analysis, we will build a model that can discern between the two types of texts.
library(tidyverse)
library(tidytext)
library(superml)
library(e1071)
library(caret)
library(ROCR)
The original data set contains 2 columns, the text and the label. A label of 1 indicates that the text is related to negative mental health, while 0 indicates that it is not. The data was cleaned and transformed to prepare it for the model later. This part of the analysis is mainly concerned with building the model, but here are links to the preparation:
Summary of cleaning and transformation:
text column. These
observations were dropped.label column was converted to a factor type.The result is the comments data set below.
#comments <- read.csv('comments.csv')
comments1 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments1.csv')
comments2 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments2.csv')
comments3 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments3.csv')
comments4 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments4.csv')
comments <- rbind(comments1, comments2, comments3, comments4)
comments <- comments[,-1]
The data set after cleaning and transformation.
After examining the result, we decide to use a Naive Bayes model.
We use a 75-25 training-to-test split ratio.
set.seed(1234)
sample_set <- sample(nrow(comments), round(nrow(comments)*0.75), replace=FALSE)
comments_train <- comments[sample_set, ]
comments_test <- comments[-sample_set, ]
Are the classes balanced?
Overall data:
round(prop.table(table(select(comments, label))),2)
##
## negative positive
## 0.51 0.49
Training data:
round(prop.table(table(select(comments_train, label))),2)
##
## negative positive
## 0.5 0.5
Test data:
round(prop.table(table(select(comments_test, label))),2)
##
## negative positive
## 0.51 0.49
The classes are balanced.
To account for zero-frequency words which will ruin probability calculations, we use Laplace smoothing with pseudocount = 1.
comments_mod <- naiveBayes(label ~ .-index, data=comments_train, laplace=1)
Raw predictions
comments_pred_prob <- predict(comments_mod, comments_test, type="raw")
head(comments_pred_prob)
## negative positive
## [1,] 1.000000e+00 4.054344e-58
## [2,] 0.000000e+00 1.000000e+00
## [3,] 0.000000e+00 1.000000e+00
## [4,] 7.563583e-267 1.000000e+00
## [5,] 1.000000e+00 9.503880e-121
## [6,] 1.000000e+00 3.733272e-153
Class predictions
comments_pred <- predict(comments_mod, comments_test, type="class")
head(comments_pred)
## [1] negative positive positive positive negative negative
## Levels: negative positive
confusionMatrix(comments_pred, as.factor(comments_test$label), positive="positive")
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 3153 869
## positive 407 2565
##
## Accuracy : 0.8176
## 95% CI : (0.8083, 0.8265)
## No Information Rate : 0.509
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6341
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7469
## Specificity : 0.8857
## Pos Pred Value : 0.8631
## Neg Pred Value : 0.7839
## Prevalence : 0.4910
## Detection Rate : 0.3667
## Detection Prevalence : 0.4249
## Balanced Accuracy : 0.8163
##
## 'Positive' Class : positive
##
At 81.7% accuracy, the model performed well. The kappa value is 0.63. Since it’s greater than 0.5, this also indicates that the model performed moderately well.
To visualize how the model did, make an ROC Curve.
roc_pred <-
prediction(
predictions = comments_pred_prob[, "positive"],
labels = comments_test$label
)
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd=3)
abline(a=0, b=1, lwd=3, lty=2, col=1)
Area Under the Curve, AUC
auc_perf <- performance(roc_pred, measure = "auc")
spam_auc <- unlist(slot(auc_perf, "y.values"))
spam_auc
## [1] 0.8849088
The area under the curve is 0.88. The closer to 1, the better. Overall, the NB model did a decent job at predicting whether a comment is related to negative mental health.
The UCI Machine Learning Repository has a data set for glass classification. It has also been uploaded to Kaggle. The goal is to predict the type of glass from predictors related to a selection of the sample’s chemical attributes. Which type of model is best for this task?
library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart)
library(rstatix) # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
glass <- read_csv('https://raw.githubusercontent.com/djunga/DATA622/main/glass.csv', col_types = 'nnnnnnnnnnf')
glimpse(glass)
## Rows: 214
## Columns: 10
## $ RI <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.~
## $ Na <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13~
## $ Mg <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,~
## $ Al <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,~
## $ Si <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72~
## $ K <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,~
## $ Ca <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,~
## $ Ba <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Fe <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,~
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
The glass types are 1,2,3,5,6,7. There is no type 4. Adjust for consistency.
glass$Type <- factor(glass$Type, labels = c(1,2,3,4,5,6))
Add unique ID numbers for each observation.
glass <- cbind(data.frame(Id = as.integer(rownames(glass))), glass)
glass_melt = melt(subset(glass, select=-c(Type)), id.vars = "Id")
ggplot(aes(value), data = glass_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are many zeroes in
Mg, K,
Ba, and Fe. The RI,
Na, Al, Si and Ca
predictors barely look normally distributed. Curiously, could there be a
correlation between glass type and the presence of zeroes?
Which class Types have all these zeroes?
zeroes <- data.frame()
ba_zero <- glass %>%
group_by(Type) %>%
filter(Ba==0) %>%
summarize(
Ba_Zero = n()
)
mg_zero <- glass %>%
group_by(Type) %>%
filter(Mg==0) %>%
summarize(
Mg_Zero = n()
)
mg_zero <- rbind(mg_zero, data.frame(Type=c(1,3), Mg_Zero=c(0,0))) %>% arrange(Type)
fe_zero <- glass %>%
group_by(Type) %>%
filter(Fe==0) %>%
summarize(
Fe_Zero = n()
)
ri_zero <- data.frame(Type=c(1,2,3,4,5,6), RI_Zero=c(0,0,0,0,0,0))
na_zero <- data.frame(Type=c(1,2,3,4,5,6), Na_Zero=c(0,0,0,0,0,0))
al_zero <- data.frame(Type=c(1,2,3,4,5,6), Al_Zero=c(0,0,0,0,0,0))
si_zero <- data.frame(Type=c(1,2,3,4,5,6), Si_Zero=c(0,0,0,0,0,0))
k_zero <- glass %>%
group_by(Type) %>%
filter(K==0) %>%
summarize(
K_Zero = n()
)
k_zero <- rbind(k_zero, c(Type=5,K_Zero=0)) %>% arrange(Type)
ca_zero <- data.frame(Type=c(1,2,3,4,5,6), Ca_Zero=c(0,0,0,0,0,0))
zeroes <- cbind(ri_zero, na_zero, mg_zero, al_zero, si_zero, k_zero, ca_zero, ba_zero, fe_zero)
zeroes <- zeroes[, c(1,2,4,6,8,10,12, 14,16,18)]
head(zeroes)
## Type RI_Zero Na_Zero Mg_Zero Al_Zero Si_Zero K_Zero Ca_Zero Ba_Zero Fe_Zero
## 1 1 0 0 0 0 0 1 0 67 45
## 2 2 0 0 9 0 0 3 0 70 44
## 3 3 0 0 0 0 0 1 0 16 12
## 4 4 0 0 7 0 0 9 0 11 11
## 5 5 0 0 3 0 0 0 0 9 9
## 6 6 0 0 23 0 0 16 0 3 23
Ba and Fe contain the most zeroes.summary(glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 4:13
## 3rd Qu.:0.10000 5: 9
## Max. :0.51000 6:29
Which predictors are most likely to influence glass type?
corrplot::corrplot(cor(dplyr::select(glass, 'RI':'Fe'), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)
round(prop.table(table(dplyr::select(glass, Type))),2)
##
## 1 2 3 4 5 6
## 0.33 0.36 0.08 0.06 0.04 0.14
We use upsampling to balance the classes.
glass <- upSample(x = glass[, -ncol(glass)], y = glass$Type)
colnames(glass)[10] <- "Type"
table(glass$Type)
##
## 1 2 3 4 5 6
## 76 76 76 76 76 76
round(prop.table(table(dplyr::select(glass, Type))),2)
##
## 1 2 3 4 5 6
## 0.17 0.17 0.17 0.17 0.17 0.17
Now there are 456 observations, up from 214 initially.
Takeaways from the data exploration:
Since the predictors are not normally distributed and the covariance matrices are not equal, we cannot use linear or quadratic discriminant analysis for classification. Naive Bayes cannot be used either because it assumes class conditional independence. From the correlation matrix, we cannot say that the predictors are independent from each other. Also, the predictors are all continuous- we would have to discretize them to possibly use Naive Bayes. Logistic Regression is a better choice for binary classification, and it makes many assumptions about the data that this data does not fit.
We decide on random forest. The basic building block is a decision tree, which does not make assumptions about the underlying data and can handle missing and noisy data. Class imbalance will hinder its performance, but we balanced our data earlier. A random forest will likely result in a higher accuracy for predictions, in exchange for explainability of the results.
We use a 75-25 train-to-test split.
set.seed(1234)
sample_set <- sample(nrow(glass), round(nrow(glass)*0.75), replace=FALSE)
glass_train <- glass[sample_set, ]
glass_test <- glass[-sample_set, ]
For the random forest model, we specify 0.632 bootstrap for the
resampling technique and 3 resampling iterations. The mtry
parameter for the trees is typically the square root of the number of
predictors. In this problem, this evaluates to 3.
rf_mod <- train(
Type ~ .,
data = glass_train,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "boot632", number=3),
tuneGrid = expand.grid(.mtry = 3) # mtry=sqrt(num predictors)
)
rf_pred <- predict(rf_mod, glass_test)
confusionMatrix(rf_pred, glass_test$Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6
## 1 21 1 0 0 0 0
## 2 3 15 0 0 0 2
## 3 0 0 15 0 0 0
## 4 0 3 0 18 0 2
## 5 0 0 0 0 16 0
## 6 0 0 0 0 0 18
##
## Overall Statistics
##
## Accuracy : 0.9035
## 95% CI : (0.8339, 0.9508)
## No Information Rate : 0.2105
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8839
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.8750 0.7895 1.0000 1.0000 1.0000 0.8182
## Specificity 0.9889 0.9474 1.0000 0.9479 1.0000 1.0000
## Pos Pred Value 0.9545 0.7500 1.0000 0.7826 1.0000 1.0000
## Neg Pred Value 0.9674 0.9574 1.0000 1.0000 1.0000 0.9583
## Prevalence 0.2105 0.1667 0.1316 0.1579 0.1404 0.1930
## Detection Rate 0.1842 0.1316 0.1316 0.1579 0.1404 0.1579
## Detection Prevalence 0.1930 0.1754 0.1316 0.2018 0.1404 0.1579
## Balanced Accuracy 0.9319 0.8684 1.0000 0.9740 1.0000 0.9091
ROC and AUC
This was calculated using the pROC package. Since there are multiple classes in the response variable, the pROC documentation says the ROC curve cannot be plotted.
prediction <- predict(dt_mod, glass_test, type="vector")
roc_object <- multiclass.roc(glass_test$Type, prediction, percent=TRUE)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
print(roc_object)
##
## Call:
## multiclass.roc.default(response = glass_test$Type, predictor = prediction, percent = TRUE)
##
## Data: prediction with 6 levels of glass_test$Type: 1, 2, 3, 4, 5, 6.
## Multi-class area under the curve: 87.23%
# Documentation of pROC's multiclass.roc says it cannot be plotted.
We would have liked to see the plots for the trees to see how to decisions were made for each one, and which predictors were selected at each level. However, compared to a DT, a random forest acts more like a black box. We can’t see the inner workings. We have traded the explainability or the result for a higher accuracy. I did test the data with a DT and the DT was 16% worse.