Results

Mental Health Issues in Comments: Exploratory Analysis

Introduction

Our personal sentiments on a subject are sometimes conveyed through the vernacular we use in writing. How does the language we use reveal thought patterns or life outlooks? In what way are words with a negative sentiment indicative of an underlying mental health issue? The exploration of related questions is a pressing task for people who work in mental health professions. The Mental Health Corpus data set contains texts that exhibit negative mental health attributes, and texts that do not. In this analysis, we will build a model that can discern between the two types of texts.

library(tidyverse)
library(tidytext)
library(superml)
library(e1071)
library(caret)
library(ROCR)

The original data set contains 2 columns, the text and the label. A label of 1 indicates that the text is related to negative mental health, while 0 indicates that it is not. The data was cleaned and transformed to prepare it for the model later. This part of the analysis is mainly concerned with building the model, but here are links to the preparation:

Summary of cleaning and transformation:

  • There were 2 missing values in the text column. These observations were dropped.
  • The label column was converted to a factor type.
  • CountVectorizer from the superml library was used to create a sparse matrix. The column names were each word that appears in the whole data set.
    • Each row represented a comment. If a word appeared in the comment, then there was a 1 in the word’s column, and 0 otherwise.
    • To save space and time, the top 1000 most occurring words were chosen for the columns.
    • Stopwords were removed.
  • The labels were refactored from 0 and 1 to “negative” and “positive”.

The result is the comments data set below.

#comments <- read.csv('comments.csv')
comments1 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments1.csv')
comments2 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments2.csv')
comments3 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments3.csv')
comments4 <- read.csv('https://raw.githubusercontent.com/djunga/DATA622/main/comments4.csv')
comments <- rbind(comments1, comments2, comments3, comments4)
comments <- comments[,-1]

The data set after cleaning and transformation.

After examining the result, we decide to use a Naive Bayes model.

  • The features are words in the data set, which are likely to be independent of each other within the different classes (positive and negative).
  • All the features are categorical. A few continuous features would have worked, but we would have to discretize them first.
  • NB models handle noisy and missing data well. We did drop the 2 missing values while cleaning, but if there is noisy data then NB can be counted on to handle it.
  • This is a large data set. A NB model does not actually “learn”. Rather, it makes predictions based on calculated probabilities. The probabilities are more reliable when there is more data to work with.
  • Because of the above reasons, NB will be simple and computationally efficient.

Training and Test Sets

We use a 75-25 training-to-test split ratio.

set.seed(1234)
sample_set <- sample(nrow(comments), round(nrow(comments)*0.75), replace=FALSE)
comments_train <- comments[sample_set, ]
comments_test <- comments[-sample_set, ]

Are the classes balanced?

Overall data:

round(prop.table(table(select(comments, label))),2)
## 
## negative positive 
##     0.51     0.49

Training data:

round(prop.table(table(select(comments_train, label))),2)
## 
## negative positive 
##      0.5      0.5

Test data:

round(prop.table(table(select(comments_test, label))),2)
## 
## negative positive 
##     0.51     0.49

The classes are balanced.

Training the model

To account for zero-frequency words which will ruin probability calculations, we use Laplace smoothing with pseudocount = 1.

comments_mod <- naiveBayes(label ~ .-index, data=comments_train, laplace=1)

Making predictions

Raw predictions

comments_pred_prob <- predict(comments_mod, comments_test, type="raw")
head(comments_pred_prob)
##           negative      positive
## [1,]  1.000000e+00  4.054344e-58
## [2,]  0.000000e+00  1.000000e+00
## [3,]  0.000000e+00  1.000000e+00
## [4,] 7.563583e-267  1.000000e+00
## [5,]  1.000000e+00 9.503880e-121
## [6,]  1.000000e+00 3.733272e-153

Class predictions

comments_pred <- predict(comments_mod, comments_test, type="class")
head(comments_pred)
## [1] negative positive positive positive negative negative
## Levels: negative positive

Evaluation

confusionMatrix(comments_pred, as.factor(comments_test$label), positive="positive")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction negative positive
##   negative     3153      869
##   positive      407     2565
##                                           
##                Accuracy : 0.8176          
##                  95% CI : (0.8083, 0.8265)
##     No Information Rate : 0.509           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6341          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7469          
##             Specificity : 0.8857          
##          Pos Pred Value : 0.8631          
##          Neg Pred Value : 0.7839          
##              Prevalence : 0.4910          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.4249          
##       Balanced Accuracy : 0.8163          
##                                           
##        'Positive' Class : positive        
## 

At 81.7% accuracy, the model performed well. The kappa value is 0.63. Since it’s greater than 0.5, this also indicates that the model performed moderately well.

To visualize how the model did, make an ROC Curve.

roc_pred <- 
  prediction(
    predictions = comments_pred_prob[, "positive"],
    labels = comments_test$label
  )
roc_perf <- performance(roc_pred, measure = "tpr", x.measure = "fpr")
plot(roc_perf, main = "ROC Curve", col = "green", lwd=3)
abline(a=0, b=1, lwd=3, lty=2, col=1)

Area Under the Curve, AUC

auc_perf <- performance(roc_pred, measure = "auc")
spam_auc <- unlist(slot(auc_perf, "y.values"))
spam_auc
## [1] 0.8849088

The area under the curve is 0.88. The closer to 1, the better. Overall, the NB model did a decent job at predicting whether a comment is related to negative mental health.

Glass Types: Exploratory Analysis

Introduction

The UCI Machine Learning Repository has a data set for glass classification. It has also been uploaded to Kaggle. The goal is to predict the type of glass from predictors related to a selection of the sample’s chemical attributes. Which type of model is best for this task?

library(tidyverse)
library(corrplot)
library(reshape2)
library(DMwR)
library(MASS)
library(car)
library(caret)
library(rpart) 
library(rstatix)  # possibly checking for homogeneity of covariance matrices
library(rpart.plot)
library(ROCR) # only for binary classification
library(pROC)
glass <- read_csv('https://raw.githubusercontent.com/djunga/DATA622/main/glass.csv', col_types = 'nnnnnnnnnnf')

Data Exploration

glimpse(glass)
## Rows: 214
## Columns: 10
## $ RI   <dbl> 1.52101, 1.51761, 1.51618, 1.51766, 1.51742, 1.51596, 1.51743, 1.~
## $ Na   <dbl> 13.64, 13.89, 13.53, 13.21, 13.27, 12.79, 13.30, 13.15, 14.04, 13~
## $ Mg   <dbl> 4.49, 3.60, 3.55, 3.69, 3.62, 3.61, 3.60, 3.61, 3.58, 3.60, 3.46,~
## $ Al   <dbl> 1.10, 1.36, 1.54, 1.29, 1.24, 1.62, 1.14, 1.05, 1.37, 1.36, 1.56,~
## $ Si   <dbl> 71.78, 72.73, 72.99, 72.61, 73.08, 72.97, 73.09, 73.24, 72.08, 72~
## $ K    <dbl> 0.06, 0.48, 0.39, 0.57, 0.55, 0.64, 0.58, 0.57, 0.56, 0.57, 0.67,~
## $ Ca   <dbl> 8.75, 7.83, 7.78, 8.22, 8.07, 8.07, 8.17, 8.24, 8.30, 8.40, 8.09,~
## $ Ba   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Fe   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.11, 0.24,~
## $ Type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~

The glass types are 1,2,3,5,6,7. There is no type 4. Adjust for consistency.

glass$Type <- factor(glass$Type, labels = c(1,2,3,4,5,6))

Add unique ID numbers for each observation.

glass <- cbind(data.frame(Id = as.integer(rownames(glass))), glass) 

Distributions of predictors

glass_melt = melt(subset(glass, select=-c(Type)), id.vars = "Id")

ggplot(aes(value), data = glass_melt) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are many zeroes in Mg, K, Ba, and Fe. The RI, Na, Al, Si and Ca predictors barely look normally distributed. Curiously, could there be a correlation between glass type and the presence of zeroes?

Which class Types have all these zeroes?

zeroes <- data.frame()
ba_zero <- glass %>%
            group_by(Type) %>%
            filter(Ba==0) %>%
            summarize(
              Ba_Zero = n()
            )

mg_zero <- glass %>%
            group_by(Type) %>%
            filter(Mg==0) %>%
            summarize(
              Mg_Zero = n()
            )

mg_zero <- rbind(mg_zero, data.frame(Type=c(1,3), Mg_Zero=c(0,0))) %>%  arrange(Type)

fe_zero <- glass %>%
            group_by(Type) %>%
            filter(Fe==0) %>%
            summarize(
              Fe_Zero = n()
            )

ri_zero <- data.frame(Type=c(1,2,3,4,5,6), RI_Zero=c(0,0,0,0,0,0))

 
na_zero <- data.frame(Type=c(1,2,3,4,5,6), Na_Zero=c(0,0,0,0,0,0))

al_zero <- data.frame(Type=c(1,2,3,4,5,6), Al_Zero=c(0,0,0,0,0,0))

si_zero <- data.frame(Type=c(1,2,3,4,5,6), Si_Zero=c(0,0,0,0,0,0))

k_zero <- glass %>%
            group_by(Type) %>%
            filter(K==0) %>%
            summarize(
              K_Zero = n()
            )
k_zero <- rbind(k_zero, c(Type=5,K_Zero=0)) %>% arrange(Type)

ca_zero <- data.frame(Type=c(1,2,3,4,5,6), Ca_Zero=c(0,0,0,0,0,0))

zeroes <- cbind(ri_zero, na_zero, mg_zero, al_zero, si_zero, k_zero, ca_zero, ba_zero, fe_zero)
zeroes <- zeroes[, c(1,2,4,6,8,10,12, 14,16,18)]
head(zeroes)
##   Type RI_Zero Na_Zero Mg_Zero Al_Zero Si_Zero K_Zero Ca_Zero Ba_Zero Fe_Zero
## 1    1       0       0       0       0       0      1       0      67      45
## 2    2       0       0       9       0       0      3       0      70      44
## 3    3       0       0       0       0       0      1       0      16      12
## 4    4       0       0       7       0       0      9       0      11      11
## 5    5       0       0       3       0       0      0       0       9       9
## 6    6       0       0      23       0       0     16       0       3      23
  • Ba and Fe contain the most zeroes.
  • Note that the classes are imbalanced. The glass types are not represented equally.
  • Because the classes are imbalanced, it would not make sense to judge a class’s number of zeros based on this chart. Ratios would be better.
  • The significance of zero for these predictors is unknown. We aren’t sure whether they are stand-ins for NA values.

Data Summary

summary(glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   4:13  
##  3rd Qu.:0.10000   5: 9  
##  Max.   :0.51000   6:29

Correlation plot

Which predictors are most likely to influence glass type?

corrplot::corrplot(cor(dplyr::select(glass, 'RI':'Fe'), use = "complete.obs"), method = 'circle', type = 'lower', insig='blank', addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)

Class imbalance

round(prop.table(table(dplyr::select(glass, Type))),2)
## 
##    1    2    3    4    5    6 
## 0.33 0.36 0.08 0.06 0.04 0.14

We use upsampling to balance the classes.

glass <- upSample(x = glass[, -ncol(glass)], y = glass$Type)
colnames(glass)[10] <- "Type"
table(glass$Type)   
## 
##  1  2  3  4  5  6 
## 76 76 76 76 76 76
round(prop.table(table(dplyr::select(glass, Type))),2)
## 
##    1    2    3    4    5    6 
## 0.17 0.17 0.17 0.17 0.17 0.17

Now there are 456 observations, up from 214 initially.

Model Building

Takeaways from the data exploration:

  • There are a lot of zeroes, which may or may not be missing values.
  • The predictors are not normally distributed.
  • Since the predictors are not normally distributed, Box’s M-test for homogeneity of covariance matrices cannot be used. We conclude that the covariance matrices are not equal.
  • The classes were imbalanced, but we balanced them.

Since the predictors are not normally distributed and the covariance matrices are not equal, we cannot use linear or quadratic discriminant analysis for classification. Naive Bayes cannot be used either because it assumes class conditional independence. From the correlation matrix, we cannot say that the predictors are independent from each other. Also, the predictors are all continuous- we would have to discretize them to possibly use Naive Bayes. Logistic Regression is a better choice for binary classification, and it makes many assumptions about the data that this data does not fit.

We decide on random forest. The basic building block is a decision tree, which does not make assumptions about the underlying data and can handle missing and noisy data. Class imbalance will hinder its performance, but we balanced our data earlier. A random forest will likely result in a higher accuracy for predictions, in exchange for explainability of the results.

Create train and test sets

We use a 75-25 train-to-test split.

set.seed(1234)
sample_set <- sample(nrow(glass), round(nrow(glass)*0.75), replace=FALSE)
glass_train <- glass[sample_set, ]
glass_test <- glass[-sample_set, ]

Fit model

For the random forest model, we specify 0.632 bootstrap for the resampling technique and 3 resampling iterations. The mtry parameter for the trees is typically the square root of the number of predictors. In this problem, this evaluates to 3.

rf_mod <- train(
  Type ~ .,
  data = glass_train,
  metric = "Accuracy",
  method = "rf",
  trControl = trainControl(method = "boot632", number=3),
  tuneGrid = expand.grid(.mtry = 3)   # mtry=sqrt(num predictors)
)

rf_pred <- predict(rf_mod, glass_test)
confusionMatrix(rf_pred, glass_test$Type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6
##          1 21  1  0  0  0  0
##          2  3 15  0  0  0  2
##          3  0  0 15  0  0  0
##          4  0  3  0 18  0  2
##          5  0  0  0  0 16  0
##          6  0  0  0  0  0 18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9035          
##                  95% CI : (0.8339, 0.9508)
##     No Information Rate : 0.2105          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8839          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.8750   0.7895   1.0000   1.0000   1.0000   0.8182
## Specificity            0.9889   0.9474   1.0000   0.9479   1.0000   1.0000
## Pos Pred Value         0.9545   0.7500   1.0000   0.7826   1.0000   1.0000
## Neg Pred Value         0.9674   0.9574   1.0000   1.0000   1.0000   0.9583
## Prevalence             0.2105   0.1667   0.1316   0.1579   0.1404   0.1930
## Detection Rate         0.1842   0.1316   0.1316   0.1579   0.1404   0.1579
## Detection Prevalence   0.1930   0.1754   0.1316   0.2018   0.1404   0.1579
## Balanced Accuracy      0.9319   0.8684   1.0000   0.9740   1.0000   0.9091

ROC and AUC

This was calculated using the pROC package. Since there are multiple classes in the response variable, the pROC documentation says the ROC curve cannot be plotted.

prediction <- predict(dt_mod, glass_test, type="vector")

roc_object <- multiclass.roc(glass_test$Type, prediction, percent=TRUE)
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
print(roc_object)
## 
## Call:
## multiclass.roc.default(response = glass_test$Type, predictor = prediction,     percent = TRUE)
## 
## Data: prediction with 6 levels of glass_test$Type: 1, 2, 3, 4, 5, 6.
## Multi-class area under the curve: 87.23%
# Documentation of pROC's multiclass.roc says it cannot be plotted. 

We would have liked to see the plots for the trees to see how to decisions were made for each one, and which predictors were selected at each level. However, compared to a DT, a random forest acts more like a black box. We can’t see the inner workings. We have traded the explainability or the result for a higher accuracy. I did test the data with a DT and the DT was 16% worse.