Introduction

In this exploratory data analysis, we would like to see how different factors affect depression rates among students. Can we accurately predict if a student is facing depression based on features from our data set?

The approach we will be using is a Decision Tree model, an effective and interpretable model for these sorts of classification problems.

We will begin by cleaning and exploring our data set.

Data

We can clean up the data a bit by removing all NA values as well as columns Work.Pressure and Job.Satisfaction as the values are generally 0. Since we want to perform this analysis among students only, we will filter Profession to just students and then omit the Profession column entirely. We will also drop City for simplicity.

Analyzing our Data

Before we get started with our predictive decision tree model, let us first make some inferences from our data. One thing we can do is look at the frequency of students with depression using a pie chart:

We can see that depression among students in this data set is actually higher than students without.

3D Scatterplot

Another thing we can do is see the relationship between some of the features in the data set We can use a 3D scatterplot to plot the relationship between features Age, CGPA and Work.Study.Hours

We can see that we have a dense rectangular object of markers, it appears to be mostly in the following ranges, ages 0-35, work/study hours 0-12 and CGPA 5-10. There are few outliers outside of these ranges.

Statistical Analysis

It is difficult to capture the full picture visually.

Among age, CGPA, and work study hours, where does most of our data actually lie?

Let us compare the average ages, CGPA, and work study hours between students that are vs not depressed.

## # A tibble: 2 × 7
##   Depression mean_age median_age mean_CGPA median_CGPA mean_hrs median_hrs
##        <int>    <dbl>      <dbl>     <dbl>       <dbl>    <dbl>      <dbl>
## 1          0     27.1         28      7.62        7.64     6.24          6
## 2          1     24.9         24      7.68        7.85     7.81          9

Immediately the CGPA stands out the most, as there is only a small difference between depressed and non-depressed students. Older students were generally less depressed, with a mean age of 27.14 versus 24.88 for depressed. The mean hours spent studying/working is also higher among depressed students by more than an hour, and the median difference is a surprising 3 hours, suggesting that less free time might have an impact on depressive symptoms.

Training our Data

We explored the features in our data set as well as some general statistics, but which features actually matter when predicting if a student will experience depression? We can attempt to explore this question further using a decision tree model.

First, we must split the data into train/test sets. We will use an 80/20 split and use our response variable Depression for our index.

set.seed(123)

train_index = createDataPartition(SDD$Depression, p = 0.8, list = FALSE)
train_data = SDD[train_index, ]
test_data = SDD[-train_index, ]

Building our Tree

Now that we have our trained data, we can build our Decision Tree model:

tree_model = rpart(data = train_data,
                   Depression ~ .,
                   method = "class")

rpart.plot(tree_model)

Model Interpretation

Now we have our model, but what does it mean?

Let us begin with the root node with values 1, 0.58, and 100%.

1 indicates that the model predicts “Depressed” 0.58 indicates that 58% of our training data elements are depressed 100% simply indicates that 100% of our training data starts at the root node.

The first split occurs from our feature have.you.ever.had.suicidal.thoughts which means this is our most informative predictor of depression.

Bar Plot

We can plot the relationship between this feature and Depression using a bar plot:

Clearly, we can see that the vast majority of students who have had suicidal thoughts were also depressed with a proportion of 79%.

Model Evaluation

Now we can run predictions for our decision tree as well as evaluate the performance of our model:

set.seed(123)
test_data$Depression = as.factor(test_data$Depression)
predictions = predict(tree_model, test_data, type = "class")
cm = confusionMatrix(predictions, test_data$Depression)

print(cm$table)
##           Reference
## Prediction    0    1
##          0 1628  355
##          1  661 2929
print(cm$overall["Accuracy"])
##  Accuracy 
## 0.8176924

81.77% of predictions were correct.

Confusion Matrix Heat Map

We can visualize the confusion matrix with a heat map:

This visualization helps us distinguish between our true positives/negatives and false positives/negatives.

Concluding Thoughts

At an accuracy of 81.77%, there is room for improvement in our model. The relationship between our features and Depression is complex, and one potential improvement for this predictive accuracy could be a random forest model, which utilizes multiple decision trees.