Dimension Reduction: Student Depression

Dataset

The dataset was found on Kaggle (https://www.kaggle.com/datasets/hopesb/student-depression-dataset). Dataset Structure:

• Format: CSV format

• Rows: 27901 observations, each row represents an individual student.

• Columns: 18 columns, each column represents a specific feature or attribute.

Introduction

Goal: execute PCA and MDS on the same dataset and compare results. Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) are both dimension reduction techniques, but they serve different purposes and can complement each other when used on the same dataset in one analysis. PCA transforms the original variables into new uncorrelated principal components, ordered by the amount of variance they explain, and it assumes linearity, making it useful for understanding the largest variance directions in the data. MDS, however, focuses on preserving the pairwise distances between data points in a low-dimensional space, without assuming linearity, and is particularly effective at handling non-linear relationships. MDS can be more robust in the presence of outliers because it aims to maintain the overall distance structure, whereas PCA might be more sensitive to outliers since they can disproportionately influence the principal components. Using both methods together provides complementary insights: PCA helps in understanding the major variance directions, while MDS offers a perspective on the overall distance relationships, ensuring a richer and more nuanced analysis of the data, especially in the presence of outliers

Dataset preparation

I load packages, import dataset and modify it to be suitable for Dimension reduction.

#Delete column because they are not important:  
#“id” column because it is not consecutive, and some values are missing.
#“Work pressure” & “Job Satisfaction” & “Profession” columns because 99% of observations are students so there is no work related information. 
Student_Depression_Dataset <- subset(Student_Depression_Dataset, select = -c(id, Profession, `Work Pressure`, `Job Satisfaction`, Degree))
head(Student_Depression_Dataset)

## # A tibble: 6 × 13
##   Gender   Age City          `Academic Pressure`  CGPA `Study Satisfaction`
##   <chr>  <dbl> <chr>                       <dbl> <dbl>                <dbl>
## 1 Male      33 Visakhapatnam                   5  8.97                    2
## 2 Female    24 Bangalore                       2  5.9                     5
## 3 Male      31 Srinagar                        3  7.03                    5
## 4 Female    28 Varanasi                        3  5.59                    2
## 5 Female    25 Jaipur                          4  8.13                    3
## 6 Male      29 Pune                            2  5.7                     3
## # ℹ 7 more variables: `Sleep Duration` <chr>, `Dietary Habits` <chr>,
## #   `Have you ever had suicidal thoughts ?` <chr>, `Work/Study Hours` <dbl>,
## #   `Financial Stress` <dbl>, `Family History of Mental Illness` <chr>,
## #   Depression <dbl>

# Tranform charcter type columns to numeric to be able to use this column
Student_Depression_Dataset$`Family History of Mental Illness` <- ifelse(Student_Depression_Dataset$`Family History of Mental Illness` == "Yes", 1, 0)
head(Student_Depression_Dataset)

## # A tibble: 6 × 13
##   Gender   Age City          `Academic Pressure`  CGPA `Study Satisfaction`
##   <chr>  <dbl> <chr>                       <dbl> <dbl>                <dbl>
## 1 Male      33 Visakhapatnam                   5  8.97                    2
## 2 Female    24 Bangalore                       2  5.9                     5
## 3 Male      31 Srinagar                        3  7.03                    5
## 4 Female    28 Varanasi                        3  5.59                    2
## 5 Female    25 Jaipur                          4  8.13                    3
## 6 Male      29 Pune                            2  5.7                     3
## # ℹ 7 more variables: `Sleep Duration` <chr>, `Dietary Habits` <chr>,
## #   `Have you ever had suicidal thoughts ?` <chr>, `Work/Study Hours` <dbl>,
## #   `Financial Stress` <dbl>, `Family History of Mental Illness` <dbl>,
## #   Depression <dbl>

# Restructure dataset
depression <- Student_Depression_Dataset[, c(3, 1:2, 4:8, 10:13)]
summary(depression)

##      City              Gender               Age        Academic Pressure
##  Length:27901       Length:27901       Min.   :18.00   Min.   :0.000    
##  Class :character   Class :character   1st Qu.:21.00   1st Qu.:2.000    
##  Mode  :character   Mode  :character   Median :25.00   Median :3.000    
##                                        Mean   :25.82   Mean   :3.141    
##                                        3rd Qu.:30.00   3rd Qu.:4.000    
##                                        Max.   :59.00   Max.   :5.000    
##                                                                         
##       CGPA        Study Satisfaction Sleep Duration     Dietary Habits    
##  Min.   : 0.000   Min.   :0.000      Length:27901       Length:27901      
##  1st Qu.: 6.290   1st Qu.:2.000      Class :character   Class :character  
##  Median : 7.770   Median :3.000      Mode  :character   Mode  :character  
##  Mean   : 7.656   Mean   :2.944                                           
##  3rd Qu.: 8.920   3rd Qu.:4.000                                           
##  Max.   :10.000   Max.   :5.000                                           
##                                                                           
##  Work/Study Hours Financial Stress Family History of Mental Illness
##  Min.   : 0.000   Min.   :1.00     Min.   :0.000                   
##  1st Qu.: 4.000   1st Qu.:2.00     1st Qu.:0.000                   
##  Median : 8.000   Median :3.00     Median :0.000                   
##  Mean   : 7.157   Mean   :3.14     Mean   :0.484                   
##  3rd Qu.:10.000   3rd Qu.:4.00     3rd Qu.:1.000                   
##  Max.   :12.000   Max.   :5.00     Max.   :1.000                   
##                   NA's   :3                                        
##    Depression    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5855  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

# Remove NA-s
depression$`Financial Stress`[is.na(depression$`Financial Stress`)] <- mean(depression$`Financial Stress`, na.rm = TRUE)

summary(depression$CGPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.290   7.770   7.656   8.920  10.000

summary(depression$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   21.00   25.00   25.82   30.00   59.00

table(depression$`Sleep Duration`)

## 
##         5-6 hours         7-8 hours Less than 5 hours More than 8 hours 
##              6183              7346              8310              6044 
##            Others 
##                18

# Remove extra outliers that won’t influence dataset since the loss of the data is like 5%:
depression <- depression[depression$`Sleep Duration` != "Others", ]
depression <- depression[depression$CGPA > 0, ]
depression <- depression [depression $Age < 35, ]

#Change columns from character type to factors and then to numeric:
depression$Gender <- as.numeric(as.factor(depression$Gender))
depression$`Sleep Duration` <- as.numeric(as.factor(depression$`Sleep Duration`))
depression$`Dietary Habits` <- as.numeric(as.factor(depression$`Dietary Habits`))
depression$City <- as.numeric(as.factor(depression$City))

# check that coulmns types have been changes and data is ready 
str(depression)

## tibble [27,826 × 12] (S3: tbl_df/tbl/data.frame)
##  $ City                            : num [1:27826] 52 4 45 50 17 40 47 7 34 38 ...
##  $ Gender                          : num [1:27826] 2 1 2 1 1 2 2 1 2 2 ...
##  $ Age                             : num [1:27826] 33 24 31 28 25 29 30 30 28 31 ...
##  $ Academic Pressure               : num [1:27826] 5 2 3 3 4 2 3 2 3 2 ...
##  $ CGPA                            : num [1:27826] 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
##  $ Study Satisfaction              : num [1:27826] 2 5 5 2 3 3 4 4 1 3 ...
##  $ Sleep Duration                  : num [1:27826] 1 1 3 2 1 3 2 3 2 3 ...
##  $ Dietary Habits                  : num [1:27826] 1 2 1 2 2 1 1 4 2 2 ...
##  $ Work/Study Hours                : num [1:27826] 3 3 9 4 1 4 1 0 12 2 ...
##  $ Financial Stress                : num [1:27826] 1 2 1 5 1 1 2 1 3 5 ...
##  $ Family History of Mental Illness: num [1:27826] 0 1 1 1 0 0 0 1 0 0 ...
##  $ Depression                      : num [1:27826] 1 0 0 1 0 0 0 0 1 1 ...

summary(depression)

##       City           Gender           Age       Academic Pressure
##  Min.   : 1.00   Min.   :1.000   Min.   :18.0   Min.   :0.000    
##  1st Qu.:15.00   1st Qu.:1.000   1st Qu.:21.0   1st Qu.:2.000    
##  Median :25.00   Median :2.000   Median :25.0   Median :3.000    
##  Mean   :27.29   Mean   :1.558   Mean   :25.8   Mean   :3.142    
##  3rd Qu.:45.00   3rd Qu.:2.000   3rd Qu.:30.0   3rd Qu.:4.000    
##  Max.   :52.00   Max.   :2.000   Max.   :34.0   Max.   :5.000    
##       CGPA        Study Satisfaction Sleep Duration Dietary Habits 
##  Min.   : 5.030   Min.   :0.000      Min.   :1.00   Min.   :1.000  
##  1st Qu.: 6.290   1st Qu.:2.000      1st Qu.:2.00   1st Qu.:1.000  
##  Median : 7.770   Median :3.000      Median :3.00   Median :2.000  
##  Mean   : 7.659   Mean   :2.944      Mean   :2.51   Mean   :2.466  
##  3rd Qu.: 8.920   3rd Qu.:4.000      3rd Qu.:3.00   3rd Qu.:4.000  
##  Max.   :10.000   Max.   :5.000      Max.   :4.00   Max.   :4.000  
##  Work/Study Hours Financial Stress Family History of Mental Illness
##  Min.   : 0.000   Min.   :1.000    Min.   :0.0000                  
##  1st Qu.: 4.000   1st Qu.:2.000    1st Qu.:0.0000                  
##  Median : 8.000   Median :3.000    Median :0.0000                  
##  Mean   : 7.159   Mean   :3.139    Mean   :0.4838                  
##  3rd Qu.:10.000   3rd Qu.:4.000    3rd Qu.:1.0000                  
##  Max.   :12.000   Max.   :5.000    Max.   :1.0000                  
##    Depression    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5861  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Dimension Reduction and pre-diagnosis

Firstly let’s check correlation of variables. It will show if the data is suitable for analysis and enhances the overall interpretability of the reduced dimensions by providing context on how the original variables contribute to the components or factors derived from the dimension reduction technique

cor_outcome <- cor(depression, method="pearson") 
corrplot(cor_outcome, order ="alphabet", tl.cex=0.6)

Cor plot shows that many variables are not correlated and can be for example removed or merged. I keep it as it is.

I check the Kaiser-Meyer-Olkin (KMO) measure for evaluating sampling adequacy to determine if the dataset is suitable for such analyses by assessing the proportion of variance that is common among the variables. High KMO values, closer to 1, indicate that the data is appropriate for dimension reduction, ensuring reliable and meaningful results.

KMO(cor_outcome)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_outcome)
## Overall MSA =  0.6
## MSA for each item = 
##                             City                           Gender 
##                             0.58                             0.49 
##                              Age                Academic Pressure 
##                             0.63                             0.60 
##                             CGPA               Study Satisfaction 
##                             0.45                             0.71 
##                   Sleep Duration                   Dietary Habits 
##                             0.51                             0.68 
##                 Work/Study Hours                 Financial Stress 
##                             0.67                             0.65 
## Family History of Mental Illness                       Depression 
##                             0.63                             0.57

With a Kaiser-Meyer-Olkin (KMO) measure of 0.6, the dataset has a mediocre level of sampling adequacy for dimension reduction techniques. But since it’s above the threshold of 0.5, it is acceptable to proceed. In this case, I will still go ahead with the dimension reduction but remain cautious about the interpretation of the results.

Bartlett’s test of sphericity:

cortest.bartlett(cor_outcome, n = 12756)

## $chisq
## [1] 7573.865
## 
## $p.value
## [1] 0
## 
## $df
## [1] 66

Test results show that the correlations between variables are strong enough to justify proceeding with dimension reduction

#change City column values to the frequancy of values in City column
freq_table <- table(depression$City)
depression$City <- as.numeric(freq_table[depression$City])

summary(depression)

##       City            Gender           Age       Academic Pressure
##  Min.   :   1.0   Min.   :1.000   Min.   :18.0   Min.   :0.000    
##  1st Qu.: 766.0   1st Qu.:1.000   1st Qu.:21.0   1st Qu.:2.000    
##  Median :1003.0   Median :2.000   Median :25.0   Median :3.000    
##  Mean   : 998.4   Mean   :1.558   Mean   :25.8   Mean   :3.142    
##  3rd Qu.:1138.0   3rd Qu.:2.000   3rd Qu.:30.0   3rd Qu.:4.000    
##  Max.   :1566.0   Max.   :2.000   Max.   :34.0   Max.   :5.000    
##       CGPA        Study Satisfaction Sleep Duration Dietary Habits 
##  Min.   : 5.030   Min.   :0.000      Min.   :1.00   Min.   :1.000  
##  1st Qu.: 6.290   1st Qu.:2.000      1st Qu.:2.00   1st Qu.:1.000  
##  Median : 7.770   Median :3.000      Median :3.00   Median :2.000  
##  Mean   : 7.659   Mean   :2.944      Mean   :2.51   Mean   :2.466  
##  3rd Qu.: 8.920   3rd Qu.:4.000      3rd Qu.:3.00   3rd Qu.:4.000  
##  Max.   :10.000   Max.   :5.000      Max.   :4.00   Max.   :4.000  
##  Work/Study Hours Financial Stress Family History of Mental Illness
##  Min.   : 0.000   Min.   :1.000    Min.   :0.0000                  
##  1st Qu.: 4.000   1st Qu.:2.000    1st Qu.:0.0000                  
##  Median : 8.000   Median :3.000    Median :0.0000                  
##  Mean   : 7.159   Mean   :3.139    Mean   :0.4838                  
##  3rd Qu.:10.000   3rd Qu.:4.000    3rd Qu.:1.0000                  
##  Max.   :12.000   Max.   :5.000    Max.   :1.0000                  
##    Depression    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5861  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

str(depression$City)

##  num [1:27826] 965 766 1368 684 1036 ...

table(depression$City)

## 
##    1    2  460  545  608  643  648  684  691  697  743  765  766  812  821  883 
##   18    8  460  545  608  643  648  684  691  697  743  765  766  812  821  883 
##  932  948  965  966 1003 1036 1063 1076 1090 1106 1138 1152 1288 1337 1368 1566 
##  932  948  965  966 1003 1036 1063 1076 1090 1106 1138 1152 1288 1337 1368 1566

PCA

library(FactoMineR)
pca_PCA <- PCA(scale(depression), graph = FALSE)

Bar plot of the eigenvalues of the principal components indicate the amount of variance explained by each principal component.

fviz_eig(pca_PCA, choice = "eigenvalue", ncp = 22, barfill = "lightblue" , linecolor = "hotpink2",  addlabels = TRUE,   main = "Eigenvalues")

fviz_pca_var(pca_PCA, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)

fviz_contrib(pca_PCA, choice = "var", axes = 1,fill = "lightblue",color = "lightblue")

fviz_contrib(pca_PCA, choice = "var", axes = 1:5, fill = "lightblue",color = "lightblue")

fviz_contrib(pca_PCA, choice = "var", axes = 2:5, fill = "lightblue",color = "lightblue")

Bar plot of the eigenvalues of the principal components indicate the amount of variance explained by each principal component.
Visualization of the contribution of each variable to the principal components and their corelations (negative or positive or non). Color gradient of the factors show contributions, low contributions are colored light green, mid contributions are pink, and high contributions are black.
Three last plots print the contributions of variables to the specified principal components (axes). The first function shows contributions to the first principal component (axis 1), the second function shows contributions to the first five principal components (axes 1 to 5), and the third function shows contributions to the second to fifth principal components (axes 2 to 5)

MDS

MDS_dep <- scale(depression)

# Assuming dist_matrix and mds_result are already defined
dist_matrix <- dist(t(MDS_dep))
mds_result <- cmdscale(dist_matrix, k = 2)

# Convert MDS results to a data frame for easier plotting
mds_data <- as.data.frame(mds_result)
colnames(mds_data) <- c("Dim1", "Dim2")

# Add column names as labels
mds_data$Labels <- rownames(mds_result)

# Create a scatter plot of the MDS results with labels and custom x-axis limits
library(ggplot2)

ggplot(mds_data, aes(x = Dim1, y = Dim2, label = Labels)) +
  geom_point(size = 4, color = "blue") +
  geom_text(vjust = 1.1, hjust = 1.1, size = 3) +
  theme_minimal() +
  labs(title = "MDS Plot", x = "Dimension 1", y = "Dimension 2") +
  xlim(-200, 200) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Conclusion

When both PCA and MDS analyses show that the variables Depression, Academic Stress, and Financial Stress consistently stay together while the rest migrate but maintain similar relative positions, it highlights a robust and strong relationship among these three variables, indicating that they likely share a common underlying factor or are highly correlated. This stability across different dimension reduction techniques emphasizes the importance and interconnection of these variables within the dataset. The migrating variables, while maintaining their relative positions, suggest they are influenced by various dimensions or factors but still retain an underlying structure. The agreement between PCA and MDS underscores the complementary nature of these techniques and reinforces the reliability of the identified clusters and patterns, providing a comprehensive and nuanced understanding of the data. This dual perspective confirms the significance of Depression, Academic Stress, and Financial Stress as key interconnected variables in the analysis.