The dataset was found on Kaggle (https://www.kaggle.com/datasets/hopesb/student-depression-dataset). Dataset Structure:
• Format: CSV format
• Rows: 27901 observations, each row represents an individual student.
• Columns: 18 columns, each column represents a specific feature or attribute.
Goal: execute PCA and MDS on the same dataset and compare results. Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) are both dimension reduction techniques, but they serve different purposes and can complement each other when used on the same dataset in one analysis. PCA transforms the original variables into new uncorrelated principal components, ordered by the amount of variance they explain, and it assumes linearity, making it useful for understanding the largest variance directions in the data. MDS, however, focuses on preserving the pairwise distances between data points in a low-dimensional space, without assuming linearity, and is particularly effective at handling non-linear relationships. MDS can be more robust in the presence of outliers because it aims to maintain the overall distance structure, whereas PCA might be more sensitive to outliers since they can disproportionately influence the principal components. Using both methods together provides complementary insights: PCA helps in understanding the major variance directions, while MDS offers a perspective on the overall distance relationships, ensuring a richer and more nuanced analysis of the data, especially in the presence of outliers
I load packages, import dataset and modify it to be suitable for Dimension reduction.
#Delete column because they are not important:
#“id” column because it is not consecutive, and some values are missing.
#“Work pressure” & “Job Satisfaction” & “Profession” columns because 99% of observations are students so there is no work related information.
Student_Depression_Dataset <- subset(Student_Depression_Dataset, select = -c(id, Profession, `Work Pressure`, `Job Satisfaction`, Degree))
head(Student_Depression_Dataset)
## # A tibble: 6 × 13
## Gender Age City `Academic Pressure` CGPA `Study Satisfaction`
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Male 33 Visakhapatnam 5 8.97 2
## 2 Female 24 Bangalore 2 5.9 5
## 3 Male 31 Srinagar 3 7.03 5
## 4 Female 28 Varanasi 3 5.59 2
## 5 Female 25 Jaipur 4 8.13 3
## 6 Male 29 Pune 2 5.7 3
## # ℹ 7 more variables: `Sleep Duration` <chr>, `Dietary Habits` <chr>,
## # `Have you ever had suicidal thoughts ?` <chr>, `Work/Study Hours` <dbl>,
## # `Financial Stress` <dbl>, `Family History of Mental Illness` <chr>,
## # Depression <dbl>
# Tranform charcter type columns to numeric to be able to use this column
Student_Depression_Dataset$`Family History of Mental Illness` <- ifelse(Student_Depression_Dataset$`Family History of Mental Illness` == "Yes", 1, 0)
head(Student_Depression_Dataset)
## # A tibble: 6 × 13
## Gender Age City `Academic Pressure` CGPA `Study Satisfaction`
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Male 33 Visakhapatnam 5 8.97 2
## 2 Female 24 Bangalore 2 5.9 5
## 3 Male 31 Srinagar 3 7.03 5
## 4 Female 28 Varanasi 3 5.59 2
## 5 Female 25 Jaipur 4 8.13 3
## 6 Male 29 Pune 2 5.7 3
## # ℹ 7 more variables: `Sleep Duration` <chr>, `Dietary Habits` <chr>,
## # `Have you ever had suicidal thoughts ?` <chr>, `Work/Study Hours` <dbl>,
## # `Financial Stress` <dbl>, `Family History of Mental Illness` <dbl>,
## # Depression <dbl>
# Restructure dataset
depression <- Student_Depression_Dataset[, c(3, 1:2, 4:8, 10:13)]
summary(depression)
## City Gender Age Academic Pressure
## Length:27901 Length:27901 Min. :18.00 Min. :0.000
## Class :character Class :character 1st Qu.:21.00 1st Qu.:2.000
## Mode :character Mode :character Median :25.00 Median :3.000
## Mean :25.82 Mean :3.141
## 3rd Qu.:30.00 3rd Qu.:4.000
## Max. :59.00 Max. :5.000
##
## CGPA Study Satisfaction Sleep Duration Dietary Habits
## Min. : 0.000 Min. :0.000 Length:27901 Length:27901
## 1st Qu.: 6.290 1st Qu.:2.000 Class :character Class :character
## Median : 7.770 Median :3.000 Mode :character Mode :character
## Mean : 7.656 Mean :2.944
## 3rd Qu.: 8.920 3rd Qu.:4.000
## Max. :10.000 Max. :5.000
##
## Work/Study Hours Financial Stress Family History of Mental Illness
## Min. : 0.000 Min. :1.00 Min. :0.000
## 1st Qu.: 4.000 1st Qu.:2.00 1st Qu.:0.000
## Median : 8.000 Median :3.00 Median :0.000
## Mean : 7.157 Mean :3.14 Mean :0.484
## 3rd Qu.:10.000 3rd Qu.:4.00 3rd Qu.:1.000
## Max. :12.000 Max. :5.00 Max. :1.000
## NA's :3
## Depression
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5855
## 3rd Qu.:1.0000
## Max. :1.0000
##
# Remove NA-s
depression$`Financial Stress`[is.na(depression$`Financial Stress`)] <- mean(depression$`Financial Stress`, na.rm = TRUE)
summary(depression$CGPA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.290 7.770 7.656 8.920 10.000
summary(depression$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 21.00 25.00 25.82 30.00 59.00
table(depression$`Sleep Duration`)
##
## 5-6 hours 7-8 hours Less than 5 hours More than 8 hours
## 6183 7346 8310 6044
## Others
## 18
# Remove extra outliers that won’t influence dataset since the loss of the data is like 5%:
depression <- depression[depression$`Sleep Duration` != "Others", ]
depression <- depression[depression$CGPA > 0, ]
depression <- depression [depression $Age < 35, ]
#Change columns from character type to factors and then to numeric:
depression$Gender <- as.numeric(as.factor(depression$Gender))
depression$`Sleep Duration` <- as.numeric(as.factor(depression$`Sleep Duration`))
depression$`Dietary Habits` <- as.numeric(as.factor(depression$`Dietary Habits`))
depression$City <- as.numeric(as.factor(depression$City))
# check that coulmns types have been changes and data is ready
str(depression)
## tibble [27,826 × 12] (S3: tbl_df/tbl/data.frame)
## $ City : num [1:27826] 52 4 45 50 17 40 47 7 34 38 ...
## $ Gender : num [1:27826] 2 1 2 1 1 2 2 1 2 2 ...
## $ Age : num [1:27826] 33 24 31 28 25 29 30 30 28 31 ...
## $ Academic Pressure : num [1:27826] 5 2 3 3 4 2 3 2 3 2 ...
## $ CGPA : num [1:27826] 8.97 5.9 7.03 5.59 8.13 5.7 9.54 8.04 9.79 8.38 ...
## $ Study Satisfaction : num [1:27826] 2 5 5 2 3 3 4 4 1 3 ...
## $ Sleep Duration : num [1:27826] 1 1 3 2 1 3 2 3 2 3 ...
## $ Dietary Habits : num [1:27826] 1 2 1 2 2 1 1 4 2 2 ...
## $ Work/Study Hours : num [1:27826] 3 3 9 4 1 4 1 0 12 2 ...
## $ Financial Stress : num [1:27826] 1 2 1 5 1 1 2 1 3 5 ...
## $ Family History of Mental Illness: num [1:27826] 0 1 1 1 0 0 0 1 0 0 ...
## $ Depression : num [1:27826] 1 0 0 1 0 0 0 0 1 1 ...
summary(depression)
## City Gender Age Academic Pressure
## Min. : 1.00 Min. :1.000 Min. :18.0 Min. :0.000
## 1st Qu.:15.00 1st Qu.:1.000 1st Qu.:21.0 1st Qu.:2.000
## Median :25.00 Median :2.000 Median :25.0 Median :3.000
## Mean :27.29 Mean :1.558 Mean :25.8 Mean :3.142
## 3rd Qu.:45.00 3rd Qu.:2.000 3rd Qu.:30.0 3rd Qu.:4.000
## Max. :52.00 Max. :2.000 Max. :34.0 Max. :5.000
## CGPA Study Satisfaction Sleep Duration Dietary Habits
## Min. : 5.030 Min. :0.000 Min. :1.00 Min. :1.000
## 1st Qu.: 6.290 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:1.000
## Median : 7.770 Median :3.000 Median :3.00 Median :2.000
## Mean : 7.659 Mean :2.944 Mean :2.51 Mean :2.466
## 3rd Qu.: 8.920 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:4.000
## Max. :10.000 Max. :5.000 Max. :4.00 Max. :4.000
## Work/Study Hours Financial Stress Family History of Mental Illness
## Min. : 0.000 Min. :1.000 Min. :0.0000
## 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8.000 Median :3.000 Median :0.0000
## Mean : 7.159 Mean :3.139 Mean :0.4838
## 3rd Qu.:10.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :12.000 Max. :5.000 Max. :1.0000
## Depression
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5861
## 3rd Qu.:1.0000
## Max. :1.0000
Firstly let’s check correlation of variables. It will show if the data is suitable for analysis and enhances the overall interpretability of the reduced dimensions by providing context on how the original variables contribute to the components or factors derived from the dimension reduction technique
cor_outcome <- cor(depression, method="pearson")
corrplot(cor_outcome, order ="alphabet", tl.cex=0.6)
Cor plot shows that many variables are not correlated and can be for example removed or merged. I keep it as it is.
I check the Kaiser-Meyer-Olkin (KMO) measure for evaluating sampling adequacy to determine if the dataset is suitable for such analyses by assessing the proportion of variance that is common among the variables. High KMO values, closer to 1, indicate that the data is appropriate for dimension reduction, ensuring reliable and meaningful results.
KMO(cor_outcome)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_outcome)
## Overall MSA = 0.6
## MSA for each item =
## City Gender
## 0.58 0.49
## Age Academic Pressure
## 0.63 0.60
## CGPA Study Satisfaction
## 0.45 0.71
## Sleep Duration Dietary Habits
## 0.51 0.68
## Work/Study Hours Financial Stress
## 0.67 0.65
## Family History of Mental Illness Depression
## 0.63 0.57
With a Kaiser-Meyer-Olkin (KMO) measure of 0.6, the dataset has a mediocre level of sampling adequacy for dimension reduction techniques. But since it’s above the threshold of 0.5, it is acceptable to proceed. In this case, I will still go ahead with the dimension reduction but remain cautious about the interpretation of the results.
Bartlett’s test of sphericity:
cortest.bartlett(cor_outcome, n = 12756)
## $chisq
## [1] 7573.865
##
## $p.value
## [1] 0
##
## $df
## [1] 66
Test results show that the correlations between variables are strong enough to justify proceeding with dimension reduction
#change City column values to the frequancy of values in City column
freq_table <- table(depression$City)
depression$City <- as.numeric(freq_table[depression$City])
summary(depression)
## City Gender Age Academic Pressure
## Min. : 1.0 Min. :1.000 Min. :18.0 Min. :0.000
## 1st Qu.: 766.0 1st Qu.:1.000 1st Qu.:21.0 1st Qu.:2.000
## Median :1003.0 Median :2.000 Median :25.0 Median :3.000
## Mean : 998.4 Mean :1.558 Mean :25.8 Mean :3.142
## 3rd Qu.:1138.0 3rd Qu.:2.000 3rd Qu.:30.0 3rd Qu.:4.000
## Max. :1566.0 Max. :2.000 Max. :34.0 Max. :5.000
## CGPA Study Satisfaction Sleep Duration Dietary Habits
## Min. : 5.030 Min. :0.000 Min. :1.00 Min. :1.000
## 1st Qu.: 6.290 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:1.000
## Median : 7.770 Median :3.000 Median :3.00 Median :2.000
## Mean : 7.659 Mean :2.944 Mean :2.51 Mean :2.466
## 3rd Qu.: 8.920 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:4.000
## Max. :10.000 Max. :5.000 Max. :4.00 Max. :4.000
## Work/Study Hours Financial Stress Family History of Mental Illness
## Min. : 0.000 Min. :1.000 Min. :0.0000
## 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8.000 Median :3.000 Median :0.0000
## Mean : 7.159 Mean :3.139 Mean :0.4838
## 3rd Qu.:10.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :12.000 Max. :5.000 Max. :1.0000
## Depression
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5861
## 3rd Qu.:1.0000
## Max. :1.0000
str(depression$City)
## num [1:27826] 965 766 1368 684 1036 ...
table(depression$City)
##
## 1 2 460 545 608 643 648 684 691 697 743 765 766 812 821 883
## 18 8 460 545 608 643 648 684 691 697 743 765 766 812 821 883
## 932 948 965 966 1003 1036 1063 1076 1090 1106 1138 1152 1288 1337 1368 1566
## 932 948 965 966 1003 1036 1063 1076 1090 1106 1138 1152 1288 1337 1368 1566
library(FactoMineR)
pca_PCA <- PCA(scale(depression), graph = FALSE)
Bar plot of the eigenvalues of the principal components indicate the amount of variance explained by each principal component.
fviz_eig(pca_PCA, choice = "eigenvalue", ncp = 22, barfill = "lightblue" , linecolor = "hotpink2", addlabels = TRUE, main = "Eigenvalues")
fviz_pca_var(pca_PCA, col.var="contrib")+
scale_color_gradient2(low="#99FF33", mid="#CC0066",
high="black", midpoint=11)
fviz_contrib(pca_PCA, choice = "var", axes = 1,fill = "lightblue",color = "lightblue")
fviz_contrib(pca_PCA, choice = "var", axes = 1:5, fill = "lightblue",color = "lightblue")
fviz_contrib(pca_PCA, choice = "var", axes = 2:5, fill = "lightblue",color = "lightblue")
MDS_dep <- scale(depression)
# Assuming dist_matrix and mds_result are already defined
dist_matrix <- dist(t(MDS_dep))
mds_result <- cmdscale(dist_matrix, k = 2)
# Convert MDS results to a data frame for easier plotting
mds_data <- as.data.frame(mds_result)
colnames(mds_data) <- c("Dim1", "Dim2")
# Add column names as labels
mds_data$Labels <- rownames(mds_result)
# Create a scatter plot of the MDS results with labels and custom x-axis limits
library(ggplot2)
ggplot(mds_data, aes(x = Dim1, y = Dim2, label = Labels)) +
geom_point(size = 4, color = "blue") +
geom_text(vjust = 1.1, hjust = 1.1, size = 3) +
theme_minimal() +
labs(title = "MDS Plot", x = "Dimension 1", y = "Dimension 2") +
xlim(-200, 200) +
theme(plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
When both PCA and MDS analyses show that the variables Depression, Academic Stress, and Financial Stress consistently stay together while the rest migrate but maintain similar relative positions, it highlights a robust and strong relationship among these three variables, indicating that they likely share a common underlying factor or are highly correlated. This stability across different dimension reduction techniques emphasizes the importance and interconnection of these variables within the dataset. The migrating variables, while maintaining their relative positions, suggest they are influenced by various dimensions or factors but still retain an underlying structure. The agreement between PCA and MDS underscores the complementary nature of these techniques and reinforces the reliability of the identified clusters and patterns, providing a comprehensive and nuanced understanding of the data. This dual perspective confirms the significance of Depression, Academic Stress, and Financial Stress as key interconnected variables in the analysis.