Author: Tamara Milinković
library(readxl)
mydata <- read_xlsx("StudentsStressDataset.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
## ID Gender Age RecStress Heartbeat Anxiety Sleep Headaches Irritation
## 1 1 0 20 3 4 2 5 2 1
## 2 2 0 20 2 3 2 1 1 1
## 3 3 0 20 5 4 2 2 3 4
## 4 4 1 20 3 4 3 2 3 4
## 5 5 0 20 3 3 3 2 4 4
## 6 6 0 20 3 4 3 2 3 4
## Concentration Sadness Illness Loneliness Overwhelm Competition
## 1 2 2 3 1 5 1
## 2 4 2 1 2 1 2
## 3 2 3 2 3 4 5
## 4 3 5 2 4 1 2
## 5 4 4 1 1 1 2
## 6 1 3 1 2 2 3
## RelationshipStress DiffProf WorkingEnv TroubleRelax HomeEnv LackConfidence
## 1 2 3 1 4 1 2
## 2 4 3 2 1 1 3
## 3 2 2 2 2 1 4
## 4 3 1 1 2 1 2
## 5 1 2 3 1 2 2
## 6 2 2 3 1 2 3
## LackChoice Extracurricular RegularlyAttend Weight StressType
## 1 1 3 1 2 Eustress (Positive Stress)
## 2 2 1 4 2 Eustress (Positive Stress)
## 3 1 1 2 1 Eustress (Positive Stress)
## 4 1 1 5 3 Eustress (Positive Stress)
## 5 4 2 2 2 Eustress (Positive Stress)
## 6 2 4 4 4 Eustress (Positive Stress)
This dataset, titled “Stress and Well-being Data of College Students” was created to study stress levels and well-being factors among college students aged 18-21. It includes responses from 843 students, collected through a structured Google Form survey. The dataset captures various aspects of students’ lives that may impact stress levels, including academic performance, physical and emotional health, social relationships, and relaxation activities. Participants rated their experiences on a five-point scale from “Not at all” to “Extremely” providing nuanced insights into each participant’s feelings and behaviors.
Source: Kaggle
A. Singh, K. Singh, A. Kumar, A. Shrivastava and S. Kumar, “Machine Learning Algorithms for Detecting Mental Stress in College Students,” 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 2024, pp. 1-5, doi: 10.1109/I2CT61223.2024.10544243.
Unit of observation: one college student
Sample size: 843
Response Scale: Five-point Likert scale (“Not at all” to “Extremely”) for stress indicators (variables 4-25)
Variables:
ID: ID of the respondent
Gender: Gender of the respondent (0 = Male, 1 = Female)
Age: Age of the respondent
RecStress: Have you recently experienced stress in your life?: Self-reported recent experience of stress
Heartbeat: Have you noticed a rapid heartbeat or palpitations?: Experience of rapid heartbeat as a stress response
Anxiety: Have you been dealing with anxiety or tension recently?: Frequency of recent anxiety or tension
Sleep: Do you face any sleep problems or difficulties falling asleep?: Sleep disturbances or difficulty falling asleep
Headaches: Have you been getting headaches more often than usual?: Frequency of headaches as a potential stress indicator
Irritation: Do you get irritated easily?: Increased irritability as a stress response
Concentration: Do you have trouble concentrating on your academic tasks?: Difficulties concentrating on studies
Sadness: Have you been feeling sadness or low mood?: Experience of sadness or low mood
Illness: Have you been experiencing any illness or health issues?: Health issues that may relate to stress
Loneliness: Do you often feel lonely or isolated?: Feelings of loneliness or social isolation
Overwhelm: Do you feel overwhelmed with your academic workload?: Overwhelm due to academic responsibilities
Competition: Are you in competition with your peers, and does it affect you?: Perceived stress from competition with peers
RelationshipStress: Do you find that your relationship often causes you stress?: Stress due to relationships
DiffProf: Are you facing any difficulties with your professors or instructors?: Stress from difficulties with professors/instructors
WorkingEnv: Is your working environment unpleasant or stressful?: Perceived unpleasantness or stress in the work environment
TroubleRelax: Do you struggle to find time for relaxation and leisure activities?: Difficulty making time for relaxation or leisure
HomeEnv: Is your hostel or home environment causing you difficulties?: Stressful or difficult home/hostel environment
LackConfidence: Do you lack confidence in your academic performance?: Self-reported lack of confidence in academic abilities
LackChoice: Do you lack confidence in your choice of academic subjects?: Lack of confidence in academic subject choices
Extracurricular: Academic and extracurricular activities conflicting for you?: Conflict between academic and extracurricular activities
RegularlyAttend: Do you attend classes regularly?: Frequency of class attendance
Weight: Have you gained/lost weight?: Self-reported weight gain/loss as a stress indicator
StressType:Which type of stress do you primarily experience?: Self-identified primary type of stress (Eustress, Distress, No Stress)
mydata$GenderFactor <- factor(mydata$Gender,
levels = c(0, 1),
labels = c("Male", "Female"))
summary(mydata[, -c(1,2,3,26,27,28)])
## RecStress Heartbeat Anxiety Sleep
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :2.000 Median :3.000
## Mean :2.998 Mean :2.756 Mean :2.543 Mean :2.786
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Headaches Irritation Concentration Sadness Illness
## Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :3.000 Median :3.0 Median :2.000 Median :2.000
## Mean :2.629 Mean :2.702 Mean :2.7 Mean :2.585 Mean :2.549
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.0 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.0 Max. :5.000 Max. :5.000
## Loneliness Overwhelm Competition RelationshipStress
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :2.497 Mean :2.504 Mean :2.485 Mean :2.515
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## DiffProf WorkingEnv TroubleRelax HomeEnv
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :2.000
## Mean :2.447 Mean :2.489 Mean :2.517 Mean :2.425
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## LackConfidence LackChoice Extracurricular RegularlyAttend
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.581 Mean :2.642 Mean :2.757 Mean :3.259
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Weight
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.399
## 3rd Qu.:3.000
## Max. :5.000
The range of each of the shown variables is 5 (minimum is 1, maximum is 5), since they are all measured on five-point Likert scale.
The average self-reported recent experience of stress is 2.998.
Half of the students rated conflict between academic and extracurricular activities with up to 3, the others rated it with more than 3.
75% of the students rated lack of confidence in academic abilities with up to 3, the others rated it with more than 3.
For the purpose of clustering, I chose 6 cluster variables:WorkingEnv HomeEnv, RelationshipStress, DiffProf, LackChoice, Extracurricular
#Saving standardized cluster variables into new data frame
mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))
#Finding outliers
mydata$Dissimilarity <- sqrt(mydata_clu_std$RelationshipStress^2 + mydata_clu_std$DiffProf^2 + mydata_clu_std$LackChoice^2 + mydata_clu_std$WorkingEnv^2 + mydata_clu_std$HomeEnv^2 + mydata_clu_std$Extracurricular^2)
#Finding units with highest value of dissimilarity
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])
## ID Dissimilarity
## 195 195 4.856241
## 540 540 4.856241
## 727 727 4.856241
## 667 667 4.626405
## 584 584 4.589811
## 47 47 4.559359
There is a relatively big jump between third and fourth unit, so I will check first three units.
#Showing students ID195, ID540 and ID727
print(mydata[c(195, 540, 727), ])
## ID Gender Age RecStress Heartbeat Anxiety Sleep Headaches Irritation
## 195 195 0 22 4 4 2 1 5 5
## 540 540 0 20 5 5 5 5 5 5
## 727 727 0 22 4 4 2 1 5 5
## Concentration Sadness Illness Loneliness Overwhelm Competition
## 195 5 5 5 5 1 5
## 540 5 5 5 5 5 5
## 727 5 5 5 5 1 5
## RelationshipStress DiffProf WorkingEnv TroubleRelax HomeEnv LackConfidence
## 195 5 5 5 5 5 5
## 540 5 5 5 5 5 5
## 727 5 5 5 5 5 5
## LackChoice Extracurricular RegularlyAttend Weight
## 195 5 5 4 2
## 540 5 5 5 5
## 727 5 5 4 2
## StressType GenderFactor Dissimilarity
## 195 Distress (Negative Stress) Male 4.856241
## 540 Distress (Negative Stress) Male 4.856241
## 727 Distress (Negative Stress) Male 4.856241
The students ID195, ID540 and ID727 answered with 5 on almost all of the questions, so I will remove these units from original data frame.
#Removing ID195, ID540 and ID727 from original data frame
library (rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata <- mydata %>%
filter(!ID %in% c(195, 540, 727), )
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Finding Euclidean distances based on 6 Cluster variables, then saving them into object Distances
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
#Showing matrix of distances
fviz_dist(Distances,
gradient = list(low = "slateblue4",
mid = "skyblue3",
high = "skyblue"))
There are some groups of homogeneous objects forming, but they are not very evident.
#Hopkins statistics
library(factoextra)
get_clust_tendency(mydata_clu_std,
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.5188179
##
## $plot
## NULL
Hopkins statistics is just above the threshold of 0.5, indicating that data is not ideal for clustering but it is still clusterable.
#Determining number of clusters for K-means clustering
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
It seems that the biggest break is at 2, indicating that we should form 2 clusters based on Elbow method. 7 is next possible option.
#Determining number of clusters for K-means clustering
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
Since we want average Silhouette to be as high as possible, according to this index, it is definitely the best option to form 2 clusters.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 840
#Dendrogram to determine number of clusters in case of hierarchical clustering
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on Dendrogram, we the biggest distance (jump in heterogeneity) is achieved if we cut it in a way that we form 2 groups.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 2.6335 194.9864 113.4592 -4.4258 928.4972 3.123015e+17 517871.4 4083.783
## 3 0.2667 167.2229 56.9403 -7.5087 1474.0135 3.670415e+17 386080.7 3596.801
## 4 4.2210 137.8811 58.2759 -11.7757 1788.1087 4.489521e+17 324853.5 3367.700
## 5 0.5964 125.0387 62.8149 -14.0957 2167.6530 4.464668e+17 291752.5 3148.242
## 6 0.7404 119.9751 30.1699 -13.9625 2554.5160 4.056363e+17 250200.4 2927.978
## 7 0.6221 108.4941 60.4983 -16.5951 2707.0236 4.604488e+17 218267.2 2825.756
## 8 2.4115 108.2614 42.7205 -13.3463 3024.5584 4.120921e+17 190499.3 2634.426
## 9 0.9270 104.8066 39.9152 -11.9639 3236.5550 4.052223e+17 180504.7 2505.763
## 10 1.4489 101.9485 33.8664 -10.6463 3457.7238 3.844680e+17 160222.8 2390.921
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 1.6881 1.2327 0.4224 2.0584 0.1824 1.2695 -108.0556 -0.8144 0.2801
## 3 2.6282 1.3996 0.4117 1.9364 0.1571 1.2762 -78.1303 -0.8295 0.3041
## 4 3.1730 1.4948 0.4126 2.0851 0.1419 1.4777 -99.8957 -1.2375 0.2841
## 5 3.9415 1.5990 0.3936 1.9461 0.1254 1.5853 -97.8382 -1.4105 0.2704
## 6 4.7759 1.7193 0.3876 1.7662 0.1447 2.0952 -129.6331 -1.9950 0.2617
## 7 5.1080 1.7815 0.4036 1.7650 0.1322 1.3999 -54.8491 -1.0915 0.2492
## 8 5.8065 1.9109 0.4109 1.6786 0.1457 1.3851 -53.3773 -1.0593 0.2433
## 9 6.2338 2.0090 0.4224 1.6534 0.1449 2.1205 -82.9614 -2.0031 0.2359
## 10 6.7408 2.1055 0.4172 1.6532 0.1456 2.2067 -92.9606 -2.0772 0.2291
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 2041.8913 0.3557 0.5269 0.7623 0.1097 5e-04 1.7551 2.1143 3.1696
## 3 1198.9338 0.3773 0.4298 1.5185 0.1140 5e-04 1.5382 1.9807 2.1819
## 4 841.9250 0.3767 0.6673 2.1468 0.1189 5e-04 1.5086 1.9192 1.0414
## 5 629.6484 0.3541 0.1200 2.9324 0.1164 6e-04 1.5517 1.8509 0.7550
## 6 487.9963 0.3659 0.3029 3.4867 0.1195 6e-04 1.4781 1.7828 0.6523
## 7 403.6795 0.3593 0.1518 4.0201 0.1271 7e-04 1.5178 1.7523 0.6034
## 8 329.3032 0.3602 0.2124 4.6301 0.1338 7e-04 1.4379 1.6967 0.5638
## 9 278.4181 0.3556 0.2242 5.1475 0.1404 8e-04 1.4471 1.6554 0.5358
## 10 239.0921 0.3483 0.0530 5.8146 0.1417 8e-04 1.4946 1.6168 0.5094
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.8021 125.6198 1
## 3 0.7873 97.5067 1
## 4 0.7718 91.3682 1
## 5 0.7507 87.9976 1
## 6 0.7408 86.7609 1
## 7 0.7522 63.2574 1
## 8 0.7262 72.3836 1
## 9 0.6870 71.5133 1
## 10 0.7018 72.2356 1
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 4.000 2.0000 3.0000 2.0000 3.0000 6.000000e+00 3.0
## Value_Index 4.221 194.9864 56.5189 -4.4258 545.5163 9.564296e+16 131790.7
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 3.0000 6.0000 10.0000 2.0000 2.0000
## Value_Index 257.8797 0.9401 -0.0717 0.3876 1.6532 0.1824 1.2695
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 3.0000 1 2.0000
## Value_Index -108.0556 -0.8144 0.3041 842.9575 0.3773 NA 0.7623
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 10.0000 0 8.0000 0 10.0000
## Value_Index 0.1417 0 1.4379 0 0.5094
##
## $Best.partition
## [1] 2 2 2 2 3 3 1 2 1 2 2 2 1 3 2 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 2 1 3 2 1
## [38] 2 3 2 2 1 3 1 1 2 1 3 3 2 2 3 2 2 1 3 2 1 3 3 2 3 2 2 1 3 2 1 3 2 3 1 3 3
## [75] 2 3 3 2 2 2 3 2 3 1 3 3 2 3 2 1 2 3 1 2 2 3 3 3 3 1 2 1 2 3 2 2 2 2 2 2 1
## [112] 3 3 3 2 2 3 2 1 2 2 2 3 1 3 3 1 3 2 2 2 2 2 1 3 1 1 1 2 1 2 1 2 2 2 2 2 3
## [149] 2 2 3 1 1 1 2 1 3 2 3 1 2 1 3 1 2 2 3 1 3 1 3 2 1 2 3 1 1 1 3 3 2 1 3 3 1
## [186] 3 3 1 1 1 3 3 1 3 3 1 1 3 1 3 3 3 3 3 3 1 1 1 3 1 1 1 3 3 1 1 3 2 1 2 3 1
## [223] 2 1 3 2 3 1 3 3 2 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 1 3 3 3 1 3 2 1 2 2 1 3 1
## [260] 3 3 1 3 2 3 3 2 2 2 1 2 1 3 1 1 2 1 2 2 2 1 2 1 2 2 3 1 3 3 2 1 3 1 2 1 3
## [297] 1 1 1 2 1 1 2 1 2 1 2 1 3 2 1 3 1 3 1 2 2 2 3 2 3 2 3 3 3 2 3 2 3 2 3 1 2
## [334] 1 2 3 3 3 3 1 1 2 2 1 3 3 3 2 3 2 2 2 1 2 1 2 1 3 1 2 1 3 1 1 1 2 1 1 3 1
## [371] 2 1 2 2 2 1 3 3 1 2 2 3 2 3 1 2 2 3 2 3 3 3 3 2 2 2 2 1 3 1 2 2 3 2 1 2 2
## [408] 2 2 2 2 1 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 3 2 2 2 1 2 2 1 1 2 3 2 3 2 3 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 1 2 1 3 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 3 1 1 3 2 2
## [482] 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 3 3 3 3 2 1
## [519] 1 1 2 1 1 3 1 3 2 1 3 2 1 2 2 2 2 1 2 2 2 3 2 2 2 2 2 1 2 3 2 2 3 2 2 3 3
## [556] 1 2 2 1 3 2 2 1 1 3 2 1 2 3 2 2 1 2 2 2 1 1 3 3 1 2 1 1 1 2 3 2 3 3 3 1 3
## [593] 2 2 3 3 2 2 1 2 3 3 3 1 2 3 3 2 2 3 1 1 3 1 3 1 2 3 2 1 2 3 1 2 2 3 3 3 3
## [630] 1 1 2 2 1 2 2 2 2 2 2 3 3 3 3 2 1 3 2 3 1 2 2 1 2 3 3 3 1 2 2 2 2 2 1 1 1
## [667] 2 1 2 1 2 3 2 2 1 2 2 3 2 2 3 1 1 2 2 3 1 2 3 1 2 1 2 2 1 2 1 2 3 2 3 1 1
## [704] 2 3 3 3 2 3 3 2 3 3 3 1 3 1 3 1 1 3 1 1 3 3 1 1 3 1 3 3 3 3 3 3 1 1 1 3 1
## [741] 1 1 3 3 1 1 3 2 1 2 3 2 2 3 3 1 3 1 3 3 2 1 1 1 3 1 3 1 2 1 1 1 1 3 2 1 3
## [778] 3 3 3 1 3 2 3 2 2 1 3 1 3 1 1 1 2 3 3 1 2 2 1 2 1 3 2 1 2 3 2 1 2 3 3 3 1
## [815] 2 1 1 1 3 2 2 1 1 2 1 3 1 1 2 2 3 1 3 1 2 1 2 3 3 2
According to the majority rule, the best number of clusters for K-means clustering is 3. Also, when I tested with 2 clusters (based on Elbow method and Silhouette analysis), between_SS / total_SS ratio was much lower than in the case if 3 clusters are formed (it was around 17% compared to 28.5% achieved with 3 clusters). That is why I chose to have 3 clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 251, 329, 260
##
## Cluster means:
## WorkingEnv HomeEnv RelationshipStress DiffProf LackChoice
## 1 -0.2460615 -0.1188777 -0.3451175 -0.3921323 0.4933120
## 2 -0.4291464 -0.3762847 -0.2373695 -0.3891262 -0.6669402
## 3 0.7805793 0.5909076 0.6335348 0.8709528 0.3677002
## Extracurricular
## 1 0.9786682
## 2 -0.6435059
## 3 -0.1305088
##
## Clustering vector:
## [1] 2 2 2 2 1 1 3 2 3 2 2 2 3 1 2 3 3 2 3 2 3 2 3 3 3 2 3 3 2 3 2 3 2 3 1 2 3
## [38] 2 1 2 2 3 1 3 3 2 3 1 1 2 2 1 2 2 3 1 2 3 1 1 2 1 2 2 3 1 2 3 1 2 1 3 1 1
## [75] 2 1 1 2 2 2 1 2 1 3 1 1 2 1 2 3 2 1 3 2 2 1 1 1 1 3 2 3 2 1 2 2 2 2 2 2 3
## [112] 1 1 1 2 2 1 2 3 2 2 2 1 3 1 1 3 1 2 2 2 2 2 3 1 3 3 3 2 3 2 3 2 2 2 2 2 1
## [149] 2 2 1 3 3 3 2 3 1 2 1 3 2 3 1 3 2 2 1 3 1 3 1 2 3 2 1 3 3 3 1 1 2 3 1 1 3
## [186] 1 1 3 3 3 1 1 3 1 1 3 3 1 3 1 1 1 1 1 1 3 3 3 1 3 3 3 1 1 3 3 1 2 3 2 1 3
## [223] 2 3 1 2 1 3 1 1 2 3 1 3 1 3 1 3 3 3 1 3 3 1 3 1 3 1 1 1 3 1 2 3 2 2 3 1 3
## [260] 1 1 3 1 2 1 1 2 2 2 3 2 3 1 3 3 2 3 2 2 2 3 2 3 2 2 1 3 1 1 2 3 1 3 2 3 1
## [297] 3 3 3 2 3 3 2 3 2 3 2 3 1 2 3 1 3 1 3 2 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 3 2
## [334] 3 2 1 1 1 1 3 3 2 2 3 1 1 1 2 1 2 2 2 3 2 3 2 3 1 3 2 3 1 3 3 3 2 3 3 1 3
## [371] 2 3 2 2 2 3 1 1 3 2 2 1 2 1 3 2 2 1 2 1 1 1 1 2 2 2 2 3 1 3 2 2 1 2 3 2 2
## [408] 2 2 2 2 3 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 3 2 2 3 3 2 1 2 1 2 1 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 3 2 3 1 3 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 3 3 1 2 2
## [482] 3 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 3 2 3 2 3 3 3 2 3 2 3 2 3 1 1 1 1 2 3
## [519] 3 3 2 3 3 1 3 1 2 3 1 2 3 2 2 2 2 3 2 2 2 1 2 2 2 2 2 3 2 1 2 2 1 2 2 1 1
## [556] 3 2 2 3 1 2 2 3 3 1 2 3 2 1 2 2 3 2 2 2 3 3 1 1 3 2 3 3 3 2 1 2 1 1 1 3 1
## [593] 2 2 1 1 2 2 3 2 1 1 1 3 2 1 1 2 2 1 3 3 1 3 1 3 2 1 2 3 2 1 3 2 2 1 1 1 1
## [630] 3 3 2 2 3 2 2 2 2 2 2 1 1 1 1 2 3 1 2 1 3 2 2 3 2 1 1 1 3 2 2 2 2 2 3 3 3
## [667] 2 3 2 3 2 1 2 2 3 2 2 1 2 2 1 3 3 2 2 1 3 2 1 3 2 3 2 2 3 2 3 2 1 2 1 3 3
## [704] 2 1 1 1 2 1 1 2 1 1 1 3 1 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 1 1 1 3 3 3 1 3
## [741] 3 3 1 1 3 3 1 2 3 2 1 2 2 1 1 3 1 3 1 1 2 3 3 3 1 3 1 3 2 3 3 3 3 1 2 3 1
## [778] 1 1 1 3 1 2 1 2 2 3 1 3 1 3 3 3 2 1 1 3 2 2 3 2 3 1 2 3 2 1 2 3 2 1 1 1 3
## [815] 2 3 3 3 1 2 2 3 3 2 3 1 3 3 2 2 1 3 1 3 2 3 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 1084.977 1188.643 1323.181
## (between_SS / total_SS = 28.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Units ID382, ID130 and ID506 seem to be far away from the center, so I will remove them.
mydata <- mydata %>%
filter(!ID %in% c(382, 130, 506))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 248, 262, 327
##
## Cluster means:
## WorkingEnv HomeEnv RelationshipStress DiffProf LackChoice
## 1 -0.2430586 -0.1362045 -0.3325003 -0.3847039 0.4960717
## 2 0.7617421 0.5849697 0.6345592 0.8777868 0.3610986
## 3 -0.4259875 -0.3653925 -0.2562521 -0.4115400 -0.6655462
## Extracurricular
## 1 0.9828920
## 2 -0.1264463
## 3 -0.6441232
##
## Clustering vector:
## [1] 3 3 3 3 1 1 2 3 2 3 3 3 2 1 3 2 2 3 2 3 2 3 2 2 2 3 2 2 3 2 3 2 3 2 1 3 2
## [38] 3 1 3 3 2 1 2 2 3 2 1 1 3 3 1 3 3 2 1 3 2 1 1 3 1 3 3 2 1 3 2 1 3 1 2 1 1
## [75] 3 1 1 3 3 3 1 3 1 2 1 1 3 1 3 2 3 1 2 3 3 1 1 1 1 2 3 2 3 1 3 3 3 3 3 3 2
## [112] 1 1 1 3 3 1 3 2 3 3 3 1 2 1 1 2 1 3 3 3 3 2 1 2 2 2 3 2 3 2 3 3 3 3 3 1 3
## [149] 3 1 2 2 2 3 2 1 3 1 2 3 2 1 2 3 3 1 2 1 2 1 3 2 3 1 2 2 2 1 1 3 2 1 1 2 1
## [186] 1 2 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 3 2 3 1 2 3
## [223] 2 1 3 1 2 1 1 3 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 2 1 1 1 2 1 3 2 3 3 2 1 2 1
## [260] 1 2 1 3 1 1 3 3 3 2 3 2 1 2 2 3 2 3 3 3 2 3 2 3 3 1 2 1 1 3 2 1 2 3 2 1 2
## [297] 2 2 3 2 2 3 2 3 2 3 2 1 3 2 1 2 1 2 3 3 3 1 3 1 3 1 1 1 3 1 3 1 3 1 2 3 2
## [334] 3 1 1 1 1 2 2 3 3 2 1 1 1 3 1 3 3 3 2 3 2 3 2 1 2 3 2 1 2 2 2 3 2 2 1 2 3
## [371] 2 3 3 3 2 1 1 2 3 3 3 1 2 3 3 1 3 1 1 1 1 3 3 3 3 2 1 2 3 3 1 3 2 3 2 3 3
## [408] 3 3 2 1 3 3 1 3 3 3 3 1 3 3 1 3 3 3 3 3 3 3 2 3 3 2 2 3 3 3 1 3 1 3 3 3 3
## [445] 3 3 3 3 3 3 2 3 3 2 3 2 1 2 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 1 2 2 1 3 3 2 3
## [482] 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3 3 2 3 2 3 2 2 3 2 3 2 3 2 1 1 1 1 3 2 2 2 3
## [519] 2 2 1 2 1 3 2 1 3 2 3 3 3 3 2 3 3 3 1 3 3 3 3 3 2 3 1 3 3 1 3 3 1 1 2 3 3
## [556] 2 1 3 3 2 2 1 3 2 3 1 3 3 2 3 3 3 2 2 1 1 2 3 2 2 2 3 1 3 1 1 1 2 1 3 3 1
## [593] 1 3 3 2 3 1 1 1 2 3 1 1 3 3 1 2 2 1 2 1 2 3 1 3 2 3 1 2 3 3 1 1 1 1 2 2 3
## [630] 3 2 3 3 3 3 3 3 1 1 1 1 3 2 1 3 1 2 3 3 2 3 1 1 1 2 3 3 3 3 3 2 2 2 3 2 3
## [667] 2 3 1 3 3 2 3 3 1 3 3 1 2 2 3 3 1 2 3 1 2 3 2 3 3 2 3 2 3 1 3 1 2 2 3 1 1
## [704] 1 3 1 1 3 1 1 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 2 1
## [741] 1 2 2 1 3 2 3 1 2 3 1 1 2 1 2 1 1 3 2 2 2 1 2 1 2 3 2 2 2 2 1 3 2 1 1 1 1
## [778] 2 1 3 1 3 3 2 1 2 1 2 2 2 3 1 1 2 3 3 2 3 2 1 3 2 3 1 3 2 3 1 1 1 2 3 2 2
## [815] 2 1 3 3 2 2 3 2 1 2 2 3 3 1 2 1 2 3 2 3 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 1072.378 1336.947 1174.914
## (between_SS / total_SS = 28.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
Based on new cluster plot, I will remove units ID513 and ID768.
mydata <- mydata %>%
filter(!ID %in% c(513, 768))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
Clustering
## K-means clustering with 3 clusters of sizes 261, 248, 326
##
## Cluster means:
## WorkingEnv HomeEnv RelationshipStress DiffProf LackChoice
## 1 0.7625108 0.5865644 0.6395867 0.8785945 0.3608740
## 2 -0.2463915 -0.1415050 -0.3343825 -0.3808009 0.4893792
## 3 -0.4230375 -0.3619635 -0.2576849 -0.4137256 -0.6612090
## Extracurricular
## 1 -0.1298515
## 2 0.9866033
## 3 -0.6465840
##
## Clustering vector:
## [1] 3 3 3 3 2 2 1 3 1 3 3 3 1 2 3 1 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 1 3 1 2 3 1
## [38] 3 2 3 3 1 2 1 1 3 1 2 2 3 3 2 3 2 1 2 3 1 2 2 3 2 3 3 1 2 3 1 2 3 2 1 2 2
## [75] 3 2 2 3 3 3 2 3 2 1 2 2 3 2 3 1 3 2 1 3 3 2 2 2 2 1 3 1 3 2 3 3 3 3 3 3 1
## [112] 2 2 2 3 3 2 3 1 3 3 3 2 1 2 2 1 2 3 3 3 3 1 2 1 1 1 3 1 3 1 3 3 3 3 3 2 3
## [149] 3 2 1 1 1 3 1 2 3 2 1 3 1 2 1 3 3 2 1 2 1 2 3 1 3 2 1 1 1 2 2 3 1 2 2 1 2
## [186] 2 1 1 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 1 1 2 3 1 3 2 1 3
## [223] 1 2 3 2 1 2 2 3 1 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 2 2 2 1 2 3 1 3 3 1 2 1 2
## [260] 2 1 2 3 2 2 3 3 3 1 3 1 2 1 1 3 1 3 3 3 1 3 1 3 3 2 1 2 2 3 1 2 1 3 1 2 1
## [297] 1 1 3 1 1 3 1 3 1 3 1 2 3 1 2 1 2 1 3 3 3 2 3 2 3 2 2 2 3 2 3 2 3 2 1 3 1
## [334] 3 2 2 2 2 1 1 3 3 1 2 2 2 3 2 3 3 3 1 3 1 3 1 2 1 3 1 2 1 1 1 3 1 1 2 1 3
## [371] 1 3 3 3 1 2 2 1 3 3 3 2 1 3 3 2 3 2 2 2 2 3 3 3 3 1 2 1 3 3 2 3 1 3 1 3 3
## [408] 3 3 1 2 3 3 2 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 1 3 3 1 1 3 3 3 2 3 2 3 3 3 3
## [445] 3 3 3 3 3 3 1 3 3 1 3 1 2 1 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 2 1 1 2 3 3 1 3
## [482] 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3 1 1 3 1 3 1 3 1 2 2 2 3 1 1 1 3 1
## [519] 1 2 1 2 3 1 2 3 1 3 3 3 3 1 3 3 3 2 3 3 3 3 3 1 3 2 3 3 2 3 3 2 2 1 3 3 1
## [556] 2 3 3 1 1 2 3 1 3 2 3 3 1 3 3 3 1 1 2 2 1 3 1 1 1 3 2 3 2 2 2 1 2 3 3 2 2
## [593] 3 3 1 3 2 2 2 1 3 2 2 3 3 2 1 1 2 1 2 1 3 2 3 1 3 2 1 3 3 2 2 2 2 1 1 3 3
## [630] 1 3 3 3 3 3 3 2 2 2 2 3 1 2 3 2 1 3 3 1 3 2 2 2 1 3 3 3 3 3 1 1 1 3 1 3 1
## [667] 3 2 3 3 1 3 3 2 3 3 2 1 1 3 3 2 1 3 2 1 3 1 3 3 1 3 1 3 2 3 2 1 1 3 2 2 2
## [704] 3 2 2 3 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2
## [741] 1 1 2 3 1 3 2 1 3 2 2 1 2 1 2 2 3 1 1 1 2 1 2 1 3 1 1 1 2 3 1 2 2 2 2 1 2
## [778] 3 2 3 3 1 2 1 2 1 1 1 3 2 2 1 3 3 1 3 1 2 3 1 3 2 3 1 3 2 2 2 1 3 1 1 1 2
## [815] 3 3 1 1 3 1 2 1 1 3 3 2 1 2 1 3 1 3 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 1328.425 1070.682 1174.886
## (between_SS / total_SS = 28.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The ratio between_SS / total_SS improved a little bit.
#Average values of cluster variables to describe groups
Averages <- Clustering$centers
Averages
## WorkingEnv HomeEnv RelationshipStress DiffProf LackChoice
## 1 0.7625108 0.5865644 0.6395867 0.8785945 0.3608740
## 2 -0.2463915 -0.1415050 -0.3343825 -0.3808009 0.4893792
## 3 -0.4230375 -0.3619635 -0.2576849 -0.4137256 -0.6612090
## Extracurricular
## 1 -0.1298515
## 2 0.9866033
## 3 -0.6465840
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"),
labels = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
Group 1 (Stressed)
Experience above average: unpleasantness or stress in the work environment, stressful or difficult home/hostel environment, stress due to relationships, stress from difficulties with professors, and lack of confidence in academic subject choices.
The only below average below average value for this group is for the conflict between academic and extracurricular activities.
Group 2 (Average)
Above average rating only for lack of confidence in academic subject choices and conflict between academic and extracurricular activities.
Group 3 (Relaxed)
Below average rating for all of the variables.
#Saving where each unit belongs
mydata$Group <- Clustering$cluster
#Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(WorkingEnv, HomeEnv, RelationshipStress, DiffProf, LackChoice, Extracurricular) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response WorkingEnv :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 315.35 157.677 153.83 < 2.2e-16 ***
## Residuals 832 852.79 1.025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response HomeEnv :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 211.97 105.986 82.108 < 2.2e-16 ***
## Residuals 832 1073.95 1.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response RelationshipStress :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 236.41 118.206 95.825 < 2.2e-16 ***
## Residuals 832 1026.32 1.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response DiffProf :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 413.25 206.623 225.58 < 2.2e-16 ***
## Residuals 832 762.08 0.916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response LackChoice :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 395.14 197.569 164.09 < 2.2e-16 ***
## Residuals 832 1001.77 1.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Extracurricular :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 598.64 299.320 351.73 < 2.2e-16 ***
## Residuals 832 708.03 0.851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the variable WorkingEnv:
H0: μWorkingEnv,G1 = μWorkingEnv,G2 = μWorkingEnv,G3
H1: At least one μWorkingEnv,j is different.
We can reject H0 (p<0.001). We can conclude that the mean for WorkingEnv differs for at least one of the groups.
The same hypotheses and conclusion in this case is for all of the variables.
#Additional variables
aggregate(mydata$Overwhelm,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 2.923372
## 2 2 2.366935
## 3 3 2.273006
aggregate(mydata$Competition,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 2.931034
## 2 2 2.387097
## 3 3 2.177914
The additional variables confirm the classification from above. Those from Group 1 (Stressed) experience most overwhelm due to academic responsibilities, and most stressed from competition with peers, while those from Group 3 (Relaxed) experience these the least.
#Checking homogeneity of variances
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$Overwhelm, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 4.3044 0.01381 *
## 832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: σ²Overwhelm,G1 = σ²Overwhelm,G2 = σ²Overwhelm,G3
H1: At least one σ²Overwhelm,j is different.
We can reject H0 (p=0.02). Welch correction will be applied.
library(car)
leveneTest(mydata$Competition, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 2.5122 0.08171 .
## 832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: σ²Competition,G1 = σ²Competition,G2 = σ²Competition,G3
H1: At least one σ²Competition,j is different.
We can’t reject H0. We can assume equal variances in all groups for the variable Competition.
#Checking normal distribution of variables
library(dplyr)
library(rstatix)
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Overwhelm)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Overwhelm 0.897 2.49e-12
## 2 2 Overwhelm 0.879 3.59e-13
## 3 3 Overwhelm 0.874 1.06e-15
H0: Overwhelm is normally distributed in Group 1 (row 1).
H1: Overwhelm is not normally distributed in Group 1 (row 1).
We can reject H0 (p<0.001). Kruskal-Wallis test will be used.
library(dplyr)
library(rstatix)
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Competition)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Competition 0.911 2.38e-11
## 2 2 Competition 0.889 1.53e-12
## 3 3 Competition 0.844 1.80e-17
H0: Competition is normally distributed in Group 1 (row 1).
H1: Competition is not normally distributed in Group 1 (row 1).
We can reject H0 (p<0.001). Kruskal-Wallis test will be used.
kruskal.test(Overwhelm ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Overwhelm by as.factor(Group)
## Kruskal-Wallis chi-squared = 38.387, df = 2, p-value = 4.617e-09
H0: Location distribution of Overwhelm is the same in all 3 groups.
H1: At least one group is different from others in the location distribution of Overwhelm.
We can reject H0 (p<0.001). Our result is validated and based on this, the groups make sense.
kruskal.test(Competition ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Competition by as.factor(Group)
## Kruskal-Wallis chi-squared = 59.172, df = 2, p-value = 1.416e-13
H0: Location distribution of Competition is the same in all 3 groups.
H1: At least one group is different from others in the location distribution of Competition.
We can reject H0 (p<0.001). Our result is validated and based on this, the groups make sense.
#Checking the association between the self-identified primary type of stress and classification of students into 3 groups
chi_square <- chisq.test(mydata$StressType, as.factor(mydata$Group))
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$StressType and as.factor(mydata$Group)
## X-squared = 54.158, df = 4, p-value = 4.878e-11
H0:There is no association between the type of stress and classification of students into 3 groups.
H1: There is association between the type of stress and classification of students into 3 groups.
We can reject H0 (p<0.001). The result is validated again.
addmargins(chi_square$observed)
## as.factor(mydata$Group)
## mydata$StressType 1 2 3 Sum
## Distress (Negative Stress) 19 9 0 28
## Eustress (Positive Stress) 241 231 292 764
## No Stress 1 8 34 43
## Sum 261 248 326 835
addmargins(round(chi_square$expected, 2))
## as.factor(mydata$Group)
## mydata$StressType 1 2 3 Sum
## Distress (Negative Stress) 8.75 8.32 10.93 28
## Eustress (Positive Stress) 238.81 226.91 298.28 764
## No Stress 13.44 12.77 16.79 43
## Sum 261.00 248.00 326.00 835
All expected frequencies are larger than 5.
round(chi_square$res, 2)
## as.factor(mydata$Group)
## mydata$StressType 1 2 3
## Distress (Negative Stress) 3.46 0.24 -3.31
## Eustress (Positive Stress) 0.14 0.27 -0.36
## No Stress -3.39 -1.34 4.20
More than expected students from Group 1 (Stressed) experience Distress at α = 0.1%.
Less than expected students from Group 1 (Stressed) experience No Stress at α = 0.1%.
Less than expected students from Group 3 (Relaxed) experience Distress at α = 0.1%.
More than expected students from Group 3 (Relaxed) experience No Stress at α = 0.1%.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize::cramers_v(mydata$StressType, mydata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.17 | [0.12, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.17)
## [1] "small"
## (Rules: funder2019)
There is a small association between the type of stress and classification of students into 3 groups.
chi_square <- chisq.test(mydata$GenderFactor, as.factor(mydata$Group))
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$GenderFactor and as.factor(mydata$Group)
## X-squared = 1.7694, df = 2, p-value = 0.4128
We can’t reject H0. There is no association between gender of students and classification of students into 3 groups.
aggregate(mydata$Age,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 19.80460
## 2 2 20.55242
## 3 3 19.91411
library(car)
leveneTest(mydata$Age, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.6497 0.5225
## 832
We can’t reject H0. We can assume equal variances in all groups for the variable Age.
library(dplyr)
library(rstatix)
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(Age)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Age 0.607 5.26e-24
## 2 2 Age 0.217 5.85e-31
## 3 3 Age 0.237 2.20e-34
We can reject H0 in all 3 groups (p<0.001). Age is not normally distributed. Kruskal-Wallis test will be used.
kruskal.test(Age ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Age by as.factor(Group)
## Kruskal-Wallis chi-squared = 1.9209, df = 2, p-value = 0.3827
We can’t reject H0. We can’t say that groups significantly differ in age.
We clustered 840 college students based on six standardized variables.
Cluster 1 (Stressed) consists of 261 students (31.26%) who are above-average stressed about many things in their lives. These include work environment, home/hostel environment, relationships, difficulties with professors, and academic subject choices. More than expected, students from this group experience Distress (negative stress). The students from this group are also the most overwhelmed due to academic responsibilities, and the most stressed from competition with peers compared to other two groups. Because extracurricular activities is the only thing not causing them stress, maybe it would be useful to include some activities that are even stress-relieving for this group of students. There should be solution for this group of students to reduce the stress they are experiencing.
Cluster 2 (Average) is the smallest group of students (248/835, 29.70%).These are the students who experience some stress, but in less areas compared to the first group. They experience above-average stress only when it comes to academic subject choices and extracurricular activities, but have relatively peaceful environments and relationships. Therefore, for example, extracurricular activities might be overwhelming for this group of students, since they rated them the highest, so they should be changed or reduced.
Cluster 3 (Relaxed) represents the largest group of students (326/835, 39.04%), and consists of students who are the most relaxed.They reported below-average stress for all six variables. More than expected, students from this group experience No Stress.The students from this group are also the least overwhelmed due to academic responsibilities, and the least stressed from competition with peers compared to other two groups.