Tamara Milinković HW2

Author: Tamara Milinković

1. Data import

library(readxl)
mydata <- read_xlsx("StudentsStressDataset.xlsx")

mydata <- as.data.frame(mydata)

head(mydata)

##   ID Gender Age RecStress Heartbeat Anxiety Sleep Headaches Irritation
## 1  1      0  20         3         4       2     5         2          1
## 2  2      0  20         2         3       2     1         1          1
## 3  3      0  20         5         4       2     2         3          4
## 4  4      1  20         3         4       3     2         3          4
## 5  5      0  20         3         3       3     2         4          4
## 6  6      0  20         3         4       3     2         3          4
##   Concentration Sadness Illness Loneliness Overwhelm Competition
## 1             2       2       3          1         5           1
## 2             4       2       1          2         1           2
## 3             2       3       2          3         4           5
## 4             3       5       2          4         1           2
## 5             4       4       1          1         1           2
## 6             1       3       1          2         2           3
##   RelationshipStress DiffProf WorkingEnv TroubleRelax HomeEnv LackConfidence
## 1                  2        3          1            4       1              2
## 2                  4        3          2            1       1              3
## 3                  2        2          2            2       1              4
## 4                  3        1          1            2       1              2
## 5                  1        2          3            1       2              2
## 6                  2        2          3            1       2              3
##   LackChoice Extracurricular RegularlyAttend Weight                 StressType
## 1          1               3               1      2 Eustress (Positive Stress)
## 2          2               1               4      2 Eustress (Positive Stress)
## 3          1               1               2      1 Eustress (Positive Stress)
## 4          1               1               5      3 Eustress (Positive Stress)
## 5          4               2               2      2 Eustress (Positive Stress)
## 6          2               4               4      4 Eustress (Positive Stress)

2. Data description

This dataset, titled “Stress and Well-being Data of College Students” was created to study stress levels and well-being factors among college students aged 18-21. It includes responses from 843 students, collected through a structured Google Form survey. The dataset captures various aspects of students’ lives that may impact stress levels, including academic performance, physical and emotional health, social relationships, and relaxation activities. Participants rated their experiences on a five-point scale from “Not at all” to “Extremely” providing nuanced insights into each participant’s feelings and behaviors.

Source: Kaggle

A. Singh, K. Singh, A. Kumar, A. Shrivastava and S. Kumar, “Machine Learning Algorithms for Detecting Mental Stress in College Students,” 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 2024, pp. 1-5, doi: 10.1109/I2CT61223.2024.10544243.

Unit of observation: one college student

Sample size: 843

Response Scale: Five-point Likert scale (“Not at all” to “Extremely”) for stress indicators (variables 4-25)

Variables:

ID: ID of the respondent
Gender: Gender of the respondent (0 = Male, 1 = Female)
Age: Age of the respondent
RecStress: Have you recently experienced stress in your life?: Self-reported recent experience of stress
Heartbeat: Have you noticed a rapid heartbeat or palpitations?: Experience of rapid heartbeat as a stress response
Anxiety: Have you been dealing with anxiety or tension recently?: Frequency of recent anxiety or tension
Sleep: Do you face any sleep problems or difficulties falling asleep?: Sleep disturbances or difficulty falling asleep
Headaches: Have you been getting headaches more often than usual?: Frequency of headaches as a potential stress indicator
Irritation: Do you get irritated easily?: Increased irritability as a stress response
Concentration: Do you have trouble concentrating on your academic tasks?: Difficulties concentrating on studies
Sadness: Have you been feeling sadness or low mood?: Experience of sadness or low mood
Illness: Have you been experiencing any illness or health issues?: Health issues that may relate to stress
Loneliness: Do you often feel lonely or isolated?: Feelings of loneliness or social isolation
Overwhelm: Do you feel overwhelmed with your academic workload?: Overwhelm due to academic responsibilities
Competition: Are you in competition with your peers, and does it affect you?: Perceived stress from competition with peers
RelationshipStress: Do you find that your relationship often causes you stress?: Stress due to relationships
DiffProf: Are you facing any difficulties with your professors or instructors?: Stress from difficulties with professors/instructors
WorkingEnv: Is your working environment unpleasant or stressful?: Perceived unpleasantness or stress in the work environment
TroubleRelax: Do you struggle to find time for relaxation and leisure activities?: Difficulty making time for relaxation or leisure
HomeEnv: Is your hostel or home environment causing you difficulties?: Stressful or difficult home/hostel environment
LackConfidence: Do you lack confidence in your academic performance?: Self-reported lack of confidence in academic abilities
LackChoice: Do you lack confidence in your choice of academic subjects?: Lack of confidence in academic subject choices
Extracurricular: Academic and extracurricular activities conflicting for you?: Conflict between academic and extracurricular activities
RegularlyAttend: Do you attend classes regularly?: Frequency of class attendance
Weight: Have you gained/lost weight?: Self-reported weight gain/loss as a stress indicator
StressType:Which type of stress do you primarily experience?: Self-identified primary type of stress (Eustress, Distress, No Stress)

mydata$GenderFactor <- factor(mydata$Gender, 
                             levels = c(0, 1), 
                             labels = c("Male", "Female"))

summary(mydata[, -c(1,2,3,26,27,28)])

##    RecStress       Heartbeat        Anxiety          Sleep      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :2.000   Median :3.000  
##  Mean   :2.998   Mean   :2.756   Mean   :2.543   Mean   :2.786  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##    Headaches       Irritation    Concentration    Sadness         Illness     
##  Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.0   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :3.000   Median :3.0   Median :2.000   Median :2.000  
##  Mean   :2.629   Mean   :2.702   Mean   :2.7   Mean   :2.585   Mean   :2.549  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.0   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.0   Max.   :5.000   Max.   :5.000  
##    Loneliness      Overwhelm      Competition    RelationshipStress
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000     
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000     
##  Median :2.000   Median :2.000   Median :2.000   Median :2.000     
##  Mean   :2.497   Mean   :2.504   Mean   :2.485   Mean   :2.515     
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000     
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000     
##     DiffProf       WorkingEnv     TroubleRelax      HomeEnv     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :2.447   Mean   :2.489   Mean   :2.517   Mean   :2.425  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  LackConfidence    LackChoice    Extracurricular RegularlyAttend
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.581   Mean   :2.642   Mean   :2.757   Mean   :3.259  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Weight     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.399  
##  3rd Qu.:3.000  
##  Max.   :5.000

The range of each of the shown variables is 5 (minimum is 1, maximum is 5), since they are all measured on five-point Likert scale.

The average self-reported recent experience of stress is 2.998.

Half of the students rated conflict between academic and extracurricular activities with up to 3, the others rated it with more than 3.

75% of the students rated lack of confidence in academic abilities with up to 3, the others rated it with more than 3.

3. Research question: How can college students be classified into segments based on stress and well-being data?

For the purpose of clustering, I chose 6 cluster variables:WorkingEnv HomeEnv, RelationshipStress, DiffProf, LackChoice, Extracurricular

#Saving standardized cluster variables into new data frame

mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))

#Finding outliers

mydata$Dissimilarity <- sqrt(mydata_clu_std$RelationshipStress^2 + mydata_clu_std$DiffProf^2 + mydata_clu_std$LackChoice^2 + mydata_clu_std$WorkingEnv^2 + mydata_clu_std$HomeEnv^2 + mydata_clu_std$Extracurricular^2)

#Finding units with highest value of dissimilarity

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])

##      ID Dissimilarity
## 195 195      4.856241
## 540 540      4.856241
## 727 727      4.856241
## 667 667      4.626405
## 584 584      4.589811
## 47   47      4.559359

There is a relatively big jump between third and fourth unit, so I will check first three units.

#Showing students ID195, ID540 and ID727

print(mydata[c(195, 540, 727), ])

##      ID Gender Age RecStress Heartbeat Anxiety Sleep Headaches Irritation
## 195 195      0  22         4         4       2     1         5          5
## 540 540      0  20         5         5       5     5         5          5
## 727 727      0  22         4         4       2     1         5          5
##     Concentration Sadness Illness Loneliness Overwhelm Competition
## 195             5       5       5          5         1           5
## 540             5       5       5          5         5           5
## 727             5       5       5          5         1           5
##     RelationshipStress DiffProf WorkingEnv TroubleRelax HomeEnv LackConfidence
## 195                  5        5          5            5       5              5
## 540                  5        5          5            5       5              5
## 727                  5        5          5            5       5              5
##     LackChoice Extracurricular RegularlyAttend Weight
## 195          5               5               4      2
## 540          5               5               5      5
## 727          5               5               4      2
##                     StressType GenderFactor Dissimilarity
## 195 Distress (Negative Stress)         Male      4.856241
## 540 Distress (Negative Stress)         Male      4.856241
## 727 Distress (Negative Stress)         Male      4.856241

The students ID195, ID540 and ID727 answered with 5 on almost all of the questions, so I will remove these units from original data frame.

#Removing ID195, ID540 and ID727 from original data frame

library (rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata <- mydata %>%
  filter(!ID %in% c(195, 540, 727), ) 
mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Finding Euclidean distances based on 6 Cluster variables, then saving them into object Distances

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

#Showing matrix of distances

fviz_dist(Distances, 
          gradient = list(low = "slateblue4",
                          mid = "skyblue3",
                          high = "skyblue"))

There are some groups of homogeneous objects forming, but they are not very evident.

#Hopkins statistics

library(factoextra) 
get_clust_tendency(mydata_clu_std, 
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.5188179
## 
## $plot
## NULL

Hopkins statistics is just above the threshold of 0.5, indicating that data is not ideal for clustering but it is still clusterable.

#Determining number of clusters for K-means clustering

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

It seems that the biggest break is at 2, indicating that we should form 2 clusters based on Elbow method. 7 is next possible option.

#Determining number of clusters for K-means clustering

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

Since we want average Silhouette to be as high as possible, according to this index, it is definitely the best option to form 2 clusters.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(factoextra)
WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 840

#Dendrogram to determine number of clusters in case of hierarchical clustering

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on Dendrogram, we the biggest distance (jump in heterogeneity) is achieved if we cut it in a way that we form 2 groups.

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL       CH Hartigan      CCC     Scott      Marriot   TrCovW   TraceW
## 2  2.6335 194.9864 113.4592  -4.4258  928.4972 3.123015e+17 517871.4 4083.783
## 3  0.2667 167.2229  56.9403  -7.5087 1474.0135 3.670415e+17 386080.7 3596.801
## 4  4.2210 137.8811  58.2759 -11.7757 1788.1087 4.489521e+17 324853.5 3367.700
## 5  0.5964 125.0387  62.8149 -14.0957 2167.6530 4.464668e+17 291752.5 3148.242
## 6  0.7404 119.9751  30.1699 -13.9625 2554.5160 4.056363e+17 250200.4 2927.978
## 7  0.6221 108.4941  60.4983 -16.5951 2707.0236 4.604488e+17 218267.2 2825.756
## 8  2.4115 108.2614  42.7205 -13.3463 3024.5584 4.120921e+17 190499.3 2634.426
## 9  0.9270 104.8066  39.9152 -11.9639 3236.5550 4.052223e+17 180504.7 2505.763
## 10 1.4489 101.9485  33.8664 -10.6463 3457.7238 3.844680e+17 160222.8 2390.921
##    Friedman  Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale Ratkowsky
## 2    1.6881 1.2327 0.4224 2.0584     0.1824 1.2695 -108.0556 -0.8144    0.2801
## 3    2.6282 1.3996 0.4117 1.9364     0.1571 1.2762  -78.1303 -0.8295    0.3041
## 4    3.1730 1.4948 0.4126 2.0851     0.1419 1.4777  -99.8957 -1.2375    0.2841
## 5    3.9415 1.5990 0.3936 1.9461     0.1254 1.5853  -97.8382 -1.4105    0.2704
## 6    4.7759 1.7193 0.3876 1.7662     0.1447 2.0952 -129.6331 -1.9950    0.2617
## 7    5.1080 1.7815 0.4036 1.7650     0.1322 1.3999  -54.8491 -1.0915    0.2492
## 8    5.8065 1.9109 0.4109 1.6786     0.1457 1.3851  -53.3773 -1.0593    0.2433
## 9    6.2338 2.0090 0.4224 1.6534     0.1449 2.1205  -82.9614 -2.0031    0.2359
## 10   6.7408 2.1055 0.4172 1.6532     0.1456 2.2067  -92.9606 -2.0772    0.2291
##         Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  2041.8913     0.3557 0.5269  0.7623 0.1097  5e-04  1.7551 2.1143 3.1696
## 3  1198.9338     0.3773 0.4298  1.5185 0.1140  5e-04  1.5382 1.9807 2.1819
## 4   841.9250     0.3767 0.6673  2.1468 0.1189  5e-04  1.5086 1.9192 1.0414
## 5   629.6484     0.3541 0.1200  2.9324 0.1164  6e-04  1.5517 1.8509 0.7550
## 6   487.9963     0.3659 0.3029  3.4867 0.1195  6e-04  1.4781 1.7828 0.6523
## 7   403.6795     0.3593 0.1518  4.0201 0.1271  7e-04  1.5178 1.7523 0.6034
## 8   329.3032     0.3602 0.2124  4.6301 0.1338  7e-04  1.4379 1.6967 0.5638
## 9   278.4181     0.3556 0.2242  5.1475 0.1404  8e-04  1.4471 1.6554 0.5358
## 10  239.0921     0.3483 0.0530  5.8146 0.1417  8e-04  1.4946 1.6168 0.5094
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.8021           125.6198            1
## 3          0.7873            97.5067            1
## 4          0.7718            91.3682            1
## 5          0.7507            87.9976            1
## 6          0.7408            86.7609            1
## 7          0.7522            63.2574            1
## 8          0.7262            72.3836            1
## 9          0.6870            71.5133            1
## 10         0.7018            72.2356            1
## 
## $Best.nc
##                    KL       CH Hartigan     CCC    Scott      Marriot   TrCovW
## Number_clusters 4.000   2.0000   3.0000  2.0000   3.0000 6.000000e+00      3.0
## Value_Index     4.221 194.9864  56.5189 -4.4258 545.5163 9.564296e+16 131790.7
##                   TraceW Friedman   Rubin Cindex      DB Silhouette   Duda
## Number_clusters   3.0000   3.0000  3.0000 6.0000 10.0000     2.0000 2.0000
## Value_Index     257.8797   0.9401 -0.0717 0.3876  1.6532     0.1824 1.2695
##                  PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters    2.0000  2.0000    3.0000   3.0000     3.0000    1  2.0000
## Value_Index     -108.0556 -0.8144    0.3041 842.9575     0.3773   NA  0.7623
##                    Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 10.0000      0  8.0000      0 10.0000
## Value_Index      0.1417      0  1.4379      0  0.5094
## 
## $Best.partition
##   [1] 2 2 2 2 3 3 1 2 1 2 2 2 1 3 2 1 1 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 2 1 3 2 1
##  [38] 2 3 2 2 1 3 1 1 2 1 3 3 2 2 3 2 2 1 3 2 1 3 3 2 3 2 2 1 3 2 1 3 2 3 1 3 3
##  [75] 2 3 3 2 2 2 3 2 3 1 3 3 2 3 2 1 2 3 1 2 2 3 3 3 3 1 2 1 2 3 2 2 2 2 2 2 1
## [112] 3 3 3 2 2 3 2 1 2 2 2 3 1 3 3 1 3 2 2 2 2 2 1 3 1 1 1 2 1 2 1 2 2 2 2 2 3
## [149] 2 2 3 1 1 1 2 1 3 2 3 1 2 1 3 1 2 2 3 1 3 1 3 2 1 2 3 1 1 1 3 3 2 1 3 3 1
## [186] 3 3 1 1 1 3 3 1 3 3 1 1 3 1 3 3 3 3 3 3 1 1 1 3 1 1 1 3 3 1 1 3 2 1 2 3 1
## [223] 2 1 3 2 3 1 3 3 2 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 1 3 3 3 1 3 2 1 2 2 1 3 1
## [260] 3 3 1 3 2 3 3 2 2 2 1 2 1 3 1 1 2 1 2 2 2 1 2 1 2 2 3 1 3 3 2 1 3 1 2 1 3
## [297] 1 1 1 2 1 1 2 1 2 1 2 1 3 2 1 3 1 3 1 2 2 2 3 2 3 2 3 3 3 2 3 2 3 2 3 1 2
## [334] 1 2 3 3 3 3 1 1 2 2 1 3 3 3 2 3 2 2 2 1 2 1 2 1 3 1 2 1 3 1 1 1 2 1 1 3 1
## [371] 2 1 2 2 2 1 3 3 1 2 2 3 2 3 1 2 2 3 2 3 3 3 3 2 2 2 2 1 3 1 2 2 3 2 1 2 2
## [408] 2 2 2 2 1 3 2 2 3 2 2 2 2 3 2 2 3 2 2 2 3 2 2 2 1 2 2 1 1 2 3 2 3 2 3 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 1 2 1 3 1 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 3 1 1 3 2 2
## [482] 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 3 3 3 3 2 1
## [519] 1 1 2 1 1 3 1 3 2 1 3 2 1 2 2 2 2 1 2 2 2 3 2 2 2 2 2 1 2 3 2 2 3 2 2 3 3
## [556] 1 2 2 1 3 2 2 1 1 3 2 1 2 3 2 2 1 2 2 2 1 1 3 3 1 2 1 1 1 2 3 2 3 3 3 1 3
## [593] 2 2 3 3 2 2 1 2 3 3 3 1 2 3 3 2 2 3 1 1 3 1 3 1 2 3 2 1 2 3 1 2 2 3 3 3 3
## [630] 1 1 2 2 1 2 2 2 2 2 2 3 3 3 3 2 1 3 2 3 1 2 2 1 2 3 3 3 1 2 2 2 2 2 1 1 1
## [667] 2 1 2 1 2 3 2 2 1 2 2 3 2 2 3 1 1 2 2 3 1 2 3 1 2 1 2 2 1 2 1 2 3 2 3 1 1
## [704] 2 3 3 3 2 3 3 2 3 3 3 1 3 1 3 1 1 3 1 1 3 3 1 1 3 1 3 3 3 3 3 3 1 1 1 3 1
## [741] 1 1 3 3 1 1 3 2 1 2 3 2 2 3 3 1 3 1 3 3 2 1 1 1 3 1 3 1 2 1 1 1 1 3 2 1 3
## [778] 3 3 3 1 3 2 3 2 2 1 3 1 3 1 1 1 2 3 3 1 2 2 1 2 1 3 2 1 2 3 2 1 2 3 3 3 1
## [815] 2 1 1 1 3 2 2 1 1 2 1 3 1 1 2 2 3 1 3 1 2 1 2 3 3 2

According to the majority rule, the best number of clusters for K-means clustering is 3. Also, when I tested with 2 clusters (based on Elbow method and Silhouette analysis), between_SS / total_SS ratio was much lower than in the case if 3 clusters are formed (it was around 17% compared to 28.5% achieved with 3 clusters). That is why I chose to have 3 clusters.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering

## K-means clustering with 3 clusters of sizes 251, 329, 260
## 
## Cluster means:
##   WorkingEnv    HomeEnv RelationshipStress   DiffProf LackChoice
## 1 -0.2460615 -0.1188777         -0.3451175 -0.3921323  0.4933120
## 2 -0.4291464 -0.3762847         -0.2373695 -0.3891262 -0.6669402
## 3  0.7805793  0.5909076          0.6335348  0.8709528  0.3677002
##   Extracurricular
## 1       0.9786682
## 2      -0.6435059
## 3      -0.1305088
## 
## Clustering vector:
##   [1] 2 2 2 2 1 1 3 2 3 2 2 2 3 1 2 3 3 2 3 2 3 2 3 3 3 2 3 3 2 3 2 3 2 3 1 2 3
##  [38] 2 1 2 2 3 1 3 3 2 3 1 1 2 2 1 2 2 3 1 2 3 1 1 2 1 2 2 3 1 2 3 1 2 1 3 1 1
##  [75] 2 1 1 2 2 2 1 2 1 3 1 1 2 1 2 3 2 1 3 2 2 1 1 1 1 3 2 3 2 1 2 2 2 2 2 2 3
## [112] 1 1 1 2 2 1 2 3 2 2 2 1 3 1 1 3 1 2 2 2 2 2 3 1 3 3 3 2 3 2 3 2 2 2 2 2 1
## [149] 2 2 1 3 3 3 2 3 1 2 1 3 2 3 1 3 2 2 1 3 1 3 1 2 3 2 1 3 3 3 1 1 2 3 1 1 3
## [186] 1 1 3 3 3 1 1 3 1 1 3 3 1 3 1 1 1 1 1 1 3 3 3 1 3 3 3 1 1 3 3 1 2 3 2 1 3
## [223] 2 3 1 2 1 3 1 1 2 3 1 3 1 3 1 3 3 3 1 3 3 1 3 1 3 1 1 1 3 1 2 3 2 2 3 1 3
## [260] 1 1 3 1 2 1 1 2 2 2 3 2 3 1 3 3 2 3 2 2 2 3 2 3 2 2 1 3 1 1 2 3 1 3 2 3 1
## [297] 3 3 3 2 3 3 2 3 2 3 2 3 1 2 3 1 3 1 3 2 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 3 2
## [334] 3 2 1 1 1 1 3 3 2 2 3 1 1 1 2 1 2 2 2 3 2 3 2 3 1 3 2 3 1 3 3 3 2 3 3 1 3
## [371] 2 3 2 2 2 3 1 1 3 2 2 1 2 1 3 2 2 1 2 1 1 1 1 2 2 2 2 3 1 3 2 2 1 2 3 2 2
## [408] 2 2 2 2 3 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 3 2 2 3 3 2 1 2 1 2 1 2 2
## [445] 2 2 2 2 2 2 2 2 2 2 2 3 2 3 1 3 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 3 3 1 2 2
## [482] 3 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 3 2 3 2 3 3 3 2 3 2 3 2 3 1 1 1 1 2 3
## [519] 3 3 2 3 3 1 3 1 2 3 1 2 3 2 2 2 2 3 2 2 2 1 2 2 2 2 2 3 2 1 2 2 1 2 2 1 1
## [556] 3 2 2 3 1 2 2 3 3 1 2 3 2 1 2 2 3 2 2 2 3 3 1 1 3 2 3 3 3 2 1 2 1 1 1 3 1
## [593] 2 2 1 1 2 2 3 2 1 1 1 3 2 1 1 2 2 1 3 3 1 3 1 3 2 1 2 3 2 1 3 2 2 1 1 1 1
## [630] 3 3 2 2 3 2 2 2 2 2 2 1 1 1 1 2 3 1 2 1 3 2 2 3 2 1 1 1 3 2 2 2 2 2 3 3 3
## [667] 2 3 2 3 2 1 2 2 3 2 2 1 2 2 1 3 3 2 2 1 3 2 1 3 2 3 2 2 3 2 3 2 1 2 1 3 3
## [704] 2 1 1 1 2 1 1 2 1 1 1 3 1 3 1 3 3 1 3 3 1 1 3 3 1 3 1 1 1 1 1 1 3 3 3 1 3
## [741] 3 3 1 1 3 3 1 2 3 2 1 2 2 1 1 3 1 3 1 1 2 3 3 3 1 3 1 3 2 3 3 3 3 1 2 3 1
## [778] 1 1 1 3 1 2 1 2 2 3 1 3 1 3 3 3 2 1 1 3 2 2 3 2 3 1 2 3 2 1 2 3 2 1 1 1 3
## [815] 2 3 3 3 1 2 2 3 3 2 3 1 3 3 2 2 1 3 1 3 2 3 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 1084.977 1188.643 1323.181
##  (between_SS / total_SS =  28.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Units ID382, ID130 and ID506 seem to be far away from the center, so I will remove them.

mydata <- mydata %>%
  filter(!ID %in% c(382, 130, 506))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering

## K-means clustering with 3 clusters of sizes 248, 262, 327
## 
## Cluster means:
##   WorkingEnv    HomeEnv RelationshipStress   DiffProf LackChoice
## 1 -0.2430586 -0.1362045         -0.3325003 -0.3847039  0.4960717
## 2  0.7617421  0.5849697          0.6345592  0.8777868  0.3610986
## 3 -0.4259875 -0.3653925         -0.2562521 -0.4115400 -0.6655462
##   Extracurricular
## 1       0.9828920
## 2      -0.1264463
## 3      -0.6441232
## 
## Clustering vector:
##   [1] 3 3 3 3 1 1 2 3 2 3 3 3 2 1 3 2 2 3 2 3 2 3 2 2 2 3 2 2 3 2 3 2 3 2 1 3 2
##  [38] 3 1 3 3 2 1 2 2 3 2 1 1 3 3 1 3 3 2 1 3 2 1 1 3 1 3 3 2 1 3 2 1 3 1 2 1 1
##  [75] 3 1 1 3 3 3 1 3 1 2 1 1 3 1 3 2 3 1 2 3 3 1 1 1 1 2 3 2 3 1 3 3 3 3 3 3 2
## [112] 1 1 1 3 3 1 3 2 3 3 3 1 2 1 1 2 1 3 3 3 3 2 1 2 2 2 3 2 3 2 3 3 3 3 3 1 3
## [149] 3 1 2 2 2 3 2 1 3 1 2 3 2 1 2 3 3 1 2 1 2 1 3 2 3 1 2 2 2 1 1 3 2 1 1 2 1
## [186] 1 2 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 3 2 3 1 2 3
## [223] 2 1 3 1 2 1 1 3 2 1 2 1 2 1 2 2 2 1 2 2 1 2 1 2 1 1 1 2 1 3 2 3 3 2 1 2 1
## [260] 1 2 1 3 1 1 3 3 3 2 3 2 1 2 2 3 2 3 3 3 2 3 2 3 3 1 2 1 1 3 2 1 2 3 2 1 2
## [297] 2 2 3 2 2 3 2 3 2 3 2 1 3 2 1 2 1 2 3 3 3 1 3 1 3 1 1 1 3 1 3 1 3 1 2 3 2
## [334] 3 1 1 1 1 2 2 3 3 2 1 1 1 3 1 3 3 3 2 3 2 3 2 1 2 3 2 1 2 2 2 3 2 2 1 2 3
## [371] 2 3 3 3 2 1 1 2 3 3 3 1 2 3 3 1 3 1 1 1 1 3 3 3 3 2 1 2 3 3 1 3 2 3 2 3 3
## [408] 3 3 2 1 3 3 1 3 3 3 3 1 3 3 1 3 3 3 3 3 3 3 2 3 3 2 2 3 3 3 1 3 1 3 3 3 3
## [445] 3 3 3 3 3 3 2 3 3 2 3 2 1 2 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 1 2 2 1 3 3 2 3
## [482] 3 3 3 3 3 3 3 3 2 3 2 3 3 3 3 3 2 3 2 3 2 2 3 2 3 2 3 2 1 1 1 1 3 2 2 2 3
## [519] 2 2 1 2 1 3 2 1 3 2 3 3 3 3 2 3 3 3 1 3 3 3 3 3 2 3 1 3 3 1 3 3 1 1 2 3 3
## [556] 2 1 3 3 2 2 1 3 2 3 1 3 3 2 3 3 3 2 2 1 1 2 3 2 2 2 3 1 3 1 1 1 2 1 3 3 1
## [593] 1 3 3 2 3 1 1 1 2 3 1 1 3 3 1 2 2 1 2 1 2 3 1 3 2 3 1 2 3 3 1 1 1 1 2 2 3
## [630] 3 2 3 3 3 3 3 3 1 1 1 1 3 2 1 3 1 2 3 3 2 3 1 1 1 2 3 3 3 3 3 2 2 2 3 2 3
## [667] 2 3 1 3 3 2 3 3 1 3 3 1 2 2 3 3 1 2 3 1 2 3 2 3 3 2 3 2 3 1 3 1 2 2 3 1 1
## [704] 1 3 1 1 3 1 1 1 2 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 2 1
## [741] 1 2 2 1 3 2 3 1 2 3 1 1 2 1 2 1 1 3 2 2 2 1 2 1 2 3 2 2 2 2 1 3 2 1 1 1 1
## [778] 2 1 3 1 3 3 2 1 2 1 2 2 2 3 1 1 2 3 3 2 3 2 1 3 2 3 1 3 2 3 1 1 1 2 3 2 2
## [815] 2 1 3 3 2 2 3 2 1 2 2 3 3 1 2 1 2 3 2 3 1 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 1072.378 1336.947 1174.914
##  (between_SS / total_SS =  28.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Based on new cluster plot, I will remove units ID513 and ID768.

mydata <- mydata %>%
  filter(!ID %in% c(513, 768))

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular")]))

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions

Clustering

## K-means clustering with 3 clusters of sizes 261, 248, 326
## 
## Cluster means:
##   WorkingEnv    HomeEnv RelationshipStress   DiffProf LackChoice
## 1  0.7625108  0.5865644          0.6395867  0.8785945  0.3608740
## 2 -0.2463915 -0.1415050         -0.3343825 -0.3808009  0.4893792
## 3 -0.4230375 -0.3619635         -0.2576849 -0.4137256 -0.6612090
##   Extracurricular
## 1      -0.1298515
## 2       0.9866033
## 3      -0.6465840
## 
## Clustering vector:
##   [1] 3 3 3 3 2 2 1 3 1 3 3 3 1 2 3 1 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 1 3 1 2 3 1
##  [38] 3 2 3 3 1 2 1 1 3 1 2 2 3 3 2 3 2 1 2 3 1 2 2 3 2 3 3 1 2 3 1 2 3 2 1 2 2
##  [75] 3 2 2 3 3 3 2 3 2 1 2 2 3 2 3 1 3 2 1 3 3 2 2 2 2 1 3 1 3 2 3 3 3 3 3 3 1
## [112] 2 2 2 3 3 2 3 1 3 3 3 2 1 2 2 1 2 3 3 3 3 1 2 1 1 1 3 1 3 1 3 3 3 3 3 2 3
## [149] 3 2 1 1 1 3 1 2 3 2 1 3 1 2 1 3 3 2 1 2 1 2 3 1 3 2 1 1 1 2 2 3 1 2 2 1 2
## [186] 2 1 1 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 1 1 2 3 1 3 2 1 3
## [223] 1 2 3 2 1 2 2 3 1 2 1 2 1 2 1 1 1 2 1 1 2 1 2 1 2 2 2 1 2 3 1 3 3 1 2 1 2
## [260] 2 1 2 3 2 2 3 3 3 1 3 1 2 1 1 3 1 3 3 3 1 3 1 3 3 2 1 2 2 3 1 2 1 3 1 2 1
## [297] 1 1 3 1 1 3 1 3 1 3 1 2 3 1 2 1 2 1 3 3 3 2 3 2 3 2 2 2 3 2 3 2 3 2 1 3 1
## [334] 3 2 2 2 2 1 1 3 3 1 2 2 2 3 2 3 3 3 1 3 1 3 1 2 1 3 1 2 1 1 1 3 1 1 2 1 3
## [371] 1 3 3 3 1 2 2 1 3 3 3 2 1 3 3 2 3 2 2 2 2 3 3 3 3 1 2 1 3 3 2 3 1 3 1 3 3
## [408] 3 3 1 2 3 3 2 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 1 3 3 1 1 3 3 3 2 3 2 3 3 3 3
## [445] 3 3 3 3 3 3 1 3 3 1 3 1 2 1 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 2 1 1 2 3 3 1 3
## [482] 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 3 1 1 3 1 3 1 3 1 2 2 2 3 1 1 1 3 1
## [519] 1 2 1 2 3 1 2 3 1 3 3 3 3 1 3 3 3 2 3 3 3 3 3 1 3 2 3 3 2 3 3 2 2 1 3 3 1
## [556] 2 3 3 1 1 2 3 1 3 2 3 3 1 3 3 3 1 1 2 2 1 3 1 1 1 3 2 3 2 2 2 1 2 3 3 2 2
## [593] 3 3 1 3 2 2 2 1 3 2 2 3 3 2 1 1 2 1 2 1 3 2 3 1 3 2 1 3 3 2 2 2 2 1 1 3 3
## [630] 1 3 3 3 3 3 3 2 2 2 2 3 1 2 3 2 1 3 3 1 3 2 2 2 1 3 3 3 3 3 1 1 1 3 1 3 1
## [667] 3 2 3 3 1 3 3 2 3 3 2 1 1 3 3 2 1 3 2 1 3 1 3 3 1 3 1 3 2 3 2 1 1 3 2 2 2
## [704] 3 2 2 3 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2
## [741] 1 1 2 3 1 3 2 1 3 2 2 1 2 1 2 2 3 1 1 1 2 1 2 1 3 1 1 1 2 3 1 2 2 2 2 1 2
## [778] 3 2 3 3 1 2 1 2 1 1 1 3 2 2 1 3 3 1 3 1 2 3 1 3 2 3 1 3 2 2 2 1 3 1 1 1 2
## [815] 3 3 1 1 3 1 2 1 1 3 3 2 1 2 1 3 1 3 2 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 1328.425 1070.682 1174.886
##  (between_SS / total_SS =  28.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The ratio between_SS / total_SS improved a little bit.

#Average values of cluster variables to describe groups

Averages <- Clustering$centers
Averages

##   WorkingEnv    HomeEnv RelationshipStress   DiffProf LackChoice
## 1  0.7625108  0.5865644          0.6395867  0.8785945  0.3608740
## 2 -0.2463915 -0.1415050         -0.3343825 -0.3808009  0.4893792
## 3 -0.4230375 -0.3619635         -0.2576849 -0.4137256 -0.6612090
##   Extracurricular
## 1      -0.1298515
## 2       0.9866033
## 3      -0.6465840

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"), 
                       labels = c("WorkingEnv", "HomeEnv", "RelationshipStress", "DiffProf", "LackChoice", "Extracurricular"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Group 1 (Stressed)

Experience above average: unpleasantness or stress in the work environment, stressful or difficult home/hostel environment, stress due to relationships, stress from difficulties with professors, and lack of confidence in academic subject choices.

The only below average below average value for this group is for the conflict between academic and extracurricular activities.

Group 2 (Average)

Above average rating only for lack of confidence in academic subject choices and conflict between academic and extracurricular activities.

Group 3 (Relaxed)

Below average rating for all of the variables.

#Saving where each unit belongs

mydata$Group <- Clustering$cluster

#Checking if clustering variables successfully differentiate between groups

fit <- aov(cbind(WorkingEnv, HomeEnv, RelationshipStress, DiffProf, LackChoice, Extracurricular) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response WorkingEnv :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 315.35 157.677  153.83 < 2.2e-16 ***
## Residuals        832 852.79   1.025                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response HomeEnv :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  211.97 105.986  82.108 < 2.2e-16 ***
## Residuals        832 1073.95   1.291                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response RelationshipStress :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  236.41 118.206  95.825 < 2.2e-16 ***
## Residuals        832 1026.32   1.234                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response DiffProf :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 413.25 206.623  225.58 < 2.2e-16 ***
## Residuals        832 762.08   0.916                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response LackChoice :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  395.14 197.569  164.09 < 2.2e-16 ***
## Residuals        832 1001.77   1.204                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Extracurricular :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 598.64 299.320  351.73 < 2.2e-16 ***
## Residuals        832 708.03   0.851                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the variable WorkingEnv:

H0: μWorkingEnv,G1 = μWorkingEnv,G2 = μWorkingEnv,G3

H1: At least one μWorkingEnv,j is different.

We can reject H0 (p<0.001). We can conclude that the mean for WorkingEnv differs for at least one of the groups.

The same hypotheses and conclusion in this case is for all of the variables.

#Additional variables

aggregate(mydata$Overwhelm, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 2.923372
## 2       2 2.366935
## 3       3 2.273006

aggregate(mydata$Competition, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 2.931034
## 2       2 2.387097
## 3       3 2.177914

The additional variables confirm the classification from above. Those from Group 1 (Stressed) experience most overwhelm due to academic responsibilities, and most stressed from competition with peers, while those from Group 3 (Relaxed) experience these the least.

#Checking homogeneity of variances

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$Overwhelm, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  4.3044 0.01381 *
##       832                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: σ²Overwhelm,G1 = σ²Overwhelm,G2 = σ²Overwhelm,G3

H1: At least one σ²Overwhelm,j is different.

We can reject H0 (p=0.02). Welch correction will be applied.

library(car)
leveneTest(mydata$Competition, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   2  2.5122 0.08171 .
##       832                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: σ²Competition,G1 = σ²Competition,G2 = σ²Competition,G3

H1: At least one σ²Competition,j is different.

We can’t reject H0. We can assume equal variances in all groups for the variable Competition.

#Checking normal distribution of variables

library(dplyr)
library(rstatix)
mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Overwhelm)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable  statistic        p
##   <fct>                     <chr>         <dbl>    <dbl>
## 1 1                         Overwhelm     0.897 2.49e-12
## 2 2                         Overwhelm     0.879 3.59e-13
## 3 3                         Overwhelm     0.874 1.06e-15

H0: Overwhelm is normally distributed in Group 1 (row 1).

H1: Overwhelm is not normally distributed in Group 1 (row 1).

We can reject H0 (p<0.001). Kruskal-Wallis test will be used.

library(dplyr)
library(rstatix)
mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Competition)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable    statistic        p
##   <fct>                     <chr>           <dbl>    <dbl>
## 1 1                         Competition     0.911 2.38e-11
## 2 2                         Competition     0.889 1.53e-12
## 3 3                         Competition     0.844 1.80e-17

H0: Competition is normally distributed in Group 1 (row 1).

H1: Competition is not normally distributed in Group 1 (row 1).

We can reject H0 (p<0.001). Kruskal-Wallis test will be used.

kruskal.test(Overwhelm ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Overwhelm by as.factor(Group)
## Kruskal-Wallis chi-squared = 38.387, df = 2, p-value = 4.617e-09

H0: Location distribution of Overwhelm is the same in all 3 groups.

H1: At least one group is different from others in the location distribution of Overwhelm.

We can reject H0 (p<0.001). Our result is validated and based on this, the groups make sense.

kruskal.test(Competition ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Competition by as.factor(Group)
## Kruskal-Wallis chi-squared = 59.172, df = 2, p-value = 1.416e-13

H0: Location distribution of Competition is the same in all 3 groups.

H1: At least one group is different from others in the location distribution of Competition.

We can reject H0 (p<0.001). Our result is validated and based on this, the groups make sense.

#Checking the association between the self-identified primary type of stress and classification of students into 3 groups

chi_square <- chisq.test(mydata$StressType, as.factor(mydata$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$StressType and as.factor(mydata$Group)
## X-squared = 54.158, df = 4, p-value = 4.878e-11

H0:There is no association between the type of stress and classification of students into 3 groups.

H1: There is association between the type of stress and classification of students into 3 groups.

We can reject H0 (p<0.001). The result is validated again.

addmargins(chi_square$observed)

##                             as.factor(mydata$Group)
## mydata$StressType              1   2   3 Sum
##   Distress (Negative Stress)  19   9   0  28
##   Eustress (Positive Stress) 241 231 292 764
##   No Stress                    1   8  34  43
##   Sum                        261 248 326 835

addmargins(round(chi_square$expected, 2))

##                             as.factor(mydata$Group)
## mydata$StressType                 1      2      3 Sum
##   Distress (Negative Stress)   8.75   8.32  10.93  28
##   Eustress (Positive Stress) 238.81 226.91 298.28 764
##   No Stress                   13.44  12.77  16.79  43
##   Sum                        261.00 248.00 326.00 835

All expected frequencies are larger than 5.

round(chi_square$res, 2)

##                             as.factor(mydata$Group)
## mydata$StressType                1     2     3
##   Distress (Negative Stress)  3.46  0.24 -3.31
##   Eustress (Positive Stress)  0.14  0.27 -0.36
##   No Stress                  -3.39 -1.34  4.20

More than expected students from Group 1 (Stressed) experience Distress at α = 0.1%.

Less than expected students from Group 1 (Stressed) experience No Stress at α = 0.1%.

Less than expected students from Group 3 (Relaxed) experience Distress at α = 0.1%.

More than expected students from Group 3 (Relaxed) experience No Stress at α = 0.1%.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

effectsize::cramers_v(mydata$StressType, mydata$Group)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.17              | [0.12, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.17)

## [1] "small"
## (Rules: funder2019)

There is a small association between the type of stress and classification of students into 3 groups.

chi_square <- chisq.test(mydata$GenderFactor, as.factor(mydata$Group))
chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$GenderFactor and as.factor(mydata$Group)
## X-squared = 1.7694, df = 2, p-value = 0.4128

We can’t reject H0. There is no association between gender of students and classification of students into 3 groups.

aggregate(mydata$Age, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 19.80460
## 2       2 20.55242
## 3       3 19.91411

library(car)
leveneTest(mydata$Age, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.6497 0.5225
##       832

We can’t reject H0. We can assume equal variances in all groups for the variable Age.

library(dplyr)
library(rstatix)
mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Age)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable statistic        p
##   <fct>                     <chr>        <dbl>    <dbl>
## 1 1                         Age          0.607 5.26e-24
## 2 2                         Age          0.217 5.85e-31
## 3 3                         Age          0.237 2.20e-34

We can reject H0 in all 3 groups (p<0.001). Age is not normally distributed. Kruskal-Wallis test will be used.

kruskal.test(Age ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Age by as.factor(Group)
## Kruskal-Wallis chi-squared = 1.9209, df = 2, p-value = 0.3827

We can’t reject H0. We can’t say that groups significantly differ in age.

4. Conclusion

We clustered 840 college students based on six standardized variables.

Cluster 1 (Stressed) consists of 261 students (31.26%) who are above-average stressed about many things in their lives. These include work environment, home/hostel environment, relationships, difficulties with professors, and academic subject choices. More than expected, students from this group experience Distress (negative stress). The students from this group are also the most overwhelmed due to academic responsibilities, and the most stressed from competition with peers compared to other two groups. Because extracurricular activities is the only thing not causing them stress, maybe it would be useful to include some activities that are even stress-relieving for this group of students. There should be solution for this group of students to reduce the stress they are experiencing.

Cluster 2 (Average) is the smallest group of students (248/835, 29.70%).These are the students who experience some stress, but in less areas compared to the first group. They experience above-average stress only when it comes to academic subject choices and extracurricular activities, but have relatively peaceful environments and relationships. Therefore, for example, extracurricular activities might be overwhelming for this group of students, since they rated them the highest, so they should be changed or reduced.

Cluster 3 (Relaxed) represents the largest group of students (326/835, 39.04%), and consists of students who are the most relaxed.They reported below-average stress for all six variables. More than expected, students from this group experience No Stress.The students from this group are also the least overwhelmed due to academic responsibilities, and the least stressed from competition with peers compared to other two groups.

Tamara Milinković HW2

2025-01-19

1. Data import

2. Data description

3. Research question: How can college students be classified into segments based on stress and well-being data?

4. Conclusion