K-means Clustering Countries by Mental Health and GDP

Introduction

K-means clustering is one of the unsupervised learning techniques for grouping observations based on multidimensional space similarity. K-means clustering is a practical tool for exploratory analysis when Data labeling is not available. Therefore in this paper, I analyze country group data on mental health and GDP. By merging two separate datasets obtained from Kaggle and World bank group this study explores patterns in the relationship information between economic development rate and mental health (Increasing stress)indicators. This process may generate countries and groups that follow the mental health and GDP relationship for K-mean clustering. After that I would figure out characteristics.

Methods

Pre-processing

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

gdp<-read.csv("GDP.csv")

mental<-read.csv("Mental Health dataset1.csv")

Duplicate confirm

colSums(is.na(mental))
##                Gender               Country            Occupation 
##                     0                     0                     0 
##          SelfEmployed         FamilyHistory             Treatment 
##                     0                     0                     0 
##           DaysIndoors          HabitsChange   MentalHealthHistory 
##                     0                     0                     0 
##      IncreasingStress            MoodSwings        SocialWeakness 
##                     0                     0                     0 
##       CopingStruggles          WorkInterest      SocialWeakness.1 
##                     0                     0                     0 
## MentalHealthInterview           CareOptions 
##                     0                     0
sum(mental$Country == "")
## [1] 0
sum(mental$Treatment == "")
## [1] 0
sum(mental$FamilyHistory == "")
## [1] 0
sum(mental$SocialWeakness == "")
## [1] 0
sum(mental$MentalHealthHistory == "")
## [1] 0
sum(mental$DaysIndoors == "")
## [1] 0
mental<- mental %>% distinct()

v<- c("Treatment", "FamilyHistory","SocialWeakness", "SocialWeakness","MentalHealthHistory","DaysIndoors")

for(V in v){
   mental[[V]]<- trimws(mental[[V]])
   mental[[V]][mental[[V]] == ""] <- NA
}
cl<-function(x){
  x<- trimws(tolower(x))
  ifelse(x == "yes",1,
         ifelse(x == "no", 0, NA))
}

mental <- mental %>%
  mutate(
    Treatment = cl(Treatment),
    SocialWeakness =cl(SocialWeakness),
    IncreasingStress = cl(IncreasingStress),
    MentalHealthHistory = cl(MentalHealthHistory)
  )

table(mental$Treatment, useNA = "ifany")
## 
##     0     1 
## 44714 41330
table(mental$SocialWeakness, useNA = "ifany" )
## 
##     0     1  <NA> 
## 28906 26528 30610
table(mental$IncreasingStress, useNA = "ifany" )
## 
##     0     1  <NA> 
## 27014 29415 29615
table(mental$MentalHealthHistory, useNA = "ifany" )
## 
##     0     1  <NA> 
## 30845 27019 28180

Aggregate to country level

mental_c <-mental %>%
  group_by(Country) %>%
  summarise(
    n=n(),
    treatment_rate=mean(Treatment, na.rm =TRUE),
   socialWeakness_rate=mean(SocialWeakness, na.rm =T),
    increasingstress_rate=mean(IncreasingStress, na.rm =T),
    mentalhistory_rate=mean(MentalHealthHistory, na.rm = T)
  ) %>%
  ungroup()
library(dplyr)
gdp_clean<- gdp %>%
  filter(!is.na(X), nchar(X) == 3)

gdp_cl <- gdp_clean %>%
  transmute(
    country = X.2,
    gdp_pc = as.numeric(gsub(",", "", X.3))
  )
## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", X.3))`.
## Caused by warning:
## ! NAs introduced by coercion
head(gdp_cl)
##          country   gdp_pc
## 1  United States 28750956
## 2          China 18743803
## 3        Germany  4685593
## 4          Japan  4027598
## 5          India  3909892
## 6 United Kingdom  3686033
summary(gdp_cl$gdp_pc)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##        62     10651     44458   2003108    347034 110982661         8

GDP dataset is uncleared data. Therfore was cleaned by data cleaning.

df <- mental_c %>%
  left_join(gdp_cl, by =c("Country"="country")) %>%
  filter(!is.na(gdp_pc))

df$gdp_pc1000<-df$gdp_pc/1000

head(df)
## # A tibble: 6 × 8
##   Country             n treatment_rate socialWeakness_rate increasingstress_rate
##   <chr>           <int>          <dbl>               <dbl>                 <dbl>
## 1 Australia        3114          0.598               0.478                 0.532
## 2 Belgium           477          0                   0.5                   0.523
## 3 Bosnia and Her…   237          0                   0.472                 0.497
## 4 Brazil           1338          0.338               0.465                 0.521
## 5 Canada           9149          0.478               0.481                 0.517
## 6 Colombia          384          0                   0.476                 0.510
## # ℹ 3 more variables: mentalhistory_rate <dbl>, gdp_pc <dbl>, gdp_pc1000 <dbl>
nrow(df)
## [1] 32

Mental health country data and GDP country data were merged together. This code is about standardisation for K-means clustering.

X <- df %>% select(gdp_pc1000, treatment_rate, socialWeakness_rate,increasingstress_rate, mentalhistory_rate)
X_scaled <- scale(X)

set.seed(488101)
kmeanc <- sapply(2:8, function(k){
  kmeans(X_scaled, centers = k, nstart =25)$tot.withinss
})

plot(2:8, kmeanc, type="b",
     xlab="Number of clusters(K)",
     ylab="Cluster sum of squres")

From this code i am able to recognize where is the Elbow. According to Elbow method when cluster sum of squre start reduce less, that is the Elbow point. Therefore from this dataset elbow is 4 so deviding cluster at four is effective.

K-means Clustering

set.seed(488101)
km<-kmeans(X_scaled, centers =3, nstart = 50)

df$cluster <- factor(km$cluster)
table(df$cluster)
## 
##  1  2  3 
##  9 22  1
set.seed(488101)
km<-kmeans(X_scaled, centers=5, nstart = 50)

df$cluster <- factor(km$cluster)
table(df$cluster)
## 
##  1  2  3  4  5 
##  4  1  6  7 14
set.seed(488101)
km<-kmeans(X_scaled, center=4, nstart = 50)

df$cluster <- factor(km$cluster)
table(df$cluster)
## 
##  1  2  3  4 
## 10  1  6 15

Choosing K

Regard to K-means clustering, k=4 is the best option for mental health and GDP country data. Because clustering generate overfitting and many outliers at the k=3 and the k=5.Therfore k=4 was selected as a compromise meaningful interpretation.

Results

cluster_Profiles<- df %>%
  group_by(cluster) %>%
  summarise(
    n_countries = n(),
    avg.gdp_pc1000 = mean(gdp_pc1000),
    avg.treatment = mean(treatment_rate),
    avg.socialWeakness =mean(socialWeakness_rate),
    avg.IncreasingStress =mean(increasingstress_rate),
    avg.mentalhistory = mean(mentalhistory_rate)
  )

cluster_Profiles
## # A tibble: 4 × 7
##   cluster n_countries avg.gdp_pc1000 avg.treatment avg.socialWeakness
##   <fct>         <int>          <dbl>         <dbl>              <dbl>
## 1 1                10           630.        0.726               0.479
## 2 2                 1           671.        0                   0.5  
## 3 3                 6          3094.        0.271               0.476
## 4 4                15           461.        0.0333              0.476
## # ℹ 2 more variables: avg.IncreasingStress <dbl>, avg.mentalhistory <dbl>

Each Cluster has an identified character. Average of GDP per capita and mental health indicators by clusters. In this table, treatment experience average is explored that shows an explainable value difference. Therefore Clusters are able to obtain 4 group name. Cluster 1: Low- midium GDP, very high treatment rate Cluster 2: Midium GDP, very low treatment rate Cluster 3: Very high GDP, moderate treatment rate Cluster 4: Low GDP, low treatment rate

Visualisation

plot(df$gdp_pc1000,df$treatment_rate,
     xlab="GDP per capita/1000",
     ylab="Treatment experience rate",
     pch=15, col=df$cluster)

legend("topright",
       legend=levels(df$cluster),
       pch=15, col=1:length(levels(df$cluster)))

Above cluster profiles are summarized by average values. On the other hands this scatter plot reveals within clusters variability. This indicates that clusters remain general patterns rather than mean value.

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
cluster_labels<- c("Low-midium GDP, very high treatment rate", "Midium GDP, very low treatment rate", "Very high GDP, moderate treatment rate","Low GDP, low treatment rate")

df$cluster_name <- factor(df$cluster, labels = cluster_labels)

ggplot(df, aes(gdp_pc1000, treatment_rate, color=cluster_name))+geom_point(size=3, alpha=0.85)+
  scale_x_continuous(labels = comma)+
  scale_y_continuous(labels = percent_format(accuracy=1))+
  labs(x= "GDP per capita (thousand USD)",
       y= "Treatment experience rate",
       color= "Cluster")+
  theme_minimal()

Conclusion

This paper demonstrate K-mean clustering to explore patterns in the relationship between GDP and mental health data by country. Conclusionally countries’ data from GDP and mental health remain 4 groups from k-mean clustering as meaningful compromise.