K-means clustering is one of the unsupervised learning techniques for grouping observations based on multidimensional space similarity. K-means clustering is a practical tool for exploratory analysis when Data labeling is not available. Therefore in this paper, I analyze country group data on mental health and GDP. By merging two separate datasets obtained from Kaggle and World bank group this study explores patterns in the relationship information between economic development rate and mental health (Increasing stress)indicators. This process may generate countries and groups that follow the mental health and GDP relationship for K-mean clustering. After that I would figure out characteristics.
Mental Health Dataset Link from Kaggle https://www.kaggle.com/datasets/alamshihab075/mental-health-dataset
GDP Dataset link from World bank group https://datacatalog.worldbank.org/search/dataset/0038130/gdp-ranking
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Duplicate confirm
## Gender Country Occupation
## 0 0 0
## SelfEmployed FamilyHistory Treatment
## 0 0 0
## DaysIndoors HabitsChange MentalHealthHistory
## 0 0 0
## IncreasingStress MoodSwings SocialWeakness
## 0 0 0
## CopingStruggles WorkInterest SocialWeakness.1
## 0 0 0
## MentalHealthInterview CareOptions
## 0 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
mental<- mental %>% distinct()
v<- c("Treatment", "FamilyHistory","SocialWeakness", "SocialWeakness","MentalHealthHistory","DaysIndoors")
for(V in v){
mental[[V]]<- trimws(mental[[V]])
mental[[V]][mental[[V]] == ""] <- NA
}cl<-function(x){
x<- trimws(tolower(x))
ifelse(x == "yes",1,
ifelse(x == "no", 0, NA))
}
mental <- mental %>%
mutate(
Treatment = cl(Treatment),
SocialWeakness =cl(SocialWeakness),
IncreasingStress = cl(IncreasingStress),
MentalHealthHistory = cl(MentalHealthHistory)
)
table(mental$Treatment, useNA = "ifany")##
## 0 1
## 44714 41330
##
## 0 1 <NA>
## 28906 26528 30610
##
## 0 1 <NA>
## 27014 29415 29615
##
## 0 1 <NA>
## 30845 27019 28180
mental_c <-mental %>%
group_by(Country) %>%
summarise(
n=n(),
treatment_rate=mean(Treatment, na.rm =TRUE),
socialWeakness_rate=mean(SocialWeakness, na.rm =T),
increasingstress_rate=mean(IncreasingStress, na.rm =T),
mentalhistory_rate=mean(MentalHealthHistory, na.rm = T)
) %>%
ungroup()library(dplyr)
gdp_clean<- gdp %>%
filter(!is.na(X), nchar(X) == 3)
gdp_cl <- gdp_clean %>%
transmute(
country = X.2,
gdp_pc = as.numeric(gsub(",", "", X.3))
)## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", X.3))`.
## Caused by warning:
## ! NAs introduced by coercion
## country gdp_pc
## 1 United States 28750956
## 2 China 18743803
## 3 Germany 4685593
## 4 Japan 4027598
## 5 India 3909892
## 6 United Kingdom 3686033
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 62 10651 44458 2003108 347034 110982661 8
GDP dataset is uncleared data. Therfore was cleaned by data cleaning.
df <- mental_c %>%
left_join(gdp_cl, by =c("Country"="country")) %>%
filter(!is.na(gdp_pc))
df$gdp_pc1000<-df$gdp_pc/1000
head(df)## # A tibble: 6 × 8
## Country n treatment_rate socialWeakness_rate increasingstress_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Australia 3114 0.598 0.478 0.532
## 2 Belgium 477 0 0.5 0.523
## 3 Bosnia and Her… 237 0 0.472 0.497
## 4 Brazil 1338 0.338 0.465 0.521
## 5 Canada 9149 0.478 0.481 0.517
## 6 Colombia 384 0 0.476 0.510
## # ℹ 3 more variables: mentalhistory_rate <dbl>, gdp_pc <dbl>, gdp_pc1000 <dbl>
## [1] 32
Mental health country data and GDP country data were merged together. This code is about standardisation for K-means clustering.
X <- df %>% select(gdp_pc1000, treatment_rate, socialWeakness_rate,increasingstress_rate, mentalhistory_rate)
X_scaled <- scale(X)
set.seed(488101)
kmeanc <- sapply(2:8, function(k){
kmeans(X_scaled, centers = k, nstart =25)$tot.withinss
})
plot(2:8, kmeanc, type="b",
xlab="Number of clusters(K)",
ylab="Cluster sum of squres")From this code i am able to recognize where is the Elbow. According to Elbow method when cluster sum of squre start reduce less, that is the Elbow point. Therefore from this dataset elbow is 4 so deviding cluster at four is effective.
set.seed(488101)
km<-kmeans(X_scaled, centers =3, nstart = 50)
df$cluster <- factor(km$cluster)
table(df$cluster)##
## 1 2 3
## 9 22 1
set.seed(488101)
km<-kmeans(X_scaled, centers=5, nstart = 50)
df$cluster <- factor(km$cluster)
table(df$cluster)##
## 1 2 3 4 5
## 4 1 6 7 14
set.seed(488101)
km<-kmeans(X_scaled, center=4, nstart = 50)
df$cluster <- factor(km$cluster)
table(df$cluster)##
## 1 2 3 4
## 10 1 6 15
Regard to K-means clustering, k=4 is the best option for mental health and GDP country data. Because clustering generate overfitting and many outliers at the k=3 and the k=5.Therfore k=4 was selected as a compromise meaningful interpretation.
cluster_Profiles<- df %>%
group_by(cluster) %>%
summarise(
n_countries = n(),
avg.gdp_pc1000 = mean(gdp_pc1000),
avg.treatment = mean(treatment_rate),
avg.socialWeakness =mean(socialWeakness_rate),
avg.IncreasingStress =mean(increasingstress_rate),
avg.mentalhistory = mean(mentalhistory_rate)
)
cluster_Profiles## # A tibble: 4 × 7
## cluster n_countries avg.gdp_pc1000 avg.treatment avg.socialWeakness
## <fct> <int> <dbl> <dbl> <dbl>
## 1 1 10 630. 0.726 0.479
## 2 2 1 671. 0 0.5
## 3 3 6 3094. 0.271 0.476
## 4 4 15 461. 0.0333 0.476
## # ℹ 2 more variables: avg.IncreasingStress <dbl>, avg.mentalhistory <dbl>
Each Cluster has an identified character. Average of GDP per capita and mental health indicators by clusters. In this table, treatment experience average is explored that shows an explainable value difference. Therefore Clusters are able to obtain 4 group name. Cluster 1: Low- midium GDP, very high treatment rate Cluster 2: Midium GDP, very low treatment rate Cluster 3: Very high GDP, moderate treatment rate Cluster 4: Low GDP, low treatment rate
plot(df$gdp_pc1000,df$treatment_rate,
xlab="GDP per capita/1000",
ylab="Treatment experience rate",
pch=15, col=df$cluster)
legend("topright",
legend=levels(df$cluster),
pch=15, col=1:length(levels(df$cluster)))Above cluster profiles are summarized by average values. On the other hands this scatter plot reveals within clusters variability. This indicates that clusters remain general patterns rather than mean value.
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
cluster_labels<- c("Low-midium GDP, very high treatment rate", "Midium GDP, very low treatment rate", "Very high GDP, moderate treatment rate","Low GDP, low treatment rate")
df$cluster_name <- factor(df$cluster, labels = cluster_labels)
ggplot(df, aes(gdp_pc1000, treatment_rate, color=cluster_name))+geom_point(size=3, alpha=0.85)+
scale_x_continuous(labels = comma)+
scale_y_continuous(labels = percent_format(accuracy=1))+
labs(x= "GDP per capita (thousand USD)",
y= "Treatment experience rate",
color= "Cluster")+
theme_minimal()This paper demonstrate K-mean clustering to explore patterns in the relationship between GDP and mental health data by country. Conclusionally countries’ data from GDP and mental health remain 4 groups from k-mean clustering as meaningful compromise.