Introduction

This paper is connected with my K-mean clustering paper that link is below. http://rpubs.com/Crescantia/KmeansMentalGDP

Economic data and mental health data are dimensional data. According to various components of mental health data is able to complement GDP. Then identifying differences between countries is easier. Dimension reduction techniques are useful to explain issues by transforming. high-dimensional data make a smaller number of latent dimensions that preserve relevant information. Dimension reduction aims to uncover the underlying structure of the variables themselves. This approach highlights countries’ differences from different dimensions. According to the dataset speciality PCA one of the dimension reductions is effectively explained data. In this paper GDP and mental health data will explain the representative axis. Therefore PCA is considered tool for GDP and mental health data.

Method

Variable selection from Dataset

This paper adopts GDP and mental health are combined data by country. While mental health related variables include treatment experience, increasing stress, social weakness and mental health history.

Pre-processing

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(stringr)

gdp<-read.csv("GDP.csv")

mental<-read.csv("Mental Health dataset1.csv")

dim(gdp); dim(mental)
## [1] 329   6
## [1] 261328     17
head(gdp,3)
##   X Gross.domestic.product.2024 X.1     X.2          X.3 X.4
## 1                                NA                         
## 2                                NA         (millions of    
## 3                       Ranking  NA Economy  US dollars)
head(mental,3)
##   Gender Country Occupation SelfEmployed FamilyHistory Treatment
## 1 Female      UK     Others           No           Yes        No
## 2 Female     USA  Housewife           No           Yes        No
## 3 Female  Canada     Others           No            No       Yes
##          DaysIndoors HabitsChange MentalHealthHistory IncreasingStress
## 1         15-30 days           No                 Yes              Yes
## 2         15-30 days        Maybe               Maybe              Yes
## 3 More than 2 months        Maybe                  No               No
##   MoodSwings SocialWeakness CopingStruggles WorkInterest SocialWeakness.1
## 1       High             No             Yes        Maybe               No
## 2       High          Maybe             Yes        Maybe            Maybe
## 3     Medium             No              No           No               No
##   MentalHealthInterview CareOptions
## 1                    No          No
## 2                    No    Not sure
## 3                    No    Not sure
cl<-function(x){
  x<- trimws(tolower(x))
  ifelse(x == "yes",1,
         ifelse(x == "no", 0, NA))
}

mental <- mental %>%
  mutate(
    Treatment = cl(Treatment),
    SocialWeakness =cl(SocialWeakness),
    IncreasingStress = cl(IncreasingStress),
    MentalHealthHistory = cl(MentalHealthHistory)
  )
colSums(is.na(mental[, c("Country", "Treatment", "SocialWeakness","IncreasingStress", "MentalHealthHistory")]))
##             Country           Treatment      SocialWeakness    IncreasingStress 
##                   0                   0               93271               90697 
## MentalHealthHistory 
##               84989

Mental health variables were recorded by individual component. Therefore they require to aggregate country data by calculating proportions. GDP dataset was merged with country identifiers with mental health data. Therefore the equipped data cell only remained.

mental_c <-mental %>%
  group_by(Country) %>%
  summarise(
    n=n(),
    treatment_rate=mean(Treatment, na.rm =TRUE),
   socialWeakness_rate=mean(SocialWeakness, na.rm =T),
    increasingstress_rate=mean(IncreasingStress, na.rm =T),
    mentalhistory_rate=mean(MentalHealthHistory, na.rm = T)
  ) %>%
  ungroup()

mental_country<- mental_c %>% filter(n>= 10)

head(mental_country)
## # A tibble: 6 × 6
##   Country             n treatment_rate socialWeakness_rate increasingstress_rate
##   <chr>           <int>          <dbl>               <dbl>                 <dbl>
## 1 Australia        3544          0.606               0.477                 0.537
## 2 Belgium           477          0                   0.5                   0.523
## 3 Bosnia and Her…   237          0                   0.472                 0.497
## 4 Brazil           1338          0.338               0.465                 0.521
## 5 Canada          14177          0.520               0.480                 0.517
## 6 Colombia          384          0                   0.476                 0.510
## # ℹ 1 more variable: mentalhistory_rate <dbl>
gdp_clean <-gdp %>%
  filter(!is.na(X), nchar(as.character(X)) ==3)

gdp_cl <-gdp_clean %>%
  transmute(
    CountryCode = as.character(X),
    gdp_pc = as.numeric(gsub(",","", as.character(X.3)))
  )
## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", as.character(X.3)))`.
## Caused by warning:
## ! NAs introduced by coercion
summary(gdp_cl$gdp_pc)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##        62     10651     44458   2003108    347034 110982661         8
head(gdp_cl)
##   CountryCode   gdp_pc
## 1         USA 28750956
## 2         CHN 18743803
## 3         DEU  4685593
## 4         JPN  4027598
## 5         IND  3909892
## 6         GBR  3686033
gdp_clean <- gdp %>%
  mutate(
    X = str_trim(as.character(X)),
    X.2 = str_trim(as.character(X.2)),
    X.3 = str_trim(as.character(X.3)))%>%
                     filter(!is.na(X), nchar(X)==3)
  
    
gdp_cl2 <- gdp_clean %>%
  transmute(
    country = X.2, 
    gdp_pc =as.numeric(gsub(",", "", X.3))
  ) %>%
filter(!is.na(country), !is.na(gdp_pc))
## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", X.3))`.
## Caused by warning:
## ! NAs introduced by coercion
head(gdp_cl2)
##          country   gdp_pc
## 1  United States 28750956
## 2          China 18743803
## 3        Germany  4685593
## 4          Japan  4027598
## 5          India  3909892
## 6 United Kingdom  3686033
summary(gdp_cl2)
##    country              gdp_pc         
##  Length:221         Min.   :       62  
##  Class :character   1st Qu.:    10651  
##  Mode  :character   Median :    44458  
##                     Mean   :  2003108  
##                     3rd Qu.:   347034  
##                     Max.   :110982661
normal_country <- function(x){
  x %>%
    as.character() %>%
    str_to_lower() %>%
    str_trim() %>%
    str_replace_all("&", "and")%>%
    str_replace_all("[^a-z]", " ")%>%
    str_squish()
}

mental_country2 <- mental_country %>%
  mutate(country_key =normal_country(Country))

gdp_cl2b <- gdp_cl2 %>%
  mutate(country_key = normal_country(country))
df<-mental_country2 %>%
  left_join(gdp_cl2b %>% select(country_key, gdp_pc),by ="country_key") %>%
  filter(!is.na(gdp_pc)) %>%
  mutate(gdp_pc1000=gdp_pc/1000)

PCA running

X <- df %>%
  select(gdp_pc1000, 
         treatment_rate, 
         socialWeakness_rate,
         increasingstress_rate,
         mentalhistory_rate)

X<- na.omit(X)
X_scaled <- scale(X)
pca_res <-prcomp(X_scaled, center =T, scale. =T)

summary(pca_res)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.4254 1.1267 0.9790 0.70019 0.50033
## Proportion of Variance 0.4063 0.2539 0.1917 0.09805 0.05007
## Cumulative Proportion  0.4063 0.6602 0.8519 0.94993 1.00000
pca_res$rotation
##                               PC1        PC2          PC3         PC4
## gdp_pc1000            -0.08657137  0.4252005  0.875131500 -0.20234764
## treatment_rate         0.19301356  0.6824869 -0.445371109 -0.53317482
## socialWeakness_rate    0.59210482 -0.2668186  0.148058099 -0.39315589
## increasingstress_rate  0.47714304  0.4643266  0.003746838  0.72006809
## mentalhistory_rate     0.61400128 -0.2581179  0.117703839 -0.04135744
##                               PC5
## gdp_pc1000             0.07007323
## treatment_rate         0.11969356
## socialWeakness_rate   -0.63381936
## increasingstress_rate -0.19550764
## mentalhistory_rate     0.73539988
plot(pca_res, type="l")

scores <- as.data.frame(pca_res$x)

plot(scores$PC1, scores$PC2,
     xlab = "PC1",
     ylab = "PC2",
     pch = 19, col=4)

Result

The principal Component Analysis (PCA) is standardized to reduce the dimension of data therefore Data explains cross country differences about GDP and mental health. PC1 is related to individual GDP. If a country has high GDP it tends to economic growth and development is also high. Therefore PC1 is able to explain that if the economic level is high, people can contact medical mental health care easily. PC2 is leaded by mental health data. For instance, increasing stress and social weakness. This axis reflects the high pressure of mental problems. This axis is not rely on economic level and this is able to figure out component difference from mental health status.

How different PCA result and MDS result?

d<- dist(X_scaled)
mds_res <- cmdscale(d,k=2)

plot(mds_res[,1],mds_res[,2],
     xlab="MDS1", ylab ="MDS2", pch=19)

scores <- as.data.frame(pca_res$x)

plot(scores$PC1, scores$PC2,
     xlab = "PCA PC1",
     ylab = "PCA PC2",
     pch = 19, col=4)

For comparison, multidimensional scaling(MDS) was run on the same dataset. From those result they show similar shapes. Therefore MDS and PCA is complementive relationship and stable. Moreover PCA is an effective tool beside K-means clustering. K-means clustering is effective in grouping with similar data width. In contrast, PCA allows for explanation of latent dimensions.

Translation

Geeks for geeks, Dimensionality Reduction Techniques, https://www.geeksforgeeks.org/data-science/dimensionality-reduction-techniques/