This paper is connected with my K-mean clustering paper that link is below. http://rpubs.com/Crescantia/KmeansMentalGDP
Economic data and mental health data are dimensional data. According to various components of mental health data is able to complement GDP. Then identifying differences between countries is easier. Dimension reduction techniques are useful to explain issues by transforming. high-dimensional data make a smaller number of latent dimensions that preserve relevant information. Dimension reduction aims to uncover the underlying structure of the variables themselves. This approach highlights countries’ differences from different dimensions. According to the dataset speciality PCA one of the dimension reductions is effectively explained data. In this paper GDP and mental health data will explain the representative axis. Therefore PCA is considered tool for GDP and mental health data.
This paper adopts GDP and mental health are combined data by country. While mental health related variables include treatment experience, increasing stress, social weakness and mental health history.
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(stringr)
gdp<-read.csv("GDP.csv")
mental<-read.csv("Mental Health dataset1.csv")
dim(gdp); dim(mental)## [1] 329 6
## [1] 261328 17
## X Gross.domestic.product.2024 X.1 X.2 X.3 X.4
## 1 NA
## 2 NA (millions of
## 3 Ranking NA Economy US dollars)
## Gender Country Occupation SelfEmployed FamilyHistory Treatment
## 1 Female UK Others No Yes No
## 2 Female USA Housewife No Yes No
## 3 Female Canada Others No No Yes
## DaysIndoors HabitsChange MentalHealthHistory IncreasingStress
## 1 15-30 days No Yes Yes
## 2 15-30 days Maybe Maybe Yes
## 3 More than 2 months Maybe No No
## MoodSwings SocialWeakness CopingStruggles WorkInterest SocialWeakness.1
## 1 High No Yes Maybe No
## 2 High Maybe Yes Maybe Maybe
## 3 Medium No No No No
## MentalHealthInterview CareOptions
## 1 No No
## 2 No Not sure
## 3 No Not sure
cl<-function(x){
x<- trimws(tolower(x))
ifelse(x == "yes",1,
ifelse(x == "no", 0, NA))
}
mental <- mental %>%
mutate(
Treatment = cl(Treatment),
SocialWeakness =cl(SocialWeakness),
IncreasingStress = cl(IncreasingStress),
MentalHealthHistory = cl(MentalHealthHistory)
)
colSums(is.na(mental[, c("Country", "Treatment", "SocialWeakness","IncreasingStress", "MentalHealthHistory")]))## Country Treatment SocialWeakness IncreasingStress
## 0 0 93271 90697
## MentalHealthHistory
## 84989
Mental health variables were recorded by individual component. Therefore they require to aggregate country data by calculating proportions. GDP dataset was merged with country identifiers with mental health data. Therefore the equipped data cell only remained.
mental_c <-mental %>%
group_by(Country) %>%
summarise(
n=n(),
treatment_rate=mean(Treatment, na.rm =TRUE),
socialWeakness_rate=mean(SocialWeakness, na.rm =T),
increasingstress_rate=mean(IncreasingStress, na.rm =T),
mentalhistory_rate=mean(MentalHealthHistory, na.rm = T)
) %>%
ungroup()
mental_country<- mental_c %>% filter(n>= 10)
head(mental_country)## # A tibble: 6 × 6
## Country n treatment_rate socialWeakness_rate increasingstress_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Australia 3544 0.606 0.477 0.537
## 2 Belgium 477 0 0.5 0.523
## 3 Bosnia and Her… 237 0 0.472 0.497
## 4 Brazil 1338 0.338 0.465 0.521
## 5 Canada 14177 0.520 0.480 0.517
## 6 Colombia 384 0 0.476 0.510
## # ℹ 1 more variable: mentalhistory_rate <dbl>
gdp_clean <-gdp %>%
filter(!is.na(X), nchar(as.character(X)) ==3)
gdp_cl <-gdp_clean %>%
transmute(
CountryCode = as.character(X),
gdp_pc = as.numeric(gsub(",","", as.character(X.3)))
)## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", as.character(X.3)))`.
## Caused by warning:
## ! NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 62 10651 44458 2003108 347034 110982661 8
## CountryCode gdp_pc
## 1 USA 28750956
## 2 CHN 18743803
## 3 DEU 4685593
## 4 JPN 4027598
## 5 IND 3909892
## 6 GBR 3686033
gdp_clean <- gdp %>%
mutate(
X = str_trim(as.character(X)),
X.2 = str_trim(as.character(X.2)),
X.3 = str_trim(as.character(X.3)))%>%
filter(!is.na(X), nchar(X)==3)
gdp_cl2 <- gdp_clean %>%
transmute(
country = X.2,
gdp_pc =as.numeric(gsub(",", "", X.3))
) %>%
filter(!is.na(country), !is.na(gdp_pc))## Warning: There was 1 warning in `transmute()`.
## ℹ In argument: `gdp_pc = as.numeric(gsub(",", "", X.3))`.
## Caused by warning:
## ! NAs introduced by coercion
## country gdp_pc
## 1 United States 28750956
## 2 China 18743803
## 3 Germany 4685593
## 4 Japan 4027598
## 5 India 3909892
## 6 United Kingdom 3686033
## country gdp_pc
## Length:221 Min. : 62
## Class :character 1st Qu.: 10651
## Mode :character Median : 44458
## Mean : 2003108
## 3rd Qu.: 347034
## Max. :110982661
normal_country <- function(x){
x %>%
as.character() %>%
str_to_lower() %>%
str_trim() %>%
str_replace_all("&", "and")%>%
str_replace_all("[^a-z]", " ")%>%
str_squish()
}
mental_country2 <- mental_country %>%
mutate(country_key =normal_country(Country))
gdp_cl2b <- gdp_cl2 %>%
mutate(country_key = normal_country(country))X <- df %>%
select(gdp_pc1000,
treatment_rate,
socialWeakness_rate,
increasingstress_rate,
mentalhistory_rate)
X<- na.omit(X)
X_scaled <- scale(X)
pca_res <-prcomp(X_scaled, center =T, scale. =T)
summary(pca_res)## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.4254 1.1267 0.9790 0.70019 0.50033
## Proportion of Variance 0.4063 0.2539 0.1917 0.09805 0.05007
## Cumulative Proportion 0.4063 0.6602 0.8519 0.94993 1.00000
## PC1 PC2 PC3 PC4
## gdp_pc1000 -0.08657137 0.4252005 0.875131500 -0.20234764
## treatment_rate 0.19301356 0.6824869 -0.445371109 -0.53317482
## socialWeakness_rate 0.59210482 -0.2668186 0.148058099 -0.39315589
## increasingstress_rate 0.47714304 0.4643266 0.003746838 0.72006809
## mentalhistory_rate 0.61400128 -0.2581179 0.117703839 -0.04135744
## PC5
## gdp_pc1000 0.07007323
## treatment_rate 0.11969356
## socialWeakness_rate -0.63381936
## increasingstress_rate -0.19550764
## mentalhistory_rate 0.73539988
scores <- as.data.frame(pca_res$x)
plot(scores$PC1, scores$PC2,
xlab = "PC1",
ylab = "PC2",
pch = 19, col=4)The principal Component Analysis (PCA) is standardized to reduce the dimension of data therefore Data explains cross country differences about GDP and mental health. PC1 is related to individual GDP. If a country has high GDP it tends to economic growth and development is also high. Therefore PC1 is able to explain that if the economic level is high, people can contact medical mental health care easily. PC2 is leaded by mental health data. For instance, increasing stress and social weakness. This axis reflects the high pressure of mental problems. This axis is not rely on economic level and this is able to figure out component difference from mental health status.
d<- dist(X_scaled)
mds_res <- cmdscale(d,k=2)
plot(mds_res[,1],mds_res[,2],
xlab="MDS1", ylab ="MDS2", pch=19)scores <- as.data.frame(pca_res$x)
plot(scores$PC1, scores$PC2,
xlab = "PCA PC1",
ylab = "PCA PC2",
pch = 19, col=4)For comparison, multidimensional scaling(MDS) was run on the same dataset. From those result they show similar shapes. Therefore MDS and PCA is complementive relationship and stable. Moreover PCA is an effective tool beside K-means clustering. K-means clustering is effective in grouping with similar data width. In contrast, PCA allows for explanation of latent dimensions.
Geeks for geeks, Dimensionality Reduction Techniques, https://www.geeksforgeeks.org/data-science/dimensionality-reduction-techniques/