PCA is a statistical approach that can be used to analyze high-dimensional data and capture the most important information from it. This is done by transforming the original data into a lower-dimensional space while collating highly correlated variables together. In our scenario, we picked six indicators among students including mental health (anxiety and/or depression), food insecurity, housing instability, no health insurance, no regular healthcare provider and needed medical care but did not get it.
Load Libraries
library('corrr')
library('FactoMineR')
library('factoextra')
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library('ggcorrplot')
library('ggfortify')
library('rmarkdown')
Load & normalize the data
BDP <- read.csv("C:\\Users\\SPH User\\Downloads\\PCAdata_1_12.csv")
colSums(is.na(BDP))
## STUDY_ID Anxious Depressed
## 2 25 24
## Anxiety_or_Depression Housing_Instability Food_Insecurity
## 25 44 75
## No_Reg_Provider No_Medical_Care No_Health_Insurance
## 5 5 2
bdp_clean <- na.omit(BDP)
numerical_data <- bdp_clean[,4:9]
data_normalized <- scale(numerical_data)
Positive Correlations: Dark peach= strongly correlated, Medium peach= moderately correlated, Light peach= weakly correlated
Strongly correlated
Moderately correlated
Weakly correlated:
Each component explains a percentage of the total variance in the data set and each principal component is completely independent of the next principal component.
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.4568301 0.4038822 0.3643973 0.3224126 0.25990562
## Proportion of Variance 0.3086725 0.2412670 0.1963989 0.1537491 0.09991252
## Cumulative Proportion 0.3086725 0.5499395 0.7463384 0.9000875 1.00000000
## Comp.6
## Standard deviation 1.747316e-08
## Proportion of Variance 4.515767e-16
## Cumulative Proportion 1.000000e+00
The first two principal components explain just 55% of the total variance in the dataset.Four components are required to represent 90% of the data in the set of six variables (not ideal). Thus, PCA was unable to accurately represent the data with just two principal components.
##Generate loading matrix Shows how the principal components relate to each variable.
## Comp.1 Comp.2
## Anxiety_or_Depression 1.901091e-01 0.30264654
## Housing_Instability 5.194424e-01 -0.21590451
## Food_Insecurity 5.621210e-01 -0.09976055
## No_Reg_Provider -3.967742e-01 0.53582719
## No_Medical_Care -5.881487e-05 0.30156646
## No_Health_Insurance -4.697110e-01 -0.68832045
Can be used to determine the number of principal components to retain.
To visualize the similarities and dissimilarities between the sample and the impact of each variable on each of the principal components.
To determine contribution of each variable represented in a given component.
##Biplot combined with Cos2 Plot Attributes with similar Cos scores have similar colors.
Takeaways:
We will use the poLCA package to run LCA. This method will classify students into latent groups based on the same six characteristics (anxiety and/or depression, food insecurity, housing instability, no health insurance, unable to get medical care when needed it, and no regular healthcare provider).
Load polCA package and dataset
library(poLCA)
## Loading required package: scatterplot3d
## Loading required package: MASS
bdp <- read.delim("C:\\Users\\SPH User\\Downloads\\lca_data_12.txt")
View(bdp)
Define our LCA model
f <- cbind(Anx_or_depression, housing_instability, food_insecurity, No_Reg_Provider, Needed_medical_care, No_Health_Insurance) ~ 1
Try out different models to determine best fit (code not shown).
Model 2, with three classes, was the best fit.
M2 <- poLCA(f, data=bdp, nclass=3, graphs=TRUE, na.rm=TRUE)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Anx_or_depression
## Pr(1) Pr(2)
## class 1: 0.1968 0.8032
## class 2: 0.4360 0.5640
## class 3: 0.4730 0.5270
##
## $housing_instability
## Pr(1) Pr(2)
## class 1: 0.2430 0.7570
## class 2: 0.7486 0.2514
## class 3: 0.6458 0.3542
##
## $food_insecurity
## Pr(1) Pr(2)
## class 1: 0.0985 0.9015
## class 2: 0.8015 0.1985
## class 3: 0.6136 0.3864
##
## $No_Reg_Provider
## Pr(1) Pr(2)
## class 1: 0.1152 0.8848
## class 2: 0.0604 0.9396
## class 3: 0.9475 0.0525
##
## $Needed_medical_care
## Pr(1) Pr(2)
## class 1: 0.0430 0.9570
## class 2: 0.2220 0.7780
## class 3: 0.3989 0.6011
##
## $No_Health_Insurance
## Pr(1) Pr(2) Pr(3)
## class 1: 0.0263 0.9139 0.0598
## class 2: 0.0210 0.9528 0.0262
## class 3: 0.2813 0.5082 0.2105
##
## Estimated class population shares
## 0.4867 0.3852 0.1281
##
## Predicted class memberships (by modal posterior prob.)
## 0.4779 0.3778 0.1443
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## number of observations: 1739
## number of estimated parameters: 23
## residual degrees of freedom: 72
## maximum log-likelihood: -5627.731
##
## AIC(3): 11301.46
## BIC(3): 11427.07
## G^2(3): 142.7512 (Likelihood ratio/deviance statistic)
## X^2(3): 148.4684 (Chi-square goodness of fit)
##