Principal Component Analysis (PCA)

PCA is a statistical approach that can be used to analyze high-dimensional data and capture the most important information from it. This is done by transforming the original data into a lower-dimensional space while collating highly correlated variables together. In our scenario, we picked six indicators among students including mental health (anxiety and/or depression), food insecurity, housing instability, no health insurance, no regular healthcare provider and needed medical care but did not get it.

Load Libraries

library('corrr')
library('FactoMineR')
library('factoextra')
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library('ggcorrplot')
library('ggfortify')
library('rmarkdown')

Load & normalize the data

BDP <- read.csv("C:\\Users\\SPH User\\Downloads\\PCAdata_1_12.csv")

colSums(is.na(BDP))
##              STUDY_ID               Anxious             Depressed 
##                     2                    25                    24 
## Anxiety_or_Depression   Housing_Instability       Food_Insecurity 
##                    25                    44                    75 
##       No_Reg_Provider       No_Medical_Care   No_Health_Insurance 
##                     5                     5                     2
bdp_clean <- na.omit(BDP)

numerical_data <- bdp_clean[,4:9]

data_normalized <- scale(numerical_data)

Plot correlation heat map

Positive Correlations: Dark peach= strongly correlated, Medium peach= moderately correlated, Light peach= weakly correlated

Strongly correlated

Moderately correlated

Weakly correlated:

Compute principal components

Each component explains a percentage of the total variance in the data set and each principal component is completely independent of the next principal component.

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     0.4568301 0.4038822 0.3643973 0.3224126 0.25990562
## Proportion of Variance 0.3086725 0.2412670 0.1963989 0.1537491 0.09991252
## Cumulative Proportion  0.3086725 0.5499395 0.7463384 0.9000875 1.00000000
##                              Comp.6
## Standard deviation     1.747316e-08
## Proportion of Variance 4.515767e-16
## Cumulative Proportion  1.000000e+00

The first two principal components explain just 55% of the total variance in the dataset.Four components are required to represent 90% of the data in the set of six variables (not ideal). Thus, PCA was unable to accurately represent the data with just two principal components.

##Generate loading matrix Shows how the principal components relate to each variable.

##                              Comp.1      Comp.2
## Anxiety_or_Depression  1.901091e-01  0.30264654
## Housing_Instability    5.194424e-01 -0.21590451
## Food_Insecurity        5.621210e-01 -0.09976055
## No_Reg_Provider       -3.967742e-01  0.53582719
## No_Medical_Care       -5.881487e-05  0.30156646
## No_Health_Insurance   -4.697110e-01 -0.68832045

Generate Scree Plot

Can be used to determine the number of principal components to retain.

Generate Biplot

To visualize the similarities and dissimilarities between the sample and the impact of each variable on each of the principal components.

Cos2 Plot

To determine contribution of each variable represented in a given component.

##Biplot combined with Cos2 Plot Attributes with similar Cos scores have similar colors.

Takeaways:

Latent Class Analysis

We will use the poLCA package to run LCA. This method will classify students into latent groups based on the same six characteristics (anxiety and/or depression, food insecurity, housing instability, no health insurance, unable to get medical care when needed it, and no regular healthcare provider).

Load polCA package and dataset

library(poLCA)
## Loading required package: scatterplot3d
## Loading required package: MASS
bdp <- read.delim("C:\\Users\\SPH User\\Downloads\\lca_data_12.txt")
View(bdp)

Define our LCA model

f <- cbind(Anx_or_depression, housing_instability, food_insecurity, No_Reg_Provider, Needed_medical_care, No_Health_Insurance) ~ 1

Try out different models to determine best fit (code not shown).

Model 2, with three classes, was the best fit.

M2 <- poLCA(f, data=bdp, nclass=3, graphs=TRUE, na.rm=TRUE)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $Anx_or_depression
##            Pr(1)  Pr(2)
## class 1:  0.1968 0.8032
## class 2:  0.4360 0.5640
## class 3:  0.4730 0.5270
## 
## $housing_instability
##            Pr(1)  Pr(2)
## class 1:  0.2430 0.7570
## class 2:  0.7486 0.2514
## class 3:  0.6458 0.3542
## 
## $food_insecurity
##            Pr(1)  Pr(2)
## class 1:  0.0985 0.9015
## class 2:  0.8015 0.1985
## class 3:  0.6136 0.3864
## 
## $No_Reg_Provider
##            Pr(1)  Pr(2)
## class 1:  0.1152 0.8848
## class 2:  0.0604 0.9396
## class 3:  0.9475 0.0525
## 
## $Needed_medical_care
##            Pr(1)  Pr(2)
## class 1:  0.0430 0.9570
## class 2:  0.2220 0.7780
## class 3:  0.3989 0.6011
## 
## $No_Health_Insurance
##            Pr(1)  Pr(2)  Pr(3)
## class 1:  0.0263 0.9139 0.0598
## class 2:  0.0210 0.9528 0.0262
## class 3:  0.2813 0.5082 0.2105
## 
## Estimated class population shares 
##  0.4867 0.3852 0.1281 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.4779 0.3778 0.1443 
##  
## ========================================================= 
## Fit for 3 latent classes: 
## ========================================================= 
## number of observations: 1739 
## number of estimated parameters: 23 
## residual degrees of freedom: 72 
## maximum log-likelihood: -5627.731 
##  
## AIC(3): 11301.46
## BIC(3): 11427.07
## G^2(3): 142.7512 (Likelihood ratio/deviance statistic) 
## X^2(3): 148.4684 (Chi-square goodness of fit) 
## 

Summary of Results: