Principal Component & Latent Class Analyses

Principal Component Analysis (PCA)

PCA is a statistical approach that can be used to analyze high-dimensional data and capture the most important information from it. This is done by transforming the original data into a lower-dimensional space while collating highly correlated variables together. In our scenario, we picked six indicators among students including mental health (anxiety and/or depression), food insecurity, housing instability, no health insurance, no regular healthcare provider and needed medical care but did not get it.

Load Libraries

library('corrr')
library('FactoMineR')
library('factoextra')

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library('ggcorrplot')
library('ggfortify')
library('rmarkdown')

Load & normalize the data

BDP <- read.csv("C:\\Users\\SPH User\\Downloads\\PCAdata_1_12.csv")

colSums(is.na(BDP))

##              STUDY_ID               Anxious             Depressed 
##                     2                    25                    24 
## Anxiety_or_Depression   Housing_Instability       Food_Insecurity 
##                    25                    44                    75 
##       No_Reg_Provider       No_Medical_Care   No_Health_Insurance 
##                     5                     5                     2

bdp_clean <- na.omit(BDP)

numerical_data <- bdp_clean[,4:9]

data_normalized <- scale(numerical_data)

Plot correlation heat map

Positive Correlations: Dark peach= strongly correlated, Medium peach= moderately correlated, Light peach= weakly correlated

Strongly correlated

Unstable housing and food insecurity

Moderately correlated

Needed medical care but didn’t get it & 1) no regular provider, 2) food insecurity, 3) unstable housing, and 4) anxiety/depression
Anxiety depression & food insecurity.

Weakly correlated:

Anxiety and/or depression & Unstable housing

Compute principal components

Each component explains a percentage of the total variance in the data set and each principal component is completely independent of the next principal component.

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     0.4568301 0.4038822 0.3643973 0.3224126 0.25990562
## Proportion of Variance 0.3086725 0.2412670 0.1963989 0.1537491 0.09991252
## Cumulative Proportion  0.3086725 0.5499395 0.7463384 0.9000875 1.00000000
##                              Comp.6
## Standard deviation     1.747316e-08
## Proportion of Variance 4.515767e-16
## Cumulative Proportion  1.000000e+00

The first two principal components explain just 55% of the total variance in the dataset.Four components are required to represent 90% of the data in the set of six variables (not ideal). Thus, PCA was unable to accurately represent the data with just two principal components.

##Generate loading matrix Shows how the principal components relate to each variable.

##                              Comp.1      Comp.2
## Anxiety_or_Depression  1.901091e-01  0.30264654
## Housing_Instability    5.194424e-01 -0.21590451
## Food_Insecurity        5.621210e-01 -0.09976055
## No_Reg_Provider       -3.967742e-01  0.53582719
## No_Medical_Care       -5.881487e-05  0.30156646
## No_Health_Insurance   -4.697110e-01 -0.68832045

Generate Scree Plot

Can be used to determine the number of principal components to retain.

Generate Biplot

To visualize the similarities and dissimilarities between the sample and the impact of each variable on each of the principal components.

All variables that are grouped together are positively correlated to one another.
The greater the distance between the variable and the origin, the more represented the variable is.
Variables displayed on opposite sides of the origin are negatively correlated.

Cos2 Plot

To determine contribution of each variable represented in a given component.

A low cosine value means the variable is not perfectly represented by the principal component (Needed medical care).
A high value means it is a good representation of the variable on that principal component (No health insurance).

##Biplot combined with Cos2 Plot Attributes with similar Cos scores have similar colors.

Takeaways:

Food insecurity and housing instability are positively correlated.
Not having health insurance is not correlated with the various needs.
Not having a regular healthcare provider is positively correlated with not being able to get medical care when it is needed.
Not being able to get medical care when it is needed also tracks with anxiety and/or depression.
No medical care and anxiety and/or depression are not well represented by the principal components.

Latent Class Analysis

We will use the poLCA package to run LCA. This method will classify students into latent groups based on the same six characteristics (anxiety and/or depression, food insecurity, housing instability, no health insurance, unable to get medical care when needed it, and no regular healthcare provider).

Load polCA package and dataset

library(poLCA)

## Loading required package: scatterplot3d

## Loading required package: MASS

bdp <- read.delim("C:\\Users\\SPH User\\Downloads\\lca_data_12.txt")
View(bdp)

Define our LCA model

f <- cbind(Anx_or_depression, housing_instability, food_insecurity, No_Reg_Provider, Needed_medical_care, No_Health_Insurance) ~ 1

Try out different models to determine best fit (code not shown).

Model 2, with three classes, was the best fit.

M2 <- poLCA(f, data=bdp, nclass=3, graphs=TRUE, na.rm=TRUE)

## Conditional item response (column) probabilities,
##  by outcome variable, for each class (row) 
##  
## $Anx_or_depression
##            Pr(1)  Pr(2)
## class 1:  0.1968 0.8032
## class 2:  0.4360 0.5640
## class 3:  0.4730 0.5270
## 
## $housing_instability
##            Pr(1)  Pr(2)
## class 1:  0.2430 0.7570
## class 2:  0.7486 0.2514
## class 3:  0.6458 0.3542
## 
## $food_insecurity
##            Pr(1)  Pr(2)
## class 1:  0.0985 0.9015
## class 2:  0.8015 0.1985
## class 3:  0.6136 0.3864
## 
## $No_Reg_Provider
##            Pr(1)  Pr(2)
## class 1:  0.1152 0.8848
## class 2:  0.0604 0.9396
## class 3:  0.9475 0.0525
## 
## $Needed_medical_care
##            Pr(1)  Pr(2)
## class 1:  0.0430 0.9570
## class 2:  0.2220 0.7780
## class 3:  0.3989 0.6011
## 
## $No_Health_Insurance
##            Pr(1)  Pr(2)  Pr(3)
## class 1:  0.0263 0.9139 0.0598
## class 2:  0.0210 0.9528 0.0262
## class 3:  0.2813 0.5082 0.2105
## 
## Estimated class population shares 
##  0.4867 0.3852 0.1281 
##  
## Predicted class memberships (by modal posterior prob.) 
##  0.4779 0.3778 0.1443 
##  
## ========================================================= 
## Fit for 3 latent classes: 
## ========================================================= 
## number of observations: 1739 
## number of estimated parameters: 23 
## residual degrees of freedom: 72 
## maximum log-likelihood: -5627.731 
##  
## AIC(3): 11301.46
## BIC(3): 11427.07
## G^2(3): 142.7512 (Likelihood ratio/deviance statistic) 
## X^2(3): 148.4684 (Chi-square goodness of fit) 
##

Summary of Results:

48.7% of the student population (class 3) has a relatively low probability of mental health, healthcare, housing or food security needs (ranging from 2% to 24%). The highest needs in this group are housing (24%) and anxiety and/or depression (20%).
38.5% of the student population (class 2) has high probability of housing instability (75%) and food insecurity (80%) with 43% probability of anxiety and/or depression. This group has a 22% probability of needing medical care but not getting it and a low probability of not having health insurance (2%) and not having a regular healthcare provider (6%).
12.8% of the student population (class 1) has a very high probability of not having a regular healthcare provider (94%). This group has a 28% probability of not having health insurance and a 40% probability of needing medical care but not getting it. This group also has relatively high levels of housing instability (65%) food insecurity (61%) and anxiety and/or depression (47%).