This example illustrates the use of the method of Principal Components Analysis to form an index of overall health using data from the 2017 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data Link and an example of calcuating a place-based index of area deprivation.
Principal Components is a mathematical technique (not a statistical one!) based off the eigenvalue decomposition/singular value decomposition of a data matrix (or correlation/covariance matrix) Link. This technique re-represents a complicated data set (set of variables) in a simpler form, where the information in the original variables is now represented by fewer components, which represent the majority (we hope!!) of the variance in the original data.
This is known as a variable reduction strategy. It is also used commonly to form indices in the behavioral/social sciences. It is closely related to factor analysis Link, and indeed the initial solution of factor analysis is often the exact same as the PCA solution. PCA preserves the orthogonal nature of the components, while factor analysis can violate this assumption.
If we have a series of variables, called \[ X\], with \[X = \left [ x_1, x_2, ..., x_k \right ]\], and we have the Pearson correlation matrix between these variables, \[ R\], with
\[ R = \begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1k} \\ r_{21}& r_{22} & \ddots & r_{2k}\\ \vdots & \cdots & \vdots & \vdots \\ r_{k1} & \cdots & \cdots & r_{kk}\\ \end{bmatrix}\]
The singular value decomposition of \[R\] is \[{R a} = \lambda a\]
where \(\lambda\)’s are the eigenvalues of R and a are the eigenvectors of R.
These values are found by solving this equation for the eigenvalues and corresponding vectors of R, by :
\[( R - \lambda I )a = 0\]
if the determinant of \(R-\lambda I\) is not 0, then there will be k solutions (for each of the k original variables) to the equation, or k eigenvalues, and k accompanying eigenvectors.
The eigenvalues of R are ranked, with the first being the largest and the kth being the smallest. So, we will have \(\lambda_1 >\lambda_2 > \cdots >\lambda_k\)
Each eigenvalue accounts for a proportion of all the variation within the matrix, and we can see this proportion by:
% variance explained= \(\frac{\lambda_i}{\sum_i \lambda_i}\)
The eigenvectors a will each be orthogonal, or uncorrelated with each other, and will be linear combinations of the original k variables.
These two items form what is known as the singular value (or eigenvalue) decomposition of a square matrix, and when that matrix is a correlation matrix, this is called Principal Components Analysis or PCA. The idea is that we have a new set of uncorrelated variables (due to the orthogonal nature of the eigenvectors), or components that summarize the information in the original k variables in a smaller number of components. For example, we may have had 10 variables originally, that we thought measured “health”, but after doing a PCA, we may have three components that summarize 75% of the information in these original 10 variables, so we have a simpler set of variables. Moreover, each component now corresponds to a unique and independent subset of the information in the original 10 variables.
Each new component has an eigenvector, that we call a loading vector. We can see how each variable is related to the new component by examining the loading for each component. These can be thought of in a very similar way as standardized regression coefficients. Large positive values indicate positive association of that variable with the component, large negative values, the opposite, and values close to 0 indicate that a particular variable doesn’t contribute to that component. There are no tests, and this can be very subjective in terms of interpretations.
So, this formalism is nice, but what does this boil down to in terms of research. Well, that depends on what you’re looking for. Many people will use PCA to summarize information on lots of variables that may be correlated, in the hopes of finding a smaller subset of variables that represent some underlying latent trait that you want to measure. Often times, people will use this method to construct an index for something. While there are lots of ways to do an index, PCA is advantageous, because it acknowledges the correlation among variables.
For example Sharkey and Horel, 2008 use this method to construct an index of socioeconomic deprivation for census tracts, and Messer et al, 2006 also use this method for a deprivation index in the context of low birth weight outcomes.
Always use z-scored variables when doing a PCA -> scale is very important, as we want correlations going in, not covariances.
If the first eigenvalue is less than 1, your PCA isn’t doing anything for you, it isn’t finding any correlated information among your variables
Have a look at how much variation your component is summarizing, in aggregate data it will tend to be more than in individual level survey data
Doing scatter plots of the first few PC’s can often be informative for finding underlying groups in your data, as people tend to cluster together in multidimensional space
You can use the PCs in a regression model if you’ve had problems with multicollinearity among variables. This is called Principal component regression.
library(car)
library(stargazer)
library(survey)
library(ggplot2)
library(pander)
library(dplyr)
library(knitr)
load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))
#Healthy days
brfss_17$healthdays<-Recode(brfss_17$physhlth, recodes = "88=0; 77=NA; 99=NA")
#Healthy mental health days
brfss_17$healthmdays<-Recode(brfss_17$menthlth, recodes = "88=0; 77=NA; 99=NA")
brfss_17$badhealth<-Recode(brfss_17$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss_17$black<-Recode(brfss_17$racegr3, recodes="2=1; 9=NA; else=0")
brfss_17$white<-Recode(brfss_17$racegr3, recodes="1=1; 9=NA; else=0")
brfss_17$other<-Recode(brfss_17$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss_17$hispanic<-Recode(brfss_17$racegr3, recodes="5=1; 9=NA; else=0")
brfss_17$race_eth<-Recode(brfss_17$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss_17$race_eth<-relevel(brfss_17$race_eth, ref = "nhwhite")
#insurance
brfss_17$ins<-Recode(brfss_17$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss_17$inc<-ifelse(brfss_17$incomg==9, NA, brfss_17$incomg)
#education level
brfss_17$educ<-Recode(brfss_17$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad')
#employment
brfss_17$employ<-Recode(brfss_17$employ1,
recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss_17$employ<-relevel(brfss_17$employ, ref='Employed')
#marital status
brfss_17$marst<-Recode(brfss_17$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss_17$marst<-relevel(brfss_17$marst, ref='married')
#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss_17a the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss_17$bmi<-brfss_17$bmi5/100
#smoking currently
brfss_17$smoke<-Recode(brfss_17$smoker3,
recodes="1:2=1; 3:4=0; else=NA")
brfss_17$checkup<-Recode(brfss_17$checkup1, recodes = "1:2 = 1; 3:4=0; 8=0; else=NA")
brfss_17$hbp<-Recode(brfss_17$bphigh4, recodes = "1= 1; 2:4=0; else=NA")
brfss_17$hchol<-Recode(brfss_17$toldhi2, recodes = "1= 1; 2:4=0; else=NA")
brfss_17$heart<-Recode(brfss_17$cvdcrhd4, recodes = "1= 1; 2=0; else=NA")
brfss_17$ast<-Recode(brfss_17$asthma3, recodes = "1= 1; 2:4=0; else=NA")
Now we prepare to fit the survey corrected model. Here I just get the names of the variables that i will use, this keeps the size of things down, in terms of memory used
We use the prcomp()
function to do the pca, we specify our variables we want in our index using a formula, the name of the data, we also specify R to z-score the data (center=T
removes the mean, scale=T
divides by the standard deviation for each variable), and the retx=T
will calculate the PC scores for each individual for each component.
brfss_17b<-brfss_17%>%
filter(complete.cases(mmsawt, ststr, agec, race_eth,educ,healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast))%>%
select(mmsawt, ststr, agec, race_eth,educ, healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast)%>%
mutate_at(vars(healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast), scale)
brfss.pc<-princomp(~healthdays+healthmdays+smoke+bmi+badhealth+hbp+ hchol+ heart+ ast,
data=brfss_17b,
scores=T)
#Screeplot
screeplot(brfss.pc,
type = "l",
main = "Scree Plot")
abline(h=1)
Summary of eignevalues and variance explained
#Request some basic summaries of the PCA analysis option(digits=3)
summary(brfss.pc)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.4822767 1.1499174 1.0048604 0.9654195 0.9528408
## Proportion of Variance 0.2441285 0.1469241 0.1121944 0.1035600 0.1008790
## Cumulative Proportion 0.2441285 0.3910526 0.5032471 0.6068071 0.7076860
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.89826757 0.84805700 0.80413146 0.67682974
## Proportion of Variance 0.08965435 0.07991163 0.07184789 0.05090012
## Cumulative Proportion 0.79734037 0.87725200 0.94909988 1.00000000
The first four components account for 52% of the variation in the input variables, that isn’t bad. Also, we see 4 eigenvalues of at least 1, which suggests there are 4 real components among these variables (remember, we are looking for eigenvalues > 1 when using z-scored data).
Examine eignevectors/loading vectors
loadings(brfss.pc )
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## healthdays 0.489 0.249 0.213 0.224 0.292 0.106 0.706
## healthmdays 0.343 0.416 0.147 0.260 -0.729 -0.274 -0.103
## smoke 0.138 0.369 0.345 -0.702 -0.370 -0.271 0.105
## bmi 0.262 -0.263 -0.539 -0.308 0.298 -0.508 -0.230 0.271
## badhealth 0.505 0.169 0.164 0.182 0.384 0.139 -0.694
## hbp 0.338 -0.478 -0.253 0.198 -0.741
## hchol 0.279 -0.472 0.180 -0.198 -0.155 0.565 -0.148 0.512
## heart 0.267 -0.246 0.388 0.466 -0.384 -0.499 -0.318
## ast 0.191 0.141 -0.633 0.127 -0.705 0.140
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111
## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
In terms of the loadings for PC1, we see positive associations with everything except insurance status and checkup. We may interpret this component as an index for overall health status, since all health variables are loading in the same direction.
For PC2 we see a mixture of loadings, some positive, some negative, and the only two variables that are loading with large coefficients on this one are smoking and insurance status.
Of course, interpreting components is more art than science…
#then I plot the first 2 components
scores<-data.frame(brfss.pc$scores)
hist(scores$Comp.1)
hist(scores$Comp.2)
Next, I calculate the correlation between the first 2 components to show they are orthogonal, i.e. correlation == 0
cor(scores[,1:2])
## Comp.1 Comp.2
## Comp.1 1.000000e+00 -1.782005e-14
## Comp.2 -1.782005e-14 1.000000e+00
brfss_17c<-cbind(brfss_17b, scores)
names(brfss_17c)
## [1] "mmsawt" "ststr" "agec" "race_eth" "educ"
## [6] "healthdays" "healthmdays" "smoke" "bmi" "badhealth"
## [11] "ins" "checkup" "hbp" "hchol" "heart"
## [16] "ast" "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## [21] "Comp.5" "Comp.6" "Comp.7" "Comp.8" "Comp.9"
Here we examine correlation among our health variables
names(brfss_17c)
## [1] "mmsawt" "ststr" "agec" "race_eth" "educ"
## [6] "healthdays" "healthmdays" "smoke" "bmi" "badhealth"
## [11] "ins" "checkup" "hbp" "hchol" "heart"
## [16] "ast" "Comp.1" "Comp.2" "Comp.3" "Comp.4"
## [21] "Comp.5" "Comp.6" "Comp.7" "Comp.8" "Comp.9"
round(cor(brfss_17c[,c(-1:-5, -17:-27)]), 3)
## healthdays healthmdays smoke bmi badhealth ins checkup hbp
## healthdays 1.000 0.338 0.100 0.125 0.533 -0.001 0.029 0.162
## healthmdays 0.338 1.000 0.156 0.075 0.285 -0.056 -0.035 0.042
## smoke 0.100 0.156 1.000 -0.024 0.120 -0.104 -0.074 0.007
## bmi 0.125 0.075 -0.024 1.000 0.169 -0.008 0.033 0.229
## badhealth 0.533 0.285 0.120 0.169 1.000 -0.037 0.029 0.208
## ins -0.001 -0.056 -0.104 -0.008 -0.037 1.000 0.171 0.057
## checkup 0.029 -0.035 -0.074 0.033 0.029 0.171 1.000 0.137
## hbp 0.162 0.042 0.007 0.229 0.208 0.057 0.137 1.000
## hchol 0.120 0.043 -0.006 0.124 0.141 0.066 0.124 0.318
## heart 0.154 0.047 0.007 0.050 0.194 0.028 0.049 0.181
## ast 0.129 0.115 0.034 0.097 0.125 0.001 0.006 0.040
## hchol heart ast
## healthdays 0.120 0.154 0.129
## healthmdays 0.043 0.047 0.115
## smoke -0.006 0.007 0.034
## bmi 0.124 0.050 0.097
## badhealth 0.141 0.194 0.125
## ins 0.066 0.028 0.001
## checkup 0.124 0.049 0.006
## hbp 0.318 0.181 0.040
## hchol 1.000 0.167 0.029
## heart 0.167 1.000 0.041
## ast 0.029 0.041 1.000
Sometimes it’s easier to look at the correlations among the original variables and the first 3 components
round(cor(brfss_17c[,c(-1:-5, -19:-27)]),3)
## healthdays healthmdays smoke bmi badhealth ins checkup
## healthdays 1.000 0.338 0.100 0.125 0.533 -0.001 0.029
## healthmdays 0.338 1.000 0.156 0.075 0.285 -0.056 -0.035
## smoke 0.100 0.156 1.000 -0.024 0.120 -0.104 -0.074
## bmi 0.125 0.075 -0.024 1.000 0.169 -0.008 0.033
## badhealth 0.533 0.285 0.120 0.169 1.000 -0.037 0.029
## ins -0.001 -0.056 -0.104 -0.008 -0.037 1.000 0.171
## checkup 0.029 -0.035 -0.074 0.033 0.029 0.171 1.000
## hbp 0.162 0.042 0.007 0.229 0.208 0.057 0.137
## hchol 0.120 0.043 -0.006 0.124 0.141 0.066 0.124
## heart 0.154 0.047 0.007 0.050 0.194 0.028 0.049
## ast 0.129 0.115 0.034 0.097 0.125 0.001 0.006
## Comp.1 0.725 0.509 0.205 0.388 0.749 -0.006 0.075
## Comp.2 0.286 0.478 0.424 -0.302 0.194 -0.114 -0.151
## hbp hchol heart ast Comp.1 Comp.2
## healthdays 0.162 0.120 0.154 0.129 0.725 0.286
## healthmdays 0.042 0.043 0.047 0.115 0.509 0.478
## smoke 0.007 -0.006 0.007 0.034 0.205 0.424
## bmi 0.229 0.124 0.050 0.097 0.388 -0.302
## badhealth 0.208 0.141 0.194 0.125 0.749 0.194
## ins 0.057 0.066 0.028 0.001 -0.006 -0.114
## checkup 0.137 0.124 0.049 0.006 0.075 -0.151
## hbp 1.000 0.318 0.181 0.040 0.501 -0.549
## hchol 0.318 1.000 0.167 0.029 0.414 -0.543
## heart 0.181 0.167 1.000 0.041 0.395 -0.283
## ast 0.040 0.029 0.041 1.000 0.283 0.162
## Comp.1 0.501 0.414 0.395 0.283 1.000 0.000
## Comp.2 -0.549 -0.543 -0.283 0.162 0.000 1.000
This allows us to see the correlations among the original health variables and the new components. This is important for interpreting the direction of the component. In this case, the PC1 suggests that higher PC1 scores have better health, because PC1 has negative correlations with all of the health variables.
Next, I will use the PC’s we just generated in an analysis
#Make the survey design object
options(survey.lonely.psu = "adjust")
brfss_17c$pc1q<-cut(brfss_17c$Comp.1,
breaks = quantile(brfss_17c$Comp.1,
probs = c(0,.25,.5,.75,1) ,
na.rm=T),
include.lowest=T)
des<-svydesign(ids=~1,
strata=~ststr,
weights=~mmsawt,
data=brfss_17c)
The first analysis will look at variation in my health index across age, and education level:
library(ggplot2)
ggplot(aes(x=agec, y=Comp.1),
data=brfss_17c)+
geom_boxplot()
brfss_17c$educ<-factor(brfss_17c$educ, levels(brfss_17c$educ)[c(1,3,2,4,5)])
ggplot(aes(x=educ, y=Comp.1),
data=brfss_17c)+
geom_boxplot()
The age plot is nice, and really shows that older ages have higher values for our health index variable, which confirms our interpretation that higher values of the index are “not good”
Now we do some hypothesis testing. This model will examine variation in my health index across age, education, race and two healthcare access variables:
fit.1<-svyglm(Comp.1~agec+educ+(race_eth),
des,
family=gaussian)
summary(fit.1)
##
## Call:
## svyglm(formula = Comp.1 ~ agec + educ + (race_eth), design = des,
## family = gaussian)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss_17c)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.62798 0.02183 -28.769 < 2e-16 ***
## agec(24,39] 0.36908 0.02139 17.253 < 2e-16 ***
## agec(39,59] 0.78325 0.02190 35.760 < 2e-16 ***
## agec(59,79] 1.10850 0.02285 48.511 < 2e-16 ***
## agec(79,99] 1.02713 0.04073 25.217 < 2e-16 ***
## educ0Prim 0.44400 0.06147 7.223 5.11e-13 ***
## educ1somehs 0.53562 0.04115 13.017 < 2e-16 ***
## educ3somecol -0.12294 0.01881 -6.534 6.42e-11 ***
## educ4colgrad -0.58688 0.01616 -36.316 < 2e-16 ***
## race_ethhispanic -0.05136 0.02252 -2.281 0.0226 *
## race_ethnh black 0.16877 0.02231 7.563 3.95e-14 ***
## race_ethnh multirace 0.33619 0.04809 6.990 2.76e-12 ***
## race_ethnh other -0.18925 0.03024 -6.257 3.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.829605)
##
## Number of Fisher Scoring iterations: 2
So, overall, the index increases across age, and decreases with higher education. Hispanics and other race/ethnic groups have lower indices on average than whites, while NH blacks and multiple race respondents have higher values.