This example illustrates the use of the method of Principal Components Analysis to form an index of overall health using data from the 2014 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data. Link

Principal Components is a mathematical technique (not a statistical one!) based off the eigenvalue decomposition/singular value decomposition of a data matrix (or correlation/covariance matrix) Link. This technique re-represents a complicated data set (set of variables) in a simpler form, where the information in the original variables is now represented by fewer components, which represent the majority (we hope!!) of the variance in the original data.

This is known as a variable reduction strategy. It is also used commonly to form indices in the behavioral/social sciences. It is closely related to factor analysis Link, and indeed the initial solution of factor analysis is often the exact same as the PCA solution. PCA preserves the orthogonal nature of the components, while factor analysis can violate this assumption.

PCA in formulas

If we have a series of variables, called \[ X\], with \[X = \left [ x_1, x_2, ..., x_k \right ]\], and we have the Pearson correlation matrix between these variables, \[ R\], with

\[ R = \begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1k} \\ r_{21}& r_{22} & \ddots & r_{2k}\\ \vdots & \cdots & \vdots & \vdots \\ r_{k1} & \cdots & \cdots & r_{kk}\\ \end{bmatrix}\]

The singular value decomposition of \[R\] is \[{R a} = \lambda a\]

where \(\lambda\)’s are the eigenvalues of R and a are the eigenvectors of R.

These values are found by solving this equation for the eigenvalues and corresponding vectors of R, by :

\[( R - \lambda I )a = 0\]

if the determinant of \(R-\lambda I\) is not 0, then there will be k solutions (for each of the k original variables) to the equation, or k eigenvalues, and k accompanying eigenvectors.

The eigenvalues of R are ranked, with the first being the largest and the kth being the smallest. So, we will have \(\lambda_1 >\lambda_2 > \cdots >\lambda_k\)

Each eigenvalue accounts for a proportion of all the variation within the matrix, and we can see this proportion by:

% variance explained= \(\frac{\lambda_i}{\sum_i \lambda_i}\)

The eigenvectors a will each be orthogonal, or uncorrelated with each other, and will be linear combinations of the original k variables.

These two items form what is known as the singular value (or eigenvalue) decomposition of a square matrix, and when that matrix is a correlation matrix, this is called Principal Components Analysis or PCA. The idea is that we have a new set of uncorrelated variables (due to the orthogonal nature of the eigenvectors), or components that summarize the information in the original k variables in a smaller number of components. For example, we may have had 10 variables originally, that we thought measured “health”, but after doing a PCA, we may have three components that summarize 75% of the informaiton in these original 10 variables, so we have a simpler set of variables. Moreover, each component now corresponds to a unique and independent subset of the information in the original 10 variables.

Variable loadings

Each new component has an eigenvector, that we call a loading vector. We can see how each variable is related to the new component by examining the loadings for each component. These can be thought of in a very simliar way as standardized regression coefficients. Large positive values indicate positive association of that variable with the component, large negative values, the opposite, and values close to 0 indicate that a particular variable doesn’t contribut to that component. There are no tests, and this can be very subjective in terms of intepretations.

Using this in research

So, this formalism is nice, but what does this boil down to in terms of research. Well, that depends on what you’re looking for. Many people will use PCA to summarize information on lots of variables that may be correlated, in the hopes of finding a smaller subset of variables that represent some underlying latent trait that you want to measure. Often times, people will use this method to construct an index for something. While there are lots of ways to do an index, PCA is adventageous, because it acknowledges the correlation among variables.

For example Sharkey and Horel, 2008 use this method to construct an index of socioeconomic deprivation for census tracts, and Messer et al, 2006 also use this method for a deprivation index in the context of low birth weight outcomes.

General rules

  1. Always use z-scored variables when doing a PCA -> scale is very imporant, as we want correlations going in, not covariances.
  2. If the first eigenvalue is less than 1, your PCA isn’t doing anything for you, it isn’t finding any correlated information among your variables
  3. Have a look at how much variation your component is summarizing, in aggregate data it will tend to be more than in individual level survey data
  4. Doing scatter plots of the first few PC’s can often be informative for finding underlying groups in your data, as people tend to cluster together in multidimensional space
  5. You can use the PCs in a regression model if you’ve had problems with multicollinearity among variables. This is called Principal component regression.
library(car)
library(stargazer)
library(survey)
library(ggplot2)
library(pander)
library(knitr)
load(file = "brfss_14.Rdata")
nams<-names(brfss14)
head(nams, n=10)
##  [1] "dispcode" "genhlth"  "physhlth" "menthlth" "poorhlth" "hlthpln1"
##  [7] "persdoc2" "medcost"  "checkup1" "exerany2"
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
set.seed(1234)
newnames<-gsub(pattern = "x_",replacement =  "",x =  nams)
names(brfss14)<-tolower(newnames)


#Healthy days
brfss14$healthdays<-recode(brfss14$physhlth, recodes = "88=0; 77=NA; 99=NA")

#Healthy mental health days
brfss14$healthmdays<-recode(brfss14$menthlth, recodes = "88=0; 77=NA; 99=NA")

#BMI
brfss14$bmi<-ifelse(is.na(brfss14$bmi5)==T, NA, brfss14$bmi5/100)
#Poor or fair self rated health
brfss14$badhealth<-recode(brfss14$genhlth, recodes="4:5=1; 1:3=0; else=NA")

#race/ethnicity
brfss14$black<-recode(brfss14$racegr3, recodes="2=1; 9=NA; else=0")
brfss14$white<-recode(brfss14$racegr3, recodes="1=1; 9=NA; else=0")
brfss14$other<-recode(brfss14$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss14$hispanic<-recode(brfss14$racegr3, recodes="5=1; 9=NA; else=0")
brfss14$race_eth<-recode(brfss14$racegr3, recodes="1='nhwhite'; 2='nh black'; 3='nh other';
                         4='nh multirace'; 5='hispanic'; else=NA", as.factor.result = T)
brfss14$race_eth<-relevel(brfss14$race_eth, ref = "nhwhite")
#insurance
brfss14$ins<-ifelse(brfss14$hlthpln1==1,1,0)

#income grouping
brfss14$inc<-ifelse(brfss14$incomg==9, NA, brfss14$incomg)

#education level
brfss14$educ<-recode(brfss14$educa, recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad';
                     5='3somecol'; 6='4colgrad';9=NA", as.factor.result=T)


#employment
brfss14$employ<-recode(brfss14$employ, recodes="1:2='Employed'; 2:6='nilf';
                       7='retired'; 8='unable'; else=NA", as.factor.result=T)
brfss14$employ<-relevel(brfss14$employ, ref='Employed')

#marital status
brfss14$marst<-recode(brfss14$marital, recodes="1='married'; 2='divorced'; 3='widowed';
                      4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss14$marst<-relevel(brfss14$marst, ref='married')

#Age cut into intervals
brfss14$agec<-cut(brfss14$age80, breaks=c(0,24,39,59,79,99), include.lowest = T)

#Health behaviors
#insurance
brfss14$ins<-recode(brfss14$hlthpln1, recodes = "1=1; 2=0; else=NA" )

#smoking currently
brfss14$smoke<-recode(brfss14$smoker3, recodes="1:2=1; 3:4=0; else=NA", as.factor.result=F)

Analysis

Now we prepare to fit the survey corrected model. Here I just get the names of the variables that i will use, this keeps the size of things down, in terms of memory used

We use the prcomp function to do the pca, we specify our variables we want in our index using a formula, the name of the data, we also specify R to z-score the data (center=T removes the mean, scale=T divides by the standard deviation for each variable), and the retx=T will calculate the PC scores for each individual for each component.

brfss.pc<-prcomp(~healthdays+healthmdays+I(-1*ins)+smoke+bmi+badhealth,data=brfss14, center=T, scale=T, retx=T)

#Screeplot
screeplot(brfss.pc, type = "l", main = "Scree Plot")
abline(h=1)

#Request some basic summaries of the PCA analysis option(digits=3)
summary(brfss.pc)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6
## Standard deviation     1.3804 1.0605 0.9782 0.9125 0.8527 0.67321
## Proportion of Variance 0.3176 0.1875 0.1595 0.1388 0.1212 0.07554
## Cumulative Proportion  0.3176 0.5050 0.6645 0.8033 0.9245 1.00000
brfss.pc$rotation
##                    PC1        PC2         PC3         PC4         PC5
## healthdays  0.58095678 -0.1535405 -0.16245343  0.18076993  0.23501024
## healthmdays 0.47577619  0.1109303 -0.15574783  0.04049504 -0.84796764
## I(-1 * ins) 0.09694554  0.6299130  0.58468210  0.49755359  0.04595082
## smoke       0.22364354  0.6300452 -0.13471310 -0.70434147  0.19342646
## bmi         0.22111549 -0.3886331  0.76618879 -0.44776947 -0.09725386
## badhealth   0.57255507 -0.1390566 -0.04801511  0.14672560  0.42040157
##                     PC6
## healthdays  -0.72430369
## healthmdays  0.12800444
## I(-1 * ins) -0.04782242
## smoke       -0.03699027
## bmi         -0.05541795
## badhealth    0.67251217
#then I plot the first 2 components
hist(brfss.pc$x[,1])

hist(brfss.pc$x[,2])

The first two components account for 50% of the variation in the input variables, that isn’t bad. Also, we see 2 eigenvalues of at least 1, which suggests there are 2 real components among these 6 variables (remember, we are looking for eigenvalues > 1 when using z-scored data).

In terms of the loadings for PC1, we see positive associations with everything but insurance status and smoking status. We may interpret this component as an index for overall health status, since all health variables are loading in the same direction.

For PC2 we see a mixture of loadings, some positive, some negative, and the only two variables that are loading on this one are smoking and insurance status.

Of course, interpreting components is more art than science…

Next, I calculate the correlation between the first 2 components to show they are orthogonal, i.e. correlation == 0

cor(brfss.pc$x[,1:2])
##               PC1           PC2
## PC1  1.000000e+00 -6.800532e-18
## PC2 -6.800532e-18  1.000000e+00
scores<-data.frame(brfss.pc$x)
scores$name<-rownames(brfss.pc$x)
brfss14$name<-rownames(brfss14)
brfss14<-merge(brfss14, scores, by.x="name", by.y="name", all.x=F)
tail(names(brfss14), 10)
##  [1] "employ" "marst"  "agec"   "smoke"  "PC1"    "PC2"    "PC3"   
##  [8] "PC4"    "PC5"    "PC6"

Here we examine correlation among our health variables

round(cor(brfss14[,c("healthdays","healthmdays","ins","smoke","bmi","badhealth")]), 3)
##             healthdays healthmdays    ins  smoke    bmi badhealth
## healthdays       1.000       0.351 -0.006  0.099  0.127     0.538
## healthmdays      0.351       1.000 -0.065  0.156  0.079     0.294
## ins             -0.006      -0.065  1.000 -0.128 -0.007    -0.041
## smoke            0.099       0.156 -0.128  1.000 -0.030     0.113
## bmi              0.127       0.079 -0.007 -0.030  1.000     0.165
## badhealth        0.538       0.294 -0.041  0.113  0.165     1.000

Sometimes it’s easier to look at the correlations among the original variables and the components

round(cor(brfss14[,c("healthdays","healthmdays","ins","smoke","bmi","badhealth", "PC1", "PC2")]),3)
##             healthdays healthmdays    ins  smoke    bmi badhealth    PC1
## healthdays       1.000       0.351 -0.006  0.099  0.127     0.538  0.802
## healthmdays      0.351       1.000 -0.065  0.156  0.079     0.294  0.657
## ins             -0.006      -0.065  1.000 -0.128 -0.007    -0.041 -0.134
## smoke            0.099       0.156 -0.128  1.000 -0.030     0.113  0.309
## bmi              0.127       0.079 -0.007 -0.030  1.000     0.165  0.305
## badhealth        0.538       0.294 -0.041  0.113  0.165     1.000  0.790
## PC1              0.802       0.657 -0.134  0.309  0.305     0.790  1.000
## PC2             -0.163       0.118 -0.668  0.668 -0.412    -0.147  0.000
##                PC2
## healthdays  -0.163
## healthmdays  0.118
## ins         -0.668
## smoke        0.668
## bmi         -0.412
## badhealth   -0.147
## PC1          0.000
## PC2          1.000

This allows us to see the correlations among the original health variables and the new components.

Using the Principal Components

Next, I will use the PC’s we just generated in an analysis

#Make the survey design object
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data=brfss14 , nest=T)

#means of the variables in the index
svymean(~healthdays+healthmdays+ins+smoke+bmi+badhealth, des, estimate.only=T )
##                 mean     SE
## healthdays   3.74410 0.0349
## healthmdays  3.58928 0.0343
## ins          0.87186 0.0017
## smoke        0.16350 0.0016
## bmi         27.63053 0.0276
## badhealth    0.16599 0.0017

The first analysis will look at variation in my health index across age, and work status:

boxplot(PC1~agec, data=brfss14)

boxplot(PC2~educ, data=brfss14)

The age plot is nice, and really shows that older ages have higher values for our health index variable, which confirms our interpretation that higher values of the index are “not good”

Now we do some hypothesis testing. This model will examine variation in my health index across age, education, race and two healthcare access variables:

fit.1<-svyglm(PC1~agec+educ+(race_eth)+inc, des, family=gaussian)
summary(fit.1)
## 
## Call:
## svyglm(formula = PC1 ~ agec + educ + (race_eth) + inc, des, family = gaussian)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss14, 
##     nest = T)
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.172860   0.058320  20.111  < 2e-16 ***
## agec(24,39]           0.363412   0.021082  17.238  < 2e-16 ***
## agec(39,59]           0.594829   0.020397  29.163  < 2e-16 ***
## agec(59,79]           0.489876   0.021066  23.254  < 2e-16 ***
## agec(79,99]           0.167075   0.031502   5.304 1.14e-07 ***
## educ1somehs           0.014213   0.060814   0.234    0.815    
## educ2hsgrad          -0.390427   0.053498  -7.298 2.93e-13 ***
## educ3somecol         -0.453606   0.053651  -8.455  < 2e-16 ***
## educ4colgrad         -0.747221   0.053456 -13.978  < 2e-16 ***
## race_ethhispanic     -0.176830   0.020569  -8.597  < 2e-16 ***
## race_ethnh black     -0.002788   0.020003  -0.139    0.889    
## race_ethnh multirace  0.215212   0.047328   4.547 5.44e-06 ***
## race_ethnh other     -0.261496   0.026254  -9.960  < 2e-16 ***
## inc                  -0.280014   0.005344 -52.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.555263)
## 
## Number of Fisher Scoring iterations: 2

So, overall, the index decreases across age, and education. Hispanics and other race/ethnic groups have lower indices on average than whites, and thos with higher income have lower indices.

#use survey procedure to do pca
svy.pc<-svyprcomp(formula=~healthdays+healthmdays+ins+smoke+bmi+badhealth, center=T, scale.=T,scores=T, des)

brfss14$svpcs1<-svy.pc$x[,1]
brfss14$svpcs2<-svy.pc$x[,2]

#Just look at the non-survey pc1 vs the survey pc1
ggplot(brfss14, aes(x=PC1, y=svpcs1))+geom_point(alpha=.05)+theme_gray()

ggplot(brfss14, aes(x=PC1, y=svpcs1)) + stat_binhex()

Aggregate data example

This is an example from a paper some former students and I did at the 2014 Applied Demography conference entitled “Spatial accessibility to food sources in Alamo Area Council of Governments (AACOG)” where, we examined factors influencing whether or not a census tract was a food desert in the area surrounding San Antonio. We used the index from the Sharkey paper above

In this case, we had to do some reverse coding of the index to make it work out the way it should

library(rgdal)
## Loading required package: sp
## rgdal: version: 1.2-5, (SVN revision 648)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.0.1, released 2015/09/15
##  Path to GDAL shared files: C:/Users/Corey Sparks/Documents/R/win-library/3.3/rgdal/gdal
##  Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
##  Path to PROJ.4 shared files: C:/Users/Corey Sparks/Documents/R/win-library/3.3/rgdal/proj
##  Linking to sp version: 1.2-4
sp<-readOGR(dsn = "D:/GoogleDrive/Food Security Paper", layer = "shp2_nm")
## OGR data source with driver: ESRI Shapefile 
## Source: "D:/GoogleDrive/Food Security Paper", layer: "shp2_nm"
## with 466 features
## It has 220 fields
## Integer64 fields read as strings:  OBJECTI DP00100 B010010 B02001001 B02001002 B02001003 B02001004 B02001005 B02001006 B02001007 B02001008 B03002001 B03002002 B03002003 B03002004 B03002005 B03002006 B03002007 B03002008 B03002009 B03002010 B03002011 B03002012 B03002013 B03002014 B03002015 B03002016 B03002017 B03002018 B03002019 B03002020 B03002021 B17001001 B17001002 B17001003 B17001004 B17001005 B17001006 B17001007 B17001008 B17001009 B17001010 B17001011 B17001012 B17001013 B17001014 B17001015 B17001016 B17001017 B17001018 B17001019 B17001020 B17001021 B17001022 B17001023 B17001024 B17001025 B17001026 B17001027 B17001028 B17001029 B17001030 B17001031 B17001032 B17001033 B17001034 B17001035 B17001036 B17001037 B17001038 B17001039 B17001040 B17001041 B17001042 B17001043 B17001044 B17001045 B17001046 B17001047 B17001048 B17001049 B17001050 B17001051 B17001052 B17001053 B17001054 B17001055 B17001056 B17001057 B17001058 B17001059 B19001001 B19001002 B19001003 B19001004 B19001005 B19001006 B19001007 B19001008 B19001009 B19001010 B19001011 B19001012 B19001013 B19001014 B19001015 B19001016 B19001017 B190130 hhincom LILAT_1A1 LILAT_A LILAT_1A2 LILAT_V Urban Rural LA1nd10 LAhlf10 LA1nd20 LATrct_ LATrct1 LATrc10 LATrc20 HUNVFlg GrpQrtF OHU2010 NUMGQTR LwIncmT POP2010 region region2
pcshark<-prcomp(~punemp+pprsnsp+(-1*p25plsl)+gt1prrm+pubasst+phwoveh+phwophn, center=T, scale=T, data=sp, retx=T)
pcshark
## Standard deviations:
## [1] 1.8663852 0.8610036 0.8020119 0.7011190 0.6259235 0.4987064
## 
## Rotation:
##                PC1         PC2           PC3        PC4          PC5
## punemp  -0.4018635  0.43356009  0.0505749296 -0.6663923  0.438027381
## pprsnsp -0.4710992 -0.09081256  0.0002533247 -0.1633937 -0.399702491
## gt1prrm -0.3718360  0.12266195  0.8268309160  0.3184658 -0.126928435
## pubasst -0.3623665  0.57277160 -0.4729613865  0.5619221  0.001537651
## phwoveh -0.4362213 -0.35893552 -0.2906211744 -0.1948689 -0.456054569
## phwophn -0.3958671 -0.57604753 -0.0750973274  0.2721635  0.651360860
##                 PC6
## punemp  -0.10966695
## pprsnsp  0.76377943
## gt1prrm -0.21333403
## pubasst -0.03423333
## phwoveh -0.59199319
## phwophn  0.08645738
summary(pcshark)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6
## Standard deviation     1.8664 0.8610 0.8020 0.70112 0.6259 0.49871
## Proportion of Variance 0.5806 0.1235 0.1072 0.08193 0.0653 0.04145
## Cumulative Proportion  0.5806 0.7041 0.8113 0.89325 0.9586 1.00000
sp$sharkeyindex<--1*prcomp(~punemp+pprsnsp+(-1*p25plsl)+gt1prrm+pubasst+phwoveh+phwophn, center=T, scale=T, data=sp, retx=T)$x[,1]

spplot(sp, "sharkeyindex", col.regions=RColorBrewer::brewer.pal(n=8, "RdBu"), at=quantile(sp$sharkeyindex, p=seq(0,1,length.out = 8)), main="Food Insecurity Risk index - Sharkey")

fit<-glm(LILAT_1A1~scale(plingis)+scale(interct)+Rural+sharkeyindex,family=binomial, sp)
summary(fit)
## 
## Call:
## glm(formula = LILAT_1A1 ~ scale(plingis) + scale(interct) + Rural + 
##     sharkeyindex, family = binomial, data = sp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0309  -0.5892  -0.4166  -0.2578   2.4975  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.41872    0.15210  -9.328  < 2e-16 ***
## scale(plingis)  0.03501    0.21010   0.167   0.8677    
## scale(interct)  0.02544    0.24611   0.103   0.9177    
## Rural1         -1.02594    0.46176  -2.222   0.0263 *  
## sharkeyindex    0.55722    0.10108   5.513 3.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 484.62  on 465  degrees of freedom
## Residual deviance: 387.09  on 461  degrees of freedom
## AIC: 397.09
## 
## Number of Fisher Scoring iterations: 5
sp$fitted<-fit$fitted.values
spplot(sp, "fitted",col.regions=RColorBrewer::brewer.pal(n=8, "Reds"), at=quantile(sp$fitted, p=c(0, .1, .5,.75, .8, .9, .95, 1)) , main="Fitted Probability of Being a Food Desert", col="transparent")

So, in areas that had higher values of the index, the odds of being a food desert were also higher. This was offset by being in a rural area, where the odds of being a food desert were lower.