DEM 7283 - Principal Components Analysis

PCA in formulas
Variable loadings
Using this in research
General rules
Recode variables
Analysis
Using the Principal Components
Aggregate data example

This example illustrates the use of the method of Principal Components Analysis to form an index of overall health using data from the 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data. Link

Principal Components is a mathematical technique (not a statistical one!) based off the eigenvalue decomposition/singular value decomposition of a data matrix (or correlation/covariance matrix) Link. This technique re-represents a complicated data set (set of variables) in a simpler form, where the information in the original variables is now represented by fewer components, which represent the majority (we hope!!) of the variance in the original data.

This is known as a variable reduction strategy. It is also used commonly to form indices in the behavioral/social sciences. It is closely related to factor analysis Link, and indeed the initial solution of factor analysis is often the exact same as the PCA solution. PCA preserves the orthogonal nature of the components, while factor analysis can violate this assumption.

PCA in formulas

If we have a series of variables, called \[ X\], with \[X = \left [ x_1, x_2, ..., x_k \right ]\], and we have the Pearson correlation matrix between these variables, \[ R\], with

\[ R = \begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1k} \\ r_{21}& r_{22} & \ddots & r_{2k}\\ \vdots & \cdots & \vdots & \vdots \\ r_{k1} & \cdots & \cdots & r_{kk}\\ \end{bmatrix}\]

The singular value decomposition of \[R\] is \[{R a} = \lambda a\]

where \(\lambda\)’s are the eigenvalues of R and a are the eigenvectors of R.

These values are found by solving this equation for the eigenvalues and corresponding vectors of R, by :

\[( R - \lambda I )a = 0\]

if the determinant of \(R-\lambda I\) is not 0, then there will be k solutions (for each of the k original variables) to the equation, or k eigenvalues, and k accompanying eigenvectors.

The eigenvalues of R are ranked, with the first being the largest and the kth being the smallest. So, we will have \(\lambda_1 >\lambda_2 > \cdots >\lambda_k\)

Each eigenvalue accounts for a proportion of all the variation within the matrix, and we can see this proportion by:

% variance explained= \(\frac{\lambda_i}{\sum_i \lambda_i}\)

The eigenvectors a will each be orthogonal, or uncorrelated with each other, and will be linear combinations of the original k variables.

These two items form what is known as the singular value (or eigenvalue) decomposition of a square matrix, and when that matrix is a correlation matrix, this is called Principal Components Analysis or PCA. The idea is that we have a new set of uncorrelated variables (due to the orthogonal nature of the eigenvectors), or components that summarize the information in the original k variables in a smaller number of components. For example, we may have had 10 variables originally, that we thought measured “health”, but after doing a PCA, we may have three components that summarize 75% of the information in these original 10 variables, so we have a simpler set of variables. Moreover, each component now corresponds to a unique and independent subset of the information in the original 10 variables.

Variable loadings

Each new component has an eigenvector, that we call a loading vector. We can see how each variable is related to the new component by examining the loading for each component. These can be thought of in a very similar way as standardized regression coefficients. Large positive values indicate positive association of that variable with the component, large negative values, the opposite, and values close to 0 indicate that a particular variable doesn’t contribute to that component. There are no tests, and this can be very subjective in terms of interpretations.

Using this in research

So, this formalism is nice, but what does this boil down to in terms of research. Well, that depends on what you’re looking for. Many people will use PCA to summarize information on lots of variables that may be correlated, in the hopes of finding a smaller subset of variables that represent some underlying latent trait that you want to measure. Often times, people will use this method to construct an index for something. While there are lots of ways to do an index, PCA is advantageous, because it acknowledges the correlation among variables.

For example Sharkey and Horel, 2008 use this method to construct an index of socioeconomic deprivation for census tracts, and Messer et al, 2006 also use this method for a deprivation index in the context of low birth weight outcomes.

General rules

Always use z-scored variables when doing a PCA -> scale is very important, as we want correlations going in, not covariances.
If the first eigenvalue is less than 1, your PCA isn’t doing anything for you, it isn’t finding any correlated information among your variables
Have a look at how much variation your component is summarizing, in aggregate data it will tend to be more than in individual level survey data
Doing scatter plots of the first few PC’s can often be informative for finding underlying groups in your data, as people tend to cluster together in multidimensional space
You can use the PCs in a regression model if you’ve had problems with multicollinearity among variables. This is called Principal component regression.

library(car)
library(stargazer)
library(survey)
library(ggplot2)
library(pander)
library(dplyr)
library(knitr)

load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))

Recode variables

#Healthy days
brfss_17$healthdays<-Recode(brfss_17$physhlth, recodes = "88=0; 77=NA; 99=NA")

#Healthy mental health days
brfss_17$healthmdays<-Recode(brfss_17$menthlth, recodes = "88=0; 77=NA; 99=NA")

brfss_17$badhealth<-Recode(brfss_17$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss_17$black<-Recode(brfss_17$racegr3, recodes="2=1; 9=NA; else=0")
brfss_17$white<-Recode(brfss_17$racegr3, recodes="1=1; 9=NA; else=0")
brfss_17$other<-Recode(brfss_17$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss_17$hispanic<-Recode(brfss_17$racegr3, recodes="5=1; 9=NA; else=0")

brfss_17$race_eth<-Recode(brfss_17$racegr3, 
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor = T)
brfss_17$race_eth<-relevel(brfss_17$race_eth, ref = "nhwhite")

#insurance
brfss_17$ins<-Recode(brfss_17$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss_17$inc<-ifelse(brfss_17$incomg==9, NA, brfss_17$incomg)

#education level
brfss_17$educ<-Recode(brfss_17$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor=T)
brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad')

#employment
brfss_17$employ<-Recode(brfss_17$employ1,
recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor=T)
brfss_17$employ<-relevel(brfss_17$employ, ref='Employed')

#marital status
brfss_17$marst<-Recode(brfss_17$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor=T)
brfss_17$marst<-relevel(brfss_17$marst, ref='married')

#Age cut into intervals
brfss_17$agec<-cut(brfss_17$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss_17a the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's

brfss_17$bmi<-brfss_17$bmi5/100

#smoking currently
brfss_17$smoke<-Recode(brfss_17$smoker3, 
recodes="1:2=1; 3:4=0; else=NA")

brfss_17$checkup<-Recode(brfss_17$checkup1, recodes = "1:2 = 1; 3:4=0; 8=0; else=NA")
brfss_17$hbp<-Recode(brfss_17$bphigh4, recodes = "1= 1; 2:4=0; else=NA")
brfss_17$hchol<-Recode(brfss_17$toldhi2, recodes = "1= 1; 2:4=0; else=NA")
brfss_17$heart<-Recode(brfss_17$cvdcrhd4, recodes = "1= 1; 2=0; else=NA")
brfss_17$ast<-Recode(brfss_17$asthma3, recodes = "1= 1; 2:4=0; else=NA")

Analysis

Now we prepare to fit the survey corrected model. Here I just get the names of the variables that i will use, this keeps the size of things down, in terms of memory used

We use the prcomp function to do the pca, we specify our variables we want in our index using a formula, the name of the data, we also specify R to z-score the data (center=T removes the mean, scale=T divides by the standard deviation for each variable), and the retx=T will calculate the PC scores for each individual for each component.

brfss_17b<-brfss_17%>%
  filter(complete.cases(mmsawt, ststr, agec, race_eth,educ,healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast))%>%
  select(mmsawt, ststr, agec, race_eth,educ, healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast)%>%
  mutate_at(vars(healthdays, healthmdays, smoke, bmi, badhealth,ins, checkup, hbp, hchol, heart, ast), scale)

brfss.pc<-princomp(~healthdays+healthmdays+smoke+bmi+badhealth+ins+checkup+hbp+ hchol+ heart+ ast,data=brfss_17b, scores=T)

#Screeplot
screeplot(brfss.pc, type = "l", main = "Scree Plot")
abline(h=1)

Summary of eignevalues and variance explained

#Request some basic summaries of the PCA analysis option(digits=3)
summary(brfss.pc)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     1.4858353 1.2155981 1.02540858 1.00477616 0.95934569
## Proportion of Variance 0.2007017 0.1343352 0.09558805 0.09178007 0.08366811
## Cumulative Proportion  0.2007017 0.3350369 0.43062493 0.52240500 0.60607311
##                            Comp.6    Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.95284360 0.9038006 0.88880562 0.84639833 0.80335441
## Proportion of Variance 0.08253781 0.0742600 0.07181635 0.06512674 0.05867108
## Cumulative Proportion  0.68861093 0.7628709 0.83468727 0.89981401 0.95848509
##                           Comp.11
## Standard deviation     0.67576731
## Proportion of Variance 0.04151491
## Cumulative Proportion  1.00000000

The first four components account for 52% of the variation in the input variables, that isn’t bad. Also, we see 4 eigenvalues of at least 1, which suggests there are 4 real components among these variables (remember, we are looking for eigenvalues > 1 when using z-scored data).

Examine eignevectors/loading vectors

loadings(brfss.pc )

## 
## Loadings:
##             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## healthdays   0.483  0.201  0.239         0.122  0.230                0.294
## healthmdays  0.333  0.355  0.223        -0.133  0.140         0.268 -0.724
## smoke        0.128  0.362 -0.135  0.324 -0.627 -0.402 -0.184 -0.349       
## bmi          0.263 -0.145 -0.324 -0.581 -0.131  0.292 -0.168 -0.443 -0.241
## badhealth    0.500  0.166  0.107         0.129  0.189                0.383
## ins                -0.390  0.595                      -0.690              
## checkup            -0.429  0.409        -0.345         0.663 -0.238 -0.125
## hbp          0.347 -0.369 -0.320        -0.175                       0.194
## hchol        0.288 -0.375 -0.270  0.154 -0.164 -0.165         0.609       
## heart        0.269 -0.171 -0.133  0.376  0.592 -0.356        -0.387 -0.331
## ast          0.188  0.105  0.211 -0.613        -0.700         0.120       
##             Comp.10 Comp.11
## healthdays   0.101   0.705 
## healthmdays -0.258  -0.106 
## smoke        0.103         
## bmi          0.282         
## badhealth    0.129  -0.694 
## ins                        
## checkup                    
## hbp         -0.753         
## hchol        0.493         
## heart                      
## ast                        
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
## Cumulative Var  0.091  0.182  0.273  0.364  0.455  0.545  0.636  0.727  0.818
##                Comp.10 Comp.11
## SS loadings      1.000   1.000
## Proportion Var   0.091   0.091
## Cumulative Var   0.909   1.000

In terms of the loadings for PC1, we see positive associations with everything except insurance status and checkup. We may interpret this component as an index for overall health status, since all health variables are loading in the same direction.

For PC2 we see a mixture of loadings, some positive, some negative, and the only two variables that are loading with large coefficients on this one are smoking and insurance status.

Of course, interpreting components is more art than science…

#then I plot the first 2 components

scores<-data.frame(brfss.pc$scores)
hist(scores$Comp.1)

hist(scores$Comp.2)

Next, I calculate the correlation between the first 2 components to show they are orthogonal, i.e. correlation == 0

cor(scores[,1:2])

##              Comp.1       Comp.2
## Comp.1  1.00000e+00 -9.11502e-14
## Comp.2 -9.11502e-14  1.00000e+00

brfss_17c<-cbind(brfss_17b, scores)
names(brfss_17c)

##  [1] "mmsawt"      "ststr"       "agec"        "race_eth"    "educ"       
##  [6] "healthdays"  "healthmdays" "smoke"       "bmi"         "badhealth"  
## [11] "ins"         "checkup"     "hbp"         "hchol"       "heart"      
## [16] "ast"         "Comp.1"      "Comp.2"      "Comp.3"      "Comp.4"     
## [21] "Comp.5"      "Comp.6"      "Comp.7"      "Comp.8"      "Comp.9"     
## [26] "Comp.10"     "Comp.11"

Here we examine correlation among our health variables

names(brfss_17c)

##  [1] "mmsawt"      "ststr"       "agec"        "race_eth"    "educ"       
##  [6] "healthdays"  "healthmdays" "smoke"       "bmi"         "badhealth"  
## [11] "ins"         "checkup"     "hbp"         "hchol"       "heart"      
## [16] "ast"         "Comp.1"      "Comp.2"      "Comp.3"      "Comp.4"     
## [21] "Comp.5"      "Comp.6"      "Comp.7"      "Comp.8"      "Comp.9"     
## [26] "Comp.10"     "Comp.11"

round(cor(brfss_17c[,c(-1:-5, -17:-27)]), 3)

##             healthdays healthmdays  smoke    bmi badhealth    ins checkup   hbp
## healthdays       1.000       0.338  0.100  0.125     0.533 -0.001   0.029 0.162
## healthmdays      0.338       1.000  0.156  0.075     0.285 -0.056  -0.035 0.042
## smoke            0.100       0.156  1.000 -0.024     0.120 -0.104  -0.074 0.007
## bmi              0.125       0.075 -0.024  1.000     0.169 -0.008   0.033 0.229
## badhealth        0.533       0.285  0.120  0.169     1.000 -0.037   0.029 0.208
## ins             -0.001      -0.056 -0.104 -0.008    -0.037  1.000   0.171 0.057
## checkup          0.029      -0.035 -0.074  0.033     0.029  0.171   1.000 0.137
## hbp              0.162       0.042  0.007  0.229     0.208  0.057   0.137 1.000
## hchol            0.120       0.043 -0.006  0.124     0.141  0.066   0.124 0.318
## heart            0.154       0.047  0.007  0.050     0.194  0.028   0.049 0.181
## ast              0.129       0.115  0.034  0.097     0.125  0.001   0.006 0.040
##              hchol heart   ast
## healthdays   0.120 0.154 0.129
## healthmdays  0.043 0.047 0.115
## smoke       -0.006 0.007 0.034
## bmi          0.124 0.050 0.097
## badhealth    0.141 0.194 0.125
## ins          0.066 0.028 0.001
## checkup      0.124 0.049 0.006
## hbp          0.318 0.181 0.040
## hchol        1.000 0.167 0.029
## heart        0.167 1.000 0.041
## ast          0.029 0.041 1.000

Sometimes it’s easier to look at the correlations among the original variables and the first 3 components

round(cor(brfss_17c[,c(-1:-5, -19:-27)]),3)

##             healthdays healthmdays  smoke    bmi badhealth    ins checkup
## healthdays       1.000       0.338  0.100  0.125     0.533 -0.001   0.029
## healthmdays      0.338       1.000  0.156  0.075     0.285 -0.056  -0.035
## smoke            0.100       0.156  1.000 -0.024     0.120 -0.104  -0.074
## bmi              0.125       0.075 -0.024  1.000     0.169 -0.008   0.033
## badhealth        0.533       0.285  0.120  0.169     1.000 -0.037   0.029
## ins             -0.001      -0.056 -0.104 -0.008    -0.037  1.000   0.171
## checkup          0.029      -0.035 -0.074  0.033     0.029  0.171   1.000
## hbp              0.162       0.042  0.007  0.229     0.208  0.057   0.137
## hchol            0.120       0.043 -0.006  0.124     0.141  0.066   0.124
## heart            0.154       0.047  0.007  0.050     0.194  0.028   0.049
## ast              0.129       0.115  0.034  0.097     0.125  0.001   0.006
## Comp.1           0.718       0.495  0.190  0.390     0.743  0.013   0.142
## Comp.2           0.245       0.432  0.441 -0.176     0.201 -0.474  -0.522
##                hbp  hchol  heart   ast Comp.1 Comp.2
## healthdays   0.162  0.120  0.154 0.129  0.718  0.245
## healthmdays  0.042  0.043  0.047 0.115  0.495  0.432
## smoke        0.007 -0.006  0.007 0.034  0.190  0.441
## bmi          0.229  0.124  0.050 0.097  0.390 -0.176
## badhealth    0.208  0.141  0.194 0.125  0.743  0.201
## ins          0.057  0.066  0.028 0.001  0.013 -0.474
## checkup      0.137  0.124  0.049 0.006  0.142 -0.522
## hbp          1.000  0.318  0.181 0.040  0.516 -0.448
## hchol        0.318  1.000  0.167 0.029  0.428 -0.456
## heart        0.181  0.167  1.000 0.041  0.399 -0.208
## ast          0.040  0.029  0.041 1.000  0.279  0.128
## Comp.1       0.516  0.428  0.399 0.279  1.000  0.000
## Comp.2      -0.448 -0.456 -0.208 0.128  0.000  1.000

This allows us to see the correlations among the original health variables and the new components. This is important for interpreting the direction of the component. In this case, the PC1 suggests that higher PC1 scores have better health, because PC1 has negative correlations with all of the health variables.

Using the Principal Components

Next, I will use the PC’s we just generated in an analysis

#Make the survey design object
options(survey.lonely.psu = "adjust")
brfss_17c$pc1q<-cut(brfss_17c$Comp.1, breaks = quantile(brfss_17c$Comp.1,probs = c(0,.25,.5,.75,1) , na.rm=T))
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data=brfss_17c)

The first analysis will look at variation in my health index across age, and education level:

library(ggplot2)
ggplot(aes(x=agec, y=Comp.1), data=brfss_17c)+geom_boxplot()

brfss_17c$educ<-factor(brfss_17c$educ, levels(brfss_17c$educ)[c(1,3,2,4,5)])
ggplot(aes(x=educ, y=Comp.1), data=brfss_17c)+geom_boxplot()

The age plot is nice, and really shows that older ages have higher values for our health index variable, which confirms our interpretation that higher values of the index are “not good”

Now we do some hypothesis testing. This model will examine variation in my health index across age, education, race and two healthcare access variables:

fit.1<-svyglm(Comp.1~agec+educ+(race_eth), des, family=gaussian)
summary(fit.1)

## 
## Call:
## svyglm(formula = Comp.1 ~ agec + educ + (race_eth), design = des, 
##     family = gaussian)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss_17c)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.66670    0.02155 -30.940  < 2e-16 ***
## agec(24,39]           0.35395    0.02115  16.733  < 2e-16 ***
## agec(39,59]           0.79859    0.02171  36.779  < 2e-16 ***
## agec(59,79]           1.15987    0.02260  51.329  < 2e-16 ***
## agec(79,99]           1.09390    0.04045  27.045  < 2e-16 ***
## educ0Prim             0.42492    0.06125   6.937 4.01e-12 ***
## educ1somehs           0.52187    0.04100  12.729  < 2e-16 ***
## educ3somecol         -0.11711    0.01875  -6.246 4.23e-10 ***
## educ4colgrad         -0.56946    0.01615 -35.260  < 2e-16 ***
## race_ethhispanic     -0.04645    0.02243  -2.071   0.0384 *  
## race_ethnh black      0.19350    0.02226   8.694  < 2e-16 ***
## race_ethnh multirace  0.33002    0.04798   6.878 6.07e-12 ***
## race_ethnh other     -0.17733    0.03004  -5.903 3.58e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.826043)
## 
## Number of Fisher Scoring iterations: 2

So, overall, the index increases across age, and decreases with higher education. Hispanics and other race/ethnic groups have lower indices on average than whites, while NH blacks and multiple race respondents have higher values.

Aggregate data example

This is an example from a paper some former students and I did at the 2014 Applied Demography conference entitled “Spatial accessibility to food sources in Alamo Area Council of Governments (AACOG)” where, we examined factors influencing whether or not a census tract was a food desert in the area surrounding San Antonio. We used the index from the Sharkey paper above

In this case, we had to do some reverse coding of the index to make it work out the way it should

library(rgdal)

## Loading required package: sp

## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: /usr/share/gdal/2.2
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.4-1

sp<-readOGR(dsn = "/media/corey/0E45-D54F/classes/dem7283/class_20_7283/data", layer = "shp2_nm")

## OGR data source with driver: ESRI Shapefile 
## Source: "/media/corey/0E45-D54F/classes/dem7283/class_20_7283/data", layer: "shp2_nm"
## with 466 features
## It has 220 fields
## Integer64 fields read as strings:  OBJECTI DP00100 B010010 B02001001 B02001002 B02001003 B02001004 B02001005 B02001006 B02001007 B02001008 B03002001 B03002002 B03002003 B03002004 B03002005 B03002006 B03002007 B03002008 B03002009 B03002010 B03002011 B03002012 B03002013 B03002014 B03002015 B03002016 B03002017 B03002018 B03002019 B03002020 B03002021 B17001001 B17001002 B17001003 B17001004 B17001005 B17001006 B17001007 B17001008 B17001009 B17001010 B17001011 B17001012 B17001013 B17001014 B17001015 B17001016 B17001017 B17001018 B17001019 B17001020 B17001021 B17001022 B17001023 B17001024 B17001025 B17001026 B17001027 B17001028 B17001029 B17001030 B17001031 B17001032 B17001033 B17001034 B17001035 B17001036 B17001037 B17001038 B17001039 B17001040 B17001041 B17001042 B17001043 B17001044 B17001045 B17001046 B17001047 B17001048 B17001049 B17001050 B17001051 B17001052 B17001053 B17001054 B17001055 B17001056 B17001057 B17001058 B17001059 B19001001 B19001002 B19001003 B19001004 B19001005 B19001006 B19001007 B19001008 B19001009 B19001010 B19001011 B19001012 B19001013 B19001014 B19001015 B19001016 B19001017 B190130 hhincom LILAT_1A1 LILAT_A LILAT_1A2 LILAT_V Urban Rural LA1nd10 LAhlf10 LA1nd20 LATrct_ LATrct1 LATrc10 LATrc20 HUNVFlg GrpQrtF OHU2010 NUMGQTR LwIncmT POP2010 region region2

pcshark<-prcomp(~punemp+pprsnsp+(-1*p25plsl)+gt1prrm+pubasst+phwoveh+phwophn, center=T, scale=T, data=sp, retx=T)
pcshark

## Standard deviations (1, .., p=6):
## [1] 1.8663852 0.8610036 0.8020119 0.7011190 0.6259235 0.4987064
## 
## Rotation (n x k) = (6 x 6):
##                PC1         PC2           PC3        PC4          PC5
## punemp  -0.4018635  0.43356009  0.0505749296 -0.6663923  0.438027381
## pprsnsp -0.4710992 -0.09081256  0.0002533247 -0.1633937 -0.399702491
## gt1prrm -0.3718360  0.12266195  0.8268309160  0.3184658 -0.126928435
## pubasst -0.3623665  0.57277160 -0.4729613865  0.5619221  0.001537651
## phwoveh -0.4362213 -0.35893552 -0.2906211744 -0.1948689 -0.456054569
## phwophn -0.3958671 -0.57604753 -0.0750973274  0.2721635  0.651360860
##                 PC6
## punemp  -0.10966695
## pprsnsp  0.76377943
## gt1prrm -0.21333403
## pubasst -0.03423333
## phwoveh -0.59199319
## phwophn  0.08645738

summary(pcshark)

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6
## Standard deviation     1.8664 0.8610 0.8020 0.70112 0.6259 0.49871
## Proportion of Variance 0.5806 0.1235 0.1072 0.08193 0.0653 0.04145
## Cumulative Proportion  0.5806 0.7041 0.8113 0.89325 0.9586 1.00000

#Here I reverse the direction of the index by multiplying by -1
sp$sharkeyindex<- -1*prcomp(~punemp+pprsnsp+(-1*p25plsl)+gt1prrm+pubasst+phwoveh+phwophn, center=T, scale=T, data=sp, retx=T)$x[,1]

library(mapview)
mapview(sp, zcol="sharkeyindex", "Food Insecurity Risk index")

mapview(sp, zcol="LILAT_1A1", "Food Desert")

fit<-glm(LILAT_1A1~scale(plingis)+scale(interct)+Rural+sharkeyindex,family=binomial, sp)
summary(fit)

## 
## Call:
## glm(formula = LILAT_1A1 ~ scale(plingis) + scale(interct) + Rural + 
##     sharkeyindex, family = binomial, data = sp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0309  -0.5892  -0.4166  -0.2578   2.4975  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.41872    0.15210  -9.328  < 2e-16 ***
## scale(plingis)  0.03501    0.21010   0.167   0.8677    
## scale(interct)  0.02544    0.24611   0.103   0.9177    
## Rural1         -1.02594    0.46176  -2.222   0.0263 *  
## sharkeyindex    0.55722    0.10108   5.513 3.54e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 484.62  on 465  degrees of freedom
## Residual deviance: 387.09  on 461  degrees of freedom
## AIC: 397.09
## 
## Number of Fisher Scoring iterations: 5

Map estimates of fitted probability

sp$fitted<-fit$fitted.values
mapview(sp, zcol="fitted", "Fitted Probability of Being a Food Desert")

So, in areas that had higher values of the index, the odds of being a food desert were also higher. This was offset by being in a rural area, where the odds of being a food desert were lower.