1.(PCA) to perform a variable reduction of at least 5 variables. 2. If you have an idea for latent construct, state what you believe this is. 3. Report the summary statistics and correlation matrix for your data 4. Report the results of the PCA, being sure to include the eigenvalues and corresponding vectors. 5. Interpret your component(s) if possible 6. If deemed appropriate, conduct some testing of your index/components/latent variables.

library(car)
## Loading required package: carData
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()
## x tidyr::unpack() masks Matrix::unpack()
library(pander)
library(dplyr)
library(knitr)
library(FactoMineR)
brfss<-readRDS(url("https://github.com/coreysparks/DEM7283/blob/master/data/brfss20sm.rds?raw=true"))
names(brfss)<-tolower(gsub(pattern = "_",replacement = "",x=names(brfss)))
#recode
brfss$healthydays<-Recode(brfss$physhlth,
                          recodes="88=0;77=NA;99=NA")

brfss$mentalhealth<-Recode(brfss$menthlth,
                            recodes="88=0;77=NA;99=NA")

brfss$genhealth<-Recode(brfss$genhlth,
                        recodes="1:3=0; 4:5=1; 7:9=NA")

brfss$ins<-Recode(brfss$hlthpln1,
                    recodes ="7:9=NA; 1=0;2=1")

brfss$doctor<-Recode(brfss$persdoc2,
                    recodes ="7:9=NA; 1=1;2=0;3=2")

brfss$skip<-Recode(brfss$medcost,
                    recodes ="7:9=NA; 1=1;2=0")

brfss$rent<-Recode(brfss$renthom1,
                    recodes ="7:9=NA; 1=1;2=2")
brfss5<-brfss%>%
  filter(complete.cases(ins,doctor,skip,rent,healthydays,mentalhealth,genhealth))%>%
  select(ins,doctor,skip,rent,healthydays,mentalhealth,genhealth)
brfss.pca<-PCA(brfss5,scale.unit = T, graph=F)

eigenvalues<-brfss.pca$eig
head(eigenvalues[,1:2])
##        eigenvalue percentage of variance
## comp 1  1.8545468               26.49353
## comp 2  1.5143452               21.63350
## comp 3  0.8792037               12.56005
## comp 4  0.8340436               11.91491
## comp 5  0.7687423               10.98203
## comp 6  0.6656895                9.50985

The eigenvalues for component 1 and 2 are greater than 1, meaning there are two real components among the variables in the PCA. They represent almost 50% of the variance in the original data.

brfss.pca$var
## $coord
##                  Dim.1      Dim.2       Dim.3       Dim.4      Dim.5
## ins          0.3096720  0.6486317 -0.33165577  0.18154160 -0.1696154
## doctor       0.1354850  0.6692598  0.08423534  0.40833244  0.5193205
## skip         0.5013881  0.3646944 -0.44661595 -0.48437150 -0.1457117
## rent         0.4018932  0.3980329  0.69358680 -0.02725376 -0.4362793
## healthydays  0.6920331 -0.4384679 -0.07868868  0.25970003  0.0126311
## mentalhealth 0.6117223 -0.1028511  0.25711527 -0.45642058  0.4947456
## genhealth    0.6886739 -0.3891642 -0.09635316  0.35103921 -0.1173665
## 
## $cor
##                  Dim.1      Dim.2       Dim.3       Dim.4      Dim.5
## ins          0.3096720  0.6486317 -0.33165577  0.18154160 -0.1696154
## doctor       0.1354850  0.6692598  0.08423534  0.40833244  0.5193205
## skip         0.5013881  0.3646944 -0.44661595 -0.48437150 -0.1457117
## rent         0.4018932  0.3980329  0.69358680 -0.02725376 -0.4362793
## healthydays  0.6920331 -0.4384679 -0.07868868  0.25970003  0.0126311
## mentalhealth 0.6117223 -0.1028511  0.25711527 -0.45642058  0.4947456
## genhealth    0.6886739 -0.3891642 -0.09635316  0.35103921 -0.1173665
## 
## $cos2
##                   Dim.1      Dim.2       Dim.3        Dim.4        Dim.5
## ins          0.09589678 0.42072302 0.109995549 0.0329573537 0.0287693750
## doctor       0.01835618 0.44790871 0.007095592 0.1667353824 0.2696937694
## skip         0.25139006 0.13300201 0.199465809 0.2346157518 0.0212319040
## rent         0.16151810 0.15843021 0.481062645 0.0007427675 0.1903396540
## healthydays  0.47890981 0.19225412 0.006191909 0.0674441045 0.0001595446
## mentalhealth 0.37420412 0.01057836 0.066108262 0.2083197493 0.2447732068
## genhealth    0.47427179 0.15144879 0.009283931 0.1232285288 0.0137748877
## 
## $contrib
##                   Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
## ins           5.1709008 27.7825039 12.5108151  3.95151431  3.74239500
## doctor        0.9897935 29.5777148  0.8070476 19.99120607 35.08246585
## skip         13.5553360  8.7828067 22.6870985 28.12991325  2.76190120
## rent          8.7093030 10.4619610 54.7157213  0.08905619 24.75987645
## healthydays  25.8235489 12.6955281  0.7042633  8.08639998  0.02075398
## mentalhealth 20.1776581  0.6985434  7.5191064 24.97708031 31.84073435
## genhealth    25.5734597 10.0009422  1.0559477 14.77482990  1.79187317

According to this output here, the first component includes the largest values for the number of healthy days reported (healthydays), the number of healthy mental health days reported (mentalhealth) and the general health reported (genhealth). Number of healthy days and general health contribute most to this component according to the $contrib output. This makes sense since you would expect these reports to be correlated. The second component has the largest values for whether or not someone has insurance (ins) and whether or not they have a personal care provider (doctor) which, again, is not surprisingly correlated. The third component only loads with whether or not someone owns their homw (rent) which was a wild card variable I threw into the model to see if it had any correlation with the health outcome variables, but this shows that it is in a component by itself. If I were to create an index of some sort, I would remove rent as a variable.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_pca_var(brfss.pca,col.var="contrib")

This visual shows positive associations with all variables. For PC1 we see whether or not someone has a PC and whether or not someone has insurance loading with the largest coefficients. General health, healthy days and mental health days loading associating with PC2, with general health and healthy days making a larger contribution (according to color) and higher loading (according to length of the arrows).

fviz_pca_ind(brfss.pca,label="none",col.ind = "cos2")+
  scale_color_gradient2(low="pink",mid="turquoise",high="purple",midpoint=.5)+
  theme_minimal()