Principal Components
For this homework, you are to use the technique of Principal Components Analysis
(PCA) to perform a variable reduction of at least 5 variables.
These 5 variables will be if the respondent was born in the US, the sex of the respondent, the income of the respondent when they were 16, if the respondent was ever disrespected, and if the respondent was ever treated like they were not smart.
If you have an idea for latent construct, state what you believe this is.
I beleive that the latent function that exist with these variables is the idea of parental factors and the respodents sex. When observing educational attainment for Hispanics various parental factors effect these outcomes esspecially by generation. So to observe what variables have the most effect while controlling for multicollinarity would be the PCA. The respondents sex is also observed in this technique because of whats expressed in the literature of how these racialized outcomes differ between men and women Hispanics. All variables will be tested to observe which variables should be used when measuring this effect in this dataset.
Report the summary statistics and correlation matrix for your data
Report the results of the PCA, being sure to include the eigenvalues and corresponding vectors.
When observing the outcome of the PCA column 1 (subgroupBorn; if respondent was born in the US or not) and column 2 (subgroupsex; respondents sex) passed the threshold of 1 for the eigenvalue.
The variance accounted for regarding the variable of being born in the US is about 30%. 21% of variance is accounted for the respondents sex.
Interpret your component(s) if possible
With the five Principle Compents observed (5 variables) the eigenvalues of the two variables that should be included that best capture independent variance amongst those variables would be if the respondent was born in the US and the sex of the respondent. So there are 2 components of variation out of the 5 principle components.
If deemed appropriate, conduct some testing of your index/components/latent variables.
When observing the summary of the PCA's; only observing those 2 variables that passed the threshold test of 1 for its eigenvalue, the eigenvectors shown for the coordinates show in Dim 1 weak effects (.06 for born outside of US and .24 for respondents sex being a women).
These low numbers in Dim 1 show that much of the commonaility could be explained by ones perception but moving to Dim 2 those variables of perception (being disrespect and not being smart) drop drastically with large growths in being born outside of the US and respondents sex. This shows that outside of the linear relations found in Dim 1 in Dim 2 observing the orthogonal relationship of the PC's shows that when controlling for that multicollinarity the effect of the sex of the respondent being a woman and the respondent being born outside of the US have heavy effects.
The respondents income at 16 is further validated to be removed because of not reaching the threshold of 1 as well as the negative relationship found in its coordinates. Although the close to 0 value found in Dim 1 for respondent being born outside of the US should normally be called to be removed; it is rather keeping its place in the model because of passing the threshold of 1 as well as the large growth from .06 in Dim 1 to .72 in Dim 2, further expressing its independence from the other variables as well as the true strength of the variable. Same can be said for respondents sex but held better weights in Dim 1.
This shows that from these principle components listed that variables more inline with the respondents charecteristics at birth such as their sex and being born outside of the US are more appropriate variables to use rather than what the respondent percieves or thier income at 16 in regards to educational attainment outcomes of Hispanics.
A couple of important mentions, when observing income16 (respondents income at 16) although it does not reach the 1 threshold and is also a negative in Dim 1 for coord. it does continue to appear as a factor such as in Dim 2 in coord, Dim 2 in contribution, and is shown again in Dim 2 for the top variables that are correlated the most with the original variables (desc$Dim.2 -> .65)
This could mean a variable such as this can play apart in what is occuring but not this specfic variable. Also the high levels of Dim 1 found for percieved disrespect and not being smart are explained by the very high levels of sameness that is expressed. That is why they are at such low levels in Dim 2 being that they explain very little variation amongst those PC's.
library(gtsummary)
library(pander)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.3
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(haven)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(scales)
library(sur)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:gtsummary':
##
## mutate
library(summarytools)
library(Rmisc)
## Loading required package: lattice
library(car)
## Loading required package: carData
##
## Attaching package: 'carData'
## The following objects are masked from 'package:sur':
##
## Anscombe, States
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(forcats)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plyr::arrange() masks dplyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact() masks plyr::compact()
## x plyr::count() masks dplyr::count()
## x purrr::discard() masks scales::discard()
## x plyr::failwith() masks dplyr::failwith()
## x dplyr::filter() masks stats::filter()
## x plyr::id() masks dplyr::id()
## x dplyr::lag() masks stats::lag()
## x plyr::mutate() masks dplyr::mutate(), gtsummary::mutate()
## x car::recode() masks dplyr::recode()
## x plyr::rename() masks dplyr::rename()
## x purrr::some() masks car::some()
## x plyr::summarise() masks dplyr::summarise()
## x plyr::summarize() masks dplyr::summarize()
## x tibble::view() masks summarytools::view()
library(survey)
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(grid)
library(Matrix)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
gss2021_ZERODraft<-read_dta("C:\\Users\\BTP\\Desktop\\STATS 2 FOLDER\\2021_sas\\gss2021.dta")
recode hispanic
gss2021_ZERODraft$subgrouphis <-Recode(gss2021_ZERODraft$hispanic, recodes="1 = 0; 2:50 = 1; else=NA", as.factor=T)
gss2021_ZERODraft %>%
tabyl(subgrouphis)
## subgrouphis n percent valid_percent
## 0 3544 0.87896825 0.8864432
## 1 454 0.11259921 0.1135568
## <NA> 34 0.00843254 NA
subgrouphis_1<-as.factor(ifelse(gss2021_ZERODraft$subgrouphis==1, "Hispanic", "Non Hispanic"))
tabyl(subgrouphis_1)
## subgrouphis_1 n percent valid_percent
## Hispanic 454 0.11259921 0.1135568
## Non Hispanic 3544 0.87896825 0.8864432
## <NA> 34 0.00843254 NA
recode education outcome variable for Multinomial model for those with less than high school, high school, and college
gss2021_ZERODraft$AllEducLevels <-Recode(gss2021_ZERODraft$educ, recodes="0:11 = 1; 12 = 2; 13:16 = 3; 17:20 = 4; else=NA", as.factor=T)
gss2021_ZERODraft$AllEducLevels<-relevel(gss2021_ZERODraft$AllEducLevels, ref = "1")
gss2021_ZERODraft %>%
tabyl(AllEducLevels)
## AllEducLevels n percent valid_percent
## 1 230 0.05704365 0.05799294
## 2 829 0.20560516 0.20902673
## 3 1969 0.48834325 0.49646999
## 4 938 0.23263889 0.23651034
## <NA> 66 0.01636905 NA
sex
gss2021_ZERODraft$subgroupsex <-Recode(gss2021_ZERODraft$sex, recodes="1:1 = 0; 2:2 = 1; else=NA",)
gss2021_ZERODraft %>%
tabyl(subgroupsex)
## subgroupsex n percent valid_percent
## 0 1736 0.43055556 0.4406091
## 1 2204 0.54662698 0.5593909
## NA 92 0.02281746 NA
subgroupsex_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupsex==1, "Women", "Men"))
tabyl(subgroupsex_1)
## subgroupsex_1 n percent valid_percent
## Men 1736 0.43055556 0.4406091
## Women 2204 0.54662698 0.5593909
## <NA> 92 0.02281746 NA
income of household of respondent when age 16
gss2021_ZERODraft$subgroupincom16 <-Recode(gss2021_ZERODraft$incom16, recodes="1:2 = 0; 3:5 = 1; else=NA",)
gss2021_ZERODraft %>%
tabyl(subgroupincom16)
## subgroupincom16 n percent valid_percent
## 0 1434 0.35565476 0.374804
## 1 2392 0.59325397 0.625196
## NA 206 0.05109127 NA
subgroupincom16_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupincom16==1, "secure economic resources at 16", "insecure economic resources"))
tabyl(subgroupincom16_1)
## subgroupincom16_1 n percent valid_percent
## insecure economic resources 1434 0.35565476 0.374804
## secure economic resources at 16 2392 0.59325397 0.625196
## <NA> 206 0.05109127 NA
if respondent was born in the US
gss2021_ZERODraft$subgroupBorn <-Recode(gss2021_ZERODraft$born, recodes="1:1 = 1; 2:2 = 0; else=NA",)
gss2021_ZERODraft %>%
tabyl(subgroupBorn)
## subgroupBorn n percent valid_percent
## 0 444 0.11011905 0.1121212
## 1 3516 0.87202381 0.8878788
## NA 72 0.01785714 NA
subgroupborn_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupBorn==1, "Born in US", "Not born in US"))
tabyl(subgroupborn_1)
## subgroupborn_1 n percent valid_percent
## Born in US 3516 0.87202381 0.8878788
## Not born in US 444 0.11011905 0.1121212
## <NA> 72 0.01785714 NA
if respondents ever been disrespected
gss2021_ZERODraft$subgroupdisrspct <-Recode(gss2021_ZERODraft$disrspct, recodes="1:5 = 1; 6:6 = 0; else=NA",)
gss2021_ZERODraft %>%
tabyl(subgroupdisrspct)
## subgroupdisrspct n percent valid_percent
## 0 554 0.1374008 0.212995
## 1 2047 0.5076885 0.787005
## NA 1431 0.3549107 NA
subgroupdisrspct_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupdisrspct==1, "respondent has been disrespected", "Not being disrespected"))
tabyl(subgroupdisrspct_1)
## subgroupdisrspct_1 n percent valid_percent
## Not being disrespected 554 0.1374008 0.212995
## respondent has been disrespected 2047 0.5076885 0.787005
## <NA> 1431 0.3549107 NA
if respondents ever been called or treated like they were not smart.
gss2021_ZERODraft$subgroupnotsmart <-Recode(gss2021_ZERODraft$notsmart, recodes="1:5 = 1; 6:6 = 0; else=NA",)
gss2021_ZERODraft %>%
tabyl(subgroupnotsmart)
## subgroupnotsmart n percent valid_percent
## 0 867 0.2150298 0.3332052
## 1 1735 0.4303075 0.6667948
## NA 1430 0.3546627 NA
subgroupnotsmart_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupnotsmart==1, "respondent was told or treated as if they are not smart", "Never experienced that sort of treatment of not being smart"))
tabyl(subgroupnotsmart_1)
## subgroupnotsmart_1 n percent
## Never experienced that sort of treatment of not being smart 867 0.2150298
## respondent was told or treated as if they are not smart 1735 0.4303075
## <NA> 1430 0.3546627
## valid_percent
## 0.3332052
## 0.6667948
## NA
age cut into intervals
age1<-cut(gss2021_ZERODraft$age,
breaks = c(0,24,39,59,79,99))
Filtering Data for PCA
gss20b<-gss2021_ZERODraft%>%
filter(complete.cases(AllEducLevels,subgrouphis,subgroupBorn,subgroupsex,subgroupincom16,subgroupdisrspct,subgroupnotsmart,age,vstrat,wtssnrps)) %>%
select(AllEducLevels,subgrouphis,subgroupBorn,subgroupsex,subgroupincom16,subgroupdisrspct,subgroupnotsmart,age,vstrat,wtssnrps) %>%
mutate_at(vars(subgroupBorn,subgroupincom16,subgroupdisrspct,subgroupnotsmart,subgroupsex),scale)
samps<- sample(1:dim(gss20b)[1], replace = FALSE)
gss20b<- gss20b[samps,]
PCA
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.3
gss.pc<-PCA(gss20b[, c(3:7)],
scale.unit = T,
graph = F)
eigenvalues <-gss.pc$eig
head(eigenvalues[, 1:2])
## eigenvalue percentage of variance
## comp 1 1.5428347 30.85669
## comp 2 1.0880438 21.76088
## comp 3 0.9642112 19.28422
## comp 4 0.9016333 18.03267
## comp 5 0.5032771 10.06554
fviz_screeplot(gss.pc, ncp=10)

gss.pc$var
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## subgroupBorn 0.06259212 0.721071718 -0.35410000 -0.59173087 0.02460844
## subgroupsex 0.24607302 0.368337998 0.88913214 -0.10729872 0.04130641
## subgroupincom16 -0.24326640 0.657123957 -0.10516110 0.70482652 -0.03420950
## subgroupdisrspct 0.83384114 0.003904857 -0.16567078 0.18492474 0.49300075
## subgroupnotsmart 0.85082045 0.024480720 -0.09880652 0.09485404 -0.50670034
##
## $cor
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## subgroupBorn 0.06259212 0.721071718 -0.35410000 -0.59173087 0.02460844
## subgroupsex 0.24607302 0.368337998 0.88913214 -0.10729872 0.04130641
## subgroupincom16 -0.24326640 0.657123957 -0.10516110 0.70482652 -0.03420950
## subgroupdisrspct 0.83384114 0.003904857 -0.16567078 0.18492474 0.49300075
## subgroupnotsmart 0.85082045 0.024480720 -0.09880652 0.09485404 -0.50670034
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## subgroupBorn 0.003917774 5.199444e-01 0.125386809 0.35014542 0.0006055753
## subgroupsex 0.060551929 1.356729e-01 0.790555955 0.01151302 0.0017062199
## subgroupincom16 0.059178540 4.318119e-01 0.011058858 0.49678042 0.0011702897
## subgroupdisrspct 0.695291049 1.524791e-05 0.027446807 0.03419716 0.2430497354
## subgroupnotsmart 0.723895439 5.993056e-04 0.009762728 0.00899729 0.2567452380
##
## $contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## subgroupBorn 0.2539335 47.787087807 13.004082 38.8345703 0.1203264
## subgroupsex 3.9247191 12.469432483 81.989920 1.2769066 0.3390220
## subgroupincom16 3.8357019 39.686997281 1.146933 55.0978338 0.2325339
## subgroupdisrspct 45.0658152 0.001401406 2.846556 3.7928014 48.2934263
## subgroupnotsmart 46.9198304 0.055081023 1.012509 0.9978879 51.0146914
fviz_pca_var(gss.pc,
col.var = "contrib")+
theme_minimal()

fviz_pca_ind(gss.pc,
label="none",
col.ind="cos2")+
scale_color_gradient2(low = "blue",
mid = "white",
high = "red",
midpoint = .5)+
theme_minimal()

desc <- dimdesc(gss.pc)
desc$Dim.1
## $quanti
## correlation p.value
## subgroupnotsmart 0.85082045 0.000000e+00
## subgroupdisrspct 0.83384114 0.000000e+00
## subgroupsex 0.24607302 9.154763e-34
## subgroupBorn 0.06259212 2.395212e-03
## subgroupincom16 -0.24326640 5.148792e-33
##
## attr(,"class")
## [1] "condes" "list"
desc$Dim.2
## $quanti
## correlation p.value
## subgroupBorn 0.7210717 0.000000e+00
## subgroupincom16 0.6571240 1.120930e-290
## subgroupsex 0.3683380 1.894575e-76
##
## attr(,"class")
## [1] "condes" "list"