Homework 7 STATS2

Principal Components
For this homework, you are to use the technique of Principal Components Analysis
(PCA) to perform a variable reduction of at least 5 variables.

These 5 variables will be if the respondent was born in the US, the sex of the respondent, the income of the respondent when they were 16, if the respondent was ever disrespected, and if the respondent was ever treated like they were not smart.

If you have an idea for latent construct, state what you believe this is.
I beleive that the latent function that exist with these variables is the idea of parental factors and the respodents sex. When observing educational attainment for Hispanics various parental factors effect these outcomes esspecially by generation. So to observe what variables have the most effect while controlling for multicollinarity would be the PCA. The respondents sex is also observed in this technique because of whats expressed in the literature of how these racialized outcomes differ between men and women Hispanics. All variables will be tested to observe which variables should be used when measuring this effect in this dataset. 


Report the summary statistics and correlation matrix for your data


Report the results of the PCA, being sure to include the eigenvalues and corresponding vectors.

When observing the outcome of the PCA column 1 (subgroupBorn; if respondent was born in the US or not) and column 2 (subgroupsex; respondents sex) passed the threshold of 1 for the eigenvalue. 

The variance accounted for regarding the variable of being born in the US is about 30%. 21% of variance is accounted for the respondents sex. 

Interpret your component(s) if possible

With the five Principle Compents observed (5 variables) the eigenvalues of the two variables that should be included that best capture independent variance amongst those variables would be if the respondent was born in the US and the sex of the respondent. So there are 2 components of variation out of the 5 principle components. 

If deemed appropriate, conduct some testing of your index/components/latent variables.

When observing the summary of the PCA's; only observing those 2 variables that passed the threshold test of 1 for its eigenvalue, the eigenvectors shown for the coordinates show in Dim 1 weak effects (.06 for born outside of US and .24 for respondents sex being a women). 

These low numbers in Dim 1 show that much of the commonaility could be explained by ones perception but moving to Dim 2 those variables of perception (being disrespect and not being smart) drop drastically with large growths in being born outside of the US and respondents sex. This shows that outside of the linear relations found in Dim 1 in Dim 2 observing the orthogonal relationship of the PC's shows that when controlling for that multicollinarity the effect of the sex of the respondent being a woman and the respondent being born outside of the US have heavy effects. 

The respondents income at 16 is further validated to be removed because of not reaching the threshold of 1 as well as the negative relationship found in its coordinates. Although the close to 0 value found in Dim 1 for respondent being born outside of the US should normally be called to be removed; it is rather keeping its place in the model because of passing the threshold of 1 as well as the large growth from .06 in Dim 1 to .72 in Dim 2, further expressing its independence from the other variables as well as the true strength of the variable. Same can be said for respondents sex but held better weights in Dim 1. 

This shows that from these principle components listed that variables more inline with the respondents charecteristics at birth such as their sex and being born outside of the US are more appropriate variables to use rather than what the respondent percieves or thier income at 16 in regards to educational attainment outcomes of Hispanics.

A couple of important mentions, when observing income16 (respondents income at 16) although it does not reach the 1 threshold and is also a negative in Dim 1 for coord. it does continue to appear as a factor such as in Dim 2 in coord, Dim 2 in contribution, and is shown again in Dim 2 for the top variables that are correlated the most with the original variables (desc$Dim.2 -> .65) 

This could mean a variable such as this can play apart in what is occuring but not this specfic variable. Also the high levels of Dim 1 found for percieved disrespect and not being smart are explained by the very high levels of sameness that is expressed. That is why they are at such low levels in Dim 2 being that they explain very little variation amongst those PC's.

library(gtsummary)
library(pander)
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.1.3

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(haven)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(scales)
library(sur)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:gtsummary':
## 
##     mutate

library(summarytools)
library(Rmisc)

## Loading required package: lattice

library(car)

## Loading required package: carData

## 
## Attaching package: 'carData'

## The following objects are masked from 'package:sur':
## 
##     Anscombe, States

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(forcats)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plyr::arrange()     masks dplyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact()    masks plyr::compact()
## x plyr::count()       masks dplyr::count()
## x purrr::discard()    masks scales::discard()
## x plyr::failwith()    masks dplyr::failwith()
## x dplyr::filter()     masks stats::filter()
## x plyr::id()          masks dplyr::id()
## x dplyr::lag()        masks stats::lag()
## x plyr::mutate()      masks dplyr::mutate(), gtsummary::mutate()
## x car::recode()       masks dplyr::recode()
## x plyr::rename()      masks dplyr::rename()
## x purrr::some()       masks car::some()
## x plyr::summarise()   masks dplyr::summarise()
## x plyr::summarize()   masks dplyr::summarize()
## x tibble::view()      masks summarytools::view()

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

library(grid)
library(Matrix)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

gss2021_ZERODraft<-read_dta("C:\\Users\\BTP\\Desktop\\STATS 2 FOLDER\\2021_sas\\gss2021.dta")

recode hispanic

gss2021_ZERODraft$subgrouphis <-Recode(gss2021_ZERODraft$hispanic, recodes="1 = 0; 2:50 = 1; else=NA", as.factor=T)

gss2021_ZERODraft %>%
  
tabyl(subgrouphis)

##  subgrouphis    n    percent valid_percent
##            0 3544 0.87896825     0.8864432
##            1  454 0.11259921     0.1135568
##         <NA>   34 0.00843254            NA

subgrouphis_1<-as.factor(ifelse(gss2021_ZERODraft$subgrouphis==1, "Hispanic", "Non Hispanic"))

tabyl(subgrouphis_1)

##  subgrouphis_1    n    percent valid_percent
##       Hispanic  454 0.11259921     0.1135568
##   Non Hispanic 3544 0.87896825     0.8864432
##           <NA>   34 0.00843254            NA

recode education outcome variable for Multinomial model for those with less than high school, high school, and college

gss2021_ZERODraft$AllEducLevels <-Recode(gss2021_ZERODraft$educ, recodes="0:11 = 1; 12 = 2; 13:16 = 3; 17:20 = 4; else=NA", as.factor=T)

gss2021_ZERODraft$AllEducLevels<-relevel(gss2021_ZERODraft$AllEducLevels, ref = "1")

gss2021_ZERODraft %>%
  tabyl(AllEducLevels)

##  AllEducLevels    n    percent valid_percent
##              1  230 0.05704365    0.05799294
##              2  829 0.20560516    0.20902673
##              3 1969 0.48834325    0.49646999
##              4  938 0.23263889    0.23651034
##           <NA>   66 0.01636905            NA

sex

gss2021_ZERODraft$subgroupsex <-Recode(gss2021_ZERODraft$sex, recodes="1:1 = 0; 2:2 = 1; else=NA",)

gss2021_ZERODraft %>%
  
tabyl(subgroupsex)

##  subgroupsex    n    percent valid_percent
##            0 1736 0.43055556     0.4406091
##            1 2204 0.54662698     0.5593909
##           NA   92 0.02281746            NA

subgroupsex_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupsex==1, "Women", "Men"))

tabyl(subgroupsex_1)

##  subgroupsex_1    n    percent valid_percent
##            Men 1736 0.43055556     0.4406091
##          Women 2204 0.54662698     0.5593909
##           <NA>   92 0.02281746            NA

income of household of respondent when age 16

gss2021_ZERODraft$subgroupincom16 <-Recode(gss2021_ZERODraft$incom16, recodes="1:2 = 0; 3:5 = 1; else=NA",)

gss2021_ZERODraft %>%
  
tabyl(subgroupincom16)

##  subgroupincom16    n    percent valid_percent
##                0 1434 0.35565476      0.374804
##                1 2392 0.59325397      0.625196
##               NA  206 0.05109127            NA

subgroupincom16_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupincom16==1, "secure economic resources at 16", "insecure economic resources"))

tabyl(subgroupincom16_1)

##                subgroupincom16_1    n    percent valid_percent
##      insecure economic resources 1434 0.35565476      0.374804
##  secure economic resources at 16 2392 0.59325397      0.625196
##                             <NA>  206 0.05109127            NA

if respondent was born in the US

gss2021_ZERODraft$subgroupBorn <-Recode(gss2021_ZERODraft$born, recodes="1:1 = 1; 2:2 = 0; else=NA",)

gss2021_ZERODraft %>%
  
tabyl(subgroupBorn)

##  subgroupBorn    n    percent valid_percent
##             0  444 0.11011905     0.1121212
##             1 3516 0.87202381     0.8878788
##            NA   72 0.01785714            NA

subgroupborn_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupBorn==1, "Born in US", "Not born in US"))

tabyl(subgroupborn_1)

##  subgroupborn_1    n    percent valid_percent
##      Born in US 3516 0.87202381     0.8878788
##  Not born in US  444 0.11011905     0.1121212
##            <NA>   72 0.01785714            NA

if respondents ever been disrespected

gss2021_ZERODraft$subgroupdisrspct <-Recode(gss2021_ZERODraft$disrspct, recodes="1:5 = 1; 6:6 = 0; else=NA",)

gss2021_ZERODraft %>%
  
tabyl(subgroupdisrspct)

##  subgroupdisrspct    n   percent valid_percent
##                 0  554 0.1374008      0.212995
##                 1 2047 0.5076885      0.787005
##                NA 1431 0.3549107            NA

subgroupdisrspct_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupdisrspct==1, "respondent has been disrespected", "Not being disrespected"))

tabyl(subgroupdisrspct_1)

##                subgroupdisrspct_1    n   percent valid_percent
##            Not being disrespected  554 0.1374008      0.212995
##  respondent has been disrespected 2047 0.5076885      0.787005
##                              <NA> 1431 0.3549107            NA

if respondents ever been called or treated like they were not smart.

gss2021_ZERODraft$subgroupnotsmart <-Recode(gss2021_ZERODraft$notsmart, recodes="1:5 = 1; 6:6 = 0; else=NA",)

gss2021_ZERODraft %>%
  
tabyl(subgroupnotsmart)

##  subgroupnotsmart    n   percent valid_percent
##                 0  867 0.2150298     0.3332052
##                 1 1735 0.4303075     0.6667948
##                NA 1430 0.3546627            NA

subgroupnotsmart_1<-as.factor(ifelse(gss2021_ZERODraft$subgroupnotsmart==1, "respondent was told or treated as if they are not smart", "Never experienced that sort of treatment of not being smart"))

tabyl(subgroupnotsmart_1)

##                                           subgroupnotsmart_1    n   percent
##  Never experienced that sort of treatment of not being smart  867 0.2150298
##      respondent was told or treated as if they are not smart 1735 0.4303075
##                                                         <NA> 1430 0.3546627
##  valid_percent
##      0.3332052
##      0.6667948
##             NA

age cut into intervals

age1<-cut(gss2021_ZERODraft$age,
          breaks = c(0,24,39,59,79,99))

Filtering Data for PCA

gss20b<-gss2021_ZERODraft%>% 
  filter(complete.cases(AllEducLevels,subgrouphis,subgroupBorn,subgroupsex,subgroupincom16,subgroupdisrspct,subgroupnotsmart,age,vstrat,wtssnrps)) %>% 
  select(AllEducLevels,subgrouphis,subgroupBorn,subgroupsex,subgroupincom16,subgroupdisrspct,subgroupnotsmart,age,vstrat,wtssnrps) %>% 
  mutate_at(vars(subgroupBorn,subgroupincom16,subgroupdisrspct,subgroupnotsmart,subgroupsex),scale)

samps<- sample(1:dim(gss20b)[1], replace = FALSE)

gss20b<- gss20b[samps,]

PCA

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.1.3

gss.pc<-PCA(gss20b[, c(3:7)],
            scale.unit = T,
            graph = F) 
eigenvalues <-gss.pc$eig
head(eigenvalues[, 1:2])

##        eigenvalue percentage of variance
## comp 1  1.5428347               30.85669
## comp 2  1.0880438               21.76088
## comp 3  0.9642112               19.28422
## comp 4  0.9016333               18.03267
## comp 5  0.5032771               10.06554

fviz_screeplot(gss.pc, ncp=10)

gss.pc$var

## $coord
##                        Dim.1       Dim.2       Dim.3       Dim.4       Dim.5
## subgroupBorn      0.06259212 0.721071718 -0.35410000 -0.59173087  0.02460844
## subgroupsex       0.24607302 0.368337998  0.88913214 -0.10729872  0.04130641
## subgroupincom16  -0.24326640 0.657123957 -0.10516110  0.70482652 -0.03420950
## subgroupdisrspct  0.83384114 0.003904857 -0.16567078  0.18492474  0.49300075
## subgroupnotsmart  0.85082045 0.024480720 -0.09880652  0.09485404 -0.50670034
## 
## $cor
##                        Dim.1       Dim.2       Dim.3       Dim.4       Dim.5
## subgroupBorn      0.06259212 0.721071718 -0.35410000 -0.59173087  0.02460844
## subgroupsex       0.24607302 0.368337998  0.88913214 -0.10729872  0.04130641
## subgroupincom16  -0.24326640 0.657123957 -0.10516110  0.70482652 -0.03420950
## subgroupdisrspct  0.83384114 0.003904857 -0.16567078  0.18492474  0.49300075
## subgroupnotsmart  0.85082045 0.024480720 -0.09880652  0.09485404 -0.50670034
## 
## $cos2
##                        Dim.1        Dim.2       Dim.3      Dim.4        Dim.5
## subgroupBorn     0.003917774 5.199444e-01 0.125386809 0.35014542 0.0006055753
## subgroupsex      0.060551929 1.356729e-01 0.790555955 0.01151302 0.0017062199
## subgroupincom16  0.059178540 4.318119e-01 0.011058858 0.49678042 0.0011702897
## subgroupdisrspct 0.695291049 1.524791e-05 0.027446807 0.03419716 0.2430497354
## subgroupnotsmart 0.723895439 5.993056e-04 0.009762728 0.00899729 0.2567452380
## 
## $contrib
##                       Dim.1        Dim.2     Dim.3      Dim.4      Dim.5
## subgroupBorn      0.2539335 47.787087807 13.004082 38.8345703  0.1203264
## subgroupsex       3.9247191 12.469432483 81.989920  1.2769066  0.3390220
## subgroupincom16   3.8357019 39.686997281  1.146933 55.0978338  0.2325339
## subgroupdisrspct 45.0658152  0.001401406  2.846556  3.7928014 48.2934263
## subgroupnotsmart 46.9198304  0.055081023  1.012509  0.9978879 51.0146914

fviz_pca_var(gss.pc,
             col.var = "contrib")+
  theme_minimal()

fviz_pca_ind(gss.pc,
             label="none",
                   col.ind="cos2")+
               scale_color_gradient2(low = "blue",
                                     mid = "white",
                                     high = "red",
                                     midpoint = .5)+
               theme_minimal()

desc <- dimdesc(gss.pc)

desc$Dim.1

## $quanti
##                  correlation      p.value
## subgroupnotsmart  0.85082045 0.000000e+00
## subgroupdisrspct  0.83384114 0.000000e+00
## subgroupsex       0.24607302 9.154763e-34
## subgroupBorn      0.06259212 2.395212e-03
## subgroupincom16  -0.24326640 5.148792e-33
## 
## attr(,"class")
## [1] "condes" "list"

desc$Dim.2

## $quanti
##                 correlation       p.value
## subgroupBorn      0.7210717  0.000000e+00
## subgroupincom16   0.6571240 1.120930e-290
## subgroupsex       0.3683380  1.894575e-76
## 
## attr(,"class")
## [1] "condes" "list"

Homework 7 STATS2

Brandon Flores

March 28, 2022

recode hispanic

recode education outcome variable for Multinomial model for those with less than high school, high school, and college

sex

income of household of respondent when age 16

if respondent was born in the US

if respondents ever been disrespected

if respondents ever been called or treated like they were not smart.

age cut into intervals

Filtering Data for PCA

PCA