For this homework assignment, I’ve been asked to measure a latent construct via Principal Component Analysis. My idea for a latent construct is called “academic achievement” – that is, a measurement of the student’s latent ability to perform well in school.
For this, I’ve decided to reduce five seprate variables into the PCA:
These variables are coded as follows:
#reading scores
els$reading_score<-ifelse(els$bytxrstd==-8,NA,els$bytxrstd)
#math scores
els$math_score<-ifelse(els$bytxmstd==-8,NA,els$bytxmstd)
#repeated a grade
els<-els %>% mutate(repeated_grade=case_when(.$bygrdrpt %in% c(1,2,97)~1,
.$bygrdrpt==0~0,
.$bygrdrpt %in% c(98,99)~NA_real_))
#teacher-reported writing ability
els$teach_rep_write<-ifelse(els$bywrtnga==-4,NA_real_,
ifelse(els$bywrtnga==-8,NA_real_,
ifelse(els$bywrtnga==-9,NA_real_,els$bywrtnga)))
#classroom preperation ability
els$class_prep<-ifelse(els$bystprep==-4,NA_real_,
ifelse(els$bystprep==-8,NA_real_,
ifelse(els$bystprep==-9,NA_real_,els$bystprep)))
Here, I am using the above variables to construct a dataset and survey design for our PCA. I’ve run the PCA to get the following results:
dataset<-els %>% select(stu_id,bystuwt,strat_id,reading_score,math_score,repeated_grade,teach_rep_write,class_prep,male,female,race_eth,non_trad_family) %>% filter(complete.cases(.))
options(survey.lonely.psu = "adjust")
design<-svydesign(ids=~stu_id,strata=~strat_id,weights=~bystuwt,data=dataset)
els.pc<-prcomp(~reading_score+math_score+I(-1*repeated_grade)+teach_rep_write+class_prep,data=dataset,center=T,scale=T,retx=T)
summary(els.pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.5527 0.9685 0.9189 0.7369 0.51369
## Proportion of Variance 0.4822 0.1876 0.1689 0.1086 0.05278
## Cumulative Proportion 0.4822 0.6698 0.8386 0.9472 1.00000
Here we see that PC1 is the only PC with an eigenvalue greater than one. This means that PC1 is the only principal component that we are insterested in. We also see that PC1 explains 48% of the variance in our dataset.
Here are the loadings for these principal components:
els.pc$rotation
## PC1 PC2 PC3 PC4
## reading_score -0.5559781 0.09804135 -0.2503550 0.325676998
## math_score -0.5548258 0.11255045 -0.2003589 0.394193289
## I(-1 * repeated_grade) -0.3060927 0.32066392 0.8956866 -0.006302857
## teach_rep_write -0.4904266 -0.01716569 -0.1656205 -0.854085865
## class_prep -0.2210105 -0.93520066 0.2597964 0.095098652
## PC5
## reading_score 0.715914185
## math_score -0.695678555
## I(-1 * repeated_grade) 0.034463788
## teach_rep_write -0.047898610
## class_prep 0.004024553
We see here that all the loadings (or eigenvectors) for PC1 are going in the same direction (negative), which means that they all relate to the component in the same way. Note that “repeated_grade” – which is “bad” the higher it gets – was replaced by I(-1*repeated_grade) in order to bring it in line with the other variables, which are “good” the higher they get.
Scree Plot:
screeplot(els.pc,type="lines",main="Scree Plot")
abline(h=1)
Here we see, once again, that only PC1 has an eigenvalue higher than 1.
Histogram:
hist(els.pc$x[,1])
Here are the corelations and interpretations of the components. First, the corelations of the educational variables:
scores<-data.frame(els.pc$x)
scores$name<-rownames(els.pc$x)
dataset$name<-rownames(dataset)
dataset2<-merge(dataset,scores,by.x="name",by.y="name",all.x=F)
round(cor(dataset2[,c("reading_score","math_score","repeated_grade","teach_rep_write","class_prep")],method="pearson"),3)
## reading_score math_score repeated_grade teach_rep_write
## reading_score 1.000 0.735 -0.256 0.531
## math_score 0.735 1.000 -0.284 0.508
## repeated_grade -0.256 -0.284 1.000 -0.234
## teach_rep_write 0.531 0.508 -0.234 1.000
## class_prep 0.173 0.173 -0.078 0.196
## class_prep
## reading_score 0.173
## math_score 0.173
## repeated_grade -0.078
## teach_rep_write 0.196
## class_prep 1.000
Now we add PC1 to the educational variables:
round(cor(dataset2[,c("reading_score","math_score","repeated_grade","teach_rep_write","class_prep","PC1")],method="pearson"),3)
## reading_score math_score repeated_grade teach_rep_write
## reading_score 1.000 0.735 -0.256 0.531
## math_score 0.735 1.000 -0.284 0.508
## repeated_grade -0.256 -0.284 1.000 -0.234
## teach_rep_write 0.531 0.508 -0.234 1.000
## class_prep 0.173 0.173 -0.078 0.196
## PC1 -0.863 -0.861 0.475 -0.761
## class_prep PC1
## reading_score 0.173 -0.863
## math_score 0.173 -0.861
## repeated_grade -0.078 0.475
## teach_rep_write 0.196 -0.761
## class_prep 1.000 -0.343
## PC1 -0.343 1.000
We observe, then, that PC1 actually has a negative relationship between the positive variables associated with education (reading scores, math scores, teacher assessments of student’s ability and student’s class preperation skills), while having a positive relationshiop with repeated grades.
We can conclude, then, that PC1 is actually an index of academic disadvantage more than anything else.
Now, we can test PC1 using the data we have available:
design2<-svydesign(ids=~stu_id,strata=~strat_id,weights=~bystuwt,data=dataset2)
fit.1<-svyglm(PC1~male+race_eth+non_trad_family,design2,family=gaussian)
summary(fit.1)
##
## Call:
## svyglm(formula = PC1 ~ male + race_eth + non_trad_family, design2,
## family = gaussian)
##
## Survey design:
## svydesign(ids = ~stu_id, strata = ~strat_id, weights = ~bystuwt,
## data = dataset2)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.56339 0.03019 -18.660 < 2e-16 ***
## male 0.23995 0.03608 6.650 3.12e-11 ***
## race_ethasian_pacific 0.03063 0.06970 0.439 0.66
## race_ethblack 1.19664 0.05713 20.945 < 2e-16 ***
## race_ethhispanic 1.14729 0.05446 21.068 < 2e-16 ***
## race_ethmultirace 0.53131 0.09427 5.636 1.80e-08 ***
## race_ethnative_american 1.27443 0.16688 7.637 2.46e-14 ***
## non_trad_family 0.52591 0.03812 13.797 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.053355)
##
## Number of Fisher Scoring iterations: 2
Here, we see that the index tends to increase based on minority status (with the exception of Asians and Pacific Islanders), maleness and non-traditional family status, which is what we would expect from an index of academic disadvantage.