PCA: Idea for a latent construct - academic performance / academic disadvantage

For this homework assignment, I’ve been asked to measure a latent construct via Principal Component Analysis. My idea for a latent construct is called “academic achievement” – that is, a measurement of the student’s latent ability to perform well in school.

For this, I’ve decided to reduce five seprate variables into the PCA:

  1. Student’s standardized reading score
  2. Student’s standardized math score
  3. If the student has ever repeated a grade
  4. The teacher’s assessment of how well the student can write
  5. The student’s ability to prepare for daily classroom activities

These variables are coded as follows:

#reading scores
els$reading_score<-ifelse(els$bytxrstd==-8,NA,els$bytxrstd)

#math scores
els$math_score<-ifelse(els$bytxmstd==-8,NA,els$bytxmstd)

#repeated a grade
els<-els %>% mutate(repeated_grade=case_when(.$bygrdrpt %in% c(1,2,97)~1,
                                             .$bygrdrpt==0~0,
                                             .$bygrdrpt %in% c(98,99)~NA_real_))

#teacher-reported writing ability
els$teach_rep_write<-ifelse(els$bywrtnga==-4,NA_real_,
                            ifelse(els$bywrtnga==-8,NA_real_,
                                   ifelse(els$bywrtnga==-9,NA_real_,els$bywrtnga)))

#classroom preperation ability
els$class_prep<-ifelse(els$bystprep==-4,NA_real_,
                            ifelse(els$bystprep==-8,NA_real_,
                                   ifelse(els$bystprep==-9,NA_real_,els$bystprep)))

Creating a dataset, design and principal components analysis

Here, I am using the above variables to construct a dataset and survey design for our PCA. I’ve run the PCA to get the following results:

dataset<-els %>% select(stu_id,bystuwt,strat_id,reading_score,math_score,repeated_grade,teach_rep_write,class_prep,male,female,race_eth,non_trad_family) %>% filter(complete.cases(.))
options(survey.lonely.psu = "adjust")
design<-svydesign(ids=~stu_id,strata=~strat_id,weights=~bystuwt,data=dataset)

els.pc<-prcomp(~reading_score+math_score+I(-1*repeated_grade)+teach_rep_write+class_prep,data=dataset,center=T,scale=T,retx=T)
summary(els.pc)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.5527 0.9685 0.9189 0.7369 0.51369
## Proportion of Variance 0.4822 0.1876 0.1689 0.1086 0.05278
## Cumulative Proportion  0.4822 0.6698 0.8386 0.9472 1.00000

Here we see that PC1 is the only PC with an eigenvalue greater than one. This means that PC1 is the only principal component that we are insterested in. We also see that PC1 explains 48% of the variance in our dataset.

Here are the loadings for these principal components:

els.pc$rotation
##                               PC1         PC2        PC3          PC4
## reading_score          -0.5559781  0.09804135 -0.2503550  0.325676998
## math_score             -0.5548258  0.11255045 -0.2003589  0.394193289
## I(-1 * repeated_grade) -0.3060927  0.32066392  0.8956866 -0.006302857
## teach_rep_write        -0.4904266 -0.01716569 -0.1656205 -0.854085865
## class_prep             -0.2210105 -0.93520066  0.2597964  0.095098652
##                                 PC5
## reading_score           0.715914185
## math_score             -0.695678555
## I(-1 * repeated_grade)  0.034463788
## teach_rep_write        -0.047898610
## class_prep              0.004024553

We see here that all the loadings (or eigenvectors) for PC1 are going in the same direction (negative), which means that they all relate to the component in the same way. Note that “repeated_grade” – which is “bad” the higher it gets – was replaced by I(-1*repeated_grade) in order to bring it in line with the other variables, which are “good” the higher they get.

Scree Plot:

screeplot(els.pc,type="lines",main="Scree Plot")
abline(h=1)

Here we see, once again, that only PC1 has an eigenvalue higher than 1.

Histogram:

hist(els.pc$x[,1])

Coorelation for the first two principal components: are they coorelated?

Let’s find out if the first two principal components are orthoginal to each other:

cor(els.pc$x[,1:2])
##              PC1          PC2
## PC1 1.000000e+00 6.309782e-17
## PC2 6.309782e-17 1.000000e+00

It does indeed appear that they are orthognal to each other!

Coorelations and Interpretation of Components

Here are the corelations and interpretations of the components. First, the corelations of the educational variables:

scores<-data.frame(els.pc$x)
scores$name<-rownames(els.pc$x)
dataset$name<-rownames(dataset)
dataset2<-merge(dataset,scores,by.x="name",by.y="name",all.x=F)
round(cor(dataset2[,c("reading_score","math_score","repeated_grade","teach_rep_write","class_prep")],method="pearson"),3)
##                 reading_score math_score repeated_grade teach_rep_write
## reading_score           1.000      0.735         -0.256           0.531
## math_score              0.735      1.000         -0.284           0.508
## repeated_grade         -0.256     -0.284          1.000          -0.234
## teach_rep_write         0.531      0.508         -0.234           1.000
## class_prep              0.173      0.173         -0.078           0.196
##                 class_prep
## reading_score        0.173
## math_score           0.173
## repeated_grade      -0.078
## teach_rep_write      0.196
## class_prep           1.000

Now we add PC1 to the educational variables:

round(cor(dataset2[,c("reading_score","math_score","repeated_grade","teach_rep_write","class_prep","PC1")],method="pearson"),3)
##                 reading_score math_score repeated_grade teach_rep_write
## reading_score           1.000      0.735         -0.256           0.531
## math_score              0.735      1.000         -0.284           0.508
## repeated_grade         -0.256     -0.284          1.000          -0.234
## teach_rep_write         0.531      0.508         -0.234           1.000
## class_prep              0.173      0.173         -0.078           0.196
## PC1                    -0.863     -0.861          0.475          -0.761
##                 class_prep    PC1
## reading_score        0.173 -0.863
## math_score           0.173 -0.861
## repeated_grade      -0.078  0.475
## teach_rep_write      0.196 -0.761
## class_prep           1.000 -0.343
## PC1                 -0.343  1.000

We observe, then, that PC1 actually has a negative relationship between the positive variables associated with education (reading scores, math scores, teacher assessments of student’s ability and student’s class preperation skills), while having a positive relationshiop with repeated grades.

We can conclude, then, that PC1 is actually an index of academic disadvantage more than anything else.

Testing PC1

Now, we can test PC1 using the data we have available:

design2<-svydesign(ids=~stu_id,strata=~strat_id,weights=~bystuwt,data=dataset2)
fit.1<-svyglm(PC1~male+race_eth+non_trad_family,design2,family=gaussian)
summary(fit.1)
## 
## Call:
## svyglm(formula = PC1 ~ male + race_eth + non_trad_family, design2, 
##     family = gaussian)
## 
## Survey design:
## svydesign(ids = ~stu_id, strata = ~strat_id, weights = ~bystuwt, 
##     data = dataset2)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -0.56339    0.03019 -18.660  < 2e-16 ***
## male                     0.23995    0.03608   6.650 3.12e-11 ***
## race_ethasian_pacific    0.03063    0.06970   0.439     0.66    
## race_ethblack            1.19664    0.05713  20.945  < 2e-16 ***
## race_ethhispanic         1.14729    0.05446  21.068  < 2e-16 ***
## race_ethmultirace        0.53131    0.09427   5.636 1.80e-08 ***
## race_ethnative_american  1.27443    0.16688   7.637 2.46e-14 ***
## non_trad_family          0.52591    0.03812  13.797  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.053355)
## 
## Number of Fisher Scoring iterations: 2

Here, we see that the index tends to increase based on minority status (with the exception of Asians and Pacific Islanders), maleness and non-traditional family status, which is what we would expect from an index of academic disadvantage.