Background :

The following exam is the Spring 2024, Midterm Performance. Particularly, an Achievement Test. Intended to measure learned knowledge or skills aquired across 4 weeks and reviewed the day before the exam.

The 3 primary topics assessed are :

Probability
Design
Regression

The following analysis & modeling isn’t meant to be an exhuastive sample meant to generalize across Stats 10 Classes–but rather as a demonstration of the utility & comparison between Latent Variable modeling & typical Data Science Methods s.t. we illuminate underlying assumptions and gain a fundamental understanding of the two.

Basics

Population : Stats 10 Std
Sample : Spring, 2024 Stats 10 Std taught by Thomas, with particular TA’s

df <- read.csv("/Users/isaiahmireles/Desktop/Misconceptions/24S-STATS-10-LEC-4_Midterm/S24_Midterm_student_responses copy.csv")
library(tidyverse)

# grab relevant col : 
df <- df |> select(
  Student.ID, Sections, Max.Points, Total.Score, 
  matches(
      "^Question\\.[0-9]+\\.Score$|
       ^Question\\.[0-9]+\\.Student\\.Response\\.s\\.$|
       ^Question [0-9]+ Correct Response$"
    )
  )

How many unique students? nrow(df) = 155?

unique(length(df$Student.ID))

## [1] 155

yes. So, there are 155 unique students – 1 student per row

Anonymization :

Lets hide the identity of students :

df$Student.ID <- paste0("Student", seq_len(nrow(df)))

Correlation Matrix :

binary_matrix <- 
  df |> select(
  matches("^Question\\.[0-9]+\\.Score$")
)

# incomplete obs " cor(binary_matrix) didn work
# whats missing? 
library(naniar)
miss_var_summary(binary_matrix) |> head()

Okay so, we are missing 6 student for each question about 4% of the data. Who are those six students?

# unique(binary_matrix$Question.1.Score)
lapply(as.list(binary_matrix), function(x){which(is.na(x))}) |>
  as.data.frame()

Yeh, so its the same 6 students

cor_matrix <- cor(binary_matrix, use = "complete.obs")
library(corrplot)

## corrplot 0.95 loaded

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  order = "hclust",
  tl.col = "black",
  tl.cex = 0.7
)

Exam as a whole^

Split correlations \(\phi <0 , \phi =0, \phi>0\)

Before we do so, lets get a notion of the typical variation of correlations :

cor_list <- lapply(
  seq_len(nrow(cor_matrix)),
  function(i) cor_matrix[i, -i]
)
names(cor_list) <- paste0("Q", seq_along(cor_list))

Dist of Correlations

sd <- lapply(cor_list, function(x){
  v <- c(sd(x))
  names(v) <- c("sd")
  v
})

mu <- mean(unlist(sd))
sigma <- sd(unlist(sd))


hist(unlist(sd), freq=FALSE, main = "SD of Correlations")

curve(
  dnorm(x, mean = mu, sd = sigma),
  add = TRUE, 
  col = "red"
)

bsc_sum <-
  lapply(cor_list, function(x){
  v <- c(mean(x), sd(x))
  names(v) <- c("mean", "sd")
  v
})

lapply(seq_along(cor_list), function(i) {
  x <- cor_list[[i]]
  stats <- bsc_sum[[i]]

  hist(
    x,
    main = paste("Correlation Hist:", names(cor_list)[i]),
    xlab = "Correlation"
  )

  legend(
    "topright",
    legend = c(
      paste("Mean =", round(stats["mean"], 3)),
      paste("SD =", round(stats["sd"], 3))
    ),
    bty = "n"
  )
})

What happens if I model per question Score to overall exam performance? What does that mean? Wont i get R^2 = 1?

mdl_mat <- cbind(binary_matrix, df$Total.Score)

fct_mat <- binary_matrix |> mutate(across(everything(), factor))
mdl_mat <- cbind(df$Total.Score, fct_mat)
colnames(mdl_mat)[1] <- "Total.Score" #whoops

mdl <- lm(Total.Score~., dat=mdl_mat); summary(mdl); plot(mdl, which = 2)

## 
## Call:
## lm(formula = Total.Score ~ ., data = mdl_mat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.555e-13 -1.138e-14  2.800e-16  1.222e-14  5.456e-14 
## 
## Coefficients:
##                      Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)        -3.963e-14  5.399e-14 -7.340e-01    0.464    
## Question.1.Score1   1.000e+00  1.869e-14  5.351e+13   <2e-16 ***
## Question.2.Score1   1.000e+00  2.113e-14  4.733e+13   <2e-16 ***
## Question.3.Score1   1.000e+00  1.718e-14  5.820e+13   <2e-16 ***
## Question.4.Score1   1.000e+00  8.317e-15  1.202e+14   <2e-16 ***
## Question.5.Score1   1.000e+00  7.978e-15  1.253e+14   <2e-16 ***
## Question.6.Score1   1.000e+00  8.568e-15  1.167e+14   <2e-16 ***
## Question.7.Score1   1.000e+00  3.201e-14  3.124e+13   <2e-16 ***
## Question.8.Score1   1.000e+00  8.942e-15  1.118e+14   <2e-16 ***
## Question.9.Score1   1.000e+00  2.453e-14  4.076e+13   <2e-16 ***
## Question.10.Score1  1.000e+00  8.826e-15  1.133e+14   <2e-16 ***
## Question.11.Score1  1.000e+00  2.444e-14  4.091e+13   <2e-16 ***
## Question.12.Score1  1.000e+00  1.246e-14  8.024e+13   <2e-16 ***
## Question.13.Score1  1.000e+00  1.438e-14  6.954e+13   <2e-16 ***
## Question.14.Score1  1.000e+00  1.124e-14  8.896e+13   <2e-16 ***
## Question.15.Score1  1.000e+00  1.126e-14  8.883e+13   <2e-16 ***
## Question.16.Score1  1.000e+00  1.435e-14  6.968e+13   <2e-16 ***
## Question.17.Score1  1.000e+00  1.848e-14  5.411e+13   <2e-16 ***
## Question.18.Score1  1.000e+00  1.151e-14  8.688e+13   <2e-16 ***
## Question.19.Score1  1.000e+00  1.273e-14  7.858e+13   <2e-16 ***
## Question.20.Score1  1.000e+00  2.332e-14  4.287e+13   <2e-16 ***
## Question.21.Score1  1.000e+00  9.130e-15  1.095e+14   <2e-16 ***
## Question.22.Score1  1.000e+00  7.830e-15  1.277e+14   <2e-16 ***
## Question.23.Score1  1.000e+00  9.165e-15  1.091e+14   <2e-16 ***
## Question.24.Score1  1.000e+00  7.739e-15  1.292e+14   <2e-16 ***
## Question.25.Score1  1.000e+00  1.188e-14  8.418e+13   <2e-16 ***
## Question.26.Score1  1.000e+00  1.175e-14  8.507e+13   <2e-16 ***
## Question.27.Score1  1.000e+00  1.135e-14  8.812e+13   <2e-16 ***
## Question.28.Score1  1.000e+00  1.088e-14  9.190e+13   <2e-16 ***
## Question.29.Score1  1.000e+00  8.383e-15  1.193e+14   <2e-16 ***
## Question.30.Score1  1.000e+00  1.575e-14  6.348e+13   <2e-16 ***
## Question.31.Score1  1.000e+00  9.918e-15  1.008e+14   <2e-16 ***
## Question.32.Score1  1.000e+00  8.533e-15  1.172e+14   <2e-16 ***
## Question.33.Score1  1.000e+00  9.108e-15  1.098e+14   <2e-16 ***
## Question.34.Score1  1.000e+00  7.889e-15  1.268e+14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.957e-14 on 114 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.573e+28 on 34 and 114 DF,  p-value: < 2.2e-16

The error is exactly normally distributed

Inscribe data with Latent Variables :

Latent Variable : Statistical Reasoning ( Midterm 1 )

Probability
Regression
Design & Experimentation

Latent Variables & Data Science

Isaiah C. Mireles

2026-01-30