The following exam is the Spring 2024, Midterm Performance. Particularly, an Achievement Test. Intended to measure learned knowledge or skills aquired across 4 weeks and reviewed the day before the exam.
The 3 primary topics assessed are :
Probability
Design
Regression
The following analysis & modeling isn’t meant to be an exhuastive sample meant to generalize across Stats 10 Classes–but rather as a demonstration of the utility & comparison between Latent Variable modeling & typical Data Science Methods s.t. we illuminate underlying assumptions and gain a fundamental understanding of the two.
Population : Stats 10 Std
Sample : Spring, 2024 Stats 10 Std taught by Thomas, with particular
TA’s
df <- read.csv("/Users/isaiahmireles/Desktop/Misconceptions/24S-STATS-10-LEC-4_Midterm/S24_Midterm_student_responses copy.csv")
library(tidyverse)
# grab relevant col :
df <- df |> select(
Student.ID, Sections, Max.Points, Total.Score,
matches(
"^Question\\.[0-9]+\\.Score$|
^Question\\.[0-9]+\\.Student\\.Response\\.s\\.$|
^Question [0-9]+ Correct Response$"
)
)
How many unique students? nrow(df) = 155?
unique(length(df$Student.ID))
## [1] 155
yes. So, there are 155 unique students – 1 student per row
Lets hide the identity of students :
df$Student.ID <- paste0("Student", seq_len(nrow(df)))
binary_matrix <-
df |> select(
matches("^Question\\.[0-9]+\\.Score$")
)
# incomplete obs " cor(binary_matrix) didn work
# whats missing?
library(naniar)
miss_var_summary(binary_matrix) |> head()
Okay so, we are missing 6 student for each question about 4% of the data. Who are those six students?
# unique(binary_matrix$Question.1.Score)
lapply(as.list(binary_matrix), function(x){which(is.na(x))}) |>
as.data.frame()
Yeh, so its the same 6 students
cor_matrix <- cor(binary_matrix, use = "complete.obs")
library(corrplot)
## corrplot 0.95 loaded
corrplot(
cor_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.cex = 0.7
)
Exam as a whole^
Before we do so, lets get a notion of the typical variation of correlations :
cor_list <- lapply(
seq_len(nrow(cor_matrix)),
function(i) cor_matrix[i, -i]
)
names(cor_list) <- paste0("Q", seq_along(cor_list))
sd <- lapply(cor_list, function(x){
v <- c(sd(x))
names(v) <- c("sd")
v
})
mu <- mean(unlist(sd))
sigma <- sd(unlist(sd))
hist(unlist(sd), freq=FALSE, main = "SD of Correlations")
curve(
dnorm(x, mean = mu, sd = sigma),
add = TRUE,
col = "red"
)
bsc_sum <-
lapply(cor_list, function(x){
v <- c(mean(x), sd(x))
names(v) <- c("mean", "sd")
v
})
lapply(seq_along(cor_list), function(i) {
x <- cor_list[[i]]
stats <- bsc_sum[[i]]
hist(
x,
main = paste("Correlation Hist:", names(cor_list)[i]),
xlab = "Correlation"
)
legend(
"topright",
legend = c(
paste("Mean =", round(stats["mean"], 3)),
paste("SD =", round(stats["sd"], 3))
),
bty = "n"
)
})
What happens if I model per question Score to overall exam performance? What does that mean? Wont i get R^2 = 1?
mdl_mat <- cbind(binary_matrix, df$Total.Score)
fct_mat <- binary_matrix |> mutate(across(everything(), factor))
mdl_mat <- cbind(df$Total.Score, fct_mat)
colnames(mdl_mat)[1] <- "Total.Score" #whoops
mdl <- lm(Total.Score~., dat=mdl_mat); summary(mdl); plot(mdl, which = 2)
##
## Call:
## lm(formula = Total.Score ~ ., data = mdl_mat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.555e-13 -1.138e-14 2.800e-16 1.222e-14 5.456e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.963e-14 5.399e-14 -7.340e-01 0.464
## Question.1.Score1 1.000e+00 1.869e-14 5.351e+13 <2e-16 ***
## Question.2.Score1 1.000e+00 2.113e-14 4.733e+13 <2e-16 ***
## Question.3.Score1 1.000e+00 1.718e-14 5.820e+13 <2e-16 ***
## Question.4.Score1 1.000e+00 8.317e-15 1.202e+14 <2e-16 ***
## Question.5.Score1 1.000e+00 7.978e-15 1.253e+14 <2e-16 ***
## Question.6.Score1 1.000e+00 8.568e-15 1.167e+14 <2e-16 ***
## Question.7.Score1 1.000e+00 3.201e-14 3.124e+13 <2e-16 ***
## Question.8.Score1 1.000e+00 8.942e-15 1.118e+14 <2e-16 ***
## Question.9.Score1 1.000e+00 2.453e-14 4.076e+13 <2e-16 ***
## Question.10.Score1 1.000e+00 8.826e-15 1.133e+14 <2e-16 ***
## Question.11.Score1 1.000e+00 2.444e-14 4.091e+13 <2e-16 ***
## Question.12.Score1 1.000e+00 1.246e-14 8.024e+13 <2e-16 ***
## Question.13.Score1 1.000e+00 1.438e-14 6.954e+13 <2e-16 ***
## Question.14.Score1 1.000e+00 1.124e-14 8.896e+13 <2e-16 ***
## Question.15.Score1 1.000e+00 1.126e-14 8.883e+13 <2e-16 ***
## Question.16.Score1 1.000e+00 1.435e-14 6.968e+13 <2e-16 ***
## Question.17.Score1 1.000e+00 1.848e-14 5.411e+13 <2e-16 ***
## Question.18.Score1 1.000e+00 1.151e-14 8.688e+13 <2e-16 ***
## Question.19.Score1 1.000e+00 1.273e-14 7.858e+13 <2e-16 ***
## Question.20.Score1 1.000e+00 2.332e-14 4.287e+13 <2e-16 ***
## Question.21.Score1 1.000e+00 9.130e-15 1.095e+14 <2e-16 ***
## Question.22.Score1 1.000e+00 7.830e-15 1.277e+14 <2e-16 ***
## Question.23.Score1 1.000e+00 9.165e-15 1.091e+14 <2e-16 ***
## Question.24.Score1 1.000e+00 7.739e-15 1.292e+14 <2e-16 ***
## Question.25.Score1 1.000e+00 1.188e-14 8.418e+13 <2e-16 ***
## Question.26.Score1 1.000e+00 1.175e-14 8.507e+13 <2e-16 ***
## Question.27.Score1 1.000e+00 1.135e-14 8.812e+13 <2e-16 ***
## Question.28.Score1 1.000e+00 1.088e-14 9.190e+13 <2e-16 ***
## Question.29.Score1 1.000e+00 8.383e-15 1.193e+14 <2e-16 ***
## Question.30.Score1 1.000e+00 1.575e-14 6.348e+13 <2e-16 ***
## Question.31.Score1 1.000e+00 9.918e-15 1.008e+14 <2e-16 ***
## Question.32.Score1 1.000e+00 8.533e-15 1.172e+14 <2e-16 ***
## Question.33.Score1 1.000e+00 9.108e-15 1.098e+14 <2e-16 ***
## Question.34.Score1 1.000e+00 7.889e-15 1.268e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.957e-14 on 114 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.573e+28 on 34 and 114 DF, p-value: < 2.2e-16
The error is exactly normally distributed
Latent Variable : Statistical Reasoning ( Midterm 1 )
Probability
Regression
Design & Experimentation