Project Duration : 7 Days
Citation : https://www.lacan.upc.edu/admoreWeb/2018/05/all-models-are-wrong-but-some-are-useful-george-e-p-box/
This project conducts a district-level analysis of
standardized achievement in mathematics,
reading, and
science across 325 Iowa school
districts during the 2000–2001 academic year,
integrating test outcomes for
grades 4, 8, and 11 with district
characteristics such as size, average
teacher salary, and teaching
experience. The study is designed as a
methodological demonstration of rigorous linear
modeling, where statistical conclusions are grounded explicitly
in mathematical derivations rather than treated as
black-box outputs. Using R, we first implement standard
regression tools and diagnostics,
interpret empirical results, and then
derive the corresponding estimators, variance
expressions, and test statistics to examine model assumptions,
interpretability, and limitations in the context of real educational
data. The analysis adopts a generic, reusable programming
framework to support extensibility to future datasets, data
enrichment, and policy-relevant applications. Theoretical development is
guided primarily by Chapter 8 of Introduction to Regression
Modeling (Abraham & Ledolter, 2006), complemented by
STATS
100C lecture materials and fully done by me across days.
Enabling a tight integration of theory,
computation, and substantive
interpretation.
Linear Modeling [ Regression ]
Linear Algebra [ proofs ]
Generic R Programming [ function & package builiding ]
Literature Review Process [ Iowa in 2001 ]
Data enrichment [ income inequality & district map ]
How do student performance vary across, grade, subject and size?
How do Teacher salaries vary across Categories?
What features contribute most to a wealthier teacher?url <- "https://www.biz.uiowa.edu/faculty/jledolter/RegressionModeling/Data/Chapter8/iowastudent.txt"
# download.file(url, destfile = "iowastudent.txt", mode = "wb")
library(readr)
iowa_std <- read_tsv("iowastudent.txt", na = "*", show_col_types = FALSE)
observations :
## [1] "325 School Districts"
## [1] "22 Features"
## [1] "District" "4thmath" "8thmath" "11thmath" "4th read"
## [6] "8th read" "11th read" "8th sci" "11th sci" "size"
## [11] "size(coded)" "small size" "med size" "large size" "salary"
## [16] "experience" "4th avg" "8th avg" "11th avg" "Math avg"
## [21] "Read avg" "Sci avg"
Notice size metrics are categorical
features; Therefore categorical encoding
(factor) is appropriate.
library(dplyr)
# Ensure our data behaves like a categorical variable
iowa_std <-
iowa_std |>
mutate(across(contains("size"), as.factor))
library(dplyr)
num_df <- iowa_std[,-1] |>
select(-starts_with("size("), -ends_with(" size"))
num_df$size <- as.numeric(num_df$size)
Here we will take a look at global pairwise-patterns, no deep analysis. Simple observations.
## corrplot 0.95 loaded
Weak correlation : there appears to be
near no correlations (linear relationship) for
size , salary or experience.
Except for (salary,experience) and
(salary,size) pairs
Strong Correlations : math ,
reading , science avr appear to have the
strongest pairwise relationships
All other features are Positively Correlated [Pairwise Regression Coef are Positive]
We can test this claim and see :
cor_mat |>
colSums() |>
data.frame() |>
mutate(colSums.cor_mat. = colSums.cor_mat. - 1) |> # take away diagnol
mutate(colSums.cor_mat. = 2*colSums.cor_mat.) |> # Consider both covariances, cov(xi,xj) = cov(xj, xi)
arrange(desc(colSums.cor_mat.)) |> head(4)
As we can see, these variables capture most variation. The reasoning for this interpretation goes as follows :
Notice :
\[ \text{Let X, Y be Rand. Variables : }\\ \text{Var}(X +Y) = \sigma^2_X + \sigma^2_Y + 2\text{Cov(X,Y)} \]
proof
\[ \text{Var}(X +Y) = \text{Var}(W) \implies W=X +Y \\ \text{Var}(W)=\mathbb{E}(W^2)-\mathbb{E}^2(W) \]
\[ = \mathbb{E}((X+Y)^2) -\mathbb{E}^2(X+Y) \\ = \mathbb{E}(X^2+2XY+Y^2)-(\mu_X+\mu_Y)^2 \]
recall :
\[ \text{Var}(W)=\mathbb{E}(W^2)-\mathbb{E}^2(W) \\ \implies \\ \mathbb{E}(W^2)= \sigma^2_W+\mathbb{E}^2(W) \]
therefore,
\[ \text{Var}(X +Y) \\ = \mathbb{E}(X^2+2XY+Y^2)-(\mu_X+\mu_Y)^2 \\ = \mathbb{E}(X^2) + \mathbb{E}(Y^2) + \mathbb{E}(XY)-(\mu_X^2+\mu_X\mu_Y+\mu_Y^2) \\ =[\mathbb{E}(X^2)-\mu_X^2]+[\mathbb{E}(Y^2)-\mu_Y^2]+ [\mathbb{E}(XY)-\mu_X\mu_Y]\\ = \sigma_X^2+\sigma_Y^2+2\text{Cov}(X,Y) \]
or, more generally : \(\{X,Y\}=\{X_1,X_2\}\)
\[ \implies \text{Var}(X_1+X_2)=\sum_1^2\text{Var}(X_i)+2\text{Cov}(X_1,X_2) \]
further, suppose we had \(\text{Var}(\sum{X_i})=\text{Var}(X_1 + ...+X_n)\)
Notice \(\sum{X_i}= X_1 + ...+X_n\) ; suppose we partition the set of random variables into two groups :
\[ \sum{X_i}= X_1 + ...+X_n=A+B \\ \text{For : } \\ A = \sum_1^m{X_i} \\ B=\sum_{m+1}^n{X_i} \]
\[ \implies \text{Var}(\sum{X_i})=\text{Var}(A+B) \\ = \sigma^2_A +\sigma^2_B +2\text{Cov(A,B)} \\ = \sigma^2_{\sum_1^m{X_i}}+\sigma^2_{\sum_{m+1}^n{X_i}} + 2\text{Cov}(\sum_1^m{X_i}, \sum_{m+1}^n{X_i}) \]
and, \(\sigma^2_{\sum_1^m{X_i}}=\text{Var}(\sum_1^m{X_i})=\text{Var}(X_1+...+X_m)\)
so we can iteratively apply the formula to obtain :
\[ \text{Var}(\sum{X_i}) = \sum{\text{Var}(X_i)}+2\sum{\text{Cov}(X_i, X_j)} \]
Meaning : the variance of the total sum can be described as the sum of individual variance and pairwise variance contributions.
Suppose we are interested in just one col of the covariance matrix– Feature \(X_1\) for ex :
\[ =\text{Var}(X_1)+\sum{\text{Cov}(X_1, X_j)} \]
So a col-sum represents \(X_1\)’s variance plus all other pairwise variances with \(X_1\) (Notice we don’t include co-variances twice).
Therefore, realize :
\[ \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) \\ = \underbrace{\left( \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \right)}_{\text{all terms involving } X_1} + \underbrace{\left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right)}_{\text{all terms not involving } X_1} \]
\[ \implies \\ \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) - \left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right) \]
Therefore :
\[ \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) - \sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}(X_1) +\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) - \left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right) - \sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ =\text{Var}(X_1)+\sum{\text{Cov}(X_1, X_j)} \]Meaning : The row sum represents the remaining variation from the variation of the sum, after taking away the variation of all variables not including \(X_1\) and not double counting the \(X_1\) co-variances. This allows us to understand the contribution of a particular variable without considering the variation of others.
Similarly consider a correlation matrix :
Consider the relationship between correlation matrix & Covariance Matrix :
Recall :
\[ \sigma^2_{X_i} = \text{cov}(x_i,x_i) \]
and,
\[ \text{cov}(x_i,x_j) \in \sum_{X}, \]
Consider the following known equivalence :
\[ \sum_{X} \\ = \begin{bmatrix}\sigma^2_{X_1} & \text{Cov}(X_1, X_2) & \text{Cov}(X_1, X_3) \\\text{Cov}(X_2, X_1) & \sigma^2_{X_2} & \text{Cov}(X_2, X_3) \\\text{Cov}(X_3, X_1) & \text{Cov}(X_3, X_2) & \sigma^2_{X_3}\end{bmatrix} \\ = \begin{bmatrix}\sigma^2_{X_1} & r_{1,2}\sigma_{X_1}\sigma_{X_2} & r_{1,3}\sigma_{X_1}\sigma_{X_3} \\r_{2,1}\sigma_{X_2}\sigma_{X_1} & \sigma^2_{X_2} & r_{2,3}\sigma_{X_2}\sigma_{X_3} \\r_{3,1}\sigma_{X_3}\sigma_{X_1} & r_{3,2}\sigma_{X_3}\sigma_{X_2} & \sigma^2_{X_3}\end{bmatrix} \]
That is to say : \(\text{cov}(x_i, x_j)= r_{ij}\sigma_{i}\sigma_{h}\).
Meaning : Covariance and correlations are related. We interpret this connection as the product of their standard deviation from the mean multiplied by their linear strength.
Further,
\[
\text{cov}(x_i, x_j)= r_{ij}\sigma_{i}\sigma_{h} \\
\implies \\
r_{i,j} = \frac{\text{cov}(x_i,x_j)}{\sigma_{x_i}\sigma_{x_j}}
\]
and,
\[ r_{i,j} \in C \]
\[ C =\begin{bmatrix}1 & r_{1,2} & r_{1,3} \\r_{2,1} & 1 & r_{2,3} \\r_{3,1} & r_{3,2} & 1\end{bmatrix} \]
Therefore, suppose we wish to sum the first Column :
\[ 1 + r_{2,1} + r_{3,1} \]
equivalently,
\[ \sum_{1}^{3} {r_{i,1}}=1 + \sum_{1}^{3-1} {r_{i,1}} \]
Or, more generally for a correlation matrix \(C^{nXn}\) :
\[ \sum_{1}^{n} {r_{i,1}}=1 + \sum_{1}^{n-1} {r_{i,1}} \]
equivalently,
\[ \sum_{1}^{n} {r_{i,1}}= \sum_{1}^{n}\frac{\text{cov}(x_i,x_1)}{\sigma_{x_i}\sigma_{x_1}} \]
Meaning,
\[ \frac{\text{cov}(x_1,x_1)}{\sigma_{x_1}\sigma_{x_1}} =\frac{\text{cov}(x_1,x_1)}{\sigma_{x_1}^2}=\frac{\sigma_{x_1}^2}{\sigma_{x_1}^2}=1 \]
Consider the singular case of numerical event \(x \text{ and }y\) for a correlation \(r_{xy} = \frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\).
More generally, we say these correlations exist to a correlation matrix containing all our numerical features :
\[ r_{ij} \in C^{\text{p} X\text{p}} \\ r_{ij} \in [-1, 1] \text{ for : } r_{ij} \in \mathbb{R} \]
Where \(p=\text{Number of Parameters}\) (not inc. intercept). In deriving our correlations ( \(r_{ij}\) ) we are guided by Abraham & Ledolter’s treatment of multicollinearity (Abraham & Ledolter, 2006, Ch. 5.4), rewriting formulas verbatim and using the text as a framework to connect regression structure and predictor correlations.
We begin by first making assumptions :
\[ \vec{y}=X\vec{\beta} +\epsilon \text{ for : } \epsilon \sim N(\vec{0}, \sigma^2 I_n) \]
Suppose we expand \(\vec{y}\) :
\[ y= f(X) + \epsilon\\ = X\vec{\beta} + \epsilon\\ = \beta_0 + \beta_1\vec{x}_1 + ...+\beta_n\vec{x}_n + \epsilon \\ = \beta_0 + \sum\beta_i x_i + \epsilon \] For this we assume some intercept \(\beta_0\).
Notice that for each of the following Regression models we hold the same structure :
\[ \vec{y}_{325}=\beta_0+\beta_1\vec{x} + \epsilon \\ \text{for: } \vec{x}\in X^{325 X2} = [\vec{1}_n \vec{x}_1] \\ X'X= \begin{bmatrix}n & \sum{x_i} \\\sum{x_i} & \sum{x_i^2}\end{bmatrix} \\ \implies (X'X)^{-1}\\ =\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]
\[ =\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]
Which as we know is particularly relevant when developing a notion of variance and center for \(\hat{\beta}\).
Recall :
\[ \hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}) \]
proof
We wish to prove 3 claims :
Claim : “Our est. of slopes is Normally distributed”
Claim : “Our est. is unbiased”
Claim : “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix”
Lets start 1 by 1, consider claim 1 :
We will prove this rather informally based upon a basic property of Normally distributed data.
Consider \(\vec{\epsilon} \sim N(0, \sigma^2I_n)\) and \(y = X\vec{\beta} + \vec{\epsilon}\) and \(X\vec{\beta}\) is a deterministic (constant). We know that linear combinations ( \(aX +b\) ) of normally distributed data \(X \sim N(\mu, \sigma^2)\) is also normally distributed (More rigorous proof here), therefore \(\vec{y}\) is also normally distributed. In other words, since \(\vec{\epsilon}\) is the random part of our model ( \(y = X\vec{\beta} + \vec{\epsilon}\) ) and \(X\vec{\beta}\) is a deterministic, we say \(y\) inherits its randomness from \(\vec{\epsilon}\) and because \(\vec{\epsilon}\) is normally distributed & \(\vec{y}\) is just a linear combination of \(\vec{\epsilon}\) we say \(\vec{y}\) is also normally distributed.
Recall for something to be unbiased we say \(\mathbb{E}(\hat{\theta}) - \theta = 0\) :
\[ \mathbb{E}(\hat{\beta}) \\ = \mathbb{E}(A\vec{y}) \\ = A\mathbb{E}(\vec{y}) \\ = A\mathbb{E}(X\vec{\beta} + \epsilon) \\ = A[\mathbb{E}(X\vec{\beta}) + \mathbb{E}(\epsilon)] \\ = A[\mathbb{E}(X\vec{\beta}) + 0] \\ = A\mathbb{E}(X\vec{\beta}) \\ = AX\vec{\beta} \\ = (X'X)^{-1}X' X \vec{\beta} \\ = (X'X)^{-1}(X' X) \vec{\beta} \\ = I_n \vec{\beta} \\ = \vec{\beta} \]
\[ \implies \\ = \mathbb{E}(\hat{\beta}) -\vec{\beta} \\ = \vec{\beta} -\vec{\beta} \\ = \vec{0} \\ \implies \\ \mathbb{E}(\hat{\beta}) -\vec{\beta}=\vec{0} \]
\[ \sum_{\hat{\beta}} \\ = \text{Cov}(\hat{\beta}) \\ = \text{Cov}(A\vec{y}) \\ = A\text{Cov}(\vec{y})A' \\ = A\sigma^2 I_nA' \\ = \sigma^2AA' \\ = \sigma^2(X'X)^{-1}X'[(X'X)^{-1}X']' \]
\[ = \sigma^2(X'X)^{-1}X'X[(X'X)^{-1}]' \\ = \sigma^2(X'X)^{-1}X'X[(X'X)']^{-1} \\ = \sigma^2(X'X)^{-1}X'X[X'X]^{-1} \\ = \sigma^2(X'X)^{-1}X'XX^{-1}(X')^{-1} \\ \\ = \sigma^2(X'X)^{-1}I_n^2 \\ = \sigma^2(X'X)^{-1}I_n \\ = \sigma^2(X'X)^{-1} \\ \]
pairwise_dot_plt <- function(data_set) {
# Generate all pairwise combinations
pairwise_comb <- combn(1:ncol(data_set), 2)
# iterate across each
for (idx in 1:ncol(pairwise_comb)) {
comb_vec <- pairwise_comb[, idx]
x_axis <- comb_vec[1]
y_axis <- comb_vec[2]
# generate reg mdl
mdl <- lm(unlist(data_set[, y_axis]) ~ unlist(data_set[, x_axis]), data = data_set)
# Start a new plot on the PDF device
plot(
unlist(data_set[, x_axis]),
unlist(data_set[, y_3axis]),
xlab = colnames(data_set)[x_axis],
ylab = colnames(data_set)[y_axis],
main = paste("Plot", idx)
)
abline(mdl, col = "red")
}
}
library(dplyr)
# Grade, Topic Correlations :
gd_tpk_cor <- num_df |> select(contains("th"),-contains("avg")) |> cor(use = "complete.obs")
# Topic avg Correlations
sub_avr_cor <- num_df |> select(ends_with("avg"), -matches("^[0-9]+th")) |> cor(use = "complete.obs")
# Grade avg Correlations
grd_avg_cor <- num_df |> select(matches("^[0-9]+th avg")) |> cor(use = "complete.obs")
corrplot(gd_tpk_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
# Set bounds and filter :
weak_cutoff <- .3 # below : weak linear association
strong_cutoff <- .6 # above : strong linear association
low_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor < weak_cutoff)
med_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor >= weak_cutoff) & (gd_tpk_cor <= strong_cutoff)
high_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor > strong_cutoff)
# select only ij obs (upper.tri) and not ji, sym matrix
low_pairs <- which(low_idx, arr.ind = TRUE)
med_pairs <- which(med_idx, arr.ind = TRUE)
high_pairs <- which(high_idx, arr.ind = TRUE)
# general NA matrix function :
generate_NA_mat <-
function(mat){
NA_mat <- matrix(
NA,
nrow = nrow(mat),
ncol = ncol(mat),
dimnames = dimnames(gd_tpk_cor)
)
return(NA_mat)
}
# generate general function :
build_masked_matrix <- function(cor_matrix, pair_indices) {
NA_mat <- generate_NA_mat(cor_matrix)
NA_mat[pair_indices] <- cor_matrix[pair_indices]
return(NA_mat)
}
print("low_cor_mat")
## [1] "low_cor_mat"
low_cor_mat <- build_masked_matrix(gd_tpk_cor, low_pairs); low_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA 0.2742213 NA NA 0.1940918 0.2283466
## 8thmath NA NA NA NA NA 0.2997928 NA
## 11thmath NA NA NA 0.2819581 NA NA 0.1771869
## 4th read NA NA NA NA NA 0.2880443 0.2739120
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA 0.2655738
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath 0.1864202
## 8thmath 0.2302258
## 11thmath NA
## 4th read 0.2111796
## 8th read 0.2584866
## 11th read NA
## 8th sci NA
## 11th sci NA
print("med_cor_mat")
## [1] "med_cor_mat"
med_cor_mat <- build_masked_matrix(gd_tpk_cor, med_pairs); med_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA 0.3817187 NA NA 0.3468462 NA NA
## 8thmath NA NA 0.3118259 0.3787978 NA NA 0.5362033
## 11thmath NA NA NA NA 0.3016574 NA NA
## 4th read NA NA NA NA 0.3526437 NA NA
## 8th read NA NA NA NA NA 0.3201637 0.5382480
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath 0.4566460
## 4th read NA
## 8th read NA
## 11th read NA
## 8th sci 0.3122344
## 11th sci NA
print("high_cor_mat")
## [1] "high_cor_mat"
high_cor_mat <- build_masked_matrix(gd_tpk_cor, high_pairs); high_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA NA 0.6752072 NA NA NA
## 8thmath NA NA NA NA 0.6897058 NA NA
## 11thmath NA NA NA NA NA 0.6061706 NA
## 4th read NA NA NA NA NA NA NA
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath NA
## 4th read NA
## 8th read NA
## 11th read 0.6401641
## 8th sci NA
## 11th sci NA
gd_tpk_cor :high_cor_mat : Math and
Reading across grade levels
(4,8,11) appears to be strongly correlated (
\(r_{\text{high}}\) 0.65)
med_cor_mat : 8th grades across all categories
appear to have a moderate relationship — science subject for 8th grades
appears to approach high_cor_mat category [consider diag
behavior].
low_cor_mat : Most correlations are rather
weak
masked_matrix_func <- function(correlation_matrix, bottom_cutoff = 0.3, top_cutoff = 0.6) {
if (!is.matrix(correlation_matrix)) {
stop("Input must be a matrix.")
}
# Logical indices for each strength
l_idx <- upper.tri(correlation_matrix) & (correlation_matrix < bottom_cutoff)
m_idx <- upper.tri(correlation_matrix) &
(correlation_matrix >= bottom_cutoff) &
(correlation_matrix <= top_cutoff)
h_idx <- upper.tri(correlation_matrix) & (correlation_matrix > top_cutoff)
# Pair indices
l_pairs <- which(l_idx, arr.ind = TRUE)
m_pairs <- which(m_idx, arr.ind = TRUE)
h_pairs <- which(h_idx, arr.ind = TRUE)
# Generate empty NA matrix func
generate_NA_mat <- function(mat) {
matrix(NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
}
# Use helper function :
build_masked_matrix <- function(mat, idx) {
NA_mat <- generate_NA_mat(mat)
NA_mat[idx] <- mat[idx]
return(NA_mat)
}
# Store all three in a named list
cor_lst <- list(
low_cor_mat = build_masked_matrix(correlation_matrix, l_pairs),
med_cor_mat = build_masked_matrix(correlation_matrix, m_pairs),
high_cor_mat = build_masked_matrix(correlation_matrix, h_pairs)
)
return(cor_lst)
}
# plt :
corrplot(sub_avr_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
masked_matrix_func(sub_avr_cor)
## $low_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA NA
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $med_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA 0.4922274
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $high_cor_mat
## Math avg Read avg Sci avg
## Math avg NA 0.7568206 NA
## Read avg NA NA 0.6127668
## Sci avg NA NA NA
cov(num_df, use = "complete.obs")
## 4thmath 8thmath 11thmath 4th read 8th read
## 4thmath 115.326528 35.645267 29.514496 75.911131 33.704589
## 8thmath 35.645267 75.611479 27.175386 34.483011 54.268186
## 11thmath 29.514496 27.175386 100.447541 29.584075 27.357166
## 4th read 75.911131 34.483011 29.584075 109.599264 33.406233
## 8th read 33.704589 54.268186 27.357166 33.406233 81.879482
## 11th read 24.220580 30.291992 70.595550 35.040950 33.664523
## 8th sci 27.540761 52.364883 19.944262 32.205621 54.699930
## 11th sci 27.295669 27.295050 62.400117 30.143376 31.890522
## size -48.792263 70.141410 96.443208 52.000251 119.116076
## salary 2085.229400 5395.757446 8124.694133 6447.328575 6698.265199
## experience 0.166605 3.322614 4.187824 1.337106 4.311578
## 4th avg 95.618829 35.064139 29.549286 92.755197 33.555411
## 8th avg 32.296859 60.748205 24.825627 33.364939 63.615883
## 11th avg 27.010228 28.254109 77.814433 31.589451 30.970700
## Math avg 60.162129 46.144054 52.379151 46.659430 38.443322
## Read avg 44.612073 39.681051 42.512239 59.348801 49.650075
## Sci avg 27.418215 39.829967 41.172190 31.174498 43.295226
## 11th read 8th sci 11th sci size salary
## 4thmath 24.220580 27.540761 27.295669 -48.792263 2085.229
## 8thmath 30.291992 52.364883 27.295050 70.141410 5395.757
## 11thmath 70.595550 19.944262 62.400117 96.443208 8124.694
## 4th read 35.040950 32.205621 30.143376 52.000251 6447.329
## 8th read 33.664523 54.699930 31.890522 119.116076 6698.265
## 11th read 135.028538 34.658899 101.423879 187.892523 12251.038
## 8th sci 34.658899 126.134426 47.811651 22.400468 3533.438
## 11th sci 101.423879 47.811651 185.896939 186.582176 10819.828
## size 187.892523 22.400468 186.582176 6829.021245 203260.758
## salary 12251.038013 3533.437799 10819.827665 203260.757937 10759419.737
## experience 4.043433 4.621802 5.819243 40.474950 4170.303
## 4th avg 29.630765 29.873191 28.719522 1.603994 4266.279
## 8th avg 32.871806 77.733095 35.665758 70.552723 5209.160
## 11th avg 102.349321 34.138223 116.573634 156.972605 10398.517
## Math avg 41.702716 33.283313 38.996962 39.264063 5201.897
## Read avg 67.911317 40.521453 54.485890 119.669529 8465.541
## Sci avg 68.041389 86.973039 116.854295 104.491322 7176.633
## experience 4th avg 8th avg 11th avg Math avg
## 4thmath 0.1666050 95.6188294 32.296859 27.010228 60.162129
## 8thmath 3.3226142 35.0641388 60.748205 28.254109 46.144054
## 11thmath 4.1878244 29.5492857 24.825627 77.814433 52.379151
## 4th read 1.3371062 92.7551974 33.364939 31.589451 46.659430
## 8th read 4.3115780 33.5554112 63.615883 30.970700 38.443322
## 11th read 4.0434329 29.6307653 32.871806 102.349321 41.702716
## 8th sci 4.6218021 29.8731909 77.733095 34.138223 33.283313
## 11th sci 5.8192429 28.7195224 35.665758 116.573634 38.996962
## size 40.4749498 1.6039938 70.552723 156.972605 39.264063
## salary 4170.3034383 4266.2789873 5209.160190 10398.516555 5201.896513
## experience 6.8160538 0.7518556 4.085333 4.683491 2.559016
## 4th avg 0.7518556 94.1870134 32.830899 29.299840 53.410780
## 8th avg 4.0853333 32.8308994 67.365746 31.121024 39.290240
## 11th avg 4.6834915 29.2998398 31.121024 98.912469 44.359601
## Math avg 2.5590159 53.4107799 39.290240 44.359601 52.895129
## Read avg 3.2307024 51.9804372 43.284193 54.969798 42.268468
## Sci avg 5.2205225 29.2963566 56.699427 75.355929 36.140137
## Read avg Sci avg
## 4thmath 44.612073 27.418215
## 8thmath 39.681051 39.829967
## 11thmath 42.512239 41.172190
## 4th read 59.348801 31.174498
## 8th read 49.650075 43.295226
## 11th read 67.911317 68.041389
## 8th sci 40.521453 86.973039
## 11th sci 54.485890 116.854295
## size 119.669529 104.491322
## salary 8465.540761 7176.632732
## experience 3.230702 5.220522
## 4th avg 51.980437 29.296357
## 8th avg 43.284193 56.699427
## 11th avg 54.969798 75.355929
## Math avg 42.268468 36.140137
## Read avg 58.970051 47.503672
## Sci avg 47.503672 101.913667
gd_tpk_cor :high_cor_mat : Math, Reading Avr ;
Sci, Reading Avr
med_cor_mat :
Math, Science Avr
low_cor_mat : NA
masked_matrix_func(sub_avr_cor^2, bottom_cutoff = 0.3^2, top_cutoff = 0.6^2)
## $low_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA NA
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $med_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA 0.2422878
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $high_cor_mat
## Math avg Read avg Sci avg
## Math avg NA 0.5727773 NA
## Read avg NA NA 0.3754831
## Sci avg NA NA NA
gd_tpk_cor :corrplot(grd_avg_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
masked_matrix_func(grd_avg_cor)
## $low_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
##
## $med_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA 0.4121618 0.3035598
## 8th avg NA NA 0.3812491
## 11th avg NA NA NA
##
## $high_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
grep("_cor$", ls(), value = T)
## [1] "gd_tpk_cor" "grd_avg_cor" "sub_avr_cor"
cor_lst <- list(gd_tpk_cor = gd_tpk_cor, grd_avg_cor = grd_avg_cor, sub_avr_cor = sub_avr_cor)
# Average Correlation
lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)})
## $gd_tpk_cor
## [1] 0.31
##
## $grd_avg_cor
## [1] 0.24
##
## $sub_avr_cor
## [1] 0.41
# average of averages :
paste(lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)}) |> unlist() |> mean(), " avr cor")
## [1] "0.32 avr cor"
Global Observations :
Among variables of interest we have \(\bar{r}_{ij}\) of .32, so all signals appear pretty linearly-weak.
Subject Pairs are more related then Grade Pairs
sub_avr_cor < gd_tpk_cor
< grd_avg_corMathematical Observations :
Notice \(r_{ij}\in C \implies R^2_{\text{xi}}=r_{i,y}\). In other words, the correlation squared are the coefficient of determinations for those pairwise models
masked <- lapply(cor_lst, masked_matrix_func)
high_only <- lapply(masked, function(x) x[grepl("^high_", names(x))])
high_only
## $gd_tpk_cor
## $gd_tpk_cor$high_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA NA 0.6752072 NA NA NA
## 8thmath NA NA NA NA 0.6897058 NA NA
## 11thmath NA NA NA NA NA 0.6061706 NA
## 4th read NA NA NA NA NA NA NA
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath NA
## 4th read NA
## 8th read NA
## 11th read 0.6401641
## 8th sci NA
## 11th sci NA
##
##
## $grd_avg_cor
## $grd_avg_cor$high_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
##
##
## $sub_avr_cor
## $sub_avr_cor$high_cor_mat
## Math avg Read avg Sci avg
## Math avg NA 0.7568206 NA
## Read avg NA NA 0.6127668
## Sci avg NA NA NA
observations :
For our programmed function we have defined \(r_{xy}=.6\) as the lowest possible \(r_{xy}\) value accepted to be highly correlated – as \(R^2\) that is \(.6^2 =0.36\) or, 36% of the variation in \(y\) can be explained by \(x\).
Notice grd_avg_cor never contains
highly correlated features
Math avg and Read avg appear to be the
most linearly related 0.57% of the of variation.
Slightly above half.
Problem : What is the behavior of the linear model? And How is that different when we are estimating the model?
Notice :
\[ y = X\beta+\epsilon \ ; \epsilon \sim N(0, \sigma^2I_n) \]
\[ \mathbb{E}(y)=\mathbb{E}(X\beta+\epsilon) = \mathbb{E}(X\beta) + \mathbb{E}(\epsilon) =X\beta+0=X\beta = \mu \]
\[ \text{Cov}(y) =\text{Cov}(X\beta+\epsilon)=\text{Cov}(\epsilon) :=\sigma^2I_n \]
\[ y \sim N(X\beta, \sigma^2I) \]
Therefore \(\mu =X \beta\) and, \(y = \mu +\epsilon\). In other words : \(\text{Event = Model + Noise}\) . Further, lets consider that we wish to determine the dist of \(\hat{y}=X\hat{\beta}\).
claim : \(\hat{y}\) follows a normal dist :
\[ \begin{aligned}\hat{\vec y}&= X\hat{\beta} \\&= \underbrace{ \underbrace{X(X^\top X)^{-1}X^\top}_{\displaystyle XA} }_{\displaystyle H}\,\vec y \\&= H\vec y\end{aligned} \]
Therefore, Notice that we can then describe \(\hat{y}\) as a linear combination of \(\vec{y}=X\beta+\epsilon\) :
\[ \hat{ y} = H\vec y = \begin{bmatrix} \vec h_1^{\top}\vec y \\ \vec h_2^{\top}\vec y \\ \vdots \\ \vec h_n^{\top}\vec y \end{bmatrix} = y_1 \vec h_1 + y_2 \vec h_2 + \cdots + y_n \vec h_n \]
After establishing normality, we can describe the bahavior of \(\hat{y}\) by calculating the two parameters of the normal dist mean ( \(\vec{\mu}\) ) and covariance matrix ( \(\sum\) ), since these quantities fully describe the distribution’s behavior :
Center :
\[ \mathbb{E}(\hat{y}) \\ = \mathbb{E}(X\hat{\beta}) \\ =X\mathbb{E}(\hat{\beta}) \\ =X\beta \\ = \mu \]
So, we should expect our est. of the model in the long-run estimates our model \(\mu\) – good!
Variance :
\[ \text{Cov}(\hat{y})=\text{Cov}(Hy) =H\text{Cov}(y)H' =H\sigma^2I_nH'=\sigma^2HH'=\sigma^2H \]
This is because \(y \sim N(X\beta, \sigma^2I_n)\) and \(H^2=H\) (idopent matrix).
# Fit linear model
mdl <- lm(`Read avg` ~ `Math avg`, data = iowa_std)
# ---- Scatter plot with regression line ----
plot(iowa_std$`Math avg`, iowa_std$`Read avg`,
main = "Math vs Read Avg Mdl",
xlab = "Math Avg", ylab = "Read Avg", pch = 19)
abline(mdl, col = "red", lwd = 2)
# Residual Plot :
plot(mdl, which = 1)
# QQ plt
plot(mdl, which = 2)
observations :
Suppose we wish the summarize \(\hat{y}\) into a summary statistic : \(\bar{\hat{y}}=\frac{1}{n}\sum{\hat{y}_i}\). We will first derive its distribution and then parameters of describing the distribution :
Proof
\[ \bar{\hat{y}} \\ = \frac{1}{n}\sum{\hat{y}_i} \\ = \frac{1}{n} \vec{1}'\hat{y} \\ = M\hat{y} \]
except I want you to notice :
\[ \bar{\hat{y}}=M\hat{y}=MH\vec{y}=\frac{1}{n}\vec{1}'H\vec{y} = \frac{1}{n}(\vec{1}'H)\vec{y} \]
and, notice that : \(H\vec{1}=\vec{1}\)
In order to understand why this is true remember that \(X=[\vec{1} \ \vec{x}_1 ....\vec{x}_n]\) and \(H=X(X'X)^{-1}X'\)
Therefore, since \(\vec{1}\) is not only a linear combination but also literally in, \(X\) we can state :
\[ X\vec{v}=\vec{1} \]
\[ \vec{1}v_1+\vec{x}_1v_2+...+\vec{x}_nv_n = \vec{1} \]
Clearly if \(v_2\) to \(v_n\), are all 0 and \(v_1\) is 1 we see :
\[ \vec{1}(1)+\vec{x}_1(0)+...+\vec{x}_n(0) = \vec{1} \\ \vec{1}(1) = \vec{1} \\ \vec{1} = \vec{1} \]
Meaning \(\vec{v} = [1 \ 0 ...0]\).
Further,
\[
H\vec{1} \\
= XA\vec{1} \\
= X(X'X)^{-1}X'\vec{1} \\
= X(X'X)^{-1}X'(X\vec{v}) \\
= X(X'X)^{-1}(X'X)\vec{v} \\
= XI_n\vec{v} \\
= X\vec{v} \\
:=\vec{1}
\]
Therefore, suppose we transpose \(H\vec{1} =
\vec{1}\) :
\[ \vec{1}' =(H\vec{1})'=\vec{1}'H' = \vec{1}'H =\vec{1}' \]Meaning :
\[ \bar{\hat{y}}=\frac{1}{n}\vec{1}'H\vec{y} = \frac{1}{n}\vec{1}'\vec{y} = \frac{1}{n}\sum{y_i}=\bar{y} \]
Therefore, the average of the fitted values \(\hat{y}\) is equivalent to the average \(y\) value. So in estimating the distrobution \(\bar{\hat{y}}\) we are finding the dist of \(\bar{y}\).
\[ \mathbb{E}(\bar{\hat{y}}) \\ = \mathbb{E(\bar{y})} \\ = \mathbb{E}(\frac{1}{n}\sum{y}_i) \\ = \frac{1}{n} \mathbb{E}(\sum{y}_i) \\ = \frac{1}{n} \sum{\mathbb{E}({y}_i)} \\ \]
and \(y_i=\mu +\epsilon_i\),
\[ \mathbb{E}(y_i) \\ =\mathbb{E}(\mu +\epsilon_i)\\ =\mathbb{E}(\mu) +\mathbb{E}(\epsilon_i) \\ = \mathbb{E}(\mu) + 0 \\ = \mu \]
therefore :
\[ = \frac{1}{n} n \mu \\ = \mu \]
and,
\[ \text{Var}(\bar{\hat{y}}) \\ =\text{Var}(\bar{y}) \\ =\text{Var}(\frac{1}{n}\vec{1}'\vec{y}) \\ = (\frac{1}{n})^2\text{Var}(\vec{1}'\vec{y}) \\ = \frac{1}{n^2}\text{Var}(\sum{y}_i) \]
and I want you to recall : \(y_i = \mu + \epsilon_i\) and each \(\epsilon_i\) is independent from \(\epsilon_j\) (ie uncorrelated errors). So we can assume \(y_i\) is independent from \(y_j\) as it is a deterministic component added to a random component \(\epsilon\).
\[ = \frac{1}{n^2}\text{Var}(\sum{y}_i) \\ = \frac{1}{n^2}\sum\text{Var}(y_i) \\ = \frac{1}{n^2}\sum\text{Var}(\mu+\epsilon_i) \\ = \frac{1}{n^2}\sum\text{Var}(\epsilon_i) \\ = \frac{1}{n^2}\text{Trace}(\text{Cov}(\vec{\epsilon})) \\ = \frac{1}{n^2}\text{Trace}(\sigma^2I_n) \\ = \frac{\sigma^2}{n^2}\text{Trace}(I_n) \\ = \frac{\sigma^2}{n^2}\sum^{n}_1{1} \\ = \frac{\sigma^2}{n^2}n \\ = \frac{\sigma^2}{n} \]
Meaning :
\[ \bar{\hat{y}} =\bar{y} \sim N(\mu,\frac{\sigma^2}{n}) \]
However, this is a summary statistic of the overall behavior of \(\hat{y}\), each prediction \(\hat{y}_i\) has its own variance \(\sigma^2_{\hat{y}_i}\) and \(\text{Cov}(\hat{y}_i,\hat{y}_j)=\rho\sigma_{\hat{y}_i}\sigma_{\hat{y}_j}\) for \(\rho \in [-1,1]\) [ Correlation ]; meaning \(\text{Cov}(\hat{y}_i,\hat{y}_j)=\text{Corr}(\hat{y}_i,\hat{y}_j)\sigma_{\hat{y}_i}\sigma_{\hat{y}_j}\)
Covariance matrix of \(\hat{y}\) :
\[ H = XA ; A = (X'X)^{-1}X' \\ H =X(X'X)^{-1}X' \]
[ After re-solving this HW problem, it became clear : Citation : HW 5 Q6 S100C & ]
and for this case :
\[ X = [1_{325} \vec{x}] \\ \implies \\ H \\ = X\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} X'\\ = X\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}X' \\ = [1_{325} \vec{x}]\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \]
Notice : \(\sum{x_i}^2 = \vec{x}'\vec{x} = ||x||^2\) and, \(\bar{x}=\frac{1}{n} \sum x_i\)
\[ = [1_{325} \vec{x}]\frac{1}{n ||x||^2-(n\bar{x})^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n ||x||^2-n^2\bar{x}^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ \]
Recall :
\[ S_{xx}=\sum (x_i -\bar{x})^2 \\ = \sum(x_i -\bar{x})(x_i -\bar{x}) \\ = \sum(x_i^2 -2\bar{x}x_i+\bar{x}^2) \\ = \sum{x_i^2} -2\bar{x}\sum{x_i}+\sum{\bar{x}^2} \\ = \sum{x_i^2} -2\bar{x}n\bar{x}+n\bar{x}^2 \\ = \sum{x_i^2} -2\bar{x}^2n+n\bar{x}^2 \\ = \sum{x_i^2} -n\bar{x}^2 \\ = ||x||^2 -n\bar{x}^2 \]
Therefore :
\[ = [1_{325} \vec{x}]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ \]
and,
\[ = [1_{325} \vec{x}]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -n\bar{x} \\-n\bar{x} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\begin{bmatrix}\frac{||x||^2}{nS_{xx}} & -\frac{\bar{x}}{S_{xx}} \\-\frac{\bar{x}}{S_{xx}} \ & \frac{1}{S_{xx}}\end{bmatrix}[1_{325} \vec{x}]' \\ \]
# Histogram
hist(iowa_std$`Read avg`,
probability = TRUE,
breaks = 20,
col = "lightgray",
xlim = c(
mean(iowa_std$`Read avg`, na.rm = TRUE) - 4 * sd(iowa_std$`Read avg`, na.rm = TRUE),
mean(iowa_std$`Read avg`, na.rm = TRUE) + 4 * sd(iowa_std$`Read avg`, na.rm = TRUE)
),
main = "Histogram of Read Avg with Normal Curves",
xlab = "Read Avg")
# Center at Xbeta = \hat{y}
curve(
dnorm(x,
mean = mean(fitted(mdl), na.rm = TRUE),
sd = sd(fitted(mdl), na.rm = TRUE)),
col = "blue", lwd = 2, add = TRUE)
# Raw values :
curve(
dnorm(x,
mean = mean(iowa_std$`Read avg`, na.rm = TRUE),
sd = sd(iowa_std$`Read avg`, na.rm = TRUE)),
col = "green", lwd = 2, add = TRUE)
legend("topright", legend = c("Centered at Xβ", "Empirical Mean"),
col = c("blue", "green"), lwd = 2, bty = "n")
As we can see our theoretical model \(y\) does a pretty good job of estimating Read avg and Math avg, this tells us that there appears to be a strong relationship between a students math & reading performance.
However, relative to our emperical model based on our outcome variable, we get a worse performance. We notice that the difference is in the variance. Our model is overly confident ( smaller \(\sigma_*\) )
Hypothesis testing :
\[ \text{T-test : }H_o : \beta_i=0 \] indicates regression coefficients are statistically significant.
\[ F-\text{test : } H_o : \beta_0=\beta_1=0 \] indicating holding other coefficients constant, model is significant
In short this indicates strong evidence that districts which do well in reading also tend to do well in math together. Perhaps targeting these two areas will help overall performance.
Okay, lets go back to our research questions :
How do student performance vary across, grade,
subject and size?
How do Teacher salaries vary across Categories?
What features contribute most to a wealthier teacher?# Add avr col :
iowa_std <-
iowa_std |>
rowwise() |>
mutate(avr = mean(c_across(c(`4thmath`, `8thmath`, `11thmath`,
`4th read`, `8th read`, `11th read`,
`8th sci`, `11th sci`)), na.rm = TRUE))
Suppose we divide school districts into three size groups : large districts (x > 2,000 students), medium-sized districts (1,000 < x < 2,000 students), and small districts (x < 1,000 students).
\[ \text{avr}_i=\beta_0+\beta_1\mathbf{1}[\text{med}_i]+\beta_2\mathbf{1}[\text{large}_i] + \varepsilon_i \]
\[ \begin{align*}\text{avr}_i & : \text{Average test score for district } i \\\beta_0 & : \text{Mean score for districts in the small size group (baseline)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\mathbf{1}[\text{med}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is medium, } 0 \text{ otherwise} \\\mathbf{1}[\text{large}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is large, } 0 \text{ otherwise} \\\varepsilon_i & : \text{Random error term for district } i\end{align*} \]
\[ \beta_0 = \mu_1, \beta_1 = \mu_2 − \mu_1, \text{ and } \beta_2 = \mu_3 − \mu_1 \]
Which is equivalent to :
\[ y_i = \mu_1 x_{i1} + \mu_2 x_{i2} + \mu_3 x_{i3} + \varepsilon_i \]
\[ \begin{align*}y_i & : \text{Average test score for district } i \\x_{i1}, x_{i2}, x_{i3} & : \text{Indicator variables for district size: small, medium, large} \\\mu_1, \mu_2, \mu_3 & : \text{Mean test scores for small, medium, and large districts, respectively} \\\varepsilon_i & : \text{Random error in test score for district }\\\end{align*} \]
# Model for overall averages
overall_mdl <- lm(avr~`size(coded)`, data=iowa_std)
Notice these are the same results as our ONE-WAY-ANOVA. Lets take a closer look at the F-test results :
summary(overall_mdl); aov(avr~`size(coded)`, data=iowa_std)
##
## Call:
## lm(formula = avr ~ `size(coded)`, data = iowa_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2652 -4.5152 0.3901 4.8598 20.4848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.0152 0.4840 150.863 <2e-16 ***
## `size(coded)`2 0.5948 1.0012 0.594 0.553
## `size(coded)`3 2.5525 1.3306 1.918 0.056 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.227 on 322 degrees of freedom
## Multiple R-squared: 0.01151, Adjusted R-squared: 0.005365
## F-statistic: 1.874 on 2 and 322 DF, p-value: 0.1552
## Call:
## aov(formula = avr ~ `size(coded)`, data = iowa_std)
##
## Terms:
## `size(coded)` Residuals
## Sum of Squares 195.767 16819.764
## Deg. of Freedom 2 322
##
## Residual standard error: 7.227399
## Estimated effects may be unbalanced
levels(iowa_std$`size(coded)`) <- c("Small", "Medium", "Large")
boxplot(avr ~ `size(coded)`, data = iowa_std,
main = "Average Test Scores by District Size",
xlab = "District Size",
ylab = "Average Test Score",
col = c("skyblue", "lightgreen", "lightpink"))
abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)
Our model compares the overall average and looks for differences each indactor (S, M, L) makes. It appears large districts tend to outperform other districts :
iowa_std |> group_by(`size(coded)`) |> summarize(avr_score_per_size = round(mean(avr), 2)) |> arrange(desc(avr_score_per_size))
This claim is supported by a quick glance at grouped summary statistics.
Instead at looking at the avr among grades, suppose we look at each grade level separately :
Fourth_grd_size_mdl <- aov(`4th avg`~`size(coded)`, data=iowa_std)
summary(Fourth_grd_size_mdl)
## Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)` 2 151 75.61 0.763 0.467
## Residuals 319 31598 99.05
## 3 observations deleted due to missingness
Observations :
Eighth_grd_size_mdl <- aov(`8th avg`~`size(coded)`, data=iowa_std)
summary(Eighth_grd_size_mdl) # lots of missingness : 61 observations deleted due to missingness [about 20%]
## Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)` 2 54 26.83 0.4 0.671
## Residuals 261 17525 67.15
## 61 observations deleted due to missingness
naniar::miss_var_summary(iowa_std |> select(`8th avg`, `size(coded)`))
Observations :
8th avg is missingEleventh_grd_size_mdl <- aov(`11th avg`~`size(coded)`, data=iowa_std)
summary(Eleventh_grd_size_mdl); summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$coeff
## Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)` 2 839 419.3 4.281 0.0149 *
## Residuals 250 24489 98.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 72 observations deleted due to missingness
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.607784 0.7658753 96.109361 2.039968e-199
## `size(coded)`Medium 2.034190 1.5493777 1.312908 1.904179e-01
## `size(coded)`Large 5.392219 1.9098955 2.823306 5.135817e-03
paste("R-Sqr : ", summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$r.squared)
## [1] "R-Sqr : 0.0331127128655936"
Observations :
\[ H_o : \mu_\text{Low} = \mu_\text{Medium} = \mu_\text{High} = 0\\ H_a : \mu_i \ne \mu_j \text{ or, } \mu_i \ne0 \]
In other words, we believe one of the levels is statistically significant. However notice our models \(R^2=3\%\). So overall our model isnt capturing the variance in 11th Grade District Avr Scores.
boxplot(`11th avg`~`size(coded)`, data = iowa_std,
main = "11th Grd District by District Size",
xlab = "District Size",
ylab = "Average 11th Grade Performance",
col = c("skyblue", "lightgreen", "lightpink"))
abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)
iowa_std |> group_by(`size(coded)`) |> summarize(Grade_Distr_avr = round(mean(`11th avg`, na.rm = T), 2)) |> arrange(desc(Grade_Distr_avr))
iowa_std$size_coded <- as.factor(iowa_std$`size(coded)`)
Eleventh_grd_size_mdl <- aov(`11th avg`~size_coded, data =iowa_std)
TukeyHSD(Eleventh_grd_size_mdl); plot(TukeyHSD(Eleventh_grd_size_mdl))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = `11th avg` ~ size_coded, data = iowa_std)
##
## $size_coded
## diff lwr upr p adj
## Medium-Small 2.034190 -1.6188689 5.687249 0.3893082
## Large-Small 5.392219 0.8891463 9.895292 0.0141716
## Large-Medium 3.358029 -1.8478375 8.563896 0.2828601
Based upon the results above, 11th Grade students Large-Small Districts pairwise comparison is statistically significant. However, larger-medium Districts pairwise comparison is not statistically sign. Indicating the difference is apparent when comparing the extremes against one another.
With everything, its important we are centered around the formulas so we understand their limitation & therefore use cases.
Consider the formulas used for TukeyHSD() :
[INCOMPLETE – Goal : Compare Formulas ]
Different Group Sample Sizes ( \(n_*\) )
Unequal Variance ( \(\sigma_*\) )
Therefore because we know our group-samples sizes
& standard deviations are not equal–we try
games_howell_test . Consider the formula used in this test
:
library(rstatix)
# For games_howell_test to work we need complete cases across all features :
DF <- iowa_std |> select(District, `11th avg`, size_coded) |> drop_na()
DF |> games_howell_test(`11th avg` ~ size_coded)