Project Duration : 7 Days

Project Description :

Citation : https://www.lacan.upc.edu/admoreWeb/2018/05/all-models-are-wrong-but-some-are-useful-george-e-p-box/

This project conducts a district-level analysis of standardized achievement in mathematics, reading, and science across 325 Iowa school districts during the 2000–2001 academic year, integrating test outcomes for grades 4, 8, and 11 with district characteristics such as size, average teacher salary, and teaching experience. The study is designed as a methodological demonstration of rigorous linear modeling, where statistical conclusions are grounded explicitly in mathematical derivations rather than treated as black-box outputs. Using R, we first implement standard regression tools and diagnostics, interpret empirical results, and then derive the corresponding estimators, variance expressions, and test statistics to examine model assumptions, interpretability, and limitations in the context of real educational data. The analysis adopts a generic, reusable programming framework to support extensibility to future datasets, data enrichment, and policy-relevant applications. Theoretical development is guided primarily by Chapter 8 of Introduction to Regression Modeling (Abraham & Ledolter, 2006), complemented by STATS 100C lecture materials and fully done by me across days. Enabling a tight integration of theory, computation, and substantive interpretation.

Project Topics :

  • Linear Modeling [ Regression ]

  • Linear Algebra [ proofs ]

  • Generic R Programming [ function & package builiding ]

  • Literature Review Process [ Iowa in 2001 ]

  • Data enrichment [ income inequality & district map ]

Research Questions :

  • How do student performance vary across, grade, subject and size?

  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

Toss Files In :

url <- "https://www.biz.uiowa.edu/faculty/jledolter/RegressionModeling/Data/Chapter8/iowastudent.txt"

# download.file(url, destfile = "iowastudent.txt", mode = "wb")

library(readr)
iowa_std <- read_tsv("iowastudent.txt", na = "*", show_col_types = FALSE) 

observations :

  • Notice that we are missing 4th grade science

Basics :

## [1] "325  School Districts"
## [1] "22  Features"
##  [1] "District"    "4thmath"     "8thmath"     "11thmath"    "4th read"   
##  [6] "8th read"    "11th read"   "8th sci"     "11th sci"    "size"       
## [11] "size(coded)" "small size"  "med size"    "large size"  "salary"     
## [16] "experience"  "4th avg"     "8th avg"     "11th avg"    "Math avg"   
## [21] "Read avg"    "Sci avg"

Notice size metrics are categorical features; Therefore categorical encoding (factor) is appropriate.

library(dplyr)
# Ensure our data behaves like a categorical variable
iowa_std <- 
  iowa_std |> 
  mutate(across(contains("size"), as.factor))

Pairwise Relationships :

Correlation Matrix

  • Consider that cor(x,y) is a measure of numerical data, it is appropriate we generate a strictly numerical data set to generate a summary of numerical linear-pairwise relationships.
library(dplyr)
num_df <- iowa_std[,-1] |>
  select(-starts_with("size("), -ends_with(" size"))
num_df$size <- as.numeric(num_df$size)

Here we will take a look at global pairwise-patterns, no deep analysis. Simple observations.

## corrplot 0.95 loaded

Observations

  • Weak correlation : there appears to be near no correlations (linear relationship) for size , salary or experience. Except for (salary,experience) and (salary,size) pairs

    • Indicating : These features tend to be nearly orthogonal to other features. These features hold unique-linearly related information ( No linear redundancy ). When exactly 0 & Features are Jointly, Multivariate Normal distributed it indicates independence.
  • Strong Correlations : math , reading , science avr appear to have the strongest pairwise relationships

    • Indicating : We should take this as a warning sign of multicollinearity when moving onto the modeling stage. This is to say, some features can be represented as nearly linear combinations of one another.
  • All other features are Positively Correlated [Pairwise Regression Coef are Positive]

    • Claim : The most robustly correlated features look to be math and reading. Some features appear to be more strongly correlated however, math and reading appear to overall more positive.

We can test this claim and see :

cor_mat |> 
  colSums() |> 
  data.frame() |> 
  mutate(colSums.cor_mat. = colSums.cor_mat. - 1) |> # take away diagnol
  mutate(colSums.cor_mat. = 2*colSums.cor_mat.) |> # Consider both covariances, cov(xi,xj) = cov(xj, xi)
  arrange(desc(colSums.cor_mat.)) |> head(4)

As we can see, these variables capture most variation. The reasoning for this interpretation goes as follows :

Notice :

\[ \text{Let X, Y be Rand. Variables : }\\ \text{Var}(X +Y) = \sigma^2_X + \sigma^2_Y + 2\text{Cov(X,Y)} \]

proof

\[ \text{Var}(X +Y) = \text{Var}(W) \implies W=X +Y \\ \text{Var}(W)=\mathbb{E}(W^2)-\mathbb{E}^2(W) \]

\[ = \mathbb{E}((X+Y)^2) -\mathbb{E}^2(X+Y) \\ = \mathbb{E}(X^2+2XY+Y^2)-(\mu_X+\mu_Y)^2 \]

recall :

\[ \text{Var}(W)=\mathbb{E}(W^2)-\mathbb{E}^2(W) \\ \implies \\ \mathbb{E}(W^2)= \sigma^2_W+\mathbb{E}^2(W) \]

therefore,

\[ \text{Var}(X +Y) \\ = \mathbb{E}(X^2+2XY+Y^2)-(\mu_X+\mu_Y)^2 \\ = \mathbb{E}(X^2) + \mathbb{E}(Y^2) + \mathbb{E}(XY)-(\mu_X^2+\mu_X\mu_Y+\mu_Y^2) \\ =[\mathbb{E}(X^2)-\mu_X^2]+[\mathbb{E}(Y^2)-\mu_Y^2]+ [\mathbb{E}(XY)-\mu_X\mu_Y]\\ = \sigma_X^2+\sigma_Y^2+2\text{Cov}(X,Y) \]

or, more generally : \(\{X,Y\}=\{X_1,X_2\}\)

\[ \implies \text{Var}(X_1+X_2)=\sum_1^2\text{Var}(X_i)+2\text{Cov}(X_1,X_2) \]

further, suppose we had \(\text{Var}(\sum{X_i})=\text{Var}(X_1 + ...+X_n)\)

Notice \(\sum{X_i}= X_1 + ...+X_n\) ; suppose we partition the set of random variables into two groups :

\[ \sum{X_i}= X_1 + ...+X_n=A+B \\ \text{For : } \\ A = \sum_1^m{X_i} \\ B=\sum_{m+1}^n{X_i} \]

\[ \implies \text{Var}(\sum{X_i})=\text{Var}(A+B) \\ = \sigma^2_A +\sigma^2_B +2\text{Cov(A,B)} \\ = \sigma^2_{\sum_1^m{X_i}}+\sigma^2_{\sum_{m+1}^n{X_i}} + 2\text{Cov}(\sum_1^m{X_i}, \sum_{m+1}^n{X_i}) \]

and, \(\sigma^2_{\sum_1^m{X_i}}=\text{Var}(\sum_1^m{X_i})=\text{Var}(X_1+...+X_m)\)

so we can iteratively apply the formula to obtain :

\[ \text{Var}(\sum{X_i}) = \sum{\text{Var}(X_i)}+2\sum{\text{Cov}(X_i, X_j)} \]

Meaning : the variance of the total sum can be described as the sum of individual variance and pairwise variance contributions.

Suppose we are interested in just one col of the covariance matrix– Feature \(X_1\) for ex :

\[ =\text{Var}(X_1)+\sum{\text{Cov}(X_1, X_j)} \]

So a col-sum represents \(X_1\)’s variance plus all other pairwise variances with \(X_1\) (Notice we don’t include co-variances twice).

Therefore, realize :

\[ \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) \\ = \underbrace{\left( \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \right)}_{\text{all terms involving } X_1} + \underbrace{\left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right)}_{\text{all terms not involving } X_1} \]

\[ \implies \\ \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) - \left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right) \]

Therefore :

\[ \operatorname{Var}(X_1) + 2\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) - \sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}(X_1) +\sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ = \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) - \left( \sum_{i=2}^n \operatorname{Var}(X_i) + 2\sum_{2 \le i < j \le n} \operatorname{Cov}(X_i,X_j) \right) - \sum_{j=2}^n \operatorname{Cov}(X_1,X_j) \\ =\text{Var}(X_1)+\sum{\text{Cov}(X_1, X_j)} \]Meaning : The row sum represents the remaining variation from the variation of the sum, after taking away the variation of all variables not including \(X_1\) and not double counting the \(X_1\) co-variances. This allows us to understand the contribution of a particular variable without considering the variation of others.

Similarly consider a correlation matrix :

Consider the relationship between correlation matrix & Covariance Matrix :

Recall :

\[ \sigma^2_{X_i} = \text{cov}(x_i,x_i) \]

and,

\[ \text{cov}(x_i,x_j) \in \sum_{X}, \]

Consider the following known equivalence :

\[ \sum_{X} \\ = \begin{bmatrix}\sigma^2_{X_1} & \text{Cov}(X_1, X_2) & \text{Cov}(X_1, X_3) \\\text{Cov}(X_2, X_1) & \sigma^2_{X_2} & \text{Cov}(X_2, X_3) \\\text{Cov}(X_3, X_1) & \text{Cov}(X_3, X_2) & \sigma^2_{X_3}\end{bmatrix} \\ = \begin{bmatrix}\sigma^2_{X_1} & r_{1,2}\sigma_{X_1}\sigma_{X_2} & r_{1,3}\sigma_{X_1}\sigma_{X_3} \\r_{2,1}\sigma_{X_2}\sigma_{X_1} & \sigma^2_{X_2} & r_{2,3}\sigma_{X_2}\sigma_{X_3} \\r_{3,1}\sigma_{X_3}\sigma_{X_1} & r_{3,2}\sigma_{X_3}\sigma_{X_2} & \sigma^2_{X_3}\end{bmatrix} \]

That is to say : \(\text{cov}(x_i, x_j)= r_{ij}\sigma_{i}\sigma_{h}\).

Meaning : Covariance and correlations are related. We interpret this connection as the product of their standard deviation from the mean multiplied by their linear strength.

Further,

\[ \text{cov}(x_i, x_j)= r_{ij}\sigma_{i}\sigma_{h} \\ \implies \\ r_{i,j} = \frac{\text{cov}(x_i,x_j)}{\sigma_{x_i}\sigma_{x_j}} \]
and,

\[ r_{i,j} \in C \]

\[ C =\begin{bmatrix}1 & r_{1,2} & r_{1,3} \\r_{2,1} & 1 & r_{2,3} \\r_{3,1} & r_{3,2} & 1\end{bmatrix} \]

Therefore, suppose we wish to sum the first Column :

\[ 1 + r_{2,1} + r_{3,1} \]

equivalently,

\[ \sum_{1}^{3} {r_{i,1}}=1 + \sum_{1}^{3-1} {r_{i,1}} \]

Or, more generally for a correlation matrix \(C^{nXn}\) :

\[ \sum_{1}^{n} {r_{i,1}}=1 + \sum_{1}^{n-1} {r_{i,1}} \]

equivalently,

\[ \sum_{1}^{n} {r_{i,1}}= \sum_{1}^{n}\frac{\text{cov}(x_i,x_1)}{\sigma_{x_i}\sigma_{x_1}} \]

Meaning,

\[ \frac{\text{cov}(x_1,x_1)}{\sigma_{x_1}\sigma_{x_1}} =\frac{\text{cov}(x_1,x_1)}{\sigma_{x_1}^2}=\frac{\sigma_{x_1}^2}{\sigma_{x_1}^2}=1 \]

Regression & Correlation Matrices :

Consider the singular case of numerical event \(x \text{ and }y\) for a correlation \(r_{xy} = \frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\).

More generally, we say these correlations exist to a correlation matrix containing all our numerical features :

\[ r_{ij} \in C^{\text{p} X\text{p}} \\ r_{ij} \in [-1, 1] \text{ for : } r_{ij} \in \mathbb{R} \]

Where \(p=\text{Number of Parameters}\) (not inc. intercept). In deriving our correlations ( \(r_{ij}\) ) we are guided by Abraham & Ledolter’s treatment of multicollinearity (Abraham & Ledolter, 2006, Ch. 5.4), rewriting formulas verbatim and using the text as a framework to connect regression structure and predictor correlations.

We begin by first making assumptions :

  1. Our Data can be described using a linear model
  2. Our errors are normally distributed
  3. The \(\mathbb{E}(\epsilon)=0\) – the population average of error is centered around 0
  4. We have constant variance \(\sigma^2I_n\)

\[ \vec{y}=X\vec{\beta} +\epsilon \text{ for : } \epsilon \sim N(\vec{0}, \sigma^2 I_n) \]

Suppose we expand \(\vec{y}\) :

\[ y= f(X) + \epsilon\\ = X\vec{\beta} + \epsilon\\ = \beta_0 + \beta_1\vec{x}_1 + ...+\beta_n\vec{x}_n + \epsilon \\ = \beta_0 + \sum\beta_i x_i + \epsilon \] For this we assume some intercept \(\beta_0\).

Mathematical Considerations in Pairwise Regression :

Notice that for each of the following Regression models we hold the same structure :

\[ \vec{y}_{325}=\beta_0+\beta_1\vec{x} + \epsilon \\ \text{for: } \vec{x}\in X^{325 X2} = [\vec{1}_n \vec{x}_1] \\ X'X= \begin{bmatrix}n & \sum{x_i} \\\sum{x_i} & \sum{x_i^2}\end{bmatrix} \\ \implies (X'X)^{-1}\\ =\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

\[ =\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

Which as we know is particularly relevant when developing a notion of variance and center for \(\hat{\beta}\).

Recall :

\[ \hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}) \]

proof

We wish to prove 3 claims :

  1. Claim : “Our est. of slopes is Normally distributed

  2. Claim : “Our est. is unbiased

  3. Claim : “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix

Lets start 1 by 1, consider claim 1 :

    1. Claim : “Our est. of slopes is Normally distributed

We will prove this rather informally based upon a basic property of Normally distributed data.

Consider \(\vec{\epsilon} \sim N(0, \sigma^2I_n)\) and \(y = X\vec{\beta} + \vec{\epsilon}\) and \(X\vec{\beta}\) is a deterministic (constant). We know that linear combinations ( \(aX +b\) ) of normally distributed data \(X \sim N(\mu, \sigma^2)\) is also normally distributed (More rigorous proof here), therefore \(\vec{y}\) is also normally distributed. In other words, since \(\vec{\epsilon}\) is the random part of our model ( \(y = X\vec{\beta} + \vec{\epsilon}\) ) and \(X\vec{\beta}\) is a deterministic, we say \(y\) inherits its randomness from \(\vec{\epsilon}\) and because \(\vec{\epsilon}\) is normally distributed & \(\vec{y}\) is just a linear combination of \(\vec{\epsilon}\) we say \(\vec{y}\) is also normally distributed.

    1. Claim : “Our est. is unbiased

Recall for something to be unbiased we say \(\mathbb{E}(\hat{\theta}) - \theta = 0\) :

\[ \mathbb{E}(\hat{\beta}) \\ = \mathbb{E}(A\vec{y}) \\ = A\mathbb{E}(\vec{y}) \\ = A\mathbb{E}(X\vec{\beta} + \epsilon) \\ = A[\mathbb{E}(X\vec{\beta}) + \mathbb{E}(\epsilon)] \\ = A[\mathbb{E}(X\vec{\beta}) + 0] \\ = A\mathbb{E}(X\vec{\beta}) \\ = AX\vec{\beta} \\ = (X'X)^{-1}X' X \vec{\beta} \\ = (X'X)^{-1}(X' X) \vec{\beta} \\ = I_n \vec{\beta} \\ = \vec{\beta} \]

\[ \implies \\ = \mathbb{E}(\hat{\beta}) -\vec{\beta} \\ = \vec{\beta} -\vec{\beta} \\ = \vec{0} \\ \implies \\ \mathbb{E}(\hat{\beta}) -\vec{\beta}=\vec{0} \]

    1. Claim : “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix

\[ \sum_{\hat{\beta}} \\ = \text{Cov}(\hat{\beta}) \\ = \text{Cov}(A\vec{y}) \\ = A\text{Cov}(\vec{y})A' \\ = A\sigma^2 I_nA' \\ = \sigma^2AA' \\ = \sigma^2(X'X)^{-1}X'[(X'X)^{-1}X']' \]

\[ = \sigma^2(X'X)^{-1}X'X[(X'X)^{-1}]' \\ = \sigma^2(X'X)^{-1}X'X[(X'X)']^{-1} \\ = \sigma^2(X'X)^{-1}X'X[X'X]^{-1} \\ = \sigma^2(X'X)^{-1}X'XX^{-1}(X')^{-1} \\ \\ = \sigma^2(X'X)^{-1}I_n^2 \\ = \sigma^2(X'X)^{-1}I_n \\ = \sigma^2(X'X)^{-1} \\ \]

Generalized Programming :

Pairwise Plots :

pairwise_dot_plt <- function(data_set) {
  # Generate all pairwise combinations
  pairwise_comb <- combn(1:ncol(data_set), 2)

  # iterate across each
  for (idx in 1:ncol(pairwise_comb)) {
    comb_vec <- pairwise_comb[, idx]
    x_axis <- comb_vec[1]
    y_axis <- comb_vec[2]
    
    
    # generate reg mdl 
    mdl <- lm(unlist(data_set[, y_axis]) ~ unlist(data_set[, x_axis]), data = data_set)

    # Start a new plot on the PDF device
    plot(
      unlist(data_set[, x_axis]),
      unlist(data_set[, y_3axis]),
      xlab = colnames(data_set)[x_axis],
      ylab = colnames(data_set)[y_axis],
      main = paste("Plot", idx)
    )
    abline(mdl, col = "red")
  }
}

Mathematical Considerations :

  • Utilize the stability of the inverse to metrisize \(\text{Det}(X'X)^{-1} \rightarrow0\) slope estimates standard error bounds. In other words, we know \(\text{Det}(X'X)^{-1}=0\implies\text{Let : }A=(X'X)^{-1} \implies A^{-1} \text{ DNE}\) as there does not exist a matrix \(A^{-1}A=I_n\).

Focused Analysis :

library(dplyr)

# Grade, Topic Correlations : 
gd_tpk_cor <- num_df |> select(contains("th"),-contains("avg")) |> cor(use = "complete.obs")

# Topic avg Correlations
sub_avr_cor <- num_df |> select(ends_with("avg"), -matches("^[0-9]+th")) |> cor(use = "complete.obs")

# Grade avg Correlations 
grd_avg_cor <- num_df |> select(matches("^[0-9]+th avg")) |> cor(use = "complete.obs")

Separate Corr :

Grade-Topic Correlations :

corrplot(gd_tpk_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

# Set bounds and filter :  
weak_cutoff <- .3 # below : weak linear association
strong_cutoff <- .6 # above : strong linear association

low_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor < weak_cutoff)
med_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor >= weak_cutoff) & (gd_tpk_cor <= strong_cutoff)
high_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor > strong_cutoff)

# select only ij obs (upper.tri) and not ji, sym matrix
low_pairs <- which(low_idx, arr.ind = TRUE)
med_pairs <- which(med_idx, arr.ind = TRUE)
high_pairs <- which(high_idx, arr.ind = TRUE)
# general NA matrix function : 
generate_NA_mat <-
  function(mat){
    NA_mat <- matrix(
  NA,
  nrow = nrow(mat),
  ncol = ncol(mat),
  dimnames = dimnames(gd_tpk_cor)
)
    return(NA_mat)
  }
# generate general function : 
build_masked_matrix <- function(cor_matrix, pair_indices) {
  NA_mat <- generate_NA_mat(cor_matrix)
  NA_mat[pair_indices] <- cor_matrix[pair_indices]
  return(NA_mat)
}

print("low_cor_mat")
## [1] "low_cor_mat"
low_cor_mat  <- build_masked_matrix(gd_tpk_cor, low_pairs); low_cor_mat
##           4thmath 8thmath  11thmath  4th read 8th read 11th read   8th sci
## 4thmath        NA      NA 0.2742213        NA       NA 0.1940918 0.2283466
## 8thmath        NA      NA        NA        NA       NA 0.2997928        NA
## 11thmath       NA      NA        NA 0.2819581       NA        NA 0.1771869
## 4th read       NA      NA        NA        NA       NA 0.2880443 0.2739120
## 8th read       NA      NA        NA        NA       NA        NA        NA
## 11th read      NA      NA        NA        NA       NA        NA 0.2655738
## 8th sci        NA      NA        NA        NA       NA        NA        NA
## 11th sci       NA      NA        NA        NA       NA        NA        NA
##            11th sci
## 4thmath   0.1864202
## 8thmath   0.2302258
## 11thmath         NA
## 4th read  0.2111796
## 8th read  0.2584866
## 11th read        NA
## 8th sci          NA
## 11th sci         NA
print("med_cor_mat")
## [1] "med_cor_mat"
med_cor_mat  <- build_masked_matrix(gd_tpk_cor, med_pairs); med_cor_mat
##           4thmath   8thmath  11thmath  4th read  8th read 11th read   8th sci
## 4thmath        NA 0.3817187        NA        NA 0.3468462        NA        NA
## 8thmath        NA        NA 0.3118259 0.3787978        NA        NA 0.5362033
## 11thmath       NA        NA        NA        NA 0.3016574        NA        NA
## 4th read       NA        NA        NA        NA 0.3526437        NA        NA
## 8th read       NA        NA        NA        NA        NA 0.3201637 0.5382480
## 11th read      NA        NA        NA        NA        NA        NA        NA
## 8th sci        NA        NA        NA        NA        NA        NA        NA
## 11th sci       NA        NA        NA        NA        NA        NA        NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath  0.4566460
## 4th read         NA
## 8th read         NA
## 11th read        NA
## 8th sci   0.3122344
## 11th sci         NA
print("high_cor_mat")
## [1] "high_cor_mat"
high_cor_mat <- build_masked_matrix(gd_tpk_cor, high_pairs); high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA

Observations for gd_tpk_cor :

  • high_cor_mat : Math and Reading across grade levels (4,8,11) appears to be strongly correlated ( \(r_{\text{high}}\) 0.65)

  • med_cor_mat : 8th grades across all categories appear to have a moderate relationship — science subject for 8th grades appears to approach high_cor_mat category [consider diag behavior].

  • low_cor_mat : Most correlations are rather weak

Masked Matrix Function :

masked_matrix_func  <- function(correlation_matrix, bottom_cutoff = 0.3, top_cutoff = 0.6) {
  if (!is.matrix(correlation_matrix)) {
    stop("Input must be a matrix.")
  }
  
  # Logical indices for each strength
  l_idx <- upper.tri(correlation_matrix) & (correlation_matrix < bottom_cutoff)
  m_idx <- upper.tri(correlation_matrix) & 
           (correlation_matrix >= bottom_cutoff) & 
           (correlation_matrix <= top_cutoff)
  h_idx <- upper.tri(correlation_matrix) & (correlation_matrix > top_cutoff)
  
  # Pair indices
  l_pairs <- which(l_idx, arr.ind = TRUE)
  m_pairs <- which(m_idx, arr.ind = TRUE)
  h_pairs <- which(h_idx, arr.ind = TRUE)
  
  # Generate empty NA matrix func
  generate_NA_mat <- function(mat) {
    matrix(NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
  }

  # Use helper function : 
  build_masked_matrix <- function(mat, idx) {
    NA_mat <- generate_NA_mat(mat)
    NA_mat[idx] <- mat[idx]
    return(NA_mat)
  }

  # Store all three in a named list
  cor_lst <- list(
    low_cor_mat  = build_masked_matrix(correlation_matrix, l_pairs),
    med_cor_mat  = build_masked_matrix(correlation_matrix, m_pairs),
    high_cor_mat = build_masked_matrix(correlation_matrix, h_pairs)
  )
  return(cor_lst)
}

Subject-Avr Correlations :

# plt : 
corrplot(sub_avr_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(sub_avr_cor)
## $low_cor_mat
##          Math avg Read avg Sci avg
## Math avg       NA       NA      NA
## Read avg       NA       NA      NA
## Sci avg        NA       NA      NA
## 
## $med_cor_mat
##          Math avg Read avg   Sci avg
## Math avg       NA       NA 0.4922274
## Read avg       NA       NA        NA
## Sci avg        NA       NA        NA
## 
## $high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA
cov(num_df, use = "complete.obs")
##                4thmath     8thmath    11thmath    4th read    8th read
## 4thmath     115.326528   35.645267   29.514496   75.911131   33.704589
## 8thmath      35.645267   75.611479   27.175386   34.483011   54.268186
## 11thmath     29.514496   27.175386  100.447541   29.584075   27.357166
## 4th read     75.911131   34.483011   29.584075  109.599264   33.406233
## 8th read     33.704589   54.268186   27.357166   33.406233   81.879482
## 11th read    24.220580   30.291992   70.595550   35.040950   33.664523
## 8th sci      27.540761   52.364883   19.944262   32.205621   54.699930
## 11th sci     27.295669   27.295050   62.400117   30.143376   31.890522
## size        -48.792263   70.141410   96.443208   52.000251  119.116076
## salary     2085.229400 5395.757446 8124.694133 6447.328575 6698.265199
## experience    0.166605    3.322614    4.187824    1.337106    4.311578
## 4th avg      95.618829   35.064139   29.549286   92.755197   33.555411
## 8th avg      32.296859   60.748205   24.825627   33.364939   63.615883
## 11th avg     27.010228   28.254109   77.814433   31.589451   30.970700
## Math avg     60.162129   46.144054   52.379151   46.659430   38.443322
## Read avg     44.612073   39.681051   42.512239   59.348801   49.650075
## Sci avg      27.418215   39.829967   41.172190   31.174498   43.295226
##               11th read     8th sci     11th sci          size       salary
## 4thmath       24.220580   27.540761    27.295669    -48.792263     2085.229
## 8thmath       30.291992   52.364883    27.295050     70.141410     5395.757
## 11thmath      70.595550   19.944262    62.400117     96.443208     8124.694
## 4th read      35.040950   32.205621    30.143376     52.000251     6447.329
## 8th read      33.664523   54.699930    31.890522    119.116076     6698.265
## 11th read    135.028538   34.658899   101.423879    187.892523    12251.038
## 8th sci       34.658899  126.134426    47.811651     22.400468     3533.438
## 11th sci     101.423879   47.811651   185.896939    186.582176    10819.828
## size         187.892523   22.400468   186.582176   6829.021245   203260.758
## salary     12251.038013 3533.437799 10819.827665 203260.757937 10759419.737
## experience     4.043433    4.621802     5.819243     40.474950     4170.303
## 4th avg       29.630765   29.873191    28.719522      1.603994     4266.279
## 8th avg       32.871806   77.733095    35.665758     70.552723     5209.160
## 11th avg     102.349321   34.138223   116.573634    156.972605    10398.517
## Math avg      41.702716   33.283313    38.996962     39.264063     5201.897
## Read avg      67.911317   40.521453    54.485890    119.669529     8465.541
## Sci avg       68.041389   86.973039   116.854295    104.491322     7176.633
##              experience      4th avg     8th avg     11th avg    Math avg
## 4thmath       0.1666050   95.6188294   32.296859    27.010228   60.162129
## 8thmath       3.3226142   35.0641388   60.748205    28.254109   46.144054
## 11thmath      4.1878244   29.5492857   24.825627    77.814433   52.379151
## 4th read      1.3371062   92.7551974   33.364939    31.589451   46.659430
## 8th read      4.3115780   33.5554112   63.615883    30.970700   38.443322
## 11th read     4.0434329   29.6307653   32.871806   102.349321   41.702716
## 8th sci       4.6218021   29.8731909   77.733095    34.138223   33.283313
## 11th sci      5.8192429   28.7195224   35.665758   116.573634   38.996962
## size         40.4749498    1.6039938   70.552723   156.972605   39.264063
## salary     4170.3034383 4266.2789873 5209.160190 10398.516555 5201.896513
## experience    6.8160538    0.7518556    4.085333     4.683491    2.559016
## 4th avg       0.7518556   94.1870134   32.830899    29.299840   53.410780
## 8th avg       4.0853333   32.8308994   67.365746    31.121024   39.290240
## 11th avg      4.6834915   29.2998398   31.121024    98.912469   44.359601
## Math avg      2.5590159   53.4107799   39.290240    44.359601   52.895129
## Read avg      3.2307024   51.9804372   43.284193    54.969798   42.268468
## Sci avg       5.2205225   29.2963566   56.699427    75.355929   36.140137
##               Read avg     Sci avg
## 4thmath      44.612073   27.418215
## 8thmath      39.681051   39.829967
## 11thmath     42.512239   41.172190
## 4th read     59.348801   31.174498
## 8th read     49.650075   43.295226
## 11th read    67.911317   68.041389
## 8th sci      40.521453   86.973039
## 11th sci     54.485890  116.854295
## size        119.669529  104.491322
## salary     8465.540761 7176.632732
## experience    3.230702    5.220522
## 4th avg      51.980437   29.296357
## 8th avg      43.284193   56.699427
## 11th avg     54.969798   75.355929
## Math avg     42.268468   36.140137
## Read avg     58.970051   47.503672
## Sci avg      47.503672  101.913667

Observations for gd_tpk_cor :

  • high_cor_mat : Math, Reading Avr ; Sci, Reading Avr

  • med_cor_mat : Math, Science Avr

  • low_cor_mat : NA

masked_matrix_func(sub_avr_cor^2, bottom_cutoff = 0.3^2, top_cutoff = 0.6^2)
## $low_cor_mat
##          Math avg Read avg Sci avg
## Math avg       NA       NA      NA
## Read avg       NA       NA      NA
## Sci avg        NA       NA      NA
## 
## $med_cor_mat
##          Math avg Read avg   Sci avg
## Math avg       NA       NA 0.2422878
## Read avg       NA       NA        NA
## Sci avg        NA       NA        NA
## 
## $high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.5727773        NA
## Read avg       NA        NA 0.3754831
## Sci avg        NA        NA        NA

Interpretation for gd_tpk_cor :

Grade-Avr Correlations :

corrplot(grd_avg_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(grd_avg_cor)
## $low_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## $med_cor_mat
##          4th avg   8th avg  11th avg
## 4th avg       NA 0.4121618 0.3035598
## 8th avg       NA        NA 0.3812491
## 11th avg      NA        NA        NA
## 
## $high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
grep("_cor$", ls(), value = T)
## [1] "gd_tpk_cor"  "grd_avg_cor" "sub_avr_cor"
cor_lst <- list(gd_tpk_cor = gd_tpk_cor, grd_avg_cor = grd_avg_cor, sub_avr_cor = sub_avr_cor)

# Average Correlation
lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)})
## $gd_tpk_cor
## [1] 0.31
## 
## $grd_avg_cor
## [1] 0.24
## 
## $sub_avr_cor
## [1] 0.41
# average of averages : 
paste(lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)}) |> unlist() |> mean(), " avr cor")
## [1] "0.32  avr cor"

Global Observations :

  • Among variables of interest we have \(\bar{r}_{ij}\) of .32, so all signals appear pretty linearly-weak.

  • Subject Pairs are more related then Grade Pairs

    • In summary : sub_avr_cor < gd_tpk_cor < grd_avg_cor

Mathematical Observations :

Notice \(r_{ij}\in C \implies R^2_{\text{xi}}=r_{i,y}\). In other words, the correlation squared are the coefficient of determinations for those pairwise models

masked <- lapply(cor_lst, masked_matrix_func)
high_only <- lapply(masked, function(x) x[grepl("^high_", names(x))])
high_only
## $gd_tpk_cor
## $gd_tpk_cor$high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA
## 
## 
## $grd_avg_cor
## $grd_avg_cor$high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## 
## $sub_avr_cor
## $sub_avr_cor$high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA

observations :

  • For our programmed function we have defined \(r_{xy}=.6\) as the lowest possible \(r_{xy}\) value accepted to be highly correlated as \(R^2\) that is \(.6^2 =0.36\) or, 36% of the variation in \(y\) can be explained by \(x\).

  • Notice grd_avg_cor never contains highly correlated features

  • Math avg and Read avg appear to be the most linearly related 0.57% of the of variation. Slightly above half.

Problem : What is the behavior of the linear model? And How is that different when we are estimating the model?

Notice :

\[ y = X\beta+\epsilon \ ; \epsilon \sim N(0, \sigma^2I_n) \]

\[ \mathbb{E}(y)=\mathbb{E}(X\beta+\epsilon) = \mathbb{E}(X\beta) + \mathbb{E}(\epsilon) =X\beta+0=X\beta = \mu \]

\[ \text{Cov}(y) =\text{Cov}(X\beta+\epsilon)=\text{Cov}(\epsilon) :=\sigma^2I_n \]

\[ y \sim N(X\beta, \sigma^2I) \]

Therefore \(\mu =X \beta\) and, \(y = \mu +\epsilon\). In other words : \(\text{Event = Model + Noise}\) . Further, lets consider that we wish to determine the dist of \(\hat{y}=X\hat{\beta}\).

claim : \(\hat{y}\) follows a normal dist :

\[ \begin{aligned}\hat{\vec y}&= X\hat{\beta} \\&= \underbrace{ \underbrace{X(X^\top X)^{-1}X^\top}_{\displaystyle XA} }_{\displaystyle H}\,\vec y \\&= H\vec y\end{aligned} \]

Therefore, Notice that we can then describe \(\hat{y}\) as a linear combination of \(\vec{y}=X\beta+\epsilon\) :

\[ \hat{ y} = H\vec y = \begin{bmatrix} \vec h_1^{\top}\vec y \\ \vec h_2^{\top}\vec y \\ \vdots \\ \vec h_n^{\top}\vec y \end{bmatrix} = y_1 \vec h_1 + y_2 \vec h_2 + \cdots + y_n \vec h_n \]

After establishing normality, we can describe the bahavior of \(\hat{y}\) by calculating the two parameters of the normal dist mean ( \(\vec{\mu}\) ) and covariance matrix ( \(\sum\) ), since these quantities fully describe the distribution’s behavior :

Center :

\[ \mathbb{E}(\hat{y}) \\ = \mathbb{E}(X\hat{\beta}) \\ =X\mathbb{E}(\hat{\beta}) \\ =X\beta \\ = \mu \]

So, we should expect our est. of the model in the long-run estimates our model \(\mu\)good!

Variance :

\[ \text{Cov}(\hat{y})=\text{Cov}(Hy) =H\text{Cov}(y)H' =H\sigma^2I_nH'=\sigma^2HH'=\sigma^2H \]

This is because \(y \sim N(X\beta, \sigma^2I_n)\) and \(H^2=H\) (idopent matrix).

# Fit linear model
mdl <- lm(`Read avg` ~ `Math avg`, data = iowa_std)

# ---- Scatter plot with regression line ----
plot(iowa_std$`Math avg`, iowa_std$`Read avg`,
     main = "Math vs Read Avg Mdl",
     xlab = "Math Avg", ylab = "Read Avg", pch = 19)
abline(mdl, col = "red", lwd = 2)

# Residual Plot : 
plot(mdl, which = 1)

# QQ plt
plot(mdl, which = 2)

observations :

Suppose we wish the summarize \(\hat{y}\) into a summary statistic : \(\bar{\hat{y}}=\frac{1}{n}\sum{\hat{y}_i}\). We will first derive its distribution and then parameters of describing the distribution :

Proof

\[ \bar{\hat{y}} \\ = \frac{1}{n}\sum{\hat{y}_i} \\ = \frac{1}{n} \vec{1}'\hat{y} \\ = M\hat{y} \]

except I want you to notice :

\[ \bar{\hat{y}}=M\hat{y}=MH\vec{y}=\frac{1}{n}\vec{1}'H\vec{y} = \frac{1}{n}(\vec{1}'H)\vec{y} \]

and, notice that : \(H\vec{1}=\vec{1}\)

In order to understand why this is true remember that \(X=[\vec{1} \ \vec{x}_1 ....\vec{x}_n]\) and \(H=X(X'X)^{-1}X'\)

Therefore, since \(\vec{1}\) is not only a linear combination but also literally in, \(X\) we can state :

\[ X\vec{v}=\vec{1} \]

\[ \vec{1}v_1+\vec{x}_1v_2+...+\vec{x}_nv_n = \vec{1} \]

Clearly if \(v_2\) to \(v_n\), are all 0 and \(v_1\) is 1 we see :

\[ \vec{1}(1)+\vec{x}_1(0)+...+\vec{x}_n(0) = \vec{1} \\ \vec{1}(1) = \vec{1} \\ \vec{1} = \vec{1} \]

Meaning \(\vec{v} = [1 \ 0 ...0]\).

Further,

\[ H\vec{1} \\ = XA\vec{1} \\ = X(X'X)^{-1}X'\vec{1} \\ = X(X'X)^{-1}X'(X\vec{v}) \\ = X(X'X)^{-1}(X'X)\vec{v} \\ = XI_n\vec{v} \\ = X\vec{v} \\ :=\vec{1} \]

Therefore, suppose we transpose \(H\vec{1} = \vec{1}\) :

\[ \vec{1}' =(H\vec{1})'=\vec{1}'H' = \vec{1}'H =\vec{1}' \]Meaning :

\[ \bar{\hat{y}}=\frac{1}{n}\vec{1}'H\vec{y} = \frac{1}{n}\vec{1}'\vec{y} = \frac{1}{n}\sum{y_i}=\bar{y} \]

Therefore, the average of the fitted values \(\hat{y}\) is equivalent to the average \(y\) value. So in estimating the distrobution \(\bar{\hat{y}}\) we are finding the dist of \(\bar{y}\).

\[ \mathbb{E}(\bar{\hat{y}}) \\ = \mathbb{E(\bar{y})} \\ = \mathbb{E}(\frac{1}{n}\sum{y}_i) \\ = \frac{1}{n} \mathbb{E}(\sum{y}_i) \\ = \frac{1}{n} \sum{\mathbb{E}({y}_i)} \\ \]

and \(y_i=\mu +\epsilon_i\),

\[ \mathbb{E}(y_i) \\ =\mathbb{E}(\mu +\epsilon_i)\\ =\mathbb{E}(\mu) +\mathbb{E}(\epsilon_i) \\ = \mathbb{E}(\mu) + 0 \\ = \mu \]

therefore :

\[ = \frac{1}{n} n \mu \\ = \mu \]

and,

\[ \text{Var}(\bar{\hat{y}}) \\ =\text{Var}(\bar{y}) \\ =\text{Var}(\frac{1}{n}\vec{1}'\vec{y}) \\ = (\frac{1}{n})^2\text{Var}(\vec{1}'\vec{y}) \\ = \frac{1}{n^2}\text{Var}(\sum{y}_i) \]

and I want you to recall : \(y_i = \mu + \epsilon_i\) and each \(\epsilon_i\) is independent from \(\epsilon_j\) (ie uncorrelated errors). So we can assume \(y_i\) is independent from \(y_j\) as it is a deterministic component added to a random component \(\epsilon\).

\[ = \frac{1}{n^2}\text{Var}(\sum{y}_i) \\ = \frac{1}{n^2}\sum\text{Var}(y_i) \\ = \frac{1}{n^2}\sum\text{Var}(\mu+\epsilon_i) \\ = \frac{1}{n^2}\sum\text{Var}(\epsilon_i) \\ = \frac{1}{n^2}\text{Trace}(\text{Cov}(\vec{\epsilon})) \\ = \frac{1}{n^2}\text{Trace}(\sigma^2I_n) \\ = \frac{\sigma^2}{n^2}\text{Trace}(I_n) \\ = \frac{\sigma^2}{n^2}\sum^{n}_1{1} \\ = \frac{\sigma^2}{n^2}n \\ = \frac{\sigma^2}{n} \]

Meaning :

\[ \bar{\hat{y}} =\bar{y} \sim N(\mu,\frac{\sigma^2}{n}) \]

However, this is a summary statistic of the overall behavior of \(\hat{y}\), each prediction \(\hat{y}_i\) has its own variance \(\sigma^2_{\hat{y}_i}\) and \(\text{Cov}(\hat{y}_i,\hat{y}_j)=\rho\sigma_{\hat{y}_i}\sigma_{\hat{y}_j}\) for \(\rho \in [-1,1]\) [ Correlation ]; meaning \(\text{Cov}(\hat{y}_i,\hat{y}_j)=\text{Corr}(\hat{y}_i,\hat{y}_j)\sigma_{\hat{y}_i}\sigma_{\hat{y}_j}\)

Covariance matrix of \(\hat{y}\) :

\[ H = XA ; A = (X'X)^{-1}X' \\ H =X(X'X)^{-1}X' \]

[ After re-solving this HW problem, it became clear : Citation : HW 5 Q6 S100C & ]

and for this case :

\[ X = [1_{325} \vec{x}] \\ \implies \\ H \\ = X\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} X'\\ = X\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}X' \\ = [1_{325} \vec{x}]\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \]

Notice : \(\sum{x_i}^2 = \vec{x}'\vec{x} = ||x||^2\) and, \(\bar{x}=\frac{1}{n} \sum x_i\)

\[ = [1_{325} \vec{x}]\frac{1}{n ||x||^2-(n\bar{x})^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n ||x||^2-n^2\bar{x}^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ \]

Recall :

\[ S_{xx}=\sum (x_i -\bar{x})^2 \\ = \sum(x_i -\bar{x})(x_i -\bar{x}) \\ = \sum(x_i^2 -2\bar{x}x_i+\bar{x}^2) \\ = \sum{x_i^2} -2\bar{x}\sum{x_i}+\sum{\bar{x}^2} \\ = \sum{x_i^2} -2\bar{x}n\bar{x}+n\bar{x}^2 \\ = \sum{x_i^2} -2\bar{x}^2n+n\bar{x}^2 \\ = \sum{x_i^2} -n\bar{x}^2 \\ = ||x||^2 -n\bar{x}^2 \]

Therefore :

\[ = [1_{325} \vec{x}]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}]' \\ \]

and,

\[ = [1_{325} \vec{x}]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -n\bar{x} \\-n\bar{x} & n\end{bmatrix}[1_{325} \vec{x}]' \\ = [1_{325} \vec{x}]\begin{bmatrix}\frac{||x||^2}{nS_{xx}} & -\frac{\bar{x}}{S_{xx}} \\-\frac{\bar{x}}{S_{xx}} \ & \frac{1}{S_{xx}}\end{bmatrix}[1_{325} \vec{x}]' \\ \]

# Histogram 
hist(iowa_std$`Read avg`,
     probability = TRUE,
     breaks = 20,
     col = "lightgray",
     xlim = c(
       mean(iowa_std$`Read avg`, na.rm = TRUE) - 4 * sd(iowa_std$`Read avg`, na.rm = TRUE),
       mean(iowa_std$`Read avg`, na.rm = TRUE) + 4 * sd(iowa_std$`Read avg`, na.rm = TRUE)
     ),
     main = "Histogram of Read Avg with Normal Curves",
     xlab = "Read Avg")


# Center at Xbeta = \hat{y}
curve(
  dnorm(x, 
        mean = mean(fitted(mdl), na.rm = TRUE), 
        sd = sd(fitted(mdl), na.rm = TRUE)), 
  col = "blue", lwd = 2, add = TRUE)

# Raw values : 
curve(
  dnorm(x, 
        mean = mean(iowa_std$`Read avg`, na.rm = TRUE), 
        sd = sd(iowa_std$`Read avg`, na.rm = TRUE)), 
  col = "green", lwd = 2, add = TRUE)

legend("topright", legend = c("Centered at Xβ", "Empirical Mean"),
       col = c("blue", "green"), lwd = 2, bty = "n")

  • As we can see our theoretical model \(y\) does a pretty good job of estimating Read avg and Math avg, this tells us that there appears to be a strong relationship between a students math & reading performance.

  • However, relative to our emperical model based on our outcome variable, we get a worse performance. We notice that the difference is in the variance. Our model is overly confident ( smaller \(\sigma_*\) )

  • Hypothesis testing :

    • \[ \text{T-test : }H_o : \beta_i=0 \] indicates regression coefficients are statistically significant.

    • \[ F-\text{test : } H_o : \beta_0=\beta_1=0 \] indicating holding other coefficients constant, model is significant

  • In short this indicates strong evidence that districts which do well in reading also tend to do well in math together. Perhaps targeting these two areas will help overall performance.

Okay, lets go back to our research questions :

  • How do student performance vary across, grade, subject and size?

    • In district located in iowa, relationship among school subjects is stronger than grade level.
    • Reading and math appear to be related through grade levels
  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

Overall Avr ~ District Size :

# Add avr col : 
iowa_std <- 
  iowa_std |> 
  rowwise() |>
  mutate(avr = mean(c_across(c(`4thmath`, `8thmath`, `11thmath`, 
                                        `4th read`, `8th read`, `11th read`, 
                                        `8th sci`, `11th sci`)), na.rm = TRUE))

Suppose we divide school districts into three size groups : large districts (x > 2,000 students), medium-sized districts (1,000 < x < 2,000 students), and small districts (x < 1,000 students).

\[ \text{avr}_i=\beta_0+\beta_1\mathbf{1}[\text{med}_i]+\beta_2\mathbf{1}[\text{large}_i] + \varepsilon_i \]

\[ \begin{align*}\text{avr}_i & : \text{Average test score for district } i \\\beta_0 & : \text{Mean score for districts in the small size group (baseline)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\mathbf{1}[\text{med}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is medium, } 0 \text{ otherwise} \\\mathbf{1}[\text{large}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is large, } 0 \text{ otherwise} \\\varepsilon_i & : \text{Random error term for district } i\end{align*} \]

\[ \beta_0 = \mu_1, \beta_1 = \mu_2 − \mu_1, \text{ and } \beta_2 = \mu_3 − \mu_1 \]

Which is equivalent to :

\[ y_i = \mu_1 x_{i1} + \mu_2 x_{i2} + \mu_3 x_{i3} + \varepsilon_i \]

\[ \begin{align*}y_i & : \text{Average test score for district } i \\x_{i1}, x_{i2}, x_{i3} & : \text{Indicator variables for district size: small, medium, large} \\\mu_1, \mu_2, \mu_3 & : \text{Mean test scores for small, medium, and large districts, respectively} \\\varepsilon_i & : \text{Random error in test score for district }\\\end{align*} \]

# Model for overall averages 
overall_mdl <- lm(avr~`size(coded)`, data=iowa_std)

Notice these are the same results as our ONE-WAY-ANOVA. Lets take a closer look at the F-test results :

summary(overall_mdl); aov(avr~`size(coded)`, data=iowa_std)
## 
## Call:
## lm(formula = avr ~ `size(coded)`, data = iowa_std)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2652  -4.5152   0.3901   4.8598  20.4848 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     73.0152     0.4840 150.863   <2e-16 ***
## `size(coded)`2   0.5948     1.0012   0.594    0.553    
## `size(coded)`3   2.5525     1.3306   1.918    0.056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.227 on 322 degrees of freedom
## Multiple R-squared:  0.01151,    Adjusted R-squared:  0.005365 
## F-statistic: 1.874 on 2 and 322 DF,  p-value: 0.1552
## Call:
##    aov(formula = avr ~ `size(coded)`, data = iowa_std)
## 
## Terms:
##                 `size(coded)` Residuals
## Sum of Squares        195.767 16819.764
## Deg. of Freedom             2       322
## 
## Residual standard error: 7.227399
## Estimated effects may be unbalanced
levels(iowa_std$`size(coded)`) <- c("Small", "Medium", "Large") 
boxplot(avr ~ `size(coded)`, data = iowa_std,
        main = "Average Test Scores by District Size",
        xlab = "District Size",
        ylab = "Average Test Score",
        col = c("skyblue", "lightgreen", "lightpink"))

abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)

Our model compares the overall average and looks for differences each indactor (S, M, L) makes. It appears large districts tend to outperform other districts :

iowa_std |> group_by(`size(coded)`) |> summarize(avr_score_per_size = round(mean(avr), 2)) |> arrange(desc(avr_score_per_size))

This claim is supported by a quick glance at grouped summary statistics.

Grade Avr ~ District Size :

Instead at looking at the avr among grades, suppose we look at each grade level separately :

Fourth Grade Avr MDL

Fourth_grd_size_mdl <- aov(`4th avg`~`size(coded)`, data=iowa_std)
summary(Fourth_grd_size_mdl)
##                Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)`   2    151   75.61   0.763  0.467
## Residuals     319  31598   99.05               
## 3 observations deleted due to missingness

Observations :

  • For 4th Grade students, we dont believe any level of size makes a statistically significant difference.

Eighth Grade Avr MDL

Eighth_grd_size_mdl <- aov(`8th avg`~`size(coded)`, data=iowa_std)
summary(Eighth_grd_size_mdl) # lots of missingness : 61 observations deleted due to missingness [about 20%]
##                Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)`   2     54   26.83     0.4  0.671
## Residuals     261  17525   67.15               
## 61 observations deleted due to missingness
naniar::miss_var_summary(iowa_std |> select(`8th avg`, `size(coded)`))

Observations :

  • For 8th Grade students, we dont believe any level of size makes a statistically significant difference.
  • About 20% of 8th avg is missing

Eleventh Grade Avr MDL

Eleventh_grd_size_mdl <- aov(`11th avg`~`size(coded)`, data=iowa_std)
summary(Eleventh_grd_size_mdl); summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$coeff
##                Df Sum Sq Mean Sq F value Pr(>F)  
## `size(coded)`   2    839   419.3   4.281 0.0149 *
## Residuals     250  24489    98.0                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 72 observations deleted due to missingness
##                      Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)         73.607784  0.7658753 96.109361 2.039968e-199
## `size(coded)`Medium  2.034190  1.5493777  1.312908  1.904179e-01
## `size(coded)`Large   5.392219  1.9098955  2.823306  5.135817e-03
paste("R-Sqr : ", summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$r.squared)
## [1] "R-Sqr :  0.0331127128655936"

Observations :

  • For 11th Grade students, we have statistically significant

\[ H_o : \mu_\text{Low} = \mu_\text{Medium} = \mu_\text{High} = 0\\ H_a : \mu_i \ne \mu_j \text{ or, } \mu_i \ne0 \]

In other words, we believe one of the levels is statistically significant. However notice our models \(R^2=3\%\). So overall our model isnt capturing the variance in 11th Grade District Avr Scores.

boxplot(`11th avg`~`size(coded)`, data = iowa_std,
        main = "11th Grd District by District Size",
        xlab = "District Size",
        ylab = "Average 11th Grade Performance",
        col = c("skyblue", "lightgreen", "lightpink"))

abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)

iowa_std |> group_by(`size(coded)`) |> summarize(Grade_Distr_avr = round(mean(`11th avg`, na.rm = T), 2)) |> arrange(desc(Grade_Distr_avr))
iowa_std$size_coded <- as.factor(iowa_std$`size(coded)`)
Eleventh_grd_size_mdl <- aov(`11th avg`~size_coded, data =iowa_std)
TukeyHSD(Eleventh_grd_size_mdl); plot(TukeyHSD(Eleventh_grd_size_mdl))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = `11th avg` ~ size_coded, data = iowa_std)
## 
## $size_coded
##                  diff        lwr      upr     p adj
## Medium-Small 2.034190 -1.6188689 5.687249 0.3893082
## Large-Small  5.392219  0.8891463 9.895292 0.0141716
## Large-Medium 3.358029 -1.8478375 8.563896 0.2828601

Based upon the results above, 11th Grade students Large-Small Districts pairwise comparison is statistically significant. However, larger-medium Districts pairwise comparison is not statistically sign. Indicating the difference is apparent when comparing the extremes against one another.

TukeyHSD Problem :

With everything, its important we are centered around the formulas so we understand their limitation & therefore use cases.

Consider the formulas used for TukeyHSD() :

[INCOMPLETE – Goal : Compare Formulas ]

  • Different Group Sample Sizes ( \(n_*\) )

  • Unequal Variance ( \(\sigma_*\) )

Therefore because we know our group-samples sizes & standard deviations are not equal–we try games_howell_test . Consider the formula used in this test :

library(rstatix)
# For games_howell_test to work we need complete cases across all features : 
DF <- iowa_std |> select(District, `11th avg`, size_coded) |> drop_na()
DF |> games_howell_test(`11th avg` ~ size_coded)

Eleventh Grade Math MDL

Eleventh Grade Reading MDL

Eleventh Grade Science MDL