Project Duration : 3 Days

Project Description :

This project analyzes district-level education data from Iowa (2001), focusing on standardized test performance in mathematics, reading, and science for grades 4, 8, and 11. The dataset contains 325 school districts and includes associated salary information. Variables describe district size, teacher work experience, average grade and class performance, and grade–subject performance by district. The purpose of the project is to demonstrate a rigorous understanding and application of linear modeling, supported by statistical programming and well-defined statistical metrics. We adopt a generic, production-style programming approach, develop linear models within a formal mathematical framework, and incorporate domain-specific context about Iowa’s education system to address our research questions. A central principle of the project is a “use it, then prove it” philosophy. Any regression formula or regression related statistical measures employed are derived, justified, and mathematically investigated, rather than taken as a black box. Several sections of the project apply metrics and ideas from Chapter 8 of Introduction to Regression Modeling (Abraham & Ledolter, 2006). These are complemented by lecture material from STATS 100C at UCLA and additional textbook results, allowing us to deepen our theoretical understanding and extract richer insight from the data.

Project Topics :

  • Linear Modeling [ Regression ]

  • Linear Algebra [ proofs ]

  • Generic R Programming [ function & package builiding ]

  • Literature Review Process [ Iowa in 2001 ]

  • Data enrichment [ income inequality & district map ]

Research Questions :

  • How do student performance vary across, grade, subject and size?

  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

Toss Files In & Take a brief Look :

url <- "https://www.biz.uiowa.edu/faculty/jledolter/RegressionModeling/Data/Chapter8/iowastudent.txt"

# download.file(url, destfile = "iowastudent.txt", mode = "wb")

library(readr)
iowa_std <- read_tsv("iowastudent.txt", na = "*", show_col_types = FALSE) 

observations :

  • Notice that we are missing 4th grade science

Basics :

## [1] "325  School Districts"
## [1] "22  Features"
##  [1] "District"    "4thmath"     "8thmath"     "11thmath"    "4th read"   
##  [6] "8th read"    "11th read"   "8th sci"     "11th sci"    "size"       
## [11] "size(coded)" "small size"  "med size"    "large size"  "salary"     
## [16] "experience"  "4th avg"     "8th avg"     "11th avg"    "Math avg"   
## [21] "Read avg"    "Sci avg"

Cardinality :

The motivation & interest of Cardinality is centered around the issue of Big Data. Often in Data science we deal with data sets with many features. Framing data in the context of Cardinality metrics provides a generalizable framework for grouping features into their associated data types. Laying the foundation for the types of analysis that is most appropriate for our given data structure.

# Detect Cardinality : Get back vector : 
Cardinality_func <- function(df, t = 10) {
  sapply(df, function(x) length(unique(x)) < t)
}
# less than 10 unique values
iowa_std[,Cardinality_func(iowa_std)]

observation :

  • We notice there are 4 features with less than 10 unique values ( All size metrics )

  • We may determine these as categorical features. Therefore categorical encoding (factor) in appropriate.

library(dplyr)
# Ensure our data behaves like a categorical variable
iowa_std <- 
  iowa_std |> 
  mutate(across(contains("size"), as.factor))

observation :

  • This is good, however consider for large data \(n \rightarrow \infty\) , we can begin to treat low cardinality features as categorical, lets see if these measures generally agree. Consider the particular formula for Eff_Cardinality :

\[ \text{Eff_Cardinality} = \frac{\text{Cardinality}}{\text{Sample Size (n)}} \\ \text{Cardinality} = \text{Number of Unique Values} \]

Key Take-away :

  • When \(\text{Eff_Cardinality} = 1\) this means we have exactly 1 unique value for every observation. These are useful to be thought of as row-labels that you can describe as jointed sentences reading across columns. However generating summary statistics or grouping of categorical variables with an effective cardinality of 1 can also provide itself to be helpful when summarizing across strata.

  • When \(\text{Eff_Cardinality} \rightarrow1\) that tells us we have nearly the same amount of unique values as the number of observations in our data. If this is true we can safely say relative to our sample size, our data has many unique values. This tells us its unlikely that repeated values appear, making signals likely sparse for the categorical or numeric vector alone.

  • When \(\text{Eff_Cardinality} = 1/n\) this means we have exactly 1 unique valin total. In other words, the feature of the data set repeats itself : “student”, “student”, … , “student” . These usually just contextualize your data.

  • When \(\text{Eff_Cardinality} \rightarrow 1/n\) this means blocking tends to be happening. Meaning, we have few unique values for a large data set–we have few groups. Because, \(\text{Min(Cardinality)}=1\) (ie you have atleast 1 unique value). Therefore this suggest the feature is likely a categorical variable

Effective_Cardinality  <- 
  function(df, dec = T){
  v <- sapply(df, function(x) {length(unique(x))/nrow(df)})
  v <- sort(v, decreasing = dec)
  data.frame(Effective_Cardinality_Values=v)
  }

# View effective cardinality 
head(Effective_Cardinality(iowa_std), 6)
head(Effective_Cardinality(iowa_std, dec = F), 6)
library(ggplot2)
features <- rownames(Effective_Cardinality(iowa_std))
eff_c_df <- cbind(features, Effective_Cardinality(iowa_std))

Mean <- summary(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values)[4]
one_std <- c(Mean + sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values), Mean - sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values))

library(patchwork)

bar_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
  ggplot(aes(x = factor(features, levels = features), y = Effective_Cardinality_Values)) +
  geom_hline(yintercept = Mean, linetype = "dashed", color = "black") +
  geom_hline(yintercept = one_std[1], linetype = "dashed", color = "red") +
  geom_hline(yintercept = one_std[2], linetype = "dashed", color = "red") +
  geom_col() +
  xlab("Features") + ylab("Effective Cardinality") +
  theme(axis.text.x = element_blank(),  
        axis.ticks.x = element_blank()) 

# ggplot boxplot
box_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
  ggplot(aes(y = Effective_Cardinality_Values)) +
  geom_boxplot()

# Combine side by side
bar_plot + box_plot + plot_layout(widths = c(7, 1))

Observations :

  • This tells us for District & salary have exactly 1 obs for 1 unique value. While size and, experience we nearly 1 observation for every unique value. We suspect their signals will be weak.

    • In other words, we have a summary statistic of the salary for every District – most likely \(\bar{x}\)
  • size metrics have low Eff Cardinality, indicating themselves as categorical variables relative to their sample size. Therefore agreeing with our simple approach of counting unique values.

# Cardinaliy scores : Top & Bottom 3 : 
cardinality <- sapply(iowa_std, function(x){length(unique(x))})
head(sort(cardinality, decreasing = T), 3); head(sort(cardinality, decreasing = F), 3)
## District   salary     size 
##      325      325      293
## small size   med size large size 
##          2          2          2

Pairwise Relationships :

Correlation Matrix

  • Consider that cor(x,y) is a measure of numerical data, it is appropriate we generate a strictly numerical data set to generate a summary of numerical linear-pairwise relationships.
num_df <- iowa_std[,-1] |> select(-contains("size"))

Here we will take a look at global pairwise-patterns, no deep analysis. Simple observations.

## corrplot 0.95 loaded

Observations

  • No correlation : there appears to be no correlations (linear relationship) for salary or experience, indicating it doesn’t alone appear to holds a weak signal with low predictive power – except for (salary,experience) pairs

  • All other features are Positively Correlated [Pairwise Regression Coef are Positive]

Mathematical Foundations of Correlation Matrices :

Consider the singular case of numerical event \(x \text{ and }y\) for a correlation \(r_{xy} = \frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\).

More generally, we say these correlations exist to a correlation matrix containing all our numerical features :

\[ r_{ij} \in C^{\text{p} X\text{p}} \\ r_{ij} \in [-1, 1] \text{ for : } r_{ij} \in \mathbb{R} \]

Where \(p=\text{Number of Parameters}\) (not inc. intercept). In deriving our correlations ( \(r_{ij}\) ) we are guided by Abraham & Ledolter’s treatment of multicollinearity (Abraham & Ledolter, 2006, Ch. 5.4), rewriting formulas verbatim and using the text as a framework to connect regression structure and predictor correlations.

We begin by first making assumptions :

  1. Our Data can be described using a linear model
  2. Our errors are normally distributed
  3. The \(\mathbb{E}(\epsilon)=0\) – the population average of error is centered around 0
  4. We have constant variance \(\sigma^2I_n\)

\[ \vec{y}=X\vec{\beta} +\epsilon \text{ for : } \epsilon \sim N(\vec{0}, \sigma^2 I_n) \]

Suppose we expand \(\vec{y}\) :

\[ y= f(X) + \epsilon\\ = X\vec{\beta} + \epsilon\\ = \beta_0 + \beta_1\vec{x}_1 + ...+\beta_n\vec{x}_n + \epsilon \\ = \beta_0 + \sum\beta_i x_i + \epsilon \] For this we assume some intercept \(\beta_0\).

Mathematical Considerations in Pairwise Regression :

Notice that for each of the following Regression models we hold the same structure :

\[ \vec{y}_{325}=\beta_0+\beta_1\vec{x}_i + \epsilon \\ \text{for: } \vec{x}_i\in X^{325 X2} = [\vec{1}_n \vec{x}_1] \\ X'X= \begin{bmatrix}n & \sum{x_i} \\\sum{x_i} & \sum{x_i^2}\end{bmatrix} \\ \implies (X'X)^{-1}\\ =\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

\[ =\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

Which as we know is particularly relevant when develop a notion of variance and center for \(\hat{\beta}\).

Recall :

\[ \hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}) \]

proof

We wish to prove 3 claims :

  1. Claim : “Our est. of slopes is Normally distributed

  2. Claim : “Our est. is unbiased

  3. Claim : “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix

Lets start 1 by 1, consider claim 1 :

    1. Claim : “Our est. of slopes is Normally distributed

We will prove this rather informally based upon a basic property of Normally distributed data.

Consider \(\vec{\epsilon} \sim N(0, \sigma^2I_n)\) and \(y = X\vec{\beta} + \vec{\epsilon}\) and \(X\vec{\beta}\) is a deterministic (constant). We know that linear combinations ( \(aX +b\) ) of normally distributed data \(X \sim N(\mu, \sigma^2)\) is also normally distributed (More rigorous proof here), therefore \(\vec{y}\) is also normally distributed. In other words, since \(\vec{\epsilon}\) is the random part of our model ( \(y = X\vec{\beta} + \vec{\epsilon}\) ) and \(X\vec{\beta}\) is a deterministic, we say \(y\) inherits its randomness from \(\vec{\epsilon}\) and because \(\vec{\epsilon}\) is normally distributed & \(\vec{y}\) is just a linear combination of \(\vec{\epsilon}\) we say \(\vec{y}\) is also normally distributed.

    1. Claim : “Our est. is unbiased

Recall for something to be unbiased we say \(\mathbb{E}(\hat{\theta}) - \theta = 0\) :

\[ \mathbb{E}(\hat{\beta}) \\ = \mathbb{E}(A\vec{y}) \\ = A\mathbb{E}(\vec{y}) \\ = A\mathbb{E}(X\vec{\beta} + \epsilon) \\ = A[\mathbb{E}(X\vec{\beta}) + \mathbb{E}(\epsilon)] \\ = A[\mathbb{E}(X\vec{\beta}) + 0] \\ = A\mathbb{E}(X\vec{\beta}) \\ = AX\vec{\beta} \\ = (X'X)^{-1}X' X \vec{\beta} \\ = (X'X)^{-1}(X' X) \vec{\beta} \\ = I_n \vec{\beta} \\ = \vec{\beta} \]

\[ \implies \\ = \mathbb{E}(\hat{\beta}) -\vec{\beta} \\ = \vec{\beta} -\vec{\beta} \\ = \vec{0} \\ \implies \\ \mathbb{E}(\hat{\beta}) -\vec{\beta}=\vec{0} \]

    1. Claim : “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix

\[ \sum_{\hat{\beta}} \\ = \text{Cov}(\hat{\beta}) \\ = \text{Cov}(A\vec{y}) \\ = A\text{Cov}(\vec{y})A' \\ = A\sigma^2 I_nA' \\ = \sigma^2AA' \\ = \sigma^2(X'X)^{-1}X'[(X'X)^{-1}X']' \]

\[ = \sigma^2(X'X)^{-1}X'X[(X'X)^{-1}]' \\ = \sigma^2(X'X)^{-1}X'X[(X'X)']^{-1} \\ = \sigma^2(X'X)^{-1}X'X[X'X]^{-1} \\ = \sigma^2(X'X)^{-1}X'XX^{-1}(X')^{-1} \\ \\ = \sigma^2(X'X)^{-1}I_n^2 \\ = \sigma^2(X'X)^{-1}I_n \\ = \sigma^2(X'X)^{-1} \\ \]

Pairwise Plots : Generalized Programming :

pairwise_dot_plt <- function(data_set) {
  # Generate all pairwise combinations
  pairwise_comb <- combn(1:ncol(data_set), 2)

  # iterate across each
  for (idx in 1:ncol(pairwise_comb)) {
    comb_vec <- pairwise_comb[, idx]
    x_axis <- comb_vec[1]
    y_axis <- comb_vec[2]
    
    
    # generate reg mdl 
    mdl <- lm(unlist(data_set[, y_axis]) ~ unlist(data_set[, x_axis]), data = data_set)

    # Start a new plot on the PDF device
    plot(
      unlist(data_set[, x_axis]),
      unlist(data_set[, y_3axis]),
      xlab = colnames(data_set)[x_axis],
      ylab = colnames(data_set)[y_axis],
      main = paste("Plot", idx)
    )
    abline(mdl, col = "red")
  }
}

Mathematical Considerations :

  • Utilize the stability of the inverse to metrisize \(\text{Det}(X'X)^{-1} \rightarrow0\) slope estimates standard error bounds. In other words, we know \(\text{Det}(X'X)^{-1}=0\implies\text{Let : }A=(X'X)^{-1} \implies A^{-1} \text{ DNE}\) as there does not exist a matrix \(A^{-1}A=I_n\).

Focused Analysis :

library(dplyr)

# Grade, Topic Correlations : 
gd_tpk_cor <- num_df |> select(contains("th"),-contains("avg")) |> cor(use = "complete.obs")

# Topic avg Correlations
sub_avr_cor <- num_df |> select(ends_with("avg"), -matches("^[0-9]+th")) |> cor(use = "complete.obs")

# Grade avg Correlations 
grd_avg_cor <- num_df |> select(matches("^[0-9]+th avg")) |> cor(use = "complete.obs")

Separate Corr :

Grade-Topic Correlations :

corrplot(gd_tpk_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

# Set bounds and filter :  
weak_cutoff <- .3 # below : weak linear association
strong_cutoff <- .6 # above : strong linear association

low_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor < weak_cutoff)
med_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor >= weak_cutoff) & (gd_tpk_cor <= strong_cutoff)
high_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor > strong_cutoff)

# select only ij obs (upper.tri) and not ji, sym matrix
low_pairs <- which(low_idx, arr.ind = TRUE)
med_pairs <- which(med_idx, arr.ind = TRUE)
high_pairs <- which(high_idx, arr.ind = TRUE)
# general NA matrix function : 
generate_NA_mat <-
  function(mat){
    NA_mat <- matrix(
  NA,
  nrow = nrow(mat),
  ncol = ncol(mat),
  dimnames = dimnames(gd_tpk_cor)
)
    return(NA_mat)
  }
# generate general function : 
build_masked_matrix <- function(cor_matrix, pair_indices) {
  NA_mat <- generate_NA_mat(cor_matrix)
  NA_mat[pair_indices] <- cor_matrix[pair_indices]
  return(NA_mat)
}

print("low_cor_mat")
## [1] "low_cor_mat"
low_cor_mat  <- build_masked_matrix(gd_tpk_cor, low_pairs); low_cor_mat
##           4thmath 8thmath  11thmath  4th read 8th read 11th read   8th sci
## 4thmath        NA      NA 0.2742213        NA       NA 0.1940918 0.2283466
## 8thmath        NA      NA        NA        NA       NA 0.2997928        NA
## 11thmath       NA      NA        NA 0.2819581       NA        NA 0.1771869
## 4th read       NA      NA        NA        NA       NA 0.2880443 0.2739120
## 8th read       NA      NA        NA        NA       NA        NA        NA
## 11th read      NA      NA        NA        NA       NA        NA 0.2655738
## 8th sci        NA      NA        NA        NA       NA        NA        NA
## 11th sci       NA      NA        NA        NA       NA        NA        NA
##            11th sci
## 4thmath   0.1864202
## 8thmath   0.2302258
## 11thmath         NA
## 4th read  0.2111796
## 8th read  0.2584866
## 11th read        NA
## 8th sci          NA
## 11th sci         NA
print("med_cor_mat")
## [1] "med_cor_mat"
med_cor_mat  <- build_masked_matrix(gd_tpk_cor, med_pairs); med_cor_mat
##           4thmath   8thmath  11thmath  4th read  8th read 11th read   8th sci
## 4thmath        NA 0.3817187        NA        NA 0.3468462        NA        NA
## 8thmath        NA        NA 0.3118259 0.3787978        NA        NA 0.5362033
## 11thmath       NA        NA        NA        NA 0.3016574        NA        NA
## 4th read       NA        NA        NA        NA 0.3526437        NA        NA
## 8th read       NA        NA        NA        NA        NA 0.3201637 0.5382480
## 11th read      NA        NA        NA        NA        NA        NA        NA
## 8th sci        NA        NA        NA        NA        NA        NA        NA
## 11th sci       NA        NA        NA        NA        NA        NA        NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath  0.4566460
## 4th read         NA
## 8th read         NA
## 11th read        NA
## 8th sci   0.3122344
## 11th sci         NA
print("high_cor_mat")
## [1] "high_cor_mat"
high_cor_mat <- build_masked_matrix(gd_tpk_cor, high_pairs); high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA

Observations for gd_tpk_cor :

  • high_cor_mat : Math and Reading across grade levels (4,8,11) appears to be strongly correlated ( \(r_{\text{high}}\) 0.65)

  • med_cor_mat : 8th grades across all categories appear to have a moderate relationship — science subject for 8th grades appears to approach high_cor_mat category [consider diag behavior].

  • low_cor_mat : Most correlations are rather weak

Masked Matrix Function :

masked_matrix_func  <- function(correlation_matrix, bottom_cutoff = 0.3, top_cutoff = 0.6) {
  if (!is.matrix(correlation_matrix)) {
    stop("Input must be a matrix.")
  }
  
  # Logical indices for each strength
  l_idx <- upper.tri(correlation_matrix) & (correlation_matrix < bottom_cutoff)
  m_idx <- upper.tri(correlation_matrix) & 
           (correlation_matrix >= bottom_cutoff) & 
           (correlation_matrix <= top_cutoff)
  h_idx <- upper.tri(correlation_matrix) & (correlation_matrix > top_cutoff)
  
  # Pair indices
  l_pairs <- which(l_idx, arr.ind = TRUE)
  m_pairs <- which(m_idx, arr.ind = TRUE)
  h_pairs <- which(h_idx, arr.ind = TRUE)
  
  # Generate empty NA matrix func
  generate_NA_mat <- function(mat) {
    matrix(NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
  }

  # Use helper function : 
  build_masked_matrix <- function(mat, idx) {
    NA_mat <- generate_NA_mat(mat)
    NA_mat[idx] <- mat[idx]
    return(NA_mat)
  }

  # Store all three in a named list
  cor_lst <- list(
    low_cor_mat  = build_masked_matrix(correlation_matrix, l_pairs),
    med_cor_mat  = build_masked_matrix(correlation_matrix, m_pairs),
    high_cor_mat = build_masked_matrix(correlation_matrix, h_pairs)
  )
  
  return(cor_lst)
}

Subject-Avr Correlations :

# plt : 
corrplot(sub_avr_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(sub_avr_cor)
## $low_cor_mat
##          Math avg Read avg Sci avg
## Math avg       NA       NA      NA
## Read avg       NA       NA      NA
## Sci avg        NA       NA      NA
## 
## $med_cor_mat
##          Math avg Read avg   Sci avg
## Math avg       NA       NA 0.4922274
## Read avg       NA       NA        NA
## Sci avg        NA       NA        NA
## 
## $high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA

Grade-Avr Correlations :

corrplot(grd_avg_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(grd_avg_cor)
## $low_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## $med_cor_mat
##          4th avg   8th avg  11th avg
## 4th avg       NA 0.4121618 0.3035598
## 8th avg       NA        NA 0.3812491
## 11th avg      NA        NA        NA
## 
## $high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
grep("_cor$", ls(), value = T)
## [1] "gd_tpk_cor"  "grd_avg_cor" "sub_avr_cor"
cor_lst <- list(gd_tpk_cor = gd_tpk_cor, grd_avg_cor = grd_avg_cor, sub_avr_cor = sub_avr_cor)

# Average Correlation
lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)})
## $gd_tpk_cor
## [1] 0.31
## 
## $grd_avg_cor
## [1] 0.24
## 
## $sub_avr_cor
## [1] 0.41
# average of averages : 
paste(lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)}) |> unlist() |> mean(), " avr cor")
## [1] "0.32  avr cor"

Global Observations :

  • Among variables of interest we have \(\bar{r}_{ij}\) of .32, so all signals appear pretty linearly-weak.

  • Subject Pairs are more related then Grade Pairs

    • In summary : sub_avr_cor < gd_tpk_cor < grd_avg_cor

Mathematical Observations :

Notice \(r_{ij}\in C \implies R^2_{\text{xi}}=r_{i,y}\). In other words, the correlation squared are the coefficient of determinations for those pairwise models.

masked <- lapply(cor_lst, masked_matrix_func)
high_only <- lapply(masked, function(x) x[grepl("^high_", names(x))])
high_only
## $gd_tpk_cor
## $gd_tpk_cor$high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA
## 
## 
## $grd_avg_cor
## $grd_avg_cor$high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## 
## $sub_avr_cor
## $sub_avr_cor$high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA

observations :

  • For our programmed function we have defined \(r_{xy}=.6\) as the lowest possible \(r_{xy}\) value accepted to be highly correlated as \(R^2\) that is \(.6^2 =0.36\) or, 36% of the variation in \(y\) can be explained by \(x\).

  • Notice grd_avg_cor never contains highly correlated features

  • Math avg and Read avg appear to be the most linearly related 0.57% of the of variation. Slightly above half.

proof

\[ y = X\beta+\epsilon \ ; \epsilon \sim N(0, \sigma^2I_n) \]

\[ \mathbb{E}(y)=\mathbb{E}(X\beta+\epsilon) = \mathbb{E}(X\beta) + \mathbb{E}(\epsilon) =X\beta+0=X\beta \]

\[ \text{Cov}(y) =\text{Cov}(X\beta+\epsilon)=\text{Cov}(\epsilon) :=\sigma^2I_n \]

\[ y \sim N(X\beta, \sigma^2I) \]

Note \(\mu =X \beta\) therefore \(y = \mu +\epsilon\). In other words : \(\text{Event = Model + Noise}\) . Further, lets consider that we wish to determine the dist of \(\hat{y}=X\hat{\beta}\). We know that \(\hat{y}\) follows a normal dist. Therefore, we need to find its center and covariance matrix :

Center :

\[ \mathbb{E}(\hat{y}) \\ = \mathbb{E}(X\hat{\beta}) \\ =X\mathbb{E}(\hat{\beta}) \\ =X\beta \\ = \mu \]

So, we should expect our est. of the model in the long-run estimates our model \(\mu\)good!

Variance :

\[ \text{Cov}(\hat{y})=\text{Cov}(Hy) =H\text{Cov}(y)H' =H\sigma^2I_nH'=\sigma^2HH'=\sigma^2H \]

This is because \(y \sim N(X\beta, \sigma^2I_n)\) and \(H^2=H\) (idopent matrix).

Recall :

\[ H = XA ; A = (X'X)^{-1}X' \\ H =X(X'X)^{-1}X' \]

and for this case :

\[ X = [1_{325} \vec{x}_i] \\ \implies \\ H \\ = X\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} X'\\ = X\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}X' \\ = [1_{325} \vec{x}_i]\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \]

Notice : \(\sum{x_i}^2 = \vec{x}'\vec{x} = ||x||^2\) and, \(\bar{x}=\frac{1}{n} \sum x_i\)

\[ = [1_{325} \vec{x}_i]\frac{1}{n ||x||^2-(n\bar{x})^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ = [1_{325} \vec{x}_i]\frac{1}{n ||x||^2-n^2\bar{x}^2}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ = [1_{325} \vec{x}_i]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ \]

Recall :

\[ S_{xx}=\sum (x_i -\bar{x})^2 \\ = \sum(x_i -\bar{x})(x_i -\bar{x}) \\ = \sum(x_i^2 -2\bar{x}x_i+\bar{x}^2) \\ = \sum{x_i^2} -2\bar{x}\sum{x_i}+\sum{\bar{x}^2} \\ = \sum{x_i^2} -2\bar{x}n\bar{x}+n\bar{x}^2 \\ = \sum{x_i^2} -2\bar{x}^2n+n\bar{x}^2 \\ = \sum{x_i^2} -n\bar{x}^2 \\ = ||x||^2 -n\bar{x}^2 \]

Therefore :

\[ = [1_{325} \vec{x}_i]\frac{1}{n (||x||^2-n\bar{x}^2)}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ = [1_{325} \vec{x}_i]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ \]

and,

\[ = [1_{325} \vec{x}_i]\frac{1}{n S_{xx}}\begin{bmatrix}||x||^2 & -n\bar{x} \\-n\bar{x} & n\end{bmatrix}[1_{325} \vec{x}_i]' \\ = [1_{325} \vec{x}_i]\begin{bmatrix}\frac{||x||^2}{nS_{xx}} & -\frac{\bar{x}}{S_{xx}} \\-\frac{\bar{x}}{S_{xx}} \ & \frac{1}{S_{xx}}\end{bmatrix}[1_{325} \vec{x}_i]' \\ \]

# Fit linear model
mdl <- lm(`Read avg` ~ `Math avg`, data = iowa_std)

# ---- Scatter plot with regression line ----
plot(iowa_std$`Math avg`, iowa_std$`Read avg`,
     main = "Math vs Read Avg Mdl",
     xlab = "Math Avg", ylab = "Read Avg", pch = 19)
abline(mdl, col = "red", lwd = 2)

# QQ plt
plot(mdl, which = 2)

# Histogram 
hist(iowa_std$`Read avg`,
     probability = TRUE,
     breaks = 20,
     col = "lightgray",
     xlim = c(
       mean(iowa_std$`Read avg`, na.rm = TRUE) - 4 * sd(iowa_std$`Read avg`, na.rm = TRUE),
       mean(iowa_std$`Read avg`, na.rm = TRUE) + 4 * sd(iowa_std$`Read avg`, na.rm = TRUE)
     ),
     main = "Histogram of Read Avg with Normal Curves",
     xlab = "Read Avg")


# Center at Xbeta = \hat{y}
curve(
  dnorm(x, 
        mean = mean(fitted(mdl), na.rm = TRUE), 
        sd = sd(fitted(mdl), na.rm = TRUE)), 
  col = "blue", lwd = 2, add = TRUE)

# Raw values : 
curve(
  dnorm(x, 
        mean = mean(iowa_std$`Read avg`, na.rm = TRUE), 
        sd = sd(iowa_std$`Read avg`, na.rm = TRUE)), 
  col = "green", lwd = 2, add = TRUE)

legend("topright", legend = c("Centered at Xβ", "Empirical Mean"),
       col = c("blue", "green"), lwd = 2, bty = "n")

  • As we can see our theoretical model \(y\) does a pretty good job of estimating Read avg and Math avg, this tells us that there appears to be a strong relationship between a students math & reading performance.

  • However, relative to our emperical model based on our outcome variable, we get a worse performance. We notice that the difference is in the variance.

  • Notice our correlation squared and R^2 are equivalent

  • Hypothesis testing :

    • \[ \text{T-test : }H_o : \beta_i=0 \] indicates regression coefficients are statistically significant.

    • \[ F-\text{test : } H_o : \beta_0=\beta_1=0 \] indicating holding other coefficients constant, model is significant

  • In short this indicates strong evidence that districts which do well in reading also tend to do well in math together. Perhaps targeting these two areas will help overall performance.

Okay, lets go back to our research questions :

  • How do student performance vary across, grade, subject and size?

    • In district located in iowa, relationship among school subjects is stronger than grade level.
    • Reading and math appear to be related through grade levels
  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

Overall Avr ~ District Size MDL :

# Add avr col : 
iowa_std <- 
  iowa_std |> 
  rowwise() |>
  mutate(avr = mean(c_across(c(`4thmath`, `8thmath`, `11thmath`, 
                                        `4th read`, `8th read`, `11th read`, 
                                        `8th sci`, `11th sci`)), na.rm = TRUE))

Suppose we divide school districts into three size groups : large districts (x > 2,000 students), medium-sized districts (1,000 < x < 2,000 students), and small districts (x < 1,000 students).

\[ \text{avr}_i=\beta_0+\beta_1\mathbf{1}[\text{med}_i]+\beta_2\mathbf{1}[\text{large}_i] + \varepsilon_i \]

\[ \begin{align*}\text{avr}_i & : \text{Average test score for district } i \\\beta_0 & : \text{Mean score for districts in the small size group (baseline)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\mathbf{1}[\text{med}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is medium, } 0 \text{ otherwise} \\\mathbf{1}[\text{large}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is large, } 0 \text{ otherwise} \\\varepsilon_i & : \text{Random error term for district } i\end{align*} \]

\[ \beta_0 = \mu_1, \beta_1 = \mu_2 − \mu_1, \text{ and } \beta_2 = \mu_3 − \mu_1 \]

Which is equivalent to :

\[ y_i = \mu_1 x_{i1} + \mu_2 x_{i2} + \mu_3 x_{i3} + \varepsilon_i \]

\[ \begin{align*}y_i & : \text{Average test score for district } i \\x_{i1}, x_{i2}, x_{i3} & : \text{Indicator variables for district size: small, medium, large} \\\mu_1, \mu_2, \mu_3 & : \text{Mean test scores for small, medium, and large districts, respectively} \\\varepsilon_i & : \text{Random error in test score for district }\\\end{align*} \]

# Model for overall averages 
overall_mdl <- lm(avr~`size(coded)`, data=iowa_std)

Notice these are the same results as our ONE-WAY-ANOVA. Lets take a closer look at the F-test results :

summary(overall_mdl); aov(avr~`size(coded)`, data=iowa_std)
## 
## Call:
## lm(formula = avr ~ `size(coded)`, data = iowa_std)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2652  -4.5152   0.3901   4.8598  20.4848 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     73.0152     0.4840 150.863   <2e-16 ***
## `size(coded)`2   0.5948     1.0012   0.594    0.553    
## `size(coded)`3   2.5525     1.3306   1.918    0.056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.227 on 322 degrees of freedom
## Multiple R-squared:  0.01151,    Adjusted R-squared:  0.005365 
## F-statistic: 1.874 on 2 and 322 DF,  p-value: 0.1552
## Call:
##    aov(formula = avr ~ `size(coded)`, data = iowa_std)
## 
## Terms:
##                 `size(coded)` Residuals
## Sum of Squares        195.767 16819.764
## Deg. of Freedom             2       322
## 
## Residual standard error: 7.227399
## Estimated effects may be unbalanced
levels(iowa_std$`size(coded)`) <- c("Small", "Medium", "Large") 
boxplot(avr ~ `size(coded)`, data = iowa_std,
        main = "Average Test Scores by District Size",
        xlab = "District Size",
        ylab = "Average Test Score",
        col = c("skyblue", "lightgreen", "lightpink"))

abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)

Our model compares the overall average and looks for differences each indactor (S, M, L) makes. It appears large districts tend to outperform other districts :

iowa_std |> group_by(`size(coded)`) |> summarize(avr_score_per_size = round(mean(avr), 2)) |> arrange(desc(avr_score_per_size))

This claim is supported by a quick glance at grouped summary statistics.

Grade Avr ~ District Size MDLs :

Fourth_grd_size_mdl <- aov(`4th avg`~`size(coded)`, data=iowa_std)
summary(Fourth_grd_size_mdl)
##                Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)`   2    151   75.61   0.763  0.467
## Residuals     319  31598   99.05               
## 3 observations deleted due to missingness

Observations :

  • For 4th Grade students, we dont believe any level of size makes a statistically significant difference.
Eighth_grd_size_mdl <- aov(`8th avg`~`size(coded)`, data=iowa_std)
summary(Eighth_grd_size_mdl)
##                Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)`   2     54   26.83     0.4  0.671
## Residuals     261  17525   67.15               
## 61 observations deleted due to missingness

Observations :

  • For 8th Grade students, we dont believe any level of size makes a statistically significant difference.
Eleventh_grd_size_mdl <- aov(`11th avg`~`size(coded)`, data=iowa_std)
summary(Eleventh_grd_size_mdl); summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$coeff; paste("R-Sqr : ", summary(lm(`11th avg`~`size(coded)`, data=iowa_std))$r.squared)
##                Df Sum Sq Mean Sq F value Pr(>F)  
## `size(coded)`   2    839   419.3   4.281 0.0149 *
## Residuals     250  24489    98.0                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 72 observations deleted due to missingness
##                      Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)         73.607784  0.7658753 96.109361 2.039968e-199
## `size(coded)`Medium  2.034190  1.5493777  1.312908  1.904179e-01
## `size(coded)`Large   5.392219  1.9098955  2.823306  5.135817e-03
## [1] "R-Sqr :  0.0331127128655936"

Observations :

  • For 11th Grade students, we have statistically significant

\[ H_o : \mu_\text{Low} = \mu_\text{Medium} = \mu_\text{High} = 0\\ H_a : \mu_i \ne \mu_j \text{ or, } \mu_i \ne0 \]

In other words, we believe one of the levels is statistically significant. However notice our models \(R^2=3\%\). So overall our model isnt capturing the variance in 11th Grade District Avr Scores.

boxplot(`11th avg`~`size(coded)`, data = iowa_std,
        main = "11th Grd District by District Size",
        xlab = "District Size",
        ylab = "Average 11th Grade Performance",
        col = c("skyblue", "lightgreen", "lightpink"))

abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)

iowa_std |> group_by(`size(coded)`) |> summarize(Grade_Distr_avr = round(mean(`11th avg`, na.rm = T), 2)) |> arrange(desc(Grade_Distr_avr))
iowa_std$size_coded <- as.factor(iowa_std$`size(coded)`)
Eleventh_grd_size_mdl <- aov(`11th avg`~size_coded, data =iowa_std)
TukeyHSD(Eleventh_grd_size_mdl); plot(TukeyHSD(Eleventh_grd_size_mdl))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = `11th avg` ~ size_coded, data = iowa_std)
## 
## $size_coded
##                  diff        lwr      upr     p adj
## Medium-Small 2.034190 -1.6188689 5.687249 0.3893082
## Large-Small  5.392219  0.8891463 9.895292 0.0141716
## Large-Medium 3.358029 -1.8478375 8.563896 0.2828601

Based upon the results above, 11th Grade students Large-Small Districts pairwise comparison is statistically significant. However, larger-medium Districts pairwise comparison is not statistically sign. Indicating the difference is apparent when comparing the extremes against one another.

library(rstatix)
# For games_howell_test to work we need complete cases across all features : 
DF <- iowa_std |> select(District, `11th avg`, size_coded) |> drop_na()
DF |> games_howell_test(`11th avg` ~ size_coded)