Project Duration : 3 Days

Project Description :

This project explores district-level education data from Iowa, covering standardized test performance across math, reading, and science for grades 4, 8, and 11. The data set includes 325 school districts & their associated salaries. Variables include : district size metrics, Teacher Work Experience, Avr Grade & Class Performance & GradeSubject Performance (per district). Subjects include : mathematics, reading, and science. The purpose of this project is to demonstrate my rigorous use and understanding of Linear Modeling, Statistical Programming & well defined metrics. We will use a generic programming style, Linear Modeling in a mathematical Framework, & Domain Specific knowledge of education to develop answers to our research questions. Alongside this task we will explore mathematical concepts and build a library of functions that will can utilize in future projects. Demonstrating production level programming and rigorous analysis guided by mathematics. Sections of the project apply metrics discussed in Chapter 8 of Introduction to Regression Modeling (Abraham & Ledolter, 2006). Alongside the textbook provided solutions we dive into relevant mathematical theorems discussed in lecture (STATS100C at UCLA) & other textbooks to enrich the information we can pull from this data set.

Research Questions :

  • How do student performance vary across, grade, subject and size?

  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

Toss Files In & Take a brief Look :

url <- "https://www.biz.uiowa.edu/faculty/jledolter/RegressionModeling/Data/Chapter8/iowastudent.txt"

# download.file(url, destfile = "iowastudent.txt", mode = "wb")

library(readr)
iowa_std <- read_tsv("iowastudent.txt", na = "*", show_col_types = FALSE) 

observations :

  • Notice that we are missing 4th grade science

Basics :

## [1] "325  School Districts"
## [1] "22  Features"
##  [1] "District"    "4thmath"     "8thmath"     "11thmath"    "4th read"   
##  [6] "8th read"    "11th read"   "8th sci"     "11th sci"    "size"       
## [11] "size(coded)" "small size"  "med size"    "large size"  "salary"     
## [16] "experience"  "4th avg"     "8th avg"     "11th avg"    "Math avg"   
## [21] "Read avg"    "Sci avg"

Cardinality :

The motivation & interest of Cardinality is centered around the issue of Big Data. Often in Data science we deal with data sets with many features. Framing data in the context of Cardinality metrics provides a generalizable framework for grouping features into their associated data types. Laying the foundation for the types of analysis that is most appropriate for our given data structure.

# Detect Cardinality : Get back vector : 
Cardinality_func <- function(df, t = 10) {
  sapply(df, function(x) length(unique(x)) < t)
}
# less than 10 unique values
iowa_std[,Cardinality_func(iowa_std)]

observation :

  • We notice there are 4 features with less than 10 unique values ( All size metrics )

  • We may determine these as categorical features–suppose we do categorical encoding (factor).

library(dplyr)
iowa_std <- 
  iowa_std |> 
  mutate(across(contains("size"), as.factor))

observation :

  • This is good, however consider for large data \(n \rightarrow \infty\) , we can begin to treat low cardinality features as categorical, lets see if these measures generally agree. Consider the particular formula for Eff_Cardinality :

\[ \text{Eff_Cardinality} = \frac{\text{Cardinality}}{\text{Sample Size (n)}} \\ \text{Cardinality} = \text{Number of Unique Values} \]

Key Take-away :

  • When \(\text{Eff_Cardinality} = 1\) this means we have exactly 1 unique value for every observation. These are useful to be thought of as row-labels that you can describe as jointed sentences reading across columns. However generating summary statistics or grouping of categorical variables with an effective cardinality of 1 can also provide itself to be helpful when summarizing across strata.

  • When \(\text{Eff_Cardinality} \rightarrow1\) that tells us we have nearly the same amount of unique values as the number of observations in our data. If this is true we can safely say relative to our sample size, our data has many unique values. This tells us its unlikely that repeated values appear, making signals likely sparse for the categorical or numeric vector alone.

  • When \(\text{Eff_Cardinality} = 1/n\) this means we have exactly 1 unique valin total. In other words, the feature of the data set repeats itself : “student”, “student”, … , “student” . These usually just contextualize your data.

  • When \(\text{Eff_Cardinality} \rightarrow 1/n\) this means blocking tends to be happening. Meaning, we have few unique values for a large data set–we have few groups. Because, \(\text{Min(Cardinality)}=1\) (ie you have atleast 1 unique value). Therefore this suggest the feature is likely a categorical variable

Effective_Cardinality  <- 
  function(df, dec = T){
  v <- sapply(df, function(x) {length(unique(x))/nrow(df)})
  v <- sort(v, decreasing = dec)
  data.frame(Effective_Cardinality_Values=v)
  }

# View effective cardinality 
head(Effective_Cardinality(iowa_std), 6)
head(Effective_Cardinality(iowa_std, dec = F), 6)
library(ggplot2)
features <- rownames(Effective_Cardinality(iowa_std))
eff_c_df <- cbind(features, Effective_Cardinality(iowa_std))

Mean <- summary(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values)[4]
one_std <- c(Mean + sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values), Mean - sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values))

library(patchwork)

bar_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
  ggplot(aes(x = factor(features, levels = features), y = Effective_Cardinality_Values)) +
  geom_hline(yintercept = Mean, linetype = "dashed", color = "black") +
  geom_hline(yintercept = one_std[1], linetype = "dashed", color = "red") +
  geom_hline(yintercept = one_std[2], linetype = "dashed", color = "red") +
  geom_col() +
  xlab("Features") + ylab("Effective Cardinality") +
  theme(axis.text.x = element_blank(),  
        axis.ticks.x = element_blank()) 

# ggplot boxplot
box_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
  ggplot(aes(y = Effective_Cardinality_Values)) +
  geom_boxplot()

# Combine side by side
bar_plot + box_plot + plot_layout(widths = c(7, 1))

Observations :

  • This tells us for District & salary have exactly 1 obs for 1 unique value. While size and, experience we nearly 1 observation for every unique value. We suspect their signals will be weak.

    • In other words, we have a summary statistic of the salary for every District – most likely \(\bar{x}\)
  • size metrics have low Eff Cardinality, indicating themselves as categorical variables relative to their sample size. Therefore agreeing with our simple approach of counting unique values.

# Cardinaliy scores : Top & Bottom 3 : 
cardinality <- sapply(iowa_std, function(x){length(unique(x))})
head(sort(cardinality, decreasing = T), 3); head(sort(cardinality, decreasing = F), 3)
## District   salary     size 
##      325      325      293
## small size   med size large size 
##          2          2          2

Pairwise Relationships :

Correlation Matrix

  • Consider that cor(x,y) is a measure of numerical data, it is appropriate we generate a strictly numerical data set to generate a summary of numerical linear-pairwise relationships.
num_df <- iowa_std[,-1] |> select(-contains("size"))

Here we will take a look at global pairwise-patterns, no deep analysis. Simple observations.

## corrplot 0.95 loaded

Observations

  • No correlation : there appears to be no correlations (linear relationship) for salary or experience, indicating it doesn’t alone appear to holds a weak signal with low predictive power – except for (salary,experience) pairs

  • All other features are Positively Correlated [Pairwise Regression Coef are Positive]

Mathematical Foundations :

Consider the singular case of numerical event \(x \text{ and }y\) for a correlation \(r_{xy} = \frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\).

More generally, we say these correlations exist to a correlation matrix containing all our numerical features :

\[ r_{ij} \in C^{\text{p} X\text{p}} \]

Where \(p=\text{Number of Parameters}\) (not inc. intercept). Lets consider the foundations and reasoning for correlation matrices :

We begin by first making assumptions :

  1. Our Data can be described using a linear model
  2. Our errors are normally distributed
  3. The \(\mathbb{E}(\epsilon)=0\) – the population average of error is centered around 0
  4. We have constant variance \(\sigma^2I_n\)

\[ \vec{y}=X\vec{\beta} +\epsilon \text{ for : } \epsilon \sim N(\vec{0}, \sigma^2 I_n) \]

Suppose we expand \(\vec{y}\) :

\[ y= f(X) + \epsilon\\ = X\vec{\beta} + \epsilon\\ = \beta_0 + \beta_1\vec{x}_1 + ...+\beta_n\vec{x}_n + \epsilon \\ = \beta_0 + \sum\beta_i x_i + \epsilon \] For this we assume some intercept \(\beta_0\).

Mathematical Considerations in Pairwise Regression :

Notice that for each of the following Regression models we hold the same structure :

\[ \vec{y}_{325}=\beta_0+\beta_1\vec{x}_i + \epsilon \\ \text{for: } \vec{x}_i\in X^{325 X2} = [\vec{1}_n \vec{x}_1] \\ X'X= \begin{bmatrix}n & \sum{x_i} \\\sum{x_i} & \sum{x_i^2}\end{bmatrix} \\ \implies (X'X)^{-1}\\ =\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

\[ =\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]

Which as we know is particularly relevant when develop a notion of variance of \(\hat{\beta}\).

Recall :

\[ \hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}) \]

proof

We wish to prove 3 statements :

  1. “Our est. of slopes is Normally distributed

  2. “Our est. is unbiased

  3. “Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix

Pairwise Plots : Generalized Programming :

pairwise_dot_plt <- function(data_set) {
  # Generate all pairwise combinations
  pairwise_comb <- combn(1:ncol(data_set), 2)

  # iterate across each
  for (idx in 1:ncol(pairwise_comb)) {
    comb_vec <- pairwise_comb[, idx]
    x_axis <- comb_vec[1]
    y_axis <- comb_vec[2]
    
    
    # generate reg mdl 
    mdl <- lm(unlist(data_set[, y_axis]) ~ unlist(data_set[, x_axis]), data = data_set)

    # Start a new plot on the PDF device
    plot(
      unlist(data_set[, x_axis]),
      unlist(data_set[, y_axis]),
      xlab = colnames(data_set)[x_axis],
      ylab = colnames(data_set)[y_axis],
      main = paste("Plot", idx)
    )
    abline(mdl, col = "red")
  }
}

Mathematical Considerations :

  • Utilize the stability of the inverse to metrisize \(\text{Det}(X'X)^{-1} \rightarrow0\) slope estimates standard error bounds. In other words, we know \(\text{Det}(X'X)^{-1}=0\implies\text{Let : }A=(X'X)^{-1} \implies A^{-1} \text{ DNE}\) as there does not exist a matrix \(A^{-1}A=I_n\).

Focused Analysis :

library(dplyr)

# Grade, Topic Correlations : 
gd_tpk_cor <- num_df |> select(contains("th"),-contains("avg")) |> cor(use = "complete.obs")

# Topic avg Correlations
sub_avr_cor <- num_df |> select(ends_with("avg"), -matches("^[0-9]+th")) |> cor(use = "complete.obs")

# Grade avg Correlations 
grd_avg_cor <- num_df |> select(matches("^[0-9]+th avg")) |> cor(use = "complete.obs")

Separate Corr :

Grade-Topic Correlations :

corrplot(gd_tpk_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

# Set bounds and filter :  
weak_cutoff <- .3 # below : weak linear association
strong_cutoff <- .6 # above : strong linear association

low_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor < weak_cutoff)
med_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor >= weak_cutoff) & (gd_tpk_cor <= strong_cutoff)
high_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor > strong_cutoff)

# select only ij obs (upper.tri) and not ji, sym matrix
low_pairs <- which(low_idx, arr.ind = TRUE)
med_pairs <- which(med_idx, arr.ind = TRUE)
high_pairs <- which(high_idx, arr.ind = TRUE)
# general NA matrix function : 
generate_NA_mat <-
  function(mat){
    NA_mat <- matrix(
  NA,
  nrow = nrow(mat),
  ncol = ncol(mat),
  dimnames = dimnames(gd_tpk_cor)
)
    return(NA_mat)
  }
# generate general function : 
build_masked_matrix <- function(cor_matrix, pair_indices) {
  NA_mat <- generate_NA_mat(cor_matrix)
  NA_mat[pair_indices] <- cor_matrix[pair_indices]
  return(NA_mat)
}

print("low_cor_mat")
## [1] "low_cor_mat"
low_cor_mat  <- build_masked_matrix(gd_tpk_cor, low_pairs); low_cor_mat
##           4thmath 8thmath  11thmath  4th read 8th read 11th read   8th sci
## 4thmath        NA      NA 0.2742213        NA       NA 0.1940918 0.2283466
## 8thmath        NA      NA        NA        NA       NA 0.2997928        NA
## 11thmath       NA      NA        NA 0.2819581       NA        NA 0.1771869
## 4th read       NA      NA        NA        NA       NA 0.2880443 0.2739120
## 8th read       NA      NA        NA        NA       NA        NA        NA
## 11th read      NA      NA        NA        NA       NA        NA 0.2655738
## 8th sci        NA      NA        NA        NA       NA        NA        NA
## 11th sci       NA      NA        NA        NA       NA        NA        NA
##            11th sci
## 4thmath   0.1864202
## 8thmath   0.2302258
## 11thmath         NA
## 4th read  0.2111796
## 8th read  0.2584866
## 11th read        NA
## 8th sci          NA
## 11th sci         NA
print("med_cor_mat")
## [1] "med_cor_mat"
med_cor_mat  <- build_masked_matrix(gd_tpk_cor, med_pairs); med_cor_mat
##           4thmath   8thmath  11thmath  4th read  8th read 11th read   8th sci
## 4thmath        NA 0.3817187        NA        NA 0.3468462        NA        NA
## 8thmath        NA        NA 0.3118259 0.3787978        NA        NA 0.5362033
## 11thmath       NA        NA        NA        NA 0.3016574        NA        NA
## 4th read       NA        NA        NA        NA 0.3526437        NA        NA
## 8th read       NA        NA        NA        NA        NA 0.3201637 0.5382480
## 11th read      NA        NA        NA        NA        NA        NA        NA
## 8th sci        NA        NA        NA        NA        NA        NA        NA
## 11th sci       NA        NA        NA        NA        NA        NA        NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath  0.4566460
## 4th read         NA
## 8th read         NA
## 11th read        NA
## 8th sci   0.3122344
## 11th sci         NA
print("high_cor_mat")
## [1] "high_cor_mat"
high_cor_mat <- build_masked_matrix(gd_tpk_cor, high_pairs); high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA

Observations for gd_tpk_cor :

  • high_cor_mat : For grades with themself, all grades (4th, 8th, 11th) appear to a moderately strong relationship between reading and math performance [consider diag behavior].

  • med_cor_mat : 8th grades across all categories appear to have a moderate relationship — science subject for 8th grades appears to approach high_cor_mat category [consider diag behavior].

  • low_cor_mat :

Masked Matrix Function :

masked_matrix_func  <- function(correlation_matrix, bottom_cutoff = 0.3, top_cutoff = 0.6) {
  if (!is.matrix(correlation_matrix)) {
    stop("Input must be a matrix.")
  }
  
  # Logical indices for each strength
  l_idx <- upper.tri(correlation_matrix) & (correlation_matrix < bottom_cutoff)
  m_idx <- upper.tri(correlation_matrix) & 
           (correlation_matrix >= bottom_cutoff) & 
           (correlation_matrix <= top_cutoff)
  h_idx <- upper.tri(correlation_matrix) & (correlation_matrix > top_cutoff)
  
  # Pair indices
  l_pairs <- which(l_idx, arr.ind = TRUE)
  m_pairs <- which(m_idx, arr.ind = TRUE)
  h_pairs <- which(h_idx, arr.ind = TRUE)
  
  # Generate empty NA matrix func
  generate_NA_mat <- function(mat) {
    matrix(NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
  }

  # Use helper function : 
  build_masked_matrix <- function(mat, idx) {
    NA_mat <- generate_NA_mat(mat)
    NA_mat[idx] <- mat[idx]
    return(NA_mat)
  }

  # Store all three in a named list
  cor_lst <- list(
    low_cor_mat  = build_masked_matrix(correlation_matrix, l_pairs),
    med_cor_mat  = build_masked_matrix(correlation_matrix, m_pairs),
    high_cor_mat = build_masked_matrix(correlation_matrix, h_pairs)
  )
  
  return(cor_lst)
}

Subject-Avr Correlations :

# plt : 
corrplot(sub_avr_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(sub_avr_cor)
## $low_cor_mat
##          Math avg Read avg Sci avg
## Math avg       NA       NA      NA
## Read avg       NA       NA      NA
## Sci avg        NA       NA      NA
## 
## $med_cor_mat
##          Math avg Read avg   Sci avg
## Math avg       NA       NA 0.4922274
## Read avg       NA       NA        NA
## Sci avg        NA       NA        NA
## 
## $high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA

Grade-Avr Correlations :

corrplot(grd_avg_cor,
        method = "color",
        type = "upper",
        tl.col = "black",
        addCoef.col = "black",  
        number.cex = 0.9        
      )

masked_matrix_func(grd_avg_cor)
## $low_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## $med_cor_mat
##          4th avg   8th avg  11th avg
## 4th avg       NA 0.4121618 0.3035598
## 8th avg       NA        NA 0.3812491
## 11th avg      NA        NA        NA
## 
## $high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
grep("_cor$", ls(), value = T)
## [1] "gd_tpk_cor"  "grd_avg_cor" "sub_avr_cor"
cor_lst <- list(gd_tpk_cor = gd_tpk_cor, grd_avg_cor = grd_avg_cor, sub_avr_cor = sub_avr_cor)

# Average Correlation
lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)})
## $gd_tpk_cor
## [1] 0.31
## 
## $grd_avg_cor
## [1] 0.24
## 
## $sub_avr_cor
## [1] 0.41
# average of averages : 
paste(lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)}) |> unlist() |> mean(), " avr cor")
## [1] "0.32  avr cor"

Global Observations :

  • Among variables of interest we have \(\bar{r}_{ij}\) of .32, so all signals appear pretty linearly-weak.

  • Subject Pairs are more related then Grade Pairs

    • In summary : sub_avr_cor < gd_tpk_cor < grd_avg_cor

Mathematical Observations :

Notice \(r_{ij}\in C \implies R^2_{\text{xi}}=r_{i,y}\)

masked <- lapply(cor_lst, masked_matrix_func)
high_only <- lapply(masked, function(x) x[grepl("^high_", names(x))])
high_only
## $gd_tpk_cor
## $gd_tpk_cor$high_cor_mat
##           4thmath 8thmath 11thmath  4th read  8th read 11th read 8th sci
## 4thmath        NA      NA       NA 0.6752072        NA        NA      NA
## 8thmath        NA      NA       NA        NA 0.6897058        NA      NA
## 11thmath       NA      NA       NA        NA        NA 0.6061706      NA
## 4th read       NA      NA       NA        NA        NA        NA      NA
## 8th read       NA      NA       NA        NA        NA        NA      NA
## 11th read      NA      NA       NA        NA        NA        NA      NA
## 8th sci        NA      NA       NA        NA        NA        NA      NA
## 11th sci       NA      NA       NA        NA        NA        NA      NA
##            11th sci
## 4thmath          NA
## 8thmath          NA
## 11thmath         NA
## 4th read         NA
## 8th read         NA
## 11th read 0.6401641
## 8th sci          NA
## 11th sci         NA
## 
## 
## $grd_avg_cor
## $grd_avg_cor$high_cor_mat
##          4th avg 8th avg 11th avg
## 4th avg       NA      NA       NA
## 8th avg       NA      NA       NA
## 11th avg      NA      NA       NA
## 
## 
## $sub_avr_cor
## $sub_avr_cor$high_cor_mat
##          Math avg  Read avg   Sci avg
## Math avg       NA 0.7568206        NA
## Read avg       NA        NA 0.6127668
## Sci avg        NA        NA        NA

observations :

  • For our programmed function we have defined \(r_{xy}=.6\) as the lowest possible \(r_{xy}\) value accepted to be highly correlated as \(R^2\) that is \(.6^2 =0.36\) or, 36% of the variation in \(y\) can be explained by \(x\).

  • Notice grd_avg_cor never contains highly correlated features

  • Math avg and Read avg appear to be the most linearly related 0.57% of the of variation. Slighly above half.

proof

\[ y = X\beta+\epsilon \ ; \epsilon \sim N(0, \sigma^2I_n) \]

\[ \mathbb{E}(y)=\mathbb{E}(X\beta+\epsilon) = \mathbb{E}(X\beta) + \mathbb{E}(\epsilon) =X\beta+0=X\beta \]

\[ \text{Cov}(y) =\text{Cov}(X\beta+\epsilon)=\text{Cov}(\epsilon) :=\sigma^2I_n \]

\[ y \sim N(X\beta, \sigma^2I) \]

Further, lets consider that we wish to determine the dist of \(\hat{y}=X\hat{\beta}\) and \(\hat{y}\approx\vec{y}\). We know that \(\hat{y}\) follows a normal dist. Therefore, we need to find its center and variance :

Center :

\[ \mathbb{E}(\hat{y}) \\ = \mathbb{E}(X\hat{\beta}) \\ =X\mathbb{E}(\hat{\beta}) \\ =X\beta \\ = \mu \\ = y \]

Variance :

\[ \text{Cov}(\hat{y})=\text{Cov}(Hy) =H\text{Cov}(y)H' =H\sigma^2I_nH'=\sigma^2HH'=\sigma^2H \]

This is because \(y \sim N(X\beta, \sigma^2I_n)\) and \(H^2=H\) (idopent matrix).

Recall :

\[ H = XA ; A = (X'X)^{-1}X' \\ H =X(X'X)^{-1}X' \]

and for this case :

\[ X = [1_{325} \vec{x}_i] \\ \implies \\ H \\ = X\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} X'\\ = X\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}X' \\ = [1_{325} \vec{x}_i]\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \]

# Fit linear model
mdl <- lm(`Read avg` ~ `Math avg`, data = iowa_std)

# ---- Scatter plot with regression line ----
plot(iowa_std$`Math avg`, iowa_std$`Read avg`,
     main = "Math vs Read Avg Mdl",
     xlab = "Math Avg", ylab = "Read Avg", pch = 19)
abline(mdl, col = "red", lwd = 2)

# QQ plt
plot(mdl, which = 2)

# Histogram 
hist(iowa_std$`Read avg`,
     probability = TRUE,
     breaks = 20,
     col = "lightgray",
     xlim = c(
       mean(iowa_std$`Read avg`, na.rm = TRUE) - 4 * sd(iowa_std$`Read avg`, na.rm = TRUE),
       mean(iowa_std$`Read avg`, na.rm = TRUE) + 4 * sd(iowa_std$`Read avg`, na.rm = TRUE)
     ),
     main = "Histogram of Read Avg with Normal Curves",
     xlab = "Read Avg")


# Center at Xbeta = \hat{y}
curve(
  dnorm(x, 
        mean = mean(fitted(mdl), na.rm = TRUE), 
        sd = sd(fitted(mdl), na.rm = TRUE)), 
  col = "blue", lwd = 2, add = TRUE)

# Raw values : 
curve(
  dnorm(x, 
        mean = mean(iowa_std$`Read avg`, na.rm = TRUE), 
        sd = sd(iowa_std$`Read avg`, na.rm = TRUE)), 
  col = "green", lwd = 2, add = TRUE)

legend("topright", legend = c("Centered at Xβ", "Empirical Mean"),
       col = c("blue", "green"), lwd = 2, bty = "n")

  • As we can see our theoretical model \(y\) does a pretty good job of estimating Read avg and Math avg, this tells us that there appears to be a strong relationship between a students math & reading performance.

  • However, relative to our emperical model based on our outcome variable, we get a worse performance. We notice that the difference is in the variance.

  • Notice our correlation squared and R^2 are equivalent

  • Hypothesis testing :

    • \[ \text{T-test : }H_o : \beta_i=0 \] indicates regression coefficients are statistically significant.

    • \[ F-\text{test : } H_o : \beta_0=\beta_1=0 \] indicating holding other coefficients constant, model is significant

  • In short this indicates strong evidence that districts which do well in reading also tend to do well in math together. Perhaps targeting these two areas will help overall performance.

Okay, lets go back to our research questions :

  • How do student performance vary across, grade, subject and size?

    • In district located in iowa, relationship among school subjects is stronger than grade level.
    • Reading and math appear to be related through grade levels
  • How do Teacher salaries vary across Categories?

    • What features contribute most to a wealthier teacher?

District Size Model :

# Add avr col : 
iowa_std <- 
  iowa_std |> 
  rowwise() |>
  mutate(avr = mean(c_across(c(`4thmath`, `8thmath`, `11thmath`, 
                                        `4th read`, `8th read`, `11th read`, 
                                        `8th sci`, `11th sci`)), na.rm = TRUE))

Suppose we divide school districts into three size groups : large districts (x > 2,000 students), medium-sized districts (1,000 < x < 2,000 students), and small districts (x < 1,000 students).

We can formulate a model to investigate district size and overall performance : \[ \text{avr}_i = \beta_0 + \beta_1 \cdot \mathbf{1}[\text{med}_i] + \beta_2 \cdot \mathbf{1}[\text{large}_i] + \varepsilon_i \]

\[ \begin{align*}\text{avr}_i & : \text{Average test score for district } i \\\beta_0 & : \text{Mean score for districts in the small size group (baseline)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\mathbf{1}[\text{med}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is medium, } 0 \text{ otherwise} \\\mathbf{1}[\text{large}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is large, } 0 \text{ otherwise} \\\varepsilon_i & : \text{Random error term for district } i\end{align*} \]

Which is equivalent to :

\[ y_i = \mu_1 x_{i1} + \mu_2 x_{i2} + \mu_3 x_{i3} + \varepsilon_i = \beta_0 + \beta_1 x_{i2} + \beta_2 x_{i3} + \varepsilon_i \]

\[ \begin{align*}y_i & : \text{Average test score for district } i \\x_{i1}, x_{i2}, x_{i3} & : \text{Indicator variables for district size: small, medium, large} \\\mu_1, \mu_2, \mu_3 & : \text{Mean test scores for small, medium, and large districts, respectively} \\\varepsilon_i & : \text{Random error in test score for district } i \\\beta_0 & : \text{Mean test score for small districts (baseline group)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\end{align*} \]

mdl <- lm(avr~`size(coded)`, data=iowa_std)
summary(mdl)
## 
## Call:
## lm(formula = avr ~ `size(coded)`, data = iowa_std)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2652  -4.5152   0.3901   4.8598  20.4848 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     73.0152     0.4840 150.863   <2e-16 ***
## `size(coded)`2   0.5948     1.0012   0.594    0.553    
## `size(coded)`3   2.5525     1.3306   1.918    0.056 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.227 on 322 degrees of freedom
## Multiple R-squared:  0.01151,    Adjusted R-squared:  0.005365 
## F-statistic: 1.874 on 2 and 322 DF,  p-value: 0.1552

Notice these are the same results as our ONE-WAY-ANOVA. Lets take a closer look at the F-test results :

summary(aov(avr~`size(coded)`, data=iowa_std))
##                Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)`   2    196   97.88   1.874  0.155
## Residuals     322  16820   52.24
levels(iowa_std$`size(coded)`) <- c("Small", "Medium", "Large") 
boxplot(avr ~ `size(coded)`, data = iowa_std,
        main = "Average Test Scores by District Size",
        xlab = "District Size",
        ylab = "Average Test Score",
        col = c("skyblue", "lightgreen", "lightpink"))

abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)

Our model compares the overall average and looks for differences each indactor (S, M, L) makes. It appears large districts tend to outperform other districts :

iowa_std |> group_by(`size(coded)`) |> summarize(avr_score_per_size = round(mean(avr), 2)) |> arrange(desc(avr_score_per_size))

This claim is supported by a quick glance at grouped summary statistics.