Project Duration : 3 Days
This project explores district-level education data from Iowa, covering standardized test performance across math, reading, and science for grades 4, 8, and 11. The data set includes 325 school districts & their associated salaries. Variables include : district size metrics, Teacher Work Experience, Avr Grade & Class Performance & GradeSubject Performance (per district). Subjects include : mathematics, reading, and science. The purpose of this project is to demonstrate my rigorous use and understanding of Linear Modeling, Statistical Programming & well defined metrics. We will use a generic programming style, Linear Modeling in a mathematical Framework, & Domain Specific knowledge of education to develop answers to our research questions. Alongside this task we will explore mathematical concepts and build a library of functions that will can utilize in future projects. Demonstrating production level programming and rigorous analysis guided by mathematics. Sections of the project apply metrics discussed in Chapter 8 of Introduction to Regression Modeling (Abraham & Ledolter, 2006). Alongside the textbook provided solutions we dive into relevant mathematical theorems discussed in lecture (STATS100C at UCLA) & other textbooks to enrich the information we can pull from this data set.
How do student performance vary across, grade, subject and size?
How do Teacher salaries vary across Categories?
What features contribute most to a wealthier teacher?url <- "https://www.biz.uiowa.edu/faculty/jledolter/RegressionModeling/Data/Chapter8/iowastudent.txt"
# download.file(url, destfile = "iowastudent.txt", mode = "wb")
library(readr)
iowa_std <- read_tsv("iowastudent.txt", na = "*", show_col_types = FALSE)
observations :
## [1] "325 School Districts"
## [1] "22 Features"
## [1] "District" "4thmath" "8thmath" "11thmath" "4th read"
## [6] "8th read" "11th read" "8th sci" "11th sci" "size"
## [11] "size(coded)" "small size" "med size" "large size" "salary"
## [16] "experience" "4th avg" "8th avg" "11th avg" "Math avg"
## [21] "Read avg" "Sci avg"
The motivation & interest of Cardinality is centered around the issue of Big Data. Often in Data science we deal with data sets with many features. Framing data in the context of Cardinality metrics provides a generalizable framework for grouping features into their associated data types. Laying the foundation for the types of analysis that is most appropriate for our given data structure.
# Detect Cardinality : Get back vector :
Cardinality_func <- function(df, t = 10) {
sapply(df, function(x) length(unique(x)) < t)
}
# less than 10 unique values
iowa_std[,Cardinality_func(iowa_std)]
observation :
We notice there are 4 features with less than 10 unique values (
All size metrics )
We may determine these as categorical features–suppose we do categorical encoding (factor).
library(dplyr)
iowa_std <-
iowa_std |>
mutate(across(contains("size"), as.factor))
observation :
Eff_Cardinality :\[ \text{Eff_Cardinality} = \frac{\text{Cardinality}}{\text{Sample Size (n)}} \\ \text{Cardinality} = \text{Number of Unique Values} \]
Key Take-away :
When \(\text{Eff_Cardinality} = 1\) this means we have exactly 1 unique value for every observation. These are useful to be thought of as row-labels that you can describe as jointed sentences reading across columns. However generating summary statistics or grouping of categorical variables with an effective cardinality of 1 can also provide itself to be helpful when summarizing across strata.
When \(\text{Eff_Cardinality} \rightarrow1\) that tells us we have nearly the same amount of unique values as the number of observations in our data. If this is true we can safely say relative to our sample size, our data has many unique values. This tells us its unlikely that repeated values appear, making signals likely sparse for the categorical or numeric vector alone.
When \(\text{Eff_Cardinality} = 1/n\) this means we have exactly 1 unique valin total. In other words, the feature of the data set repeats itself : “student”, “student”, … , “student” . These usually just contextualize your data.
When \(\text{Eff_Cardinality} \rightarrow 1/n\) this means blocking tends to be happening. Meaning, we have few unique values for a large data set–we have few groups. Because, \(\text{Min(Cardinality)}=1\) (ie you have atleast 1 unique value). Therefore this suggest the feature is likely a categorical variable
Effective_Cardinality <-
function(df, dec = T){
v <- sapply(df, function(x) {length(unique(x))/nrow(df)})
v <- sort(v, decreasing = dec)
data.frame(Effective_Cardinality_Values=v)
}
# View effective cardinality
head(Effective_Cardinality(iowa_std), 6)
head(Effective_Cardinality(iowa_std, dec = F), 6)
library(ggplot2)
features <- rownames(Effective_Cardinality(iowa_std))
eff_c_df <- cbind(features, Effective_Cardinality(iowa_std))
Mean <- summary(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values)[4]
one_std <- c(Mean + sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values), Mean - sd(eff_c_df[5:nrow(eff_c_df),]$Effective_Cardinality_Values))
library(patchwork)
bar_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
ggplot(aes(x = factor(features, levels = features), y = Effective_Cardinality_Values)) +
geom_hline(yintercept = Mean, linetype = "dashed", color = "black") +
geom_hline(yintercept = one_std[1], linetype = "dashed", color = "red") +
geom_hline(yintercept = one_std[2], linetype = "dashed", color = "red") +
geom_col() +
xlab("Features") + ylab("Effective Cardinality") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
# ggplot boxplot
box_plot <- eff_c_df[5:nrow(eff_c_df), ] %>%
ggplot(aes(y = Effective_Cardinality_Values)) +
geom_boxplot()
# Combine side by side
bar_plot + box_plot + plot_layout(widths = c(7, 1))
Observations :
This tells us for District & salary
have exactly 1 obs for 1 unique value. While size and,
experience we nearly 1 observation for every unique value.
We suspect their signals will be weak.
salary for every District – most likely \(\bar{x}\)size metrics have low Eff
Cardinality, indicating themselves as categorical variables
relative to their sample size. Therefore agreeing with our simple
approach of counting unique values.
# Cardinaliy scores : Top & Bottom 3 :
cardinality <- sapply(iowa_std, function(x){length(unique(x))})
head(sort(cardinality, decreasing = T), 3); head(sort(cardinality, decreasing = F), 3)
## District salary size
## 325 325 293
## small size med size large size
## 2 2 2
num_df <- iowa_std[,-1] |> select(-contains("size"))
Here we will take a look at global pairwise-patterns, no deep analysis. Simple observations.
## corrplot 0.95 loaded
No correlation : there appears to be no
correlations (linear relationship) for salary or
experience, indicating it doesn’t alone appear to holds a
weak signal with low predictive power – except for
(salary,experience) pairs
All other features are Positively Correlated [Pairwise Regression Coef are Positive]
Mathematical Foundations :
Consider the singular case of numerical event \(x \text{ and }y\) for a correlation \(r_{xy} = \frac{\text{cov}(x,y)}{\sigma_x\sigma_y}\).
More generally, we say these correlations exist to a correlation matrix containing all our numerical features :
\[ r_{ij} \in C^{\text{p} X\text{p}} \]
Where \(p=\text{Number of Parameters}\) (not inc. intercept). Lets consider the foundations and reasoning for correlation matrices :
We begin by first making assumptions :
\[ \vec{y}=X\vec{\beta} +\epsilon \text{ for : } \epsilon \sim N(\vec{0}, \sigma^2 I_n) \]
Suppose we expand \(\vec{y}\) :
\[ y= f(X) + \epsilon\\ = X\vec{\beta} + \epsilon\\ = \beta_0 + \beta_1\vec{x}_1 + ...+\beta_n\vec{x}_n + \epsilon \\ = \beta_0 + \sum\beta_i x_i + \epsilon \] For this we assume some intercept \(\beta_0\).
Notice that for each of the following Regression models we hold the
same structure :
\[ \vec{y}_{325}=\beta_0+\beta_1\vec{x}_i + \epsilon \\ \text{for: } \vec{x}_i\in X^{325 X2} = [\vec{1}_n \vec{x}_1] \\ X'X= \begin{bmatrix}n & \sum{x_i} \\\sum{x_i} & \sum{x_i^2}\end{bmatrix} \\ \implies (X'X)^{-1}\\ =\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]
\[ =\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} \]
Which as we know is particularly relevant when develop a notion of variance of \(\hat{\beta}\).
Recall :
\[ \hat{\beta} \sim N(\beta, \sigma^2 (X'X)^{-1}) \]
proof
We wish to prove 3 statements :
“Our est. of slopes is Normally distributed”
“Our est. is unbiased”
“Our covariance matrix of the slope estimate is the variance of random error times the inverse of Gram matrix”
pairwise_dot_plt <- function(data_set) {
# Generate all pairwise combinations
pairwise_comb <- combn(1:ncol(data_set), 2)
# iterate across each
for (idx in 1:ncol(pairwise_comb)) {
comb_vec <- pairwise_comb[, idx]
x_axis <- comb_vec[1]
y_axis <- comb_vec[2]
# generate reg mdl
mdl <- lm(unlist(data_set[, y_axis]) ~ unlist(data_set[, x_axis]), data = data_set)
# Start a new plot on the PDF device
plot(
unlist(data_set[, x_axis]),
unlist(data_set[, y_axis]),
xlab = colnames(data_set)[x_axis],
ylab = colnames(data_set)[y_axis],
main = paste("Plot", idx)
)
abline(mdl, col = "red")
}
}
Mathematical Considerations :
library(dplyr)
# Grade, Topic Correlations :
gd_tpk_cor <- num_df |> select(contains("th"),-contains("avg")) |> cor(use = "complete.obs")
# Topic avg Correlations
sub_avr_cor <- num_df |> select(ends_with("avg"), -matches("^[0-9]+th")) |> cor(use = "complete.obs")
# Grade avg Correlations
grd_avg_cor <- num_df |> select(matches("^[0-9]+th avg")) |> cor(use = "complete.obs")
corrplot(gd_tpk_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
# Set bounds and filter :
weak_cutoff <- .3 # below : weak linear association
strong_cutoff <- .6 # above : strong linear association
low_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor < weak_cutoff)
med_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor >= weak_cutoff) & (gd_tpk_cor <= strong_cutoff)
high_idx <- upper.tri(gd_tpk_cor) & (gd_tpk_cor > strong_cutoff)
# select only ij obs (upper.tri) and not ji, sym matrix
low_pairs <- which(low_idx, arr.ind = TRUE)
med_pairs <- which(med_idx, arr.ind = TRUE)
high_pairs <- which(high_idx, arr.ind = TRUE)
# general NA matrix function :
generate_NA_mat <-
function(mat){
NA_mat <- matrix(
NA,
nrow = nrow(mat),
ncol = ncol(mat),
dimnames = dimnames(gd_tpk_cor)
)
return(NA_mat)
}
# generate general function :
build_masked_matrix <- function(cor_matrix, pair_indices) {
NA_mat <- generate_NA_mat(cor_matrix)
NA_mat[pair_indices] <- cor_matrix[pair_indices]
return(NA_mat)
}
print("low_cor_mat")
## [1] "low_cor_mat"
low_cor_mat <- build_masked_matrix(gd_tpk_cor, low_pairs); low_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA 0.2742213 NA NA 0.1940918 0.2283466
## 8thmath NA NA NA NA NA 0.2997928 NA
## 11thmath NA NA NA 0.2819581 NA NA 0.1771869
## 4th read NA NA NA NA NA 0.2880443 0.2739120
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA 0.2655738
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath 0.1864202
## 8thmath 0.2302258
## 11thmath NA
## 4th read 0.2111796
## 8th read 0.2584866
## 11th read NA
## 8th sci NA
## 11th sci NA
print("med_cor_mat")
## [1] "med_cor_mat"
med_cor_mat <- build_masked_matrix(gd_tpk_cor, med_pairs); med_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA 0.3817187 NA NA 0.3468462 NA NA
## 8thmath NA NA 0.3118259 0.3787978 NA NA 0.5362033
## 11thmath NA NA NA NA 0.3016574 NA NA
## 4th read NA NA NA NA 0.3526437 NA NA
## 8th read NA NA NA NA NA 0.3201637 0.5382480
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath 0.4566460
## 4th read NA
## 8th read NA
## 11th read NA
## 8th sci 0.3122344
## 11th sci NA
print("high_cor_mat")
## [1] "high_cor_mat"
high_cor_mat <- build_masked_matrix(gd_tpk_cor, high_pairs); high_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA NA 0.6752072 NA NA NA
## 8thmath NA NA NA NA 0.6897058 NA NA
## 11thmath NA NA NA NA NA 0.6061706 NA
## 4th read NA NA NA NA NA NA NA
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath NA
## 4th read NA
## 8th read NA
## 11th read 0.6401641
## 8th sci NA
## 11th sci NA
gd_tpk_cor :high_cor_mat : For grades with themself, all grades
(4th, 8th, 11th) appear to a moderately strong relationship between
reading and math performance [consider
diag behavior].
med_cor_mat : 8th grades across all categories
appear to have a moderate relationship — science subject for 8th grades
appears to approach high_cor_mat category [consider diag
behavior].
low_cor_mat :
masked_matrix_func <- function(correlation_matrix, bottom_cutoff = 0.3, top_cutoff = 0.6) {
if (!is.matrix(correlation_matrix)) {
stop("Input must be a matrix.")
}
# Logical indices for each strength
l_idx <- upper.tri(correlation_matrix) & (correlation_matrix < bottom_cutoff)
m_idx <- upper.tri(correlation_matrix) &
(correlation_matrix >= bottom_cutoff) &
(correlation_matrix <= top_cutoff)
h_idx <- upper.tri(correlation_matrix) & (correlation_matrix > top_cutoff)
# Pair indices
l_pairs <- which(l_idx, arr.ind = TRUE)
m_pairs <- which(m_idx, arr.ind = TRUE)
h_pairs <- which(h_idx, arr.ind = TRUE)
# Generate empty NA matrix func
generate_NA_mat <- function(mat) {
matrix(NA, nrow = nrow(mat), ncol = ncol(mat), dimnames = dimnames(mat))
}
# Use helper function :
build_masked_matrix <- function(mat, idx) {
NA_mat <- generate_NA_mat(mat)
NA_mat[idx] <- mat[idx]
return(NA_mat)
}
# Store all three in a named list
cor_lst <- list(
low_cor_mat = build_masked_matrix(correlation_matrix, l_pairs),
med_cor_mat = build_masked_matrix(correlation_matrix, m_pairs),
high_cor_mat = build_masked_matrix(correlation_matrix, h_pairs)
)
return(cor_lst)
}
# plt :
corrplot(sub_avr_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
masked_matrix_func(sub_avr_cor)
## $low_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA NA
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $med_cor_mat
## Math avg Read avg Sci avg
## Math avg NA NA 0.4922274
## Read avg NA NA NA
## Sci avg NA NA NA
##
## $high_cor_mat
## Math avg Read avg Sci avg
## Math avg NA 0.7568206 NA
## Read avg NA NA 0.6127668
## Sci avg NA NA NA
corrplot(grd_avg_cor,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.9
)
masked_matrix_func(grd_avg_cor)
## $low_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
##
## $med_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA 0.4121618 0.3035598
## 8th avg NA NA 0.3812491
## 11th avg NA NA NA
##
## $high_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
grep("_cor$", ls(), value = T)
## [1] "gd_tpk_cor" "grd_avg_cor" "sub_avr_cor"
cor_lst <- list(gd_tpk_cor = gd_tpk_cor, grd_avg_cor = grd_avg_cor, sub_avr_cor = sub_avr_cor)
# Average Correlation
lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)})
## $gd_tpk_cor
## [1] 0.31
##
## $grd_avg_cor
## [1] 0.24
##
## $sub_avr_cor
## [1] 0.41
# average of averages :
paste(lapply(cor_lst, function(x){as.vector(x - diag(1,nrow(x))) |> mean() |> round(2)}) |> unlist() |> mean(), " avr cor")
## [1] "0.32 avr cor"
Global Observations :
Among variables of interest we have \(\bar{r}_{ij}\) of .32, so all signals appear pretty linearly-weak.
Subject Pairs are more related then Grade Pairs
sub_avr_cor < gd_tpk_cor
< grd_avg_corMathematical Observations :
Notice \(r_{ij}\in C \implies R^2_{\text{xi}}=r_{i,y}\)
masked <- lapply(cor_lst, masked_matrix_func)
high_only <- lapply(masked, function(x) x[grepl("^high_", names(x))])
high_only
## $gd_tpk_cor
## $gd_tpk_cor$high_cor_mat
## 4thmath 8thmath 11thmath 4th read 8th read 11th read 8th sci
## 4thmath NA NA NA 0.6752072 NA NA NA
## 8thmath NA NA NA NA 0.6897058 NA NA
## 11thmath NA NA NA NA NA 0.6061706 NA
## 4th read NA NA NA NA NA NA NA
## 8th read NA NA NA NA NA NA NA
## 11th read NA NA NA NA NA NA NA
## 8th sci NA NA NA NA NA NA NA
## 11th sci NA NA NA NA NA NA NA
## 11th sci
## 4thmath NA
## 8thmath NA
## 11thmath NA
## 4th read NA
## 8th read NA
## 11th read 0.6401641
## 8th sci NA
## 11th sci NA
##
##
## $grd_avg_cor
## $grd_avg_cor$high_cor_mat
## 4th avg 8th avg 11th avg
## 4th avg NA NA NA
## 8th avg NA NA NA
## 11th avg NA NA NA
##
##
## $sub_avr_cor
## $sub_avr_cor$high_cor_mat
## Math avg Read avg Sci avg
## Math avg NA 0.7568206 NA
## Read avg NA NA 0.6127668
## Sci avg NA NA NA
observations :
For our programmed function we have defined \(r_{xy}=.6\) as the lowest possible \(r_{xy}\) value accepted to be highly correlated – as \(R^2\) that is \(.6^2 =0.36\) or, 36% of the variation in \(y\) can be explained by \(x\).
Notice grd_avg_cor never contains
highly correlated features
Math avg and Read avg appear to be the
most linearly related 0.57% of the of variation.
Slighly above half.
proof
\[ y = X\beta+\epsilon \ ; \epsilon \sim N(0, \sigma^2I_n) \]
\[ \mathbb{E}(y)=\mathbb{E}(X\beta+\epsilon) = \mathbb{E}(X\beta) + \mathbb{E}(\epsilon) =X\beta+0=X\beta \]
\[ \text{Cov}(y) =\text{Cov}(X\beta+\epsilon)=\text{Cov}(\epsilon) :=\sigma^2I_n \]
\[ y \sim N(X\beta, \sigma^2I) \]
Further, lets consider that we wish to determine the dist of \(\hat{y}=X\hat{\beta}\) and \(\hat{y}\approx\vec{y}\). We know that \(\hat{y}\) follows a normal dist. Therefore, we need to find its center and variance :
Center :
\[ \mathbb{E}(\hat{y}) \\ = \mathbb{E}(X\hat{\beta}) \\ =X\mathbb{E}(\hat{\beta}) \\ =X\beta \\ = \mu \\ = y \]
Variance :
\[ \text{Cov}(\hat{y})=\text{Cov}(Hy) =H\text{Cov}(y)H' =H\sigma^2I_nH'=\sigma^2HH'=\sigma^2H \]
This is because \(y \sim N(X\beta, \sigma^2I_n)\) and \(H^2=H\) (idopent matrix).
Recall :
\[ H = XA ; A = (X'X)^{-1}X' \\ H =X(X'X)^{-1}X' \]
and for this case :
\[ X = [1_{325} \vec{x}_i] \\ \implies \\ H \\ = X\frac{1}{\text{det}(X'X)}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix} X'\\ = X\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}X' \\ = [1_{325} \vec{x}_i]\frac{1}{n \sum{x_i}^2-(\sum{x_i})^2}\begin{bmatrix}\sum{x_i^2} & -\sum{x_i} \\-\sum{x_i} & n\end{bmatrix}[1_{325} \vec{x}_i]' \]
# Fit linear model
mdl <- lm(`Read avg` ~ `Math avg`, data = iowa_std)
# ---- Scatter plot with regression line ----
plot(iowa_std$`Math avg`, iowa_std$`Read avg`,
main = "Math vs Read Avg Mdl",
xlab = "Math Avg", ylab = "Read Avg", pch = 19)
abline(mdl, col = "red", lwd = 2)
# QQ plt
plot(mdl, which = 2)
# Histogram
hist(iowa_std$`Read avg`,
probability = TRUE,
breaks = 20,
col = "lightgray",
xlim = c(
mean(iowa_std$`Read avg`, na.rm = TRUE) - 4 * sd(iowa_std$`Read avg`, na.rm = TRUE),
mean(iowa_std$`Read avg`, na.rm = TRUE) + 4 * sd(iowa_std$`Read avg`, na.rm = TRUE)
),
main = "Histogram of Read Avg with Normal Curves",
xlab = "Read Avg")
# Center at Xbeta = \hat{y}
curve(
dnorm(x,
mean = mean(fitted(mdl), na.rm = TRUE),
sd = sd(fitted(mdl), na.rm = TRUE)),
col = "blue", lwd = 2, add = TRUE)
# Raw values :
curve(
dnorm(x,
mean = mean(iowa_std$`Read avg`, na.rm = TRUE),
sd = sd(iowa_std$`Read avg`, na.rm = TRUE)),
col = "green", lwd = 2, add = TRUE)
legend("topright", legend = c("Centered at Xβ", "Empirical Mean"),
col = c("blue", "green"), lwd = 2, bty = "n")
As we can see our theoretical model \(y\) does a pretty good job of estimating Read avg and Math avg, this tells us that there appears to be a strong relationship between a students math & reading performance.
However, relative to our emperical model based on our outcome variable, we get a worse performance. We notice that the difference is in the variance.
Notice our correlation squared and R^2 are equivalent
Hypothesis testing :
\[ \text{T-test : }H_o : \beta_i=0 \] indicates regression coefficients are statistically significant.
\[ F-\text{test : } H_o : \beta_0=\beta_1=0 \] indicating holding other coefficients constant, model is significant
In short this indicates strong evidence that districts which do well in reading also tend to do well in math together. Perhaps targeting these two areas will help overall performance.
Okay, lets go back to our research questions :
How do student performance vary across, grade,
subject and size?
How do Teacher salaries vary across Categories?
What features contribute most to a wealthier teacher?# Add avr col :
iowa_std <-
iowa_std |>
rowwise() |>
mutate(avr = mean(c_across(c(`4thmath`, `8thmath`, `11thmath`,
`4th read`, `8th read`, `11th read`,
`8th sci`, `11th sci`)), na.rm = TRUE))
Suppose we divide school districts into three size groups : large districts (x > 2,000 students), medium-sized districts (1,000 < x < 2,000 students), and small districts (x < 1,000 students).
We can formulate a model to investigate district size and overall performance : \[ \text{avr}_i = \beta_0 + \beta_1 \cdot \mathbf{1}[\text{med}_i] + \beta_2 \cdot \mathbf{1}[\text{large}_i] + \varepsilon_i \]
\[ \begin{align*}\text{avr}_i & : \text{Average test score for district } i \\\beta_0 & : \text{Mean score for districts in the small size group (baseline)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\mathbf{1}[\text{med}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is medium, } 0 \text{ otherwise} \\\mathbf{1}[\text{large}_i] & : \text{Indicator function: } 1 \text{ if district } i \text{ is large, } 0 \text{ otherwise} \\\varepsilon_i & : \text{Random error term for district } i\end{align*} \]
Which is equivalent to :
\[ y_i = \mu_1 x_{i1} + \mu_2 x_{i2} + \mu_3 x_{i3} + \varepsilon_i = \beta_0 + \beta_1 x_{i2} + \beta_2 x_{i3} + \varepsilon_i \]
\[ \begin{align*}y_i & : \text{Average test score for district } i \\x_{i1}, x_{i2}, x_{i3} & : \text{Indicator variables for district size: small, medium, large} \\\mu_1, \mu_2, \mu_3 & : \text{Mean test scores for small, medium, and large districts, respectively} \\\varepsilon_i & : \text{Random error in test score for district } i \\\beta_0 & : \text{Mean test score for small districts (baseline group)} \\\beta_1 & : \text{Difference in mean score between medium and small districts} \\\beta_2 & : \text{Difference in mean score between large and small districts} \\\end{align*} \]
mdl <- lm(avr~`size(coded)`, data=iowa_std)
summary(mdl)
##
## Call:
## lm(formula = avr ~ `size(coded)`, data = iowa_std)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2652 -4.5152 0.3901 4.8598 20.4848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.0152 0.4840 150.863 <2e-16 ***
## `size(coded)`2 0.5948 1.0012 0.594 0.553
## `size(coded)`3 2.5525 1.3306 1.918 0.056 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.227 on 322 degrees of freedom
## Multiple R-squared: 0.01151, Adjusted R-squared: 0.005365
## F-statistic: 1.874 on 2 and 322 DF, p-value: 0.1552
Notice these are the same results as our ONE-WAY-ANOVA. Lets take a closer look at the F-test results :
summary(aov(avr~`size(coded)`, data=iowa_std))
## Df Sum Sq Mean Sq F value Pr(>F)
## `size(coded)` 2 196 97.88 1.874 0.155
## Residuals 322 16820 52.24
levels(iowa_std$`size(coded)`) <- c("Small", "Medium", "Large")
boxplot(avr ~ `size(coded)`, data = iowa_std,
main = "Average Test Scores by District Size",
xlab = "District Size",
ylab = "Average Test Score",
col = c("skyblue", "lightgreen", "lightpink"))
abline(h = mean(iowa_std$avr, na.rm = TRUE), col = "red", lty = 2)
Our model compares the overall average and looks for differences each indactor (S, M, L) makes. It appears large districts tend to outperform other districts :
iowa_std |> group_by(`size(coded)`) |> summarize(avr_score_per_size = round(mean(avr), 2)) |> arrange(desc(avr_score_per_size))
This claim is supported by a quick glance at grouped summary statistics.