SAT Scores and their Determinants

Install the mosaic package for statistical education first: “Packages” Panel -> “Install Packages”. Then load:

library(UsingR)

Data

The data consists of SAT scores for the 50 states in 1994-1995. The outcome variable is average total SAT score, and we consider 3 predictors * salary estimated average annual salary of teachers in public elementary and secondary schools (in $1000 of USD) * expend expenditure per pupil in average daily attendance in public elementary and secondary schools (in $1000 of USD) * ratio average pupil/teacher ratio in public elementary and secondary schools

We load the SAT data, consider the help file, and as always look at the data.

data(SAT)
head(SAT)

##        state expend ratio salary perc verbal math total
## 1    Alabama  4.405  17.2 31.144    8    491  538  1029
## 2     Alaska  8.963  17.6 47.951   47    445  489   934
## 3    Arizona  4.778  19.3 32.175   27    448  496   944
## 4   Arkansas  4.459  17.1 28.934    6    482  523  1005
## 5 California  4.992  24.0 41.078   45    417  485   902
## 6   Colorado  5.443  18.4 34.571   29    462  518   980

Three Regression Models to Consdiers

We fit 3 regression models where on top of ratio we include either: * salary and expend * only salary * only expend

model1 <- lm(total ~ salary + expend + ratio, data=SAT)
model2 <- lm(total ~ expend + ratio, data=SAT)
model3 <- lm(total ~ salary + ratio, data=SAT)
round(summary(model1)$coefficients, 3)[,c(1,2,4)]

##             Estimate Std. Error Pr(>|t|)
## (Intercept) 1069.234    110.925    0.000
## salary        -8.823      4.697    0.067
## expend        16.469     22.050    0.459
## ratio          6.330      6.542    0.338

round(summary(model2)$coefficients, 3)[,c(1,2,4)]

##             Estimate Std. Error Pr(>|t|)
## (Intercept) 1136.336    107.803    0.000
## expend       -22.308      7.956    0.007
## ratio         -2.295      4.784    0.634

round(summary(model3)$coefficients, 3)[,c(1,2,4)]

##             Estimate Std. Error Pr(>|t|)
## (Intercept) 1113.877     93.003    0.000
## salary        -5.538      1.643    0.002
## ratio          2.666      4.307    0.539

Observe the behavior of expend. When included with salary, expend is very insignificant, but by itself it is looking very significant.

Issue of Collinearity

It’s because the two variables salary and expend are very correlated, thus we have collinear, in a way redundant, predictors. When modelling their fit, they “steal” each other’s effect, so it becomes difficult to isolate the effect of one vs the other.

# Compute correlation coefficient
r <- cor(SAT$salary, SAT$expend)

# Plot scatterplot
plot(SAT$salary, SAT$expend, xlab="salary", ylab="expenditure", 
     pch=19, type='n', main="Salary vs Expend")
text(SAT$salary, SAT$expend, labels=state.abb)
legend("topleft", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')
abline(lm(expend ~ salary, data=SAT), lwd=2, lty=2)

Principal Components Analysis as a Potential Solution

We run a principal components analysis on the two variables, then plot the data in a recentered space, and compare it to the data in the newly defined space by the principal components.

# Compute principal components using princomp() function
d = SAT[, c("salary", "expend")]
rownames(d) <- state.abb
pca <- princomp( ~ salary + expend, data=d)

# Mean-adjusted values 
d$x_adj <- d$salary - mean(d$salary)
d$y_adj <- d$expend - mean(d$expend)

# Calculate covariance matrix and eigenvectors/values
cm <- cov(d[,1:2])
e <- eigen(cm)

# Principal component vector slopes
s1 <- e$vectors[1,1] / e$vectors[2,1] # PC1
s2 <- e$vectors[1,2] / e$vectors[2,2] # PC2

# Plot
par(mfrow=c(1,2))
# In original dimensions: salary and expend
plot(d$x_adj, d$y_adj, pch=16, xlab="salary", asp=1, ylab="expend", type='n')
text(d$x_adj, d$y_adj, labels=state.abb, cex=0.75)
abline(a=0, b=-s1, col='red')
abline(a=0, b=-s2, col='black')
legend("topright", legend=c("PC 1", "PC 2"), lty=c(1,1), col=c(1,2), bty='n')

# In new dimension space
biplot(pca, xlab="Principal Component 1", ylab="Principal Component 2", cex=0.75)

New Meta Variable: School Funding

We’ll take all the observations of the 1st principal component and define its values to be a new variable school.funding whose units unfortunately don’t make much sense, but it somehow captures the concept of the state allocating a lot of financial resources. Also, we need to take the negative values of the principal components, as it seems the pca() function inverted them. We plot a scatterplot of its relationship with

# Create new variable
SAT$school.funding <- pca$scores[,1]
SAT$school.funding <- -1 * SAT$school.funding
r <- cor(SAT$school.funding, SAT$salary)
plot(SAT$salary, SAT$school.funding, xlab="salary", ylab="school funding", type='n')
text(SAT$salary, SAT$school.funding, labels=state.abb)
legend("topleft", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')

# Histogram of scores
hist(SAT$school.funding, xlab="school funding in #$%@^@!% units", main="School Funding")

However, we do know as it gets larger, it is associated with decreases in SAT scores. Why is that? That doesn’t make sense.

r <- cor(SAT$school.funding, SAT$total)
plot(SAT$school.funding, SAT$total, xlab="school funding", ylab="total score", 
     pch=19, type='n', main="School Funding vs SAT Score")
text(SAT$school.funding, SAT$total, labels=state.abb)
legend("topright", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')
abline(lm(total ~ school.funding, data=SAT), lwd=2, lty=2)

New Regression Fit

We fit a new regression and observe the results. Unfortunately the interpretation of the coefficient for school.funding doesn’t make sense since we don’t understand its units. Surprisingly, the diagnostics look good!

model.new <- lm(total ~ school.funding + ratio, data=SAT)
round(summary(model.new)$coefficients, 3)[,c(1,2,4)]

##                Estimate Std. Error Pr(>|t|)
## (Intercept)     924.998     73.319    0.000
## school.funding   -5.413      1.612    0.002
## ratio             2.427      4.311    0.576

plot(model.new, 1)

plot(model.new, 2)

State Rankings

Why did ND, IA, MN do so well and TX, NY, FL do so poorly?

plot(SAT$perc, SAT$total, xlab="fraction of students taking SAT", ylab="Avg SAT score",
     main="Who's taking the SAT's?", type='n')
text(SAT$perc, SAT$total, labels=state.abb)

In states like ND, IA, and UT, a much lower fraction of students are taking the SAT’s i.e. the best of the best. Whereas in MA, CT, NY, many more students are taking it thus lowering the average.