Install the mosaic package for statistical education first: “Packages” Panel -> “Install Packages”. Then load:
library(UsingR)
The data consists of SAT scores for the 50 states in 1994-1995. The outcome variable is average total SAT score, and we consider 3 predictors * salary estimated average annual salary of teachers in public elementary and secondary schools (in $1000 of USD) * expend expenditure per pupil in average daily attendance in public elementary and secondary schools (in $1000 of USD) * ratio average pupil/teacher ratio in public elementary and secondary schools
We load the SAT data, consider the help file, and as always look at the data.
data(SAT)
head(SAT)
## state expend ratio salary perc verbal math total
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029
## 2 Alaska 8.963 17.6 47.951 47 445 489 934
## 3 Arizona 4.778 19.3 32.175 27 448 496 944
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005
## 5 California 4.992 24.0 41.078 45 417 485 902
## 6 Colorado 5.443 18.4 34.571 29 462 518 980
We fit 3 regression models where on top of ratio we include either: * salary and expend * only salary * only expend
model1 <- lm(total ~ salary + expend + ratio, data=SAT)
model2 <- lm(total ~ expend + ratio, data=SAT)
model3 <- lm(total ~ salary + ratio, data=SAT)
round(summary(model1)$coefficients, 3)[,c(1,2,4)]
## Estimate Std. Error Pr(>|t|)
## (Intercept) 1069.234 110.925 0.000
## salary -8.823 4.697 0.067
## expend 16.469 22.050 0.459
## ratio 6.330 6.542 0.338
round(summary(model2)$coefficients, 3)[,c(1,2,4)]
## Estimate Std. Error Pr(>|t|)
## (Intercept) 1136.336 107.803 0.000
## expend -22.308 7.956 0.007
## ratio -2.295 4.784 0.634
round(summary(model3)$coefficients, 3)[,c(1,2,4)]
## Estimate Std. Error Pr(>|t|)
## (Intercept) 1113.877 93.003 0.000
## salary -5.538 1.643 0.002
## ratio 2.666 4.307 0.539
Observe the behavior of expend. When included with salary, expend is very insignificant, but by itself it is looking very significant.
It’s because the two variables salary and expend are very correlated, thus we have collinear, in a way redundant, predictors. When modelling their fit, they “steal” each other’s effect, so it becomes difficult to isolate the effect of one vs the other.
# Compute correlation coefficient
r <- cor(SAT$salary, SAT$expend)
# Plot scatterplot
plot(SAT$salary, SAT$expend, xlab="salary", ylab="expenditure",
pch=19, type='n', main="Salary vs Expend")
text(SAT$salary, SAT$expend, labels=state.abb)
legend("topleft", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')
abline(lm(expend ~ salary, data=SAT), lwd=2, lty=2)
We run a principal components analysis on the two variables, then plot the data in a recentered space, and compare it to the data in the newly defined space by the principal components.
# Compute principal components using princomp() function
d = SAT[, c("salary", "expend")]
rownames(d) <- state.abb
pca <- princomp( ~ salary + expend, data=d)
# Mean-adjusted values
d$x_adj <- d$salary - mean(d$salary)
d$y_adj <- d$expend - mean(d$expend)
# Calculate covariance matrix and eigenvectors/values
cm <- cov(d[,1:2])
e <- eigen(cm)
# Principal component vector slopes
s1 <- e$vectors[1,1] / e$vectors[2,1] # PC1
s2 <- e$vectors[1,2] / e$vectors[2,2] # PC2
# Plot
par(mfrow=c(1,2))
# In original dimensions: salary and expend
plot(d$x_adj, d$y_adj, pch=16, xlab="salary", asp=1, ylab="expend", type='n')
text(d$x_adj, d$y_adj, labels=state.abb, cex=0.75)
abline(a=0, b=-s1, col='red')
abline(a=0, b=-s2, col='black')
legend("topright", legend=c("PC 1", "PC 2"), lty=c(1,1), col=c(1,2), bty='n')
# In new dimension space
biplot(pca, xlab="Principal Component 1", ylab="Principal Component 2", cex=0.75)
We’ll take all the observations of the 1st principal component and define its values to be a new variable school.funding whose units unfortunately don’t make much sense, but it somehow captures the concept of the state allocating a lot of financial resources. Also, we need to take the negative values of the principal components, as it seems the pca() function inverted them. We plot a scatterplot of its relationship with
# Create new variable
SAT$school.funding <- pca$scores[,1]
SAT$school.funding <- -1 * SAT$school.funding
r <- cor(SAT$school.funding, SAT$salary)
plot(SAT$salary, SAT$school.funding, xlab="salary", ylab="school funding", type='n')
text(SAT$salary, SAT$school.funding, labels=state.abb)
legend("topleft", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')
# Histogram of scores
hist(SAT$school.funding, xlab="school funding in #$%@^@!% units", main="School Funding")
However, we do know as it gets larger, it is associated with decreases in SAT scores. Why is that? That doesn’t make sense.
r <- cor(SAT$school.funding, SAT$total)
plot(SAT$school.funding, SAT$total, xlab="school funding", ylab="total score",
pch=19, type='n', main="School Funding vs SAT Score")
text(SAT$school.funding, SAT$total, labels=state.abb)
legend("topright", legend = paste("R = ", round(r, 3), sep=""), lty=0, col=1, bty='n')
abline(lm(total ~ school.funding, data=SAT), lwd=2, lty=2)
We fit a new regression and observe the results. Unfortunately the interpretation of the coefficient for school.funding doesn’t make sense since we don’t understand its units. Surprisingly, the diagnostics look good!
model.new <- lm(total ~ school.funding + ratio, data=SAT)
round(summary(model.new)$coefficients, 3)[,c(1,2,4)]
## Estimate Std. Error Pr(>|t|)
## (Intercept) 924.998 73.319 0.000
## school.funding -5.413 1.612 0.002
## ratio 2.427 4.311 0.576
plot(model.new, 1)
plot(model.new, 2)
Why did ND, IA, MN do so well and TX, NY, FL do so poorly?
plot(SAT$perc, SAT$total, xlab="fraction of students taking SAT", ylab="Avg SAT score",
main="Who's taking the SAT's?", type='n')
text(SAT$perc, SAT$total, labels=state.abb)
In states like ND, IA, and UT, a much lower fraction of students are taking the SAT’s i.e. the best of the best. Whereas in MA, CT, NY, many more students are taking it thus lowering the average.