The data are from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. The data set contains information on test performance, school characteristics and student demographic backgrounds for school districts.
Test scores are on the Stanford 9 standardized test administered to 5th grade students. School characteristics (averaged across the district) include enrollment, number of teachers (measured as “full-time equivalents”, number of computers per classroom, and expenditures per student. Demographic variables for the students are averaged across the district.
A data frame with 420 observations on the following 14 variables.
data(CASchools, package="AER")
<- CASchools dta
# have a look
glimpse(dta)
Rows: 420
Columns: 14
$ district <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
$ school <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
$ grades <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
$ students <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
$ teachers <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
$ calworks <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
$ lunch <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
$ computer <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
$ expenditure <dbl> 6384.91, 5099.38, 5501.95, 7101.83, 5235.99, 5580.15, 5253…
$ income <dbl> 22.69000, 9.82400, 8.97800, 8.97800, 9.08033, 10.41500, 6.…
$ english <dbl> 0.00000, 4.58333, 30.00000, 0.00000, 13.85768, 12.40876, 6…
$ read <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
$ math <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…
We compute the student-teacher ratio and a score which is the average of math and reading scores. The number of schools in each county is calculated and augment to the data frame.
<- dta |> mutate(stratio = students/teachers,
dta score = (math+read)/2) |>
add_count(county)
# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=stratio, y=score, group=county)) +
geom_point(alpha=0.5) +
stat_smooth(aes(group=1),
method="lm", formula=y ~ x, se=F) +
labs(x="Student-teacher ratio",
y="Score (average of math and reading)") +
theme_minimal()
# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=lunch, y=score, group=county)) +
geom_point(alpha=0.5) +
stat_smooth(aes(group=1),method="lm", formula=y ~ x, se=F) +
labs(x="lunch",
y="Score (average of math and reading)") +
theme_minimal()
<- lmList(lunch ~ score | county,
m1 data=dta)
$county <- factor(dta$county,
dtalevels(dta$county)[c(order(coef(m1)[,2]))])
ggplot(subset(dta, n > 10),
aes(x=lunch,
y=score)) +
geom_point(alpha=.5)+
facet_grid(. ~ county) +
stat_smooth(method='lm',
formula=y~x,
se=F,
col=1,
lwd=rel(.8)) +
theme_minimal()
summary(lm(score ~ stratio, data=subset(dta, n > 10)))
Call:
lm(formula = score ~ stratio, data = subset(dta, n > 10))
Residuals:
Min 1Q Median 3Q Max
-44.44 -14.23 0.42 13.92 47.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 709.472 13.324 53.2 < 2e-16
stratio -2.830 0.673 -4.2 3.7e-05
Residual standard error: 19.5 on 249 degrees of freedom
Multiple R-squared: 0.0663, Adjusted R-squared: 0.0625
F-statistic: 17.7 on 1 and 249 DF, p-value: 3.66e-05
Regress score over student-teacher ratio for each county. Compare the average estimated slope coefficients with the value of estimates from the overall regression ignoring county clusters.
coef(m1 <- nlme::lmList(score ~ stratio | county, data = subset(dta, n > 10)))
(Intercept) stratio
Orange 662.864 -0.403029
Humboldt 573.106 4.652889
Los Angeles 758.799 -5.315277
Kern 590.222 2.458576
Shasta 649.677 0.478515
Santa Barbara 704.248 -1.903107
San Diego 737.238 -3.877054
Sonoma 673.143 -0.465782
Santa Clara 822.429 -7.921588
Tulare 631.051 0.371337
Merced 672.068 -1.795721
Fresno 663.794 -1.460448
San Mateo 813.695 -7.901506
Placer 655.978 0.491432
coef(m1)[,2] |> mean()
[1] -1.61363
summary(lm(score ~ stratio, data=dta))
Call:
lm(formula = score ~ stratio, data = dta)
Residuals:
Min 1Q Median 3Q Max
-47.73 -14.25 0.48 12.82 48.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.93 9.47 73.82 < 2e-16
stratio -2.28 0.48 -4.75 2.8e-06
Residual standard error: 18.6 on 418 degrees of freedom
Multiple R-squared: 0.0512, Adjusted R-squared: 0.049
F-statistic: 22.6 on 1 and 418 DF, p-value: 2.78e-06
-各地區的平均估計迴歸係數為-1.61363,每增加1單位師生比,測驗成績平均下降-1.61363。不區分地區的話,師生比預測測驗成績的估計迴歸係數為-2.28,每增加1單位師生比,測驗成績下降-2.28,因此若以整體的估計迴歸係數去推論全部受試者表現,測驗成績變動幅度較大,忽略了不同地區的相似性。
Stock, J.H. & Watson, M.W. (2007). Introduction to Econometrics. 2nd Ed. Boston: Addison Wesley.