1 Introduction

The data are from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. The data set contains information on test performance, school characteristics and student demographic backgrounds for school districts.

Test scores are on the Stanford 9 standardized test administered to 5th grade students. School characteristics (averaged across the district) include enrollment, number of teachers (measured as “full-time equivalents”, number of computers per classroom, and expenditures per student. Demographic variables for the students are averaged across the district.

A data frame with 420 observations on the following 14 variables.

district: character. District code.
school: character. School name.
county: factor indicating county.
grades: factor indicating grade span of district.
students: Total enrollment.
teachers: Number of teachers.
calworks: Percent qualifying for CalWorks (income assistance).
lunch: Percent qualifying for reduced-price lunch.
computer: Number of computers.
expenditure: Expenditure per student.
income: District average income (in USD 1,000).
english: Percent of English learners.
read: Average reading score.
math: Average math score.

2 Data

data(CASchools, package="AER")
dta <- CASchools

# have a look
glimpse(dta)

Rows: 420
Columns: 14
$ district    <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
$ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county      <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
$ grades      <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
$ students    <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
$ teachers    <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
$ calworks    <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
$ lunch       <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
$ computer    <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
$ expenditure <dbl> 6384.91, 5099.38, 5501.95, 7101.83, 5235.99, 5580.15, 5253…
$ income      <dbl> 22.69000, 9.82400, 8.97800, 8.97800, 9.08033, 10.41500, 6.…
$ english     <dbl> 0.00000, 4.58333, 30.00000, 0.00000, 13.85768, 12.40876, 6…
$ read        <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
$ math        <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…

We compute the student-teacher ratio and a score which is the average of math and reading scores. The number of schools in each county is calculated and augment to the data frame.

dta <- dta |> mutate(stratio = students/teachers,
                     score = (math+read)/2) |>
  add_count(county)

3 Visualization

# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=stratio, y=score, group=county)) +
  geom_point(alpha=0.5) +
  stat_smooth(aes(group=1),
              method="lm", formula=y ~ x, se=F) +
  labs(x="Student-teacher ratio",
       y="Score (average of math and reading)") +
  theme_minimal()

3.1 Problem

Replace stratio with lunch in the above plot.

# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=lunch, y=score, group=county)) +
  geom_point(alpha=0.5) +
  stat_smooth(aes(group=1),method="lm", formula=y ~ x, se=F) +
  labs(x="lunch",
       y="Score (average of math and reading)") +
  theme_minimal()

Graph each county (with more than 10 schools) in one panel of a plot with a single regression line for each.

m1 <- lmList(lunch ~ score | county,
             data=dta)
dta$county <- factor(dta$county,
                   levels(dta$county)[c(order(coef(m1)[,2]))])

ggplot(subset(dta, n > 10),
       aes(x=lunch, 
           y=score)) +
  geom_point(alpha=.5)+
  facet_grid(. ~ county) + 
  stat_smooth(method='lm', 
              formula=y~x, 
              se=F,
              col=1,
              lwd=rel(.8)) +
  theme_minimal()

4 Simple regression

summary(lm(score ~ stratio, data=subset(dta, n > 10)))


Call:
lm(formula = score ~ stratio, data = subset(dta, n > 10))

Residuals:
   Min     1Q Median     3Q    Max 
-44.44 -14.23   0.42  13.92  47.84 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  709.472     13.324    53.2  < 2e-16
stratio       -2.830      0.673    -4.2  3.7e-05

Residual standard error: 19.5 on 249 degrees of freedom
Multiple R-squared:  0.0663,    Adjusted R-squared:  0.0625 
F-statistic: 17.7 on 1 and 249 DF,  p-value: 3.66e-05

4.1 Problem

Regress score over student-teacher ratio for each county. Compare the average estimated slope coefficients with the value of estimates from the overall regression ignoring county clusters.

coef(m1 <- nlme::lmList(score ~ stratio | county, data = subset(dta, n > 10)))

              (Intercept)   stratio
Orange            662.864 -0.403029
Humboldt          573.106  4.652889
Los Angeles       758.799 -5.315277
Kern              590.222  2.458576
Shasta            649.677  0.478515
Santa Barbara     704.248 -1.903107
San Diego         737.238 -3.877054
Sonoma            673.143 -0.465782
Santa Clara       822.429 -7.921588
Tulare            631.051  0.371337
Merced            672.068 -1.795721
Fresno            663.794 -1.460448
San Mateo         813.695 -7.901506
Placer            655.978  0.491432

coef(m1)[,2] |> mean()

[1] -1.61363

summary(lm(score ~ stratio, data=dta))


Call:
lm(formula = score ~ stratio, data = dta)

Residuals:
   Min     1Q Median     3Q    Max 
-47.73 -14.25   0.48  12.82  48.54 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   698.93       9.47   73.82  < 2e-16
stratio        -2.28       0.48   -4.75  2.8e-06

Residual standard error: 18.6 on 418 degrees of freedom
Multiple R-squared:  0.0512,    Adjusted R-squared:  0.049 
F-statistic: 22.6 on 1 and 418 DF,  p-value: 2.78e-06

-各地區的平均估計迴歸係數為-1.61363，每增加1單位師生比，測驗成績平均下降-1.61363。不區分地區的話，師生比預測測驗成績的估計迴歸係數為-2.28，每增加1單位師生比，測驗成績下降-2.28，因此若以整體的估計迴歸係數去推論全部受試者表現，測驗成績變動幅度較大，忽略了不同地區的相似性。

5 Referenecs

Stock, J.H. & Watson, M.W. (2007). Introduction to Econometrics. 2nd Ed. Boston: Addison Wesley.

California Test Score Data

Huang En-Li