1 Introduction

The data are from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. The data set contains information on test performance, school characteristics and student demographic backgrounds for school districts.

Test scores are on the Stanford 9 standardized test administered to 5th grade students. School characteristics (averaged across the district) include enrollment, number of teachers (measured as “full-time equivalents”, number of computers per classroom, and expenditures per student. Demographic variables for the students are averaged across the district.

A data frame with 420 observations on the following 14 variables.

  • district: character. District code.
  • school: character. School name.
  • county: factor indicating county.
  • grades: factor indicating grade span of district.
  • students: Total enrollment.
  • teachers: Number of teachers.
  • calworks: Percent qualifying for CalWorks (income assistance).
  • lunch: Percent qualifying for reduced-price lunch.
  • computer: Number of computers.
  • expenditure: Expenditure per student.
  • income: District average income (in USD 1,000).
  • english: Percent of English learners.
  • read: Average reading score.
  • math: Average math score.

2 Data

data(CASchools, package="AER")
dta <- CASchools
# have a look
glimpse(dta)
Rows: 420
Columns: 14
$ district    <chr> "75119", "61499", "61549", "61457", "61523", "62042", "685…
$ school      <chr> "Sunol Glen Unified", "Manzanita Elementary", "Thermalito …
$ county      <fct> Alameda, Butte, Butte, Butte, Butte, Fresno, San Joaquin, …
$ grades      <fct> KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK-08, KK…
$ students    <dbl> 195, 240, 1550, 243, 1335, 137, 195, 888, 379, 2247, 446, …
$ teachers    <dbl> 10.90, 11.15, 82.90, 14.00, 71.50, 6.40, 10.00, 42.50, 19.…
$ calworks    <dbl> 0.5102, 15.4167, 55.0323, 36.4754, 33.1086, 12.3188, 12.90…
$ lunch       <dbl> 2.0408, 47.9167, 76.3226, 77.0492, 78.4270, 86.9565, 94.62…
$ computer    <dbl> 67, 101, 169, 85, 171, 25, 28, 66, 35, 0, 86, 56, 25, 0, 3…
$ expenditure <dbl> 6384.91, 5099.38, 5501.95, 7101.83, 5235.99, 5580.15, 5253…
$ income      <dbl> 22.69000, 9.82400, 8.97800, 8.97800, 9.08033, 10.41500, 6.…
$ english     <dbl> 0.00000, 4.58333, 30.00000, 0.00000, 13.85768, 12.40876, 6…
$ read        <dbl> 691.6, 660.5, 636.3, 651.9, 641.8, 605.7, 604.5, 605.5, 60…
$ math        <dbl> 690.0, 661.9, 650.9, 643.5, 639.9, 605.4, 609.0, 612.5, 61…

We compute the student-teacher ratio and a score which is the average of math and reading scores. The number of schools in each county is calculated and augment to the data frame.

dta <- dta |> mutate(stratio = students/teachers,
                     score = (math+read)/2) |>
  add_count(county)

3 Visualization

# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=stratio, y=score, group=county)) +
  geom_point(alpha=0.5) +
  stat_smooth(aes(group=1),
              method="lm", formula=y ~ x, se=F) +
  labs(x="Student-teacher ratio",
       y="Score (average of math and reading)") +
  theme_minimal() 

3.1 Problem

# OLS regression lines over county with more than 10 schools
ggplot(subset(dta, n > 10), aes(x=lunch, y=score, group=county, color=county)) +
  geom_point(alpha=0.5) +
  stat_smooth(aes(group=1),
              method="lm", formula=y ~ x, se=T) +
  labs(x="lunch",
       y="Score (average of math and reading)") +
  theme_minimal() 

- Replace stratio with lunch in the above plot.

  • Graph each county (with more than 10 schools) in one panel of a plot with a single regression line for each.

4 Simple regression

summary(lm(score ~ stratio, data=subset(dta, n > 10)))

Call:
lm(formula = score ~ stratio, data = subset(dta, n > 10))

Residuals:
   Min     1Q Median     3Q    Max 
-44.44 -14.23   0.42  13.92  47.84 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  709.472     13.324    53.2  < 2e-16
stratio       -2.830      0.673    -4.2  3.7e-05

Residual standard error: 19.5 on 249 degrees of freedom
Multiple R-squared:  0.0663,    Adjusted R-squared:  0.0625 
F-statistic: 17.7 on 1 and 249 DF,  p-value: 3.66e-05

4.1 Problem

nlme::lmList(score ~ stratio | county, data=subset(dta, n > 10))
Call:
  Model: score ~ stratio | county 
   Data: subset(dta, n > 10) 

Coefficients:
              (Intercept)   stratio
Fresno            663.794 -1.460448
Humboldt          573.106  4.652889
Kern              590.222  2.458576
Los Angeles       758.799 -5.315277
Merced            672.068 -1.795721
Orange            662.864 -0.403029
Placer            655.978  0.491432
San Diego         737.238 -3.877054
San Mateo         813.695 -7.901506
Santa Barbara     704.248 -1.903107
Santa Clara       822.429 -7.921588
Shasta            649.677  0.478515
Sonoma            673.143 -0.465782
Tulare            631.051  0.371337

Degrees of freedom: 251 total; 223 residual
Residual standard error: 15.9413

Regress score over student-teacher ratio for each county. Compare the average estimated slope coefficients with the value of estimates from the overall regression ignoring county clusters.

5 Referenecs

Stock, J.H. & Watson, M.W. (2007). Introduction to Econometrics. 2nd Ed. Boston: Addison Wesley.