Can Smaller Classes Close the Achievement Gap?

Abstract

This study examines whether smaller class sizes improve early reading outcomes using data from Project STAR, a randomized experiment conducted in Tennessee. Students entering kindergarten were randomly assigned to small classes, regular classes, or regular classes with a teacher’s aide. Because assignment occurred within schools, the design allows causal effects of class size to be estimated. Using mixed-effects ANOVA and regression models, this analysis evaluates the effect of class size on reading scores in kindergarten and first grade while controlling for race, free-lunch status, and school-level variation. Results show that students in small classes score higher on average than students in regular classes. The difference is about 1.6% higher in kindergarten and increases to about 1.6 - 2.5% in first grade. However, socioeconomic factors such as free-lunch status remain strong predictors of student achievement. Overall, the findings suggest that smaller class sizes improve early reading outcomes but do not fully eliminate achievement gaps.

Background

This study uses data from Project STAR, a large randomized experiment conducted in Tennessee in the mid-1980s. Students entering kindergarten were randomly assigned to one of three classroom types: small classes, regular classes, or regular classes with a teacher’s aide. Random assignment occurred within schools, meaning that students in the same school were randomly placed into different classroom types. Because of this design, treatment status is independent of student background characteristics. This allows the causal effect of class size on student outcomes to be estimated.

This analysis focuses on reading scores as the primary outcome variable. Early reading ability is widely considered a foundational academic skill that supports later development in other subjects (Erbeli, 2022). For this reason, the analysis focuses on early grades, specifically kindergarten and first grade, where foundational reading skills begin to develop.

Several variables were considered to improve precision. Free lunch status serves as a proxy for socioeconomic status. Urbanicity captures differences between urban, suburban, and rural school environments. Race is also included as prior literature finds persistent achievement gaps between Black and White students on standardized tests, often linked to sociostudies that show that students from lower socioeconomic backgrounds tend to perform worse on standardized tests (Domina, 2018).

School ID is included to account for differences across schools. Schools may differ in resources and baseline achievement levels. Controlling for school-level differences is common in education research because it helps control for unobserved school characteristics that affect student outcomes (Krueger, 1999). Because students were randomized within schools, including a random effect for school allows our findings to be generalized across all schools in the country. These variables also help improve the precision of the estimated treatment effects.

Questions of Interest

This study examines whether smaller class sizes improve student reading outcomes and whether they help reduce achievement gaps across socioeconomic groups.

The primary question: Does assignment to small classes causally improve reading scores?

The analysis also investigates whether other socioeconomic factors have an effect on reading scores. In particular, it evaluates whether students from lower socioeconomic backgrounds struggle more and if class size can help. If the effect of smaller classes is significant for disadvantaged students, reducing class size could help narrow achievement gaps in education.

Cleaning Data

To begin, an initial investigation on the STAR data set was conducted. The STAR data set obtained 11,602 observations along side 379 unique variables. To allow for efficient exploratory data analysis (EDA) the raw data was transformed into long format which granted a single student ID for 2 school years. In addition to the transformation, based on our literature review and our effort to properly answer our question of interests, a selection of relevant variables occurred. Those variables being school id, race, grade, class type, city level, free lunch status, and reading scores. This ultimately resulted in a data frame with 23,204 observations and the selected 9 variables. It was obvious that there was an immense amount of null values present in the STAR data set. Table 1, gives a visual representation of these null values.

Table 1: Missing Data in Reading Score by Class Type
Class Type	Total N	Missing	Percent Missing
Small	3825	263	6.9
Regular	4778	336	7.0
Reg + Aide	4551	371	8.2
NA	10050	10050	100.0

We notice that there are 10,050 observations with a null value in the class type column, signifying that this student was not assigned to a specific class type. Therefore, in our EDA we will drop these students. For those who had been properly assigned to a class type, there are some students with null values in place of their reading scores. It is of importance to quantify these missing values for each treatment assignment (Small Class, Regular Class, and Regular + Aide Class) to ensure our causal estimates are accurate and valid. Due to the fact that the percentages, 6.9%, 7%, and 8.2%, are very close in value we can assume that the missing entries do not bias the treatment assignment because a roughly even amount were dropped from each class type. Thus our results on causality will have validity.

Due to the fact that Analysis of Variance (ANOVA) models have strict assumptions revolving around normality and homoscedasticity, Visualization 1 below shows the density curves of each class type’s reading score for each grade, which provides the ability to determine if the assumptions are not violated.

Exploratory Data Analysis

Class Type

From the plot we can see that although deviations are noticeable, nothing is too severe here. Since the density curves are approximately identical, we can proceed with confidence the both normality and homoscedasticity are satisfied, justifying a decision to use an ANOVA model.

It is known that the treatment of class type was randomized by school id, and it is crucial that each treatment is represented well and equally within each demographic. Tables 2 and 3 below provide the evidence to determine if this stipulation is fulfilled.

Table 2: Distribution Check for variables for Kindergarten
Class Type	Total	Percent Black	Percent Free Lunch	Percent Inner City	Mean Reading Score
Small	1900	31.2	47.1	21.1	440.5
Regular	2194	32.4	47.7	23.1	434.7
Reg + Aide	2231	33.8	50.3	23.4	435.4

Table 3: Distribution Check for variables for First Grade
Class Type	Total	Percent Black	Percent Free Lunch	Percent Inner City	Mean Reading Score
Small	1925	31.2	48.8	19.8	530.0
Regular	2584	37.2	53.0	22.1	513.5
Reg + Aide	2320	28.8	52.3	18.5	521.3

In Table 2, representing kindergarten, we can observe that the percentage of composition of each variable is very similar. The percentage of African American students for each class type falls between 31% - 34%. Those who are on free lunch make up about 50% of each class type. Finally, students that attend school in the inner city make up 21.1% of small classes, 23.1% of regular classes, and 23.4% of regular + aide classes. In table 3, representing first grade, the spread is still relatively small despite the fact that many students either dropped out of the program or switched class type. The percentage spread for each variable maintains a small value for both kindergarten and first grade, therefore confirming that the initial randomization was successful and remained satisfied though the progression into first grade. In the sensitivity analysis, the effect of dropping and switching with in the study will be further addressed.

One of the primary goals of this study is to determine the true effect of class size on the outcome reading scores. To do so, Visualization 2 provides box plots of kindergarten and first grade reading scores based on the class type.

In kindergarten, notice that Regular and Regular + Aide class types are nearly identical and small classes have a slightly higher median reading score than its counter parts. It is evident that in this early stage of a students academic career the playing field is quite even. Despite the presence of high performing students, the core of the data suggests that class type has not yet had a significant influence on reading scores. As the students progress into first grade, the scores increase a significantly, with small classes having the largest increase. Again, the regular and regular with aide are similar in distribution, without anything looking too extreme or out of the ordinary

The following plots will serve the purpose of providing sufficient evidence to determine which variables should be included in our final ANOVA model by comparing reading scores in each grade by the specified socio-economic factor.

Race

The majority of students are White and Black, while the number of students from other racial groups is relatively small. Thus, restricting the analysis to Black and White students helps avoid unstable estimates caused by small sample sizes and makes the comparison more reliable.

These boxplots show a seemingly significant gap between performance of Black and White students in terms of their reading scores in both kindergarten and first grade. From this we have ample reason to consider including it into the final model as a precision variable.

Free Lunch Status

Free lunch status also shows a significant difference between those with free lunch and those without. The difference here is larger than the other variables so far which suggest reason to believe that there could be a causal effect.

Urbanicity

Urbanicity seems to be much more even as compared to the other variables considered so far with only Inner City students lagging behind when it they enter first grade. Although a difference remains, the difference appears not too high and could be potentially explained by School Location or Free Lunch Status as they both serve similar purposes, with School ID already containing information on schools and Free Lunch Status serving as the proxy for economics. We will consider removing it if further testing suggests it to be correlated to another variable.

School ID

This is a plot of all the different schools against Reading Score with each color denoting urbanicity. It is ordered from lowest average to highest average to improve readability and to also get a sense of the gap of reading scores between schools. We can see a good mix of all types of urbanicity between schools showing no specific region having higher or lower scores. We can also clearly see that some schools have a sizeably higher average score than others, indicating that not all schools are the same and should be included into the model as a random effect, as we are not interested in these specific schools in particular but the variation between schools.

The first grade plot showed somewhat similar results, except this time we can see a clear pattern of inner city students underperforming compared to the other regions which are more evenly distributed. This could highlight systemic issues present in society that affect academic performance from these regions. Inner city areas in the United States are historically defined by low income, mostly minority neighborhoods. This implied that urbanicity is a proxy for socioeconomic status but due to the fact that we already have other socioeconomic proxy variables, it might be best to exclude from the model as they could be correlated. Similar to kindergarten, there are clear uneven differences between schools on reading scores which further backs up the claim to include it in the model as a random effect term. One thing to also note is that the confidence intervals are a lot wider than in kindergarten indicating less outliers.

Correlation

## 
##  Pearson's Chi-squared test
## 
## data:  my_table
## X-squared = 857.9, df = 3, p-value < 2.2e-16

## 
##  Pearson's Chi-squared test
## 
## data:  my_table_1
## X-squared = 822.02, df = 3, p-value < 2.2e-16

Running a Pearson’s Chi-Squared test, we can observe a incredibly small p-value which means that there is likely a relationship between Free Lunch Status and Urbanicity. Thus, we decided to discard urbanicity as it is a similar variable to both School ID and Free Lunch Status.

Class Switch

There is significant class switching between Regular and Regular + Aid from kindergarten to first grade. For now, we will assume this to be irrelevant and address it later during our Sensitivity Analysis

Exploratory Data Analysis Overall

A similar trend between the all visualizations is that the kindergarten median reading scores are all relatively equal, with the classifications that usually correlate with a lower social economic status, such as African American students, free lunch, and inner city scoring slightly lower. We do not begin to see the rising disparities in score until first grade. As the students progress into first grade, the classifications mentioned before begin to fall behind and the difference in scores becomes more obvious. Further testing suggested that urbanicity isn’t too important of a variable to be included as most of its predictive influence is already contained elsewhere. The noticeable gaps in reading as the children move into first grade are consistent with our literature review and offer sound evidence that race, free lunch status, and School ID should be included in the ANOVA model.

Final Model

\[ \huge{ Y_{ijkm} = \mu + \alpha_i + \beta_j + \gamma_k + \delta_m + \epsilon_{ijkm}\\ \Large Y_{ijkm} \large{\text{ : Student Reading Score}}\\ \Large \mu \large{\text{ : Overall Grand Mean}}\\ \Large \alpha \large{\text{ : Effect of Class Size (Small, Regular, Regular + Aide)}}\\ \Large \beta \large{\text{ : Effect of Free Lunch Status (Free, Not Free)}}\\ \Large \gamma \large{\text{ : Random Effect of School (School ID)}}\\ \Large \delta \large{\text{ : Effect of Race (White, Black)}}\\ \Large \epsilon_{ijkm} \large{\text{ : Random Error Term}}\\ } \]

The Final Model changed from the one given in the presentation as observing the other presentations showed me how impactful the differences within and between each schools are so we decided to include that into the model. We also decided to exclude the effect of urbanicity as we believe that the location of the school is deeply related to the schools themselves. We also have reason to believe that urbanicity is already explained by other variables like Free Lunch Status since both are proxies for economic status. We added a random effect as it will allow us to generalize the effect of the school on reading scores onto the entire population instead of just the specific schools tested.

Assumptions

\(\epsilon_{ijkm}\) are all independent and identically distributed following a Normal distribution of mean = 0 and variance = \(\sigma^2\)
Random Effects are normally distributed with a mean of 0 and a variance of \(\sigma^2\)
No interactions as class size showed no notable interactions with other variables and we wanted it to be more interpretable, allowing us to get a clear causal effect of class size on reading scores.
Class Switching is irrelevant and will not be impact the findings

Plot of reading scores vs class type indicating no notable interactions as the lines are parallel.

Mixed Effect ANOVA Model

We will first be analyzing the model with a mixed effects ANOVA. This will help us understand the statistical significance of the predictors along with any differences that may appear between kindergarten and first grade.

Kindergarten ANOVA

	npar	Sum Sq	Mean Sq	F value
gkclasstype	2	28691.35	14345.67	19.61975
gkfreelunch	1	127249.82	127249.82	174.03221
race	1	11573.99	11573.99	15.82908

The ANOVA model of the Mixed Effects Model shows the statistical significance of Class Type, Free Lunch Status, and Race on Reading Scores. Free lunch Status is the most statistically significant, explaining the most variation in the model. This indicates clear differences between the different factor levels of each precision variable.

First Grade ANOVA

	npar	Sum Sq	Mean Sq	F value
g1classtype	2	135465.35	67732.67	29.52766
g1freelunch	1	690740.41	690740.41	301.12426
race	1	45979.47	45979.47	20.04448

The ANOVA model in first grade gives similar results. The variation explained by these factors are larger than in kindergarten indicating an effect of higher magnitude.

Tukey’s HSD Test

Then we conducted a Tukey’s HSD test to compare the differences between the 3 class types and their impact on Reading Score. This will allow us to see the numerical differences between the 3 types of classes. Since the data is completely randomized, these comparisons will also allow us to compare them with the regression coefficients from a regression model as they should both estimate the average causal effect accurately and without bias.

##  contrast                               estimate   SE   df t.ratio p.value
##  (REGULAR + AIDE CLASS) - SMALL CLASS     -5.665 1.08 3856  -5.259  <.0001
##  (REGULAR + AIDE CLASS) - REGULAR CLASS    0.267 1.06 3862   0.251  0.9660
##  SMALL CLASS - REGULAR CLASS               5.932 1.09 3861   5.420  <.0001
## 
## Results are averaged over the levels of: gkfreelunch, race 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 3 estimates

Results indicate that in kindergarten, there is a significant difference in small class sizes compared to both regular and regular with aid class sizes. The differences are similar with students in small classes outperforming those in either of the two regular class sizes by about +6 more points. The result also shows no statistical significance in reading score between students in regular classes and regular + aid classes as their p-value is way larger than 0.05.

##  contrast                               estimate   SE   df t.ratio p.value
##  SMALL CLASS - REGULAR CLASS               14.13 1.87 3873   7.537  <.0001
##  SMALL CLASS - (REGULAR + AIDE CLASS)       8.97 1.94 3876   4.624  <.0001
##  REGULAR CLASS - (REGULAR + AIDE CLASS)    -5.16 1.95 3888  -2.643  0.0224
## 
## Results are averaged over the levels of: g1freelunch, race 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 3 estimates

Results were similar for first grade with differences between small classes and the two regular ones being significant. However this time, the differences between the two regular class sizes were also important as all differences had a p-value lower than 0.05. It implies that a regular + aided class might not help much in kindergarten compared to a regular class but could have an effect as students move upwards in grade levels. It could also be due to certain other factors like students switching classes as a sizable chunk of students switched between the 2 regular classes after kindergarten. Overall though students in small class classes still outperformed those in regular by 14 points and those in regular + aid by 9 points. Students in regular + aid classes outperformed those in regular classes by around 5 points.

Linear Regression

Now we will fit a linear regression model for out Mixed Effects Model to check the causal effect of all the precision variables we included as well as check their significance.

Kindergarten Regression Model

##  Groups   Name        Std.Dev.
##  gkschid  (Intercept) 14.206  
##  Residual             27.054

##                                   Estimate Std. Error    t value
## (Intercept)                     444.247725   2.323094 191.231045
## gkclasstypeREGULAR + AIDE CLASS  -5.666293   1.077095  -5.260718
## gkclasstypeREGULAR CLASS         -5.932277   1.094391  -5.420617
## gkfreelunchFREE LUNCH           -12.696407   1.070889 -11.855950
## raceWHITE                         7.152766   1.797570   3.979131

Among the fixed effects of the model, all 3 are deemed statistically significant predictors. We know this by looking at the absolute value of t-value and comparing it to 1.96 for significance at \(\alpha\) = 0.05. Since we have over a thousand samples, it is enough to conclude that t-distribution will approach the Normal distribution and thus allow 1.96 to be a good measuring point for significance.

From the model we can see that the causal effect of class size on score is similar to the calculated mean differences from Tukey HSD. With small classes as a reference point, students in regular classes and regular + aid classes underperform by around 6 points in terms of their reading scores. Students with free lunch underperform by around 13 points compared to students without free lunch showing how lower socioeconomic status can affect academic performance. White students outperform Black students by around 7 points, again indicating background and socioeconomic status as major influences in academic performance. The effects of systemic racism persist and can significantly affect young student’s learning. Before, our old model had urbanicity instead of the random effect of school and it was all deemed statistically insignificant so excluding it this time was a good decision as the random effect appears more impactful in its place.

The random effect indicates that each individual schools have differences with +/- 14 points in reading score within one standard deviation. This shows the importance of the random effect which is that each school has a different baseline performance with some schools being better than others. The residual standard deviation tells us that students within the school differ by around +/- 27 points within one standard deviation. Essentially it tells us that scores of students within each school differ more than the differences between the schools themselves.

First Grade Regression Model

##  Groups   Name        Std.Dev.
##  g1schid  (Intercept) 19.190  
##  Residual             47.894

##                                  Estimate Std. Error    t value
## (Intercept)                     537.73560   3.565331 150.823463
## g1classtypeREGULAR CLASS        -14.12822   1.873341  -7.541722
## g1classtypeREGULAR + AIDE CLASS  -8.96508   1.937422  -4.627325
## g1freelunchFREE LUNCH           -29.43036   1.868345 -15.752105
## raceWHITE                        13.53015   3.022076   4.477106

The first grade regression model gives similar results with all variables being statistically significant. The effects of these variables seem to be larger than in kindergarten as students in small class sizes outperform those in regular classes by 14 points and those in regular + aid classes by 9 points. Students with free lunch underperform those without by an entire 30 points and white students do better by around 14 points. The random effect shows that the differences between schools is around +/- 19 points with students within schools differing by around 48 points. Class size certainly has an effect on reading scores but it seems that socioeconomic factors play a larger role as free lunch status, race, and schools being a proxy for it. Notably these factor effects get larger from kindergarten to first grade indicating a widening gap for students as they grow up.

Overall Results

Overall there is a causal effect of class size on student’s reading score in both kindergarten and 1st grade. With small class sizes as a reference point, both regular and regular + aid classes decrease reading scores by around 1.3%. There is a 2.9% decrease for student’s that have free lunch compared to those that don’t. White students outperform Black ones by 1.6%. In 1st grade, students in regular classes underperform by 2.6% and students in regular+aid by 1.6% compared to those in small classes. Students with free lunch underperform by 5.5% and White students outperform Black students by 2.5%

This shows that although class size matters, addressing systemic socioeconomic inequalities would most likely be more impactful.

Sensitivity Analysis

Residual Plots

In both plots, the residuals are roughly centered around zero, suggesting that the models capture the main pattern in the data reasonably well.

The residual plot for the kindergarten model shows a fairly constant spread of residuals across fitted values, indicating that the constant variance assumption is largely satisfied.

In contrast, the residual plot for the first-grade model shows a slightly increasing spread as the fitted values increase, suggesting a mild degree of heteroskedasticity.

Overall, the residual diagnostics suggest that the model is appropriate for studying the effects of class size, school, and race on first-grade math scores while accounting for random teacher effects, giving us greater confidence in the reliability of the estimated class-type effects.

Randomization Check

To interpret the class-size effect causally, we first examine whether class assignment was randomized. In the STAR experiment, students were assigned to class types within each school rather than across all schools combined. If class assignment was randomized within schools, then conditional on the school attended, the assignment to class type should be independent of baseline characteristics. Under this assumption, differences in outcomes across class types can be interpreted as causal effects of class size.

To assess whether class assignment was randomized, we test whether class type is independent of baseline characteristics within each school.

Let \(Y\) denote class type, \(X\) a baseline covariate, and \(S\) the school. Under within-school randomization, the assignment should satisfy \[ Y \perp X \mid S . \]

We focus on two baseline variables: race and free-lunch status, which are the main predictors included in our model. We test whether class type is independent of these variables within schools in both Kindergarten (GK) and Grade 1 (G1).

Statistical Test

Since class type has three categories (small class, regular class, and regular class with aide), we model the assignment probability using a multinomial logistic regression with school fixed effects. Specifically, we estimate \[ P(Y_i = k \mid X_i, S_i) = \frac{\exp(\alpha_{k,S_i} + \beta_k X_i)} {\sum_{j=1}^{3}\exp(\alpha_{j,S_i} + \beta_j X_i)}, \] where \(Y_i\) denotes the class type of student \(i\), \(X_i\) is a baseline covariate (race or free-lunch status), and \(\alpha_{k,S_i}\) represents school fixed effects.

To test the randomization assumption, we compare this model with a null model that includes only school fixed effects (i.e., \(\beta_k=0\)). We then conduct a to determine whether the covariate significantly predicts class assignment.

If the \(p\)-value is large, it suggests that, after accounting for school, the covariate does not predict class assignment, which is consistent with the randomization assumption. Conversely, a small \(p\)-value indicates that class assignment is associated with the covariate, suggesting a potential violation of randomization.

Results

Likelihood ratio tests for independence between class assignment and covariates within schools
	LR	df	p
GK: Class Type vs Race	1.1462	2	0.5638
GK: Class Type vs Free Lunch	1.6517	2	0.4379
G1: Class Type vs Race	1.8882	2	0.3890
G1: Class Type vs Free Lunch	8.9516	2	0.0114

The results show that class assignment in Kindergarten appears to be randomized. In contrast, in Grade 1 the free-lunch status significantly predicts class assignment, indicating a potential selection bias. Therefore, the causal interpretation of the class-size effect is more credible for Kindergarten, while the Grade 1 analysis may be affected by selection bias.

However, the matching analysis in the next section shows that, for students who switch class types, smaller classes are still associated with larger improvements in reading scores, which is consistent with the direction of the main model. Although most of these estimates are not statistically significant, this suggests that the observed pattern is not entirely driven by selection. Therefore, the ANOVA results for Grade 1 remain informative, but the causal interpretation should be made with caution.

Matching Approach

To check the robustness of our main results, we conduct a sensitivity analysis using a matching approach to compare the change in scores between students who remained in their original class type and those who transitioned to another class.

In this analysis, students are matched based on baseline characteristics including race, free-lunch status, school ID, and their initial academic performance. Initial performance is measured by the kindergarten (GK) reading score percentile. These variables are used to construct a similarity measure so that students with comparable backgrounds and baseline achievement are paired together. By comparing the changes in reading scores among matched students with similar characteristics, we aim to assess the effect of class size on students who switch class types.

For each transition, we estimate the average treatment effect on the treated (ATT). Specifically, we compare the change in reading scores for students who switched to another class type with the change in scores for matched students who remained in their original class type. Because the matching procedure pairs students with similar baseline characteristics and initial academic performance, the difference in score changes can be interpreted as the effect of class size for students who switch class types.

Results

The results suggest that students who switch from regular classes to regular classes with aid and from regular or regular classes with an aide to small classes tend to experience larger improvements in reading scores. Conversely, students who move from small classes to regular classes tend to show a decline in their score improvements. These patterns are broadly consistent with the findings from our main model, which suggests that smaller class sizes are associated with better reading outcomes. The results also suggest that regular classes with an aide are associated with larger improvements compared with regular classes without an aide.

However, most of these estimated effects are not statistically significant, suggesting that the evidence for switching effects is limited.

Change-in-Score Model

We also estimated a change-in-score model to examine the effect of class size on the improvement in students’ reading scores.

As an additional robustness check, we considered a change score model in which the outcome was defined as the difference between first-grade and kindergarten reading scores. This specification focuses directly on student improvement over time, rather than the absolute score at a single grade level. If the estimated class-size effect remains similar under this alternative specification, then the main findings are more robust to model choice.

Instead of modeling reading scores at a single grade level, here we defined the response as the difference between first-grade and kindergarten reading scores:

\[ \Delta Y_i = Y_{i,G1} - Y_{i,GK}. \]

This allows us to directly study whether class size is associated with different levels of reading improvement between kindergarten and first grade.

Consistent with the main model, we use gkclasstype as the main treatment variable, while race and free-lunch status are included as fixed effects, with school ID modeled as a random effect.

Change Score Mixed Model

The change score model can be written as \[ \Delta Y_{ijkm} = \mu + \alpha_i + \beta_j + \gamma_k + \delta_m + \epsilon_{ijkm}, \] where \(\Delta Y_{ijkm}\) denotes the change in reading score from Kindergarten to Grade 1 for student \(ijkm\). The remaining parameters have the same interpretations as in the main model.

Results

	npar	Sum Sq	Mean Sq	F value
gkclasstype	2	20181.379	10090.690	6.614189
gkfreelunch	1	165168.856	165168.856	108.263960
race	1	7519.069	7519.069	4.928557

Compared with students in small classes, students in larger classes show significantly smaller improvements in reading scores. Compared with students in small classes, students in regular classes improve about 4.33 points less on average, while students in regular classes with an aide improve about 5.17 points less on average. These results suggest that students assigned to small classes experience larger gains in reading performance between kindergarten and first grade.

Additionally, students who do not receive free lunch improve by approximately 14.75 additional points. And White students improve about 5.68 points more than Black students on average.

Overall, the change score model indicates that class size significantly affects reading score improvement. Students in smaller classes not only achieve higher reading scores but also experience greater gains in reading performance between kindergarten and first grade. This result is consistent with the findings from the main mixed-effects regression model and therefore strengthens the evidence that smaller class sizes positively affect early reading development.

Conclusion

This study investigates the effect of class size on early reading outcomes using data from the Project STAR randomized experiment. The results provide evidence that students assigned to small classes perform better on reading assessments than those in regular-sized classes. The findings show that the effect of class size appears early and becomes larger by first grade. Students in small classes score higher on average and experience greater improvements in reading performance over time. These results are consistent across several model specifications, including mixed-effects regression, matching analysis, and change-score models. At the same time, socioeconomic factors remain strong predictors of academic performance. Students receiving free lunch and Black students tend to have lower reading scores, and differences between schools also contribute to variation in outcomes. These patterns suggest that while reducing class size can improve learning outcomes, it does not fully address broader inequalities in education. Overall, the results indicate that smaller class sizes can positively influence early reading development. However, policies aimed at improving educational equity should also address socioeconomic disparities that continue to affect student achievement.

Appendix I. Reference

Domina, T., Pharris-Ciurej, N., Penner, A., Penner, E., Brummet, Q., Porter, S., & Sanabria, T. (2018).
Is free and reduced-price lunch a valid measure of educational disadvantage? Educational Researcher, 47(9), 539–555.

Finn, J. D., & Achilles, C. M. (1990).
Answers and questions about class size: A statewide experiment. American Educational Research Journal, 27(3), 557–577.

Krueger, A. B. (1999).
Experimental estimates of education production functions. The Quarterly Journal of Economics, 114(2), 497–532.

Lippman, L., Burns, S., & McArthur, E. (1996).
Urban schools: The challenge of location and poverty. U.S. Department of Education.

Reardon, S. F. (2011).
The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In G. J. Duncan & R. J. Murnane (Eds.), Whither opportunity? Rising inequality, schools, and children’s life chances (pp. 91–116). Russell Sage Foundation.

Appendiv II. Reproducibility Info

knitr::opts_chunk$set(
  echo    = FALSE,
  message = FALSE,
  warning = FALSE,
  fig.width  = 10,
  fig.height = 5,
  dpi     = 150
)
load("STAR_Students.RData")
library(dplyr)
library(ggplot2)
library(patchwork)
library(lme4)
library(emmeans)
library(tidyr)
library(forcats)
library(knitr)
library(scales)
library(data.table)
library(MatchIt)
library(purrr)
library(tibble)
library(nnet)
library(kableExtra)
library(plotly)

socio_econ_df = x %>% select(stdntid, race, gkclasstype, g1classtype, gksurban, g1surban, gkfreelunch, g1freelunch, gktreadss, g1treadss, gkschid, g1schid)

long_df <- socio_econ_df %>%
  pivot_longer(
    cols         = -c(stdntid, race),
    names_pattern = "(g[k1])(.+)",
    names_to     = c("grade", ".value")
  ) %>%
  mutate(
    classtype = fct_recode(as.factor(classtype),
      "Small"        = "SMALL CLASS",
      "Regular"      = "REGULAR CLASS",
      "Reg + Aide" = "REGULAR + AIDE CLASS"
    ),
    freelunch = as.factor(freelunch),
    race      = as.factor(race),
    surban    = as.factor(surban)
  )
long_df %>%
  group_by(classtype) %>%
  summarise(
    n_total   = n(),
    n_missing = sum(is.na(treadss)),
    pct_miss  = round(100 * mean(is.na(treadss)), 1)
  ) %>%
    kable(
    col.names = c("Class Type", "Total N", "Missing", "Percent Missing"),
    caption   = "Table 1: Missing Data in Reading Score by Class Type"
  )
long_df %>%
  filter(!is.na(classtype), !is.na(treadss)) %>%
  mutate(grade = fct_rev(factor(grade))) %>%
  ggplot(aes(x = treadss, fill = classtype)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~grade) +
  labs(
    title    = "Visualization 1: Reading Score Distribution by Class Type",
    x = "Reading Score", y = "Density"
  ) +
  theme_bw()
long_df %>%
  filter(grade == "gk", !is.na(classtype)) %>%
  group_by(classtype) %>%
  summarise(
    n = n(),
    pct_black = round(100 * mean(race == "BLACK", na.rm = TRUE), 1),
    pct_freelunch = round(100 * mean(freelunch == "FREE LUNCH", na.rm = TRUE), 1),
    pct_urban = round(100 * mean(surban == "INNER CITY", na.rm = TRUE), 1),
    mean_read = round(mean(treadss, na.rm = TRUE), 1)
  ) %>%
    kable(
    col.names = c("Class Type", "Total", "Percent Black", "Percent Free Lunch", "Percent Inner City", "Mean Reading Score"),
    caption   = "Table 2: Distribution Check for variables for Kindergarten"
  )
long_df %>%
  filter(grade == "g1", !is.na(classtype)) %>%
  group_by(classtype) %>%
  summarise(
    n = n(),
    pct_black = round(100 * mean(race == "BLACK", na.rm = TRUE), 1),
    pct_freelunch = round(100 * mean(freelunch == "FREE LUNCH", na.rm = TRUE), 1),
    pct_urban = round(100 * mean(surban == "INNER CITY", na.rm = TRUE), 1),
    mean_read = round(mean(treadss, na.rm = TRUE), 1)
  ) %>%
    kable(
    col.names = c("Class Type", "Total", "Percent Black", "Percent Free Lunch", "Percent Inner City", "Mean Reading Score"),
    caption   = "Table 3: Distribution Check for variables for First Grade"
  )

long_df %>%
  filter(!is.na(classtype), !is.na(treadss)) %>%
  mutate(grade = fct_rev(factor(grade))) %>%
  ggplot(aes(x = classtype, y = treadss, fill = classtype)) +
  geom_boxplot(alpha = 0.6) +
  facet_wrap(~grade) +
  labs(
    title    = "Visualization 2: Reading Scores by Class Type",
    x = "Class Type", y = "Reading Score", fill = "Class Type"
  ) +
  theme_bw()

race_df <- x %>%
  filter(!is.na(race)) %>%
  count(race) %>%
  mutate(
    percent = n / sum(n),
    label = ifelse(percent > 0.02,
                   paste0(race, "\n", percent(percent, accuracy = 0.1)),
                   "")
  )


race_colors <- c(
  "WHITE" = "#4E79A7",
  "BLACK" = "#E15759",
  "ASIAN" = "#59A14F",
  "HISPANIC" = "#F28E2B",
  "OTHER" = "#B07AA1",
  "NATIVE AMERICAN" = "#76B7B2"
)

p_race <- ggplot(race_df, aes(x = 2, y = n, fill = race)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +     
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5),
            size = 4) +
  scale_fill_manual(values = race_colors) +
  labs(
    title = "Donut Chart of Race Distribution",
    fill = "Race"
  ) +
  theme_void()

print(p_race)
plot_df <- x %>%
  filter(race %in% c("WHITE", "BLACK")) %>%
  pivot_longer(
    c(gktreadss, g1treadss),
    names_to = "grade",
    values_to = "reading_score"
  ) %>%
  mutate(
    class_label = case_when(
      grade == "gktreadss" ~ as.character(gkclasstype),
      grade == "g1treadss" ~ as.character(g1classtype)
    ),
    grade = factor(grade, levels = c("gktreadss", "g1treadss"),
                   labels = c("GK", "G1")),
    class_label = factor(
      class_label,
      levels = c("SMALL CLASS", "REGULAR CLASS", "REGULAR + AIDE CLASS"),
      labels = c("Small\nClass", "Regular\nClass", "Regular\n+ Aide")
    )
  ) %>%
  filter(!is.na(reading_score), !is.na(class_label))

ggplot(plot_df, aes(class_label, reading_score, fill = race)) +
  geom_boxplot(position = position_dodge(0.75), outlier.size = 0.8) +
  facet_wrap(~grade, nrow = 1) +
  scale_fill_manual(values = c("WHITE" = "#4E79A7", "BLACK" = "#E15759")) +
  labs(x = "Class Type", y = "Reading Score", fill = "Race") +
  theme_bw(base_size = 14) +
  theme(legend.position = "top")
long_df %>%
  filter(!is.na(freelunch), !is.na(treadss)) %>%
  mutate(grade = fct_rev(factor(grade))) %>%
  ggplot(aes(x = freelunch, y = treadss, fill = freelunch)) +
  geom_boxplot(alpha = 0.6) +
  facet_wrap(~grade) +
  labs(
    title    = "Visualization 4:Reading Scores by Free Lunch Status",
    x = "Free Lunch Status", y = "Reading Score"
  ) +
  theme_bw() 
long_df %>%
  filter(!is.na(surban), !is.na(treadss)) %>%
  mutate(grade = fct_rev(factor(grade))) %>%
  ggplot(aes(x = surban, y = treadss, fill = surban)) +
  geom_boxplot(alpha = 0.6) +
  facet_wrap(~grade) +
  labs(
    title = "Visualization 5:Reading Scores Across City Level",
    x = "City Level", y = "Reading Score"
  ) +
  theme_bw()
socio_econ_df = x %>% select(stdntid, gkclasstype, g1classtype, race, gksurban, g1surban, gkfreelunch, g1freelunch, gkschid, g1schid, gktreadss, g1treadss)

socio_econ_df = na.omit(socio_econ_df)
socio_econ_df[, 2:10] <- lapply(socio_econ_df[, 2:10], function(x) {
  factor(x, levels = unique(x), ordered = FALSE)
})
socio_econ_df = socio_econ_df %>% filter(race == "WHITE" | race == "BLACK")
socio_econ_df$g1classtype <- factor(socio_econ_df$g1classtype, 
                                     levels = c("SMALL CLASS",
                                                "REGULAR CLASS",
                                                "REGULAR + AIDE CLASS"))
ggplot(socio_econ_df, aes(x = reorder(gkschid, gktreadss, FUN = mean), 
                          y = gktreadss, 
                          fill = gksurban)) +
  geom_boxplot(width = 1) +  
  labs(x = "School", y = "Reading Score", title = "School vs Kindergarten Reading Scores") +
  theme_minimal() +
  theme(axis.text.x = element_blank())
ggplot(socio_econ_df, aes(x = reorder(g1schid, g1treadss, FUN = mean), 
                          y = g1treadss, 
                          fill = g1surban)) +
  geom_boxplot(width = 1) +  
  labs(x = "School", y = "Reading Score", title = "School vs First Grade Reading Scores")+
  theme_minimal() +
  theme(axis.text.x = element_blank())
my_table <- table(socio_econ_df$gkfreelunch, socio_econ_df$gksurban)
my_table_1 <- table(socio_econ_df$g1freelunch, socio_econ_df$g1surban)

chisq.test(my_table)
chisq.test(my_table_1)
data_gk_g1 <- x %>%
  select(gkclasstype, g1classtype) %>%
  na.omit()

flow_data <- data_gk_g1 %>%
  group_by(gkclasstype, g1classtype) %>%
  summarise(Freq = n(), .groups = "drop")

flow_data <- flow_data %>%
  mutate(
    gk_class = case_when(
      gkclasstype == "REGULAR CLASS" ~ "regular",
      gkclasstype == "SMALL CLASS" ~ "small",
      gkclasstype == "REGULAR + AIDE CLASS" ~ "regular+aide",
      TRUE ~ as.character(gkclasstype)
    ),
    g1_class = case_when(
      g1classtype == "REGULAR CLASS" ~ "regular",
      g1classtype == "SMALL CLASS" ~ "small",
      g1classtype == "REGULAR + AIDE CLASS" ~ "regular+aide",
      TRUE ~ as.character(g1classtype)
    )
  )

flow_data <- flow_data %>%
  mutate(
    source = paste0(gk_class, "_GK"),
    target = paste0(g1_class, "_G1")
  )

flow_data <- flow_data %>%
  mutate(
    color = case_when(
      gk_class == "regular" ~ "lightblue",
      gk_class == "small" ~ "salmon",
      gk_class == "regular+aide" ~ "lightgray",
      TRUE ~ "gray"
    )
  )

links <- flow_data %>%
  select(source, target, Freq, color)

node_order <- c(
  "regular_GK", "small_GK", "regular+aide_GK",
  "regular_G1", "small_G1", "regular+aide_G1"
)

nodes <- data.frame(name = node_order)


links$source_id <- match(links$source, nodes$name) - 1
links$target_id <- match(links$target, nodes$name) - 1

node_labels <- c(
  "regular GK", "small GK", "reg+aide GK",
  "regular G1", "small G1", "reg+aide G1"
)


node_colors <- c(
  "#EFC000FF", "#868686FF", "#CD534CFF",
  "#EFC000FF", "#868686FF", "#CD534CFF"
)

fig <- plot_ly(
  type = "sankey",
  orientation = "h",
  node = list(
    label = node_labels,
    color = node_colors,
    pad = 20,
    thickness = 20,
    line = list(color = "black", width = 0.5)
  ),
  link = list(
    source = links$source_id,
    target = links$target_id,
    value  = links$Freq,
    color  = links$color
  )
)

fig <- fig %>%
  layout(
    title = "Continuity of Program from Kindergarten to Grade 1",
    font = list(size = 14),
    margin = list(t = 80),
    width = 600
  )

fig
model_k <- lmer(gktreadss ~ gkclasstype + gkfreelunch + race + 
                          (1 | gkschid), data = socio_econ_df, REML = FALSE)
anova(model_k)
model_1 <- lmer(g1treadss ~ g1classtype + g1freelunch + race +
                          (1 | g1schid), data = socio_econ_df, REML = FALSE)
anova(model_1)
group_means_k = emmeans(model_k, "gkclasstype", pbkrtest.limit = 4000)
pairs(group_means_k, adjust = "tukey")
group_means_1 = emmeans(model_1, "g1classtype", pbkrtest.limit = 4000)
pairs(group_means_1, adjust = "tukey")
socio_econ_df$gkclasstype <- relevel(socio_econ_df$gkclasstype, ref = "SMALL CLASS")
model_k <- lmer(gktreadss ~ gkclasstype + gkfreelunch + race + (1|gkschid), 
                data = socio_econ_df)

print(VarCorr(model_k))
summary(model_k)$coefficients
socio_econ_df$g1classtype <- relevel(socio_econ_df$g1classtype, ref = "SMALL CLASS")
print(VarCorr(model_1))

summary(model_1)$coefficients
res_df <- bind_rows(
  data.frame(
    fitted = fitted(model_k),
    resid  = resid(model_k),
    grade  = "GK"
  ),
  data.frame(
    fitted = fitted(model_1),
    resid  = resid(model_1),
    grade  = "G1"
  )
) %>%
  mutate(grade = factor(grade, levels = c("GK","G1"))) 

ggplot(res_df, aes(x = fitted, y = resid)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
  geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
  facet_wrap(~grade, nrow = 1, scales = "free") +  
  labs(
    x = "Fitted Values",
    y = "Residuals",
    title = "Residual Plots for GK and G1 Mixed Models"
  ) +
  theme_bw(base_size = 14) +
  theme(
    strip.text = element_text(size = 14),
    plot.title = element_text(hjust = 0.5)
  )

test_independence <- function(data, class, school, covar, keep = NULL) {
  
  df <- data %>%
    select(all_of(c(class, school, covar))) %>%
    na.omit()
  
  if (!is.null(keep)) {
    df <- df %>% filter(.data[[covar]] %in% keep)
  }
  
  df <- df %>% mutate(across(everything(), as.factor))
  
  m1 <- multinom(as.formula(paste(class, "~", school)), data=df, trace=FALSE)
  m2 <- multinom(as.formula(paste(class, "~", school, "+", covar)), data=df, trace=FALSE)
  
  LR <- 2*(logLik(m2) - logLik(m1))
  df_diff <- attr(logLik(m2),"df") - attr(logLik(m1),"df")
  p <- pchisq(LR, df_diff, lower.tail = FALSE)
  
  c(LR = as.numeric(LR), df = df_diff, p = p)
}

results <- rbind(
  "GK: Class Type vs Race" =
    test_independence(x, "gkclasstype","gkschid","race", keep=c("WHITE","BLACK")),
  
  "GK: Class Type vs Free Lunch" =
    test_independence(x,"gkclasstype","gkschid","gkfreelunch"),
  
  "G1: Class Type vs Race" =
    test_independence(x,"g1classtype","g1schid","race", keep=c("WHITE","BLACK")),
  
  "G1: Class Type vs Free Lunch" =
    test_independence(x,"g1classtype","g1schid","g1freelunch")
)

results <- as.data.frame(results)

knitr::kable(
  results,
  digits = 4,
  caption = "Likelihood ratio tests for independence between class assignment and covariates within schools"
)
library(dplyr)
library(purrr)
library(tidyr)
library(MatchIt)
library(ggplot2)

df <- x %>%
  select(stdntid, race, gkschid, gkclasstype, g1classtype,
         gksurban, gkfreelunch, gktreadss, g1treadss) %>%
  drop_na() %>%
  mutate(
    pctR_gk = 100 * percent_rank(gktreadss),
    delta_read = g1treadss - gktreadss
  )

run_matching <- function(data, origin, dest) {
  temp <- data %>%
    mutate(
      treat = case_when(
        gkclasstype == origin & g1classtype == dest   ~ 1,
        gkclasstype == origin & g1classtype == origin ~ 0,
        TRUE ~ NA_real_
      )
    ) %>%
    filter(!is.na(treat))

  if (nrow(temp) < 20 || sum(temp$treat == 1) < 10 || sum(temp$treat == 0) < 10) return(NULL)

  m <- matchit(
    treat ~ pctR_gk + race + gkfreelunch,
    data = temp,
    method = "nearest",
    distance = "glm",
    ratio = 1,
    caliper = 0.2,
    std.caliper = TRUE,
    exact = ~ gkschid
  )

  matched <- match.data(m)

  means <- matched %>%
    group_by(treat) %>%
    summarise(mean_change = weighted.mean(delta_read, weights), .groups = "drop")

  tt <- tryCatch(t.test(delta_read ~ treat, data = matched), error = function(e) NULL)

  tibble(
    origin = origin,
    dest = dest,
    ATT = means$mean_change[means$treat == 1] - means$mean_change[means$treat == 0],
    treated_mean = means$mean_change[means$treat == 1],
    control_mean = means$mean_change[means$treat == 0],
    treated_n = sum(matched$treat == 1),
    control_n = sum(matched$treat == 0),
    p_value = ifelse(is.null(tt), NA, tt$p.value)
  )
}

transitions <- tribble(
  ~origin, ~dest,
  "SMALL CLASS", "REGULAR CLASS",
  "SMALL CLASS", "REGULAR + AIDE CLASS",
  "REGULAR CLASS", "SMALL CLASS",
  "REGULAR CLASS", "REGULAR + AIDE CLASS",
  "REGULAR + AIDE CLASS", "SMALL CLASS",
  "REGULAR + AIDE CLASS", "REGULAR CLASS"
)

results <- pmap_dfr(transitions, ~run_matching(df, ..1, ..2)) %>%
  mutate(
    transition = paste(origin, "→", dest),
    sig = ifelse(!is.na(p_value) & p_value < 0.05, "Significant", "Not significant")
  )

p_readchange <- ggplot(results, aes(x = reorder(transition, ATT), y = ATT, fill = sig)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(
    x = "Class transition",
    y = "ATT (G1 reading - GK reading)",
    title = "Estimated effect of class transitions on reading score change",
    fill = NULL
  ) +
  theme(
    axis.text.y = element_text(color = "black", size = 10, face = "bold"),
    plot.title = element_text(hjust = 0.5)
  )

print(p_readchange)
change_df <- x %>%
  select(stdntid,
         gkclasstype, g1classtype,
         race,
         gkfreelunch, g1freelunch,
         gkschid, g1schid,
         gktreadss, g1treadss) %>%
  na.omit()

change_df <- change_df %>%
  filter(race %in% c("WHITE", "BLACK"))

change_df <- change_df %>%
  mutate(delta_reading = g1treadss - gktreadss)

# factor coding
change_df$gkclasstype <- factor(
  change_df$gkclasstype,
  levels = c("SMALL CLASS", "REGULAR CLASS", "REGULAR + AIDE CLASS")
)

change_df$race <- factor(change_df$race)
change_df$gkfreelunch <- factor(change_df$gkfreelunch)
change_df$gkschid <- factor(change_df$gkschid)
change_df <- x %>%
  select(
    stdntid,
    gkclasstype, g1classtype,
    race,
    gkfreelunch, g1freelunch,
    gkschid, g1schid,
    gktreadss, g1treadss
  ) %>%
  na.omit() %>%
  filter(race %in% c("WHITE", "BLACK")) %>%
  mutate(
    delta_reading = g1treadss - gktreadss,
    gkclasstype = factor(
      gkclasstype,
      levels = c("SMALL CLASS", "REGULAR CLASS", "REGULAR + AIDE CLASS")
    ),
    gkfreelunch = factor(
      gkfreelunch,
      levels = c("FREE LUNCH", "NON-FREE LUNCH")
    ),
    race = factor(
      race,
      levels = c("BLACK", "WHITE")
    ),
    gkschid = factor(gkschid)
  )

change_model <- lmer(
  delta_reading ~ gkclasstype + gkfreelunch + race + (1 | gkschid),
  data = change_df,
  REML = FALSE
)

anova(change_model)