This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
gpa2 <- read_excel("gpa2.xlsx")
gpa2 <- gpa2 %>% mutate(hsizesq = hsize^2)
model <- lm(sat ~ hsize + hsizesq, data = gpa2)
summary(model)
##
## Call:
## lm(formula = sat ~ hsize + hsizesq, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -562.38 -93.07 -3.71 90.62 507.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 997.981 6.203 160.875 < 2e-16 ***
## hsize 19.814 3.991 4.965 7.14e-07 ***
## hsizesq -2.131 0.549 -3.881 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 138.9 on 4134 degrees of freedom
## Multiple R-squared: 0.00765, Adjusted R-squared: 0.007169
## F-statistic: 15.93 on 2 and 4134 DF, p-value: 1.279e-07
tidy(model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 998. 6.20 161. 0
## 2 hsize 19.8 3.99 4.97 0.000000714
## 3 hsizesq -2.13 0.549 -3.88 0.000106
stargazer(model, type = "text", title = "Regression Results", digits = 4)
##
## Regression Results
## ===============================================
## Dependent variable:
## ---------------------------
## sat
## -----------------------------------------------
## hsize 19.8145***
## (3.9907)
##
## hsizesq -2.1306***
## (0.5490)
##
## Constant 997.9805***
## (6.2034)
##
## -----------------------------------------------
## Observations 4,137
## R2 0.0076
## Adjusted R2 0.0072
## Residual Std. Error 138.9008 (df = 4134)
## F Statistic 15.9335*** (df = 2; 4134)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimated regression model is:
\[ \hat{sat} = 997.9805 + 19.8145 \cdot hsize - 2.1306 \cdot hsize^2 \]
From the regression summary:
hsize
: 19.8145,
statistically significant with p-value < 0.001hsizesq
: -2.1306,
statistically significant with p-value < 0.001The quadratic term hsize^2
is statistically significant,
indicating a non-linear relationship between class size and SAT score.
The positive linear term and negative quadratic term suggest a concave
relationship, where SAT scores initially increase with class size but
eventually decrease after a certain point.
The optimal high school size can be found by computing the vertex of the quadratic equation:
\[ hsize^* = -\frac{19.8145}{2 \times (-2.1306)} = \frac{19.8145}{4.2612} \approx 4.65 \]
This means the SAT score is maximized when the high school size is
approximately 4.65 hundred students, or 465
students. The negative coefficient on hsizesq
confirms that this is a maximum point.
new_data <- data.frame(hsize = seq(min(gpa2$hsize), max(gpa2$hsize), length.out = 100))
new_data$hsizesq <- new_data$hsize^2
new_data$sat_hat <- predict(model, newdata = new_data)
optimal_hsize <- 19.8145 / (2 * 2.1306)
ggplot(new_data, aes(x = hsize, y = sat_hat)) +
geom_line(color = "blue", size = 1.2) +
geom_vline(xintercept = optimal_hsize, linetype = "dashed", color = "red") +
labs(title = "Predicted SAT Score vs. High School Size",
x = "High School Size (hundreds)",
y = "Predicted SAT Score") +
annotate("text", x = optimal_hsize + 1, y = max(new_data$sat_hat), label = paste("Optimal hsize =", round(optimal_hsize, 2)), color = "red")
This plot shows how SAT scores vary with high school size and highlights the optimal size where the SAT score is maximized.
Now we estimate the same model using log(sat)
as the
dependent variable:
gpa2 <- gpa2 %>% mutate(logsat = log(sat))
log_model <- lm(logsat ~ hsize + hsizesq, data = gpa2)
summary(log_model)
##
## Call:
## lm(formula = logsat ~ hsize + hsizesq, data = gpa2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77744 -0.08493 0.00557 0.09465 0.40946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8960291 0.0061515 1121.032 < 2e-16 ***
## hsize 0.0196029 0.0039572 4.954 7.57e-07 ***
## hsizesq -0.0020872 0.0005444 -3.834 0.000128 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1377 on 4134 degrees of freedom
## Multiple R-squared: 0.007773, Adjusted R-squared: 0.007293
## F-statistic: 16.19 on 2 and 4134 DF, p-value: 9.885e-08
tidy(log_model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.90 0.00615 1121. 0
## 2 hsize 0.0196 0.00396 4.95 0.000000757
## 3 hsizesq -0.00209 0.000544 -3.83 0.000128
We calculate the optimal size for log(SAT):
log_coef <- coef(log_model)
a <- log_coef["hsizesq"]
b <- log_coef["hsize"]
log_optimal_hsize <- -b / (2 * a)
log_optimal_hsize
## hsize
## 4.695923
The optimal high school size based on log(sat)
is
calculated using the vertex formula:
\[ hsize^* = -\frac{\beta_1}{2 \cdot \beta_2} \]
While the exact numeric result will depend on the regression output, we typically find that the log-transformed model results in a similar, though not identical, optimal size compared to the level-based model (previously ~4.65). This can reflect differences in how the SAT data’s distribution influences linear versus log-linear fits.
While this analysis offers useful insights, it is important to consider whether the findings are representative of all high school seniors. Several limitations apply:
Sample-specific: The model is based on the
gpa2
dataset, which may not include a nationally
representative sample. It likely reflects a specific region, year, or
subset of students (e.g., those applying to a particular
college).
Omitted Variables: Many factors affect SAT scores (e.g., socioeconomic status, school resources, teacher quality, family background). These are not included in the model, which may lead to omitted variable bias.
Causality: The model estimates a correlation, not causation. A larger or smaller school size may be associated with higher or lower SAT scores due to unobserved factors.
Measurement: SAT is only one measure of academic performance, and it does not capture other dimensions such as GPA, motivation, or non-cognitive skills.
This analysis should be interpreted with caution. While it reveals a statistically significant quadratic relationship between high school size and SAT scores in the sample, the result cannot be generalized to all high school seniors without further validation using broader and more diverse data.