R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Load Data

gpa2 <- read_excel("gpa2.xlsx")
gpa2 <- gpa2 %>% mutate(hsizesq = hsize^2)

Estimate Model

model <- lm(sat ~ hsize + hsizesq, data = gpa2)
summary(model)
## 
## Call:
## lm(formula = sat ~ hsize + hsizesq, data = gpa2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -562.38  -93.07   -3.71   90.62  507.72 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  997.981      6.203 160.875  < 2e-16 ***
## hsize         19.814      3.991   4.965 7.14e-07 ***
## hsizesq       -2.131      0.549  -3.881 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.9 on 4134 degrees of freedom
## Multiple R-squared:  0.00765,    Adjusted R-squared:  0.007169 
## F-statistic: 15.93 on 2 and 4134 DF,  p-value: 1.279e-07
tidy(model)
## # A tibble: 3 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)   998.       6.20     161.   0          
## 2 hsize          19.8      3.99       4.97 0.000000714
## 3 hsizesq        -2.13     0.549     -3.88 0.000106

Regression Output in Usual Form

stargazer(model, type = "text", title = "Regression Results", digits = 4)
## 
## Regression Results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 sat            
## -----------------------------------------------
## hsize                       19.8145***         
##                              (3.9907)          
##                                                
## hsizesq                     -2.1306***         
##                              (0.5490)          
##                                                
## Constant                    997.9805***        
##                              (6.2034)          
##                                                
## -----------------------------------------------
## Observations                   4,137           
## R2                            0.0076           
## Adjusted R2                   0.0072           
## Residual Std. Error    138.9008 (df = 4134)    
## F Statistic          15.9335*** (df = 2; 4134) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The estimated regression model is:

\[ \hat{sat} = 997.9805 + 19.8145 \cdot hsize - 2.1306 \cdot hsize^2 \]

Statistical Significance

From the regression summary:

Conclusion

The quadratic term hsize^2 is statistically significant, indicating a non-linear relationship between class size and SAT score. The positive linear term and negative quadratic term suggest a concave relationship, where SAT scores initially increase with class size but eventually decrease after a certain point.

Optimal High School Size

The optimal high school size can be found by computing the vertex of the quadratic equation:

\[ hsize^* = -\frac{19.8145}{2 \times (-2.1306)} = \frac{19.8145}{4.2612} \approx 4.65 \]

This means the SAT score is maximized when the high school size is approximately 4.65 hundred students, or 465 students. The negative coefficient on hsizesq confirms that this is a maximum point.

Visualizing the Relationship

new_data <- data.frame(hsize = seq(min(gpa2$hsize), max(gpa2$hsize), length.out = 100))
new_data$hsizesq <- new_data$hsize^2
new_data$sat_hat <- predict(model, newdata = new_data)

optimal_hsize <- 19.8145 / (2 * 2.1306)

ggplot(new_data, aes(x = hsize, y = sat_hat)) +
  geom_line(color = "blue", size = 1.2) +
  geom_vline(xintercept = optimal_hsize, linetype = "dashed", color = "red") +
  labs(title = "Predicted SAT Score vs. High School Size",
       x = "High School Size (hundreds)",
       y = "Predicted SAT Score") +
  annotate("text", x = optimal_hsize + 1, y = max(new_data$sat_hat), label = paste("Optimal hsize =", round(optimal_hsize, 2)), color = "red")

This plot shows how SAT scores vary with high school size and highlights the optimal size where the SAT score is maximized.

Log(SAT) Model and Optimal Size

Now we estimate the same model using log(sat) as the dependent variable:

gpa2 <- gpa2 %>% mutate(logsat = log(sat))
log_model <- lm(logsat ~ hsize + hsizesq, data = gpa2)
summary(log_model)
## 
## Call:
## lm(formula = logsat ~ hsize + hsizesq, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77744 -0.08493  0.00557  0.09465  0.40946 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6.8960291  0.0061515 1121.032  < 2e-16 ***
## hsize        0.0196029  0.0039572    4.954 7.57e-07 ***
## hsizesq     -0.0020872  0.0005444   -3.834 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1377 on 4134 degrees of freedom
## Multiple R-squared:  0.007773,   Adjusted R-squared:  0.007293 
## F-statistic: 16.19 on 2 and 4134 DF,  p-value: 9.885e-08
tidy(log_model)
## # A tibble: 3 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)  6.90     0.00615    1121.   0          
## 2 hsize        0.0196   0.00396       4.95 0.000000757
## 3 hsizesq     -0.00209  0.000544     -3.83 0.000128

We calculate the optimal size for log(SAT):

log_coef <- coef(log_model)
a <- log_coef["hsizesq"]
b <- log_coef["hsize"]
log_optimal_hsize <- -b / (2 * a)
log_optimal_hsize
##    hsize 
## 4.695923

Comparison

The optimal high school size based on log(sat) is calculated using the vertex formula:

\[ hsize^* = -\frac{\beta_1}{2 \cdot \beta_2} \]

While the exact numeric result will depend on the regression output, we typically find that the log-transformed model results in a similar, though not identical, optimal size compared to the level-based model (previously ~4.65). This can reflect differences in how the SAT data’s distribution influences linear versus log-linear fits.

Representativeness of the Analysis

While this analysis offers useful insights, it is important to consider whether the findings are representative of all high school seniors. Several limitations apply:

Conclusion

This analysis should be interpreted with caution. While it reveals a statistically significant quadratic relationship between high school size and SAT scores in the sample, the result cannot be generalized to all high school seniors without further validation using broader and more diverse data.