WOOLDRIDGE Chapter 6 Exercise C.4

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Load Data

gpa2 <- read_excel("gpa2.xlsx")
gpa2 <- gpa2 %>% mutate(hsizesq = hsize^2)

Estimate Model

model <- lm(sat ~ hsize + hsizesq, data = gpa2)
summary(model)

## 
## Call:
## lm(formula = sat ~ hsize + hsizesq, data = gpa2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -562.38  -93.07   -3.71   90.62  507.72 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  997.981      6.203 160.875  < 2e-16 ***
## hsize         19.814      3.991   4.965 7.14e-07 ***
## hsizesq       -2.131      0.549  -3.881 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.9 on 4134 degrees of freedom
## Multiple R-squared:  0.00765,    Adjusted R-squared:  0.007169 
## F-statistic: 15.93 on 2 and 4134 DF,  p-value: 1.279e-07

tidy(model)

## # A tibble: 3 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)   998.       6.20     161.   0          
## 2 hsize          19.8      3.99       4.97 0.000000714
## 3 hsizesq        -2.13     0.549     -3.88 0.000106

Regression Output in Usual Form

stargazer(model, type = "text", title = "Regression Results", digits = 4)

## 
## Regression Results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 sat            
## -----------------------------------------------
## hsize                       19.8145***         
##                              (3.9907)          
##                                                
## hsizesq                     -2.1306***         
##                              (0.5490)          
##                                                
## Constant                    997.9805***        
##                              (6.2034)          
##                                                
## -----------------------------------------------
## Observations                   4,137           
## R2                            0.0076           
## Adjusted R2                   0.0072           
## Residual Std. Error    138.9008 (df = 4134)    
## F Statistic          15.9335*** (df = 2; 4134) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The estimated regression model is:

\[ \hat{sat} = 997.9805 + 19.8145 \cdot hsize - 2.1306 \cdot hsize^2 \]

Statistical Significance

From the regression summary:

Coefficient on hsize: 19.8145, statistically significant with p-value < 0.001
Coefficient on hsizesq: -2.1306, statistically significant with p-value < 0.001

Conclusion

The quadratic term hsize^2 is statistically significant, indicating a non-linear relationship between class size and SAT score. The positive linear term and negative quadratic term suggest a concave relationship, where SAT scores initially increase with class size but eventually decrease after a certain point.

Optimal High School Size

The optimal high school size can be found by computing the vertex of the quadratic equation:

\[ hsize^* = -\frac{19.8145}{2 \times (-2.1306)} = \frac{19.8145}{4.2612} \approx 4.65 \]

This means the SAT score is maximized when the high school size is approximately 4.65 hundred students, or 465 students. The negative coefficient on hsizesq confirms that this is a maximum point.

Visualizing the Relationship

new_data <- data.frame(hsize = seq(min(gpa2$hsize), max(gpa2$hsize), length.out = 100))
new_data$hsizesq <- new_data$hsize^2
new_data$sat_hat <- predict(model, newdata = new_data)

optimal_hsize <- 19.8145 / (2 * 2.1306)

ggplot(new_data, aes(x = hsize, y = sat_hat)) +
  geom_line(color = "blue", size = 1.2) +
  geom_vline(xintercept = optimal_hsize, linetype = "dashed", color = "red") +
  labs(title = "Predicted SAT Score vs. High School Size",
       x = "High School Size (hundreds)",
       y = "Predicted SAT Score") +
  annotate("text", x = optimal_hsize + 1, y = max(new_data$sat_hat), label = paste("Optimal hsize =", round(optimal_hsize, 2)), color = "red")

This plot shows how SAT scores vary with high school size and highlights the optimal size where the SAT score is maximized.

Log(SAT) Model and Optimal Size

Now we estimate the same model using log(sat) as the dependent variable:

gpa2 <- gpa2 %>% mutate(logsat = log(sat))
log_model <- lm(logsat ~ hsize + hsizesq, data = gpa2)
summary(log_model)

## 
## Call:
## lm(formula = logsat ~ hsize + hsizesq, data = gpa2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77744 -0.08493  0.00557  0.09465  0.40946 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6.8960291  0.0061515 1121.032  < 2e-16 ***
## hsize        0.0196029  0.0039572    4.954 7.57e-07 ***
## hsizesq     -0.0020872  0.0005444   -3.834 0.000128 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1377 on 4134 degrees of freedom
## Multiple R-squared:  0.007773,   Adjusted R-squared:  0.007293 
## F-statistic: 16.19 on 2 and 4134 DF,  p-value: 9.885e-08

tidy(log_model)

## # A tibble: 3 × 5
##   term        estimate std.error statistic     p.value
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)  6.90     0.00615    1121.   0          
## 2 hsize        0.0196   0.00396       4.95 0.000000757
## 3 hsizesq     -0.00209  0.000544     -3.83 0.000128

We calculate the optimal size for log(SAT):

log_coef <- coef(log_model)
a <- log_coef["hsizesq"]
b <- log_coef["hsize"]
log_optimal_hsize <- -b / (2 * a)
log_optimal_hsize

##    hsize 
## 4.695923

Comparison

The optimal high school size based on log(sat) is calculated using the vertex formula:

\[ hsize^* = -\frac{\beta_1}{2 \cdot \beta_2} \]

While the exact numeric result will depend on the regression output, we typically find that the log-transformed model results in a similar, though not identical, optimal size compared to the level-based model (previously ~4.65). This can reflect differences in how the SAT data’s distribution influences linear versus log-linear fits.

Representativeness of the Analysis

While this analysis offers useful insights, it is important to consider whether the findings are representative of all high school seniors. Several limitations apply:

Sample-specific: The model is based on the gpa2 dataset, which may not include a nationally representative sample. It likely reflects a specific region, year, or subset of students (e.g., those applying to a particular college).
Omitted Variables: Many factors affect SAT scores (e.g., socioeconomic status, school resources, teacher quality, family background). These are not included in the model, which may lead to omitted variable bias.
Causality: The model estimates a correlation, not causation. A larger or smaller school size may be associated with higher or lower SAT scores due to unobserved factors.
Measurement: SAT is only one measure of academic performance, and it does not capture other dimensions such as GPA, motivation, or non-cognitive skills.

Conclusion

This analysis should be interpreted with caution. While it reveals a statistically significant quadratic relationship between high school size and SAT scores in the sample, the result cannot be generalized to all high school seniors without further validation using broader and more diverse data.

WOOLDRIDGE Chapter 6 Exercise C.4 - GPA2

Robby Fathir Nashary

2025-05-24

R Markdown

Load Data

Estimate Model

Regression Output in Usual Form

Statistical Significance

Conclusion

Optimal High School Size

Visualizing the Relationship

Log(SAT) Model and Optimal Size

Comparison

Representativeness of the Analysis

Conclusion