4.8.25 Homework

df_model <- df %>%
  dplyr::select(DPSATOFC, DPSTEXPA, DPSTSPFP, DPFUNAB1T) %>%
  drop_na()

model <- lm(DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
summary(model)

## 
## Call:
## lm(formula = DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6081.7  -159.0   -55.2    37.0  8043.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.387e+02  7.275e+01   3.282  0.00106 ** 
## DPSTEXPA    -2.433e+01  5.402e+00  -4.503 7.36e-06 ***
## DPSTSPFP     1.863e+01  4.689e+00   3.972 7.55e-05 ***
## DPFUNAB1T    3.983e-05  4.779e-07  83.342  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 623.4 on 1196 degrees of freedom
## Multiple R-squared:  0.8617, Adjusted R-squared:  0.8613 
## F-statistic:  2483 on 3 and 1196 DF,  p-value: < 2.2e-16

plot(model, which = 1)

raintest(model)

## 
##  Rainbow test
## 
## data:  model
## Rain = 0.79535, df1 = 600, df2 = 596, p-value = 0.9974

durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.07093886      1.858107   0.028
##  Alternative hypothesis: rho != 0

plot(model, which = 3)

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 192.33, df = 3, p-value < 2.2e-16

plot(model, which = 2)

shapiro.test(resid(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model)
## W = 0.49765, p-value < 2.2e-16

vif(model)

##  DPSTEXPA  DPSTSPFP DPFUNAB1T 
##  1.002768  1.044500  1.044045

cor(df_model[, c("DPSTEXPA", "DPSTSPFP", "DPFUNAB1T")])

##              DPSTEXPA  DPSTSPFP   DPFUNAB1T
## DPSTEXPA   1.00000000 0.0363665 -0.02979525
## DPSTSPFP   0.03636650 1.0000000  0.20200478
## DPFUNAB1T -0.02979525 0.2020048  1.00000000

This model meets the assumptions of linearity, independence of errors, and has no multicollinearity. The model violates the assumptions of homoscedasticity (p < 0.001 in Breusch-Pagan test) and normality of residuals (p < 0.001 in Shapiro-Wilk test).

df_model$log_DPSATOFC <- log(df_model$DPSATOFC)
log_model <- lm(log_DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
summary(log_model)

## 
## Call:
## lm(formula = log_DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, 
##     data = df_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0418 -0.6281  0.0101  0.6609  3.7277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.449e+00  1.213e-01  36.692  < 2e-16 ***
## DPSTEXPA    -4.624e-02  9.004e-03  -5.135 3.29e-07 ***
## DPSTSPFP     1.382e-01  7.816e-03  17.682  < 2e-16 ***
## DPFUNAB1T    1.919e-08  7.966e-10  24.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.039 on 1196 degrees of freedom
## Multiple R-squared:  0.4879, Adjusted R-squared:  0.4866 
## F-statistic: 379.9 on 3 and 1196 DF,  p-value: < 2.2e-16

plot(log_model, which = 1)

raintest(log_model)

## 
##  Rainbow test
## 
## data:  log_model
## Rain = 1.0222, df1 = 600, df2 = 596, p-value = 0.3942

durbinWatsonTest(log_model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1422639      1.715264       0
##  Alternative hypothesis: rho != 0

plot(log_model, which = 3)

bptest(log_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  log_model
## BP = 387.99, df = 3, p-value < 2.2e-16

plot(log_model, which = 2)

shapiro.test(resid(log_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(log_model)
## W = 0.97451, p-value = 1.044e-13

vif(log_model)

##  DPSTEXPA  DPSTSPFP DPFUNAB1T 
##  1.002768  1.044500  1.044045

My original linear model using average SAT scores as the dependent variable meets some, but not all, of the standard linear regression assumptions: The model violates two assumptions: Homoscedasticity and Normality of residuals.To improve the model, I applied a log transformation to the dependent variable (DPSATOFC), and created a new variable log_DPSATOFC. The transformation improved linearity, but it does not meet all assumptions. My predictors make sense, and my errors are not influencing each other. All the analysis shows that teacher experience is not the most impactful for positive test scores and in certain areas actually decreased. Other studies have made a case for possible resource detriments in those heavily affected districts. The surprising postive correlation is the increase in scores where there is a higher percentage of special education students.

4.8.25 Homework

Pamela Williamson-Wyllie

2025-04-08