df_model <- df %>%
dplyr::select(DPSATOFC, DPSTEXPA, DPSTSPFP, DPFUNAB1T) %>%
drop_na()
model <- lm(DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
summary(model)
##
## Call:
## lm(formula = DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6081.7 -159.0 -55.2 37.0 8043.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.387e+02 7.275e+01 3.282 0.00106 **
## DPSTEXPA -2.433e+01 5.402e+00 -4.503 7.36e-06 ***
## DPSTSPFP 1.863e+01 4.689e+00 3.972 7.55e-05 ***
## DPFUNAB1T 3.983e-05 4.779e-07 83.342 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 623.4 on 1196 degrees of freedom
## Multiple R-squared: 0.8617, Adjusted R-squared: 0.8613
## F-statistic: 2483 on 3 and 1196 DF, p-value: < 2.2e-16
plot(model, which = 1)
raintest(model)
##
## Rainbow test
##
## data: model
## Rain = 0.79535, df1 = 600, df2 = 596, p-value = 0.9974
durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.07093886 1.858107 0.028
## Alternative hypothesis: rho != 0
plot(model, which = 3)
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 192.33, df = 3, p-value < 2.2e-16
plot(model, which = 2)
shapiro.test(resid(model))
##
## Shapiro-Wilk normality test
##
## data: resid(model)
## W = 0.49765, p-value < 2.2e-16
vif(model)
## DPSTEXPA DPSTSPFP DPFUNAB1T
## 1.002768 1.044500 1.044045
cor(df_model[, c("DPSTEXPA", "DPSTSPFP", "DPFUNAB1T")])
## DPSTEXPA DPSTSPFP DPFUNAB1T
## DPSTEXPA 1.00000000 0.0363665 -0.02979525
## DPSTSPFP 0.03636650 1.0000000 0.20200478
## DPFUNAB1T -0.02979525 0.2020048 1.00000000
This model meets the assumptions of linearity, independence of errors, and has no multicollinearity. The model violates the assumptions of homoscedasticity (p < 0.001 in Breusch-Pagan test) and normality of residuals (p < 0.001 in Shapiro-Wilk test).
df_model$log_DPSATOFC <- log(df_model$DPSATOFC)
log_model <- lm(log_DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T, data = df_model)
summary(log_model)
##
## Call:
## lm(formula = log_DPSATOFC ~ DPSTEXPA + DPSTSPFP + DPFUNAB1T,
## data = df_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0418 -0.6281 0.0101 0.6609 3.7277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.449e+00 1.213e-01 36.692 < 2e-16 ***
## DPSTEXPA -4.624e-02 9.004e-03 -5.135 3.29e-07 ***
## DPSTSPFP 1.382e-01 7.816e-03 17.682 < 2e-16 ***
## DPFUNAB1T 1.919e-08 7.966e-10 24.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.039 on 1196 degrees of freedom
## Multiple R-squared: 0.4879, Adjusted R-squared: 0.4866
## F-statistic: 379.9 on 3 and 1196 DF, p-value: < 2.2e-16
plot(log_model, which = 1)
raintest(log_model)
##
## Rainbow test
##
## data: log_model
## Rain = 1.0222, df1 = 600, df2 = 596, p-value = 0.3942
durbinWatsonTest(log_model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1422639 1.715264 0
## Alternative hypothesis: rho != 0
plot(log_model, which = 3)
bptest(log_model)
##
## studentized Breusch-Pagan test
##
## data: log_model
## BP = 387.99, df = 3, p-value < 2.2e-16
plot(log_model, which = 2)
shapiro.test(resid(log_model))
##
## Shapiro-Wilk normality test
##
## data: resid(log_model)
## W = 0.97451, p-value = 1.044e-13
vif(log_model)
## DPSTEXPA DPSTSPFP DPFUNAB1T
## 1.002768 1.044500 1.044045
My original linear model using average SAT scores as the dependent variable meets some, but not all, of the standard linear regression assumptions: The model violates two assumptions: Homoscedasticity and Normality of residuals.To improve the model, I applied a log transformation to the dependent variable (DPSATOFC), and created a new variable log_DPSATOFC. The transformation improved linearity, but it does not meet all assumptions. My predictors make sense, and my errors are not influencing each other. All the analysis shows that teacher experience is not the most impactful for positive test scores and in certain areas actually decreased. Other studies have made a case for possible resource detriments in those heavily affected districts. The surprising postive correlation is the increase in scores where there is a higher percentage of special education students.