model <- lm(DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP
+ DPSTSPFP + DPFUNAB1T, data = data)
summary(model)
##
## Call:
## lm(formula = DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP +
## DPSTSPFP + DPFUNAB1T, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5932.0 -161.9 -41.5 60.4 7925.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.575e+02 3.177e+02 1.755 0.0795 .
## DPSTEXPA -1.189e+01 6.217e+00 -1.912 0.0561 .
## DPSTBLFP -2.927e+00 3.637e+00 -0.805 0.4212
## DPSTHIFP -2.319e+00 3.373e+00 -0.688 0.4918
## DPSTWHFP -5.688e+00 3.356e+00 -1.695 0.0904 .
## DPSTSPFP 2.037e+01 4.693e+00 4.340 1.55e-05 ***
## DPFUNAB1T 3.940e-05 4.839e-07 81.428 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 618.6 on 1193 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.8641, Adjusted R-squared: 0.8634
## F-statistic: 1264 on 6 and 1193 DF, p-value: < 2.2e-16
The model above means that 86% of the variations are explained by teacher experience and demographics. Likely more significant than teacher experience is economically disadvantaged students and special education students.One of the most interesting statistics is that the teacher experience impacted the scores in a slightly negative way. I would not have initially expected this, but knowing that many tenured teachers work in low income areas, or with special needs kids this could explain a bit of this. The independent variables are good predictors of the SAT scores, which is shown by the p value. The variables that were not significantly influential are the demographics of Black, Hispanic, White, and teacher experience. As my independent variables, I selected DPSFUNABIT and DPSTSPFP, which are the special education students and the economically disadvantaged students. For every 1% increase the districts average score increased by 20 points. The impact was not as significant with the economically disadvantaged students. I was suprised to see the positive impact of the sped students on the overall test scores. I will be digging into this further, but suspect that there is more programming and teachers that possess more credentials in these particular districts and schools. When you teach for all and not just some, you have better results.
plot(model, which = 1)
This model above shows that it violates the linear assumption. There is
a curve upward and the spread increases. The SAT scores and independent
variables are likely not linear in relationship. I am going to test a
variable with the log transformation.
#data <- data %>%
data <- data %>%
filter(!is.na(DPFUNAB1T) & DPFUNAB1T >= 0) %>%
mutate(log_FUNAB1T = log(DPFUNAB1T + 1))
model2 <- lm(DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP +
DPSTSPFP + log_FUNAB1T, data = data)
summary(model2)
##
## Call:
## lm(formula = DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP +
## DPSTSPFP + log_FUNAB1T, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2795.4 -530.3 -220.7 103.5 21240.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2796.396 773.961 3.613 0.000315 ***
## DPSTEXPA -45.009 16.048 -2.805 0.005119 **
## DPSTBLFP -13.013 8.828 -1.474 0.140763
## DPSTHIFP -26.846 8.262 -3.249 0.001190 **
## DPSTWHFP -41.468 8.219 -5.046 5.23e-07 ***
## DPSTSPFP 58.470 11.775 4.966 7.85e-07 ***
## log_FUNAB1T 114.732 9.738 11.782 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1501 on 1190 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2019, Adjusted R-squared: 0.1979
## F-statistic: 50.17 on 6 and 1190 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP +
## DPSTSPFP + log_FUNAB1T, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2795.4 -530.3 -220.7 103.5 21240.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2796.396 773.961 3.613 0.000315 ***
## DPSTEXPA -45.009 16.048 -2.805 0.005119 **
## DPSTBLFP -13.013 8.828 -1.474 0.140763
## DPSTHIFP -26.846 8.262 -3.249 0.001190 **
## DPSTWHFP -41.468 8.219 -5.046 5.23e-07 ***
## DPSTSPFP 58.470 11.775 4.966 7.85e-07 ***
## log_FUNAB1T 114.732 9.738 11.782 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1501 on 1190 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2019, Adjusted R-squared: 0.1979
## F-statistic: 50.17 on 6 and 1190 DF, p-value: < 2.2e-16
When transforming the model, the independent variables significance increased, but the “fit” worsened. In the first model there were 2 variables that were strong, but in the second there were 5, which just reveals that there might be some extensive do dependencies that can effect test scores. When the model transformed only 20% can be explained as opposed to the 86% of the first model.
library(ggplot2)
library(broom)
library(dplyr)
data_clean <- data %>%
filter(!is.na(DPSATOFC), !is.na(DPFUNAB1T))
model <- lm(DPSATOFC ~ DPSTEXPA + DPSTBLFP + DPSTHIFP + DPSTWHFP + DPSTSPFP + DPFUNAB1T, data = data_clean)
data_clean <- data_clean %>%
mutate(predicted_SAT = predict(model),
residuals = DPSATOFC - predicted_SAT)
# Plot 1: Predicted vs Actual
ggplot(data_clean, aes(x = predicted_SAT, y = DPSATOFC)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "blue", linetype = "dashed") +
labs(
title = "Predicted vs. Actual SAT Scores",
x = "Predicted SAT Score",
y = "Actual SAT Score"
) +
theme_minimal()
```Using all the independent variables, we can see that those are good predictors of SAT scores, there are some outliers, but there are mostly clusters on the dashed line. The outliers could be missing data, or districts worth exploring further. I wanted to create this plot to see a better visual representation, but also know for my research paper what might be worth exploring further.
``` r
library(ggplot2)
ggplot(data_clean, aes(x = DPSTEXPA, y = DPSATOFC)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "orange", se = TRUE) +
labs(
title = "Teacher experience VS SAT SCORES",
x = "AVG YEARS TEACHER EXPERIENCE",
y = "SAT Score"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This just shows that there are many variables outside of teacher experience that can explain SAT scores.