Discussion 12
I am using the state of Delaware’s open data set. The regression model is predicting the percentage of students who are proficient on the MATH portion of the “Smarter Balanced Summative Assessment” in 2021
PctProficient: The percent of students who earned an achievement level considered proficient. This is the number of students who earned a proficient achievement level divided by the number of students who tested.
Soruce: https://data.delaware.gov/Education/Student-Assessment-Performance/ms6b-mt82
Loading Data
df <- read.csv('https://raw.githubusercontent.com/nolivercuny/DATA605/main/week%2012/Filtered_Student_Assessment_Performance.csv')Cleanup & Aggregation
This is a very large dataset (over 700k rows) I pre-filtered it so it is easier to work with. Removed all redacted data, removed data for the entire state, removed data without race and gender information and kept just data for the MATH assessment.
Dropping rows that add no extra information
fdf <- df %>%
mutate(NonWhite = Race != 'White') %>%
mutate(Male = Gender == 'Male') %>%
mutate(GradeNum = as.numeric(gsub("([0-9]+).*$", "\\1", Grade))) %>%
mutate(NonFosterCare = SpecialDemo == 'Non-Foster Care') %>%
mutate(NonLowIncome = SpecialDemo == 'Non Low-Income') %>%
mutate(LowIncome = SpecialDemo == 'Low-Income') %>%
mutate(ActiveEL = SpecialDemo == 'Active EL Students') %>%
mutate(NonHomeless = SpecialDemo == 'Non-Homeless') %>%
mutate(NonELStudents = SpecialDemo == 'Non-EL Students') %>%
mutate(Disabilities = SpecialDemo == 'Students with Disabilities') %>%
mutate(MilitaryConnected = SpecialDemo == 'Military Connected Youth') %>%
dplyr::select(NonWhite,Male,GradeNum,NonFosterCare,NonLowIncome,LowIncome,ActiveEL,NonHomeless,NonELStudents,Disabilities,MilitaryConnected,Tested,PctProficient)
glimpse(fdf)## Rows: 4,126
## Columns: 13
## $ NonWhite <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ Male <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ GradeNum <dbl> 4, 4, 4, 4, 4, 7, 7, 7, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7…
## $ NonFosterCare <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE…
## $ NonLowIncome <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS…
## $ LowIncome <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ActiveEL <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ NonHomeless <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
## $ NonELStudents <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Disabilities <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ MilitaryConnected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tested <int> 51, 54, 53, 39, 46, 31, 30, 27, 45, 50, 47, 30, 37, …
## $ PctProficient <dbl> 13.73, 12.96, 13.21, 15.38, 15.22, 16.13, 16.67, 18.…
Analysis
Use built in lm function to fit linear model. Summary shows a p-value of \(\approx\) 2e-16 which much less than .05 which indicates the correlation is statistically significant. Also found out that ~ . is shorthand for including all columns as predictors.
model <- lm(data=fdf, PctProficient ~ .)
summary(model)##
## Call:
## lm(formula = PctProficient ~ ., data = fdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.821 -10.745 -2.781 8.392 58.189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.089098 0.988374 55.737 < 2e-16 ***
## NonWhiteTRUE -13.635451 0.527139 -25.867 < 2e-16 ***
## MaleTRUE 2.115928 0.469323 4.508 6.71e-06 ***
## GradeNum -1.789251 0.137085 -13.052 < 2e-16 ***
## NonFosterCareTRUE -2.221879 0.739430 -3.005 0.00267 **
## NonLowIncomeTRUE 0.309360 0.771575 0.401 0.68848
## LowIncomeTRUE -12.709980 2.159715 -5.885 4.30e-09 ***
## ActiveELTRUE -7.659533 3.614094 -2.119 0.03412 *
## NonHomelessTRUE -2.107574 0.739862 -2.849 0.00441 **
## NonELStudentsTRUE -3.202893 0.777871 -4.118 3.91e-05 ***
## DisabilitiesTRUE -20.113430 3.008449 -6.686 2.61e-11 ***
## MilitaryConnectedTRUE -3.029882 4.570101 -0.663 0.50738
## Tested -0.093203 0.006579 -14.166 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.03 on 4113 degrees of freedom
## Multiple R-squared: 0.2223, Adjusted R-squared: 0.2201
## F-statistic: 97.98 on 12 and 4113 DF, p-value: < 2.2e-16
I experimented with the stepAIC function from the MASS library which can do both backward and forward predictor elimination using the Akaike Information Criterion (AIC) to compare the fit of the model. Interestingly it appears to drop the NonLowIncome and MilitaryConnected predictors.
stepModel <- stepAIC(model, direction = 'both')## Start: AIC=22377.04
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare +
## NonLowIncome + LowIncome + ActiveEL + NonHomeless + NonELStudents +
## Disabilities + MilitaryConnected + Tested
##
## Df Sum of Sq RSS AIC
## - NonLowIncome 1 36 929334 22375
## - MilitaryConnected 1 99 929397 22376
## <none> 929297 22377
## - ActiveEL 1 1015 930312 22380
## - NonHomeless 1 1833 931131 22383
## - NonFosterCare 1 2040 931337 22384
## - NonELStudents 1 3831 933128 22392
## - Male 1 4593 933890 22395
## - LowIncome 1 7825 937122 22410
## - Disabilities 1 10099 939396 22420
## - GradeNum 1 38491 967788 22542
## - Tested 1 45340 974637 22572
## - NonWhite 1 151176 1080474 22997
##
## Step: AIC=22375.2
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare +
## LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities +
## MilitaryConnected + Tested
##
## Df Sum of Sq RSS AIC
## - MilitaryConnected 1 110 929444 22374
## <none> 929334 22375
## + NonLowIncome 1 36 929297 22377
## - ActiveEL 1 1064 930398 22378
## - NonHomeless 1 2837 932171 22386
## - NonFosterCare 1 3137 932470 22387
## - Male 1 4602 933935 22394
## - NonELStudents 1 5508 934842 22398
## - LowIncome 1 8259 937592 22410
## - Disabilities 1 10424 939758 22419
## - GradeNum 1 38498 967832 22541
## - Tested 1 45361 974695 22570
## - NonWhite 1 151655 1080988 22997
##
## Step: AIC=22373.69
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare +
## LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities +
## Tested
##
## Df Sum of Sq RSS AIC
## <none> 929444 22374
## + MilitaryConnected 1 110 929334 22375
## + NonLowIncome 1 47 929397 22376
## - ActiveEL 1 1060 930504 22376
## - NonHomeless 1 2789 932233 22384
## - NonFosterCare 1 3087 932531 22385
## - Male 1 4595 934039 22392
## - NonELStudents 1 5444 934889 22396
## - LowIncome 1 8230 937674 22408
## - Disabilities 1 10396 939840 22418
## - GradeNum 1 38435 967879 22539
## - Tested 1 45269 974713 22568
## - NonWhite 1 151549 1080993 22995
summary(stepModel)##
## Call:
## lm(formula = PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare +
## LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities +
## Tested, data = fdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.820 -10.731 -2.761 8.420 58.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.202601 0.908335 60.773 < 2e-16 ***
## NonWhiteTRUE -13.632484 0.526290 -25.903 < 2e-16 ***
## MaleTRUE 2.116254 0.469215 4.510 6.66e-06 ***
## GradeNum -1.787629 0.137038 -13.045 < 2e-16 ***
## NonFosterCareTRUE -2.350688 0.635882 -3.697 0.000221 ***
## LowIncomeTRUE -12.835724 2.126443 -6.036 1.72e-09 ***
## ActiveELTRUE -7.789625 3.595346 -2.167 0.030324 *
## NonHomelessTRUE -2.236349 0.636391 -3.514 0.000446 ***
## NonELStudentsTRUE -3.331116 0.678481 -4.910 9.48e-07 ***
## DisabilitiesTRUE -20.238045 2.983054 -6.784 1.33e-11 ***
## Tested -0.093087 0.006575 -14.157 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.03 on 4115 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.2203
## F-statistic: 117.6 on 10 and 4115 DF, p-value: < 2.2e-16
\[PctProficient = 55.202 -13.63248 \times NonWhiteTRUE \\ + 2.11625 \times MaleTRUE \\ - 1.78763 \times GradeNum \\ -2.35069 \times NonFosterCareTRUE \\ -12.83572 \times LowIncomeTRUE \\ -7.78963 \times ActiveELTRUE \\ -2.23635 \times NonHomelessTRUE \\ -3.33112 \times NonELStudentsTRUE \\ -20.23805 \times DisabilitiesTRUE\\ -0.09309 \times Tested\]
What this means according to the data is you start at a baseline proficiency of 55.202% subtract for non-white, add for male, subtract for grade number, subtract if non-foster care, subtract if they are low income, subtract if they are active el, subtract if they are non-homeless, subtract if they are non-el, subtract if they are disabilities, and subtract based on the number of students tested.
Residuals plot
res = resid(stepModel)
plot(df$PctProficient, res,
ylab="Residuals", xlab="% Proficient")
abline(0, 0) # the horizonggplot(data = df, aes(x = stepModel$residuals)) +
geom_histogram(fill = 'steelblue', color = 'black', binwidth = 1) +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')Using the performance library’s check_model function as well.
check_model(stepModel)Summary
Is the linear model appropriate?
The linear model appears to be appropriate. The check model function and plot of the residuals indicates this is an appropriate model