DATA 605 - Discussion 12

Nick Oliver

Discussion 12

I am using the state of Delaware’s open data set. The regression model is predicting the percentage of students who are proficient on the MATH portion of the “Smarter Balanced Summative Assessment” in 2021

PctProficient: The percent of students who earned an achievement level considered proficient. This is the number of students who earned a proficient achievement level divided by the number of students who tested.

Soruce: https://data.delaware.gov/Education/Student-Assessment-Performance/ms6b-mt82

Loading Data

df <- read.csv('https://raw.githubusercontent.com/nolivercuny/DATA605/main/week%2012/Filtered_Student_Assessment_Performance.csv')

Cleanup & Aggregation

This is a very large dataset (over 700k rows) I pre-filtered it so it is easier to work with. Removed all redacted data, removed data for the entire state, removed data without race and gender information and kept just data for the MATH assessment.

Dropping rows that add no extra information

fdf <- df %>%
  mutate(NonWhite = Race != 'White') %>% 
  mutate(Male = Gender == 'Male') %>%
  mutate(GradeNum = as.numeric(gsub("([0-9]+).*$", "\\1", Grade))) %>%
  mutate(NonFosterCare = SpecialDemo == 'Non-Foster Care') %>%
  mutate(NonLowIncome = SpecialDemo == 'Non Low-Income') %>%
  mutate(LowIncome = SpecialDemo == 'Low-Income') %>%
  mutate(ActiveEL = SpecialDemo == 'Active EL Students') %>%
  mutate(NonHomeless = SpecialDemo == 'Non-Homeless') %>%
  mutate(NonELStudents = SpecialDemo == 'Non-EL Students') %>%
  mutate(Disabilities = SpecialDemo == 'Students with Disabilities') %>%
  mutate(MilitaryConnected = SpecialDemo == 'Military Connected Youth') %>%
  dplyr::select(NonWhite,Male,GradeNum,NonFosterCare,NonLowIncome,LowIncome,ActiveEL,NonHomeless,NonELStudents,Disabilities,MilitaryConnected,Tested,PctProficient)
 
glimpse(fdf)
## Rows: 4,126
## Columns: 13
## $ NonWhite          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ Male              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ GradeNum          <dbl> 4, 4, 4, 4, 4, 7, 7, 7, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7…
## $ NonFosterCare     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE…
## $ NonLowIncome      <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALS…
## $ LowIncome         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ ActiveEL          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ NonHomeless       <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
## $ NonELStudents     <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Disabilities      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ MilitaryConnected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ Tested            <int> 51, 54, 53, 39, 46, 31, 30, 27, 45, 50, 47, 30, 37, …
## $ PctProficient     <dbl> 13.73, 12.96, 13.21, 15.38, 15.22, 16.13, 16.67, 18.…

Analysis

Use built in lm function to fit linear model. Summary shows a p-value of \(\approx\) 2e-16 which much less than .05 which indicates the correlation is statistically significant. Also found out that ~ . is shorthand for including all columns as predictors.

model <- lm(data=fdf, PctProficient ~ .)
summary(model)
## 
## Call:
## lm(formula = PctProficient ~ ., data = fdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.821 -10.745  -2.781   8.392  58.189 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            55.089098   0.988374  55.737  < 2e-16 ***
## NonWhiteTRUE          -13.635451   0.527139 -25.867  < 2e-16 ***
## MaleTRUE                2.115928   0.469323   4.508 6.71e-06 ***
## GradeNum               -1.789251   0.137085 -13.052  < 2e-16 ***
## NonFosterCareTRUE      -2.221879   0.739430  -3.005  0.00267 ** 
## NonLowIncomeTRUE        0.309360   0.771575   0.401  0.68848    
## LowIncomeTRUE         -12.709980   2.159715  -5.885 4.30e-09 ***
## ActiveELTRUE           -7.659533   3.614094  -2.119  0.03412 *  
## NonHomelessTRUE        -2.107574   0.739862  -2.849  0.00441 ** 
## NonELStudentsTRUE      -3.202893   0.777871  -4.118 3.91e-05 ***
## DisabilitiesTRUE      -20.113430   3.008449  -6.686 2.61e-11 ***
## MilitaryConnectedTRUE  -3.029882   4.570101  -0.663  0.50738    
## Tested                 -0.093203   0.006579 -14.166  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.03 on 4113 degrees of freedom
## Multiple R-squared:  0.2223, Adjusted R-squared:  0.2201 
## F-statistic: 97.98 on 12 and 4113 DF,  p-value: < 2.2e-16

I experimented with the stepAIC function from the MASS library which can do both backward and forward predictor elimination using the Akaike Information Criterion (AIC) to compare the fit of the model. Interestingly it appears to drop the NonLowIncome and MilitaryConnected predictors.

stepModel <- stepAIC(model, direction = 'both')
## Start:  AIC=22377.04
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare + 
##     NonLowIncome + LowIncome + ActiveEL + NonHomeless + NonELStudents + 
##     Disabilities + MilitaryConnected + Tested
## 
##                     Df Sum of Sq     RSS   AIC
## - NonLowIncome       1        36  929334 22375
## - MilitaryConnected  1        99  929397 22376
## <none>                            929297 22377
## - ActiveEL           1      1015  930312 22380
## - NonHomeless        1      1833  931131 22383
## - NonFosterCare      1      2040  931337 22384
## - NonELStudents      1      3831  933128 22392
## - Male               1      4593  933890 22395
## - LowIncome          1      7825  937122 22410
## - Disabilities       1     10099  939396 22420
## - GradeNum           1     38491  967788 22542
## - Tested             1     45340  974637 22572
## - NonWhite           1    151176 1080474 22997
## 
## Step:  AIC=22375.2
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare + 
##     LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities + 
##     MilitaryConnected + Tested
## 
##                     Df Sum of Sq     RSS   AIC
## - MilitaryConnected  1       110  929444 22374
## <none>                            929334 22375
## + NonLowIncome       1        36  929297 22377
## - ActiveEL           1      1064  930398 22378
## - NonHomeless        1      2837  932171 22386
## - NonFosterCare      1      3137  932470 22387
## - Male               1      4602  933935 22394
## - NonELStudents      1      5508  934842 22398
## - LowIncome          1      8259  937592 22410
## - Disabilities       1     10424  939758 22419
## - GradeNum           1     38498  967832 22541
## - Tested             1     45361  974695 22570
## - NonWhite           1    151655 1080988 22997
## 
## Step:  AIC=22373.69
## PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare + 
##     LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities + 
##     Tested
## 
##                     Df Sum of Sq     RSS   AIC
## <none>                            929444 22374
## + MilitaryConnected  1       110  929334 22375
## + NonLowIncome       1        47  929397 22376
## - ActiveEL           1      1060  930504 22376
## - NonHomeless        1      2789  932233 22384
## - NonFosterCare      1      3087  932531 22385
## - Male               1      4595  934039 22392
## - NonELStudents      1      5444  934889 22396
## - LowIncome          1      8230  937674 22408
## - Disabilities       1     10396  939840 22418
## - GradeNum           1     38435  967879 22539
## - Tested             1     45269  974713 22568
## - NonWhite           1    151549 1080993 22995
summary(stepModel)
## 
## Call:
## lm(formula = PctProficient ~ NonWhite + Male + GradeNum + NonFosterCare + 
##     LowIncome + ActiveEL + NonHomeless + NonELStudents + Disabilities + 
##     Tested, data = fdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.820 -10.731  -2.761   8.420  58.187 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        55.202601   0.908335  60.773  < 2e-16 ***
## NonWhiteTRUE      -13.632484   0.526290 -25.903  < 2e-16 ***
## MaleTRUE            2.116254   0.469215   4.510 6.66e-06 ***
## GradeNum           -1.787629   0.137038 -13.045  < 2e-16 ***
## NonFosterCareTRUE  -2.350688   0.635882  -3.697 0.000221 ***
## LowIncomeTRUE     -12.835724   2.126443  -6.036 1.72e-09 ***
## ActiveELTRUE       -7.789625   3.595346  -2.167 0.030324 *  
## NonHomelessTRUE    -2.236349   0.636391  -3.514 0.000446 ***
## NonELStudentsTRUE  -3.331116   0.678481  -4.910 9.48e-07 ***
## DisabilitiesTRUE  -20.238045   2.983054  -6.784 1.33e-11 ***
## Tested             -0.093087   0.006575 -14.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.03 on 4115 degrees of freedom
## Multiple R-squared:  0.2222, Adjusted R-squared:  0.2203 
## F-statistic: 117.6 on 10 and 4115 DF,  p-value: < 2.2e-16

\[PctProficient = 55.202 -13.63248 \times NonWhiteTRUE \\ + 2.11625 \times MaleTRUE \\ - 1.78763 \times GradeNum \\ -2.35069 \times NonFosterCareTRUE \\ -12.83572 \times LowIncomeTRUE \\ -7.78963 \times ActiveELTRUE \\ -2.23635 \times NonHomelessTRUE \\ -3.33112 \times NonELStudentsTRUE \\ -20.23805 \times DisabilitiesTRUE\\ -0.09309 \times Tested\]

What this means according to the data is you start at a baseline proficiency of 55.202% subtract for non-white, add for male, subtract for grade number, subtract if non-foster care, subtract if they are low income, subtract if they are active el, subtract if they are non-homeless, subtract if they are non-el, subtract if they are disabilities, and subtract based on the number of students tested.

Residuals plot

 res = resid(stepModel)
plot(df$PctProficient, res,
     ylab="Residuals", xlab="% Proficient")
 abline(0, 0)                  # the horizon

ggplot(data = df, aes(x = stepModel$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', binwidth = 1) +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

Using the performance library’s check_model function as well.

check_model(stepModel)

Summary

Is the linear model appropriate?

The linear model appears to be appropriate. The check model function and plot of the residuals indicates this is an appropriate model