WV County Education Outcomes Prediction

#Project 3

Introduction

The quality of public education is a cornerstone of societal development, influencing individual success and community well-being. This project focuses on analyzing public school data in West Virginia, a state with unique demographic, economic, and educational challenges. The project aims to uncover trends and correlations between student achievement, financial investment in education, and socioeconomic factors across West Virginia counties. More specifically, I aim to uncover any trends in unemployment and see if they affect proficiency scores.

Data

The data being used for this includes the following:

Assessment Data: Dataset including proficiency in scoring by county, as well as by school. Students are grouped by age for some parts of this project to see how certain age groups react to unemployment.
Demographic Data: The unemployment rate of each county

Predictions

Children are always affected by the world around them and it is crucial to start learning early in school to ensure the full capacity of knowledge when you are older. I think that unemployment plays a role in how successful schools are in their standardized testing, and I think scores will be worse where unemployment is higher.

Correlations

This correlation shows high positive correlations between revenues, which makes sense. There is also pretty strong positive correlation between testing proficiencies, which also adds up. One thing that surprised me from this analysis is how low the correlation between revenues and proficiencies are. There is a slight positive correlation, but it is not strong enough to work with for this example. What we will be looking at here is the correlations between unemployment and median income, and how those two affect the outcomes of testing scores. As shown above, the relationship between enrollment and unemployment rates of each county have a negative correlation, meaning that when enrollment is down, unemployment is up. This can be for a few reasons:

People cannot afford to send their children to school
People who are unemployed may not prioritize education for their children and ask for help paying the bills
Education may be poor in these areas, leading to a hard time finding a job after graduation.

However, we want to look at how unemployment affects children. Say that the case is that they are helping to pay the bills, that should reflect a lower score in standardized testing for high school students. Let’s look into that next.

This graph shows an interesting idea - according to the trend line, middle schoolers are more likely to be proficient in science than high schoolers, but high schoolers tend to be more proficient when taking math. West Virginia sets up their standardized testing so that Science is taken in grades 5, 8 and 11, so every third year. Math and writing, on the other hand, are required from grades 3-8 and again in grade 11.

Linear Regression Models

## 
## Call:
## lm(formula = math_proficiency ~ unemployed + reading_proficiency + 
##     science_proficiency, data = middle_schools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2696  -4.0279   0.3458   3.9211  13.6862 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -7.10783    3.43424  -2.070   0.0407 *  
## unemployed          -0.05161    0.24905  -0.207   0.8362    
## reading_proficiency  0.64358    0.08700   7.397 2.21e-11 ***
## science_proficiency  0.20821    0.08450   2.464   0.0152 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.386 on 118 degrees of freedom
## Multiple R-squared:  0.5653, Adjusted R-squared:  0.5543 
## F-statistic: 51.16 on 3 and 118 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = math_proficiency ~ unemployed + reading_proficiency + 
##     science_proficiency, data = high_schools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7650  -3.8178   0.3681   3.5141  14.4111 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.25929    3.15593  -1.033   0.3041    
## unemployed          -0.07868    0.20138  -0.391   0.6968    
## reading_proficiency  0.22430    0.10138   2.212   0.0291 *  
## science_proficiency  0.59225    0.12263   4.830 4.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.38 on 104 degrees of freedom
## Multiple R-squared:  0.6758, Adjusted R-squared:  0.6664 
## F-statistic: 72.26 on 3 and 104 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = reading_proficiency ~ unemployed + math_proficiency + 
##     science_proficiency, data = middle_schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.132  -3.561   0.155   2.691  18.387 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         21.11279    2.36029   8.945 6.21e-15 ***
## unemployed          -0.28166    0.21630  -1.302    0.195    
## math_proficiency     0.49227    0.06655   7.397 2.21e-11 ***
## science_proficiency  0.32667    0.06956   4.696 7.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.585 on 118 degrees of freedom
## Multiple R-squared:  0.6217, Adjusted R-squared:  0.6121 
## F-statistic: 64.65 on 3 and 118 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = reading_proficiency ~ unemployed + math_proficiency + 
##     science_proficiency, data = high_schools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7395  -3.2221  -0.0972   2.8832  15.4264 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         20.62494    2.21357   9.318 2.25e-15 ***
## unemployed          -0.25092    0.18890  -1.328   0.1870    
## math_proficiency     0.20040    0.09058   2.212   0.0291 *  
## science_proficiency  0.88488    0.09445   9.369 1.73e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.085 on 104 degrees of freedom
## Multiple R-squared:  0.7915, Adjusted R-squared:  0.7855 
## F-statistic: 131.6 on 3 and 104 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = science_proficiency ~ unemployed + reading_proficiency + 
##     math_proficiency, data = middle_schools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.502  -4.608  -1.410   3.251  23.629 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.41031    3.70758   0.650   0.5169    
## unemployed          -0.03258    0.26462  -0.123   0.9022    
## reading_proficiency  0.48206    0.10265   4.696 7.22e-06 ***
## math_proficiency     0.23501    0.09538   2.464   0.0152 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.784 on 118 degrees of freedom
## Multiple R-squared:  0.4602, Adjusted R-squared:  0.4465 
## F-statistic: 33.54 on 3 and 118 DF,  p-value: 9.485e-16

## 
## Call:
## lm(formula = science_proficiency ~ unemployed + reading_proficiency + 
##     math_proficiency, data = high_schools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3218  -1.9418   0.2942   2.5901  11.6788 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.94909    2.24046  -2.209   0.0294 *  
## unemployed          -0.01227    0.14564  -0.084   0.9330    
## reading_proficiency  0.51726    0.05521   9.369 1.73e-15 ***
## math_proficiency     0.30932    0.06405   4.830 4.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.888 on 104 degrees of freedom
## Multiple R-squared:  0.8182, Adjusted R-squared:  0.813 
## F-statistic: 156.1 on 3 and 104 DF,  p-value: < 2.2e-16

These models use each proficiency as the dependent variable and tests it between middle school and high school students. Some of the models are better than others, and there are some trends to uncover here. We can see that unemployment is statistically insignificant to most models. The R Squared values seem to be much higher for high school compared to middle school. This could be for a couple reasons:

Some middle schools include elementary schools in their data. Depending on where you go to school changes the building you’re in. Some elementary schools in West Virginia end in third grade, while others may end in sixth grade. This may attest to some variance in Middle School testing.
Data from High Schools may be more complete and concise, compared to potentially extensive middle school data.
The desire for standardized testing as a benchmark for college acceptance.

PCA

## Importance of first k=2 (out of 4) components:
##                           PC1    PC2
## Standard deviation     1.5791 0.9736
## Proportion of Variance 0.6234 0.2370
## Cumulative Proportion  0.6234 0.8604

## Standard deviations (1, .., p=4):
## [1] 1.5790856 0.9736123 0.5507143 0.5052538
## 
## Rotation (n x k) = (4 x 2):
##                            PC1         PC2
## science_proficiency  0.5667548 -0.14152212
## math_proficiency     0.5637386 -0.04662995
## reading_proficiency  0.5700716 -0.13906040
## unemployed          -0.1897529 -0.97900937

Proficiency is most likely reflected by PC1, where profociencies are rather high
The second PC is most likely for unemployment
- The ranges on PC1 are more likely for a proficiency score rather than an umemployment percentage.

Decision Tree

From this decision tree, we can gauge just how important each of the variables are. Enrollment is a stronger predictor than unemployment is, but unemployment can still be used to separate the branches towards the end, but not in our high school model. These decision trees are also only looking at Math Proficiency, using Reading and Writing as predictors. The leaves at the bottom represent the number % that will be achieved by a high school. It’s surprising to see the jump in scores from even the first split in this tree.

More of the same from this tree, predicting math proficiency using unemployment and enrollment and other testing scores. Not as big of a jump at the end in terms of percentage, only a 9% increase from the second highest bucket to the highest proficiency bucket.

Recommendations

I have a few recommendations based on the analysis of this project.

Although I did not use enrollment in any of my models, there is a negative correlation between student enrollment and unemployment. Some students in West Virginia are limited due to reasons outside of their control, such as transportation or restricting weather conditions. Some students do not have a bus route, or other buses take far too long to pick them up. Other times, students who live in heavily rural areas may not be able to clear snow to get to school that day.
Unemployment may not have the strongest correlation, but there is still a somewhat negative correlation between testing scores and unemployment. Lower unemployment and less struggle for young students to see, the better they do!
Test scores are not the best across the board in West Virginia in middle school. Adding on, schools that test well get more money than schools that don’t test well. This is flawed because you are boosting the students that do good, when the students that do bad are the ones that really need the help, rather than be left in the dark to continue struggling.

References

https://hdpulse.nimhd.nih.gov/data-portal/social/table?age=001&age_options=ageall_1&demo=00010&demo_options=income_3&race=00&race_options=race_7&sex=0&sex_options=sexboth_1&socialtopic=030&socialtopic_options=social_6&statefips=54&statefips_options=area_states
- Average Family Income by County Spreadsheet
Dr. Garrett’s Class Notes
ChatGPT for some Visualization Features