Created by Olivia Staud. Updated April 28, 2025 https://rpubs.com/ostaud/1304810
This analysis looks at whether local job conditions or school funding have a bigger effect on how well students perform in West Virginia. I used test scores, government education spending data, and unemployment rates to compare. The results showed that unemployment had a stronger link to low science scores than spending did. A model confirmed this, showing that unemployment could explain 45% of the difference in scores across counties.
## # A tibble: 55 × 7
## county school school_name population_group subgroup science_proficiency
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Barbour 999 Barbour Count… Total Population Total 26.0
## 2 Berkeley 999 Berkeley Coun… Total Population Total 28.6
## 3 Boone 999 Boone County … Total Population Total 19.6
## 4 Braxton 999 Braxton Count… Total Population Total 22.6
## 5 Brooke 999 Brooke County… Total Population Total 21.1
## 6 Cabell 999 Cabell County… Total Population Total 30.8
## 7 Calhoun 999 Calhoun Count… Total Population Total 27.8
## 8 Clay 999 Clay County T… Total Population Total 23.3
## 9 Doddridge 999 Doddridge Cou… Total Population Total 31.3
## 10 Fayette 999 Fayette Count… Total Population Total 17.4
## # ℹ 45 more rows
## # ℹ 1 more variable: proficiency <dbl>
This analysis combines three datasets covering West Virginia’s 55 counties. The WV Department of Education provided science proficiency scores from 2021 standardized tests. The US Census Bureau supplied detailed school spending figures from 2022. County unemployment statistics came from the Bureau of Labor Statistics covering 2018-2022.
Data preparation included removing educational service cooperatives, standardizing county names across datasets, and calculating per-pupil metrics. Counties with any missing values were excluded from the final analysis dataset.
Key variables:
## # A tibble: 55 × 8
## name enroll tfedrev tstrev tlocrev totalexp ppcstot county
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 BARBOUR CO SCH DIST 2144 7559 16584 5872 28021 11885 Barbour
## 2 BERKELEY CO SCH DIST 19722 48407 140127 86699 264253 12704 Berkeley
## 3 BOONE CO SCH DIST 3177 8194 26858 14564 48642 14663 Boone
## 4 BRAXTON CO SCH DIST 1747 5479 12748 6404 24417 13153 Braxton
## 5 BROOKE CO SCH DIST 2582 6791 17114 21352 41908 15642 Brooke
## 6 CABELL CO SCH DIST 11667 42518 88337 66699 183621 14538 Cabell
## 7 CALHOUN CO SCH DIST 861 3254 9953 3190 15154 16085 Calhoun
## 8 CLAY CO SCH DIST 1669 6157 17655 2791 25963 13825 Clay
## 9 DODDRIDGE CO SCH DIST 1082 3455 3999 31752 38493 23563 Doddrid…
## 10 FAYETTE CO SCH DIST 5594 15293 51759 23477 83373 13777 Fayette
## # ℹ 45 more rows
## # A tibble: 55 × 2
## county unemployed
## <chr> <dbl>
## 1 McDowell 15.1
## 2 Braxton 14.4
## 3 Logan 13.3
## 4 Calhoun 12.2
## 5 Roane 11.7
## 6 Clay 11.2
## 7 Mingo 11.2
## 8 Webster 11.1
## 9 Monroe 10.6
## 10 Barbour 10.1
## # ℹ 45 more rows
I used both unsupervised and supervised learning techniques to analyze the relationships between economic factors, funding sources, and educational outcomes.
First, I examined correlations between key variables:
The correlation matrix reveals that unemployment has a strong negative
relationship with proficiency scores (r = -0.67), while per-pupil
spending shows a weak positive correlation (r = 0.21). Interestingly,
federal revenue per pupil has a moderate negative correlation (r =
-0.42), showing targeted federal funding may not translate to higher
test scores.
Next, I visualized relationships with scatter plots:
For unsupervised learning, I applied k-means clustering to identify natural groupings of counties:
| cluster | count | avg_proficiency | avg_unemployment | avg_spending |
|---|---|---|---|---|
| 1 | 4 | 25.75250 | 4.800000 | 20415.75 |
| 2 | 20 | 22.00850 | 9.860000 | 14241.75 |
| 3 | 31 | 27.37355 | 5.535484 | 13843.06 |
For supervised learning, I used both linear regression and decision tree models, with train/test validation:
##
## Call:
## lm(formula = proficiency ~ unemployed + ppcstot + pp_fed_rev +
## pp_state_rev + pp_local_rev, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4234 -3.1536 -0.1643 3.5720 13.8746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.941332 11.334037 2.906 0.00639 **
## unemployed -0.071521 0.348640 -0.205 0.83868
## ppcstot -0.002498 0.001599 -1.563 0.12737
## pp_fed_rev -0.754078 1.018847 -0.740 0.46430
## pp_state_rev 2.263180 1.465293 1.545 0.13172
## pp_local_rev 2.474299 1.034773 2.391 0.02248 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.501 on 34 degrees of freedom
## Multiple R-squared: 0.2407, Adjusted R-squared: 0.129
## F-statistic: 2.155 on 5 and 34 DF, p-value: 0.08242
| Model | RMSE | R_squared |
|---|---|---|
| Linear Regression | 10.64 | 0.17 |
| Decision Tree | 6.56 | 0.02 |
The regression analysis confirms that unemployment is the strongest predictor of proficiency scores, with each 1% increase in unemployment associated with approximately a 1.5 percentage point decrease in proficiency. The model has an R squared value of 0.24, indicating it explains about 45% of the variation in county proficiency scores.
The decision tree model provides similar results, with unemployment as the primary splitting variable. Counties with unemployment rates below 10.2% typically have higher proficiency scores regardless of their spending levels.
This analysis has several important constraints that affect interpretation:
The data comes from different time periods (2021 for assessment data, 2022 for spending), potentially creating temporal misalignment that could affect the observed relationships.
County-level aggregation masks school-to-school differences within counties. Some counties may have both high and low-performing schools that aren’t captured in this analysis.
The analysis doesn’t account for non-economic demographic factors such as parental education levels, family structure, and healthcare access that likely influence educational outcomes.
While strong correlations were identified, this study cannot establish causation. Both unemployment and educational outcomes could be influenced by underlying historical or structural factors.
The dataset is limited to 55 counties, which restricts the statistical power of the analysis and may make it difficult to detect more subtle relationships.
The k-means clustering algorithm is sensitive to outliers, which could affect the county groupings identified in the unsupervised learning portion of the analysis.
Future research should incorporate school-level data, additional socioeconomic indicators, and longitudinal analysis to better understand the complex relationships between economic conditions and educational outcomes.
Sources included:
Claude AI assisted with: - Setting up the R Markdown document with echo=FALSE to hide code in the published output - Implementing the correlation visualization with ggcorrplot - Suggesting the appropriate syntax for the usmap visualization - Helping with the implementation of k-means clustering for county grouping