This analysis explores the impact of county-level socioeconomic and funding factors (e.g., income, unemployment, spending) on composite educational proficiency in West Virginia. We aim to predict educational outcomes using both regression and decision tree models and explore hidden variable structures via PCA. My main inputs to determining how income and total spending during the school year affects a county proficiency in schools
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(rpart)
library(readxl)
options(scipen = 999)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'
t_assess_raw_school <- read_excel(path = assessment_path,
sheet = 'SY21 School & District',
range = 'b2:f7312')
t_assess_raw_science <- read_excel(path = assessment_path,
sheet = 'SY21 School & District',
range = 'db3:db7312',
col_names = c('science_proficiency'),
na = '**')
t_assess_raw_reading <- read_excel(path = assessment_path,
sheet = 'SY21 School & District',
range = 'ch3:ch7312',
col_names = c('reading_proficiency'),
na = '**')
t_assess_raw_math <- read_excel(path = assessment_path,
sheet = 'SY21 School & District',
range = 'at3:at7312',
col_names = c('math_proficiency'),
na = '**')
t_assess_raw <- t_assess_raw_school %>%
bind_cols(t_assess_raw_science) %>%
bind_cols(t_assess_raw_reading) %>%
bind_cols(t_assess_raw_math) %>%
janitor::clean_names()
# Remove subgroups
t_assess <- t_assess_raw %>%
filter(school == 999) %>%
filter(population_group == 'Total Population') %>%
filter(county != 'Statewide') %>%
transmute(
county,
proficiency_science = science_proficiency,
proficiency_reading = reading_proficiency,
proficiency_math = math_proficiency,
proficiency_composite = rowMeans(
cbind(science_proficiency, reading_proficiency, math_proficiency),
na.rm = TRUE
)
)
print(t_assess)
## # A tibble: 55 × 5
## county proficiency_science proficiency_reading proficiency_math
## <chr> <dbl> <dbl> <dbl>
## 1 Barbour 26.0 35.3 24.5
## 2 Berkeley 28.6 38.7 20.8
## 3 Boone 19.6 36.4 26
## 4 Braxton 22.6 31.5 15.8
## 5 Brooke 21.1 35.7 26.8
## 6 Cabell 30.8 42.6 31.6
## 7 Calhoun 27.8 35.0 18.5
## 8 Clay 23.3 36.0 21.8
## 9 Doddridge 31.3 40.6 26.3
## 10 Fayette 17.4 31.9 17.6
## # ℹ 45 more rows
## # ℹ 1 more variable: proficiency_composite <dbl>
spending_path <- './us census ed spending/elsec22t.xls'
t_spending_raw <- read_excel(path = spending_path,,
sheet = 'elsec22t',
range = 'a1:gb14106') %>%
janitor::clean_names()
## New names:
## • `` -> `...73`
## • `` -> `...74`
## • `` -> `...75`
## • `` -> `...76`
## • `` -> `...77`
## • `` -> `...78`
## • `` -> `...79`
## • `` -> `...80`
## • `` -> `...81`
## • `` -> `...82`
## • `` -> `...83`
## • `` -> `...84`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
## • `` -> `...116`
## • `` -> `...117`
## • `` -> `...118`
## • `` -> `...119`
## • `` -> `...120`
## • `` -> `...121`
## • `` -> `...122`
## • `` -> `...123`
## • `` -> `...124`
## • `` -> `...125`
## • `` -> `...126`
## • `` -> `...127`
## • `` -> `...128`
## • `` -> `...129`
## • `` -> `...130`
## • `` -> `...131`
## • `` -> `...132`
## • `` -> `...133`
## • `` -> `...134`
## • `` -> `...135`
## • `` -> `...136`
## • `` -> `...137`
## • `` -> `...138`
## • `` -> `...139`
## • `` -> `...140`
## • `` -> `...141`
## • `` -> `...142`
## • `` -> `...143`
## • `` -> `...144`
## • `` -> `...145`
## • `` -> `...146`
## • `` -> `...147`
## • `` -> `...148`
## • `` -> `...149`
## • `` -> `...150`
## • `` -> `...151`
## • `` -> `...152`
## • `` -> `...153`
## • `` -> `...154`
## • `` -> `...155`
## • `` -> `...156`
## • `` -> `...157`
## • `` -> `...158`
## • `` -> `...159`
## • `` -> `...160`
## • `` -> `...161`
## • `` -> `...162`
## • `` -> `...163`
## • `` -> `...164`
## • `` -> `...165`
## • `` -> `...166`
## • `` -> `...167`
## • `` -> `...168`
## • `` -> `...169`
## • `` -> `...170`
## • `` -> `...171`
## • `` -> `...172`
## • `` -> `...173`
## • `` -> `...174`
## • `` -> `...175`
## • `` -> `...176`
## • `` -> `...177`
## • `` -> `...178`
## • `` -> `...179`
## • `` -> `...180`
## • `` -> `...181`
## • `` -> `...182`
## • `` -> `...183`
## • `` -> `...184`
cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')
t_spending <- t_spending_raw %>%
filter(state == 49) %>%
filter(!name %in% cooperates) %>%
select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>%
mutate(county = str_to_title(str_split_i(name, ' ',1)),
county = ifelse(county == 'Mc', 'McDowell', county))
print(t_spending)
## # A tibble: 55 × 8
## name enroll tfedrev tstrev tlocrev totalexp ppcstot county
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 BARBOUR CO SCH DIST 2144 7559 16584 5872 28021 11885 Barbour
## 2 BERKELEY CO SCH DIST 19722 48407 140127 86699 264253 12704 Berkeley
## 3 BOONE CO SCH DIST 3177 8194 26858 14564 48642 14663 Boone
## 4 BRAXTON CO SCH DIST 1747 5479 12748 6404 24417 13153 Braxton
## 5 BROOKE CO SCH DIST 2582 6791 17114 21352 41908 15642 Brooke
## 6 CABELL CO SCH DIST 11667 42518 88337 66699 183621 14538 Cabell
## 7 CALHOUN CO SCH DIST 861 3254 9953 3190 15154 16085 Calhoun
## 8 CLAY CO SCH DIST 1669 6157 17655 2791 25963 13825 Clay
## 9 DODDRIDGE CO SCH DIST 1082 3455 3999 31752 38493 23563 Doddrid…
## 10 FAYETTE CO SCH DIST 5594 15293 51759 23477 83373 13777 Fayette
## # ℹ 45 more rows
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(stringr)
# Load Unemployment (already done)
t_demographics_unemployed <- read_csv('./demographics/unemployed.csv',
skip = 4,
na = 'N/A') %>%
clean_names() %>%
filter(county != 'West Virginia',
county != 'United States',
!is.na(value_percent)) %>%
select(county, value_percent) %>%
mutate(county = str_remove(county, " County")) %>%
rename(unemployed = value_percent)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 62 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Unemployed)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load Income
t_income <- read_csv('Income (1).csv', skip = 4, na = 'N/A') %>%
clean_names() %>%
filter(county != 'West Virginia',
county != 'United States',
!is.na(value_dollars)) %>%
select(county, value_dollars) %>%
mutate(county = str_remove(county, " County")) %>%
rename(income = value_dollars)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 64 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3139 counties)
## num (1): Value (Dollars)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load Poverty
t_poverty <- read_csv('./Poverty (2).csv', skip = 4, na = 'N/A') %>%
clean_names() %>%
filter(county != 'West Virginia',
county != 'United States',
!is.na(value_percent)) %>%
select(county, value_percent) %>%
mutate(county = str_remove(county, " County")) %>%
rename(poverty = value_percent)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), Families (Below Poverty)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load population by age CSVs
pop_under18 <- read_csv('Under 18 Pop.csv', skip = 4, na = 'N/A')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age Under 18)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop_18to40 <- read_csv('Population 18-39.csv', skip = 4, na = 'N/A')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age 18-39)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop_over40 <- read_csv('Over 40 pop.csv', skip = 4, na = 'N/A')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age 40 And Over)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean and rename helper function
clean_pop <- function(df, new_colname) {
names(df)[1:4] <- c("county", "fips", "rank", new_colname)
df %>%
select(county, all_of(new_colname)) %>%
mutate(county = str_remove(county, " County"))
}
# Apply cleaning
pop_under18_clean <- clean_pop(pop_under18, "pop_under18")
pop_18to40_clean <- clean_pop(pop_18to40, "pop_18to40")
pop_over40_clean <- clean_pop(pop_over40, "pop_over40")
# Merge all on county
t_population <- pop_18to40_clean %>%
filter(county != "United States" & county != "West Virginia") %>%
left_join(pop_under18_clean, by = "county") %>%
left_join(pop_over40_clean, by = "county") %>%
mutate(
pop_total = rowSums(across(starts_with("pop_")), na.rm = TRUE),
population = pop_18to40 # for consistency with earlier code
)
# Merge data
t_joined <- t_assess %>%
left_join(t_spending, by = "county") %>%
left_join(t_demographics_unemployed, by = "county") %>%
left_join(t_income, by = "county") %>%
left_join(t_population, by = "county") %>%
left_join(t_poverty, by = "county")
##EDA
library(ggplot2)
# Histogram: Unemployment Rate
ggplot(t_joined, aes(x = unemployed)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
labs(title = "Unemployment Rate (%)", x = "Unemployment", y = "Count")
# Histogram: Median Income
ggplot(t_joined, aes(x = income)) +
geom_histogram(bins = 20, fill = "forestgreen", color = "white") +
labs(title = "Median Income ($)", x = "Income", y = "Count")
# Histogram: Population (Age 18–39)
ggplot(t_joined, aes(x = population)) +
geom_histogram(bins = 20, fill = "darkorange", color = "white") +
labs(title = "Population Age 18–39", x = "Population", y = "Count")
# Histogram: Poverty Rate
ggplot(t_joined, aes(x = poverty)) +
geom_histogram(bins = 20, fill = "firebrick", color = "white") +
labs(title = "Poverty Rate (%)", x = "Poverty", y = "Count")
##PCA
t_numeric <- t_joined %>%
select(proficiency_composite,
tlocrev, totalexp, unemployed, income, population)
pca_results <- prcomp(t_numeric, rank = 5, scale = TRUE, center = TRUE)
summary(pca_results)
## Importance of first k=5 (out of 6) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.947 1.1016 0.74060 0.54735 0.32811
## Proportion of Variance 0.632 0.2023 0.09142 0.04993 0.01794
## Cumulative Proportion 0.632 0.8343 0.92568 0.97561 0.99355
print(pca_results)
## Standard deviations (1, .., p=6):
## [1] 1.9473232 1.1015848 0.7406046 0.5473469 0.3281079 0.1967351
##
## Rotation (n x k) = (6 x 5):
## PC1 PC2 PC3 PC4
## proficiency_composite 0.3570687 -0.4062517 0.65589417 -0.5152698665
## tlocrev 0.4722910 0.2734774 -0.11289400 -0.0371886859
## totalexp 0.4466257 0.4130498 -0.14530825 -0.0598520645
## unemployed -0.2959953 0.5455182 0.71276616 0.2934190066
## income 0.3924816 -0.4172966 0.14445245 0.8021449588
## population 0.4564139 0.3432680 0.08391177 0.0006695556
## PC5
## proficiency_composite 0.06441208
## tlocrev 0.65248737
## totalexp 0.08224018
## unemployed 0.14220328
## income 0.02636055
## population -0.73649930
t_rotate <- as_tibble(pca_results$rotation) %>%
mutate(field = rownames(pca_results$rotation)) %>%
pivot_longer(cols = starts_with('PC'), names_to = 'pc', values_to = 'value')
ggplot(data = t_rotate, mapping = aes(x = pc, y = field)) +
geom_tile(aes(fill = value )) +
geom_text(aes(label = ifelse(abs(value) > 0.2, round(value, 2), ''),
color = "black")) +
scale_fill_gradient2(low = "yellow", mid = "white", high = "green") +
theme_minimal() +
labs(title = "Heatmap of PCA Results",
x = "Principal Component",
y = "County")
library(rpart)
library(rpart.plot)
# Decision Tree
tree_model <- rpart(proficiency_composite ~ income ,
data = t_joined,
method = "anova")
# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Predicting Composite Proficiency", digits = -5)
library(ggcorrplot)
corr_matrix <- cor(t_numeric, use = "complete.obs")
ggcorrplot(corr_matrix,
lab = TRUE,
title = "Correlation Matrix of County-Level Education and Economic Factors")
ggplot(t_joined, aes(x = income, y = proficiency_composite)) +
geom_point(color = "darkblue", alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
labs(
title = "Relationship Between Income and Composite Proficiency",
x = "Median Income ($)",
y = "Composite Proficiency (%)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(proficiency_composite ~
income + totalexp , data = t_joined)
# View model summary
summary(model)
##
## Call:
## lm(formula = proficiency_composite ~ income + totalexp, data = t_joined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8414 -2.8677 0.3934 2.8032 8.9580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.481214904 3.082209317 3.725 0.000482 ***
## income 0.000254224 0.000046272 5.494 0.0000012 ***
## totalexp 0.000007145 0.000009026 0.792 0.432220
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.229 on 52 degrees of freedom
## Multiple R-squared: 0.4499, Adjusted R-squared: 0.4287
## F-statistic: 21.26 on 2 and 52 DF, p-value: 0.0000001788
# Evaluate model performance
# Remove rows with missing values for model predictors
model_data <- t_joined %>%
select(proficiency_composite, income, totalexp)
# Predict using model
predicted <- predict(model, newdata = model_data)
# Compute RMSE using base R
rmse_val <- sqrt(mean((model_data$proficiency_composite - predicted)^2))
library(usmap)
library(tidyverse)
# Step 1: Prepare clean WV counties
wv_counties <- usmap::countypop %>%
filter(abbr == "WV") %>%
mutate(county = str_remove(county, " County")) %>% # <<< Remove " County"
select(fips, county)
# Step 2: Your t_joined already has county names without "County"
# Now join
t_joined_fips <- t_joined %>%
left_join(wv_counties, by = "county")
plot_usmap(data = t_joined_fips,
values = "proficiency_composite",
region = "county",
include = "WV") + # <<< add this line
scale_fill_continuous(name = "Composite Proficiency",
low = "red", high = "blue", label = scales::comma) +
theme(legend.position = "right") +
labs(title = "WV County-Level Composite Proficiency Scores")
The scatter plot reveals a positive relationship between county-level median income and composite student proficiency. Counties with higher median incomes generally demonstrate stronger proficiency outcomes, suggesting that economic resources may enhance educational performance through better access to learning tools, support services, and quality instruction. This is supported by the regression model, which shows income as a statistically significant predictor. Based on these findings, the following recommendations are made:
Increase targeted investment in low-income counties where proficiency scores are low.
Monitor and evaluate the effectiveness of spending strategies, especially in areas where high funding does not lead to better outcomes.
Support economic development, since income levels appear to influence educational success more strongly than direct spending alone.
Future work should analyze longitudinal data to explore changes over time and test for interaction effects between variables.