WV County Education Outcomes Prediction

Thesis

This analysis explores the impact of county-level socioeconomic and funding factors (e.g., income, unemployment, spending) on composite educational proficiency in West Virginia. We aim to predict educational outcomes using both regression and decision tree models and explore hidden variable structures via PCA. My main inputs to determining how income and total spending during the school year affects a county proficiency in schools

Load assessment data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)
library(readxl)
options(scipen = 999)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess_raw_reading <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'ch3:ch7312', 
                           col_names = c('reading_proficiency'),
                           na = '**')

t_assess_raw_math <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'at3:at7312', 
                           col_names = c('math_proficiency'),
                           na = '**')


t_assess_raw <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>%
  bind_cols(t_assess_raw_reading) %>%
  bind_cols(t_assess_raw_math) %>%
  janitor::clean_names() 


# Remove subgroups
t_assess <- t_assess_raw %>%
  filter(school == 999) %>%
  filter(population_group == 'Total Population') %>%
  filter(county != 'Statewide') %>%
  transmute(
    county,
    proficiency_science = science_proficiency,
    proficiency_reading = reading_proficiency,
    proficiency_math = math_proficiency,
    proficiency_composite = rowMeans(
      cbind(science_proficiency, reading_proficiency, math_proficiency),
      na.rm = TRUE
    )
  )

print(t_assess)

## # A tibble: 55 × 5
##    county    proficiency_science proficiency_reading proficiency_math
##    <chr>                   <dbl>               <dbl>            <dbl>
##  1 Barbour                  26.0                35.3             24.5
##  2 Berkeley                 28.6                38.7             20.8
##  3 Boone                    19.6                36.4             26  
##  4 Braxton                  22.6                31.5             15.8
##  5 Brooke                   21.1                35.7             26.8
##  6 Cabell                   30.8                42.6             31.6
##  7 Calhoun                  27.8                35.0             18.5
##  8 Clay                     23.3                36.0             21.8
##  9 Doddridge                31.3                40.6             26.3
## 10 Fayette                  17.4                31.9             17.6
## # ℹ 45 more rows
## # ℹ 1 more variable: proficiency_composite <dbl>

Load spending data

spending_path <- './us census ed spending/elsec22t.xls'

t_spending_raw <- read_excel(path = spending_path,,
                           sheet = 'elsec22t',
                           range = 'a1:gb14106') %>% 
  janitor::clean_names()

## New names:
## • `` -> `...73`
## • `` -> `...74`
## • `` -> `...75`
## • `` -> `...76`
## • `` -> `...77`
## • `` -> `...78`
## • `` -> `...79`
## • `` -> `...80`
## • `` -> `...81`
## • `` -> `...82`
## • `` -> `...83`
## • `` -> `...84`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
## • `` -> `...116`
## • `` -> `...117`
## • `` -> `...118`
## • `` -> `...119`
## • `` -> `...120`
## • `` -> `...121`
## • `` -> `...122`
## • `` -> `...123`
## • `` -> `...124`
## • `` -> `...125`
## • `` -> `...126`
## • `` -> `...127`
## • `` -> `...128`
## • `` -> `...129`
## • `` -> `...130`
## • `` -> `...131`
## • `` -> `...132`
## • `` -> `...133`
## • `` -> `...134`
## • `` -> `...135`
## • `` -> `...136`
## • `` -> `...137`
## • `` -> `...138`
## • `` -> `...139`
## • `` -> `...140`
## • `` -> `...141`
## • `` -> `...142`
## • `` -> `...143`
## • `` -> `...144`
## • `` -> `...145`
## • `` -> `...146`
## • `` -> `...147`
## • `` -> `...148`
## • `` -> `...149`
## • `` -> `...150`
## • `` -> `...151`
## • `` -> `...152`
## • `` -> `...153`
## • `` -> `...154`
## • `` -> `...155`
## • `` -> `...156`
## • `` -> `...157`
## • `` -> `...158`
## • `` -> `...159`
## • `` -> `...160`
## • `` -> `...161`
## • `` -> `...162`
## • `` -> `...163`
## • `` -> `...164`
## • `` -> `...165`
## • `` -> `...166`
## • `` -> `...167`
## • `` -> `...168`
## • `` -> `...169`
## • `` -> `...170`
## • `` -> `...171`
## • `` -> `...172`
## • `` -> `...173`
## • `` -> `...174`
## • `` -> `...175`
## • `` -> `...176`
## • `` -> `...177`
## • `` -> `...178`
## • `` -> `...179`
## • `` -> `...180`
## • `` -> `...181`
## • `` -> `...182`
## • `` -> `...183`
## • `` -> `...184`

cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
                'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
                'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')

t_spending <- t_spending_raw %>% 
  filter(state == 49) %>% 
  filter(!name %in% cooperates) %>% 
  select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>% 
  mutate(county = str_to_title(str_split_i(name, ' ',1)),
         county = ifelse(county == 'Mc', 'McDowell', county))


print(t_spending)

## # A tibble: 55 × 8
##    name                  enroll tfedrev tstrev tlocrev totalexp ppcstot county  
##    <chr>                  <dbl>   <dbl>  <dbl>   <dbl>    <dbl>   <dbl> <chr>   
##  1 BARBOUR CO SCH DIST     2144    7559  16584    5872    28021   11885 Barbour 
##  2 BERKELEY CO SCH DIST   19722   48407 140127   86699   264253   12704 Berkeley
##  3 BOONE CO SCH DIST       3177    8194  26858   14564    48642   14663 Boone   
##  4 BRAXTON CO SCH DIST     1747    5479  12748    6404    24417   13153 Braxton 
##  5 BROOKE CO SCH DIST      2582    6791  17114   21352    41908   15642 Brooke  
##  6 CABELL CO SCH DIST     11667   42518  88337   66699   183621   14538 Cabell  
##  7 CALHOUN CO SCH DIST      861    3254   9953    3190    15154   16085 Calhoun 
##  8 CLAY CO SCH DIST        1669    6157  17655    2791    25963   13825 Clay    
##  9 DODDRIDGE CO SCH DIST   1082    3455   3999   31752    38493   23563 Doddrid…
## 10 FAYETTE CO SCH DIST     5594   15293  51759   23477    83373   13777 Fayette 
## # ℹ 45 more rows

Load demographic data

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(stringr)

# Load Unemployment (already done)
t_demographics_unemployed <- read_csv('./demographics/unemployed.csv', 
                            skip = 4,
                            na = 'N/A') %>%
  clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent)) %>% 
  select(county, value_percent) %>%
  mutate(county = str_remove(county, " County")) %>%
  rename(unemployed = value_percent)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 62 Columns: 5

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Unemployed)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Load Income
t_income <- read_csv('Income (1).csv', skip = 4, na = 'N/A') %>%
  clean_names() %>%
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_dollars)) %>%
  select(county, value_dollars) %>%
  mutate(county = str_remove(county, " County")) %>%
  rename(income = value_dollars)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 64 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3139 counties)
## num (1): Value (Dollars)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Load Poverty
t_poverty <- read_csv('./Poverty (2).csv', skip = 4, na = 'N/A') %>%
  clean_names() %>%
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent)) %>%
  select(county, value_percent) %>%
  mutate(county = str_remove(county, " County")) %>%
  rename(poverty = value_percent)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), Families (Below Poverty)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Load population by age CSVs
pop_under18 <- read_csv('Under 18 Pop.csv', skip = 4, na = 'N/A')

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age Under 18)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pop_18to40 <- read_csv('Population 18-39.csv', skip = 4, na = 'N/A')

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age 18-39)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pop_over40 <- read_csv('Over 40 pop.csv', skip = 4, na = 'N/A')

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 63 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Age 40 And Over)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean and rename helper function
clean_pop <- function(df, new_colname) {
  names(df)[1:4] <- c("county", "fips", "rank", new_colname)
  df %>%
    select(county, all_of(new_colname)) %>%
    mutate(county = str_remove(county, " County"))
}

# Apply cleaning
pop_under18_clean <- clean_pop(pop_under18, "pop_under18")
pop_18to40_clean <- clean_pop(pop_18to40, "pop_18to40")
pop_over40_clean <- clean_pop(pop_over40, "pop_over40")

# Merge all on county
t_population <- pop_18to40_clean %>%
  filter(county != "United States" & county != "West Virginia") %>% 
  left_join(pop_under18_clean, by = "county") %>%
  left_join(pop_over40_clean, by = "county") %>%
  mutate(
    pop_total = rowSums(across(starts_with("pop_")), na.rm = TRUE),
    population = pop_18to40  # for consistency with earlier code
  )

Joined data

# Merge data
t_joined <- t_assess %>%
  left_join(t_spending, by = "county") %>%
  left_join(t_demographics_unemployed, by = "county") %>%
  left_join(t_income, by = "county") %>%
  left_join(t_population, by = "county") %>%
  left_join(t_poverty, by = "county")

##EDA

library(ggplot2)

# Histogram: Unemployment Rate
ggplot(t_joined, aes(x = unemployed)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(title = "Unemployment Rate (%)", x = "Unemployment", y = "Count")

# Histogram: Median Income
ggplot(t_joined, aes(x = income)) +
  geom_histogram(bins = 20, fill = "forestgreen", color = "white") +
  labs(title = "Median Income ($)", x = "Income", y = "Count")

# Histogram: Population (Age 18–39)
ggplot(t_joined, aes(x = population)) +
  geom_histogram(bins = 20, fill = "darkorange", color = "white") +
  labs(title = "Population Age 18–39", x = "Population", y = "Count")

# Histogram: Poverty Rate
ggplot(t_joined, aes(x = poverty)) +
  geom_histogram(bins = 20, fill = "firebrick", color = "white") +
  labs(title = "Poverty Rate (%)", x = "Poverty", y = "Count")

##PCA

t_numeric <- t_joined %>%
  select(proficiency_composite,
        tlocrev, totalexp, unemployed, income, population)


pca_results <- prcomp(t_numeric, rank = 5, scale = TRUE, center = TRUE)



summary(pca_results)

## Importance of first k=5 (out of 6) components:
##                          PC1    PC2     PC3     PC4     PC5
## Standard deviation     1.947 1.1016 0.74060 0.54735 0.32811
## Proportion of Variance 0.632 0.2023 0.09142 0.04993 0.01794
## Cumulative Proportion  0.632 0.8343 0.92568 0.97561 0.99355

print(pca_results)

## Standard deviations (1, .., p=6):
## [1] 1.9473232 1.1015848 0.7406046 0.5473469 0.3281079 0.1967351
## 
## Rotation (n x k) = (6 x 5):
##                              PC1        PC2         PC3           PC4
## proficiency_composite  0.3570687 -0.4062517  0.65589417 -0.5152698665
## tlocrev                0.4722910  0.2734774 -0.11289400 -0.0371886859
## totalexp               0.4466257  0.4130498 -0.14530825 -0.0598520645
## unemployed            -0.2959953  0.5455182  0.71276616  0.2934190066
## income                 0.3924816 -0.4172966  0.14445245  0.8021449588
## population             0.4564139  0.3432680  0.08391177  0.0006695556
##                               PC5
## proficiency_composite  0.06441208
## tlocrev                0.65248737
## totalexp               0.08224018
## unemployed             0.14220328
## income                 0.02636055
## population            -0.73649930

t_rotate <- as_tibble(pca_results$rotation) %>%
  mutate(field = rownames(pca_results$rotation)) %>%
  pivot_longer(cols = starts_with('PC'), names_to = 'pc', values_to = 'value')


ggplot(data = t_rotate, mapping = aes(x = pc, y = field)) +
  geom_tile(aes(fill = value )) +
  geom_text(aes(label = ifelse(abs(value) > 0.2, round(value, 2), ''),
                color = "black")) +
  scale_fill_gradient2(low = "yellow", mid = "white", high = "green") +
  theme_minimal() +
  labs(title = "Heatmap of PCA Results",
       x = "Principal Component",
       y = "County")

Decision Tree Model

library(rpart)
library(rpart.plot)

# Decision Tree
tree_model <- rpart(proficiency_composite ~ income  , 
                    data = t_joined, 
                    method = "anova")

# Plot the tree
rpart.plot(tree_model, main = "Decision Tree for Predicting Composite Proficiency", digits = -5)

Correlations

library(ggcorrplot)



corr_matrix <- cor(t_numeric, use = "complete.obs")


ggcorrplot(corr_matrix, 
           lab = TRUE, 
           title = "Correlation Matrix of County-Level Education and Economic Factors")

ggplot(t_joined, aes(x = income, y = proficiency_composite)) +
  geom_point(color = "darkblue", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  labs(
    title = "Relationship Between Income and Composite Proficiency",
    x = "Median Income ($)",
    y = "Composite Proficiency (%)"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Linear Regression Model

model <- lm(proficiency_composite ~ 
        income + totalexp , data = t_joined)

# View model summary
summary(model)

## 
## Call:
## lm(formula = proficiency_composite ~ income + totalexp, data = t_joined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8414 -2.8677  0.3934  2.8032  8.9580 
## 
## Coefficients:
##                 Estimate   Std. Error t value  Pr(>|t|)    
## (Intercept) 11.481214904  3.082209317   3.725  0.000482 ***
## income       0.000254224  0.000046272   5.494 0.0000012 ***
## totalexp     0.000007145  0.000009026   0.792  0.432220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.229 on 52 degrees of freedom
## Multiple R-squared:  0.4499, Adjusted R-squared:  0.4287 
## F-statistic: 21.26 on 2 and 52 DF,  p-value: 0.0000001788

# Evaluate model performance


# Remove rows with missing values for model predictors
model_data <- t_joined %>%
  select(proficiency_composite, income, totalexp) 
  

# Predict using model
predicted <- predict(model, newdata = model_data)

# Compute RMSE using base R
rmse_val <- sqrt(mean((model_data$proficiency_composite - predicted)^2))

library(usmap)
library(tidyverse)

# Step 1: Prepare clean WV counties
wv_counties <- usmap::countypop %>%
  filter(abbr == "WV") %>%
  mutate(county = str_remove(county, " County")) %>%   # <<< Remove " County"
  select(fips, county)

# Step 2: Your t_joined already has county names without "County"
# Now join
t_joined_fips <- t_joined %>%
  left_join(wv_counties, by = "county")

plot_usmap(data = t_joined_fips,
           values = "proficiency_composite",
           region = "county",
           include = "WV") +    # <<< add this line
  scale_fill_continuous(name = "Composite Proficiency",
                        low = "red", high = "blue", label = scales::comma) +
  theme(legend.position = "right") +
  labs(title = "WV County-Level Composite Proficiency Scores")

Summary And Recommendations

The scatter plot reveals a positive relationship between county-level median income and composite student proficiency. Counties with higher median incomes generally demonstrate stronger proficiency outcomes, suggesting that economic resources may enhance educational performance through better access to learning tools, support services, and quality instruction. This is supported by the regression model, which shows income as a statistically significant predictor. Based on these findings, the following recommendations are made:

Increase targeted investment in low-income counties where proficiency scores are low.

Monitor and evaluate the effectiveness of spending strategies, especially in areas where high funding does not lead to better outcomes.

Support economic development, since income levels appear to influence educational success more strongly than direct spending alone.

Future work should analyze longitudinal data to explore changes over time and test for interaction effects between variables.