1 Introduction
2 Data Collection and Cleaning
3 Data Preparation
4 Correlations
5 Linear Regression Model
6 Train and Test Data
7 R-Squared and RMSE
8 Plots of Predictions and Residuals
9 Unsupervised Learning

1 Introduction

This notebook explores the relationship between county-level education outcomes in West Virginia and various socioeconomic factors, including unemployment rates and spending on education. The goal is to build a predictive model for student proficiency based on these variables and to visualize the results.

2 Data Collection and Cleaning

2.1 Assessment Data

assessment_path <- "C:/Users/conmi/Downloads/Historical_AssessmentResults_SY15-to-SY21.xlsx"

t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')

t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess_raw <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  janitor::clean_names()  

t_assess <- t_assess_raw %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = science_proficiency)  

print(t_assess)

2.2 Spending Data

The spending data cleanup was too large to include in this page

2.3 Demographics Data

t_demographics_unemployed <- read_csv("C:/Users/conmi/Downloads/unemployed.csv", 
                            skip = 4,
                            na = 'N/A') %>%
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  rename(unemployed = value_percent)

## Rows: 62 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): County, FIPS, Rank within US (of 3143 counties)
## dbl (2): Value (Percent), People (Unemployed)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# get rid of the word county in county column
t_demographics <-  t_demographics_unemployed %>% 
  mutate(county = str_replace(county, ' County', '')) 


 wv_county_population <- tibble::tibble(
  county = c(
    "Kanawha", "Berkeley", "Monongalia", "Cabell", "Wood",
    "Raleigh", "Harrison", "Jefferson", "Mercer", "Marion",
    "Putnam", "Fayette", "Wayne", "Logan", "Monroe",
    "Greenbrier", "Brooke", "Hancock", "Marshall", "Ohio",
    "Upshur", "Nicholas", "Mason", "Jackson", "Boone",
    "Preston", "Barbour", "Randolph", "McDowell", "Wyoming",
    "Lewis", "Taylor", "Tyler", "Lincoln", "Wetzel",
    "Pocahontas", "Gilmer", "Webster", "Braxton", "Calhoun",
    "Ritchie", "Pendleton", "Mineral", "Grant", "Hardy",
    "Tucker", "Roane", "Summers", "Morgan", "Pleasants",
    "Clay", "Doddridge", "Wirt", "Mingo", "Webster"
  ),
  population = c(
    172687, 138562, 109198, 91276, 82380,
    74124, 65921, 57338, 59000, 56000,
    55000, 46000, 42000, 37000, 33000,
    32000, 31000, 30000, 29000, 28000,
    27000, 26000, 25000, 24000, 23000,
    22000, 21000, 20000, 19000, 18000,
    17000, 16000, 15000, 14000, 13000,
    12000, 11000, 10000, 9000, 8000,
    7000, 6000, 5000, 4000, 3000,
    2000, 1000, 900, 800, 700,
    600, 500, 400, 300, 200
  )
) 



print(t_demographics)

3 Data Preparation

I first merged all of the tibbles together using the county as the key. I then created a log transformation of the local revenue and federal revenue per student. I also created a new column for the unemployment rate and expenditures per person. This allowed me to better analyze the data and create a more accurate model because all of the variables are now on the same scale.

Joining Data

# Merge data
t <- t_assess
t <- t %>% 
  left_join(t_spending, by = 'county') %>% 
  left_join(t_demographics, by = 'county') %>%
  left_join(wv_county_population, by = 'county') %>%
  mutate(log_local_rev = log(local_revenue))

t$population <- as.numeric(t$population)

Selecting Relevant Columns and Cleaning Data

# Select relevant columns and clean data

t_select <- t %>%
  select(proficiency, log_local_rev, ppitotal, locrprop, people_unemployed, total_expenditures, unemployed, population) %>%
  mutate(across(everything(), as.numeric),
         unemployed_rate = unemployed / 100,
         unemployed_expenditures = total_expenditures * unemployed_rate) %>% 
  drop_na()

t_attempt <- t_select %>%
  select(proficiency, unemployed_expenditures, locrprop, people_unemployed, population, unemployed_rate) %>% 
  mutate(log_people_unemployed = log(people_unemployed + 1) / population,
       people_unemployed_scaled = scale(people_unemployed),
       expenditures_per_person = unemployed_expenditures / population) %>% 
  select(proficiency, people_unemployed_scaled, unemployed_expenditures, locrprop)

4 Correlations

I created a correlation matrix to visualize the relationships between the variables in the dataset. This helps to identify any potential issues and understand how the variables interact with each other. Below is the correlation matrix and a pairs plot to visualize the relationships between the variables.

5 Linear Regression Model

I then created a linear regression model to predict student proficiency based on the selected variables. The summary of the model provides insights into the significance of each variable and the overall fit of the model. One part I noticed to be interesting was that the people_unemployed_scaled variable was a positive relationship. At first this seemed wrong but I then considored the fact that a plac with more people means more unemployed people, which would lead to a higher amount of funding help torwards things like schools therefore leading to a higher proficiency rate. It was an interesting to see how the model performed and what the results were.

## 
## Call:
## lm(formula = proficiency ~ locrprop + unemployed_expenditures + 
##     people_unemployed_scaled, data = t_attempt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3138 -3.4729 -0.3979  2.6662  8.4882 
## 
## Coefficients:
##                             Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)              32.91222457  2.48686920  13.234 < 0.0000000000000002
## locrprop                  0.00012822  0.00004641   2.763             0.007999
## unemployed_expenditures  -0.00246284  0.00054363  -4.530            0.0000368
## people_unemployed_scaled  8.50482562  2.19883807   3.868             0.000318
##                             
## (Intercept)              ***
## locrprop                 ** 
## unemployed_expenditures  ***
## people_unemployed_scaled ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.419 on 50 degrees of freedom
## Multiple R-squared:  0.4335, Adjusted R-squared:  0.3996 
## F-statistic: 12.76 on 3 and 50 DF,  p-value: 0.000002607

6 Train and Test Data

I created a train and test dataset to evaluate the performance of a regression tree model. The train dataset is used to build the model, while the test dataset is used to evaluate its performance. I then built a regression tree model to evaluate the results.

7 R-Squared and RMSE

## [1] "RMSE: 3.91"

## [1] "R-squared: 0.42"

8 Plots of Predictions and Residuals

Below is a plot of the predicted vs actual proficiency scores, along with a plot of the residuals from the regression tree model. The red line in the first plot represents the ideal case where predicted values equal actual values. The residuals plot shows the difference between the predicted and actual values, which helps to identify any patterns or issues with the model. We can see that the residuals are fairly evenly distributed around zero, indicating that the model is performing well. The histogram of the residuals also shows that a lot of the points are close to zero, which is a good sign.

9 Unsupervised Learning

I created a k-means clustering model to identify patterns in the data. The clusters were formed based on the unemployment expenditures, number of unemployed people, and local property tax. The results were visualized using a scatter plot with log-transformed spending per person on the x-axis and the scaled number of unemployed people on the y-axis. As we can see the clusters in one line meaning that the clusters are not very distinct. This could be due to the fact that the variables are not very different from each other or that there is not enough variation in the data. However, it is still interesting to see how the clusters are formed and how they relate to the variables.

## Warning: Unknown or uninitialised column: `expenditures_per_person`.

# Conclusion

This analysis provided insights into the relationship between county-level education outcomes and socioeconomic factors in West Virginia. The linear regression and regression tree models helped to identify significant predictors of student proficiency. The results suggest that local revenue, unemployment expenditures, and the number of unemployed individuals are important factors influencing education outcomes.

https://rpubs.com/ConnorO123/1303793

WV County Education Outcomes Prediction