Created by Riley Kearney. Updated 12/2/2024
Thesis
Median household income, local revenue, enrollment, and unemployment
are significant factors infuencing proficiency scores in county
schools.
Data
The data used this project includes unemployment demographics, county
revenues and spending, and proficiency scores for various counties.
These data sets were provided by Professor Garrett. Additionally, I
incorporated median family income and region data for each county.
Key Variables Included:
tlocrev: local revenue for each county in dollars
enroll: enrollment for schools in each county
med_income: median family income in each county
unemployed: unemployment rate for each county
proficiency: proficiency scores in each county
Methods
Correlations

Above is a correlation graph for me to see the different correlations
between the variables in my data set.
PCA

This PCA graph shows me that since proficiency and unemployment point
in opposite directions, there is a negative correlation between the two.
Proficiency and average household income have a very strong positive
correlation, as well as the number enrolled and the local revenue.
Decision Tree

Counties with higher local revenue tend to have higher predicted
proficiency scores. In contrast, counties with lower local revenues tend
to have more splits in the decision tree, indicating the presence of
distinct subgroup behaviors. For example, even within the group of
counties with lower local revenues, those with lower enrollment (below
or equal to 2,255) and greater average income tend to have a high
predicted proficiency score.
Neural Network
I haven’t been able to start this visualization yet.
Prediction
Still need to do.
Limitations
Still need to do.
Resources
Sources Included:
---
title: "Project 3"
output: html_notebook
---
Created by Riley Kearney. Updated 12/2/2024

### Thesis 
Median household income, local revenue, enrollment, and unemployment are significant factors infuencing proficiency scores in county schools. 


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
library(caret)
library(rpart)
library(readxl)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess_raw <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  janitor::clean_names()  


# Remove subgroups
t_assess <- t_assess_raw %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = science_proficiency)  

# Add regions
regions <- read_csv('wv_regions.csv')
t_assess <- left_join(t_assess, regions, by = 'county')

# Assign number to each region
t_assess <- t_assess %>% 
  mutate(region_number = as.numeric(factor(region)))

print(t_assess)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
spending_path <- './us census ed spending/elsec22t.xls'

t_spending_raw <- read_excel(path = spending_path,,
                           sheet = 'elsec22t',
                           range = 'a1:gb14106') %>% 
  janitor::clean_names()


cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
                'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
                'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')

t_spending <- t_spending_raw %>% 
  filter(state == 49) %>% 
  filter(!name %in% cooperates) %>% 
  select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>% 
  mutate(county = str_to_title(str_split_i(name, ' ',1)),
         county = ifelse(county == 'Mc', 'McDowell', county))


print(t_spending)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

  

t_demographics_unemployed <- read_csv('./demographics/unemployed.csv', 
                            skip = 4,
                            na = 'N/A') %>%
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  select(county, value_percent) %>%
  rename(unemployed = value_percent)


t_unemployed <-  t_demographics_unemployed

t_unemployed$county <- str_replace(t_unemployed$county, " County", "")

print(t_unemployed)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(stringr)
income_raw <- read_csv('HDPulse_data_export.csv', 
                       skip = 6)
household_income <- income_raw %>% 
  rename(county = 'United States', 
         med_income = '92,646') %>% 
  filter(!is.na(med_income)) %>% 
  select(county, med_income)

household_income$county <- str_replace(household_income$county, " County", "")

print(household_income)

```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

# Merge data
t <- t_assess %>% 
  left_join(household_income, by = "county") %>% 
  left_join(t_unemployed, by = "county") %>% 
  left_join(t_spending, by = "county")

print(t)
```

### Data

The data used this project includes unemployment demographics, county revenues and spending, and proficiency scores for various counties. These data sets were provided by Professor Garrett. Additionally, I incorporated median family income and region data for each county. 

Key Variables Included:

- tlocrev: local revenue for each county in dollars

- enroll: enrollment for schools in each county

- med_income: median family income in each county

- unemployed: unemployment rate for each county

- proficiency: proficiency scores in each county


### Methods

#### Correlations

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(ggcorrplot)

numeric_data <- t %>% 
  select(where(is.numeric))
corr_data <- numeric_data %>% 
  select(proficiency, med_income, unemployed, enroll, tfedrev, tstrev, tlocrev, totalexp)
cor_matrix <- cor(corr_data, use = "complete.obs")

ggcorrplot(cor(corr_data))

```
Above is a correlation graph for me to see the different correlations between the variables in my data set. 



#### PCA

```{undefined echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
pca_data <- numeric_data %>% 
  select(proficiency, med_income, unemployed, tlocrev, enroll)

pca_results <- prcomp(pca_data, 
                      center = TRUE, 
                      scale = TRUE)
biplot(pca_results)


```
This PCA graph shows me that since proficiency and unemployment point in opposite directions, there is a negative correlation between the two. Proficiency and average household income have a very strong positive correlation, as well as the number enrolled and the local revenue. 


#### Decision Tree

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(rpart.plot)
library(rpart)

tree_model <- rpart(proficiency ~ med_income + unemployed + enroll + tlocrev, 
                    data = numeric_data, 
                    method = "anova", 
                    control = rpart.control(minsplit = 10, 
                                            minbucket = 5))

rpart.plot(tree_model)

```
Counties with higher local revenue tend to have higher predicted proficiency scores. In contrast, counties with lower local revenues tend to have more splits in the decision tree, indicating the presence of distinct subgroup behaviors. For example, even within the group of counties with lower local revenues, those with lower enrollment (below or equal to 2,255) and greater average income tend to have a high predicted proficiency score. 



### Neural Network 

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(neuralnet)




```
I haven't been able to start this visualization yet. 




### Prediction

```{r}





```
Still need to do.


### Limitations

Still need to do. 


### Resources

Sources Included: 

- ChatGPT for code errors

- https://hdpulse.nimhd.nih.gov/data-portal/social/map?socialtopic=030&socialtopic_options=social_6&demo=00010&demo_options=income_3&race=00&race_options=race_7&sex=0&sex_options=sexboth_1&age=001&age_options=ageall_1&statefips=54&statefips_options=area_states

- https://www.dropbox.com/scl/fo/s29xwwg21irckz9gzjx39/AC3W8m02KLAgItDfejmmvrU?rlkey=4h226idmd0n696zyjcrk2kegb&e=1&dl=0

















