Created by Riley Kearney. Updated 12/8/2024

Thesis

Counties that have higher bachelor degree rates tend to have higher science proficiency scores.

Data

The data used this project includes unemployment demographics, county revenues and spending, and proficiency scores for various counties. These data sets were provided by Professor Garrett. Additionally, I incorporated median family income, region data for each county, bachelor degree rate, poverty rate, population rate for under 18 year olds, and foreign population of each county.

Key Variables Included:

Methods

Correlations

Above is a correlation graph for me to see the different correlations between the variables in my data set.

Linear Regression


Call:
lm(formula = proficiency ~ med_income + bachelors_pop + poverty_rate + 
    under_18_rate, data = numeric_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9331 -2.8643 -0.1598  2.8148  8.8230 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    1.889e+01  8.074e+00   2.340   0.0233 *
med_income     1.238e-04  9.788e-05   1.265   0.2119  
bachelors_pop  3.041e-01  1.308e-01   2.326   0.0241 *
poverty_rate  -1.279e-01  2.136e-01  -0.599   0.5522  
under_18_rate -2.983e-01  3.410e-01  -0.875   0.3858  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.24 on 50 degrees of freedom
Multiple R-squared:  0.5082,    Adjusted R-squared:  0.4688 
F-statistic: 12.92 on 4 and 50 DF,  p-value: 2.704e-07

The regression model above shows that there is a positive correlation between median family income and population of people with their bachelors degree with proficiency scores. There is a negative correlation with poverty rate and population of people under the age of 18 with proficiency scores. The overall model has a 0.47 adjusted R-squared which I’m happy with considering a good R-squared is 0.5 and greater. The p-value is also a good number because it is under 0.05. Bachelors population is a significant predictor when predicting proficiency for all grades.

K-means

The visualization above shows counties grouped based on demographics as median family income, bachelor degree rate, poverty rate, proficiency, and population of children (18 and younger). The clusters reveal patterns where counties with higher bachelor degree rates tend to have higher proficiency scores. The red cluster shows counties that may be economically disadvantaged with lower educational attainment, the green cluster represents counties that have higher educational attainment and outcomes, and the blue cluster represents the counties that fall in a middle of the two extremes.

Decision Tree

The decision tree above predicts proficiency based on the key demographic variables I’ve been working with throughout this project. This visualization shows once again that the bachelors degree rate in each county has a significant impact on proficiency scores, because in this decision tree it is the first and most important split. As well as the population of children in each county has an impact on proficiency scores as well. Overall the tree shows how education levels and socioeconomic factors play a role in children’s proficiency scores.

Neural Network

The visualization above represents the structure of the neural network, which predicts proficiency scores based on inputs like median income, bachelor degree rates, poverty rates, and population rate of 18 and younger. The error value is 0.010113 which indicates that the network has learned the relationships effectively capturing complex patterns.

When looking at the actual and predicted proficiency generated by the neural network, the plot shows that the model captures general trends, as many points are close to the red line. However there is still noticeable variability, indicating that the predictions are not perfectly accurate. This suggests the model performs reasonably well but may require further refinement to improve accuracy.

Prediction

Based on all my visualizations, I’ve come to the prediction that percentage of people with bachelor degrees and poverty rates in a county. Counties with higher bachelor’s degree attainment tend to have higher proficiency scores, while those with higher poverty rates generally see lower proficiency. The decision tree and k-means clustering show distinct groups based on these patterns, highlighting the importance of these socioeconomic variables. The neural network predictions align with these trends as well, with little variability. Overall, if I was presenting this to a school board I would recommend to encourage furthing education after k-12 to their students, so that in the future those proficiency scores will increase.

Limitations

One limitation of this project is that it does not provide immediate solutions for improving proficiency scores, as factors identified, such as bachelor’s degree attainment and poverty, require long term efforts to address. Additionally, only including socioeconimic variables and leaving out funding or school resources data could have potentially skewed the predictions. In the future, I would expand the data to included funding variables as well to try to improve the model’s accuracy and provide more insight.

Resources

Sources Included:

---
title: "Project 3"
output: html_notebook
---
Created by Riley Kearney. Updated 12/8/2024

### Thesis 
Counties that have higher bachelor degree rates tend to have higher science proficiency scores. 


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
library(caret)
library(rpart)
library(readxl)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess_raw <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  janitor::clean_names()  


# Remove subgroups
t_assess <- t_assess_raw %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = science_proficiency)  

# Add regions
regions <- read_csv('wv_regions.csv')
t_assess <- left_join(t_assess, regions, by = 'county')

# Assign number to each region
t_assess <- t_assess %>% 
  mutate(region_number = as.numeric(factor(region)))

print(t_assess)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
spending_path <- './us census ed spending/elsec22t.xls'

t_spending_raw <- read_excel(path = spending_path,,
                           sheet = 'elsec22t',
                           range = 'a1:gb14106') %>% 
  janitor::clean_names()


cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
                'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
                'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')

t_spending <- t_spending_raw %>% 
  filter(state == 49) %>% 
  filter(!name %in% cooperates) %>% 
  select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>% 
  mutate(county = str_to_title(str_split_i(name, ' ',1)),
         county = ifelse(county == 'Mc', 'McDowell', county))


print(t_spending)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

  

t_unemployed_raw <- read_csv('./demographics/unemployed.csv', 
                            skip = 4,
                            na = 'N/A') 

t_unemployed <- t_unemployed_raw %>% 
  mutate(county = County, 
         unemployed_pop = `Value (Percent)`) %>% 
  select(county, unemployed_pop)

t_unemployed <- t_unemployed %>% 
  filter(!is.na(unemployed_pop), 
         county != 'West Virginia',
         county != 'United States')

t_unemployed$county <- str_replace(t_unemployed$county, " County", "")


t_unemployed$county <- str_replace(t_unemployed$county, " County", "")

print(t_unemployed)


bachelor_degree_raw <- read_csv("bachelor_degree.csv", 
                            skip = 5)

bachelor_degree <- bachelor_degree_raw %>% 
  mutate(county = County, 
         bachelors_pop = `Value (Percent)`) %>% 
  select(county, bachelors_pop)

bachelor_degree <- bachelor_degree %>% 
  filter(!is.na(bachelors_pop), 
         county != 'West Virginia',
         county != 'United States')

bachelor_degree$county <- str_replace(bachelor_degree$county, " County", "")

foreign_born_raw <- read_csv("foreign_born.csv",
                             skip = 4)
foreign_born <- foreign_born_raw %>% 
  mutate(county = County, 
         foreign_pop = `Value (Percent)`) %>% 
  filter(!is.na(foreign_pop), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, foreign_pop)

foreign_born$county <- str_replace(foreign_born$county, " County", "")

 poverty_rate <- read_csv("poverty_rate.csv", skip = 3) %>% 
     mutate(county = County, 
         poverty_rate = `Value (Percent)`) %>% 
  filter(!is.na(poverty_rate), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, poverty_rate)

 poverty_rate$county <- str_replace(poverty_rate$county, " County", "")
 
under_18_pop <- read_csv("under_18_pop.csv", 
                         skip = 3) %>% 
  mutate(county = County, 
         under_18_rate = `Value (Percent)`) %>% 
  filter(!is.na(under_18_rate), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, under_18_rate)
 
 under_18_pop$county <- str_replace(under_18_pop$county, " County", "")
 
 
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(stringr)
income_raw <- read_csv('HDPulse_data_export.csv', 
                       skip = 6)
household_income <- income_raw %>% 
  rename(county = 'United States', 
         med_income = '92,646') %>% 
  filter(!is.na(med_income)) %>% 
  select(county, med_income)

household_income$county <- str_replace(household_income$county, " County", "")

print(household_income)

demographics <- foreign_born %>% 
  left_join(bachelor_degree, by = "county") %>% 
  left_join(t_unemployed, by = "county") %>% 
  left_join(household_income, by = "county") %>% 
  left_join(under_18_pop, by = "county") %>% 
  left_join(poverty_rate, by = "county")



```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

# Merge data
t <- t_assess %>% 
  left_join(demographics, by = "county") %>% 
  left_join(t_spending, by = "county")
  

print(t)
```

### Data

The data used this project includes unemployment demographics, county revenues and spending, and proficiency scores for various counties. These data sets were provided by Professor Garrett. Additionally, I incorporated median family income, region data for each county, bachelor degree rate, poverty rate, population rate for under 18 year olds, and foreign population of each county. 

Key Variables Included:

- poverty_rate: poverty rate of each county

- under_18_rate: population of 18 and younger for each county

- med_income: median family income in each county

- bachelors_pop: bachelor degree rate attainment of each county

- proficiency: science proficiency score in each county


### Methods

#### Correlations

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(ggcorrplot)

numeric_data <- t %>% 
  select(where(is.numeric))
corr_data <- numeric_data %>% 
  select(proficiency, med_income, foreign_pop, bachelors_pop, unemployed_pop, enroll, under_18_rate, poverty_rate)
cor_matrix <- cor(corr_data, use = "complete.obs")

ggcorrplot(cor(corr_data))

```
Above is a correlation graph for me to see the different correlations between the variables in my data set. 


#### Linear Regression

```{r echo=FALSE, message=FALSE, warning=FALSE}

model <- lm(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, data = numeric_data)

summary(model)


```

The regression model above shows that there is a positive correlation between median family income and population of people with their bachelors degree with proficiency scores. There is a negative correlation with poverty rate and population of people under the age of 18 with proficiency scores. The overall model has a 0.47 adjusted R-squared which I'm happy with considering a good R-squared is 0.5 and greater. The p-value is also a good number because it is under 0.05. Bachelors population is a significant predictor when predicting proficiency for all grades. 

#### K-means

```{r echo=FALSE, message=FALSE, warning=FALSE}
k_data <- numeric_data %>% 
  select(proficiency, med_income, bachelors_pop, poverty_rate, under_18_rate)

scaled_data <- k_data %>% 
  mutate(across(everything(), scale))

k_result <- kmeans(
  x = scaled_data, 
  centers = 3, 
  nstart = 25
)

numeric_data <- numeric_data %>% 
  mutate(kmeans_cluster = k_result$cluster)

ggplot(data = numeric_data) + 
  geom_point(mapping = aes(x = bachelors_pop, y = proficiency, color = factor(kmeans_cluster))) +
  labs(color = "Cluster", title = "K-means Clustering", x = "Bachelor Degree Rate", y = "Proficiency") +
  theme_minimal()


```
The visualization above shows counties grouped based on demographics as median family income, bachelor degree rate, poverty rate, proficiency, and population of children (18 and younger). The clusters reveal patterns where counties with higher bachelor degree rates tend to have higher proficiency scores. The red cluster shows counties that may be economically disadvantaged with lower educational attainment, the green cluster represents counties that have higher educational attainment and outcomes, and the blue cluster represents the counties that fall in a middle of the two extremes. 


#### Decision Tree

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(rpart.plot)
library(rpart)

tree_model <- rpart(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, 
                    data = numeric_data, 
                    method = "anova", 
                    control = rpart.control(minsplit = 10, 
                                            minbucket = 5))

rpart.plot(tree_model)

```
The decision tree above predicts proficiency based on the key demographic variables I've been working with throughout this project. This visualization shows once again that the bachelors degree rate in each county has a significant impact on proficiency scores, because in this decision tree it is the first and most important split. As well as the population of children in each county has an impact on proficiency scores as well. Overall the tree shows how education levels and socioeconomic factors play a role in children's proficiency scores. 



### Neural Network 

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(neuralnet)


set.seed(1)

scaled_data <- as.data.frame(scaled_data)
train_index <- sample(seq_len(nrow(scaled_data)), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_index, ]
test_data <-  scaled_data[-train_index, ]

nn_model <- neuralnet(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, 
                      data = train_data, 
                      hidden = c(10, 5), 
                      linear.output = TRUE)

plot(nn_model)

test_predictions <- compute(nn_model, test_data[, c("med_income", "bachelors_pop", "poverty_rate", "under_18_rate")])
test_predictions <- test_predictions$net.result

actual_values <- test_data$proficiency
rmse <- mean((actual_values-test_predictions)^2)



```
The visualization above represents the structure of the neural network, which predicts proficiency scores based on inputs like median income, bachelor degree rates, poverty rates, and population rate of 18 and younger. The error value is 0.010113 which indicates that the network has learned the relationships effectively capturing complex patterns. 



```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}

plot(actual_values, test_predictions, main = "Actual vs Predicted Proficiency",
     xlab = "Actual Proficiency", 
     ylab = "Predictd Proficiency")
abline(0,1,col = "red")



```
When looking at the actual and predicted proficiency generated by the neural network, the plot shows that the model captures general trends, as many points are close to the red line. However there is still noticeable variability, indicating that the predictions are not perfectly accurate. This suggests the model performs reasonably well but may require further refinement to improve accuracy. 











### Prediction

Based on all my visualizations, I've come to the prediction that percentage of people with bachelor degrees and poverty rates in a county. Counties with higher bachelor's degree attainment tend to have higher proficiency scores, while those with higher poverty rates generally see lower proficiency. The decision tree and k-means clustering show distinct groups based on these patterns, highlighting the importance of these socioeconomic variables. The neural network predictions align with these trends as well, with little variability. Overall, if I was presenting this to a school board I would recommend to encourage furthing education after k-12 to their students, so that in the future those proficiency scores will increase. 

### Limitations

One limitation of this project is that it does not provide immediate solutions for improving proficiency scores, as factors identified, such as bachelor's degree attainment and poverty, require long term efforts to address. Additionally, only including socioeconimic variables and leaving out funding or school resources data could have potentially skewed the predictions. In the future, I would expand the data to included funding variables as well to try to improve the model's accuracy and provide more insight. 


### Resources

Sources Included: 

- ChatGPT for code errors

- https://hdpulse.nimhd.nih.gov/data-portal/social/map?socialtopic=030&socialtopic_options=social_6&demo=00010&demo_options=income_3&race=00&race_options=race_7&sex=0&sex_options=sexboth_1&age=001&age_options=ageall_1&statefips=54&statefips_options=area_states

- https://www.dropbox.com/scl/fo/s29xwwg21irckz9gzjx39/AC3W8m02KLAgItDfejmmvrU?rlkey=4h226idmd0n696zyjcrk2kegb&e=1&dl=0

















