Created by Riley Kearney. Updated 12/8/2024
Thesis
Counties that have higher bachelor degree rates tend to have higher
science proficiency scores.
Data
The data used this project includes unemployment demographics, county
revenues and spending, and proficiency scores for various counties.
These data sets were provided by Professor Garrett. Additionally, I
incorporated median family income, region data for each county, bachelor
degree rate, poverty rate, population rate for under 18 year olds, and
foreign population of each county.
Key Variables Included:
poverty_rate: poverty rate of each county
under_18_rate: population of 18 and younger for each
county
med_income: median family income in each county
bachelors_pop: bachelor degree rate attainment of each
county
proficiency: science proficiency score in each county
Methods
Correlations

Above is a correlation graph for me to see the different correlations
between the variables in my data set.
Linear Regression
Call:
lm(formula = proficiency ~ med_income + bachelors_pop + poverty_rate +
under_18_rate, data = numeric_data)
Residuals:
Min 1Q Median 3Q Max
-7.9331 -2.8643 -0.1598 2.8148 8.8230
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.889e+01 8.074e+00 2.340 0.0233 *
med_income 1.238e-04 9.788e-05 1.265 0.2119
bachelors_pop 3.041e-01 1.308e-01 2.326 0.0241 *
poverty_rate -1.279e-01 2.136e-01 -0.599 0.5522
under_18_rate -2.983e-01 3.410e-01 -0.875 0.3858
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.24 on 50 degrees of freedom
Multiple R-squared: 0.5082, Adjusted R-squared: 0.4688
F-statistic: 12.92 on 4 and 50 DF, p-value: 2.704e-07
The regression model above shows that there is a positive correlation
between median family income and population of people with their
bachelors degree with proficiency scores. There is a negative
correlation with poverty rate and population of people under the age of
18 with proficiency scores. The overall model has a 0.47 adjusted
R-squared which I’m happy with considering a good R-squared is 0.5 and
greater. The p-value is also a good number because it is under 0.05.
Bachelors population is a significant predictor when predicting
proficiency for all grades.
K-means

The visualization above shows counties grouped based on demographics
as median family income, bachelor degree rate, poverty rate,
proficiency, and population of children (18 and younger). The clusters
reveal patterns where counties with higher bachelor degree rates tend to
have higher proficiency scores. The red cluster shows counties that may
be economically disadvantaged with lower educational attainment, the
green cluster represents counties that have higher educational
attainment and outcomes, and the blue cluster represents the counties
that fall in a middle of the two extremes.
Decision Tree

The decision tree above predicts proficiency based on the key
demographic variables I’ve been working with throughout this project.
This visualization shows once again that the bachelors degree rate in
each county has a significant impact on proficiency scores, because in
this decision tree it is the first and most important split. As well as
the population of children in each county has an impact on proficiency
scores as well. Overall the tree shows how education levels and
socioeconomic factors play a role in children’s proficiency scores.
Neural Network

The visualization above represents the structure of the neural
network, which predicts proficiency scores based on inputs like median
income, bachelor degree rates, poverty rates, and population rate of 18
and younger. The error value is 0.010113 which indicates that the
network has learned the relationships effectively capturing complex
patterns.

When looking at the actual and predicted proficiency generated by the
neural network, the plot shows that the model captures general trends,
as many points are close to the red line. However there is still
noticeable variability, indicating that the predictions are not
perfectly accurate. This suggests the model performs reasonably well but
may require further refinement to improve accuracy.
Prediction
Based on all my visualizations, I’ve come to the prediction that
percentage of people with bachelor degrees and poverty rates in a
county. Counties with higher bachelor’s degree attainment tend to have
higher proficiency scores, while those with higher poverty rates
generally see lower proficiency. The decision tree and k-means
clustering show distinct groups based on these patterns, highlighting
the importance of these socioeconomic variables. The neural network
predictions align with these trends as well, with little variability.
Overall, if I was presenting this to a school board I would recommend to
encourage furthing education after k-12 to their students, so that in
the future those proficiency scores will increase.
Limitations
One limitation of this project is that it does not provide immediate
solutions for improving proficiency scores, as factors identified, such
as bachelor’s degree attainment and poverty, require long term efforts
to address. Additionally, only including socioeconimic variables and
leaving out funding or school resources data could have potentially
skewed the predictions. In the future, I would expand the data to
included funding variables as well to try to improve the model’s
accuracy and provide more insight.
Resources
Sources Included:
---
title: "Project 3"
output: html_notebook
---
Created by Riley Kearney. Updated 12/8/2024

### Thesis 
Counties that have higher bachelor degree rates tend to have higher science proficiency scores. 


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(tidyverse)
library(caret)
library(rpart)
library(readxl)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess_raw <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  janitor::clean_names()  


# Remove subgroups
t_assess <- t_assess_raw %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = science_proficiency)  

# Add regions
regions <- read_csv('wv_regions.csv')
t_assess <- left_join(t_assess, regions, by = 'county')

# Assign number to each region
t_assess <- t_assess %>% 
  mutate(region_number = as.numeric(factor(region)))

print(t_assess)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
spending_path <- './us census ed spending/elsec22t.xls'

t_spending_raw <- read_excel(path = spending_path,,
                           sheet = 'elsec22t',
                           range = 'a1:gb14106') %>% 
  janitor::clean_names()


cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
                'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
                'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')

t_spending <- t_spending_raw %>% 
  filter(state == 49) %>% 
  filter(!name %in% cooperates) %>% 
  select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>% 
  mutate(county = str_to_title(str_split_i(name, ' ',1)),
         county = ifelse(county == 'Mc', 'McDowell', county))


print(t_spending)
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

  

t_unemployed_raw <- read_csv('./demographics/unemployed.csv', 
                            skip = 4,
                            na = 'N/A') 

t_unemployed <- t_unemployed_raw %>% 
  mutate(county = County, 
         unemployed_pop = `Value (Percent)`) %>% 
  select(county, unemployed_pop)

t_unemployed <- t_unemployed %>% 
  filter(!is.na(unemployed_pop), 
         county != 'West Virginia',
         county != 'United States')

t_unemployed$county <- str_replace(t_unemployed$county, " County", "")


t_unemployed$county <- str_replace(t_unemployed$county, " County", "")

print(t_unemployed)


bachelor_degree_raw <- read_csv("bachelor_degree.csv", 
                            skip = 5)

bachelor_degree <- bachelor_degree_raw %>% 
  mutate(county = County, 
         bachelors_pop = `Value (Percent)`) %>% 
  select(county, bachelors_pop)

bachelor_degree <- bachelor_degree %>% 
  filter(!is.na(bachelors_pop), 
         county != 'West Virginia',
         county != 'United States')

bachelor_degree$county <- str_replace(bachelor_degree$county, " County", "")

foreign_born_raw <- read_csv("foreign_born.csv",
                             skip = 4)
foreign_born <- foreign_born_raw %>% 
  mutate(county = County, 
         foreign_pop = `Value (Percent)`) %>% 
  filter(!is.na(foreign_pop), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, foreign_pop)

foreign_born$county <- str_replace(foreign_born$county, " County", "")

 poverty_rate <- read_csv("poverty_rate.csv", skip = 3) %>% 
     mutate(county = County, 
         poverty_rate = `Value (Percent)`) %>% 
  filter(!is.na(poverty_rate), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, poverty_rate)

 poverty_rate$county <- str_replace(poverty_rate$county, " County", "")
 
under_18_pop <- read_csv("under_18_pop.csv", 
                         skip = 3) %>% 
  mutate(county = County, 
         under_18_rate = `Value (Percent)`) %>% 
  filter(!is.na(under_18_rate), 
         county != 'West Virginia', 
         county != 'United States') %>% 
  select(county, under_18_rate)
 
 under_18_pop$county <- str_replace(under_18_pop$county, " County", "")
 
 
```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
library(stringr)
income_raw <- read_csv('HDPulse_data_export.csv', 
                       skip = 6)
household_income <- income_raw %>% 
  rename(county = 'United States', 
         med_income = '92,646') %>% 
  filter(!is.na(med_income)) %>% 
  select(county, med_income)

household_income$county <- str_replace(household_income$county, " County", "")

print(household_income)

demographics <- foreign_born %>% 
  left_join(bachelor_degree, by = "county") %>% 
  left_join(t_unemployed, by = "county") %>% 
  left_join(household_income, by = "county") %>% 
  left_join(under_18_pop, by = "county") %>% 
  left_join(poverty_rate, by = "county")



```


```{r message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}

# Merge data
t <- t_assess %>% 
  left_join(demographics, by = "county") %>% 
  left_join(t_spending, by = "county")
  

print(t)
```

### Data

The data used this project includes unemployment demographics, county revenues and spending, and proficiency scores for various counties. These data sets were provided by Professor Garrett. Additionally, I incorporated median family income, region data for each county, bachelor degree rate, poverty rate, population rate for under 18 year olds, and foreign population of each county. 

Key Variables Included:

- poverty_rate: poverty rate of each county

- under_18_rate: population of 18 and younger for each county

- med_income: median family income in each county

- bachelors_pop: bachelor degree rate attainment of each county

- proficiency: science proficiency score in each county


### Methods

#### Correlations

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(ggcorrplot)

numeric_data <- t %>% 
  select(where(is.numeric))
corr_data <- numeric_data %>% 
  select(proficiency, med_income, foreign_pop, bachelors_pop, unemployed_pop, enroll, under_18_rate, poverty_rate)
cor_matrix <- cor(corr_data, use = "complete.obs")

ggcorrplot(cor(corr_data))

```
Above is a correlation graph for me to see the different correlations between the variables in my data set. 


#### Linear Regression

```{r echo=FALSE, message=FALSE, warning=FALSE}

model <- lm(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, data = numeric_data)

summary(model)


```

The regression model above shows that there is a positive correlation between median family income and population of people with their bachelors degree with proficiency scores. There is a negative correlation with poverty rate and population of people under the age of 18 with proficiency scores. The overall model has a 0.47 adjusted R-squared which I'm happy with considering a good R-squared is 0.5 and greater. The p-value is also a good number because it is under 0.05. Bachelors population is a significant predictor when predicting proficiency for all grades. 

#### K-means

```{r echo=FALSE, message=FALSE, warning=FALSE}
k_data <- numeric_data %>% 
  select(proficiency, med_income, bachelors_pop, poverty_rate, under_18_rate)

scaled_data <- k_data %>% 
  mutate(across(everything(), scale))

k_result <- kmeans(
  x = scaled_data, 
  centers = 3, 
  nstart = 25
)

numeric_data <- numeric_data %>% 
  mutate(kmeans_cluster = k_result$cluster)

ggplot(data = numeric_data) + 
  geom_point(mapping = aes(x = bachelors_pop, y = proficiency, color = factor(kmeans_cluster))) +
  labs(color = "Cluster", title = "K-means Clustering", x = "Bachelor Degree Rate", y = "Proficiency") +
  theme_minimal()


```
The visualization above shows counties grouped based on demographics as median family income, bachelor degree rate, poverty rate, proficiency, and population of children (18 and younger). The clusters reveal patterns where counties with higher bachelor degree rates tend to have higher proficiency scores. The red cluster shows counties that may be economically disadvantaged with lower educational attainment, the green cluster represents counties that have higher educational attainment and outcomes, and the blue cluster represents the counties that fall in a middle of the two extremes. 


#### Decision Tree

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(rpart.plot)
library(rpart)

tree_model <- rpart(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, 
                    data = numeric_data, 
                    method = "anova", 
                    control = rpart.control(minsplit = 10, 
                                            minbucket = 5))

rpart.plot(tree_model)

```
The decision tree above predicts proficiency based on the key demographic variables I've been working with throughout this project. This visualization shows once again that the bachelors degree rate in each county has a significant impact on proficiency scores, because in this decision tree it is the first and most important split. As well as the population of children in each county has an impact on proficiency scores as well. Overall the tree shows how education levels and socioeconomic factors play a role in children's proficiency scores. 



### Neural Network 

```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(neuralnet)


set.seed(1)

scaled_data <- as.data.frame(scaled_data)
train_index <- sample(seq_len(nrow(scaled_data)), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_index, ]
test_data <-  scaled_data[-train_index, ]

nn_model <- neuralnet(proficiency ~ med_income + bachelors_pop + poverty_rate + under_18_rate, 
                      data = train_data, 
                      hidden = c(10, 5), 
                      linear.output = TRUE)

plot(nn_model)

test_predictions <- compute(nn_model, test_data[, c("med_income", "bachelors_pop", "poverty_rate", "under_18_rate")])
test_predictions <- test_predictions$net.result

actual_values <- test_data$proficiency
rmse <- mean((actual_values-test_predictions)^2)



```
The visualization above represents the structure of the neural network, which predicts proficiency scores based on inputs like median income, bachelor degree rates, poverty rates, and population rate of 18 and younger. The error value is 0.010113 which indicates that the network has learned the relationships effectively capturing complex patterns. 



```{r echo=FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}

plot(actual_values, test_predictions, main = "Actual vs Predicted Proficiency",
     xlab = "Actual Proficiency", 
     ylab = "Predictd Proficiency")
abline(0,1,col = "red")



```
When looking at the actual and predicted proficiency generated by the neural network, the plot shows that the model captures general trends, as many points are close to the red line. However there is still noticeable variability, indicating that the predictions are not perfectly accurate. This suggests the model performs reasonably well but may require further refinement to improve accuracy. 











### Prediction

Based on all my visualizations, I've come to the prediction that percentage of people with bachelor degrees and poverty rates in a county. Counties with higher bachelor's degree attainment tend to have higher proficiency scores, while those with higher poverty rates generally see lower proficiency. The decision tree and k-means clustering show distinct groups based on these patterns, highlighting the importance of these socioeconomic variables. The neural network predictions align with these trends as well, with little variability. Overall, if I was presenting this to a school board I would recommend to encourage furthing education after k-12 to their students, so that in the future those proficiency scores will increase. 

### Limitations

One limitation of this project is that it does not provide immediate solutions for improving proficiency scores, as factors identified, such as bachelor's degree attainment and poverty, require long term efforts to address. Additionally, only including socioeconimic variables and leaving out funding or school resources data could have potentially skewed the predictions. In the future, I would expand the data to included funding variables as well to try to improve the model's accuracy and provide more insight. 


### Resources

Sources Included: 

- ChatGPT for code errors

- https://hdpulse.nimhd.nih.gov/data-portal/social/map?socialtopic=030&socialtopic_options=social_6&demo=00010&demo_options=income_3&race=00&race_options=race_7&sex=0&sex_options=sexboth_1&age=001&age_options=ageall_1&statefips=54&statefips_options=area_states

- https://www.dropbox.com/scl/fo/s29xwwg21irckz9gzjx39/AC3W8m02KLAgItDfejmmvrU?rlkey=4h226idmd0n696zyjcrk2kegb&e=1&dl=0

















