Project 3 Mackenzie Hoeflinger

Introduction

The purpose of this analysis is to see what factors impact county-level educational outcomes. This was narrowed down to looking at reading proficiency and analyzing this with education of parents and local revenue received by schools. The more amount of people over 25 years old (parents of the children in schools) that have achieved a bachelors degree or higher, the higher the local money received by the school district is. This in turn raises the reading proficiency. This project found that education of parents raises local revenue for schools which in turn allows schools to use that money to raise reading proficiency. This is a cycle because the higher reading proficiency scores are rasied, the more money they receive. Another key finding was that there are few schools with very high scores, and few schools with very low scores. This means that there is a concentration of schools with moderate reading proficiency scores, indicating that most counties fall somewhere in the middle, rather than having extreme outliers at both ends.

Recommendations: Targeted investments should be made in school districts with moderate reading proficiency scores in order to focus on increasing local funding. This could be through any kind of community engagement and/or campaigns that encourage higher education for parents.

Data

Out of all of the data that was provided, reading proficiency was used for this project. Unemployment, federal and state revenue, total exp, percent of the money used per student, and median income were not used.

Key variables included:

This data was obtained from HDPulse.com. There was a missing field for education for Monongalia county which I inputted with data from the US Census Bureau. The information was joined into 1 tibble with inner and left joins.

Methods

Correlation Matrix

I started by looking at a correlation matrix of all variables provided. There is a box of strongly correlated variables in the center, however, it does not give that much new information as it makes sense why federal, state, local, total, and enrollment would all correlate with one another. Looking at the three proficiencies, math and reading proficiency have a strong correlation with one another. I mainly focus on the people with at least a bachelors degree and how that affects local revenue in the rest of this project.

PCA

Before deciding to go with the reading proficiency, I did a PCA chart of the three proficiencies. I decided to choose reading or math because they were more closely correlated and closer on the PCA so I felt as though my findings could be used with more variety than if I were to choose science. The PCA results show that PC1 explains 85.99% of the variance. PC2 explains an additional 10.50% of the variance.

## Importance of first k=2 (out of 3) components:
##                           PC1    PC2
## Standard deviation     1.6061 0.5612
## Proportion of Variance 0.8599 0.1050
## Cumulative Proportion  0.8599 0.9649
## Standard deviations (1, .., p=3):
## [1] 1.6061339 0.5612066 0.3246244
## 
## Rotation (n x k) = (3 x 2):
##                            PC1        PC2
## science_proficiency -0.5571967  0.7767163
## reading_proficiency -0.6000724 -0.1321787
## math_proficiency    -0.5739730 -0.6158251

Visualizations of Data

This graph shows the spread of schools reading proficiencies and compared it to the amount of people with a bachelors degree or higher. The color shows the local revenue the school has received. The point all the way to the right shows Monongalia county as an outlier most likley because it is home to West Virginia University and therefore has an uptick of more educated people. There is a clear shift of red dots on the right and blue dots to the left indicating that counties where they are more educated parents are more likely to receive more funding.

This is a map of West Virginia colored by their reading proficiencies as a visual and for reference throughout this project.

K-Means

First is a jitter plot of the kmeans of the data. It was hard to make sense of the data as a graph so I moved it to a map and it was much easier to look at. Cluster 1 is seemingly counties with a higher reading proficiency, Cluster 2 is middle of the road, and Cluster 3 are schools with the lowest reading proficiency.

Decision Tree

This is the decision tree for the data. At first I left all of the data in but reading proficiency was mostly based on the math proficiency since they are so close so I took out both math and science proficiency and left all other variables in.

Linear Regression Model

This is the linear regression model I created for predicting reading proficiency. Variables like science_proficiency and math_proficiency have statistically significant positive relationships with reading_proficiency, while others like enrollment and median_income show no significant effect. The model explains about 86.14% of the variance in reading_proficiency, which shows that it is a decent fit. The p-value is also very small meaning it is statistically significant.

## 
## Call:
## lm(formula = reading_proficiency ~ ., data = cor_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0891 -1.2524 -0.1263  1.0905  6.0368 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.695e+01  5.545e+00   3.056 0.003889 ** 
## science_proficiency  3.491e-01  9.258e-02   3.770 0.000503 ***
## math_proficiency     4.968e-01  7.845e-02   6.332 1.32e-07 ***
## enroll              -4.270e-04  1.180e-03  -0.362 0.719302    
## tfedrev             -1.591e-04  9.947e-05  -1.600 0.117149    
## tstrev               2.255e-05  1.500e-04   0.150 0.881216    
## tlocrev             -3.055e-05  1.004e-04  -0.304 0.762416    
## totalexp             7.677e-05  9.323e-05   0.823 0.414917    
## ppcstot             -1.994e-05  2.552e-04  -0.078 0.938077    
## unemployed          -2.059e-02  1.552e-01  -0.133 0.895134    
## median_income       -3.903e-05  5.421e-05  -0.720 0.475524    
## education_prc        4.151e-02  1.109e-01   0.374 0.710001    
## people_education    -7.669e-05  1.066e-04  -0.719 0.475987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.379 on 42 degrees of freedom
## Multiple R-squared:  0.8614, Adjusted R-squared:  0.8218 
## F-statistic: 21.75 on 12 and 42 DF,  p-value: 3.084e-14

Neural Network

The plot of the neural network did not display. However, the error is 5.286 and the steps are 413. I used tlocrev and people with a bachelors degree or higher to predict reading proficiency.

library(neuralnet)

t_nn <- t %>% select(reading_proficiency, tlocrev, people_education)

is_test = sample(x = c(0, 1),
                 size = 55,
                 replace = TRUE,
                 prob = c(0.5, 0.5))

t_nn <- t_nn %>% 
  mutate(is_test = is_test, 
         reading_proficiency = scale(reading_proficiency),
         tlocrev = scale(tlocrev),
         people_education = scale(people_education))

t_train <- filter(t_nn, is_test == 0)
t_test <- filter(t_nn, is_test == 1)

set.seed(1)

n <- neuralnet(formula = reading_proficiency ~ tlocrev + people_education,
               data = t_train,
               hidden = 1,
               linear.output = TRUE)

plot(n)

This is the SSE for the training data.

## [1] 2.917244

This is the SSE for the test data.

## [1] 17.37652

This shows the residuals for the test data. Although it is not a perfect bell curve it is centered around 0. There is 1 outlier which could be explained by Monongalia County which has been a consistent outlier throughout. I kept it in because WVU is a big part of the state.

Limitations

While this project provides valuable insights into factors influencing reading proficiency across counties in West Virginia there are limitations to keep in mind. Firstly, the data gathered is limited in quality and availability. It is mostly consistent as most of it was obtained from HDPulse, however, they did not have education data on Monongalia County so that was obtained from the US Census Bureau, a reputable source, however an inconsistent source. The project also relies on aggregate data from all grades, which may overlook smaller factors that influence each county differently. Overall, the project did a good job with the data and resources it had. This could also be compared with multiple other states for better research.

References

Sources included: