The goal of this project is to predict the proficiency levels of students in different counties in WV. The project reviews the data provided, and uses that to create a model to predict proficiency. My model demonstrates that total expenses, local funding, and state funding are the most relevant factors in determining proficiency.
# Merge data
t <- t_assess %>%
inner_join(t_spending, by = 'county') %>%
inner_join(t_demographics, by = 'county')
print(t)
## # A tibble: 55 × 15
## county school school_name population_group subgroup science_proficiency
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Barbour 999 Barbour Count… Total Population Total 26.0
## 2 Berkeley 999 Berkeley Coun… Total Population Total 28.6
## 3 Boone 999 Boone County … Total Population Total 19.6
## 4 Braxton 999 Braxton Count… Total Population Total 22.6
## 5 Brooke 999 Brooke County… Total Population Total 21.1
## 6 Cabell 999 Cabell County… Total Population Total 30.8
## 7 Calhoun 999 Calhoun Count… Total Population Total 27.8
## 8 Clay 999 Clay County T… Total Population Total 23.3
## 9 Doddridge 999 Doddridge Cou… Total Population Total 31.3
## 10 Fayette 999 Fayette Count… Total Population Total 17.4
## # ℹ 45 more rows
## # ℹ 9 more variables: proficiency <dbl>, name <chr>, enroll <dbl>,
## # tfedrev <dbl>, tstrev <dbl>, tlocrev <dbl>, totalexp <dbl>, ppcstot <dbl>,
## # unemployed <dbl>
The correlations between numeric variables are shown in various different ways. First you see the pairs plot which shows the general correlations between variables. This is followed by a corrplot, which I find to be an easier way to visualize the data. This shows the level of correlation between variables by color as well as labeled by number. The correlation of these variables with proficiency is the most important to look at in this circumstance.
Correlation is also analyzed in this section using PCA. This groups the variables together to create new variables to be used to predict proficiency. It shows what variables could work well together to make this prediction. The summary of these results is shown in text as well as with a heat map.
## Importance of first k=4 (out of 9) components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.2861 1.3908 0.9814 0.84084
## Proportion of Variance 0.5807 0.2149 0.1070 0.07856
## Cumulative Proportion 0.5807 0.7956 0.9026 0.98118
## Standard deviations (1, .., p=9):
## [1] 2.286075e+00 1.390783e+00 9.814091e-01 8.408411e-01 2.955149e-01
## [6] 2.777063e-01 5.576010e-02 4.296449e-02 9.890724e-17
##
## Rotation (n x k) = (9 x 4):
## PC1 PC2 PC3 PC4
## science_proficiency -0.2201909 -0.57310961 0.335755591 -0.054099098
## proficiency -0.2201909 -0.57310961 0.335755591 -0.054099098
## enroll -0.4263078 0.13770536 0.005682077 -0.005197766
## tfedrev -0.3926048 0.23007619 -0.099002279 -0.168585913
## tstrev -0.4180510 0.16921143 0.040874107 0.004871777
## tlocrev -0.4112420 -0.02846765 -0.212156443 -0.158272467
## totalexp -0.4283677 0.12259154 -0.067458733 -0.086673035
## ppcstot 0.1096831 -0.37358646 -0.669458674 -0.570136800
## unemployed 0.1665509 0.29521940 0.515068910 -0.779779619
## PCA Correlations for Select Counties
This section shows PCA correlations for Kanawha, Boone, Putnam, and Clay counties. I selected these counties because it is the area of the state I am from, and I wanted to showcase it and the surrounding areas. This just shows a more concentrated version of the PCA’s. It creates much larger groups of them because there is much less data to pull from, so it creates these larger groups to try to account for that but keep the accuracy.
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.5586 1.4897 0.48441 4.574e-16
## Proportion of Variance 0.7274 0.2466 0.02607 0.000e+00
## Cumulative Proportion 0.7274 0.9739 1.00000 1.000e+00
## Standard deviations (1, .., p=4):
## [1] 2.558567e+00 1.489659e+00 4.844075e-01 4.573535e-16
##
## Rotation (n x k) = (9 x 4):
## PC1 PC2 PC3 PC4
## science_proficiency 0.2564072 -0.5066436 -0.005194982 0.3659756
## proficiency 0.2564072 -0.5066436 -0.005194982 0.4120043
## enroll 0.3696686 0.2173827 0.048758244 -0.3011233
## tfedrev 0.3374443 0.3350583 -0.152570714 0.2056975
## tstrev 0.3670950 0.2303375 0.020547971 0.1117445
## tlocrev 0.3711631 0.2083354 0.088952018 0.1659562
## totalexp 0.3611541 0.2566345 -0.004066566 0.1787607
## ppcstot -0.3255483 0.2924512 0.704380630 0.5128842
## unemployed -0.3302308 0.2815266 -0.685410494 0.4778009
A linear regression model is created to show what variables are used to predict the dependent variable, which is proficiency in this case. The model uses local funding, total expenses, and state funding to make the prediction. The summary of the model is shown. All p values are below 0.05, which means they are statistically significant. The r2 value is 0.26, which shows that it is showing 26% accuracy.
A decision tree using these same variables is also shown. It is a more visual representation of the prediction being made. The average accuracy is printed below, showing 0.27. This means that the decision tree is 27% accurate.
##
## Call:
## lm(formula = proficiency ~ tlocrev + totalexp + tstrev, data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7718 -4.1128 -0.0673 3.5618 9.5035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.4977577 1.0126687 22.216 < 2e-16 ***
## tlocrev 0.0004192 0.0001164 3.602 0.000715 ***
## totalexp -0.0002727 0.0001041 -2.619 0.011587 *
## tstrev 0.0003290 0.0001477 2.227 0.030416 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.996 on 51 degrees of freedom
## Multiple R-squared: 0.3034, Adjusted R-squared: 0.2624
## F-statistic: 7.404 on 3 and 51 DF, p-value: 0.0003288
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 53 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.
## [1] 0.2727273
The model shows that total expenses, local funding, and state funding are relevant factors in determining proficiency. I would recommend that the state of WV focus on these areas to improve proficiency. This means putting funding into schools and focusing on the expenses that the schools take on. This recommendation is purely based on the model provided, so other methods may be applicable to improve proficiency under a different model or different set of data.