Introduction

The goal of this project is to predict the proficiency levels of students in different counties in WV. The project reviews the data provided, and uses that to create a model to predict proficiency. My model demonstrates that total expenses, local funding, and state funding are the most relevant factors in determining proficiency.

Load assessment data

  • Data from assessment results in the state of WV was loaded in the code in this section.

Load spending data

  • Data from spending in the state of WV was loaded in the code in this section.

Load demographic data

  • Data from demographic data, specifically unemployment rates, in the state of WV was loaded in the code in this section.

Joined data

  • This section shows all of the previously loaded data combined into one data set.
# Merge data
t <- t_assess %>% 
  inner_join(t_spending, by = 'county') %>% 
  inner_join(t_demographics, by = 'county')

print(t)
## # A tibble: 55 × 15
##    county    school school_name    population_group subgroup science_proficiency
##    <chr>     <chr>  <chr>          <chr>            <chr>                  <dbl>
##  1 Barbour   999    Barbour Count… Total Population Total                   26.0
##  2 Berkeley  999    Berkeley Coun… Total Population Total                   28.6
##  3 Boone     999    Boone County … Total Population Total                   19.6
##  4 Braxton   999    Braxton Count… Total Population Total                   22.6
##  5 Brooke    999    Brooke County… Total Population Total                   21.1
##  6 Cabell    999    Cabell County… Total Population Total                   30.8
##  7 Calhoun   999    Calhoun Count… Total Population Total                   27.8
##  8 Clay      999    Clay County T… Total Population Total                   23.3
##  9 Doddridge 999    Doddridge Cou… Total Population Total                   31.3
## 10 Fayette   999    Fayette Count… Total Population Total                   17.4
## # ℹ 45 more rows
## # ℹ 9 more variables: proficiency <dbl>, name <chr>, enroll <dbl>,
## #   tfedrev <dbl>, tstrev <dbl>, tlocrev <dbl>, totalexp <dbl>, ppcstot <dbl>,
## #   unemployed <dbl>

Extra county data

  • This section loads data from various counties and filters the previous data down to select counties.

Correlations

The correlations between numeric variables are shown in various different ways. First you see the pairs plot which shows the general correlations between variables. This is followed by a corrplot, which I find to be an easier way to visualize the data. This shows the level of correlation between variables by color as well as labeled by number. The correlation of these variables with proficiency is the most important to look at in this circumstance.

Correlation is also analyzed in this section using PCA. This groups the variables together to create new variables to be used to predict proficiency. It shows what variables could work well together to make this prediction. The summary of these results is shown in text as well as with a heat map.

## Importance of first k=4 (out of 9) components:
##                           PC1    PC2    PC3     PC4
## Standard deviation     2.2861 1.3908 0.9814 0.84084
## Proportion of Variance 0.5807 0.2149 0.1070 0.07856
## Cumulative Proportion  0.5807 0.7956 0.9026 0.98118
## Standard deviations (1, .., p=9):
## [1] 2.286075e+00 1.390783e+00 9.814091e-01 8.408411e-01 2.955149e-01
## [6] 2.777063e-01 5.576010e-02 4.296449e-02 9.890724e-17
## 
## Rotation (n x k) = (9 x 4):
##                            PC1         PC2          PC3          PC4
## science_proficiency -0.2201909 -0.57310961  0.335755591 -0.054099098
## proficiency         -0.2201909 -0.57310961  0.335755591 -0.054099098
## enroll              -0.4263078  0.13770536  0.005682077 -0.005197766
## tfedrev             -0.3926048  0.23007619 -0.099002279 -0.168585913
## tstrev              -0.4180510  0.16921143  0.040874107  0.004871777
## tlocrev             -0.4112420 -0.02846765 -0.212156443 -0.158272467
## totalexp            -0.4283677  0.12259154 -0.067458733 -0.086673035
## ppcstot              0.1096831 -0.37358646 -0.669458674 -0.570136800
## unemployed           0.1665509  0.29521940  0.515068910 -0.779779619

## PCA Correlations for Select Counties

This section shows PCA correlations for Kanawha, Boone, Putnam, and Clay counties. I selected these counties because it is the area of the state I am from, and I wanted to showcase it and the surrounding areas. This just shows a more concentrated version of the PCA’s. It creates much larger groups of them because there is much less data to pull from, so it creates these larger groups to try to account for that but keep the accuracy.

## Importance of components:
##                           PC1    PC2     PC3       PC4
## Standard deviation     2.5586 1.4897 0.48441 4.574e-16
## Proportion of Variance 0.7274 0.2466 0.02607 0.000e+00
## Cumulative Proportion  0.7274 0.9739 1.00000 1.000e+00
## Standard deviations (1, .., p=4):
## [1] 2.558567e+00 1.489659e+00 4.844075e-01 4.573535e-16
## 
## Rotation (n x k) = (9 x 4):
##                            PC1        PC2          PC3        PC4
## science_proficiency  0.2564072 -0.5066436 -0.005194982  0.3659756
## proficiency          0.2564072 -0.5066436 -0.005194982  0.4120043
## enroll               0.3696686  0.2173827  0.048758244 -0.3011233
## tfedrev              0.3374443  0.3350583 -0.152570714  0.2056975
## tstrev               0.3670950  0.2303375  0.020547971  0.1117445
## tlocrev              0.3711631  0.2083354  0.088952018  0.1659562
## totalexp             0.3611541  0.2566345 -0.004066566  0.1787607
## ppcstot             -0.3255483  0.2924512  0.704380630  0.5128842
## unemployed          -0.3302308  0.2815266 -0.685410494  0.4778009

Linear Regression Model and Decision Tree

A linear regression model is created to show what variables are used to predict the dependent variable, which is proficiency in this case. The model uses local funding, total expenses, and state funding to make the prediction. The summary of the model is shown. All p values are below 0.05, which means they are statistically significant. The r2 value is 0.26, which shows that it is showing 26% accuracy.

A decision tree using these same variables is also shown. It is a more visual representation of the prediction being made. The average accuracy is printed below, showing 0.27. This means that the decision tree is 27% accurate.

## 
## Call:
## lm(formula = proficiency ~ tlocrev + totalexp + tstrev, data = t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7718  -4.1128  -0.0673   3.5618   9.5035 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.4977577  1.0126687  22.216  < 2e-16 ***
## tlocrev      0.0004192  0.0001164   3.602 0.000715 ***
## totalexp    -0.0002727  0.0001041  -2.619 0.011587 *  
## tstrev       0.0003290  0.0001477   2.227 0.030416 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.996 on 51 degrees of freedom
## Multiple R-squared:  0.3034, Adjusted R-squared:  0.2624 
## F-statistic: 7.404 on 3 and 51 DF,  p-value: 0.0003288
## Warning: All boxes will be white (the box.palette argument will be ignored) because
## the number of classes in the response 53 is greater than length(box.palette) 6.
## To silence this warning use box.palette=0 or trace=-1.

## [1] 0.2727273

Recommendations

The model shows that total expenses, local funding, and state funding are relevant factors in determining proficiency. I would recommend that the state of WV focus on these areas to improve proficiency. This means putting funding into schools and focusing on the expenses that the schools take on. This recommendation is purely based on the model provided, so other methods may be applicable to improve proficiency under a different model or different set of data.

Map of Proficiency