Introduction

This document is a template for the West Virginia County Education Outcomes Prediction project. The goal of this project is to predict student proficiency in science based on various factors such as spending, demographics, and other relevant data.

Load Assessment Data

## # A tibble: 55 × 7
##    county    school school_name    population_group subgroup science_proficiency
##    <chr>     <chr>  <chr>          <chr>            <chr>                  <dbl>
##  1 Barbour   999    Barbour Count… Total Population Total                   26.0
##  2 Berkeley  999    Berkeley Coun… Total Population Total                   28.6
##  3 Boone     999    Boone County … Total Population Total                   19.6
##  4 Braxton   999    Braxton Count… Total Population Total                   22.6
##  5 Brooke    999    Brooke County… Total Population Total                   21.1
##  6 Cabell    999    Cabell County… Total Population Total                   30.8
##  7 Calhoun   999    Calhoun Count… Total Population Total                   27.8
##  8 Clay      999    Clay County T… Total Population Total                   23.3
##  9 Doddridge 999    Doddridge Cou… Total Population Total                   31.3
## 10 Fayette   999    Fayette Count… Total Population Total                   17.4
## # ℹ 45 more rows
## # ℹ 1 more variable: proficiency <dbl>

Load spending data

t_spending shows the spending data for each county in West Virginia. The data is from the US Census Bureau and includes information on total revenue, local revenue, federal revenue, and total expenditures.

## # A tibble: 55 × 8
##    name                  enroll tfedrev tstrev tlocrev totalexp ppcstot county  
##    <chr>                  <dbl>   <dbl>  <dbl>   <dbl>    <dbl>   <dbl> <chr>   
##  1 BARBOUR CO SCH DIST     2144    7559  16584    5872    28021   11885 Barbour 
##  2 BERKELEY CO SCH DIST   19722   48407 140127   86699   264253   12704 Berkeley
##  3 BOONE CO SCH DIST       3177    8194  26858   14564    48642   14663 Boone   
##  4 BRAXTON CO SCH DIST     1747    5479  12748    6404    24417   13153 Braxton 
##  5 BROOKE CO SCH DIST      2582    6791  17114   21352    41908   15642 Brooke  
##  6 CABELL CO SCH DIST     11667   42518  88337   66699   183621   14538 Cabell  
##  7 CALHOUN CO SCH DIST      861    3254   9953    3190    15154   16085 Calhoun 
##  8 CLAY CO SCH DIST        1669    6157  17655    2791    25963   13825 Clay    
##  9 DODDRIDGE CO SCH DIST   1082    3455   3999   31752    38493   23563 Doddrid…
## 10 FAYETTE CO SCH DIST     5594   15293  51759   23477    83373   13777 Fayette 
## # ℹ 45 more rows

Load demographic data

The demographic data is from the US Census Bureau and includes information on the percentage of unemployed individuals in each county.

## # A tibble: 55 × 2
##    county    unemployed
##    <chr>          <dbl>
##  1 Barbour         10.1
##  2 Berkeley         4.6
##  3 Boone            9.8
##  4 Braxton         14.4
##  5 Brooke           5.7
##  6 Cabell           6.1
##  7 Calhoun         12.2
##  8 Clay            11.2
##  9 Doddridge        4.3
## 10 Fayette          7.5
## # ℹ 45 more rows

Data Cleaning

The data is cleaned by removing any rows with missing values and ensuring that the proficiency scores are between 0 and 100. The data is then merged into a single dataframe for analysis.

Variables

Correlations

The correlation matrix shows the relationships between the variables in the dataset. I also included a correlation plot to visualize the correlations between the variables. It shows that there are strong correlations between the following variables:

Linear Regression Model

The linear regression model is used to predict student proficiency in science based on the other variables in the dataset. I found an R-squared value of 0.34, which indicates that the model explains 34% of the variance in the data. The RMSE (Root Mean Square Error) is 4.58, which indicates that the model’s predictions are off by an average of 4.58 percentage points.

## RMSE: 4.578881
## R-squared: 0.3398791

Decision Tree Model

The decision tree model is used to predict student proficiency in science based on the other variables in the dataset. The model is trained on 80% of the data and tested on the remaining 20%. The model is evaluated using RMSE (Root Mean Square Error) and R-squared values.

HeatMap

The heatmap shows the predicted vs actual values of student proficiency in science. The model is evaluated using RMSE (Root Mean Square Error) and R-squared values.

Conclusion

In conclusion, I found that the decision tree model is a good predictor of student proficiency in science based on the other variables in the dataset. The model can be used to identify areas for improvement in education spending and demographics in West Virginia counties.

References

Sources used:

Rpubs link: http://rpubs.com/zz00019/1305003