This document is a template for the West Virginia County Education Outcomes Prediction project. The goal of this project is to predict student proficiency in science based on various factors such as spending, demographics, and other relevant data.
## # A tibble: 55 × 7
## county school school_name population_group subgroup science_proficiency
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Barbour 999 Barbour Count… Total Population Total 26.0
## 2 Berkeley 999 Berkeley Coun… Total Population Total 28.6
## 3 Boone 999 Boone County … Total Population Total 19.6
## 4 Braxton 999 Braxton Count… Total Population Total 22.6
## 5 Brooke 999 Brooke County… Total Population Total 21.1
## 6 Cabell 999 Cabell County… Total Population Total 30.8
## 7 Calhoun 999 Calhoun Count… Total Population Total 27.8
## 8 Clay 999 Clay County T… Total Population Total 23.3
## 9 Doddridge 999 Doddridge Cou… Total Population Total 31.3
## 10 Fayette 999 Fayette Count… Total Population Total 17.4
## # ℹ 45 more rows
## # ℹ 1 more variable: proficiency <dbl>
t_spending shows the spending data for each county in West Virginia. The data is from the US Census Bureau and includes information on total revenue, local revenue, federal revenue, and total expenditures.
## # A tibble: 55 × 8
## name enroll tfedrev tstrev tlocrev totalexp ppcstot county
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 BARBOUR CO SCH DIST 2144 7559 16584 5872 28021 11885 Barbour
## 2 BERKELEY CO SCH DIST 19722 48407 140127 86699 264253 12704 Berkeley
## 3 BOONE CO SCH DIST 3177 8194 26858 14564 48642 14663 Boone
## 4 BRAXTON CO SCH DIST 1747 5479 12748 6404 24417 13153 Braxton
## 5 BROOKE CO SCH DIST 2582 6791 17114 21352 41908 15642 Brooke
## 6 CABELL CO SCH DIST 11667 42518 88337 66699 183621 14538 Cabell
## 7 CALHOUN CO SCH DIST 861 3254 9953 3190 15154 16085 Calhoun
## 8 CLAY CO SCH DIST 1669 6157 17655 2791 25963 13825 Clay
## 9 DODDRIDGE CO SCH DIST 1082 3455 3999 31752 38493 23563 Doddrid…
## 10 FAYETTE CO SCH DIST 5594 15293 51759 23477 83373 13777 Fayette
## # ℹ 45 more rows
The demographic data is from the US Census Bureau and includes information on the percentage of unemployed individuals in each county.
## # A tibble: 55 × 2
## county unemployed
## <chr> <dbl>
## 1 Barbour 10.1
## 2 Berkeley 4.6
## 3 Boone 9.8
## 4 Braxton 14.4
## 5 Brooke 5.7
## 6 Cabell 6.1
## 7 Calhoun 12.2
## 8 Clay 11.2
## 9 Doddridge 4.3
## 10 Fayette 7.5
## # ℹ 45 more rows
The data is cleaned by removing any rows with missing values and ensuring that the proficiency scores are between 0 and 100. The data is then merged into a single dataframe for analysis.
county: The name of the county in West Virginia. Used
as the key to join datasets.proficiency: The percentage of students in the county
proficient in science. Derived from
science_proficiency.enroll: Total number of enrolled students in the
county.tfedrev: Total federal revenue received by the school
district in the county. - tstrev: Total state revenue
received by the school district.tlocrev: Total local revenue received by the school
district.totalexpTotal education-related expenditures in the
county. - ppcstot: Per-pupil current spending total (a
measure of education spending per student).unemployed: Percentage of unemployed individuals in the
county’s population (a socioeconomic factor).state (added for mapping): Fixed as “West Virginia” to
allow mapping with usmap.The correlation matrix shows the relationships between the variables in the dataset. I also included a correlation plot to visualize the correlations between the variables. It shows that there are strong correlations between the following variables:
The linear regression model is used to predict student proficiency in science based on the other variables in the dataset. I found an R-squared value of 0.34, which indicates that the model explains 34% of the variance in the data. The RMSE (Root Mean Square Error) is 4.58, which indicates that the model’s predictions are off by an average of 4.58 percentage points.
## RMSE: 4.578881
## R-squared: 0.3398791
The decision tree model is used to predict student proficiency in
science based on the other variables in the dataset. The model is
trained on 80% of the data and tested on the remaining 20%. The model is
evaluated using RMSE (Root Mean Square Error) and R-squared values.
The heatmap shows the predicted vs actual values of student proficiency in science. The model is evaluated using RMSE (Root Mean Square Error) and R-squared values.
In conclusion, I found that the decision tree model is a good predictor of student proficiency in science based on the other variables in the dataset. The model can be used to identify areas for improvement in education spending and demographics in West Virginia counties.
Sources used:
Rpubs link: http://rpubs.com/zz00019/1305003