Test 2 Aaron Cohen
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Generate 25 equally spaced points
x <- seq(1, 5, length.out = 5)
y <- seq(1, 5, length.out = 5)
# Create the data set
points <- expand.grid(X = x, Y = y)
points |>
ggplot() +
geom_point(aes(X, Y))
k <- 4
# every time this is run, 'cluster centers' get reset
k_means_points <- kmeans(points, centers = k)
# visualize clustering
points |>
mutate(cluster = k_means_points$cluster) |>
ggplot() +
geom_point(aes(X, Y,
color = factor(cluster)),
size = 5)
With this data set of 25 equally spaced points, the points along y=3 and x=3 are inconsistently assigned to clusters. K-means clustering works by randomly placing k points/centroids and assigning each data point to its nearest centroid. Centroids are then shifted to the centroid of all points in that cluster and points are reassigned to clusters based on their nearest centroid. The goal is to minimize the sum of distances between points and their cluster, ie minimize within-cluster variation. In this dataset, there is no “true clustering”; rather, all points are equally spaced from its neighbors. This means that there is no one way that within-cluster variation can be minimzed, but instead there are a number of equally effective ways. Inconsistency in clustering is also due to the fact that the grid is 5x5 rather than 4x4 or some multiple of 4, so the equally spaced points can never be evenly divided. Clustering does not necessarily ensure that each group will contain equal number of points, but if, for example, this data set contained a 4x4 set of equally spaced points, clusters would be consistent since the centroids would always tend toward the exact same point: the center of each quadrant.
The sign (positive or negative) of a loading indicates whether a variable and the given principal component are positively or negatively correlated. Given that PCA is reducing the dimensionality of the 22 years of data collected about the Southern Ocean, if 2009 has a massively negative score, that would imply that the data from that year was significantly negatively correlated with the trends captured by PC1. On the whole (as is shown by the graph), areas in the center of the ocean had the strongest positive correlation with the trends of PC1, which might suggest those areas had the highest Net Primary Productivity throughout the study. 2009, then, would have seen lower Net Primary Productivity in the center and higher productivity in other areas of the ocean.
library(tidyverse)
vt_schools <- read_csv("/Users/aaroncohen/Downloads/VT_School_Data.xls - Sheet1.csv", show_col_types = FALSE)
library(mice)
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
columns_to_impute <- c("StudentsFreeReducedLunch", "TotalEnrollment")
imputed_vt_schools <- mice(vt_schools,
method = "rf",
maxit = 5)
##
## iter imp variable
## 1 1 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 1 2 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 1 3 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 1 4 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 1 5 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 2 1 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 2 2 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 2 3 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 2 4 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 2 5 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 3 1 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 3 2 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 3 3 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 3 4 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 3 5 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 4 1 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 4 2 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 4 3 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 4 4 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 4 5 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 5 1 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 5 2 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 5 3 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 5 4 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## 5 5 EducationSpendingGrant SchoolConstructionAid SmallSchoolsGrant StateTransportationAid SpecialEducationGrants OtherCategoricalGrants AllOtherStateGrants TotalLocalRevenue StateSourcesRevenue FederalLocalRevenue PupilSupportServices InstructionalStaffSupportServ GeneralAdministration SchoolAdministration StudentTransportation FoodServiceOperations StudentsFreeReducedLunch TotalEnrollment ELA03 ELA04 ELA05 ELA06 ELA07 ELA08 ELA011 M3 M4 M5 M6 M7 M8 M11
## Warning: Number of logged events: 1
# fully imputed data set
completed_vt_schools <- complete(imputed_vt_schools)
# exploratory graphs
# start just by plotting total enrollment vs. lunch and highlighting imputed values
completed_vt_schools |>
ggplot() +
geom_point(aes(x = StudentsFreeReducedLunch,
y = TotalEnrollment,
color = SupervisoryUnion == "Orange Windsor")) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
facet_wrap(~ Year)
# encouraging result... 2016 imputed value falls easily within range of other schools, isn't some outrageous outlier
# are there any trends occurring year to year that might help us justify imputations?
vt_schools |>
ggplot() +
geom_point(aes(x = StudentsFreeReducedLunch, y = TotalEnrollment)) +
facet_wrap(~ Year)
## Warning: Removed 17 rows containing missing values (`geom_point()`).
vt_schools |>
ggplot() +
geom_point(aes(x = StudentsFreeReducedLunch, y = TotalEnrollment, color = as.factor(Year))) +
scale_color_manual(
values = c("2015" = "red", "2016" = "blue", "2017" = "green2")
) +
labs(color = "Year")
## Warning: Removed 17 rows containing missing values (`geom_point()`).
# Oh but wait, we don't have data for 2015, 16, or 17 for Orange Windsor so we'll have to try something else...
# Let's see if there's any kind of consistent relationship between any other variable and total enrollment
# maybe education spending grant is based on enrollment?
vt_schools <- vt_schools |>
mutate(X = EducationSpendingGrant / TotalEnrollment)
# visualize
vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = X)) +
ylim(0, 25000) +
facet_wrap(~ Year)
## Warning: Removed 29 rows containing missing values (`geom_point()`).
# with the exception of some outliers, seems like schools typically receive between $10,000 - 20,000 in education spending grant per student
# now let's do the same with our imputed values
completed_vt_schools <- completed_vt_schools |>
mutate(X = EducationSpendingGrant / TotalEnrollment)
# again, highlight imputed value (Orange Windsor, 2016)
completed_vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion,
y = X,
color = SupervisoryUnion == "Orange Windsor")) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
ylim(0, 25000) +
facet_wrap(~ Year)
## Warning: Removed 13 rows containing missing values (`geom_point()`).
# The imputed value of Total Enrollment for Orange Windsor in 2016 appears reasonable since its resulting ratio falls comfortably within this typical range
# Now let's justify the imputed value for 'Students on Free and Reduced Lunch' for Orange Windsor in 2016...
# Similar approach... is there a discernible relationship between Education Spending Grant and and reduced lunch?
vt_schools <- vt_schools |>
mutate(Y = EducationSpendingGrant / StudentsFreeReducedLunch)
# visualize
vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = Y)) +
facet_wrap(~ Year)
## Warning: Removed 17 rows containing missing values (`geom_point()`).
# a lot more variation this time around...
# let's consider other variables that might be related to reduced lunch
# what about FoodServiceOperations?
vt_schools <- vt_schools |>
mutate(Z = FoodServiceOperations / StudentsFreeReducedLunch)
# visualize
vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = Z)) +
facet_wrap(~ Year)
## Warning: Removed 18 rows containing missing values (`geom_point()`).
# ylim(0, 50)
# still a good bit of variation...
# what about enrollment?
vt_schools <- vt_schools |>
mutate(Z = TotalEnrollment / StudentsFreeReducedLunch)
# visualize
vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = Z)) +
facet_wrap(~ Year)
## Warning: Removed 17 rows containing missing values (`geom_point()`).
# better! vast majority of schools in 2016 had a value between 1 and 3 representing the ratio of total enrollment to students received free/reduced lunch
# let's see where our imputed value lies within these data (bearing in mind that our enrollment value was imputed as well...)
completed_vt_schools <- completed_vt_schools |>
mutate(Z = TotalEnrollment / StudentsFreeReducedLunch)
# again, highlight imputed value (Orange Windsor, 2016)
completed_vt_schools |>
ggplot() +
geom_point(aes(x = SupervisoryUnion,
y = Z,
color = SupervisoryUnion == "Orange Windsor")) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) +
# ylim(0, 25000) +
facet_wrap(~ Year)
# again, our imputed value falls very comfortably within the typical range, allowing us to justify the reasonability of our imputed value!
# remove additional columns... (return to original data set)
vt_schools <- vt_schools |>
select(- X, - Y, - Z)
# BOOSTED TREE
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
# let's try a boosted tree! (seemed to be a pretty accurate approach from what we saw in class...)
vt_subset <- completed_vt_schools |>
select(-M8)
boost_data <- xgb.DMatrix(data.matrix(vt_subset),
label = completed_vt_schools$M8)
boost_model <- xgb.train(data = boost_data,
nrounds = 65,
params = list(learning_rate = 0.1,
objective = "reg:squarederror"))
predictions <- predict(boost_model, boost_data)
comparison <- data.frame(
Actual = completed_vt_schools$M6,
Predicted = predictions
)
print(head(comparison))
## Actual Predicted
## 1 2586.628 2576.396
## 2 2529.350 2537.511
## 3 2507.658 2534.613
## 4 2520.312 2546.737
## 5 2497.358 2575.561
## 6 2443.957 2539.666
# let's overlay on a graph to visualize degree of accuracy
completed_vt_schools <- completed_vt_schools |>
mutate(Predicted = predictions)
completed_vt_schools |>
filter(Year == 2015) |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = M8, color = "blue")) +
geom_point(aes(x = SupervisoryUnion, y = Predicted, color = "red")) +
ylim(2400, 2650)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
completed_vt_schools |>
filter(Year == 2016) |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = M8, color = "blue")) +
geom_point(aes(x = SupervisoryUnion, y = Predicted, color = "red")) +
ylim(2400, 2650)
completed_vt_schools |>
filter(Year == 2017) |>
ggplot() +
geom_point(aes(x = SupervisoryUnion, y = M8, color = "blue")) +
geom_point(aes(x = SupervisoryUnion, y = Predicted, color = "red")) +
ylim(2400, 2650)
# Based on graphs, predictions seem pretty damn accurate
# Let's look at other evaluations of prediction accuracy
mae <- mean(abs(completed_vt_schools$M8 - completed_vt_schools$Predicted))
print(paste("MAE:", mae))
## [1] "MAE: 5.64176957662706"
# "MAE: 5.51034759667695"
summary(completed_vt_schools$M8)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2527 2551 2537 2579 2629
summary(completed_vt_schools$Predicted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.34 2524.91 2548.34 2533.19 2573.02 2618.82
Considering test scores can range by hundreds of points, I am confident that a mean abs error of 5.5 points represents a remarkably high degree of accuracy. Additionally, when comparing summary statistics of predicted scores and actual scores, there is significantly little difference. While it’s impossible to state that this method (boosted trees) is the ‘best’ way for Vermont schools to predict how well their students will do on the 8th-grade Math standardized test, I can say that this particular method yields predicted mean scores with notable resemblance to true performance. Some factors that might contribute to this is the inclusion of all available data. Rather than eliminating any given variable due to its seeming unrelatedness to math scores, this method allows the algorithm to learn/decide for itself which variables are, in fact, the most predictive. If schools wished to eliminate variables from its prediction approach, they might consider using shap values, which can tell you the variables that contributed the most when making a prediction.
Speaking of… let’s use shap values to do exactly that: determine which variables contributed most meaningfully to predicting math scores.
library(shapviz)
## Warning: package 'shapviz' was built under R version 4.3.3
shap_values <- shapviz(boost_model,
X_pred = data.matrix(vt_subset),
X = vt_subset)
# Look at one school as an example
sv_waterfall(shap_values, row_id = 1)
Unsurprisingly, the highest predictors of 8th grade math test scores are other test scores, including 8th grade ELA scores and 6th and 7th grade math scores. These factors, however, are not actionable— it’s not very helpful to instruct a school district to improve their 8th grade math test scores simply by improving all their other test scores at other grade levels and in different subjects. What this might suggest, though, is a correlation between performance in other areas with performance in math. In other words, schools with strong English programs are likely to have strong math programs as well.
One actionable factor that seems to positively contribute to higher 8th grade math test scores is Other Categorical Grants. If schools apply for and receive more grant funding, they can invest in their academic programs and expect higher standardized test scores.
Food Service Operations also appear to positively contribute to higher scores. It’s not necessarily a surprise that when schools invest in quality food for their students, performance will generally increase. Making investments in food and other vital resources for students (which may also include providing more free/reduced lunches) has a strong chance of leading to students doing better on the 8th grade math standardized test.
```