library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(tibble)
library(iml)
library(plotly)
library(webshot)
library(broom)
library(rpart.plot)
library(xgboost)
library(DiagrammeR)
This evaluation uses data from the Californian Schools/Academic
Performance Index population samples.
This is a built in dataset as part of the survey package in R. The
dataset contains information for all schools with at least 100 students
for various probability samples of the data. The full population set
contains 6194 observations on 37 variables. Three prediction models have
been created (model1, model2, model3) that aim to predict the outcome
variable api00, from 17 predictors
Sources: https://r-survey.r-forge.r-project.org/survey/html/api.html
The data is a reduced set of school-level information
Load data and models
model1 is a GLM model, with 17 predictors, trained using 10-fold cross-validation on 4,478 samples
model2 is a CART model, with 17 predictors, trained using 10-fold cross-validation on 4,478 samples
model3 is an XGBoost model, with 17 parameters, trained using 10-fold cross-validation on 4,478 samples
## [16:38:31] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
The three models were trained to predict the variable api00 (the academic performance index of the school, in the year 2000), the variable meals (which represents the percentage of students eligible for subsidized meals) was the most important feature for predicting api00 for all 3 models, and was much higher than any of the following variables. The importance of this feature was greatest for model 3 (mse loss was highest), followed by model 2 and model 1. For model 1, the second most importance feature was the variable enroll, which represents the number of students enrolled. For models 2 and 3 avg.ed (the average education level of parents) was the second most important feature. School type (elementary/middle/high school) was the third most important predictor for all models.
A plot of the GLM coefficients, the CART and the first tree in the
XGBoost model is shown above. ## Construct PD plots for the most
important predictor of the model
The previous plots provide the Individual Conditional Expectation
(ICE) plots with the PDP (average of the ICEs) depicted as the thick
yellow line in each plot. From this, you can clearly see that model 1 is
the GLM model, as the ICE and PDP shows a completely linear relationship
between the predictor “meals” and the outcome variable. Both models 2
and 3 show step-wise responses, indicating these are the tree models;
the higher fidelity of model 3, and the many averages indicates this
would be the xgboost model, where the predictor is averaged across many
trees, and model 2 is the simple CART model.
For all models,
the academic performance index is inversely proportional to the
proportion of students eligible for for subsidized meals. I.e. schools
with a lower proportion of students eligible for subsidized meals are
most likely to have a higher academic performance.
This is further demonstrated when reviewing the surface plots of
the top two predictors where model 1 illustrates a simple plane, and
models 2 and 3 show a stepped response, however the response is much
smoother for model 3 where the output has been averaged over the many
trees.
Visualising the output variable across the two most
important predictors shows that for model 1, the number of students
enrolled at the school is also inversely proportional to the academic
performance, i.e. as the number of enrolled students increases, the
academic performance decreases.
For models 2 and 3, the relationship of number of enrolled students wasn’t evaluated, instead the average parental education level had the second greatest importance and was evaluated. For both models 2 and 3, as the value of “average parental education” variable increased, the academic performance of the school decreased. It was not possible to ascertain what the numeric values of parental education level represented in this instance.
## Model RMSE Rsquared MAE
## 1 Model 1 48.56357 0.8523121 37.14739
## 2 Model 2 65.32222 0.7335810 51.39215
## 3 Model 3 44.55533 0.8756635 33.51161
Model 3 (which was the xgboost model) provided the best performance for predicting api00. The GLM model (model1) performed better than the CART model (model2), and was not significantly worse than the xgboost model.
In this instance, the GLM model may be the preferred model. The GLM is a simpler representation, and provides similar performance to the xgboost model. Although the importance of the predictors varied between the two models, the GLM model is easier to interpret, being able to provide coefficients and identify inverse and directly proportional relationships between predictors and the outcome variable.
This evaluation sought to utilise model interpretability tools to understand which predictors had the greatest importance within the model and to understand how the variables impacted the outcome variable. By comparing the model performance we see that the simple GLM performs similarly to the XGBoost model. The GLM model is also easier to interpret the impact and relationship of the predictors on the output variable. In this instance a linear relationship between the predictors and the output variable seems adequate to capture the relationship and provides a model with the most interpretability.