Assignment

Setup

library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(tibble)
library(iml)
library(plotly)
library(webshot)
library(broom)
library(rpart.plot)
library(xgboost)
library(DiagrammeR)

Data

This evaluation uses data from the Californian Schools/Academic Performance Index population samples.
This is a built in dataset as part of the survey package in R. The dataset contains information for all schools with at least 100 students for various probability samples of the data. The full population set contains 6194 observations on 37 variables. Three prediction models have been created (model1, model2, model3) that aim to predict the outcome variable api00, from 17 predictors

Sources: https://r-survey.r-forge.r-project.org/survey/html/api.html
The data is a reduced set of school-level information

Load data and models

Explore models

model1 is a GLM model, with 17 predictors, trained using 10-fold cross-validation on 4,478 samples

model2 is a CART model, with 17 predictors, trained using 10-fold cross-validation on 4,478 samples

model3 is an XGBoost model, with 17 parameters, trained using 10-fold cross-validation on 4,478 samples

Construct Variable Importance plots for each model

## [16:38:31] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.

The three models were trained to predict the variable api00 (the academic performance index of the school, in the year 2000), the variable meals (which represents the percentage of students eligible for subsidized meals) was the most important feature for predicting api00 for all 3 models, and was much higher than any of the following variables. The importance of this feature was greatest for model 3 (mse loss was highest), followed by model 2 and model 1. For model 1, the second most importance feature was the variable enroll, which represents the number of students enrolled. For models 2 and 3 avg.ed (the average education level of parents) was the second most important feature. School type (elementary/middle/high school) was the third most important predictor for all models.

A plot of the GLM coefficients, the CART and the first tree in the XGBoost model is shown above. ## Construct PD plots for the most important predictor of the model
The previous plots provide the Individual Conditional Expectation (ICE) plots with the PDP (average of the ICEs) depicted as the thick yellow line in each plot. From this, you can clearly see that model 1 is the GLM model, as the ICE and PDP shows a completely linear relationship between the predictor “meals” and the outcome variable. Both models 2 and 3 show step-wise responses, indicating these are the tree models; the higher fidelity of model 3, and the many averages indicates this would be the xgboost model, where the predictor is averaged across many trees, and model 2 is the simple CART model.

For all models, the academic performance index is inversely proportional to the proportion of students eligible for for subsidized meals. I.e. schools with a lower proportion of students eligible for subsidized meals are most likely to have a higher academic performance.

PD plots for the 2 most important predictoers of each model (ie 3D plots or heat maps)

This is further demonstrated when reviewing the surface plots of the top two predictors where model 1 illustrates a simple plane, and models 2 and 3 show a stepped response, however the response is much smoother for model 3 where the output has been averaged over the many trees.
Visualising the output variable across the two most important predictors shows that for model 1, the number of students enrolled at the school is also inversely proportional to the academic performance, i.e. as the number of enrolled students increases, the academic performance decreases.

For models 2 and 3, the relationship of number of enrolled students wasn’t evaluated, instead the average parental education level had the second greatest importance and was evaluated. For both models 2 and 3, as the value of “average parental education” variable increased, the academic performance of the school decreased. It was not possible to ascertain what the numeric values of parental education level represented in this instance.

Evaluate prediction performance of each model

##     Model     RMSE  Rsquared      MAE
## 1 Model 1 48.56357 0.8523121 37.14739
## 2 Model 2 65.32222 0.7335810 51.39215
## 3 Model 3 44.55533 0.8756635 33.51161

Discussion

Model 3 (which was the xgboost model) provided the best performance for predicting api00. The GLM model (model1) performed better than the CART model (model2), and was not significantly worse than the xgboost model.

In this instance, the GLM model may be the preferred model. The GLM is a simpler representation, and provides similar performance to the xgboost model. Although the importance of the predictors varied between the two models, the GLM model is easier to interpret, being able to provide coefficients and identify inverse and directly proportional relationships between predictors and the outcome variable.

Concluding remarks

This evaluation sought to utilise model interpretability tools to understand which predictors had the greatest importance within the model and to understand how the variables impacted the outcome variable. By comparing the model performance we see that the simple GLM performs similarly to the XGBoost model. The GLM model is also easier to interpret the impact and relationship of the predictors on the output variable. In this instance a linear relationship between the predictors and the output variable seems adequate to capture the relationship and provides a model with the most interpretability.

Assignment_4

Melisssa Ryan

2024-11-20