We want to predict individual credit balance, based on 10 predictor variables

Credit Balance is distributed with the following density in our sample of 400 individuals:

Looking at Categorical Variables, we can draw the following boxplots and density plots for the distribution of credit balance, per variable level

For discrete variables, we draw boxplots only: Density plots are less informative…

For continuous variables, we fit linear models to our sample data points, and investigate the error of these functions

We scale our numerical variables (but keep an unscaled version of our outcome variable!), and plot the correlations between the scaled variables:

And look at the covariance matrix:

We create dummy variables for the categorical variables, and add categorical variables, corresponding to each “Decade of Life”

We then draw a correlation matrix for these variables as well:

We build simple linear models for normalized numerical variables:

    Balance
    B CI p
(Intercept)   13.43 13.29 – 13.57 <.001
Income   5.39 5.16 – 5.63 <.001
Limit   0.68 -1.45 – 2.81 .529
Rating   -0.65 -2.79 – 1.48 .549
Cards   0.09 -0.08 – 0.25 .320
Age   0.37 0.23 – 0.51 <.001
Education   0.18 0.04 – 0.33 .011
Observations   400
R2 / adj. R2   .937 / .936

As well as categorical variables:

    Balance
    B CI p
(Intercept)   30.88 23.15 – 38.60 <.001
age_twenties   -20.54 -28.38 – -12.70 <.001
age_thirties   -18.84 -26.50 – -11.19 <.001
age_fourties   -17.93 -25.55 – -10.30 <.001
age_fifties   -17.67 -25.32 – -10.02 <.001
age_sixties   -16.99 -24.63 – -9.36 <.001
age_seventies   -18.10 -25.75 – -10.45 <.001
age_eighties   -13.84 -21.57 – -6.11 <.001
Male   -0.06 -1.13 – 1.01 .909
Student   1.30 -0.49 – 3.08 .155
Married   0.18 -0.94 – 1.29 .756
Caucasian   -0.22 -1.52 – 1.08 .741
AfricanAmerican   0.07 -1.45 – 1.58 .932
Observations   400
R2 / adj. R2   .121 / .093

Finally, we build a linear model including all these transformed variables:

    Balance
    B CI p
(Intercept)   12.09 9.29 – 14.89 <.001
Income   5.44 5.20 – 5.67 <.001
Limit   0.82 -1.29 – 2.93 .444
Rating   -0.77 -2.88 – 1.35 .477
Cards   0.09 -0.07 – 0.26 .281
Age   0.14 -0.72 – 1.00 .746
Education   0.19 0.05 – 0.33 .008
age_twenties   1.31 -2.66 – 5.29 .516
age_thirties   1.31 -2.30 – 4.93 .475
age_fourties   1.13 -2.07 – 4.32 .489
age_fifties   1.88 -0.96 – 4.72 .194
age_sixties   1.81 -0.68 – 4.30 .154
age_seventies   1.92 -0.35 – 4.18 .097
age_eighties   1.73 -0.39 – 3.85 .109
Male   -0.05 -0.33 – 0.22 .705
Student   0.67 0.20 – 1.13 .005
Married   -0.40 -0.69 – -0.11 .007
Caucasian   -0.03 -0.36 – 0.31 .883
AfricanAmerican   -0.11 -0.50 – 0.28 .588
Observations   400
R2 / adj. R2   .942 / .940
    Balance
    B CI p
(Intercept)   -0.02 -0.11 – 0.08 .751
age_nineties   3.11 1.74 – 4.47 <.001
Observations   400
R2 / adj. R2   .048 / .046

Looking at the patterns in, and signifance of the results, in the above work, we learn the following:

This is the function we’ll use for our very first model:

Let’s look at the model predictions vs. actual values, and look at the distribution of the error:

The error is normally distributed, which is a good thing

Let’s see how we can explain the error in our model with a tree-based model:

It seems modeling the error on its own is probably not such a good idea, although interestingly, there is a linear relationship between the real error and the error in the error prediction….

We’ll rather try to model the original response variable with a tree-based model:

We see that Income overwhelms the decision splits, since it is such an influential variable:

We find that the error is not much different from the linear model:

Income is dominating our predictive models, in either case

(This is probably why our error distributions are so similar)…

Random Forests allow us to prevent this from happening, so let’s try that:

It looks like our error has improved, but let’s get a standard measurement of our model errors so far, the sum of squared error:

LinearModelError TreeModelError RandomForestError
871.9598 993.7124 260.182

This is misleading

*The sum of squared errors comes to zero, which means it performs better in aggregate, but the range of errors for individual observations is much wider:

It’s apparent that the errors are being made for similar observations…

Let’s try neural networks…

Let’s try an ensembled approach

This is only on 100 test observations, let’s test this ensemble model on all data:

Our model has improved, from the initial linear model, based only on Income…

Still, that’s a lot of error…

Bearing in mind the sample range of our outcome variable

Minimum Balance Maximum Balance
3.749402976 38.78512301

And our sample size…

Sample Size
400

We end up with the following scale of error:

Absolute value of error per observation
0.8846963259

If we’re smart about it, we’ll use General Additive Models from the MGCV package:

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 21 iterations.
## The RMS GCV score gradient at convergence was 0.00000006901775364 .
## The Hessian was positive definite.
## Model rank =  60 / 64 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'   edf k-index p-value
## s(Income)     2.00  1.00    1.00    0.50
## s(Limit)      2.00  1.00    1.09    0.91
## s(Rating)    29.00  1.00    1.02    0.64
## s(Cards)      7.00  3.39    0.97    0.27
## s(Age)        9.00  6.68    1.15    0.99
## s(Education) 14.00  1.00    1.08    0.92

The Sum of Absolute Error in the GAM model is:

GAM Model: Sum of Absolue Error
405.4928862

In summary, error is a pesky thing to get rid of…