We want to predict individual credit balance, based on 10 predictor variables
Credit Balance is distributed with the following density in our sample of 400 individuals:

We scale our numerical variables (but keep an unscaled version of our outcome variable!), and plot the correlations between the scaled variables:

And look at the covariance matrix:

We create dummy variables for the categorical variables, and add categorical variables, corresponding to each “Decade of Life”
We then draw a correlation matrix for these variables as well:

We build simple linear models for normalized numerical variables:
|
Â
|
Â
|
Balance
|
|
Â
|
Â
|
B
|
CI
|
p
|
|
(Intercept)
|
Â
|
13.43
|
13.29 – 13.57
|
<.001
|
|
Income
|
Â
|
5.39
|
5.16 – 5.63
|
<.001
|
|
Limit
|
Â
|
0.68
|
-1.45 – 2.81
|
.529
|
|
Rating
|
Â
|
-0.65
|
-2.79 – 1.48
|
.549
|
|
Cards
|
Â
|
0.09
|
-0.08 – 0.25
|
.320
|
|
Age
|
Â
|
0.37
|
0.23 – 0.51
|
<.001
|
|
Education
|
Â
|
0.18
|
0.04 – 0.33
|
.011
|
|
Observations
|
Â
|
400
|
|
R2 / adj. R2
|
Â
|
.937 / .936
|
As well as categorical variables:
|
Â
|
Â
|
Balance
|
|
Â
|
Â
|
B
|
CI
|
p
|
|
(Intercept)
|
Â
|
30.88
|
23.15 – 38.60
|
<.001
|
|
age_twenties
|
Â
|
-20.54
|
-28.38 – -12.70
|
<.001
|
|
age_thirties
|
Â
|
-18.84
|
-26.50 – -11.19
|
<.001
|
|
age_fourties
|
Â
|
-17.93
|
-25.55 – -10.30
|
<.001
|
|
age_fifties
|
Â
|
-17.67
|
-25.32 – -10.02
|
<.001
|
|
age_sixties
|
Â
|
-16.99
|
-24.63 – -9.36
|
<.001
|
|
age_seventies
|
Â
|
-18.10
|
-25.75 – -10.45
|
<.001
|
|
age_eighties
|
Â
|
-13.84
|
-21.57 – -6.11
|
<.001
|
|
Male
|
Â
|
-0.06
|
-1.13 – 1.01
|
.909
|
|
Student
|
Â
|
1.30
|
-0.49 – 3.08
|
.155
|
|
Married
|
Â
|
0.18
|
-0.94 – 1.29
|
.756
|
|
Caucasian
|
Â
|
-0.22
|
-1.52 – 1.08
|
.741
|
|
AfricanAmerican
|
Â
|
0.07
|
-1.45 – 1.58
|
.932
|
|
Observations
|
Â
|
400
|
|
R2 / adj. R2
|
Â
|
.121 / .093
|
Looking at the patterns in, and signifance of the results, in the above work, we learn the following:
- Credit Balance is steeply and significantly linearly dependent on Income
- Decide of life is a reliable intercept modifier
- Our best naive model would be to: use the linear model values for Balance~Income
This is the function we’ll use for our very first model:
Let’s look at the model predictions vs. actual values, and look at the distribution of the error:


The error is normally distributed, which is a good thing
Let’s see how we can explain the error in our model with a tree-based model:



It seems modeling the error on its own is probably not such a good idea, although interestingly, there is a linear relationship between the real error and the error in the error prediction….
We’ll rather try to model the original response variable with a tree-based model:
We see that Income overwhelms the decision splits, since it is such an influential variable:

Income is dominating our predictive models, in either case
(This is probably why our error distributions are so similar)…
Random Forests allow us to prevent this from happening, so let’s try that:


It looks like our error has improved, but let’s get a standard measurement of our model errors so far, the sum of squared error:
| 871.9598 |
993.7124 |
260.182 |
This is misleading
*The sum of squared errors comes to zero, which means it performs better in aggregate, but the range of errors for individual observations is much wider:

It’s apparent that the errors are being made for similar observations…
Let’s try an ensembled approach

This is only on 100 test observations, let’s test this ensemble model on all data:



Our model has improved, from the initial linear model, based only on Income…

Still, that’s a lot of error…
Bearing in mind the sample range of our outcome variable
And our sample size…
We end up with the following scale of error:
If we’re smart about it, we’ll use General Additive Models from the MGCV package:

##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 21 iterations.
## The RMS GCV score gradient at convergence was 0.00000006901775364 .
## The Hessian was positive definite.
## Model rank = 60 / 64
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Income) 2.00 1.00 1.00 0.50
## s(Limit) 2.00 1.00 1.09 0.91
## s(Rating) 29.00 1.00 1.02 0.64
## s(Cards) 7.00 3.39 0.97 0.27
## s(Age) 9.00 6.68 1.15 0.99
## s(Education) 14.00 1.00 1.08 0.92



The Sum of Absolute Error in the GAM model is:
In summary, error is a pesky thing to get rid of…
