The Data used for this project consists of 8,161 observations from 26 variables. Sample Below.

KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
0 60 0 11 67349 No 0 z_No M PhD Professional 14 Private 14230
0 43 0 11 91449 No 257252 z_No M z_High School z_Blue Collar 22 Commercial 14940
0 35 1 10 16039 No 124191 Yes z_F z_High School Clerical 5 Private 4010
0 51 0 14 NaN No 306251 Yes M <High School z_Blue Collar 32 Private 15440
0 50 0 NaN 114986 No 243925 Yes z_F PhD Doctor 36 Private 18000
0 34 1 12 125301 Yes 0 z_No z_F Bachelors z_Blue Collar 46 Commercial 17430

 
 

Data Exploration

The above histograms show highly repeated observed values present in the variables INCOME, BLUEBOOK, and HOME_VAL. These values are removed in the histograms below.

 

 
 
 

Categorical Variable Summary
Variables Minivan Panel Truck Pickup Sports Car Van z_SUV
KIDSDRIV
0 1,905 (89%) 608 (90%) 1,227 (88%) 795 (88%) 677 (90%) 1,968 (86%)
1 151 (7.0%) 39 (5.8%) 102 (7.3%) 79 (8.7%) 52 (6.9%) 213 (9.3%)
2 77 (3.6%) 22 (3.3%) 46 (3.3%) 26 (2.9%) 18 (2.4%) 90 (3.9%)
3 12 (0.6%) 7 (1.0%) 13 (0.9%) 6 (0.7%) 3 (0.4%) 21 (0.9%)
4 0 (0%) 0 (0%) 1 (<0.1%) 1 (0.1%) 0 (0%) 2 (<0.1%)
HOMEKIDS
0 1,464 (68%) 506 (75%) 904 (65%) 539 (59%) 540 (72%) 1,336 (58%)
1 205 (9.6%) 63 (9.3%) 155 (11%) 111 (12%) 63 (8.4%) 305 (13%)
2 290 (14%) 56 (8.3%) 186 (13%) 149 (16%) 92 (12%) 345 (15%)
3 145 (6.8%) 43 (6.4%) 112 (8.1%) 86 (9.5%) 43 (5.7%) 245 (11%)
4 37 (1.7%) 7 (1.0%) 31 (2.2%) 20 (2.2%) 10 (1.3%) 59 (2.6%)
5 4 (0.2%) 1 (0.1%) 1 (<0.1%) 2 (0.2%) 2 (0.3%) 4 (0.2%)
PARENT1 259 (12%) 66 (9.8%) 177 (13%) 141 (16%) 77 (10%) 357 (16%)
MSTATUS
Yes 1,288 (60%) 386 (57%) 835 (60%) 558 (62%) 447 (60%) 1,380 (60%)
z_No 857 (40%) 290 (43%) 554 (40%) 349 (38%) 303 (40%) 914 (40%)
SEX
M 1,413 (66%) 642 (95%) 967 (70%) 24 (2.6%) 654 (87%) 86 (3.7%)
z_F 732 (34%) 34 (5.0%) 422 (30%) 883 (97%) 96 (13%) 2,208 (96%)
EDUCATION
<High School 282 (13%) 51 (7.5%) 268 (19%) 154 (17%) 86 (11%) 362 (16%)
Bachelors 631 (29%) 180 (27%) 352 (25%) 232 (26%) 222 (30%) 625 (27%)
Masters 481 (22%) 203 (30%) 217 (16%) 172 (19%) 181 (24%) 404 (18%)
PhD 170 (7.9%) 126 (19%) 107 (7.7%) 63 (6.9%) 102 (14%) 160 (7.0%)
z_High School 581 (27%) 116 (17%) 445 (32%) 286 (32%) 159 (21%) 743 (32%)
JOB
Clerical 319 (15%) 55 (8.1%) 256 (18%) 149 (16%) 92 (12%) 400 (17%)
Doctor 88 (4.1%) 0 (0%) 24 (1.7%) 27 (3.0%) 29 (3.9%) 78 (3.4%)
Home Maker 86 (4.0%) 4 (0.6%) 64 (4.6%) 154 (17%) 12 (1.6%) 321 (14%)
Lawyer 340 (16%) 0 (0%) 70 (5.0%) 104 (11%) 68 (9.1%) 253 (11%)
Manager 282 (13%) 112 (17%) 159 (11%) 100 (11%) 99 (13%) 236 (10%)
nan 22 (1.0%) 241 (36%) 124 (8.9%) 5 (0.6%) 116 (15%) 18 (0.8%)
Professional 329 (15%) 118 (17%) 173 (12%) 100 (11%) 132 (18%) 265 (12%)
Student 163 (7.6%) 27 (4.0%) 160 (12%) 100 (11%) 28 (3.7%) 234 (10%)
z_Blue Collar 516 (24%) 119 (18%) 359 (26%) 168 (19%) 174 (23%) 489 (21%)
CAR_USE
Commercial 441 (21%) 676 (100%) 850 (61%) 160 (18%) 454 (61%) 448 (20%)
Private 1,704 (79%) 0 (0%) 539 (39%) 747 (82%) 296 (39%) 1,846 (80%)
RED_CAR 889 (41%) 389 (58%) 624 (45%) 30 (3.3%) 401 (53%) 45 (2.0%)
CLM_FREQ
0 1,459 (68%) 384 (57%) 831 (60%) 510 (56%) 450 (60%) 1,375 (60%)
1 218 (10%) 90 (13%) 171 (12%) 122 (13%) 99 (13%) 297 (13%)
2 252 (12%) 109 (16%) 211 (15%) 140 (15%) 105 (14%) 354 (15%)
3 169 (7.9%) 64 (9.5%) 141 (10%) 107 (12%) 79 (11%) 216 (9.4%)
4 44 (2.1%) 26 (3.8%) 33 (2.4%) 26 (2.9%) 16 (2.1%) 45 (2.0%)
5 3 (0.1%) 3 (0.4%) 2 (0.1%) 2 (0.2%) 1 (0.1%) 7 (0.3%)
REVOKED 230 (11%) 71 (11%) 184 (13%) 113 (12%) 96 (13%) 306 (13%)
URBANICITY
Highly Urban/ Urban 1,739 (81%) 582 (86%) 1,100 (79%) 690 (76%) 628 (84%) 1,753 (76%)
z_Highly Rural/ Rural 406 (19%) 94 (14%) 289 (21%) 217 (24%) 122 (16%) 541 (24%)
1 n (%)

 
 
 

Data Preparation

In preparing data for analysis and exploration, the first action was converting the numerical variables from type string to type integer. After some exploration with the NA values included in the data, they were removed. The NA values for the variable AGE were replaced by the average age of the variable HOMEKIDS equal to values of 0, 2, and 3. For the variables HOME_VAL, CAR_AGE, and YOJ, the NA values were replaced via Pythons Numpy library with randomly chosen variables, distributed normally, accounting for each variables unique mean and standard deviation. For the variable INCOME, the same technique was used, although its standard deviation was divided by three to prevent the generation of negative values.

The next significant change to the data was the conversion of its categorical variables to a numerical equivalent. For ‘Yes/No’ variables, ‘No’ was assigned 0, and ‘Yes’ 1.

Likewise, ‘Highly Urban/Highly Rural’, ‘Private/Commercial’, and ‘Female/Male’ all took corresponding values of ‘0/1’ in the order presented here.

After the categorical variables were converted to their associated number, each variable type was unlisted then converted to numeric.

 

Various other changes happened throughout the entire process. What’s described above is the primary preparation for analysis.

 

For the binary regression models below, no data conversion took place. For the linear regression models, the data underwent three boxcox conversions while using the same variables so comparisons could me made.

 
 
 

Models


Binary Regression Models

Logit Model 1

[1] "KIDSDRIV" "HOMEKIDS" "TRAVTIME" "TIF"     

Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5141  -0.7916  -0.7104   1.3006   2.0779  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.191943   0.070325 -16.949  < 2e-16 ***
XKIDSDRIV    0.237305   0.050911   4.661 3.14e-06 ***
XHOMEKIDS    0.174733   0.024410   7.158 8.18e-13 ***
XTRAVTIME    0.006871   0.001585   4.336 1.45e-05 ***
XTIF        -0.048384   0.006440  -7.513 5.76e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418  on 8160  degrees of freedom
Residual deviance: 9213  on 8156  degrees of freedom
AIC: 9223

Number of Fisher Scoring iterations: 4
    pred
true    0    1
   0 5969   39
   1 2120   33

The coefficients of this model make sense. If one has more teenagers that drive their car, it is shown to increase the likelihood of a crash. The amount of children at home is shown to be positively correlated with the likelihood of a crash. This may be caused due to stress, lack of sleep, or distracted driving. The amount of time traveled does effect the likelihood of a crash, which increases just under 1% for each additional hour driven. Time in the Force shows that the longer someone has been a customer, the less likely they are to crash. This makes sense because it indicates their propensity for coverage or ‘safety.’ These variables were chosen at the time because they were the only numerical variables available.

 
 

Logit Model 2

[1] "KIDSDRIV"   "URBANICITY" "TRAVTIME"   "TIF"        "AGE"       
[6] "INCOME"     "EDUCATION" 

Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1575  -0.8148  -0.5292   1.0211   2.7840  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  6.225e-01  1.564e-01   3.980 6.90e-05 ***
XKIDSDRIV    4.456e-01  4.936e-02   9.029  < 2e-16 ***
XURBANICITY -2.325e+00  1.053e-01 -22.071  < 2e-16 ***
XTRAVTIME    1.430e-02  1.742e-03   8.209 2.24e-16 ***
XTIF        -5.304e-02  6.797e-03  -7.805 5.97e-15 ***
XAGE        -1.953e-02  3.204e-03  -6.095 1.09e-09 ***
XINCOME     -7.087e-06  7.945e-07  -8.921  < 2e-16 ***
XEDUCATION  -2.003e-01  2.876e-02  -6.964 3.30e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 8282.3  on 8153  degrees of freedom
AIC: 8298.3

Number of Fisher Scoring iterations: 5
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00749 0.12743 0.25399 0.26382 0.37338 0.90245 
    pred
true    0    1
   0 5711  297
   1 1778  375

The coefficients for all variables in this model make sense. Driving in a more rural area decreases your likelihood of a crash, as one gets older, their behavior is less risky. Income has a significant effect, but it’s not large. Education decreases the odds of getting into a crash as well, and there are likely many different causes for this.

 
 

Logit Model 3

 [1] "MSTATUS"    "HOMEKIDS"   "INCOME"     "HOME_VAL"   "BLUEBOOK"  
 [6] "CLM_FREQ"   "MVR_PTS"    "EDUCATION"  "TRAVTIME"   "CAR_USE"   
[11] "TIF"        "CAR_TYPE"   "REVOKED"    "URBANICITY"

Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2382  -0.7412  -0.4301   0.6860   2.9399  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.869e-01  1.171e-01  -4.158 3.21e-05 ***
XMSTATUS    -6.184e-01  6.876e-02  -8.994  < 2e-16 ***
XHOMEKIDS    2.052e-01  2.525e-02   8.125 4.46e-16 ***
XINCOME     -4.758e-06  9.563e-07  -4.975 6.51e-07 ***
XHOME_VAL   -1.205e-06  3.230e-07  -3.731 0.000191 ***
XBLUEBOOK   -2.790e-05  3.918e-06  -7.121 1.07e-12 ***
XCLM_FREQ    1.560e-01  2.501e-02   6.235 4.53e-10 ***
XMVR_PTS     1.190e-01  1.331e-02   8.942  < 2e-16 ***
XEDUCATION  -1.820e-01  3.065e-02  -5.937 2.90e-09 ***
XTRAVTIME    1.469e-02  1.838e-03   7.994 1.31e-15 ***
XCAR_USE     7.499e-01  6.569e-02  11.415  < 2e-16 ***
XTIF        -5.387e-02  7.196e-03  -7.486 7.10e-14 ***
XCAR_TYPE    6.520e-02  1.712e-02   3.809 0.000140 ***
XREVOKED     7.512e-01  7.849e-02   9.571  < 2e-16 ***
XURBANICITY -2.210e+00  1.098e-01 -20.127  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418  on 8160  degrees of freedom
Residual deviance: 7544  on 8146  degrees of freedom
AIC: 7574

Number of Fisher Scoring iterations: 5
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001564 0.089262 0.212915 0.263816 0.395212 0.962634 
    pred
true    0    1
   0 5574  434
   1 1350  803

The coefficients in this model make sense. All are described above or have conclusions easily reached through reason.


 
 

Linear Regression Models

Significant Variables

                 Estimate   Std. Error     t value     Pr(>|t|)
INDEX       -8.547316e-03 5.637293e-02 -0.15162093 8.795003e-01
TARGET_FLAG  2.919362e+03 1.182143e+03  2.46955125 1.360622e-02
KIDSDRIV    -2.110596e+02 3.139251e+02 -0.67232477 5.014500e-01
AGE          2.521098e+01 2.067129e+01  1.21961341 2.227466e-01
HOMEKIDS     2.120844e+02 2.045586e+02  1.03679049 2.999513e-01
YOJ          9.000769e+00 4.115378e+01  0.21871064 8.268964e-01
INCOME      -7.745031e-03 6.070036e-03 -1.27594482 2.021143e-01
PARENT1      3.060087e+02 5.855255e+02  0.52262229 6.012915e-01
HOME_VAL     2.325357e-03 1.958559e-03  1.18727952 2.352499e-01
MSTATUS     -7.642840e+02 4.840063e+02 -1.57907858 1.144667e-01
SEX          8.302593e+02 4.807768e+02  1.72691228 8.432854e-02
EDUCATION    2.039332e+02 2.535872e+02  0.80419362 4.213750e-01
JOB          9.059596e+01 1.177117e+02  0.76964246 4.415974e-01
TRAVTIME     1.125351e+00 1.105023e+01  0.10183956 9.188936e-01
CAR_USE      3.013173e+02 4.041585e+02  0.74554258 4.560261e-01
BLUEBOOK     1.049621e-01 2.292547e-02  4.57840676 4.955673e-06
TIF         -1.203588e+01 4.240100e+01 -0.28385832 7.765466e-01
CAR_TYPE    -4.887690e+01 1.101393e+02 -0.44377327 6.572516e-01
RED_CAR     -2.505819e+02 4.948556e+02 -0.50637371 6.126468e-01
OLDCLAIM     2.281512e-02 2.253146e-02  1.01258914 3.113716e-01
CLM_FREQ    -1.153544e+02 1.572722e+02 -0.73346975 4.633528e-01
REVOKED     -1.012165e+03 5.147216e+02 -1.96643165 4.937856e-02
MVR_PTS      1.207527e+02 6.818887e+01  1.77085628 7.672773e-02
CAR_AGE     -7.350691e+01 3.952775e+01 -1.85962785 6.307611e-02
URBANICITY  -1.479445e+01 7.516699e+02 -0.01968211 9.842988e-01

 

As shown here, there are few variables with significant correlation to the TARGET_AMT variable. To start, the seven variables with the smallest p-value will be included in the model.

They are: INCOME, MSTATUS, BLUEBOOK, OLDCLAIM, CLM_FREQ, REVOKED, and MVR_PTS. Three transformations will be applied which will suggest which model to investigate further.

 
 

Linear Model 1 (r1) - No Transformation


Call:
lm(formula = TARGET_AMT ~ INCOME + MSTATUS + BLUEBOOK + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS, data = numData1)

Residuals:
   Min     1Q Median     3Q    Max 
 -8425  -3132  -1509    372 101251 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.309e+03  4.484e+02   9.610  < 2e-16 ***
INCOME      -1.520e-03  4.355e-03  -0.349   0.7270    
MSTATUS     -5.153e+02  3.320e+02  -1.552   0.1208    
BLUEBOOK     1.139e-01  2.193e-02   5.192 2.28e-07 ***
OLDCLAIM     2.111e-02  2.240e-02   0.943   0.3459    
CLM_FREQ    -1.222e+02  1.559e+02  -0.784   0.4330    
REVOKED     -9.854e+02  5.094e+02  -1.934   0.0532 .  
MVR_PTS      1.304e+02  6.752e+01   1.932   0.0535 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7683 on 2145 degrees of freedom
Multiple R-squared:  0.01867,   Adjusted R-squared:  0.01546 
F-statistic: 5.829 on 7 and 2145 DF,  p-value: 1.024e-06

 
 

Linear Model 2 (r2) - Response Transformation


Call:
lm(formula = TARGET_AMT^(0.02) ~ INCOME + MSTATUS + BLUEBOOK + 
    OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS, data = numData1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.104624 -0.009621  0.000709  0.009341  0.080365 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  1.178e+00  1.111e-03 1060.059  < 2e-16 ***
INCOME      -6.953e-09  1.079e-08   -0.644   0.5193    
MSTATUS     -1.802e-03  8.225e-04   -2.191   0.0286 *  
BLUEBOOK     2.655e-07  5.433e-08    4.886  1.1e-06 ***
OLDCLAIM     9.658e-08  5.548e-08    1.741   0.0819 .  
CLM_FREQ    -8.207e-04  3.861e-04   -2.125   0.0337 *  
REVOKED     -2.067e-03  1.262e-03   -1.637   0.1017    
MVR_PTS      3.785e-04  1.673e-04    2.262   0.0238 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01903 on 2145 degrees of freedom
Multiple R-squared:  0.01856,   Adjusted R-squared:  0.01535 
F-statistic: 5.794 on 7 and 2145 DF,  p-value: 1.141e-06

 
 

Linear Model 3 (r3) - Predictor Transformation


Call:
lm(formula = TARGET_AMT ~ INCOME + MSTATUS + BLUEBOOK + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS, data = bcData)

Residuals:
   Min     1Q Median     3Q    Max 
 -7231  -3150  -1568    346 101267 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64942.2    12858.8  -5.050 4.78e-07 ***
INCOME         297.3      428.6   0.694   0.4880    
MSTATUS       -500.4      331.6  -1.509   0.1314    
BLUEBOOK     58328.0    10731.3   5.435 6.09e-08 ***
OLDCLAIM      3508.5     9570.0   0.367   0.7139    
CLM_FREQ     -4096.6    11263.1  -0.364   0.7161    
REVOKED       -810.9      469.8  -1.726   0.0845 .  
MVR_PTS        507.0      355.5   1.426   0.1539    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7682 on 2145 degrees of freedom
Multiple R-squared:  0.01887,   Adjusted R-squared:  0.01567 
F-statistic: 5.894 on 7 and 2145 DF,  p-value: 8.402e-07

 
 

Linear Model 4 (r4) - Predictor and Response Transformation


Call:
lm(formula = TARGET_AMT^(0.02) ~ INCOME + MSTATUS + BLUEBOOK + 
    OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS, data = bcData)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.105179 -0.009561  0.000629  0.009433  0.080338 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9911815  0.0317976  31.172  < 2e-16 ***
INCOME       0.0003063  0.0010597   0.289   0.7726    
MSTATUS     -0.0017169  0.0008200  -2.094   0.0364 *  
BLUEBOOK     0.1564427  0.0265367   5.895 4.33e-09 ***
OLDCLAIM     0.0365292  0.0236651   1.544   0.1228    
CLM_FREQ    -0.0432866  0.0278518  -1.554   0.1203    
REVOKED     -0.0016884  0.0011617  -1.453   0.1463    
MVR_PTS      0.0018712  0.0008790   2.129   0.0334 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.019 on 2145 degrees of freedom
Multiple R-squared:  0.02239,   Adjusted R-squared:  0.0192 
F-statistic: 7.017 on 7 and 2145 DF,  p-value: 2.689e-08

 
 
 

Linear Model Predictions v Actual

 

While the transformed models r2 and r4 better normalize the data and its associated error terms, the predictions these models create are highly affected by the heavy tails observed in their Q-Q Plots. These models are suggested to better fit the data, as seen by their R^2 values, but this difference is minuscule. The above graphic also shows that their predictions are too cautious as the leveraged outliers are too heavily compensated for, which in turn minimizes the range of predicted values. The tails are simply too heavy, and it may be worth looking into possibly “Gaussianizing” the data to reduce these effects. These models are ruled out.

 

 
 
 

Model Selection

The binary model to be chosen is Model 3. It is the most accurate in predicting the amount of crashes and non-crashes. Below is shown the amount of NO CRASH/CRASH observations, the percentage of these observations, and the predicted TARGET_FLAG values predicted by Model 3.

Y
   0    1 
6008 2153 
Y
        0         1 
0.7361843 0.2638157 
    pred
true    0    1
   0 5574  434
   1 1350  803

Summing the two diagonal values and dividing by total observations produces the percent correctly predicted. The equations below are approximated as random values are used to replace NA’s.

\[ 5578 + 809 = 6387 \] \[ 6387 / 8161 = 0.7826 \]

 

Comparing linear models r1 and r3, there is a heavy right skew seen for both in the Q-Q plots. The Residual v Fitted plot indicates the presence of heteroskedasticity in r3, and shows the multitude of outliers in r1. A Weighted Least Squares transformation could be done to reduce heteroskedasticity. Seen in the graphic on page 17, r1’s predicted values are higher and r3’s lower.

r1 : min- 3254 mean- 5702 max- 11050
r3 : min- 2069 mean- 5702 max- 8493

Many combinations of variables were tried with both models to little effect. The difference between these models isn’t easily apparent, but the chosen model is model 3. Model 3 is chosen over model 1 simply because model 1 excludes more than 34.74% of observed claim amounts ( < 3,254), while at the same time, model 3 excludes only 9.52% of observations ( > 8,493). Below are quartiles of the observed Target Amounts.

          0%          25%          50%          75%         100% 
    30.27728   2609.77857   4104.00000   5787.00000 107586.13616 

Since the total amount of payments made by the insurance company ($12,276,793) is also the sum of the predicted claim payments for both models, the model chosen is simply the best representation of the actual data. While a tight fit (high R^2) is not really seen anywhere, it is slightly higher in model 3.

 
 
 

Testing on Evaluation Data

 

INDEXTARGET_FLAGTARGET_AMT
300       
900       
1000       
1800       
2100       
3000       
3100       
3700       
3900       
4700       
6000       
6215.77e+03
6315.46e+03
6400       
6800       
7516.04e+03
7613.17e+03
8300       
8714.82e+03
9200       
9800       
10600       
10700       
11300       
12000       
12300       
12500       
12600       
12800       
12900       
13100       
13500       
14100       
14700       
14800       
15100       
15600       
15700       
17400       
18600