1 Introduction

In this report, we analyze insurance data to estimate the following quantities:

  • The probability that a specified driver will have a car crash.
  • The dollar cost of auto claims, given that the insured was involved in a crash.

In practice, these claim frequency and severity measures are useful for determining appropriate pure premium amounts to charge auto policyholders.

Using the provide training data set, we will build two separate models:

  • a binary logistic regression to determine crash probabilities.
  • a multiple linear regression model to estimate claim severity.

We will then make predictions on the provided insurance testing data set.

2 Data Exploration

2.1 Variable Overview

The training data set includes 8,161 observations, with 26 variables: 23 predictors, two response variables, and one record identifier.

Below is a brief description of the included variables:

Variable Name Description Theoretical Impact
INDEX Identification Variable (do not use) None
TARGET_FLAG In a crash? 1=YES 0=NO None
TARGET_AMT Cost of Crash, if applicable None
KIDSDRIV # Driving Children When teenagers drive your car, increased crash risk
AGE Age of Driver Young and old drivers might be riskier
HOMEKIDS # Children at Home Unknown effect
YOJ Years on Job Long-term employees are usually safer
INCOME Income In theory, rich have fewer crashes
PARENT1 Single Parent Unknown impact
HOME_VAL Home Value In theory, home owners may drive more responsibly
MSTATUS Marital Status In theory, married individuals are less risky
SEX Gender Urban legend: females are safer drivers
EDUCATION Max Education Level Unknown, but in theory more educated people tend to drive more safely
JOB Job Category In theory, white collar workers are less risky
TRAVTIME Commute Distance Long drives to work usually suggest greater risk
CAR_USE Vehicle Use Commercial fleet driven more, may impact collision prob
BLUEBOOK Value of Vehicle Unknown impact on collision prob, but impacts crash payout
TIF Time in Force Long-term customers are usually safer
CAR_TYPE Type of Car Unknown impact on collision prob, but impacts crash payout
RED_CAR A Red Car Urban legend: red cars are riskier, particularly sports cars
OLDCLAIM # Claims (Past 5 Years) If total payout high, future payouts might be high
CLM_FREQ Total Claims (Past 5 Years) Claim count should be positively correlated with future claims
REVOKED License Revoked (Past 7 Years) If your license was revoked, you probably are a riskier driver
MVR_PTS Motor Vehicle Report Points Traffic ticket counts have postive correlation with crashes
CAR_AGE Vehicle Age Unknown impact on collision prob, but impacts crash payout
URBANICITY Home/Work Area Unknown impact

There are a couple issues with the raw data file:

  • Currency fields were treated as factors due to “$” and “,” characters.
  • Multiple character field entries included an extraneous “z_” or “<” prefix.

We also rescaled the fields INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM to be so that dollars are expressed in $1,000s,

After cleaning up these minor issues, we’re ready to explore data types and sample observations from the training set:

'data.frame':   8161 obs. of  26 variables:
 $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
 $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
 $ TARGET_AMT : num  0 0 0 0 0 ...
 $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
 $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
 $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
 $ INCOME     : num  67.3 91.4 16 NA 115 ...
 $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ HOME_VAL   : num  0 257 124 306 244 ...
 $ MSTATUS    : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 2 1 1 ...
 $ SEX        : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 1 2 ...
 $ EDUCATION  : Ord.factor w/ 4 levels "High School"<..: 4 1 1 1 4 2 1 2 2 2 ...
 $ JOB        : Factor w/ 9 levels "","Blue Collar",..: 8 2 3 2 4 2 2 2 3 8 ...
 $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
 $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
 $ BLUEBOOK   : num  14.23 14.94 4.01 15.44 18 ...
 $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
 $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 5 1 5 4 5 6 5 6 ...
 $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
 $ OLDCLAIM   : num  4.46 0 38.69 0 19.22 ...
 $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
 $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
 $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
 $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
 $ URBANICITY : Factor w/ 2 levels "Highly Rural/ Rural",..: 2 2 2 2 2 2 2 2 2 1 ...

We ignore the field INDEX for modeling purposes, which is used only for identifying observations.

The two response variables, TARGET_FLAGand TARGET_AMT, contain binary and dollar values, respectively.

The predictors include four discrete count variables:

  • KIDSDRIV
  • HOMEKIDS
  • CLM_FREQ
  • MVR_PTS

There are five discrete time measurements:

  • AGE
  • YOJ
  • TRAVTIME
  • TIF
  • CAR_AGE

There are also seven binary, categorical features:

  • PARENT1
  • MSTATUS
  • SEX
  • CAR_USE
  • RED_CAR
  • REVOKED
  • URBANCITY

The data set includes two multinomial, categorical features:

  • JOB
  • CAR_TYPE

There is one ordinal, categorical predictor:

  • EDUCTATION

Finally, there are four predictors that express dollar amounts. These variables are effectively continuous:

  • INCOME
  • HOME_VAL
  • BLUEBOOK
  • OLDCLAIM

Now, let’s review a high level statistical summary of the variables:

     INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
 Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
 1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
 Median : 5133   Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
 Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
 3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
 Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
                                                                    NA's   :6      
    HOMEKIDS           YOJ           INCOME       PARENT1       HOME_VAL     MSTATUS   
 Min.   :0.0000   Min.   : 0.0   Min.   :  0.00   No :7084   Min.   :  0.0   No :3267  
 1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28.10   Yes:1077   1st Qu.:  0.0   Yes:4894  
 Median :0.0000   Median :11.0   Median : 54.03              Median :161.2             
 Mean   :0.7212   Mean   :10.5   Mean   : 61.90              Mean   :154.9             
 3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85.99              3rd Qu.:238.7             
 Max.   :5.0000   Max.   :23.0   Max.   :367.03              Max.   :885.3             
                  NA's   :454    NA's   :445                 NA's   :464               
 SEX            EDUCATION              JOB          TRAVTIME            CAR_USE    
 F:4375   High School:3533   Blue Collar :1825   Min.   :  5.00   Commercial:3029  
 M:3786   Bachelors  :2242   Clerical    :1271   1st Qu.: 22.00   Private   :5132  
          Masters    :1658   Professional:1117   Median : 33.00                    
          PhD        : 728   Manager     : 988   Mean   : 33.49                    
                             Lawyer      : 835   3rd Qu.: 44.00                    
                             Student     : 712   Max.   :142.00                    
                             (Other)     :1413                                     
    BLUEBOOK          TIF                CAR_TYPE    RED_CAR       OLDCLAIM     
 Min.   : 1.50   Min.   : 1.000   Minivan    :2145   no :5783   Min.   : 0.000  
 1st Qu.: 9.28   1st Qu.: 1.000   Panel Truck: 676   yes:2378   1st Qu.: 0.000  
 Median :14.44   Median : 4.000   Pickup     :1389              Median : 0.000  
 Mean   :15.71   Mean   : 5.351   Sports Car : 907              Mean   : 4.037  
 3rd Qu.:20.85   3rd Qu.: 7.000   SUV        :2294              3rd Qu.: 4.636  
 Max.   :69.74   Max.   :25.000   Van        : 750              Max.   :57.037  
                                                                                
    CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE                     URBANICITY  
 Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000   Highly Rural/ Rural:1669  
 1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000   Highly Urban/ Urban:6492  
 Median :0.0000              Median : 1.000   Median : 8.000                             
 Mean   :0.7986              Mean   : 1.696   Mean   : 8.328                             
 3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000                             
 Max.   :5.0000              Max.   :13.000   Max.   :28.000                             
                                              NA's   :510                                

We notice a variety of missing values and strange field entries that may reflect data errors. These issues will be address in a later section.

Let’s now focus on each variable individually.


2.2 TARGET Variables

TARGET_FLAG

The response variable TARGET_FLAG has a moderate imbalance, with three-quarters of the observations indicating no crashes.

             0      1  Sum
count   6008.0 2153.0 8161
percent   73.6   26.4  100

TARGET_AMT

The other response, TARGET_AMT, exhibits extreme, positive skewness and high kurtosis.

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      StdD      Skew      Kurt 
     0.00      0.00      0.00   1504.32   1036.00 107586.14   4704.03      8.71    115.32 

We already noted that almost three quarters of records indicate no car crash. In these cases, the TARGET_AMT has a zero value.

For modeling purposes, however, we will only be interested in dollar amounts where a crash occurred. Going forward, when performing calculations and summaries involving TARGET_AMT, we will use a subset of the data where zero amounts are filtered out.

Here is a summary of the zero-truncated TARGET_AMT variable:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      StdD      Skew      Kurt 
    30.28   2609.78   4104.00   5702.18   5787.00 107586.14   7743.18      5.64     45.49 

Even after we remove the zero values, the variable remains highly skewed:


2.2.0.1 Count Variables

KIDSDRIV

The discrete variable KIDSDRIV is right skewed, with 88% of insureds in the training data having no teenage drivers in the household.

           0     1     2    3 4  Sum
count   7180 636.0 279.0 62.0 4 8161
percent   88   7.8   3.4  0.8 0  100
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.17    0.00    4.00    0.51    3.35   14.78 

In the barplot below, we see different a slightly different distribution of policyholders involved in crashes vis-a-vis the incident-free insureds. Specifically, we see slightly higher concentration of individuals teenage drivers. The scatter plot also indicates a relationship between relationship between number of teenage drivers and the log odds of an auto incident.

We also plot the the log TARGET_AMT–given that a crash occurred–against KIDSDRIV. Note: we applied the log transformation because TARGET_AMT is highly right skewed, and the directional relationship with KIDSDRIV should be clearer in this form.

While not entirely clear in the boxplot, there appears to be a rough, positive relationship between KIDSDRIV and the median cost of the crash. The relationship with the mean crash amount, however, is less clear.

KIDSDRIV ct mean_cost median_cost
0 1773 5659 4123
1 236 6220 3958
2 111 5542 4148
3 31 4915 4541
4 2 4054 4054

HOMEKIDS

Numerical distribution and statistical summaries are presented below:

             0     1      2     3   4    5  Sum
count   5289.0 902.0 1118.0 674.0 164 14.0 8161
percent   64.8  11.1   13.7   8.3   2  0.2  100
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.72    1.00    5.00    1.12    1.34    3.65 

The distribution of this discrete variable is right skewed, but not to the same extent as KIDSRIV. HOMEKIDS contains some of the same information as KIDSDRIV: presumably, some of the reported children are also drivers.

Let’s look plots relating this predictor to the target variables:

The barplot provides some evidence that policyholders with crashes tend to have more children than than insureds not involved in an auto incident. The scatterplot indicates a significant difference in log odds between policyholders with children vs. insureds without children. We discount the observation associated with five children due to the small sample size.

There appears to be a subtle relationship between HOMEKIDS and the median cost of crashes. The relationship with the mean is less clear, given the highly variable and skewed distribution of crash amounts.

HOMEKIDS ct mean_cost median_cost
0 1173 5685 4080
1 305 5522 4036
2 382 6085 4122
3 230 5432 4192
4 57 5610 4575
5 6 5009 5130

CLM_FREQ

Let’s review numerical and statistical summaries for CLM_FREQ, another discrete count variable:

             0     1      2     3     4    5  Sum
count   5009.0 997.0 1171.0 776.0 190.0 18.0 8161
percent   61.4  12.2   14.3   9.5   2.3  0.2  100
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.80    2.00    5.00    1.16    1.21    3.29 

This variable has a similar skew as the previously reviewed count variable, HOMEKIDS.

Based on the barplot below, there seems to be a significance difference in the distribution of prior claim frequencies for TARGET_FLAG values of zero vs. one. We see roughly 60% of auto claimants have had one or more prior claims in the past five year, while only 40% of non-claimants have had accidents. The scatter plot indicates a nonlinear albeit positive relationship between log odds of a claim and prior claim history. We also do not see a clear pattern between CLM_FREQ and the claim amounts.

CLM_FREQ ct mean_cost median_cost
0 898 5633 4160
1 385 5995 4312
2 469 5463 3858
3 314 5970 4206
4 80 5551 3393
5 7 4247 3737

MVR_PTS

Like the other count variables, MVR_PTS is positively skewed. There seems to be a positive relationship between points and log odds; however the scatter plot below indicates a strange, curved relationship between the two variables. We wonder if the curvature is related to to interaction with another predictor.

We see no straightforward relationship between TARGET_AMT and MVR_PTS.

             0      1     2     3     4     5     6   7  8    9   10   11 13  Sum
count   3712.0 1157.0 948.0 758.0 599.0 399.0 266.0 167 84 45.0 13.0 11.0  2 8161
percent   45.5   14.2  11.6   9.3   7.3   4.9   3.3   2  1  0.6  0.2  0.1  0  100
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    1.00    1.70    3.00   13.00    2.15    1.35    4.38 

BIN_MVR ct mean_cost median_cost
0 985 5379 4037
2 506 5762 4068
4 354 6110 4292
6 198 5896 4194
8 88 6809 4492
10 20 6293 3824
12 2 3786 3786

2.3 Time Variables

AGE

Below is table of values of the AGE predictor, with ages bucketed into 5 year increments (i.e. [15,20), [20,25), [30,35), etc.])

          15   20    25  30     35     40     45     50    55    60   65   70 75 80  Sum
count   14.0 58.0 249.0 649 1259.0 1673.0 1810.0 1373.0 725.0 265.0 65.0 12.0  1  2 8155
percent  0.2  0.7   3.1   8   15.4   20.5   22.2   16.8   8.9   3.2  0.8  0.1  0  0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
  16.00   39.00   45.00   44.79   51.00   81.00    6.00    8.63   -0.03    2.94 

We note six missing values that we’ll need to address later.

The distribution of AGE is almost perfectly normal. When we break out the data by TARGET_FLAG values, the distributions of age by TARGET_FLAG are still roughly normal. However, individuals involved in a crash appear to be slightly younger, on average.

In the bottom left scatter plot, we notice a pattern between log odds and age. Specifically, there appears to be a curved relationship where younger ages have a higher odds of a crash. The odds continue to decrease until around age 60, when costs begin to trend upward again.

TARGET_AMT appears to be slightly higher for both younger (< 25) and older (>60) drivers, but the differences appear to be subtle.

AGEBIN ct mean_cost median_cost
10 7 9650 3917
20 151 5412 4193
30 627 5601 4144
40 804 5470 3936
50 446 6280 4291
60 109 5907 4460
70 4 4284 4276

YOJ

YOB refers to the number of years in the insured’s current job.

Here is a numerical summary, binned in 2 year increments:

            0    2     4     6     8     10     12     14  16   18 22  Sum
count   631.0 51.0 129.0 473.0 905.0 1752.0 2174.0 1248.0 305 37.0  2 7707
percent   8.2  0.7   1.7   6.1  11.7   22.7   28.2   16.2   4  0.5  0  100

Below is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00    9.00   11.00   10.50   13.00   23.00  454.00    4.09   -1.20    4.18 

We have a significant number of NA records–454, or 5.6%–that we’ll need to address.

The variable would be approximately normally distributed if it weren’t for the high percentage of individuals with less than one year on the job.

Insureds with accidents have a relatively high proportion of individuals with less than a year on the job.

The relationship between log odds and YOB appears to be complex–see the fitted loess curve below.

The relationship between YOJ and TARGET_AMT is also not very clear.

YOJBIN ct mean_cost median_cost
0 257 4765 3736
4 134 5993 4512
8 729 5964 4026
12 811 5590 4249
16 99 6001 3969

TRAVTIME

Here is a summary of TRAVTIME, the commuting distance to work, in bins of 10 minutes:

            0     10     20     30     40    50    60   70   80   90 100 110 120 130 140  Sum
count   550.0 1036.0 1788.0 2023.0 1531.0 784.0 305.0 94.0 33.0 11.0   2   1   1   1   1 8161
percent   6.7   12.7   21.9   24.8   18.8   9.6   3.7  1.2  0.4  0.1   0   0   0   0   0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   5.00   22.00   33.00   33.49   44.00  142.00   15.91    0.45    3.67 

The distribution has a slight positive skew. The subset of insureds with no accidents have a higher proportion of individuals with short commute times. In the scatterplot below, we notice a generally positive relationship between log odds and TRAVETIME. Some of the curvature in the loess curve is likely driven by small sample sizes for long commute times.

Finally, we see a slight, upward curvature in the log of TARGET_AMT for long commute times. Judging from the loess curve confidence interval, this upward trend may not be statistically significant.

TRAVBIN ct mean_cost median_cost
0 214 5395 4539
15 596 5846 3998
30 772 5671 4163
45 451 5516 3964
60 104 6470 4376
75 15 6169 4579
90 1 6909 6909

TIF

Here is a distribution of time in force values in bins of 2 year increments:

           0     2      4    6     8     10  12    14    16   18   20 22 24  Sum
count   2533 430.0 1294.0 1961 285.0 1022.0 323 109.0 148.0 32.0 19.0  3  2 8161
percent   31   5.3   15.9   24   3.5   12.5   4   1.3   1.8  0.4  0.2  0  0  100

Here is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   1.00    1.00    4.00    5.35    7.00   25.00    4.15    0.89    3.42 

The distribution is somewhat positively skewed. We see somewhat small sub-samples for TIF values of 2-3 and 8-9. Also, the log odds vs. TIF scatterplot suggests a quadratic relationship.

Finally, there does not seem to be a significant relationship between log of TARGET_AMT and TIF.

TIFBIN ct mean_cost median_cost
0 912 5858 4102
4 823 5609 4049
8 289 5518 4220
12 91 5764 4061
16 35 5257 4868
20 3 5025 5610

CAR_AGE

The predictor CAR_AGE describes the age in years of the insured’s car. Below we bin the data in increments of three:

        -3      0     3      6    9     12    15    18   21   24 27  Sum
count    1 1949.0 494.0 1512.0 1455 1035.0 720.0 369.0 96.0 18.0  2 7651
percent  0   25.5   6.5   19.8   19   13.5   9.4   4.8  1.3  0.2  0  100

Here is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
  -3.00    1.00    8.00    8.33   12.00   28.00  510.00    5.70    0.28    2.25 

There is an observation that indicates a CAR_AGE value of -3. This is clearly an error. Also, there are 510 missing observations, representing 6% of total records. 1934 out of the 8161 (25%) of the records have a value of one.

Surprisingly, CAR_AGE appears to be negatively correlated with the log odds of a claim. Perhaps there is a confounding variable responsible for this result.

Finally, we expected to see a stronger relationship between CAR_AGE and the log of TARGET_AMT.
Our intuition was that payouts go down with the age of the car, as replacement value goes down with age. However, the bottom right scatter plot below does not seem to indicate a significant association.

CAR_AGE ct mean_cost median_cost
-4 1 1469 1469
0 646 5824 4107
4 392 5884 4224
8 516 5789 4088
12 261 5431 4274
16 158 5327 4076
20 34 5438 3150
24 3 4480 3949

2.3.0.1 Binary Variables

PARENT1

Here is a record summary for PARENT1, a binary variable indicating if an insured is a single parent.

            No    Yes  Sum
count   7084.0 1077.0 8161
percent   86.8   13.2  100

The vast majority (87%) of individuals in the training data are not single parents.

Let’s explore the relationship with TARGET_FLAG

       TARGET_FLAG
PARENT1    0    1  Sum
    No  5407 1677 7084
    Yes  601  476 1077
    Sum 6008 2153 8161

Now let’s review the proportions of individuals involved in a crash, given PARENT1 status:

       TARGET_FLAG
PARENT1    0    1
    No  0.76 0.24
    Yes 0.56 0.44

The is a 20% difference in the calculated proportions. This difference is statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 201.7, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1734351 0.2370404
sample estimates:
   prop 1    prop 2 
0.7632693 0.5580316 

Finally, lets review the relationship of PARENT1 with TARGET_AMT.

PARENT1 ct mean_cost median_cost
No 1677 5603 4101
Yes 476 6050 4133

The distribution of log amounts is pretty similar for both values of PARENT1, with perhaps a very modest increase for single parent insureds.

MSTATUS

The predictor, MSTATUS, refers to the married status of the policyholder.

          No  Yes  Sum
count   3267 4894 8161
percent   40   60  100

There is a fairly balanced split (60/40) between married and single insureds.

Let’s look at the relationship with TARGET_FLAG

       TARGET_FLAG
MSTATUS    0    1  Sum
    No  2167 1100 3267
    Yes 3841 1053 4894
    Sum 6008 2153 8161

Now let’s review the proportions of individuals involved in a crash, given MSTATUS status:

       TARGET_FLAG
MSTATUS    0    1
    No  0.66 0.34
    Yes 0.78 0.22

The 12% difference in proportions is statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 148.38, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1416726 -0.1014053
sample estimates:
   prop 1    prop 2 
0.6632997 0.7848386 

Now lets explore the relationship of MSTATUS with TARGET_AMT.

MSTATUS ct mean_cost median_cost
No 1100 5967 4098
Yes 1053 5426 4117

The distribution of log amounts is very similar across MSTATUS type. Non married individuals appear to have a slightly higher average log cost compared to the married cohort. However, median costs are almost identical between the two cohorts.

SEX

The variable, SEX, denotes the gender of the insured policyholder.

             F      M  Sum
count   4375.0 3786.0 8161
percent   53.6   46.4  100

The split between males and females is split almost 50/50.

Let’s review the relationship with TARGET_FLAG.

     TARGET_FLAG
SEX      0    1  Sum
  F   3183 1192 4375
  M   2825  961 3786
  Sum 6008 2153 8161

Here are the proportions of individuals involved in a crash, given gender type:

   TARGET_FLAG
SEX    0    1
  F 0.73 0.27
  M 0.75 0.25

The 2% difference in proportions is not statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 3.5307, df = 1, p-value = 0.06024
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0380106016  0.0007561151
sample estimates:
   prop 1    prop 2 
0.7275429 0.7461701 

Now lets explore the relationship of SEX with TARGET_AMT.

SEX ct mean_cost median_cost
F 1192 5344 4039
M 961 6147 4220

The log dollar cost of claims seems to be slightly higher for males, on average.

CAR_USE

CAR_USE is a predictor that indicates how the vehicle is uses (i.e. commercial or private purposes).

        Commercial Private  Sum
count       3029.0  5132.0 8161
percent       37.1    62.9  100

The majority of the observations involve private use vehicles, at 60%.

Let’s explore the relationship with TARGET_FLAG.

            TARGET_FLAG
CAR_USE         0    1  Sum
  Commercial 1982 1047 3029
  Private    4026 1106 5132
  Sum        6008 2153 8161

Below are the proportions of individuals involved in a crash, given the CAR_USE indicator:

            TARGET_FLAG
CAR_USE         0    1
  Commercial 0.65 0.35
  Private    0.78 0.22

There is a 13% difference in proportions between the two cohorts. This result is statistically significant.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 165.45, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1507428 -0.1095535
sample estimates:
   prop 1    prop 2 
0.6543414 0.7844895 

Finally, lets explore the relationship of CAR_USE with TARGET_AMT.

CAR_USE ct mean_cost median_cost
Commercial 1047 6099 4193
Private 1106 5327 4037

The log dollar cost of claims involving commercial transit appears to somewhat higher than costs associated with private transportation.

RED_CAR

RED_CAR is a binary predictor indicating whether a care is primarily red in color.

            no    yes  Sum
count   5783.0 2378.0 8161
percent   70.9   29.1  100

We see only 30% of vehicles in the red category.

Let’s look at the relationship with TARGET_FLAG.

       TARGET_FLAG
RED_CAR    0    1  Sum
    no  4246 1537 5783
    yes 1762  616 2378
    Sum 6008 2153 8161

Here are the proportions of individuals involved in a crash, given the RED_CAR status:

       TARGET_FLAG
RED_CAR    0    1
    no  0.73 0.27
    yes 0.74 0.26

There is only 1% difference in proportions between red and non-red cars. The result is not statistically significant.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 0.35996, df = 1, p-value = 0.5485
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.02800322  0.01452763
sample estimates:
   prop 1    prop 2 
0.7342210 0.7409588 

Now, we’ll explore the relationship between RED_CAR and TARGET_AMT.

RED_CAR ct mean_cost median_cost
no 1537 5568 4112
yes 616 6036 4082

The distribution for each cohort is very similar, with an subtle uptick in average log costs for red car types.

REVOKED

The variable REVOKED indicates whether a driver’s license has been suspended within the last seven years.

            No    Yes  Sum
count   7161.0 1000.0 8161
percent   87.7   12.3  100

Only 12% of drivers in the training data have a former license suspension on record.

Here is a look look at the relationship of REVOKED with TARGET_FLAG.

       TARGET_FLAG
REVOKED    0    1  Sum
    No  5451 1710 7161
    Yes  557  443 1000
    Sum 6008 2153 8161

Below are the proportions of individuals involved in a crash, given the REVOKED status:

       TARGET_FLAG
REVOKED    0    1
    No  0.76 0.24
    Yes 0.56 0.44

There is statistically significant difference in proportions by REVOKED type, with a 20% observed difference in the training data.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 187.35, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1713042 0.2371089
sample estimates:
   prop 1    prop 2 
0.7612065 0.5570000 

Finally, let’s explore the relationship between REVOKED and TARGET_AMT.

REVOKED ct mean_cost median_cost
No 1710 5848 4059
Yes 443 5140 4254

The distribution for each cohort is very similar. In the training data, we very slight increase in average log costs for folks that have not had their license suspended. However, the median log cost seems to be slightly higher for folks with a prior suspended license.

URBANICITY

The predictor URBANICITY indicates whether the environment in which the driver primarily drives: urban vs. rural.

        Highly Rural/ Rural Highly Urban/ Urban  Sum
count                1669.0              6492.0 8161
percent                20.5                79.5  100

Only 30% of drivers in the training data are categorized as being in a rural environment. .

Let’s look at the relationship of URBANICITY with TARGET_FLAG.

                     TARGET_FLAG
URBANICITY               0    1  Sum
  Highly Rural/ Rural 1554  115 1669
  Highly Urban/ Urban 4454 2038 6492
  Sum                 6008 2153 8161

Her are the proportions of individuals involved in a crash, given the URBANICITY category :

                     TARGET_FLAG
URBANICITY               0    1
  Highly Rural/ Rural 0.93 0.07
  Highly Urban/ Urban 0.69 0.31

There is a huge difference, 24%, between the two proportions. Urban drivers appear to be significantly more at risk for being involved in a crash.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 409.14, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.2280583 0.2619842
sample estimates:
   prop 1    prop 2 
0.9310965 0.6860752 

Let’s review the relationship between URBANICITY and TARGET_AMT.

URBANICITY ct mean_cost median_cost
Highly Rural/ Rural 115 5545 3589
Highly Urban/ Urban 2038 5711 4125

Urban drivers tend to have somewhat higher claim costs, given a crash. There also appears to be more variability in claim costs for urban drivers compared to their rural counterparts.

2.4 Multinomial Categorical Variables

JOB

The variable JOB indicates the insured’s job category. there is also blank category which we interpret to mean unknown or not working.

              Blue Collar Clerical Doctor Home Maker Lawyer Manager Professional Student  Sum
count   526.0      1825.0   1271.0    246      641.0  835.0   988.0       1117.0   712.0 8161
percent   6.4        22.4     15.6      3        7.9   10.2    12.1         13.7     8.7  100

Here is a breakdown of job categories by TARGET_FLAG:

              TARGET_FLAG
JOB               0    1  Sum
                390  136  526
  Blue Collar  1191  634 1825
  Clerical      900  371 1271
  Doctor        217   29  246
  Home Maker    461  180  641
  Lawyer        682  153  835
  Manager       851  137  988
  Professional  870  247 1117
  Student       446  266  712
  Sum          6008 2153 8161

Let’s calculate the proportion of observations by job category in each TARGET_FLAG indicator:

              TARGET_FLAG
JOB               0    1
               0.74 0.26
  Blue Collar  0.65 0.35
  Clerical     0.71 0.29
  Doctor       0.88 0.12
  Home Maker   0.72 0.28
  Lawyer       0.82 0.18
  Manager      0.86 0.14
  Professional 0.78 0.22
  Student      0.63 0.37

There appears to be significant variability in the probability of a claim by occupational category, with students and blue collar jobs leading the pack.

We reject the null hypothesis that all proportions are identical:


    9-sample test for equality of proportions without continuity correction

data:  tbl[1:9, 1:2]
X-squared = 261.06, df = 8, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6    prop 7    prop 8    prop 9 
0.7414449 0.6526027 0.7081039 0.8821138 0.7191888 0.8167665 0.8613360 0.7788720 0.6264045 

JOB ct mean_cost median_cost
136 6904 4155
Blue Collar 634 5890 4042
Clerical 371 5446 4144
Doctor 29 4896 4117
Home Maker 180 4951 3612
Lawyer 153 5991 4019
Manager 137 4944 4256
Professional 247 6560 4348
Student 266 5021 4188

There is moderate variability in log costs across job categories, with the unknown, student, and professional, and manager categories appearing to have higher median costs than than other five categories. The relationship are somewhat different though when viewing mean costs (or mean log costs) by category.

CAR_TYPE

The predictor CAR_TYPE indicates the insured’s vehicle type.

        Minivan Panel Truck Pickup Sports Car    SUV   Van  Sum
count    2145.0       676.0   1389      907.0 2294.0 750.0 8161
percent    26.3         8.3     17       11.1   28.1   9.2  100

Here is a breakdown of CAR_TYPE categories by TARGET_FLAG:

             TARGET_FLAG
CAR_TYPE         0    1  Sum
  Minivan     1796  349 2145
  Panel Truck  498  178  676
  Pickup       946  443 1389
  Sports Car   603  304  907
  SUV         1616  678 2294
  Van          549  201  750
  Sum         6008 2153 8161

We’ll now calculate the proportion of observations by CAR_TYPE category in each TARGET_FLAG indicator:

             TARGET_FLAG
CAR_TYPE         0    1
  Minivan     0.84 0.16
  Panel Truck 0.74 0.26
  Pickup      0.68 0.32
  Sports Car  0.66 0.34
  SUV         0.70 0.30
  Van         0.73 0.27

There is significant variability in the probability of a claim by occupational category, with sports cars having the highest proportion of accidents, and minivan have the lowest proportion.

We reject the null hypothesis that all proportions are identical:


    6-sample test for equality of proportions without continuity correction

data:  tbl[1:6, 1:2]
X-squared = 170.38, df = 5, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6 
0.8372960 0.7366864 0.6810655 0.6648291 0.7044464 0.7320000 

CAR_TYPE ct mean_cost median_cost
Minivan 349 5602 4023
Panel Truck 178 7465 4556
Pickup 443 5430 4147
Sports Car 304 5413 4023
SUV 678 5241 4032
Van 201 6909 4220

Panel trucks and vans appear to have significantly higher log claim costs compared to other car types.

2.5 Ordinal Categorical Variables

EDUCATION

The variable EDUCATION denotes the insured’s highest level of education attained.

        High School Bachelors Masters   PhD  Sum
count        3533.0    2242.0  1658.0 728.0 8161
percent        43.3      27.5    20.3   8.9  100

Below is a breakdown of EDUCATION categories by TARGET_FLAG:

             TARGET_FLAG
EDUCATION        0    1  Sum
  High School 2355 1178 3533
  Bachelors   1719  523 2242
  Masters     1331  327 1658
  PhD          603  125  728
  Sum         6008 2153 8161

Let’s review the proportion of observations by EDUCATION category in each TARGET_FLAG indicator:

             TARGET_FLAG
EDUCATION        0    1
  High School 0.67 0.33
  Bachelors   0.77 0.23
  Masters     0.80 0.20
  PhD         0.83 0.17

There are statistically significant differences between the categories, with higher levels of educational attainment associated with lower probabilities of crashes.


    4-sample test for equality of proportions without continuity correction

data:  tbl[1:4, 1:2]
X-squared = 168.58, df = 3, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.6665723 0.7667261 0.8027744 0.8282967 

EDUCATION ct mean_cost median_cost
High School 1178 5451 4079
Bachelors 523 5883 4036
Masters 327 5966 4256
PhD 125 6623 4395

Interestingly, the TARGET_AMT appears to have a positive association with the level of EDUCATION.

2.6 Continuous Variables

INCOME

Below is table of values of the INCOME predictor, with wage bucketed into $30K increments (i.e. [0,30), [30,60), [60,90), etc.])

             0     30     60    90   120   150   180  210  240  270 300 330 360  Sum
count   2059.0 2206.0 1671.0 951.0 417.0 203.0 117.0 56.0 20.0 11.0   3   1   1 7716
percent   26.7   28.6   21.7  12.3   5.4   2.6   1.5  0.7  0.3  0.1   0   0   0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00   28.10   54.03   61.90   85.99  367.03  445.00   47.57    1.19    5.13 

There are some missing values, 445 (5.4% of total), that we’ll address later.

The distribution of INCOME is right skewed, with a significant number of observations indicating $0 in income.

Individuals involved in crashes appear to have a high proportion of low-wage earners. We see the log-odds of a crash decreasing with increases to income–see the lower left scatter plot and loess curve. We see the loess curve starting to bend upward around $210k, although this phenomenon is possibly due to sparse data for high wage earners.

Finally, the relationship between income and log TARGET_AMT also does not appear to be very strong.

INCBIN ct mean_cost median_cost
0 695 5102 4045
30 654 5909 4299
60 404 6053 3988
90 163 6603 4254
120 66 6107 3856
150 31 5671 4738
180 10 4091 3940
210 14 6020 5402
240 1 4104 4104
270 4 3100 3007
300 1 3231 3231

HOME_VAL

Below is table of values of the HOME_VAL predictor, with home appraisals bucketed into $50K increments.

             0    50   100    150    200   250   300   350   400  450  500  550 600 650 700 750 850  Sum
count   2294.0 371.0 902.0 1265.0 1181.0 751.0 435.0 208.0 128.0 85.0 43.0 21.0 7.0   3   1   1   1 7697
percent   29.8   4.8  11.7   16.4   15.3   9.8   5.7   2.7   1.7  1.1  0.6  0.3 0.1   0   0   0   0  100

Below is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00    0.00  161.16  154.87  238.72  885.28  464.00  129.12    0.49    2.98 

There are some missing values, 464 (5.7% of total). We’ll address these later.

The distribution of HOME_VAL is right skewed, with a large proportion of observations indicating $0 in HOME_VAL–this probably refers to policyholders who are not homeowners.

Individuals involved in crashes have a higher proportion of low home values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. The upward trend in the curve around $400k is possibly due to the the smaller sample size for higher home values.

Finally, the relationship between HOME_VAL and log TARGET_AMT also does not appear to be very strong.

HOMEVALBIN ct mean_cost median_cost
0 846 5550 4006
50 111 4591 4400
100 247 5233 4185
150 328 6004 4280
200 264 6086 4015
250 115 5573 4093
300 59 6990 4021
350 24 9735 4071
400 14 4011 3868
450 14 4458 3981
500 3 5917 5455
550 3 4756 4435
600 2 3540 3540
650 1 2908 2908
750 1 3231 3231

BLUEBOOK

Below is table of values of the BLUEBOOK predictor, with values bucketed into $4K increments.

            0      4      8   12     16    20    24    28    32 36   40   44  48 56 60 64 68  Sum
count   281.0 1327.0 1503.0 1548 1238.0 903.0 628.0 390.0 190.0 83 45.0 14.0 6.0  1  2  1  1 8161
percent   3.4   16.3   18.4   19   15.2  11.1   7.7   4.8   2.3  1  0.6  0.2 0.1  0  0  0  0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   1.50    9.28   14.44   15.71   20.85   69.74    8.42    0.79    3.79 

The distribution of BLUEBOOK is right skewed, like the continuous variables reviewed earlier.

Individuals involved in crashes have a higher proportion of low BLUEBOOK values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. However, there is a prominent bend in the curve around $30k. Perhaps this upward bend indicates risky driving behavior by owners of luxury and sports cars.

Finally, the relationship between BLUEBOOK and log TARGET_AMT reveals an increase in payouts with the value of the auto, but the curve flattens out around $20k - $30k.

BLUEBKBIN ct mean_cost median_cost
0 171 3680 3423
5 614 5131 4196
10 504 5463 4051
15 373 6148 4164
20 267 7177 4256
25 123 6224 4369
30 57 6180 4321
35 26 11346 5308
40 13 6798 3041
45 2 3846 3846
50 1 18084 18084
55 1 3231 3231
60 1 4104 4104

OLDCLAIM

Below is table of values of the OLDCLAIM variable, with values bucketed into $4K increments.

           0  Sum
count   8161 8161
percent  100  100

Below is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    4.04    4.64   57.04    8.78    3.12   12.86 

The distribution of OLDCLAIM is extremely right skewed.

There does not appear to be a clear relationship between OLDCLAIM and log of TARGET_CLM. High previous claim amounts seem to be positively associated with the log odds of a future claim. However, this relationship appears to take a quadratic form.

OLD_BIN ct mean_cost median_cost
0 1395 5814 4148
5 452 5409 4025
10 62 6347 4186
15 28 4712 3318
20 33 4865 3764
25 49 5060 4460
30 53 4898 4262
35 33 6875 3779
40 29 6922 4477
45 16 5350 3299
50 2 3070 3070
55 1 7803 7803

3 Data Preparation

3.1 Missing Values

The following five variables have missing variables:

  • AGE: 6 missing values (0.07%)
  • YOJ: 454 missing values (5.6%)
  • CAR_AGE: 510 (6.2%)
  • INCOME: 445 (5.4%)
  • HOME_VAL: 464 (5.7%)

Because four of the predictors are missing a significant number of records (5%-6%), we will attempt to use a sophisticated imputation procedure from R’s MICE package to fill in the missing values. Our goal in using this procedure is to minimize the introduction of bias in our data vis-a-vis simpler methods like mean substitution.

One way to assess the quality of the imputation procedure used in MICE is to compare the distribution of the imputed data to the distribution of the non-missing values. Let’s do that now by reviewing density plots:
With the exception of the AGE variable, the imputed variables seem to reasonably approximate the original distribution.

we’re only missing a handful of values for AGE; so we will apply simply mean imputation for this variable.

Finally, we’ll assume the -3 value for CAR_AGE is actually zero.

4 Data Transformation

TARGET_FLAG
No changes.

TARGET_AMT

The response variable TARGET_AMT is highly right skewed. We will use the box-cox procedure to suggest a reasonable transformation when the TARGET_AMT variable is positive:

Fitted parameters:
     lambda        beta     sigmasq 
0.003172095 8.386645882 0.695511168 

Convergence code returned by optim: 0

Given this output, We will apply the log transformation.

KIDSDRIV

Given the possible curvature between the log odds and KIDSDRIV, we’re going to introduce a centered version of this variable to reduce possible collinearity issues.

HOMEKIDS

We will create another centered version of this variable due to a possible quadratic relationship with the response variable, TARGET_FLAG. Again, our intent with this transformation is avoid multicollinearity issues.

CLM_FREQ

Once again, we’ll center the variable due concerns described earlier.

MVR_PTS

No changes.

AGE

Age appears to have a quadratic relationship with both TARGET_FLAG and TARGET_AMT. We’ll center the variable.

YOJ

The variable YOJ has a significant number of zero observations. We’ll create a mean centered variable, YOJ_MOD, that will be used for years on job of 1 or higher.

TRAVTIME
Well apply the mean-center transformation.

TIF

We’ll mean-center the variable to reduce multicollinearity issues if we implement a quadratic term.

CAR_AGE

No changes.

PARENT1

No changes.

MSTATUS

No changes.

SEX

No changes.

CARUSE

No changes.

RED_CAR

No changes.

REVOKED

No changes.

URBANICITY

No changes.

JOB

No changes.

CAR_TYPE

No changes.

EDUCATION

No changes.

INCOME

Income is a positively skewed variable with a significant number zeroes.

We will apply the square root transformation suggested by the box-cox procedure to the original variable to reduce the overall skew.

Fitted parameters:
    lambda       beta    sigmasq 
 0.4298838 11.0283525 17.8086126 

Convergence code returned by optim: 0

HOME_VAL

Home values are also moderately right skewed with a significant number of zeroes.
We’ll apply a quarter root transformation to the original variable to reduce the overall skew.

Fitted parameters:
   lambda      beta   sigmasq 
0.1951741 9.3491718 1.5206145 

Convergence code returned by optim: 0

BLUEBOOK

The BLUEBOOK variable has a moderate right skew. We’ll apply the square root transformation suggested by the box-cox procedure.

Fitted parameters:
   lambda      beta   sigmasq 
0.4610754 5.2624962 3.7967126 

Convergence code returned by optim: 0

OLDCLAIM

OLDCLAIM is extremely right skewed. We’ll apply a log(x+1) transformation to reduce the overall skew.

Fitted parameters:
     lambda        beta     sigmasq 
-0.04511237  1.76185840  0.82136055 

Convergence code returned by optim: 0

5 Build Models


5.0.0.1 Binary Logistic Regression

Model 1: Manual Variable Selection, Linear Terms Only

Based on our data exploration, we believe the following variables could be relevant in predicted whether or not a claim occurs:

  • KIDSDRIV
  • HOMEKIDS
  • CLM_FREQ
  • MVR_PTS
  • AGE
  • YOJ
  • TRAVTIME
  • TIF
  • CAR_AGE
  • PARENT1
  • MSTATUS
  • CAR_USE
  • REVOKED
  • URBANICITY
  • JOB
  • CAR_TYPE
  • EDUCATION
  • INCOME
  • HOMEVAL
  • BLUEBOOK
  • OLDCLAIM

Granted, we haven’t narrowed down the list much.

Before moving forward, let’s calculate variance inflation factors, including all potential predictors in our model.

                   GVIF Df GVIF^(1/(2*Df))
PARENT1        1.927927  1        1.388498
MSTATUS        2.258214  1        1.502735
EDUCATION      9.204331  3        1.447656
JOB           31.325910  8        1.240206
CAR_USE        2.227957  1        1.492634
CAR_TYPE       2.443471  5        1.093454
REVOKED        1.148154  1        1.071520
MVR_PTS        1.194820  1        1.093078
CAR_AGE        2.123064  1        1.457074
URBANICITY     1.147162  1        1.071057
KIDSDRIV_MOD   1.347203  1        1.160691
HOMEKIDS_MOD   2.183075  1        1.477523
CLM_FREQ_MOD   2.350866  1        1.533253
AGE_MOD        1.455622  1        1.206491
YOJ_MOD        1.608639  1        1.268321
TRAVTIME_MOD   1.039894  1        1.019752
TIF_MOD        1.009853  1        1.004915
INCOME_MOD     3.468635  1        1.862427
HOME_VAL_MOD   1.926043  1        1.387819
BLUEBOOK_MOD   1.689897  1        1.299960
OLD_CLAIM_MOD  2.534623  1        1.592050

Seeing no major VIF issues–once accounting for degrees of freedom–we will proceed with the model with all suggested predictors.


Call:
glm(formula = TARGET_FLAG ~ . - TARGET_AMT - TARGET_AMT_MOD - 
    INDEX - KIDSDRIV - HOMEKIDS - CLM_FREQ - AGE - YOJ - TRAVTIME - 
    TIF - SEX - RED_CAR - INCOME - HOME_VAL - BLUEBOOK - OLDCLAIM - 
    INCOME_30 - HOME_VAL_30 - BLUEBOOK_2 - OLDCLAIM_4, family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6137  -0.7132  -0.3982   0.6166   3.1592  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.872686   0.282675  -6.625  0.00000000003475425 ***
PARENT1Yes                     0.370690   0.109692   3.379             0.000727 ***
MSTATUSYes                    -0.474052   0.087739  -5.403  0.00000006554066899 ***
EDUCATION.L                   -0.078343   0.149372  -0.524             0.599945    
EDUCATION.Q                    0.233170   0.086835   2.685             0.007249 ** 
EDUCATION.C                   -0.089875   0.081919  -1.097             0.272589    
JOBBlue Collar                 0.308221   0.184894   1.667             0.095512 .  
JOBClerical                    0.347545   0.197037   1.764             0.077756 .  
JOBDoctor                     -0.413664   0.266212  -1.554             0.120211    
JOBHome Maker                 -0.053576   0.220427  -0.243             0.807961    
JOBLawyer                      0.127853   0.168679   0.758             0.448470    
JOBManager                    -0.545841   0.170920  -3.194             0.001405 ** 
JOBProfessional                0.177985   0.177883   1.001             0.317032    
JOBStudent                    -0.170969   0.226935  -0.753             0.451218    
CAR_USEPrivate                -0.779822   0.087565  -8.906 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.583726   0.146384   3.988  0.00006673268440400 ***
CAR_TYPEPickup                 0.550812   0.100174   5.499  0.00000003829305886 ***
CAR_TYPESports Car             0.948214   0.108249   8.760 < 0.0000000000000002 ***
CAR_TYPESUV                    0.718723   0.086067   8.351 < 0.0000000000000002 ***
CAR_TYPEVan                    0.657014   0.121696   5.399  0.00000006707980041 ***
REVOKEDYes                     0.748190   0.085838   8.716 < 0.0000000000000002 ***
MVR_PTS                        0.110159   0.013843   7.958  0.00000000000000175 ***
CAR_AGE                       -0.001176   0.007502  -0.157             0.875411    
URBANICITYHighly Urban/ Urban  2.411637   0.113224  21.300 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.392587   0.061248   6.410  0.00000000014567496 ***
HOMEKIDS_MOD                   0.044878   0.037243   1.205             0.228204    
CLM_FREQ_MOD                   0.170681   0.036265   4.707  0.00000251988421409 ***
AGE_MOD                       -0.001631   0.004007  -0.407             0.684036    
YOJ_MOD                       -0.001673   0.008759  -0.191             0.848497    
TRAVTIME_MOD                   0.014751   0.001886   7.819  0.00000000000000532 ***
TIF_MOD                       -0.055559   0.007352  -7.557  0.00000000000004129 ***
INCOME_MOD                    -0.095766   0.016269  -5.886  0.00000000394563137 ***
HOME_VAL_MOD                  -0.090138   0.022401  -4.024  0.00005726974387333 ***
BLUEBOOK_MOD                  -0.177734   0.035205  -5.049  0.00000044508752803 ***
OLD_CLAIM_MOD                 -0.030521   0.039102  -0.781             0.435073    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7281.9  on 8126  degrees of freedom
AIC: 7351.9

Number of Fisher Scoring iterations: 5

The signs of the coefficients mostly make sense:

  • We expect claim probabilities to be higher for single parents.
  • Married individuals should be less prone to accidents compared to singles.
  • The signs of all the EDUCATION levels make sense; however, we expected the Master’s and PhD levels to show a greater reduction in log odds relative to the high school reference level. We are not too concerned with this result though, as other variables such as JOB_TYPE and INCOME help in defining an individual’s composite risk profile.
  • We didn’t have solid a priori expectations about the relationship between different job type. We are not surprised with most of these result though. For instance, we are not surprised by the increased risk associated with blue collar workers relative to doctors.
  • The model is consistent with our expectation that private vehicles are less accident prone compared to commercial vehicles.
  • The car type signs and magnitudes are fairly consistent with our expectations: Minivans(the reference level) should be safer than the other vehicles. Sports cars should be most likely to be involved in an accident. This is what we see.
  • A revoked license is highly indicative of future accident risk.
  • MVR points are positively associated with accident risk.
  • The model shows a slight negative relationship between car age and log odds of an accident, but the result is not statistically significant.
  • Our model shows much greater risk in urban areas compared to rural geographies. This is consistent with our expectation.
  • The number of teenage drivers impacts risk unfavorably, as our model shows.
  • Our model indicates that more children can adversely impact claims risk. We didn’t have firm expectations about this variable’s influence. More important, our model does not indicate a statistically significant result.
  • Our model reveals claim risk declining with age. This is not necessarily surprising; however, the model coefficient is not statistically significant.
  • Our model shows risk increasing slightly with increases to YOB. This is maybe a strange result, but our model indicates a very high p-value for our coefficient.
  • We expect risk to increase with longer travel times, as our model shows.
  • We expect loyal policyholders to be less risky than frequently churning policyholders, as our model indicates.
  • Our model indicates decreasing log odds with increases to home value. This is what we would expect.
  • We are not surprised to see cars with high BLUEBOOK values being associated with reduced risk in our model compared to lower values.
  • The coefficient for prior claims cost is opposite of what we would expect. However, the coefficient has a high associated p-value; so we may drop this predictor from our model.

Let’s clean up our model by removing some of the statistically insignificant predictors. We will leave all JOB levels in our model, as the “Manager” and “Clerical” coefficients are highly significant, and two additional levels have p-values below 10%.


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD + 
    CLM_FREQ_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD + 
    BLUEBOOK_MOD, family = "binomial", data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6328  -0.7136  -0.3986   0.6135   3.1439  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.918503   0.267796  -7.164  0.00000000000078332 ***
PARENT1Yes                     0.455054   0.094479   4.816  0.00000146139638217 ***
MSTATUSYes                    -0.442419   0.084012  -5.266  0.00000013933679907 ***
EDUCATION.L                   -0.093724   0.137261  -0.683              0.49472    
EDUCATION.Q                    0.236564   0.084790   2.790              0.00527 ** 
EDUCATION.C                   -0.085934   0.081180  -1.059              0.28980    
JOBBlue Collar                 0.310997   0.184778   1.683              0.09236 .  
JOBClerical                    0.355083   0.196670   1.805              0.07100 .  
JOBDoctor                     -0.421711   0.265747  -1.587              0.11254    
JOBHome Maker                 -0.045799   0.218566  -0.210              0.83402    
JOBLawyer                      0.119301   0.168323   0.709              0.47847    
JOBManager                    -0.555839   0.170742  -3.255              0.00113 ** 
JOBProfessional                0.172429   0.177740   0.970              0.33199    
JOBStudent                    -0.146979   0.225655  -0.651              0.51482    
CAR_USEPrivate                -0.777122   0.087414  -8.890 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.588330   0.146142   4.026  0.00005679472185962 ***
CAR_TYPEPickup                 0.550137   0.100124   5.495  0.00000003917260431 ***
CAR_TYPESports Car             0.945631   0.107956   8.759 < 0.0000000000000002 ***
CAR_TYPESUV                    0.719308   0.085983   8.366 < 0.0000000000000002 ***
CAR_TYPEVan                    0.658243   0.121557   5.415  0.00000006125629378 ***
REVOKEDYes                     0.728369   0.080414   9.058 < 0.0000000000000002 ***
MVR_PTS                        0.108765   0.013572   8.014  0.00000000000000111 ***
URBANICITYHighly Urban/ Urban  2.409081   0.113106  21.299 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.424693   0.055148   7.701  0.00000000000001351 ***
CLM_FREQ_MOD                   0.150238   0.025536   5.883  0.00000000402164286 ***
TRAVTIME_MOD                   0.014697   0.001885   7.796  0.00000000000000637 ***
TIF_MOD                       -0.055311   0.007347  -7.528  0.00000000000005149 ***
INCOME_MOD                    -0.095829   0.015629  -6.131  0.00000000087089256 ***
HOME_VAL_MOD                  -0.091463   0.022383  -4.086  0.00004385319335945 ***
BLUEBOOK_MOD                  -0.181410   0.035000  -5.183  0.00000021811943514 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7284.9  on 8131  degrees of freedom
AIC: 7344.9

Number of Fisher Scoring iterations: 5

All predictors are now statistically significant, with the exception of several of the job types. The coefficients in our model also still make sense.

Model 2: Add Quadratic Terms to Model 1

We noted earlier that there appeared to be quadratic relationship between some predictors and log odds.

Starting with our original Model 1–i.e. before we removed the insignificant predictors–we’ll add second order polynomial terms for the following variables:

  • KIDSDRIV
  • HOMEKIDS
  • CLM_FREQ
  • AGE
  • YOJ
  • TIF

Here is the summary output from model 2:


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + 
    KIDSDRIV_MOD + HOMEKIDS_MOD + CLM_FREQ_MOD + AGE_MOD + YOJ_MOD + 
    TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD + BLUEBOOK_MOD + 
    OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2) + 
    I(AGE_MOD^2) + I(YOJ_MOD^2) + I(TIF_MOD^2), family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4212  -0.7079  -0.3896   0.5997   3.0494  

Coefficients:
                                Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.8549782  0.3023391  -6.135 0.000000000849329604 ***
PARENT1Yes                     0.2287145  0.1190337   1.921              0.05468 .  
MSTATUSYes                    -0.5472126  0.0907131  -6.032 0.000000001615991315 ***
EDUCATION.L                   -0.1110321  0.1507177  -0.737              0.46131    
EDUCATION.Q                    0.2302008  0.0871779   2.641              0.00828 ** 
EDUCATION.C                   -0.1063223  0.0824316  -1.290              0.19711    
JOBBlue Collar                 0.3377066  0.1857011   1.819              0.06898 .  
JOBClerical                    0.3711729  0.1980087   1.875              0.06086 .  
JOBDoctor                     -0.4381672  0.2666846  -1.643              0.10038    
JOBHome Maker                 -0.0171249  0.2232588  -0.077              0.93886    
JOBLawyer                      0.1293698  0.1693179   0.764              0.44483    
JOBManager                    -0.5408136  0.1711226  -3.160              0.00158 ** 
JOBProfessional                0.1900593  0.1784593   1.065              0.28688    
JOBStudent                    -0.1735835  0.2291796  -0.757              0.44880    
CAR_USEPrivate                -0.7834831  0.0881933  -8.884 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.6238633  0.1471955   4.238 0.000022518613851542 ***
CAR_TYPEPickup                 0.5677139  0.1009806   5.622 0.000000018874685247 ***
CAR_TYPESports Car             0.8853900  0.1094746   8.088 0.000000000000000608 ***
CAR_TYPESUV                    0.7106989  0.0868341   8.185 0.000000000000000273 ***
CAR_TYPEVan                    0.6855025  0.1226882   5.587 0.000000023055663145 ***
REVOKEDYes                     0.8259645  0.0887815   9.303 < 0.0000000000000002 ***
MVR_PTS                        0.0971609  0.0141218   6.880 0.000000000005976967 ***
CAR_AGE                       -0.0012267  0.0075548  -0.162              0.87102    
URBANICITYHighly Urban/ Urban  2.3933908  0.1135895  21.071 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.7280955  0.1425080   5.109 0.000000323602039340 ***
HOMEKIDS_MOD                   0.0617138  0.0685642   0.900              0.36807    
CLM_FREQ_MOD                   0.3677199  0.0737840   4.984 0.000000623693042781 ***
AGE_MOD                       -0.0020003  0.0041442  -0.483              0.62932    
YOJ_MOD                       -0.0006137  0.0110557  -0.056              0.95573    
TRAVTIME_MOD                   0.0148435  0.0018995   7.815 0.000000000000005515 ***
TIF_MOD                       -0.0666080  0.0086300  -7.718 0.000000000000011795 ***
INCOME_MOD                    -0.0829304  0.0168293  -4.928 0.000000831917394078 ***
HOME_VAL_MOD                  -0.0887297  0.0225581  -3.933 0.000083759903483572 ***
BLUEBOOK_MOD                  -0.1860283  0.0354683  -5.245 0.000000156350347101 ***
OLD_CLAIM_MOD                 -0.1226506  0.0492985  -2.488              0.01285 *  
I(KIDSDRIV_MOD^2)             -0.1387882  0.0690188  -2.011              0.04434 *  
I(HOMEKIDS_MOD^2)             -0.0328246  0.0281772  -1.165              0.24405    
I(CLM_FREQ_MOD^2)             -0.0875831  0.0282607  -3.099              0.00194 ** 
I(AGE_MOD^2)                   0.0022268  0.0002811   7.922 0.000000000000002339 ***
I(YOJ_MOD^2)                   0.0012227  0.0016034   0.763              0.44570    
I(TIF_MOD^2)                   0.0031365  0.0013927   2.252              0.02431 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7198.7  on 8120  degrees of freedom
AIC: 7280.7

Number of Fisher Scoring iterations: 5

Below our key results from this second model:

  • The first and second order terms for KIDSDRIV are both statistically significant. The negative sign of the second order term makes sense: the unfavorable impact of adding additional children drivers diminishes with each subsequent child.
  • Neither the first nor second order terms are significant for HOMEKIDS.
  • Both CLM_FREQ terms are significant. The negative second order term indicates diminishing impact of each additional prior claim.
  • The second order term of AGE is statistically significant, but the first first order term is not. We’ll leave both terms in our model. The coefficients of the terms make sense. There is a primary trend of reduced risk with age–see the negative first order coefficient, but the trend diminishes and potentially reverses for higher ages–as reflected in the positive second order term.
  • Neither YOJ terms are statistically significant.
  • Both first and second order TIF terms are significant. The signs of the coefficients have an intuitive explanation: there is a primary effect of risk reduction in risk with increases to TIF but the favorable impact diminishes with higher TIF values.
  • CAR_AGE is still insignificant in this model.

Based on the results above, we’ll remove all HOMEKIDS and YOJ terms from our model. Here are the summary results from this modified, second model:


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD + 
    CLM_FREQ_MOD + AGE_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD + 
    HOME_VAL_MOD + BLUEBOOK_MOD + OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) + 
    I(CLM_FREQ_MOD^2) + I(AGE_MOD^2) + I(TIF_MOD^2), family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4188  -0.7079  -0.3899   0.6017   3.0411  

Coefficients:
                                Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.8566640  0.2816370  -6.592 0.000000000043277015 ***
PARENT1Yes                     0.2739046  0.1017449   2.692              0.00710 ** 
MSTATUSYes                    -0.5275390  0.0855811  -6.164 0.000000000708416117 ***
EDUCATION.L                   -0.1089269  0.1383381  -0.787              0.43105    
EDUCATION.Q                    0.2326644  0.0851694   2.732              0.00630 ** 
EDUCATION.C                   -0.1037084  0.0816550  -1.270              0.20406    
JOBBlue Collar                 0.3308613  0.1855816   1.783              0.07461 .  
JOBClerical                    0.3577260  0.1975653   1.811              0.07019 .  
JOBDoctor                     -0.4363115  0.2668807  -1.635              0.10208    
JOBHome Maker                  0.0138339  0.2199564   0.063              0.94985    
JOBLawyer                      0.1258867  0.1693304   0.743              0.45722    
JOBManager                    -0.5435257  0.1710934  -3.177              0.00149 ** 
JOBProfessional                0.1872710  0.1783856   1.050              0.29381    
JOBStudent                    -0.1531559  0.2268795  -0.675              0.49964    
CAR_USEPrivate                -0.7861317  0.0880463  -8.929 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.6224359  0.1471477   4.230 0.000023368219454604 ***
CAR_TYPEPickup                 0.5651171  0.1009060   5.600 0.000000021382163925 ***
CAR_TYPESports Car             0.8883475  0.1093351   8.125 0.000000000000000447 ***
CAR_TYPESUV                    0.7116850  0.0867565   8.203 0.000000000000000234 ***
CAR_TYPEVan                    0.6840691  0.1226023   5.580 0.000000024110109095 ***
REVOKEDYes                     0.8231669  0.0887081   9.280 < 0.0000000000000002 ***
MVR_PTS                        0.0980783  0.0141003   6.956 0.000000000003506304 ***
URBANICITYHighly Urban/ Urban  2.3910929  0.1135255  21.062 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.7850828  0.1289262   6.089 0.000000001133357267 ***
CLM_FREQ_MOD                   0.3652676  0.0737359   4.954 0.000000728054720019 ***
AGE_MOD                       -0.0025611  0.0036151  -0.708              0.47867    
TRAVTIME_MOD                   0.0148180  0.0018989   7.803 0.000000000000006031 ***
TIF_MOD                       -0.0667534  0.0086268  -7.738 0.000000000000010109 ***
INCOME_MOD                    -0.0886512  0.0157886  -5.615 0.000000019668539231 ***
HOME_VAL_MOD                  -0.0886985  0.0225545  -3.933 0.000084020987281530 ***
BLUEBOOK_MOD                  -0.1860655  0.0354614  -5.247 0.000000154612137297 ***
OLD_CLAIM_MOD                 -0.1214653  0.0492809  -2.465              0.01371 *  
I(KIDSDRIV_MOD^2)             -0.1637618  0.0651787  -2.513              0.01199 *  
I(CLM_FREQ_MOD^2)             -0.0865030  0.0282256  -3.065              0.00218 ** 
I(AGE_MOD^2)                   0.0022487  0.0002768   8.124 0.000000000000000452 ***
I(TIF_MOD^2)                   0.0031484  0.0013917   2.262              0.02368 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7201.1  on 8125  degrees of freedom
AIC: 7273.1

Number of Fisher Scoring iterations: 5

Model 3: Stepwise Regression

For our third model, we’ll implement stepwise regression, with variable selecting occurring in both directions. We’ll include all predictors (transformed versions where applicable) in our potential universe of candidates.

For simplicity, we’ll only include first order terms.


Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS + 
    CAR_TYPE + REVOKED + KIDSDRIV_MOD + INCOME_MOD + CAR_USE + 
    TIF_MOD + TRAVTIME_MOD + CLM_FREQ_MOD + BLUEBOOK_MOD + PARENT1 + 
    HOME_VAL_MOD + EDUCATION + HOMEKIDS_MOD, family = "binomial", 
    data = auto_redux)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6158  -0.7141  -0.3980   0.6181   3.1513  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.887080   0.268674  -7.024  0.00000000000216101 ***
URBANICITYHighly Urban/ Urban  2.408584   0.113076  21.301 < 0.0000000000000002 ***
JOBBlue Collar                 0.309851   0.184810   1.677              0.09362 .  
JOBClerical                    0.347849   0.196756   1.768              0.07707 .  
JOBDoctor                     -0.414469   0.265832  -1.559              0.11896    
JOBHome Maker                 -0.049057   0.218639  -0.224              0.82246    
JOBLawyer                      0.125372   0.168406   0.744              0.45660    
JOBManager                    -0.548824   0.170794  -3.213              0.00131 ** 
JOBProfessional                0.178743   0.177815   1.005              0.31479    
JOBStudent                    -0.164947   0.226014  -0.730              0.46551    
MVR_PTS                        0.108414   0.013576   7.986  0.00000000000000140 ***
MSTATUSYes                    -0.476489   0.087310  -5.457  0.00000004830512386 ***
CAR_TYPEPanel Truck            0.588390   0.146226   4.024  0.00005725500332016 ***
CAR_TYPEPickup                 0.551882   0.100139   5.511  0.00000003564383764 ***
CAR_TYPESports Car             0.944740   0.107966   8.750 < 0.0000000000000002 ***
CAR_TYPESUV                    0.717233   0.086011   8.339 < 0.0000000000000002 ***
CAR_TYPEVan                    0.658849   0.121591   5.419  0.00000006007329279 ***
REVOKEDYes                     0.724890   0.080457   9.010 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.389384   0.060210   6.467  0.00000000009988987 ***
INCOME_MOD                    -0.096508   0.015642  -6.170  0.00000000068430015 ***
CAR_USEPrivate                -0.779104   0.087448  -8.909 < 0.0000000000000002 ***
TIF_MOD                       -0.055518   0.007349  -7.554  0.00000000000004217 ***
TRAVTIME_MOD                   0.014755   0.001885   7.826  0.00000000000000505 ***
CLM_FREQ_MOD                   0.150203   0.025546   5.880  0.00000000410930376 ***
BLUEBOOK_MOD                  -0.179788   0.035019  -5.134  0.00000028357055432 ***
PARENT1Yes                     0.375928   0.109102   3.446              0.00057 ***
HOME_VAL_MOD                  -0.090524   0.022374  -4.046  0.00005210493016275 ***
EDUCATION.L                   -0.086834   0.137351  -0.632              0.52725    
EDUCATION.Q                    0.235675   0.084809   2.779              0.00545 ** 
EDUCATION.C                   -0.087132   0.081189  -1.073              0.28318    
HOMEKIDS_MOD                   0.049238   0.034007   1.448              0.14765    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7282.8  on 8130  degrees of freedom
AIC: 7344.8

Number of Fisher Scoring iterations: 5
 [1] "PARENT1"       "MSTATUS"       "SEX"           "EDUCATION"     "JOB"           "CAR_USE"      
 [7] "CAR_TYPE"      "RED_CAR"       "REVOKED"       "MVR_PTS"       "CAR_AGE"       "URBANICITY"   
[13] "KIDSDRIV_MOD"  "HOMEKIDS_MOD"  "CLM_FREQ_MOD"  "AGE_MOD"       "YOJ_MOD"       "TRAVTIME_MOD" 
[19] "TIF_MOD"       "INCOME_MOD"    "HOME_VAL_MOD"  "BLUEBOOK_MOD"  "OLD_CLAIM_MOD" "TARGET_FLAG"  

Surprisingly, the stepwise procedure produced a model that is identical to our modified Model 1!

5.0.0.2 Multiple Linear Regression

Now We’ll model claim costs using the subset of the training data where claim costs were greater than one.

Before we build our models, let’s check for multicollinearity uses by reviewing variance inflation factors for a linear model that includes all predictors.

                   GVIF Df GVIF^(1/(2*Df))
PARENT1        2.163475  1        1.470876
MSTATUS        2.442422  1        1.562825
SEX            3.756936  1        1.938282
EDUCATION      9.707532  3        1.460556
JOB           40.971902  8        1.261190
CAR_USE        2.259463  1        1.503151
CAR_TYPE       6.325230  5        1.202562
RED_CAR        1.832475  1        1.353689
REVOKED        1.266601  1        1.125433
MVR_PTS        1.158984  1        1.076561
CAR_AGE        2.130468  1        1.459612
URBANICITY     1.052960  1        1.026139
KIDSDRIV_MOD   1.434861  1        1.197857
HOMEKIDS_MOD   2.245965  1        1.498654
CLM_FREQ_MOD   2.232403  1        1.494123
AGE_MOD        1.505499  1        1.226988
YOJ_MOD        1.925940  1        1.387782
TRAVTIME_MOD   1.030401  1        1.015087
TIF_MOD        1.018817  1        1.009364
INCOME_MOD     4.310419  1        2.076155
HOME_VAL_MOD   1.991408  1        1.411173
BLUEBOOK_MOD   2.083098  1        1.443294
OLD_CLAIM_MOD  2.486871  1        1.576982

There does not appear to be an issue with mulicollinearity.

Model 4: Manual Variable Selection, Linear Terms Only

Choosing relevant predictors manually is a challenging exercise as most predictors seemed to have only a subtle influence–if any–on claim costs.

Based on our exploratory work, we believe the following variables may be relevant:

  • AGE
  • KIDSDRIV
  • HOMEKIDS
  • TRAVTIME
  • SEX
  • CAR_USE
  • RED_CAR
  • UBANICITY
  • JOB
  • CARTYPE
  • EDUCATION
  • BLUEBOOK

Let’s look at a preliminary model using all 12 of our proposed predictors:


Call:
lm(formula = TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD + AGE_MOD + 
    TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB + 
    CAR_TYPE + EDUCATION + BLUEBOOK_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7973 -0.4027  0.0404  0.4037  3.2710 

Coefficients:
                                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                    7.7902933  0.1678675  46.407 < 0.0000000000000002 ***
KIDSDRIV_MOD                  -0.0361875  0.0331524  -1.092                0.275    
HOMEKIDS_MOD                   0.0205244  0.0190398   1.078                0.281    
AGE_MOD                        0.0004682  0.0021647   0.216                0.829    
TRAVTIME_MOD                  -0.0005157  0.0011623  -0.444                0.657    
SEXM                           0.0939910  0.0678247   1.386                0.166    
CAR_USEPrivate                -0.0042709  0.0519136  -0.082                0.934    
RED_CARyes                     0.0270567  0.0521811   0.519                0.604    
URBANICITYHighly Urban/ Urban  0.0274888  0.0789854   0.348                0.728    
JOBBlue Collar                 0.0579560  0.1194463   0.485                0.628    
JOBClerical                    0.0770022  0.1242427   0.620                0.535    
JOBDoctor                     -0.0423386  0.1844218  -0.230                0.818    
JOBHome Maker                  0.0186071  0.1219750   0.153                0.879    
JOBLawyer                     -0.0040646  0.1076214  -0.038                0.970    
JOBManager                     0.0143688  0.1117657   0.129                0.898    
JOBProfessional                0.0966171  0.1180370   0.819                0.413    
JOBStudent                     0.0708606  0.1262229   0.561                0.575    
CAR_TYPEPanel Truck           -0.0023405  0.0960163  -0.024                0.981    
CAR_TYPEPickup                 0.0298623  0.0622814   0.479                0.632    
CAR_TYPESports Car             0.0719608  0.0781399   0.921                0.357    
CAR_TYPESUV                    0.0919203  0.0689452   1.333                0.183    
CAR_TYPEVan                   -0.0278677  0.0804508  -0.346                0.729    
EDUCATION.L                    0.1274041  0.0887219   1.436                0.151    
EDUCATION.Q                    0.0541159  0.0539471   1.003                0.316    
EDUCATION.C                   -0.0631616  0.0541032  -1.167                0.243    
BLUEBOOK_MOD                   0.0978773  0.0227092   4.310            0.0000171 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8092 on 2127 degrees of freedom
Multiple R-squared:  0.0198,    Adjusted R-squared:  0.008278 
F-statistic: 1.718 on 25 and 2127 DF,  p-value: 0.01486

Only one of our predictors, BLUEBOOK appears to be significant in our model.

Here is our modified model, with BLUEBOOK as the sole predictor:


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7842 -0.3929  0.0404  0.3952  3.2528 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   7.95359    0.06005 132.445 < 0.0000000000000002 ***
BLUEBOOK_MOD  0.08921    0.01590   5.609          0.000000023 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8069 on 2151 degrees of freedom
Multiple R-squared:  0.01441,   Adjusted R-squared:  0.01396 
F-statistic: 31.46 on 1 and 2151 DF,  p-value: 0.00000002298

The positive coefficient for Bluebook makes sense: we expect the replacement cost and/or repairs for a high-valued car to be more expensive than auto with a low replacement cost.

Model 5: Add Quadratic Terms to Model 4

In the exploratory section, we noted potential curved relationship between some predictors and the log of TARGET_AMT.

Those predictors were AGE and TRAVTIME. Let’s include squared terms for these two predictors and also for for BLUEBOOK:


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) + 
    TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2), 
    data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7678 -0.3885  0.0345  0.3994  3.2450 

Coefficients:
                     Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)        7.56360993  0.15342722  49.298 < 0.0000000000000002 ***
BLUEBOOK_MOD       0.29960723  0.08326270   3.598             0.000328 ***
I(BLUEBOOK_MOD^2) -0.02798286  0.01094977  -2.556             0.010670 *  
TRAVTIME_MOD      -0.00057584  0.00119020  -0.484             0.628569    
I(TRAVTIME_MOD^2)  0.00001853  0.00005609   0.330             0.741178    
AGE_MOD            0.00042933  0.00186549   0.230             0.818002    
I(AGE_MOD^2)       0.00027328  0.00014705   1.858             0.063242 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8059 on 2146 degrees of freedom
Multiple R-squared:  0.01912,   Adjusted R-squared:  0.01638 
F-statistic: 6.972 on 6 and 2146 DF,  p-value: 0.0000002344

The second order term for BLUEBOOK is statistically significant. The second order term for AGE is borderline significant; so we include leave both age variables in our model; but remove terms related to TRAVTIME.


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) + 
    AGE_MOD + I(AGE_MOD^2), data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7720 -0.3860  0.0358  0.3948  3.2380 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        7.5684831  0.1529052  49.498 < 0.0000000000000002 ***
BLUEBOOK_MOD       0.2990922  0.0831967   3.595             0.000332 ***
I(BLUEBOOK_MOD^2) -0.0279652  0.0109407  -2.556             0.010655 *  
AGE_MOD            0.0004062  0.0018617   0.218             0.827316    
I(AGE_MOD^2)       0.0002761  0.0001468   1.880             0.060177 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8056 on 2148 degrees of freedom
Multiple R-squared:  0.01899,   Adjusted R-squared:  0.01717 
F-statistic:  10.4 on 4 and 2148 DF,  p-value: 0.00000002434

This model indicates that log costs increase quadratically with Age. This result seems possible.

Model 6: Stepwise Regression

Finally, we’ll perform a basic stepwise regression with variable selection performed in both directions. For simplicity, we’ll only include linear terms.


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + MSTATUS + MVR_PTS + 
    SEX + CLM_FREQ_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6968 -0.4038  0.0391  0.4093  3.2236 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   7.940083   0.066264 119.825 < 0.0000000000000002 ***
BLUEBOOK_MOD  0.086861   0.015942   5.449         0.0000000566 ***
MSTATUSYes   -0.073598   0.034727  -2.119               0.0342 *  
MVR_PTS       0.017181   0.007053   2.436               0.0149 *  
SEXM          0.055410   0.035037   1.581               0.1139    
CLM_FREQ_MOD -0.022434   0.014567  -1.540               0.1237    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8051 on 2147 degrees of freedom
Multiple R-squared:  0.02064,   Adjusted R-squared:  0.01836 
F-statistic: 9.051 on 5 and 2147 DF,  p-value: 0.00000001583

The stepwise procedure included four additional predictors in addition to BLUEBOOK:
* MSTATUS: the negative coefficient indicates the married individuals are less expensive than singles. This seems reasonable.
* MVR_PTS: the model indicates a positive association between MVR_PTS and claim costs. This also seems reasonable.
* SEX: According to the model, males are more expensive than females. This is consistent with our earlier exploratory work.
* CLM_FREQ_MOD1: The coefficient is negative. This result is counterintuitive. We would expect folks with a high incidence accident rate to potentially be at risk for higher cost accidents. For now, we’ll leave this predictor in, but we may want do additional analysis.

6 SELECT MODELS

6.1 Binary Logistic Regression Models

Let’s compare model fits for all of our models:

            AIC     AICc      BIC    loglik
m1     7351.936 7352.246 7597.185 -3640.968
m1_mod 7344.898 7345.127 7555.112 -3642.449
m2     7280.684 7281.108 7567.976 -3599.342
m2_mod 7273.059 7273.386 7525.315 -3600.529
m3     7344.813 7345.057 7562.034 -3641.407

Based on the various model evaluation criteria, model m2_mod appears to be the clear winner. This model is the binary logistic regression model that included multiple quadratic terms, but removed statistically insignificant predictors from the original model 2 formulation.

Model 2 is superior in that the AIC, AIC_c, and BIC measures are lower than all other evaluated models. The log likelihood is not quite as the original model 2, but this measure does not account for model complexity as the other models do.

Let’s also compare the AUC measure for all models:

[1] "Model 1: 0.815"
[1] "Model 1 mod: 0.814"
[1] "Model 2: 0.82"
[1] "Model 2 mod: 0.819"
[1] "Model 3: 0.815"

Models 2 and Model 2 mod are tied and have the highest AUC measures. Given that Model 2 mod has fewer parameters than Model 2, the AUC measure supports our contention that Model 2 mod is superior to the other models.

Let’s explore a summary of our model predictions using the training data:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002406 0.074672 0.195498 0.263816 0.402386 0.960296 

We now need to choose an appropriate probability cutoff measure for predicting whether or not an individual will have a claim.

We will use Youden’s index to determine this optimal cutoff.

[1] 0.2762045

Using Youden’s index, we select a relatively low cutoff measure of 0.276. With such a low cutoff, we will inevitably sacrifice specificity for gains in sensitivity compared to a traditional 0.5 cutoff. However, this lower cutoff provides a better balance between precision and recall.

          Reference
Prediction    0    1
         0 4476  545
         1 1532 1608
$accuracy
[1] 0.7454969

$error_rt
[1] 0.2545031

$precision
[1] 0.5121019

$sensitivity
[1] 0.7468648

$specificity
[1] 0.7450067

$F1
[1] 0.6075949

For comparison purposes, here is the confusion matrix and related classification metrics with a 0.5 cutoff.

          Reference
Prediction    0    1
         0 5534 1207
         1  474  946
$accuracy
[1] 0.7940203

$error_rt
[1] 0.2059797

$precision
[1] 0.6661972

$sensitivity
[1] 0.4393869

$specificity
[1] 0.9211052

$F1
[1] 0.529527

We see that our 0.276 cutoff also results in lower accuracy vis-a-vis the 0.5 threshold. But our lower threshold also results in better balance in precision vs. recall, as indicated by the improved F1 measure.

6.2 Multiple Regression Models

We’ll review five different measures for assessing our multiple regression models:

  • R-Squared
  • Adjusted R-Squared
  • Root Mean Squared Error
  • AIC
  • Corrected AIC
  • BIC
          R_2 adj_R_2     rmse      AIC     AICc      BIC
m4     0.0198  0.0083 7891.505 5226.127 5226.838 5379.341
m4_mod 0.0144  0.0140 7901.013 5189.919 5189.931 5206.943
m5     0.0191  0.0164 7897.224 5189.618 5189.685 5235.015
m5_mod 0.0190  0.0172 7897.273 5185.896 5185.935 5219.944
m6     0.0206  0.0184 7885.800 5184.273 5184.325 5223.995

Based on the table above, model 6 appears to be the superior model.
It has superior measures in all categories except for BIC.

Let’s now do some quick model diagnostics for our selected model, model 6:

The residuals in our model appear to have a relatively constant variance across all fitted values. The qq plot indicates standardized residuals that are fairly well behaved, with only minor departures from normality. Finally, there are only a couple outliers in our data–none appear to be high leverage points.

6.3 Make Predictions

Let’s wrap up by scrubbing the test data set and make predictions. Please refer to the Github account in the Appendix to access the prediction file.

7 Appendix

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache=TRUE, comment=NA,
                      fig.align='center',options(width = 75, scipen=999))
# load libraries
library(stringr)
library(knitr)
library(moments)
library(ggplot2)
library(ggthemes)
library(gridExtra)
library(scales)
library(dplyr)
library(mice)
library(geoR)
library(car)
library(bestglm)
library(sme)
library(pROC)
library(caret)


url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance_training_data.csv"

auto <- read.csv(url)  

# create table describing variables in dataset
vname <- names(auto)

vdefn <- c("Identification Variable (do not use) ",
           "In a crash? 1=YES 0=NO ",
           "Cost of Crash, if applicable",
           "# Driving Children",
           "Age of Driver", 
           "# Children at Home",
           "Years on Job",
           "Income",
           "Single Parent",
           "Home Value",
           "Marital Status",
           "Gender",
           "Max Education Level",
           "Job Category",
           "Commute Distance",
           "Vehicle Use",
           "Value of Vehicle",
           "Time in Force",
           "Type of Car",
           "A Red Car",
           "# Claims (Past 5 Years) ",
           "Total Claims (Past 5 Years)",
           "License Revoked (Past 7 Years)",
           "Motor Vehicle Report Points",
           "Vehicle Age",
           "Home/Work Area"
)
           

veffect <- c("None",
             "None",
             "None",
             "When teenagers drive your car, increased crash risk",
             "Young and old drivers might be riskier",
             "Unknown effect",
             "Long-term employees are usually safer",
             "In theory, rich have fewer crashes",
             "Unknown impact",
             "In theory, home owners may drive more responsibly",
             "In theory, married individuals are less risky",
             "Urban legend: females are safer drivers",
             "Unknown, but in theory more educated people tend to drive more safely",
             "In theory, white collar workers are less risky",
             "Long drives to work usually suggest greater risk",
             "Commercial fleet driven more, may impact collision prob",
             "Unknown impact on collision prob, but impacts crash payout",
             "Long-term customers are usually safer",
             "Unknown impact on collision prob, but impacts crash payout",
             "Urban legend: red cars are riskier, particularly sports cars",
             "If  total payout  high, future payouts might be high",
             "Claim count should be positively correlated with future claims",
             "If your license was revoked, you probably are a riskier driver",
             "Traffic ticket counts have postive correlation with crashes",
             "Unknown impact on collision prob, but impacts crash payout",
             "Unknown impact"
             
)

var_df <- data.frame(vname,vdefn, veffect, row.names = NULL)
kable(var_df, col.names=c('Variable Name','Description','Theoretical Impact'))

# training data cleaning scripts

currency_fix <- function(x) {
  num <- str_replace_all(x, "\\$","")
  num <- as.numeric(str_replace_all(num, "\\,",""))
  num
}

auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)


auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)], 
                                        function(x) str_replace(x,"z_|<",""))

auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor) 

auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )

auto$INCOME <- auto$INCOME / 1000  
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000

str(auto)
options(width=95)
summary(auto) 

tbl <- with(auto,rbind(round(addmargins(table(TARGET_FLAG)),0),
                       addmargins(prop.table(table(TARGET_FLAG)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto, c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)
options(width=100)
auto_censor <- auto[auto$TARGET_FLAG == 1,]
round(with(auto_censor,c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)




h <- ggplot(auto_censor, aes(TARGET_AMT)) + 
  geom_histogram(color="ghostwhite", fill="steelblue4") +
  theme_classic()+ labs(title = 'Histogram of TARGET_AMT') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 

b <- ggplot(auto_censor, aes(x="",y=TARGET_AMT)) + 
  geom_boxplot(color="ghostwhite", fill="steelblue4",outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'Boxplot of TARGET_AMT') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + coord_flip() + 
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)


grid.arrange(h,b, ncol=2)


tbl <- with(auto, rbind(addmargins(table(KIDSDRIV)),addmargins(prop.table(table(KIDSDRIV)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(KIDSDRIV), StdD=sd(KIDSDRIV), Skew=skewness(KIDSDRIV), Kurt=kurtosis(KIDSDRIV))),2)

bc <- ggplot(auto, aes(KIDSDRIV, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'KIDSDRIV Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(KIDSDRIV), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by KIDSDRIV', x = "KIDSDRIV") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(KIDSDRIV) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(KIDSDRIV,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by KIDSDRIV') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
  

  
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <-auto[auto$TARGET_AMT > 0,] %>%
  group_by(KIDSDRIV) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
tbl <- with(auto, rbind(addmargins(table(HOMEKIDS)),addmargins(prop.table(table(HOMEKIDS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(HOMEKIDS), StdD=sd(HOMEKIDS), Skew=skewness(HOMEKIDS), Kurt=kurtosis(HOMEKIDS))),2)  

bc <- ggplot(auto, aes(HOMEKIDS, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'HOMEKIDS Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(HOMEKIDS), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by HOMEKIDS', x = "HOMEKIDS") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(HOMEKIDS) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(HOMEKIDS,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by HOMEKIDS') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
   
t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(HOMEKIDS) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
tbl <- with(auto, rbind(addmargins(table(CLM_FREQ)),addmargins(prop.table(table(CLM_FREQ)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(CLM_FREQ), StdD=sd(CLM_FREQ), Skew=skewness(CLM_FREQ), Kurt=kurtosis(CLM_FREQ))),2)  

bc <- ggplot(auto, aes(CLM_FREQ, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'CLM_FREQ Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(CLM_FREQ), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CLM_FREQ', x = "CLM_FREQ") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(CLM_FREQ) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(CLM_FREQ,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by CLM_FREQ') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1) 

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CLM_FREQ) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
options(width=90)
tbl <- with(auto, rbind(addmargins(table(MVR_PTS)),addmargins(prop.table(table(MVR_PTS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(MVR_PTS), StdD=sd(MVR_PTS), Skew=skewness(MVR_PTS), Kurt=kurtosis(MVR_PTS))),2)  

bc <- ggplot(auto, aes(MVR_PTS, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'MVR_PTS Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(MVR_PTS), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by MVR_PTS', x = "MVR_PTS") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(MVR_PTS) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(MVR_PTS,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by MVR_PTS') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1) 
t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(BIN_MVR=2*floor(MVR_PTS/2)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
options(width=100)

tbl <- with(auto, rbind(addmargins(table(5*floor(AGE/5))),addmargins(prop.table(table(5*floor(AGE/5))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)



options(width=100)
round(with(auto,c(summary(AGE), StdD=sd(AGE, na.rm=T), Skew=skewness(AGE, na.rm=T),
                  Kurt=kurtosis(AGE, na.rm=T))),2)


h1 <- ggplot(auto, aes(AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(AGE, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$AGE_5 <- 5 * floor(auto$AGE / 5)
#auto$AGE_10 <- 10 * floor(auto$AGE / 10)

lo <- auto %>%
  group_by(AGE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(AGE,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by AGE') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(AGE, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(AGE)) %>%
  group_by(AGEBIN=10*floor(AGE/10)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)



tbl <- with(auto, rbind(addmargins(table(2*floor(YOJ/2))),addmargins(prop.table(table(2*floor(YOJ/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=90)
round(with(auto,c(summary(YOJ), StdD=sd(YOJ, na.rm=T), Skew=skewness(YOJ, na.rm=T),
                  Kurt=kurtosis(YOJ, na.rm=T))),2)

h1 <- ggplot(auto, aes(YOJ)) + geom_histogram(binwidth = 1, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(YOJ, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 1, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$YOJ_5 <- 5 * floor(auto$YOJ / 5)
#auto$YOJ_10 <- 10 * floor(auto$YOJ / 10)

lo <- auto %>%
  group_by(YOJ) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(YOJ,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by YOJ') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(YOJ, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(YOJ)) %>%
  group_by(YOJBIN=4*floor(YOJ/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(10*floor(TRAVTIME/10))),addmargins(prop.table(table(10*floor(TRAVTIME/10)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(TRAVTIME), StdD=sd(TRAVTIME, na.rm=T), Skew=skewness(TRAVTIME, na.rm=T),
                  Kurt=kurtosis(TRAVTIME, na.rm=T))),2)

h1 <- ggplot(auto, aes(TRAVTIME)) + geom_histogram(binwidth = 5, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(TRAVTIME, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 5, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$TRAVTIME_5 <- 5 * floor(auto$TRAVTIME / 5)
#auto$TRAVTIME_10 <- 10 * floor(auto$TRAVTIME / 10)

lo <- auto %>%
  group_by(TRAVTIME) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(TRAVTIME,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by TRAVTIME') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(TRAVTIME, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(TRAVTIME)) %>%
  group_by(TRAVBIN=15*floor(TRAVTIME/15)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(2*floor(TIF/2))),addmargins(prop.table(table(2*floor(TIF/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(TIF), StdD=sd(TIF, na.rm=T), Skew=skewness(TIF, na.rm=T),
                  Kurt=kurtosis(TIF, na.rm=T))),2)

h1 <- ggplot(auto, aes(TIF)) + geom_histogram(binwidth = 3, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(TIF, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 3, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$TIF_5 <- 5 * floor(auto$TIF / 5)
#auto$TIF_10 <- 10 * floor(auto$TIF / 10)

lo <- auto %>%
  group_by(TIF) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(TIF,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by TIF') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(TIF, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) Prop by TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(TIF)) %>%
  group_by(TIFBIN=4*floor(TIF/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(3*floor(CAR_AGE/3))),addmargins(prop.table(table(3*floor(CAR_AGE/3)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(CAR_AGE), StdD=sd(CAR_AGE, na.rm=T), Skew=skewness(CAR_AGE, na.rm=T),
                  Kurt=kurtosis(CAR_AGE, na.rm=T))),2)

h1 <- ggplot(auto, aes(CAR_AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))


h2 <- ggplot(auto, aes(CAR_AGE, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$CAR_AGE_5 <- 5 * floor(auto$CAR_AGE / 5)
#auto$CAR_AGE_10 <- 10 * floor(auto$CAR_AGE / 10)

lo <- auto %>%
  group_by(CAR_AGE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(CAR_AGE,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by CAR_AGE') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(CAR_AGE, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) Prop by CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(CAR_AGE)) %>%
  group_by(CAR_AGE=4*floor(CAR_AGE/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$PARENT1))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(PARENT1=auto$PARENT1,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(PARENT1, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by PARENT1') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(PARENT1) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$MSTATUS))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(MSTATUS=auto$MSTATUS,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(MSTATUS, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by MSTATUS') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(MSTATUS) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$SEX))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(SEX=auto$SEX,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(SEX, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by SEX') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(SEX) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$CAR_USE))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(CAR_USE=auto$CAR_USE,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(CAR_USE, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_USE') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CAR_USE) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$RED_CAR))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(RED_CAR=auto$RED_CAR,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(RED_CAR, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by RED_CAR') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(RED_CAR) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$REVOKED))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(REVOKED=auto$REVOKED,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(REVOKED, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by REVOKED') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(REVOKED) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$URBANICITY))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(URBANICITY=auto$URBANICITY,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(URBANICITY, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by URBANICITY') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(URBANICITY) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(JOB)),addmargins(prop.table(table(JOB)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(JOB, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:9,1:2], margin=1),2)
pt

prop.test(tbl[1:9,1:2])


bc <- ggplot(auto, aes(x=reorder(JOB, JOB,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'JOB Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) + 
  labs(x="JOB") + coord_flip()
  

bp <- ggplot(auto_censor, aes(factor(JOB), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by JOB', x = "JOB") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(JOB) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(JOB,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by JOB') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) + 
  scale_x_discrete(labels=c('','BlCol','Clr', 'Dr', 'Hmkr', 'Law', 'Mgmt', 'Prof','Stud'))




grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(JOB) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(CAR_TYPE)),addmargins(prop.table(table(CAR_TYPE)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(CAR_TYPE, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:6,1:2], margin=1),2)
pt

prop.test(tbl[1:6,1:2])


bc <- ggplot(auto, aes(x=reorder(CAR_TYPE, CAR_TYPE,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'CAR_TYPE Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  labs(x="CAR_TYPE") + coord_flip()

#scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) + 
 

bp <- ggplot(auto_censor, aes(factor(CAR_TYPE), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_TYPE', x = "CAR_TYPE") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(CAR_TYPE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(CAR_TYPE,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by CAR_TYPE') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) +
  scale_x_discrete(labels=c('MV','PnlTrk','PkpTrk', 'SptCar', 'SUV', 'Van'))
  
  


grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CAR_TYPE) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(EDUCATION)),addmargins(prop.table(table(EDUCATION)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(EDUCATION, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:4,1:2], margin=1),2)
pt

prop.test(tbl[1:4,1:2])


bc <- ggplot(auto, aes(x=reorder(EDUCATION, EDUCATION,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'EDUCATION Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  labs(x="EDUCATION") + coord_flip()


bp <- ggplot(auto_censor, aes(factor(EDUCATION), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by EDUCATION', x = "EDUCATION") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(EDUCATION) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(EDUCATION,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by EDUCATION') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) 
  

  


grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(EDUCATION) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=130)

tbl <- with(auto, rbind(addmargins(table(30*floor(INCOME/30))),addmargins(prop.table(table(30*floor(INCOME/30))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(INCOME), StdD=sd(INCOME, na.rm=T), Skew=skewness(INCOME, na.rm=T),
                  Kurt=kurtosis(INCOME, na.rm=T))),2)


h1 <- ggplot(auto, aes(INCOME)) + geom_histogram(binwidth = 20, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))


h2 <- ggplot(auto, aes(INCOME, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 20, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$INCOME_5 <- 5 * floor(auto$INCOME / 5)
#auto$INCOME_30  <- auto$INCOME^0.5
auto$INCOME_30 <- 30 * floor(auto$INCOME / 30)

lo <- auto %>%
  group_by(INCOME_30) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))


  
sp1 <- ggplot(lo, aes(INCOME_30,logodds)) + 
  geom_point(color ='steelblue4', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned INCOME') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=T) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))



sp2 <- ggplot(auto_censor, aes(INCOME, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))

grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(INCOME)) %>%
  group_by(INCBIN=30*floor(INCOME/30)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)

tbl <- with(auto, rbind(addmargins(table(50*floor(HOME_VAL/50))),addmargins(prop.table(table(50*floor(HOME_VAL/50))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(HOME_VAL), StdD=sd(HOME_VAL, na.rm=T), Skew=skewness(HOME_VAL, na.rm=T),
                  Kurt=kurtosis(HOME_VAL, na.rm=T))),2)


h1 <- ggplot(auto, aes(HOME_VAL)) + geom_histogram(binwidth = 40, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(HOME_VAL, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 40, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$HOME_VAL_5 <- 5 * floor(auto$HOME_VAL / 5)
auto$HOME_VAL_30 <- 30 * floor(auto$HOME_VAL / 30)

lo <- auto %>%
  group_by(HOME_VAL_30) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(HOME_VAL_30,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned HOME_VAL') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(HOME_VAL, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(HOME_VAL)) %>%
  group_by(HOMEVALBIN=50*floor(HOME_VAL/50)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)
tbl <- with(auto, rbind(addmargins(table(4*floor(BLUEBOOK/4))),addmargins(prop.table(table(4*floor(BLUEBOOK/4))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(BLUEBOOK), StdD=sd(BLUEBOOK, na.rm=T), Skew=skewness(BLUEBOOK, na.rm=T),
                  Kurt=kurtosis(BLUEBOOK, na.rm=T))),2)


h1 <- ggplot(auto, aes(BLUEBOOK)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(BLUEBOOK, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$BLUEBOOK_5 <- 5 * floor(auto$BLUEBOOK / 5)
auto$BLUEBOOK_2 <- 2 * floor(auto$BLUEBOOK / 2)

lo <- auto %>%
  group_by(BLUEBOOK_2) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(BLUEBOOK_2,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned BLUEBOOK') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(BLUEBOOK, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(BLUEBOOK)) %>%
  group_by(BLUEBKBIN=5*floor(BLUEBOOK/5)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)
tbl <- with(auto, rbind(addmargins(table(4000*floor(OLDCLAIM/4000))),addmargins(prop.table(table(4000*floor(OLDCLAIM/4000))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(OLDCLAIM), StdD=sd(OLDCLAIM, na.rm=T), Skew=skewness(OLDCLAIM, na.rm=T),
                  Kurt=kurtosis(OLDCLAIM, na.rm=T))),2)


h1 <- ggplot(auto, aes(OLDCLAIM)) + geom_histogram(binwidth = 4, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(OLDCLAIM, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 4, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$OLDCLAIM_5 <- 5 * floor(auto$OLDCLAIM / 5)
auto$OLDCLAIM_4 <- 4 * floor(auto$OLDCLAIM / 4)

lo <- auto %>%
  group_by(OLDCLAIM_4) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(OLDCLAIM_4,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned OLDCLAIM') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(OLDCLAIM, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(OLDCLAIM)) %>%
  group_by(OLD_BIN=5*floor(OLDCLAIM/5)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

tmp_data <- mice(auto,maxit=5, method='pmm',seed=20, print=F)
densityplot(tmp_data)[1:5]
auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)
#auto_censor <- auto[auto$TARGET_FLAG==1,]
auto$CAR_AGE <- ifelse(auto$CAR_AGE == -3, 0, auto$CAR_AGE)

boxcoxfit(auto$TARGET_AMT[auto$TARGET_FLAG==1])

auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)

auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)

auto$AGE_MOD <- scale(auto$AGE, scale=F)


#auto$YOJ_ZERO <- ifelse(auto$YOJ == 0,1,0)

# auto$YOJ_MOD <- ifelse(auto$YOJ == 0, 
#                          mean(auto$YOJ[auto$YOJ !=0]),
#                          scale(auto$YOJ[auto$YOJ !=0],
#                                scale=F))
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)

auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)

auto$TIF_MOD <- scale(auto$TIF, scale=F)

boxcoxfit(auto$INCOME[auto$INCOME >0])
auto$INCOME_MOD <- auto$INCOME ^0.5
boxcoxfit(auto$HOME_VAL[auto$HOME_VAL > 0])
  
#auto$HOME_VAL_ZERO <- ifelse(auto$HOME_VAL==0,1,0)
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25

boxcoxfit(auto$BLUEBOOK)
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5

boxcoxfit(auto$OLDCLAIM[auto$OLDCLAIM>0])

auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1)  

m1 <- glm(TARGET_FLAG ~. -TARGET_AMT -TARGET_AMT_MOD -INDEX -KIDSDRIV -HOMEKIDS -CLM_FREQ -AGE -YOJ -TRAVTIME -TIF -SEX -RED_CAR -INCOME -HOME_VAL -BLUEBOOK -OLDCLAIM
          -INCOME_30 -HOME_VAL_30 -BLUEBOOK_2 -OLDCLAIM_4,
          data = auto, family='binomial')
    
vif(m1)        

            
summary(m1)

m1_mod <- update(m1, . ~ . -CAR_AGE -HOMEKIDS_MOD -AGE_MOD -YOJ_MOD -OLD_CLAIM_MOD)

summary(m1_mod)        

m2 <- update(m1, . ~ . + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2)
                                               + I(AGE_MOD^2) + I(YOJ_MOD^2)
                                                   + I(TIF_MOD^2))

summary(m2)

m2_mod <- update(m2, . ~ . -YOJ_MOD -HOMEKIDS_MOD -I(YOJ_MOD^2) -I(HOMEKIDS_MOD^2)
                 - CAR_AGE)

summary(m2_mod)

auto_redux <- auto[,c(9,11,12:14,16,19,20,23:26,32:42,2)]

model.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=auto_redux)
model.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=auto_redux)
 
m3 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
           trace= 0, direction='both')

summary(m3)
# m4 <- bestglm(Xy=auto_redux, family=binomial,IC="AIC", method="exhaustive", nvmax=3)
# summary(m4$BestModel)
names(auto_redux)

auto_clm <- auto[auto$TARGET_FLAG ==
                           1,c(9,11,12:14,16,19,20,23:26,32:42,31)]
mymodel <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
vif(mymodel)
m4 <- lm(TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD+ AGE_MOD + TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB + CAR_TYPE + EDUCATION + BLUEBOOK_MOD , data=auto_clm)
summary(m4)


m4_mod <- lm(TARGET_AMT_MOD ~ BLUEBOOK_MOD, data=auto_clm)

summary(m4_mod)

m5 <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2)) 

summary(m5)

m5_mod <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + AGE_MOD + I(AGE_MOD^2))
summary(m5_mod)


model.upper <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
model.null <- lm(TARGET_AMT_MOD ~ 1, data=auto_clm)
 
m6 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
           trace= 0, direction='both')

summary(m6)


m1_out <- cbind(AIC=AIC(m1), AICc=AICc(m1), BIC = BIC(m1), loglik=logLik(m1))
m1_mod_out <- cbind(AIC=AIC(m1_mod), AICc=AICc(m1_mod), BIC = BIC(m1_mod), loglik=logLik(m1_mod))
m2_out <- cbind(AIC=AIC(m2), AICc=AICc(m2), BIC = BIC(m2), loglik=logLik(m2))
m2_mod_out <- cbind(AIC=AIC(m2_mod), AICc=AICc(m2_mod), BIC = BIC(m2_mod), loglik=logLik(m2_mod))
m3_out <- cbind(AIC=AIC(m3), AICc=AICc(m3), BIC = BIC(m3), loglik=logLik(m3))

model_comp <- rbind(m1_out, m1_mod_out, m2_out, m2_mod_out, m3_out)
rownames(model_comp) <- c("m1","m1_mod","m2","m2_mod","m3")

model_comp
# models 1 and 2 prediction probs
m1_pred <- predict(m1, auto, type="response")
m1_mod_pred <- predict(m1_mod, auto, type="response")
m2_pred <- predict(m2, auto, type="response")
m2_mod_pred <- predict(m2_mod, auto, type="response")
m3_pred <- predict(m3, auto, type="response")

#AUC
paste("Model 1:",round(as.numeric(roc(auto$TARGET_FLAG, m1_pred)["auc"]),3))
paste("Model 1 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m1_mod_pred)["auc"]),3))
paste("Model 2:",round(as.numeric(roc(auto$TARGET_FLAG, m2_pred)["auc"]),3))
paste("Model 2 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m2_mod_pred)["auc"]),3))
paste("Model 3:",round(as.numeric(roc(auto$TARGET_FLAG, m3_pred)["auc"]),3))

mypredict <- predict(m2_mod, auto, type='response')

summary(mypredict)
myroc <- roc(auto$TARGET_FLAG, mypredict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")
coords(myroc, "best", ret='threshold', best.method="youden")
predict_target <- ifelse(mypredict >= 0.276, 1, 0)  
cm <- confusionMatrix(data=as.factor(predict_target), reference=as.factor(auto$TARGET_FLAG), positive='1')

cm$table

a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL  # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL  # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1

# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)
predict_target2 <- ifelse(mypredict >= 0.5, 1, 0)  
cm <- confusionMatrix(data=as.factor(predict_target2), reference=as.factor(auto$TARGET_FLAG), positive='1')

cm$table

a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL  # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL  # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1

# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)

RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}

# predictions
p4 <- predict(m4, auto_clm)
p4_mod <- predict(m4_mod, auto_clm)
p5 <- predict(m5, auto_clm)
p5_mod <- predict(m5_mod, auto_clm)
p6 <- predict(m6, auto_clm)
              

m4_out <- cbind(R_2=round(summary(m4)$r.squared,4),
                adj_R_2=round(summary(m4)$adj.r.squared,4),
                rmse=RMSE(exp(p4),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m4),AICc=AICc(m4),BIC=BIC(m4))
                
m4_mod_out <- cbind(R_2=round(summary(m4_mod)$r.squared,4),
                adj_R_2=round(summary(m4_mod)$adj.r.squared,4),
                rmse=RMSE(exp(p4_mod),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m4_mod),AICc=AICc(m4_mod),BIC=BIC(m4_mod))

m5_out <- cbind(R_2=round(summary(m5)$r.squared,4),
                adj_R_2=round(summary(m5)$adj.r.squared,4),
                rmse=RMSE(exp(p5),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m5),AICc=AICc(m5),BIC=BIC(m5))

m5_mod_out <- cbind(R_2=round(summary(m5_mod)$r.squared,4),
                adj_R_2=round(summary(m5_mod)$adj.r.squared,4),
                rmse=RMSE(exp(p5_mod),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m5_mod),AICc=AICc(m5_mod),BIC=BIC(m5_mod))

m6_out <- cbind(R_2=round(summary(m6)$r.squared,4),
                adj_R_2=round(summary(m6)$adj.r.squared,4),
                rmse=RMSE(exp(p6),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m6),AICc=AICc(m6),BIC=BIC(m6))

reg <- rbind(m4=m4_out, m4_mod=m4_mod_out,m5=m5_out,m5_mod=m5_mod_out,m6=m6_out)
rownames(reg) <- c("m4", "m4_mod", "m5", "m5_mod","m6")
reg            
par(mfrow = c(2,2))
plot(m6)

url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance-evaluation-data.csv"

auto <- read.csv(url)  


currency_fix <- function(x) {
  num <- str_replace_all(x, "\\$","")
  num <- as.numeric(str_replace_all(num, "\\,",""))
  num
}

auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)


auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)], 
                                        function(x) str_replace(x,"z_|<",""))

auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor) 

auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )

auto$INCOME <- auto$INCOME / 1000  
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000


auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)

auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)
auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)
auto$AGE_MOD <- scale(auto$AGE, scale=F)
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)
auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)
auto$TIF_MOD <- scale(auto$TIF, scale=F)
auto$INCOME_MOD <- auto$INCOME ^0.5
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5
auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1) 

mypred <- predict(m2_mod, auto, type='response')
auto$TARGET_FLAG <- ifelse(mypred  >= 0.276, 1, 0)
  
mypred2 <- exp(predict(m6, auto))  
auto$TARGET_AMT <- mypred2

write.csv(auto, "evaluation_data_w_predictions.csv")