1 Introduction

In this report, we analyze insurance data to estimate the following quantities:

The probability that a specified driver will have a car crash.
The dollar cost of auto claims, given that the insured was involved in a crash.

In practice, these claim frequency and severity measures are useful for determining appropriate pure premium amounts to charge auto policyholders.

Using the provide training data set, we will build two separate models:

a binary logistic regression to determine crash probabilities.
a multiple linear regression model to estimate claim severity.

We will then make predictions on the provided insurance testing data set.

2 Data Exploration

2.1 Variable Overview

The training data set includes 8,161 observations, with 26 variables: 23 predictors, two response variables, and one record identifier.

Below is a brief description of the included variables:

Variable Name	Description	Theoretical Impact
INDEX	Identification Variable (do not use)	None
TARGET_FLAG	In a crash? 1=YES 0=NO	None
TARGET_AMT	Cost of Crash, if applicable	None
KIDSDRIV	# Driving Children	When teenagers drive your car, increased crash risk
AGE	Age of Driver	Young and old drivers might be riskier
HOMEKIDS	# Children at Home	Unknown effect
YOJ	Years on Job	Long-term employees are usually safer
INCOME	Income	In theory, rich have fewer crashes
PARENT1	Single Parent	Unknown impact
HOME_VAL	Home Value	In theory, home owners may drive more responsibly
MSTATUS	Marital Status	In theory, married individuals are less risky
SEX	Gender	Urban legend: females are safer drivers
EDUCATION	Max Education Level	Unknown, but in theory more educated people tend to drive more safely
JOB	Job Category	In theory, white collar workers are less risky
TRAVTIME	Commute Distance	Long drives to work usually suggest greater risk
CAR_USE	Vehicle Use	Commercial fleet driven more, may impact collision prob
BLUEBOOK	Value of Vehicle	Unknown impact on collision prob, but impacts crash payout
TIF	Time in Force	Long-term customers are usually safer
CAR_TYPE	Type of Car	Unknown impact on collision prob, but impacts crash payout
RED_CAR	A Red Car	Urban legend: red cars are riskier, particularly sports cars
OLDCLAIM	# Claims (Past 5 Years)	If total payout high, future payouts might be high
CLM_FREQ	Total Claims (Past 5 Years)	Claim count should be positively correlated with future claims
REVOKED	License Revoked (Past 7 Years)	If your license was revoked, you probably are a riskier driver
MVR_PTS	Motor Vehicle Report Points	Traffic ticket counts have postive correlation with crashes
CAR_AGE	Vehicle Age	Unknown impact on collision prob, but impacts crash payout
URBANICITY	Home/Work Area	Unknown impact

There are a couple issues with the raw data file:

Currency fields were treated as factors due to “$” and “,” characters.
Multiple character field entries included an extraneous “z_” or “<” prefix.

We also rescaled the fields INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM to be so that dollars are expressed in $1,000s,

After cleaning up these minor issues, we’re ready to explore data types and sample observations from the training set:

'data.frame':   8161 obs. of  26 variables:
 $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
 $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
 $ TARGET_AMT : num  0 0 0 0 0 ...
 $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
 $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
 $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
 $ INCOME     : num  67.3 91.4 16 NA 115 ...
 $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ HOME_VAL   : num  0 257 124 306 244 ...
 $ MSTATUS    : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 2 1 1 ...
 $ SEX        : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 1 2 ...
 $ EDUCATION  : Ord.factor w/ 4 levels "High School"<..: 4 1 1 1 4 2 1 2 2 2 ...
 $ JOB        : Factor w/ 9 levels "","Blue Collar",..: 8 2 3 2 4 2 2 2 3 8 ...
 $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
 $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
 $ BLUEBOOK   : num  14.23 14.94 4.01 15.44 18 ...
 $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
 $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 5 1 5 4 5 6 5 6 ...
 $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
 $ OLDCLAIM   : num  4.46 0 38.69 0 19.22 ...
 $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
 $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
 $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
 $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
 $ URBANICITY : Factor w/ 2 levels "Highly Rural/ Rural",..: 2 2 2 2 2 2 2 2 2 1 ...

We ignore the field INDEX for modeling purposes, which is used only for identifying observations.

The two response variables, TARGET_FLAGand TARGET_AMT, contain binary and dollar values, respectively.

The predictors include four discrete count variables:

KIDSDRIV
HOMEKIDS
CLM_FREQ
MVR_PTS

There are five discrete time measurements:

AGE
YOJ
TRAVTIME
TIF
CAR_AGE

There are also seven binary, categorical features:

PARENT1
MSTATUS
SEX
CAR_USE
RED_CAR
REVOKED
URBANCITY

The data set includes two multinomial, categorical features:

JOB
CAR_TYPE

There is one ordinal, categorical predictor:

EDUCTATION

Finally, there are four predictors that express dollar amounts. These variables are effectively continuous:

INCOME
HOME_VAL
BLUEBOOK
OLDCLAIM

Now, let’s review a high level statistical summary of the variables:

     INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
 Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
 1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
 Median : 5133   Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
 Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
 3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
 Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
                                                                    NA's   :6      
    HOMEKIDS           YOJ           INCOME       PARENT1       HOME_VAL     MSTATUS   
 Min.   :0.0000   Min.   : 0.0   Min.   :  0.00   No :7084   Min.   :  0.0   No :3267  
 1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28.10   Yes:1077   1st Qu.:  0.0   Yes:4894  
 Median :0.0000   Median :11.0   Median : 54.03              Median :161.2             
 Mean   :0.7212   Mean   :10.5   Mean   : 61.90              Mean   :154.9             
 3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85.99              3rd Qu.:238.7             
 Max.   :5.0000   Max.   :23.0   Max.   :367.03              Max.   :885.3             
                  NA's   :454    NA's   :445                 NA's   :464               
 SEX            EDUCATION              JOB          TRAVTIME            CAR_USE    
 F:4375   High School:3533   Blue Collar :1825   Min.   :  5.00   Commercial:3029  
 M:3786   Bachelors  :2242   Clerical    :1271   1st Qu.: 22.00   Private   :5132  
          Masters    :1658   Professional:1117   Median : 33.00                    
          PhD        : 728   Manager     : 988   Mean   : 33.49                    
                             Lawyer      : 835   3rd Qu.: 44.00                    
                             Student     : 712   Max.   :142.00                    
                             (Other)     :1413                                     
    BLUEBOOK          TIF                CAR_TYPE    RED_CAR       OLDCLAIM     
 Min.   : 1.50   Min.   : 1.000   Minivan    :2145   no :5783   Min.   : 0.000  
 1st Qu.: 9.28   1st Qu.: 1.000   Panel Truck: 676   yes:2378   1st Qu.: 0.000  
 Median :14.44   Median : 4.000   Pickup     :1389              Median : 0.000  
 Mean   :15.71   Mean   : 5.351   Sports Car : 907              Mean   : 4.037  
 3rd Qu.:20.85   3rd Qu.: 7.000   SUV        :2294              3rd Qu.: 4.636  
 Max.   :69.74   Max.   :25.000   Van        : 750              Max.   :57.037  
                                                                                
    CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE                     URBANICITY  
 Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000   Highly Rural/ Rural:1669  
 1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000   Highly Urban/ Urban:6492  
 Median :0.0000              Median : 1.000   Median : 8.000                             
 Mean   :0.7986              Mean   : 1.696   Mean   : 8.328                             
 3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000                             
 Max.   :5.0000              Max.   :13.000   Max.   :28.000                             
                                              NA's   :510

We notice a variety of missing values and strange field entries that may reflect data errors. These issues will be address in a later section.

Let’s now focus on each variable individually.

2.2 TARGET Variables

TARGET_FLAG

The response variable TARGET_FLAG has a moderate imbalance, with three-quarters of the observations indicating no crashes.

             0      1  Sum
count   6008.0 2153.0 8161
percent   73.6   26.4  100

TARGET_AMT

The other response, TARGET_AMT, exhibits extreme, positive skewness and high kurtosis.

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      StdD      Skew      Kurt 
     0.00      0.00      0.00   1504.32   1036.00 107586.14   4704.03      8.71    115.32

We already noted that almost three quarters of records indicate no car crash. In these cases, the TARGET_AMT has a zero value.

For modeling purposes, however, we will only be interested in dollar amounts where a crash occurred. Going forward, when performing calculations and summaries involving TARGET_AMT, we will use a subset of the data where zero amounts are filtered out.

Here is a summary of the zero-truncated TARGET_AMT variable:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      StdD      Skew      Kurt 
    30.28   2609.78   4104.00   5702.18   5787.00 107586.14   7743.18      5.64     45.49

Even after we remove the zero values, the variable remains highly skewed:

2.2.0.1 Count Variables

KIDSDRIV

The discrete variable KIDSDRIV is right skewed, with 88% of insureds in the training data having no teenage drivers in the household.

           0     1     2    3 4  Sum
count   7180 636.0 279.0 62.0 4 8161
percent   88   7.8   3.4  0.8 0  100

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.17    0.00    4.00    0.51    3.35   14.78

In the barplot below, we see different a slightly different distribution of policyholders involved in crashes vis-a-vis the incident-free insureds. Specifically, we see slightly higher concentration of individuals teenage drivers. The scatter plot also indicates a relationship between relationship between number of teenage drivers and the log odds of an auto incident.

We also plot the the log TARGET_AMT–given that a crash occurred–against KIDSDRIV. Note: we applied the log transformation because TARGET_AMT is highly right skewed, and the directional relationship with KIDSDRIV should be clearer in this form.

While not entirely clear in the boxplot, there appears to be a rough, positive relationship between KIDSDRIV and the median cost of the crash. The relationship with the mean crash amount, however, is less clear.

KIDSDRIV	ct	mean_cost	median_cost
0	1773	5659	4123
1	236	6220	3958
2	111	5542	4148
3	31	4915	4541
4	2	4054	4054

HOMEKIDS

Numerical distribution and statistical summaries are presented below:

             0     1      2     3   4    5  Sum
count   5289.0 902.0 1118.0 674.0 164 14.0 8161
percent   64.8  11.1   13.7   8.3   2  0.2  100

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.72    1.00    5.00    1.12    1.34    3.65

The distribution of this discrete variable is right skewed, but not to the same extent as KIDSRIV. HOMEKIDS contains some of the same information as KIDSDRIV: presumably, some of the reported children are also drivers.

Let’s look plots relating this predictor to the target variables:

The barplot provides some evidence that policyholders with crashes tend to have more children than than insureds not involved in an auto incident. The scatterplot indicates a significant difference in log odds between policyholders with children vs. insureds without children. We discount the observation associated with five children due to the small sample size.

There appears to be a subtle relationship between HOMEKIDS and the median cost of crashes. The relationship with the mean is less clear, given the highly variable and skewed distribution of crash amounts.

HOMEKIDS	ct	mean_cost	median_cost
0	1173	5685	4080
1	305	5522	4036
2	382	6085	4122
3	230	5432	4192
4	57	5610	4575
5	6	5009	5130

CLM_FREQ

Let’s review numerical and statistical summaries for CLM_FREQ, another discrete count variable:

             0     1      2     3     4    5  Sum
count   5009.0 997.0 1171.0 776.0 190.0 18.0 8161
percent   61.4  12.2   14.3   9.5   2.3  0.2  100

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    0.80    2.00    5.00    1.16    1.21    3.29

This variable has a similar skew as the previously reviewed count variable, HOMEKIDS.

Based on the barplot below, there seems to be a significance difference in the distribution of prior claim frequencies for TARGET_FLAG values of zero vs. one. We see roughly 60% of auto claimants have had one or more prior claims in the past five year, while only 40% of non-claimants have had accidents. The scatter plot indicates a nonlinear albeit positive relationship between log odds of a claim and prior claim history. We also do not see a clear pattern between CLM_FREQ and the claim amounts.

CLM_FREQ	ct	mean_cost	median_cost
0	898	5633	4160
1	385	5995	4312
2	469	5463	3858
3	314	5970	4206
4	80	5551	3393
5	7	4247	3737

MVR_PTS

Like the other count variables, MVR_PTS is positively skewed. There seems to be a positive relationship between points and log odds; however the scatter plot below indicates a strange, curved relationship between the two variables. We wonder if the curvature is related to to interaction with another predictor.

We see no straightforward relationship between TARGET_AMT and MVR_PTS.

             0      1     2     3     4     5     6   7  8    9   10   11 13  Sum
count   3712.0 1157.0 948.0 758.0 599.0 399.0 266.0 167 84 45.0 13.0 11.0  2 8161
percent   45.5   14.2  11.6   9.3   7.3   4.9   3.3   2  1  0.6  0.2  0.1  0  100

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    1.00    1.70    3.00   13.00    2.15    1.35    4.38

BIN_MVR	ct	mean_cost	median_cost
0	985	5379	4037
2	506	5762	4068
4	354	6110	4292
6	198	5896	4194
8	88	6809	4492
10	20	6293	3824
12	2	3786	3786

2.3 Time Variables

AGE

Below is table of values of the AGE predictor, with ages bucketed into 5 year increments (i.e. [15,20), [20,25), [30,35), etc.])

          15   20    25  30     35     40     45     50    55    60   65   70 75 80  Sum
count   14.0 58.0 249.0 649 1259.0 1673.0 1810.0 1373.0 725.0 265.0 65.0 12.0  1  2 8155
percent  0.2  0.7   3.1   8   15.4   20.5   22.2   16.8   8.9   3.2  0.8  0.1  0  0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
  16.00   39.00   45.00   44.79   51.00   81.00    6.00    8.63   -0.03    2.94

We note six missing values that we’ll need to address later.

The distribution of AGE is almost perfectly normal. When we break out the data by TARGET_FLAG values, the distributions of age by TARGET_FLAG are still roughly normal. However, individuals involved in a crash appear to be slightly younger, on average.

In the bottom left scatter plot, we notice a pattern between log odds and age. Specifically, there appears to be a curved relationship where younger ages have a higher odds of a crash. The odds continue to decrease until around age 60, when costs begin to trend upward again.

TARGET_AMT appears to be slightly higher for both younger (< 25) and older (>60) drivers, but the differences appear to be subtle.

AGEBIN	ct	mean_cost	median_cost
10	7	9650	3917
20	151	5412	4193
30	627	5601	4144
40	804	5470	3936
50	446	6280	4291
60	109	5907	4460
70	4	4284	4276

YOJ

YOB refers to the number of years in the insured’s current job.

Here is a numerical summary, binned in 2 year increments:

            0    2     4     6     8     10     12     14  16   18 22  Sum
count   631.0 51.0 129.0 473.0 905.0 1752.0 2174.0 1248.0 305 37.0  2 7707
percent   8.2  0.7   1.7   6.1  11.7   22.7   28.2   16.2   4  0.5  0  100

Below is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00    9.00   11.00   10.50   13.00   23.00  454.00    4.09   -1.20    4.18

We have a significant number of NA records–454, or 5.6%–that we’ll need to address.

The variable would be approximately normally distributed if it weren’t for the high percentage of individuals with less than one year on the job.

Insureds with accidents have a relatively high proportion of individuals with less than a year on the job.

The relationship between log odds and YOB appears to be complex–see the fitted loess curve below.

The relationship between YOJ and TARGET_AMT is also not very clear.

YOJBIN	ct	mean_cost	median_cost
0	257	4765	3736
4	134	5993	4512
8	729	5964	4026
12	811	5590	4249
16	99	6001	3969

TRAVTIME

Here is a summary of TRAVTIME, the commuting distance to work, in bins of 10 minutes:

            0     10     20     30     40    50    60   70   80   90 100 110 120 130 140  Sum
count   550.0 1036.0 1788.0 2023.0 1531.0 784.0 305.0 94.0 33.0 11.0   2   1   1   1   1 8161
percent   6.7   12.7   21.9   24.8   18.8   9.6   3.7  1.2  0.4  0.1   0   0   0   0   0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   5.00   22.00   33.00   33.49   44.00  142.00   15.91    0.45    3.67

The distribution has a slight positive skew. The subset of insureds with no accidents have a higher proportion of individuals with short commute times. In the scatterplot below, we notice a generally positive relationship between log odds and TRAVETIME. Some of the curvature in the loess curve is likely driven by small sample sizes for long commute times.

Finally, we see a slight, upward curvature in the log of TARGET_AMT for long commute times. Judging from the loess curve confidence interval, this upward trend may not be statistically significant.

TRAVBIN	ct	mean_cost	median_cost
0	214	5395	4539
15	596	5846	3998
30	772	5671	4163
45	451	5516	3964
60	104	6470	4376
75	15	6169	4579
90	1	6909	6909

TIF

Here is a distribution of time in force values in bins of 2 year increments:

           0     2      4    6     8     10  12    14    16   18   20 22 24  Sum
count   2533 430.0 1294.0 1961 285.0 1022.0 323 109.0 148.0 32.0 19.0  3  2 8161
percent   31   5.3   15.9   24   3.5   12.5   4   1.3   1.8  0.4  0.2  0  0  100

Here is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   1.00    1.00    4.00    5.35    7.00   25.00    4.15    0.89    3.42

The distribution is somewhat positively skewed. We see somewhat small sub-samples for TIF values of 2-3 and 8-9. Also, the log odds vs. TIF scatterplot suggests a quadratic relationship.

Finally, there does not seem to be a significant relationship between log of TARGET_AMT and TIF.

TIFBIN	ct	mean_cost	median_cost
0	912	5858	4102
4	823	5609	4049
8	289	5518	4220
12	91	5764	4061
16	35	5257	4868
20	3	5025	5610

CAR_AGE

The predictor CAR_AGE describes the age in years of the insured’s car. Below we bin the data in increments of three:

        -3      0     3      6    9     12    15    18   21   24 27  Sum
count    1 1949.0 494.0 1512.0 1455 1035.0 720.0 369.0 96.0 18.0  2 7651
percent  0   25.5   6.5   19.8   19   13.5   9.4   4.8  1.3  0.2  0  100

Here is the statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
  -3.00    1.00    8.00    8.33   12.00   28.00  510.00    5.70    0.28    2.25

There is an observation that indicates a CAR_AGE value of -3. This is clearly an error. Also, there are 510 missing observations, representing 6% of total records. 1934 out of the 8161 (25%) of the records have a value of one.

Surprisingly, CAR_AGE appears to be negatively correlated with the log odds of a claim. Perhaps there is a confounding variable responsible for this result.

Finally, we expected to see a stronger relationship between CAR_AGE and the log of TARGET_AMT.
Our intuition was that payouts go down with the age of the car, as replacement value goes down with age. However, the bottom right scatter plot below does not seem to indicate a significant association.

CAR_AGE	ct	mean_cost	median_cost
-4	1	1469	1469
0	646	5824	4107
4	392	5884	4224
8	516	5789	4088
12	261	5431	4274
16	158	5327	4076
20	34	5438	3150
24	3	4480	3949

2.3.0.1 Binary Variables

PARENT1

Here is a record summary for PARENT1, a binary variable indicating if an insured is a single parent.

            No    Yes  Sum
count   7084.0 1077.0 8161
percent   86.8   13.2  100

The vast majority (87%) of individuals in the training data are not single parents.

Let’s explore the relationship with TARGET_FLAG

       TARGET_FLAG
PARENT1    0    1  Sum
    No  5407 1677 7084
    Yes  601  476 1077
    Sum 6008 2153 8161

Now let’s review the proportions of individuals involved in a crash, given PARENT1 status:

       TARGET_FLAG
PARENT1    0    1
    No  0.76 0.24
    Yes 0.56 0.44

The is a 20% difference in the calculated proportions. This difference is statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 201.7, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1734351 0.2370404
sample estimates:
   prop 1    prop 2 
0.7632693 0.5580316

Finally, lets review the relationship of PARENT1 with TARGET_AMT.

PARENT1	ct	mean_cost	median_cost
No	1677	5603	4101
Yes	476	6050	4133

The distribution of log amounts is pretty similar for both values of PARENT1, with perhaps a very modest increase for single parent insureds.

MSTATUS

The predictor, MSTATUS, refers to the married status of the policyholder.

          No  Yes  Sum
count   3267 4894 8161
percent   40   60  100

There is a fairly balanced split (60/40) between married and single insureds.

Let’s look at the relationship with TARGET_FLAG

       TARGET_FLAG
MSTATUS    0    1  Sum
    No  2167 1100 3267
    Yes 3841 1053 4894
    Sum 6008 2153 8161

Now let’s review the proportions of individuals involved in a crash, given MSTATUS status:

       TARGET_FLAG
MSTATUS    0    1
    No  0.66 0.34
    Yes 0.78 0.22

The 12% difference in proportions is statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 148.38, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1416726 -0.1014053
sample estimates:
   prop 1    prop 2 
0.6632997 0.7848386

Now lets explore the relationship of MSTATUS with TARGET_AMT.

MSTATUS	ct	mean_cost	median_cost
No	1100	5967	4098
Yes	1053	5426	4117

The distribution of log amounts is very similar across MSTATUS type. Non married individuals appear to have a slightly higher average log cost compared to the married cohort. However, median costs are almost identical between the two cohorts.

SEX

The variable, SEX, denotes the gender of the insured policyholder.

             F      M  Sum
count   4375.0 3786.0 8161
percent   53.6   46.4  100

The split between males and females is split almost 50/50.

Let’s review the relationship with TARGET_FLAG.

     TARGET_FLAG
SEX      0    1  Sum
  F   3183 1192 4375
  M   2825  961 3786
  Sum 6008 2153 8161

Here are the proportions of individuals involved in a crash, given gender type:

   TARGET_FLAG
SEX    0    1
  F 0.73 0.27
  M 0.75 0.25

The 2% difference in proportions is not statistically significant:


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 3.5307, df = 1, p-value = 0.06024
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.0380106016  0.0007561151
sample estimates:
   prop 1    prop 2 
0.7275429 0.7461701

Now lets explore the relationship of SEX with TARGET_AMT.

SEX	ct	mean_cost	median_cost
F	1192	5344	4039
M	961	6147	4220

The log dollar cost of claims seems to be slightly higher for males, on average.

CAR_USE

CAR_USE is a predictor that indicates how the vehicle is uses (i.e. commercial or private purposes).

        Commercial Private  Sum
count       3029.0  5132.0 8161
percent       37.1    62.9  100

The majority of the observations involve private use vehicles, at 60%.

Let’s explore the relationship with TARGET_FLAG.

            TARGET_FLAG
CAR_USE         0    1  Sum
  Commercial 1982 1047 3029
  Private    4026 1106 5132
  Sum        6008 2153 8161

Below are the proportions of individuals involved in a crash, given the CAR_USE indicator:

            TARGET_FLAG
CAR_USE         0    1
  Commercial 0.65 0.35
  Private    0.78 0.22

There is a 13% difference in proportions between the two cohorts. This result is statistically significant.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 165.45, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1507428 -0.1095535
sample estimates:
   prop 1    prop 2 
0.6543414 0.7844895

Finally, lets explore the relationship of CAR_USE with TARGET_AMT.

CAR_USE	ct	mean_cost	median_cost
Commercial	1047	6099	4193
Private	1106	5327	4037

The log dollar cost of claims involving commercial transit appears to somewhat higher than costs associated with private transportation.

RED_CAR

RED_CAR is a binary predictor indicating whether a care is primarily red in color.

            no    yes  Sum
count   5783.0 2378.0 8161
percent   70.9   29.1  100

We see only 30% of vehicles in the red category.

Let’s look at the relationship with TARGET_FLAG.

       TARGET_FLAG
RED_CAR    0    1  Sum
    no  4246 1537 5783
    yes 1762  616 2378
    Sum 6008 2153 8161

Here are the proportions of individuals involved in a crash, given the RED_CAR status:

       TARGET_FLAG
RED_CAR    0    1
    no  0.73 0.27
    yes 0.74 0.26

There is only 1% difference in proportions between red and non-red cars. The result is not statistically significant.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 0.35996, df = 1, p-value = 0.5485
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.02800322  0.01452763
sample estimates:
   prop 1    prop 2 
0.7342210 0.7409588

Now, we’ll explore the relationship between RED_CAR and TARGET_AMT.

RED_CAR	ct	mean_cost	median_cost
no	1537	5568	4112
yes	616	6036	4082

The distribution for each cohort is very similar, with an subtle uptick in average log costs for red car types.

REVOKED

The variable REVOKED indicates whether a driver’s license has been suspended within the last seven years.

            No    Yes  Sum
count   7161.0 1000.0 8161
percent   87.7   12.3  100

Only 12% of drivers in the training data have a former license suspension on record.

Here is a look look at the relationship of REVOKED with TARGET_FLAG.

       TARGET_FLAG
REVOKED    0    1  Sum
    No  5451 1710 7161
    Yes  557  443 1000
    Sum 6008 2153 8161

Below are the proportions of individuals involved in a crash, given the REVOKED status:

       TARGET_FLAG
REVOKED    0    1
    No  0.76 0.24
    Yes 0.56 0.44

There is statistically significant difference in proportions by REVOKED type, with a 20% observed difference in the training data.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 187.35, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1713042 0.2371089
sample estimates:
   prop 1    prop 2 
0.7612065 0.5570000

Finally, let’s explore the relationship between REVOKED and TARGET_AMT.

REVOKED	ct	mean_cost	median_cost
No	1710	5848	4059
Yes	443	5140	4254

The distribution for each cohort is very similar. In the training data, we very slight increase in average log costs for folks that have not had their license suspended. However, the median log cost seems to be slightly higher for folks with a prior suspended license.

URBANICITY

The predictor URBANICITY indicates whether the environment in which the driver primarily drives: urban vs. rural.

        Highly Rural/ Rural Highly Urban/ Urban  Sum
count                1669.0              6492.0 8161
percent                20.5                79.5  100

Only 30% of drivers in the training data are categorized as being in a rural environment. .

Let’s look at the relationship of URBANICITY with TARGET_FLAG.

                     TARGET_FLAG
URBANICITY               0    1  Sum
  Highly Rural/ Rural 1554  115 1669
  Highly Urban/ Urban 4454 2038 6492
  Sum                 6008 2153 8161

Her are the proportions of individuals involved in a crash, given the URBANICITY category :

                     TARGET_FLAG
URBANICITY               0    1
  Highly Rural/ Rural 0.93 0.07
  Highly Urban/ Urban 0.69 0.31

There is a huge difference, 24%, between the two proportions. Urban drivers appear to be significantly more at risk for being involved in a crash.


    2-sample test for equality of proportions with continuity correction

data:  tbl[1:2, 1:2]
X-squared = 409.14, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
 0.2280583 0.2619842
sample estimates:
   prop 1    prop 2 
0.9310965 0.6860752

Let’s review the relationship between URBANICITY and TARGET_AMT.

URBANICITY	ct	mean_cost	median_cost
Highly Rural/ Rural	115	5545	3589
Highly Urban/ Urban	2038	5711	4125

Urban drivers tend to have somewhat higher claim costs, given a crash. There also appears to be more variability in claim costs for urban drivers compared to their rural counterparts.

2.4 Multinomial Categorical Variables

JOB

The variable JOB indicates the insured’s job category. there is also blank category which we interpret to mean unknown or not working.

              Blue Collar Clerical Doctor Home Maker Lawyer Manager Professional Student  Sum
count   526.0      1825.0   1271.0    246      641.0  835.0   988.0       1117.0   712.0 8161
percent   6.4        22.4     15.6      3        7.9   10.2    12.1         13.7     8.7  100

Here is a breakdown of job categories by TARGET_FLAG:

              TARGET_FLAG
JOB               0    1  Sum
                390  136  526
  Blue Collar  1191  634 1825
  Clerical      900  371 1271
  Doctor        217   29  246
  Home Maker    461  180  641
  Lawyer        682  153  835
  Manager       851  137  988
  Professional  870  247 1117
  Student       446  266  712
  Sum          6008 2153 8161

Let’s calculate the proportion of observations by job category in each TARGET_FLAG indicator:

              TARGET_FLAG
JOB               0    1
               0.74 0.26
  Blue Collar  0.65 0.35
  Clerical     0.71 0.29
  Doctor       0.88 0.12
  Home Maker   0.72 0.28
  Lawyer       0.82 0.18
  Manager      0.86 0.14
  Professional 0.78 0.22
  Student      0.63 0.37

There appears to be significant variability in the probability of a claim by occupational category, with students and blue collar jobs leading the pack.

We reject the null hypothesis that all proportions are identical:


    9-sample test for equality of proportions without continuity correction

data:  tbl[1:9, 1:2]
X-squared = 261.06, df = 8, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6    prop 7    prop 8    prop 9 
0.7414449 0.6526027 0.7081039 0.8821138 0.7191888 0.8167665 0.8613360 0.7788720 0.6264045

JOB	ct	mean_cost	median_cost
	136	6904	4155
Blue Collar	634	5890	4042
Clerical	371	5446	4144
Doctor	29	4896	4117
Home Maker	180	4951	3612
Lawyer	153	5991	4019
Manager	137	4944	4256
Professional	247	6560	4348
Student	266	5021	4188

There is moderate variability in log costs across job categories, with the unknown, student, and professional, and manager categories appearing to have higher median costs than than other five categories. The relationship are somewhat different though when viewing mean costs (or mean log costs) by category.

CAR_TYPE

The predictor CAR_TYPE indicates the insured’s vehicle type.

        Minivan Panel Truck Pickup Sports Car    SUV   Van  Sum
count    2145.0       676.0   1389      907.0 2294.0 750.0 8161
percent    26.3         8.3     17       11.1   28.1   9.2  100

Here is a breakdown of CAR_TYPE categories by TARGET_FLAG:

             TARGET_FLAG
CAR_TYPE         0    1  Sum
  Minivan     1796  349 2145
  Panel Truck  498  178  676
  Pickup       946  443 1389
  Sports Car   603  304  907
  SUV         1616  678 2294
  Van          549  201  750
  Sum         6008 2153 8161

We’ll now calculate the proportion of observations by CAR_TYPE category in each TARGET_FLAG indicator:

             TARGET_FLAG
CAR_TYPE         0    1
  Minivan     0.84 0.16
  Panel Truck 0.74 0.26
  Pickup      0.68 0.32
  Sports Car  0.66 0.34
  SUV         0.70 0.30
  Van         0.73 0.27

There is significant variability in the probability of a claim by occupational category, with sports cars having the highest proportion of accidents, and minivan have the lowest proportion.

We reject the null hypothesis that all proportions are identical:


    6-sample test for equality of proportions without continuity correction

data:  tbl[1:6, 1:2]
X-squared = 170.38, df = 5, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6 
0.8372960 0.7366864 0.6810655 0.6648291 0.7044464 0.7320000

CAR_TYPE	ct	mean_cost	median_cost
Minivan	349	5602	4023
Panel Truck	178	7465	4556
Pickup	443	5430	4147
Sports Car	304	5413	4023
SUV	678	5241	4032
Van	201	6909	4220

Panel trucks and vans appear to have significantly higher log claim costs compared to other car types.

2.5 Ordinal Categorical Variables

EDUCATION

The variable EDUCATION denotes the insured’s highest level of education attained.

        High School Bachelors Masters   PhD  Sum
count        3533.0    2242.0  1658.0 728.0 8161
percent        43.3      27.5    20.3   8.9  100

Below is a breakdown of EDUCATION categories by TARGET_FLAG:

             TARGET_FLAG
EDUCATION        0    1  Sum
  High School 2355 1178 3533
  Bachelors   1719  523 2242
  Masters     1331  327 1658
  PhD          603  125  728
  Sum         6008 2153 8161

Let’s review the proportion of observations by EDUCATION category in each TARGET_FLAG indicator:

             TARGET_FLAG
EDUCATION        0    1
  High School 0.67 0.33
  Bachelors   0.77 0.23
  Masters     0.80 0.20
  PhD         0.83 0.17

There are statistically significant differences between the categories, with higher levels of educational attainment associated with lower probabilities of crashes.


    4-sample test for equality of proportions without continuity correction

data:  tbl[1:4, 1:2]
X-squared = 168.58, df = 3, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4 
0.6665723 0.7667261 0.8027744 0.8282967

EDUCATION	ct	mean_cost	median_cost
High School	1178	5451	4079
Bachelors	523	5883	4036
Masters	327	5966	4256
PhD	125	6623	4395

Interestingly, the TARGET_AMT appears to have a positive association with the level of EDUCATION.

2.6 Continuous Variables

INCOME

Below is table of values of the INCOME predictor, with wage bucketed into $30K increments (i.e. [0,30), [30,60), [60,90), etc.])

             0     30     60    90   120   150   180  210  240  270 300 330 360  Sum
count   2059.0 2206.0 1671.0 951.0 417.0 203.0 117.0 56.0 20.0 11.0   3   1   1 7716
percent   26.7   28.6   21.7  12.3   5.4   2.6   1.5  0.7  0.3  0.1   0   0   0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00   28.10   54.03   61.90   85.99  367.03  445.00   47.57    1.19    5.13

There are some missing values, 445 (5.4% of total), that we’ll address later.

The distribution of INCOME is right skewed, with a significant number of observations indicating $0 in income.

Individuals involved in crashes appear to have a high proportion of low-wage earners. We see the log-odds of a crash decreasing with increases to income–see the lower left scatter plot and loess curve. We see the loess curve starting to bend upward around $210k, although this phenomenon is possibly due to sparse data for high wage earners.

Finally, the relationship between income and log TARGET_AMT also does not appear to be very strong.

INCBIN	ct	mean_cost	median_cost
0	695	5102	4045
30	654	5909	4299
60	404	6053	3988
90	163	6603	4254
120	66	6107	3856
150	31	5671	4738
180	10	4091	3940
210	14	6020	5402
240	1	4104	4104
270	4	3100	3007
300	1	3231	3231

HOME_VAL

Below is table of values of the HOME_VAL predictor, with home appraisals bucketed into $50K increments.

             0    50   100    150    200   250   300   350   400  450  500  550 600 650 700 750 850  Sum
count   2294.0 371.0 902.0 1265.0 1181.0 751.0 435.0 208.0 128.0 85.0 43.0 21.0 7.0   3   1   1   1 7697
percent   29.8   4.8  11.7   16.4   15.3   9.8   5.7   2.7   1.7  1.1  0.6  0.3 0.1   0   0   0   0  100

Below is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's    StdD    Skew    Kurt 
   0.00    0.00  161.16  154.87  238.72  885.28  464.00  129.12    0.49    2.98

There are some missing values, 464 (5.7% of total). We’ll address these later.

The distribution of HOME_VAL is right skewed, with a large proportion of observations indicating $0 in HOME_VAL–this probably refers to policyholders who are not homeowners.

Individuals involved in crashes have a higher proportion of low home values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. The upward trend in the curve around $400k is possibly due to the the smaller sample size for higher home values.

Finally, the relationship between HOME_VAL and log TARGET_AMT also does not appear to be very strong.

HOMEVALBIN	ct	mean_cost	median_cost
0	846	5550	4006
50	111	4591	4400
100	247	5233	4185
150	328	6004	4280
200	264	6086	4015
250	115	5573	4093
300	59	6990	4021
350	24	9735	4071
400	14	4011	3868
450	14	4458	3981
500	3	5917	5455
550	3	4756	4435
600	2	3540	3540
650	1	2908	2908
750	1	3231	3231

BLUEBOOK

Below is table of values of the BLUEBOOK predictor, with values bucketed into $4K increments.

            0      4      8   12     16    20    24    28    32 36   40   44  48 56 60 64 68  Sum
count   281.0 1327.0 1503.0 1548 1238.0 903.0 628.0 390.0 190.0 83 45.0 14.0 6.0  1  2  1  1 8161
percent   3.4   16.3   18.4   19   15.2  11.1   7.7   4.8   2.3  1  0.6  0.2 0.1  0  0  0  0  100

Here is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   1.50    9.28   14.44   15.71   20.85   69.74    8.42    0.79    3.79

The distribution of BLUEBOOK is right skewed, like the continuous variables reviewed earlier.

Individuals involved in crashes have a higher proportion of low BLUEBOOK values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. However, there is a prominent bend in the curve around $30k. Perhaps this upward bend indicates risky driving behavior by owners of luxury and sports cars.

Finally, the relationship between BLUEBOOK and log TARGET_AMT reveals an increase in payouts with the value of the auto, but the curve flattens out around $20k - $30k.

BLUEBKBIN	ct	mean_cost	median_cost
0	171	3680	3423
5	614	5131	4196
10	504	5463	4051
15	373	6148	4164
20	267	7177	4256
25	123	6224	4369
30	57	6180	4321
35	26	11346	5308
40	13	6798	3041
45	2	3846	3846
50	1	18084	18084
55	1	3231	3231
60	1	4104	4104

OLDCLAIM

Below is table of values of the OLDCLAIM variable, with values bucketed into $4K increments.

           0  Sum
count   8161 8161
percent  100  100

Below is a statistical summary:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    StdD    Skew    Kurt 
   0.00    0.00    0.00    4.04    4.64   57.04    8.78    3.12   12.86

The distribution of OLDCLAIM is extremely right skewed.

There does not appear to be a clear relationship between OLDCLAIM and log of TARGET_CLM. High previous claim amounts seem to be positively associated with the log odds of a future claim. However, this relationship appears to take a quadratic form.

OLD_BIN	ct	mean_cost	median_cost
0	1395	5814	4148
5	452	5409	4025
10	62	6347	4186
15	28	4712	3318
20	33	4865	3764
25	49	5060	4460
30	53	4898	4262
35	33	6875	3779
40	29	6922	4477
45	16	5350	3299
50	2	3070	3070
55	1	7803	7803

3 Data Preparation

3.1 Missing Values

The following five variables have missing variables:

AGE: 6 missing values (0.07%)
YOJ: 454 missing values (5.6%)
CAR_AGE: 510 (6.2%)
INCOME: 445 (5.4%)
HOME_VAL: 464 (5.7%)

Because four of the predictors are missing a significant number of records (5%-6%), we will attempt to use a sophisticated imputation procedure from R’s MICE package to fill in the missing values. Our goal in using this procedure is to minimize the introduction of bias in our data vis-a-vis simpler methods like mean substitution.

One way to assess the quality of the imputation procedure used in MICE is to compare the distribution of the imputed data to the distribution of the non-missing values. Let’s do that now by reviewing density plots:
With the exception of the AGE variable, the imputed variables seem to reasonably approximate the original distribution.

we’re only missing a handful of values for AGE; so we will apply simply mean imputation for this variable.

Finally, we’ll assume the -3 value for CAR_AGE is actually zero.

4 Data Transformation

TARGET_FLAG
No changes.

TARGET_AMT

The response variable TARGET_AMT is highly right skewed. We will use the box-cox procedure to suggest a reasonable transformation when the TARGET_AMT variable is positive:

Fitted parameters:
     lambda        beta     sigmasq 
0.003172095 8.386645882 0.695511168 

Convergence code returned by optim: 0

Given this output, We will apply the log transformation.

KIDSDRIV

Given the possible curvature between the log odds and KIDSDRIV, we’re going to introduce a centered version of this variable to reduce possible collinearity issues.

HOMEKIDS

We will create another centered version of this variable due to a possible quadratic relationship with the response variable, TARGET_FLAG. Again, our intent with this transformation is avoid multicollinearity issues.

CLM_FREQ

Once again, we’ll center the variable due concerns described earlier.

MVR_PTS

No changes.

AGE

Age appears to have a quadratic relationship with both TARGET_FLAG and TARGET_AMT. We’ll center the variable.

YOJ

The variable YOJ has a significant number of zero observations. We’ll create a mean centered variable, YOJ_MOD, that will be used for years on job of 1 or higher.

TRAVTIME
Well apply the mean-center transformation.

TIF

We’ll mean-center the variable to reduce multicollinearity issues if we implement a quadratic term.

CAR_AGE

No changes.

PARENT1

No changes.

MSTATUS

No changes.

SEX

No changes.

CARUSE

No changes.

RED_CAR

No changes.

REVOKED

No changes.

URBANICITY

No changes.

JOB

No changes.

CAR_TYPE

No changes.

EDUCATION

No changes.

INCOME

Income is a positively skewed variable with a significant number zeroes.

We will apply the square root transformation suggested by the box-cox procedure to the original variable to reduce the overall skew.

Fitted parameters:
    lambda       beta    sigmasq 
 0.4298838 11.0283525 17.8086126 

Convergence code returned by optim: 0

HOME_VAL

Home values are also moderately right skewed with a significant number of zeroes.
We’ll apply a quarter root transformation to the original variable to reduce the overall skew.

Fitted parameters:
   lambda      beta   sigmasq 
0.1951741 9.3491718 1.5206145 

Convergence code returned by optim: 0

BLUEBOOK

The BLUEBOOK variable has a moderate right skew. We’ll apply the square root transformation suggested by the box-cox procedure.

Fitted parameters:
   lambda      beta   sigmasq 
0.4610754 5.2624962 3.7967126 

Convergence code returned by optim: 0

OLDCLAIM

OLDCLAIM is extremely right skewed. We’ll apply a log(x+1) transformation to reduce the overall skew.

Fitted parameters:
     lambda        beta     sigmasq 
-0.04511237  1.76185840  0.82136055 

Convergence code returned by optim: 0

5 Build Models

5.0.0.1 Binary Logistic Regression

Model 1: Manual Variable Selection, Linear Terms Only

Based on our data exploration, we believe the following variables could be relevant in predicted whether or not a claim occurs:

KIDSDRIV
HOMEKIDS
CLM_FREQ
MVR_PTS
AGE
YOJ
TRAVTIME
TIF
CAR_AGE
PARENT1
MSTATUS
CAR_USE
REVOKED
URBANICITY
JOB
CAR_TYPE
EDUCATION
INCOME
HOMEVAL
BLUEBOOK
OLDCLAIM

Granted, we haven’t narrowed down the list much.

Before moving forward, let’s calculate variance inflation factors, including all potential predictors in our model.

                   GVIF Df GVIF^(1/(2*Df))
PARENT1        1.927927  1        1.388498
MSTATUS        2.258214  1        1.502735
EDUCATION      9.204331  3        1.447656
JOB           31.325910  8        1.240206
CAR_USE        2.227957  1        1.492634
CAR_TYPE       2.443471  5        1.093454
REVOKED        1.148154  1        1.071520
MVR_PTS        1.194820  1        1.093078
CAR_AGE        2.123064  1        1.457074
URBANICITY     1.147162  1        1.071057
KIDSDRIV_MOD   1.347203  1        1.160691
HOMEKIDS_MOD   2.183075  1        1.477523
CLM_FREQ_MOD   2.350866  1        1.533253
AGE_MOD        1.455622  1        1.206491
YOJ_MOD        1.608639  1        1.268321
TRAVTIME_MOD   1.039894  1        1.019752
TIF_MOD        1.009853  1        1.004915
INCOME_MOD     3.468635  1        1.862427
HOME_VAL_MOD   1.926043  1        1.387819
BLUEBOOK_MOD   1.689897  1        1.299960
OLD_CLAIM_MOD  2.534623  1        1.592050

Seeing no major VIF issues–once accounting for degrees of freedom–we will proceed with the model with all suggested predictors.


Call:
glm(formula = TARGET_FLAG ~ . - TARGET_AMT - TARGET_AMT_MOD - 
    INDEX - KIDSDRIV - HOMEKIDS - CLM_FREQ - AGE - YOJ - TRAVTIME - 
    TIF - SEX - RED_CAR - INCOME - HOME_VAL - BLUEBOOK - OLDCLAIM - 
    INCOME_30 - HOME_VAL_30 - BLUEBOOK_2 - OLDCLAIM_4, family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6137  -0.7132  -0.3982   0.6166   3.1592  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.872686   0.282675  -6.625  0.00000000003475425 ***
PARENT1Yes                     0.370690   0.109692   3.379             0.000727 ***
MSTATUSYes                    -0.474052   0.087739  -5.403  0.00000006554066899 ***
EDUCATION.L                   -0.078343   0.149372  -0.524             0.599945    
EDUCATION.Q                    0.233170   0.086835   2.685             0.007249 ** 
EDUCATION.C                   -0.089875   0.081919  -1.097             0.272589    
JOBBlue Collar                 0.308221   0.184894   1.667             0.095512 .  
JOBClerical                    0.347545   0.197037   1.764             0.077756 .  
JOBDoctor                     -0.413664   0.266212  -1.554             0.120211    
JOBHome Maker                 -0.053576   0.220427  -0.243             0.807961    
JOBLawyer                      0.127853   0.168679   0.758             0.448470    
JOBManager                    -0.545841   0.170920  -3.194             0.001405 ** 
JOBProfessional                0.177985   0.177883   1.001             0.317032    
JOBStudent                    -0.170969   0.226935  -0.753             0.451218    
CAR_USEPrivate                -0.779822   0.087565  -8.906 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.583726   0.146384   3.988  0.00006673268440400 ***
CAR_TYPEPickup                 0.550812   0.100174   5.499  0.00000003829305886 ***
CAR_TYPESports Car             0.948214   0.108249   8.760 < 0.0000000000000002 ***
CAR_TYPESUV                    0.718723   0.086067   8.351 < 0.0000000000000002 ***
CAR_TYPEVan                    0.657014   0.121696   5.399  0.00000006707980041 ***
REVOKEDYes                     0.748190   0.085838   8.716 < 0.0000000000000002 ***
MVR_PTS                        0.110159   0.013843   7.958  0.00000000000000175 ***
CAR_AGE                       -0.001176   0.007502  -0.157             0.875411    
URBANICITYHighly Urban/ Urban  2.411637   0.113224  21.300 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.392587   0.061248   6.410  0.00000000014567496 ***
HOMEKIDS_MOD                   0.044878   0.037243   1.205             0.228204    
CLM_FREQ_MOD                   0.170681   0.036265   4.707  0.00000251988421409 ***
AGE_MOD                       -0.001631   0.004007  -0.407             0.684036    
YOJ_MOD                       -0.001673   0.008759  -0.191             0.848497    
TRAVTIME_MOD                   0.014751   0.001886   7.819  0.00000000000000532 ***
TIF_MOD                       -0.055559   0.007352  -7.557  0.00000000000004129 ***
INCOME_MOD                    -0.095766   0.016269  -5.886  0.00000000394563137 ***
HOME_VAL_MOD                  -0.090138   0.022401  -4.024  0.00005726974387333 ***
BLUEBOOK_MOD                  -0.177734   0.035205  -5.049  0.00000044508752803 ***
OLD_CLAIM_MOD                 -0.030521   0.039102  -0.781             0.435073    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7281.9  on 8126  degrees of freedom
AIC: 7351.9

Number of Fisher Scoring iterations: 5

The signs of the coefficients mostly make sense:

We expect claim probabilities to be higher for single parents.
Married individuals should be less prone to accidents compared to singles.
The signs of all the EDUCATION levels make sense; however, we expected the Master’s and PhD levels to show a greater reduction in log odds relative to the high school reference level. We are not too concerned with this result though, as other variables such as JOB_TYPE and INCOME help in defining an individual’s composite risk profile.
We didn’t have solid a priori expectations about the relationship between different job type. We are not surprised with most of these result though. For instance, we are not surprised by the increased risk associated with blue collar workers relative to doctors.
The model is consistent with our expectation that private vehicles are less accident prone compared to commercial vehicles.
The car type signs and magnitudes are fairly consistent with our expectations: Minivans(the reference level) should be safer than the other vehicles. Sports cars should be most likely to be involved in an accident. This is what we see.
A revoked license is highly indicative of future accident risk.
MVR points are positively associated with accident risk.
The model shows a slight negative relationship between car age and log odds of an accident, but the result is not statistically significant.
Our model shows much greater risk in urban areas compared to rural geographies. This is consistent with our expectation.
The number of teenage drivers impacts risk unfavorably, as our model shows.
Our model indicates that more children can adversely impact claims risk. We didn’t have firm expectations about this variable’s influence. More important, our model does not indicate a statistically significant result.
Our model reveals claim risk declining with age. This is not necessarily surprising; however, the model coefficient is not statistically significant.
Our model shows risk increasing slightly with increases to YOB. This is maybe a strange result, but our model indicates a very high p-value for our coefficient.
We expect risk to increase with longer travel times, as our model shows.
We expect loyal policyholders to be less risky than frequently churning policyholders, as our model indicates.
Our model indicates decreasing log odds with increases to home value. This is what we would expect.
We are not surprised to see cars with high BLUEBOOK values being associated with reduced risk in our model compared to lower values.
The coefficient for prior claims cost is opposite of what we would expect. However, the coefficient has a high associated p-value; so we may drop this predictor from our model.

Let’s clean up our model by removing some of the statistically insignificant predictors. We will leave all JOB levels in our model, as the “Manager” and “Clerical” coefficients are highly significant, and two additional levels have p-values below 10%.


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD + 
    CLM_FREQ_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD + 
    BLUEBOOK_MOD, family = "binomial", data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6328  -0.7136  -0.3986   0.6135   3.1439  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.918503   0.267796  -7.164  0.00000000000078332 ***
PARENT1Yes                     0.455054   0.094479   4.816  0.00000146139638217 ***
MSTATUSYes                    -0.442419   0.084012  -5.266  0.00000013933679907 ***
EDUCATION.L                   -0.093724   0.137261  -0.683              0.49472    
EDUCATION.Q                    0.236564   0.084790   2.790              0.00527 ** 
EDUCATION.C                   -0.085934   0.081180  -1.059              0.28980    
JOBBlue Collar                 0.310997   0.184778   1.683              0.09236 .  
JOBClerical                    0.355083   0.196670   1.805              0.07100 .  
JOBDoctor                     -0.421711   0.265747  -1.587              0.11254    
JOBHome Maker                 -0.045799   0.218566  -0.210              0.83402    
JOBLawyer                      0.119301   0.168323   0.709              0.47847    
JOBManager                    -0.555839   0.170742  -3.255              0.00113 ** 
JOBProfessional                0.172429   0.177740   0.970              0.33199    
JOBStudent                    -0.146979   0.225655  -0.651              0.51482    
CAR_USEPrivate                -0.777122   0.087414  -8.890 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.588330   0.146142   4.026  0.00005679472185962 ***
CAR_TYPEPickup                 0.550137   0.100124   5.495  0.00000003917260431 ***
CAR_TYPESports Car             0.945631   0.107956   8.759 < 0.0000000000000002 ***
CAR_TYPESUV                    0.719308   0.085983   8.366 < 0.0000000000000002 ***
CAR_TYPEVan                    0.658243   0.121557   5.415  0.00000006125629378 ***
REVOKEDYes                     0.728369   0.080414   9.058 < 0.0000000000000002 ***
MVR_PTS                        0.108765   0.013572   8.014  0.00000000000000111 ***
URBANICITYHighly Urban/ Urban  2.409081   0.113106  21.299 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.424693   0.055148   7.701  0.00000000000001351 ***
CLM_FREQ_MOD                   0.150238   0.025536   5.883  0.00000000402164286 ***
TRAVTIME_MOD                   0.014697   0.001885   7.796  0.00000000000000637 ***
TIF_MOD                       -0.055311   0.007347  -7.528  0.00000000000005149 ***
INCOME_MOD                    -0.095829   0.015629  -6.131  0.00000000087089256 ***
HOME_VAL_MOD                  -0.091463   0.022383  -4.086  0.00004385319335945 ***
BLUEBOOK_MOD                  -0.181410   0.035000  -5.183  0.00000021811943514 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7284.9  on 8131  degrees of freedom
AIC: 7344.9

Number of Fisher Scoring iterations: 5

All predictors are now statistically significant, with the exception of several of the job types. The coefficients in our model also still make sense.

Model 2: Add Quadratic Terms to Model 1

We noted earlier that there appeared to be quadratic relationship between some predictors and log odds.

Starting with our original Model 1–i.e. before we removed the insignificant predictors–we’ll add second order polynomial terms for the following variables:

KIDSDRIV
HOMEKIDS
CLM_FREQ
AGE
YOJ
TIF

Here is the summary output from model 2:


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + 
    KIDSDRIV_MOD + HOMEKIDS_MOD + CLM_FREQ_MOD + AGE_MOD + YOJ_MOD + 
    TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD + BLUEBOOK_MOD + 
    OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2) + 
    I(AGE_MOD^2) + I(YOJ_MOD^2) + I(TIF_MOD^2), family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4212  -0.7079  -0.3896   0.5997   3.0494  

Coefficients:
                                Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.8549782  0.3023391  -6.135 0.000000000849329604 ***
PARENT1Yes                     0.2287145  0.1190337   1.921              0.05468 .  
MSTATUSYes                    -0.5472126  0.0907131  -6.032 0.000000001615991315 ***
EDUCATION.L                   -0.1110321  0.1507177  -0.737              0.46131    
EDUCATION.Q                    0.2302008  0.0871779   2.641              0.00828 ** 
EDUCATION.C                   -0.1063223  0.0824316  -1.290              0.19711    
JOBBlue Collar                 0.3377066  0.1857011   1.819              0.06898 .  
JOBClerical                    0.3711729  0.1980087   1.875              0.06086 .  
JOBDoctor                     -0.4381672  0.2666846  -1.643              0.10038    
JOBHome Maker                 -0.0171249  0.2232588  -0.077              0.93886    
JOBLawyer                      0.1293698  0.1693179   0.764              0.44483    
JOBManager                    -0.5408136  0.1711226  -3.160              0.00158 ** 
JOBProfessional                0.1900593  0.1784593   1.065              0.28688    
JOBStudent                    -0.1735835  0.2291796  -0.757              0.44880    
CAR_USEPrivate                -0.7834831  0.0881933  -8.884 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.6238633  0.1471955   4.238 0.000022518613851542 ***
CAR_TYPEPickup                 0.5677139  0.1009806   5.622 0.000000018874685247 ***
CAR_TYPESports Car             0.8853900  0.1094746   8.088 0.000000000000000608 ***
CAR_TYPESUV                    0.7106989  0.0868341   8.185 0.000000000000000273 ***
CAR_TYPEVan                    0.6855025  0.1226882   5.587 0.000000023055663145 ***
REVOKEDYes                     0.8259645  0.0887815   9.303 < 0.0000000000000002 ***
MVR_PTS                        0.0971609  0.0141218   6.880 0.000000000005976967 ***
CAR_AGE                       -0.0012267  0.0075548  -0.162              0.87102    
URBANICITYHighly Urban/ Urban  2.3933908  0.1135895  21.071 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.7280955  0.1425080   5.109 0.000000323602039340 ***
HOMEKIDS_MOD                   0.0617138  0.0685642   0.900              0.36807    
CLM_FREQ_MOD                   0.3677199  0.0737840   4.984 0.000000623693042781 ***
AGE_MOD                       -0.0020003  0.0041442  -0.483              0.62932    
YOJ_MOD                       -0.0006137  0.0110557  -0.056              0.95573    
TRAVTIME_MOD                   0.0148435  0.0018995   7.815 0.000000000000005515 ***
TIF_MOD                       -0.0666080  0.0086300  -7.718 0.000000000000011795 ***
INCOME_MOD                    -0.0829304  0.0168293  -4.928 0.000000831917394078 ***
HOME_VAL_MOD                  -0.0887297  0.0225581  -3.933 0.000083759903483572 ***
BLUEBOOK_MOD                  -0.1860283  0.0354683  -5.245 0.000000156350347101 ***
OLD_CLAIM_MOD                 -0.1226506  0.0492985  -2.488              0.01285 *  
I(KIDSDRIV_MOD^2)             -0.1387882  0.0690188  -2.011              0.04434 *  
I(HOMEKIDS_MOD^2)             -0.0328246  0.0281772  -1.165              0.24405    
I(CLM_FREQ_MOD^2)             -0.0875831  0.0282607  -3.099              0.00194 ** 
I(AGE_MOD^2)                   0.0022268  0.0002811   7.922 0.000000000000002339 ***
I(YOJ_MOD^2)                   0.0012227  0.0016034   0.763              0.44570    
I(TIF_MOD^2)                   0.0031365  0.0013927   2.252              0.02431 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7198.7  on 8120  degrees of freedom
AIC: 7280.7

Number of Fisher Scoring iterations: 5

Below our key results from this second model:

The first and second order terms for KIDSDRIV are both statistically significant. The negative sign of the second order term makes sense: the unfavorable impact of adding additional children drivers diminishes with each subsequent child.
Neither the first nor second order terms are significant for HOMEKIDS.
Both CLM_FREQ terms are significant. The negative second order term indicates diminishing impact of each additional prior claim.
The second order term of AGE is statistically significant, but the first first order term is not. We’ll leave both terms in our model. The coefficients of the terms make sense. There is a primary trend of reduced risk with age–see the negative first order coefficient, but the trend diminishes and potentially reverses for higher ages–as reflected in the positive second order term.
Neither YOJ terms are statistically significant.
Both first and second order TIF terms are significant. The signs of the coefficients have an intuitive explanation: there is a primary effect of risk reduction in risk with increases to TIF but the favorable impact diminishes with higher TIF values.
CAR_AGE is still insignificant in this model.

Based on the results above, we’ll remove all HOMEKIDS and YOJ terms from our model. Here are the summary results from this modified, second model:


Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + 
    CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD + 
    CLM_FREQ_MOD + AGE_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD + 
    HOME_VAL_MOD + BLUEBOOK_MOD + OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) + 
    I(CLM_FREQ_MOD^2) + I(AGE_MOD^2) + I(TIF_MOD^2), family = "binomial", 
    data = auto)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4188  -0.7079  -0.3899   0.6017   3.0411  

Coefficients:
                                Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.8566640  0.2816370  -6.592 0.000000000043277015 ***
PARENT1Yes                     0.2739046  0.1017449   2.692              0.00710 ** 
MSTATUSYes                    -0.5275390  0.0855811  -6.164 0.000000000708416117 ***
EDUCATION.L                   -0.1089269  0.1383381  -0.787              0.43105    
EDUCATION.Q                    0.2326644  0.0851694   2.732              0.00630 ** 
EDUCATION.C                   -0.1037084  0.0816550  -1.270              0.20406    
JOBBlue Collar                 0.3308613  0.1855816   1.783              0.07461 .  
JOBClerical                    0.3577260  0.1975653   1.811              0.07019 .  
JOBDoctor                     -0.4363115  0.2668807  -1.635              0.10208    
JOBHome Maker                  0.0138339  0.2199564   0.063              0.94985    
JOBLawyer                      0.1258867  0.1693304   0.743              0.45722    
JOBManager                    -0.5435257  0.1710934  -3.177              0.00149 ** 
JOBProfessional                0.1872710  0.1783856   1.050              0.29381    
JOBStudent                    -0.1531559  0.2268795  -0.675              0.49964    
CAR_USEPrivate                -0.7861317  0.0880463  -8.929 < 0.0000000000000002 ***
CAR_TYPEPanel Truck            0.6224359  0.1471477   4.230 0.000023368219454604 ***
CAR_TYPEPickup                 0.5651171  0.1009060   5.600 0.000000021382163925 ***
CAR_TYPESports Car             0.8883475  0.1093351   8.125 0.000000000000000447 ***
CAR_TYPESUV                    0.7116850  0.0867565   8.203 0.000000000000000234 ***
CAR_TYPEVan                    0.6840691  0.1226023   5.580 0.000000024110109095 ***
REVOKEDYes                     0.8231669  0.0887081   9.280 < 0.0000000000000002 ***
MVR_PTS                        0.0980783  0.0141003   6.956 0.000000000003506304 ***
URBANICITYHighly Urban/ Urban  2.3910929  0.1135255  21.062 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.7850828  0.1289262   6.089 0.000000001133357267 ***
CLM_FREQ_MOD                   0.3652676  0.0737359   4.954 0.000000728054720019 ***
AGE_MOD                       -0.0025611  0.0036151  -0.708              0.47867    
TRAVTIME_MOD                   0.0148180  0.0018989   7.803 0.000000000000006031 ***
TIF_MOD                       -0.0667534  0.0086268  -7.738 0.000000000000010109 ***
INCOME_MOD                    -0.0886512  0.0157886  -5.615 0.000000019668539231 ***
HOME_VAL_MOD                  -0.0886985  0.0225545  -3.933 0.000084020987281530 ***
BLUEBOOK_MOD                  -0.1860655  0.0354614  -5.247 0.000000154612137297 ***
OLD_CLAIM_MOD                 -0.1214653  0.0492809  -2.465              0.01371 *  
I(KIDSDRIV_MOD^2)             -0.1637618  0.0651787  -2.513              0.01199 *  
I(CLM_FREQ_MOD^2)             -0.0865030  0.0282256  -3.065              0.00218 ** 
I(AGE_MOD^2)                   0.0022487  0.0002768   8.124 0.000000000000000452 ***
I(TIF_MOD^2)                   0.0031484  0.0013917   2.262              0.02368 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7201.1  on 8125  degrees of freedom
AIC: 7273.1

Number of Fisher Scoring iterations: 5

Model 3: Stepwise Regression

For our third model, we’ll implement stepwise regression, with variable selecting occurring in both directions. We’ll include all predictors (transformed versions where applicable) in our potential universe of candidates.

For simplicity, we’ll only include first order terms.


Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS + 
    CAR_TYPE + REVOKED + KIDSDRIV_MOD + INCOME_MOD + CAR_USE + 
    TIF_MOD + TRAVTIME_MOD + CLM_FREQ_MOD + BLUEBOOK_MOD + PARENT1 + 
    HOME_VAL_MOD + EDUCATION + HOMEKIDS_MOD, family = "binomial", 
    data = auto_redux)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6158  -0.7141  -0.3980   0.6181   3.1513  

Coefficients:
                               Estimate Std. Error z value             Pr(>|z|)    
(Intercept)                   -1.887080   0.268674  -7.024  0.00000000000216101 ***
URBANICITYHighly Urban/ Urban  2.408584   0.113076  21.301 < 0.0000000000000002 ***
JOBBlue Collar                 0.309851   0.184810   1.677              0.09362 .  
JOBClerical                    0.347849   0.196756   1.768              0.07707 .  
JOBDoctor                     -0.414469   0.265832  -1.559              0.11896    
JOBHome Maker                 -0.049057   0.218639  -0.224              0.82246    
JOBLawyer                      0.125372   0.168406   0.744              0.45660    
JOBManager                    -0.548824   0.170794  -3.213              0.00131 ** 
JOBProfessional                0.178743   0.177815   1.005              0.31479    
JOBStudent                    -0.164947   0.226014  -0.730              0.46551    
MVR_PTS                        0.108414   0.013576   7.986  0.00000000000000140 ***
MSTATUSYes                    -0.476489   0.087310  -5.457  0.00000004830512386 ***
CAR_TYPEPanel Truck            0.588390   0.146226   4.024  0.00005725500332016 ***
CAR_TYPEPickup                 0.551882   0.100139   5.511  0.00000003564383764 ***
CAR_TYPESports Car             0.944740   0.107966   8.750 < 0.0000000000000002 ***
CAR_TYPESUV                    0.717233   0.086011   8.339 < 0.0000000000000002 ***
CAR_TYPEVan                    0.658849   0.121591   5.419  0.00000006007329279 ***
REVOKEDYes                     0.724890   0.080457   9.010 < 0.0000000000000002 ***
KIDSDRIV_MOD                   0.389384   0.060210   6.467  0.00000000009988987 ***
INCOME_MOD                    -0.096508   0.015642  -6.170  0.00000000068430015 ***
CAR_USEPrivate                -0.779104   0.087448  -8.909 < 0.0000000000000002 ***
TIF_MOD                       -0.055518   0.007349  -7.554  0.00000000000004217 ***
TRAVTIME_MOD                   0.014755   0.001885   7.826  0.00000000000000505 ***
CLM_FREQ_MOD                   0.150203   0.025546   5.880  0.00000000410930376 ***
BLUEBOOK_MOD                  -0.179788   0.035019  -5.134  0.00000028357055432 ***
PARENT1Yes                     0.375928   0.109102   3.446              0.00057 ***
HOME_VAL_MOD                  -0.090524   0.022374  -4.046  0.00005210493016275 ***
EDUCATION.L                   -0.086834   0.137351  -0.632              0.52725    
EDUCATION.Q                    0.235675   0.084809   2.779              0.00545 ** 
EDUCATION.C                   -0.087132   0.081189  -1.073              0.28318    
HOMEKIDS_MOD                   0.049238   0.034007   1.448              0.14765    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7282.8  on 8130  degrees of freedom
AIC: 7344.8

Number of Fisher Scoring iterations: 5

 [1] "PARENT1"       "MSTATUS"       "SEX"           "EDUCATION"     "JOB"           "CAR_USE"      
 [7] "CAR_TYPE"      "RED_CAR"       "REVOKED"       "MVR_PTS"       "CAR_AGE"       "URBANICITY"   
[13] "KIDSDRIV_MOD"  "HOMEKIDS_MOD"  "CLM_FREQ_MOD"  "AGE_MOD"       "YOJ_MOD"       "TRAVTIME_MOD" 
[19] "TIF_MOD"       "INCOME_MOD"    "HOME_VAL_MOD"  "BLUEBOOK_MOD"  "OLD_CLAIM_MOD" "TARGET_FLAG"

Surprisingly, the stepwise procedure produced a model that is identical to our modified Model 1!

5.0.0.2 Multiple Linear Regression

Now We’ll model claim costs using the subset of the training data where claim costs were greater than one.

Before we build our models, let’s check for multicollinearity uses by reviewing variance inflation factors for a linear model that includes all predictors.

                   GVIF Df GVIF^(1/(2*Df))
PARENT1        2.163475  1        1.470876
MSTATUS        2.442422  1        1.562825
SEX            3.756936  1        1.938282
EDUCATION      9.707532  3        1.460556
JOB           40.971902  8        1.261190
CAR_USE        2.259463  1        1.503151
CAR_TYPE       6.325230  5        1.202562
RED_CAR        1.832475  1        1.353689
REVOKED        1.266601  1        1.125433
MVR_PTS        1.158984  1        1.076561
CAR_AGE        2.130468  1        1.459612
URBANICITY     1.052960  1        1.026139
KIDSDRIV_MOD   1.434861  1        1.197857
HOMEKIDS_MOD   2.245965  1        1.498654
CLM_FREQ_MOD   2.232403  1        1.494123
AGE_MOD        1.505499  1        1.226988
YOJ_MOD        1.925940  1        1.387782
TRAVTIME_MOD   1.030401  1        1.015087
TIF_MOD        1.018817  1        1.009364
INCOME_MOD     4.310419  1        2.076155
HOME_VAL_MOD   1.991408  1        1.411173
BLUEBOOK_MOD   2.083098  1        1.443294
OLD_CLAIM_MOD  2.486871  1        1.576982

There does not appear to be an issue with mulicollinearity.

Model 4: Manual Variable Selection, Linear Terms Only

Choosing relevant predictors manually is a challenging exercise as most predictors seemed to have only a subtle influence–if any–on claim costs.

Based on our exploratory work, we believe the following variables may be relevant:

AGE
KIDSDRIV
HOMEKIDS
TRAVTIME
SEX
CAR_USE
RED_CAR
UBANICITY
JOB
CARTYPE
EDUCATION
BLUEBOOK

Let’s look at a preliminary model using all 12 of our proposed predictors:


Call:
lm(formula = TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD + AGE_MOD + 
    TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB + 
    CAR_TYPE + EDUCATION + BLUEBOOK_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7973 -0.4027  0.0404  0.4037  3.2710 

Coefficients:
                                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                    7.7902933  0.1678675  46.407 < 0.0000000000000002 ***
KIDSDRIV_MOD                  -0.0361875  0.0331524  -1.092                0.275    
HOMEKIDS_MOD                   0.0205244  0.0190398   1.078                0.281    
AGE_MOD                        0.0004682  0.0021647   0.216                0.829    
TRAVTIME_MOD                  -0.0005157  0.0011623  -0.444                0.657    
SEXM                           0.0939910  0.0678247   1.386                0.166    
CAR_USEPrivate                -0.0042709  0.0519136  -0.082                0.934    
RED_CARyes                     0.0270567  0.0521811   0.519                0.604    
URBANICITYHighly Urban/ Urban  0.0274888  0.0789854   0.348                0.728    
JOBBlue Collar                 0.0579560  0.1194463   0.485                0.628    
JOBClerical                    0.0770022  0.1242427   0.620                0.535    
JOBDoctor                     -0.0423386  0.1844218  -0.230                0.818    
JOBHome Maker                  0.0186071  0.1219750   0.153                0.879    
JOBLawyer                     -0.0040646  0.1076214  -0.038                0.970    
JOBManager                     0.0143688  0.1117657   0.129                0.898    
JOBProfessional                0.0966171  0.1180370   0.819                0.413    
JOBStudent                     0.0708606  0.1262229   0.561                0.575    
CAR_TYPEPanel Truck           -0.0023405  0.0960163  -0.024                0.981    
CAR_TYPEPickup                 0.0298623  0.0622814   0.479                0.632    
CAR_TYPESports Car             0.0719608  0.0781399   0.921                0.357    
CAR_TYPESUV                    0.0919203  0.0689452   1.333                0.183    
CAR_TYPEVan                   -0.0278677  0.0804508  -0.346                0.729    
EDUCATION.L                    0.1274041  0.0887219   1.436                0.151    
EDUCATION.Q                    0.0541159  0.0539471   1.003                0.316    
EDUCATION.C                   -0.0631616  0.0541032  -1.167                0.243    
BLUEBOOK_MOD                   0.0978773  0.0227092   4.310            0.0000171 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8092 on 2127 degrees of freedom
Multiple R-squared:  0.0198,    Adjusted R-squared:  0.008278 
F-statistic: 1.718 on 25 and 2127 DF,  p-value: 0.01486

Only one of our predictors, BLUEBOOK appears to be significant in our model.

Here is our modified model, with BLUEBOOK as the sole predictor:


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7842 -0.3929  0.0404  0.3952  3.2528 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   7.95359    0.06005 132.445 < 0.0000000000000002 ***
BLUEBOOK_MOD  0.08921    0.01590   5.609          0.000000023 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8069 on 2151 degrees of freedom
Multiple R-squared:  0.01441,   Adjusted R-squared:  0.01396 
F-statistic: 31.46 on 1 and 2151 DF,  p-value: 0.00000002298

The positive coefficient for Bluebook makes sense: we expect the replacement cost and/or repairs for a high-valued car to be more expensive than auto with a low replacement cost.

Model 5: Add Quadratic Terms to Model 4

In the exploratory section, we noted potential curved relationship between some predictors and the log of TARGET_AMT.

Those predictors were AGE and TRAVTIME. Let’s include squared terms for these two predictors and also for for BLUEBOOK:


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) + 
    TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2), 
    data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7678 -0.3885  0.0345  0.3994  3.2450 

Coefficients:
                     Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)        7.56360993  0.15342722  49.298 < 0.0000000000000002 ***
BLUEBOOK_MOD       0.29960723  0.08326270   3.598             0.000328 ***
I(BLUEBOOK_MOD^2) -0.02798286  0.01094977  -2.556             0.010670 *  
TRAVTIME_MOD      -0.00057584  0.00119020  -0.484             0.628569    
I(TRAVTIME_MOD^2)  0.00001853  0.00005609   0.330             0.741178    
AGE_MOD            0.00042933  0.00186549   0.230             0.818002    
I(AGE_MOD^2)       0.00027328  0.00014705   1.858             0.063242 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8059 on 2146 degrees of freedom
Multiple R-squared:  0.01912,   Adjusted R-squared:  0.01638 
F-statistic: 6.972 on 6 and 2146 DF,  p-value: 0.0000002344

The second order term for BLUEBOOK is statistically significant. The second order term for AGE is borderline significant; so we include leave both age variables in our model; but remove terms related to TRAVTIME.


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) + 
    AGE_MOD + I(AGE_MOD^2), data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7720 -0.3860  0.0358  0.3948  3.2380 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)        7.5684831  0.1529052  49.498 < 0.0000000000000002 ***
BLUEBOOK_MOD       0.2990922  0.0831967   3.595             0.000332 ***
I(BLUEBOOK_MOD^2) -0.0279652  0.0109407  -2.556             0.010655 *  
AGE_MOD            0.0004062  0.0018617   0.218             0.827316    
I(AGE_MOD^2)       0.0002761  0.0001468   1.880             0.060177 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8056 on 2148 degrees of freedom
Multiple R-squared:  0.01899,   Adjusted R-squared:  0.01717 
F-statistic:  10.4 on 4 and 2148 DF,  p-value: 0.00000002434

This model indicates that log costs increase quadratically with Age. This result seems possible.

Model 6: Stepwise Regression

Finally, we’ll perform a basic stepwise regression with variable selection performed in both directions. For simplicity, we’ll only include linear terms.


Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + MSTATUS + MVR_PTS + 
    SEX + CLM_FREQ_MOD, data = auto_clm)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6968 -0.4038  0.0391  0.4093  3.2236 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   7.940083   0.066264 119.825 < 0.0000000000000002 ***
BLUEBOOK_MOD  0.086861   0.015942   5.449         0.0000000566 ***
MSTATUSYes   -0.073598   0.034727  -2.119               0.0342 *  
MVR_PTS       0.017181   0.007053   2.436               0.0149 *  
SEXM          0.055410   0.035037   1.581               0.1139    
CLM_FREQ_MOD -0.022434   0.014567  -1.540               0.1237    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8051 on 2147 degrees of freedom
Multiple R-squared:  0.02064,   Adjusted R-squared:  0.01836 
F-statistic: 9.051 on 5 and 2147 DF,  p-value: 0.00000001583

The stepwise procedure included four additional predictors in addition to BLUEBOOK:
* MSTATUS: the negative coefficient indicates the married individuals are less expensive than singles. This seems reasonable.
* MVR_PTS: the model indicates a positive association between MVR_PTS and claim costs. This also seems reasonable.
* SEX: According to the model, males are more expensive than females. This is consistent with our earlier exploratory work.
* CLM_FREQ_MOD1: The coefficient is negative. This result is counterintuitive. We would expect folks with a high incidence accident rate to potentially be at risk for higher cost accidents. For now, we’ll leave this predictor in, but we may want do additional analysis.

6 SELECT MODELS

6.1 Binary Logistic Regression Models

Let’s compare model fits for all of our models:

            AIC     AICc      BIC    loglik
m1     7351.936 7352.246 7597.185 -3640.968
m1_mod 7344.898 7345.127 7555.112 -3642.449
m2     7280.684 7281.108 7567.976 -3599.342
m2_mod 7273.059 7273.386 7525.315 -3600.529
m3     7344.813 7345.057 7562.034 -3641.407

Based on the various model evaluation criteria, model m2_mod appears to be the clear winner. This model is the binary logistic regression model that included multiple quadratic terms, but removed statistically insignificant predictors from the original model 2 formulation.

Model 2 is superior in that the AIC, AIC_c, and BIC measures are lower than all other evaluated models. The log likelihood is not quite as the original model 2, but this measure does not account for model complexity as the other models do.

Let’s also compare the AUC measure for all models:

[1] "Model 1: 0.815"

[1] "Model 1 mod: 0.814"

[1] "Model 2: 0.82"

[1] "Model 2 mod: 0.819"

[1] "Model 3: 0.815"

Models 2 and Model 2 mod are tied and have the highest AUC measures. Given that Model 2 mod has fewer parameters than Model 2, the AUC measure supports our contention that Model 2 mod is superior to the other models.

Let’s explore a summary of our model predictions using the training data:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002406 0.074672 0.195498 0.263816 0.402386 0.960296

We now need to choose an appropriate probability cutoff measure for predicting whether or not an individual will have a claim.

We will use Youden’s index to determine this optimal cutoff.

[1] 0.2762045

Using Youden’s index, we select a relatively low cutoff measure of 0.276. With such a low cutoff, we will inevitably sacrifice specificity for gains in sensitivity compared to a traditional 0.5 cutoff. However, this lower cutoff provides a better balance between precision and recall.

          Reference
Prediction    0    1
         0 4476  545
         1 1532 1608

$accuracy
[1] 0.7454969

$error_rt
[1] 0.2545031

$precision
[1] 0.5121019

$sensitivity
[1] 0.7468648

$specificity
[1] 0.7450067

$F1
[1] 0.6075949

For comparison purposes, here is the confusion matrix and related classification metrics with a 0.5 cutoff.

          Reference
Prediction    0    1
         0 5534 1207
         1  474  946

$accuracy
[1] 0.7940203

$error_rt
[1] 0.2059797

$precision
[1] 0.6661972

$sensitivity
[1] 0.4393869

$specificity
[1] 0.9211052

$F1
[1] 0.529527

We see that our 0.276 cutoff also results in lower accuracy vis-a-vis the 0.5 threshold. But our lower threshold also results in better balance in precision vs. recall, as indicated by the improved F1 measure.

6.2 Multiple Regression Models

We’ll review five different measures for assessing our multiple regression models:

R-Squared
Adjusted R-Squared
Root Mean Squared Error
AIC
Corrected AIC
BIC

          R_2 adj_R_2     rmse      AIC     AICc      BIC
m4     0.0198  0.0083 7891.505 5226.127 5226.838 5379.341
m4_mod 0.0144  0.0140 7901.013 5189.919 5189.931 5206.943
m5     0.0191  0.0164 7897.224 5189.618 5189.685 5235.015
m5_mod 0.0190  0.0172 7897.273 5185.896 5185.935 5219.944
m6     0.0206  0.0184 7885.800 5184.273 5184.325 5223.995

Based on the table above, model 6 appears to be the superior model.
It has superior measures in all categories except for BIC.

Let’s now do some quick model diagnostics for our selected model, model 6:

The residuals in our model appear to have a relatively constant variance across all fitted values. The qq plot indicates standardized residuals that are fairly well behaved, with only minor departures from normality. Finally, there are only a couple outliers in our data–none appear to be high leverage points.

6.3 Make Predictions

Let’s wrap up by scrubbing the test data set and make predictions. Please refer to the Github account in the Appendix to access the prediction file.

7 Appendix

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache=TRUE, comment=NA,
                      fig.align='center',options(width = 75, scipen=999))
# load libraries
library(stringr)
library(knitr)
library(moments)
library(ggplot2)
library(ggthemes)
library(gridExtra)
library(scales)
library(dplyr)
library(mice)
library(geoR)
library(car)
library(bestglm)
library(sme)
library(pROC)
library(caret)


url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance_training_data.csv"

auto <- read.csv(url)  

# create table describing variables in dataset
vname <- names(auto)

vdefn <- c("Identification Variable (do not use) ",
           "In a crash? 1=YES 0=NO ",
           "Cost of Crash, if applicable",
           "# Driving Children",
           "Age of Driver", 
           "# Children at Home",
           "Years on Job",
           "Income",
           "Single Parent",
           "Home Value",
           "Marital Status",
           "Gender",
           "Max Education Level",
           "Job Category",
           "Commute Distance",
           "Vehicle Use",
           "Value of Vehicle",
           "Time in Force",
           "Type of Car",
           "A Red Car",
           "# Claims (Past 5 Years) ",
           "Total Claims (Past 5 Years)",
           "License Revoked (Past 7 Years)",
           "Motor Vehicle Report Points",
           "Vehicle Age",
           "Home/Work Area"
)
           

veffect <- c("None",
             "None",
             "None",
             "When teenagers drive your car, increased crash risk",
             "Young and old drivers might be riskier",
             "Unknown effect",
             "Long-term employees are usually safer",
             "In theory, rich have fewer crashes",
             "Unknown impact",
             "In theory, home owners may drive more responsibly",
             "In theory, married individuals are less risky",
             "Urban legend: females are safer drivers",
             "Unknown, but in theory more educated people tend to drive more safely",
             "In theory, white collar workers are less risky",
             "Long drives to work usually suggest greater risk",
             "Commercial fleet driven more, may impact collision prob",
             "Unknown impact on collision prob, but impacts crash payout",
             "Long-term customers are usually safer",
             "Unknown impact on collision prob, but impacts crash payout",
             "Urban legend: red cars are riskier, particularly sports cars",
             "If  total payout  high, future payouts might be high",
             "Claim count should be positively correlated with future claims",
             "If your license was revoked, you probably are a riskier driver",
             "Traffic ticket counts have postive correlation with crashes",
             "Unknown impact on collision prob, but impacts crash payout",
             "Unknown impact"
             
)

var_df <- data.frame(vname,vdefn, veffect, row.names = NULL)
kable(var_df, col.names=c('Variable Name','Description','Theoretical Impact'))

# training data cleaning scripts

currency_fix <- function(x) {
  num <- str_replace_all(x, "\\$","")
  num <- as.numeric(str_replace_all(num, "\\,",""))
  num
}

auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)


auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)], 
                                        function(x) str_replace(x,"z_|<",""))

auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor) 

auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )

auto$INCOME <- auto$INCOME / 1000  
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000

str(auto)
options(width=95)
summary(auto) 

tbl <- with(auto,rbind(round(addmargins(table(TARGET_FLAG)),0),
                       addmargins(prop.table(table(TARGET_FLAG)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto, c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)
options(width=100)
auto_censor <- auto[auto$TARGET_FLAG == 1,]
round(with(auto_censor,c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)




h <- ggplot(auto_censor, aes(TARGET_AMT)) + 
  geom_histogram(color="ghostwhite", fill="steelblue4") +
  theme_classic()+ labs(title = 'Histogram of TARGET_AMT') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 

b <- ggplot(auto_censor, aes(x="",y=TARGET_AMT)) + 
  geom_boxplot(color="ghostwhite", fill="steelblue4",outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'Boxplot of TARGET_AMT') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + coord_flip() + 
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)


grid.arrange(h,b, ncol=2)


tbl <- with(auto, rbind(addmargins(table(KIDSDRIV)),addmargins(prop.table(table(KIDSDRIV)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(KIDSDRIV), StdD=sd(KIDSDRIV), Skew=skewness(KIDSDRIV), Kurt=kurtosis(KIDSDRIV))),2)

bc <- ggplot(auto, aes(KIDSDRIV, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'KIDSDRIV Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(KIDSDRIV), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by KIDSDRIV', x = "KIDSDRIV") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(KIDSDRIV) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(KIDSDRIV,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by KIDSDRIV') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
  

  
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <-auto[auto$TARGET_AMT > 0,] %>%
  group_by(KIDSDRIV) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
tbl <- with(auto, rbind(addmargins(table(HOMEKIDS)),addmargins(prop.table(table(HOMEKIDS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(HOMEKIDS), StdD=sd(HOMEKIDS), Skew=skewness(HOMEKIDS), Kurt=kurtosis(HOMEKIDS))),2)  

bc <- ggplot(auto, aes(HOMEKIDS, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'HOMEKIDS Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(HOMEKIDS), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by HOMEKIDS', x = "HOMEKIDS") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(HOMEKIDS) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(HOMEKIDS,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by HOMEKIDS') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
   
t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(HOMEKIDS) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
tbl <- with(auto, rbind(addmargins(table(CLM_FREQ)),addmargins(prop.table(table(CLM_FREQ)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(CLM_FREQ), StdD=sd(CLM_FREQ), Skew=skewness(CLM_FREQ), Kurt=kurtosis(CLM_FREQ))),2)  

bc <- ggplot(auto, aes(CLM_FREQ, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'CLM_FREQ Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(CLM_FREQ), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CLM_FREQ', x = "CLM_FREQ") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(CLM_FREQ) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(CLM_FREQ,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by CLM_FREQ') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1) 

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CLM_FREQ) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
options(width=90)
tbl <- with(auto, rbind(addmargins(table(MVR_PTS)),addmargins(prop.table(table(MVR_PTS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)


round(with(auto,c(summary(MVR_PTS), StdD=sd(MVR_PTS), Skew=skewness(MVR_PTS), Kurt=kurtosis(MVR_PTS))),2)  

bc <- ggplot(auto, aes(MVR_PTS, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
  theme_classic()+ labs(title = 'MVR_PTS Prop by TARGET_FLAG') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = element_rect(fill='lightblue')) +
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) 
 
 
bp <- ggplot(auto_censor, aes(factor(MVR_PTS), log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by MVR_PTS', x = "MVR_PTS") + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

lo <- auto %>%
  group_by(MVR_PTS) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp <- ggplot(lo, aes(MVR_PTS,logodds)) + 
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) + 
  theme_classic()+ labs(title = 'Log Odds Claim by MVR_PTS') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1) 
t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(BIN_MVR=2*floor(MVR_PTS/2)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)
options(width=100)

tbl <- with(auto, rbind(addmargins(table(5*floor(AGE/5))),addmargins(prop.table(table(5*floor(AGE/5))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)



options(width=100)
round(with(auto,c(summary(AGE), StdD=sd(AGE, na.rm=T), Skew=skewness(AGE, na.rm=T),
                  Kurt=kurtosis(AGE, na.rm=T))),2)


h1 <- ggplot(auto, aes(AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(AGE, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$AGE_5 <- 5 * floor(auto$AGE / 5)
#auto$AGE_10 <- 10 * floor(auto$AGE / 10)

lo <- auto %>%
  group_by(AGE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(AGE,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by AGE') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=F, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(AGE, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(AGE)) %>%
  group_by(AGEBIN=10*floor(AGE/10)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)



tbl <- with(auto, rbind(addmargins(table(2*floor(YOJ/2))),addmargins(prop.table(table(2*floor(YOJ/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=90)
round(with(auto,c(summary(YOJ), StdD=sd(YOJ, na.rm=T), Skew=skewness(YOJ, na.rm=T),
                  Kurt=kurtosis(YOJ, na.rm=T))),2)

h1 <- ggplot(auto, aes(YOJ)) + geom_histogram(binwidth = 1, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(YOJ, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 1, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$YOJ_5 <- 5 * floor(auto$YOJ / 5)
#auto$YOJ_10 <- 10 * floor(auto$YOJ / 10)

lo <- auto %>%
  group_by(YOJ) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(YOJ,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by YOJ') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(YOJ, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by YOJ') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(YOJ)) %>%
  group_by(YOJBIN=4*floor(YOJ/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(10*floor(TRAVTIME/10))),addmargins(prop.table(table(10*floor(TRAVTIME/10)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(TRAVTIME), StdD=sd(TRAVTIME, na.rm=T), Skew=skewness(TRAVTIME, na.rm=T),
                  Kurt=kurtosis(TRAVTIME, na.rm=T))),2)

h1 <- ggplot(auto, aes(TRAVTIME)) + geom_histogram(binwidth = 5, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(TRAVTIME, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 5, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$TRAVTIME_5 <- 5 * floor(auto$TRAVTIME / 5)
#auto$TRAVTIME_10 <- 10 * floor(auto$TRAVTIME / 10)

lo <- auto %>%
  group_by(TRAVTIME) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(TRAVTIME,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by TRAVTIME') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(TRAVTIME, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by TRAVTIME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(TRAVTIME)) %>%
  group_by(TRAVBIN=15*floor(TRAVTIME/15)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(2*floor(TIF/2))),addmargins(prop.table(table(2*floor(TIF/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(TIF), StdD=sd(TIF, na.rm=T), Skew=skewness(TIF, na.rm=T),
                  Kurt=kurtosis(TIF, na.rm=T))),2)

h1 <- ggplot(auto, aes(TIF)) + geom_histogram(binwidth = 3, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(TIF, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 3, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$TIF_5 <- 5 * floor(auto$TIF / 5)
#auto$TIF_10 <- 10 * floor(auto$TIF / 10)

lo <- auto %>%
  group_by(TIF) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(TIF,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by TIF') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(TIF, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) Prop by TIF') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(TIF)) %>%
  group_by(TIFBIN=4*floor(TIF/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(3*floor(CAR_AGE/3))),addmargins(prop.table(table(3*floor(CAR_AGE/3)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)

options(width = 90)
round(with(auto,c(summary(CAR_AGE), StdD=sd(CAR_AGE, na.rm=T), Skew=skewness(CAR_AGE, na.rm=T),
                  Kurt=kurtosis(CAR_AGE, na.rm=T))),2)

h1 <- ggplot(auto, aes(CAR_AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))


h2 <- ggplot(auto, aes(CAR_AGE, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$CAR_AGE_5 <- 5 * floor(auto$CAR_AGE / 5)
#auto$CAR_AGE_10 <- 10 * floor(auto$CAR_AGE / 10)

lo <- auto %>%
  group_by(CAR_AGE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(CAR_AGE,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Claim by CAR_AGE') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(CAR_AGE, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) Prop by CAR_AGE') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(CAR_AGE)) %>%
  group_by(CAR_AGE=4*floor(CAR_AGE/4)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$PARENT1))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(PARENT1=auto$PARENT1,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(PARENT1, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by PARENT1') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(PARENT1) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$MSTATUS))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(MSTATUS=auto$MSTATUS,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(MSTATUS, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by MSTATUS') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(MSTATUS) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$SEX))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(SEX=auto$SEX,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(SEX, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by SEX') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(SEX) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$CAR_USE))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(CAR_USE=auto$CAR_USE,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(CAR_USE, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_USE') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CAR_USE) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$RED_CAR))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(RED_CAR=auto$RED_CAR,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(RED_CAR, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by RED_CAR') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(RED_CAR) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$REVOKED))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(REVOKED=auto$REVOKED,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(REVOKED, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by REVOKED') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(REVOKED) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

ct <- addmargins(table(auto$URBANICITY))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)

tbl <- addmargins(table(URBANICITY=auto$URBANICITY,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])

bp <- ggplot(auto_censor, aes(URBANICITY, log(TARGET_AMT))) + 
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by URBANICITY') + 
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  + 
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))

bp

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(URBANICITY) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(JOB)),addmargins(prop.table(table(JOB)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(JOB, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:9,1:2], margin=1),2)
pt

prop.test(tbl[1:9,1:2])


bc <- ggplot(auto, aes(x=reorder(JOB, JOB,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'JOB Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) + 
  labs(x="JOB") + coord_flip()
  

bp <- ggplot(auto_censor, aes(factor(JOB), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by JOB', x = "JOB") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(JOB) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(JOB,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by JOB') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) + 
  scale_x_discrete(labels=c('','BlCol','Clr', 'Dr', 'Hmkr', 'Law', 'Mgmt', 'Prof','Stud'))




grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(JOB) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(CAR_TYPE)),addmargins(prop.table(table(CAR_TYPE)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(CAR_TYPE, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:6,1:2], margin=1),2)
pt

prop.test(tbl[1:6,1:2])


bc <- ggplot(auto, aes(x=reorder(CAR_TYPE, CAR_TYPE,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'CAR_TYPE Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  labs(x="CAR_TYPE") + coord_flip()

#scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) + 
 

bp <- ggplot(auto_censor, aes(factor(CAR_TYPE), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_TYPE', x = "CAR_TYPE") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(CAR_TYPE) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(CAR_TYPE,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by CAR_TYPE') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) +
  scale_x_discrete(labels=c('MV','PnlTrk','PkpTrk', 'SptCar', 'SUV', 'Van'))
  
  


grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(CAR_TYPE) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=100)
tbl <- with(auto, rbind(addmargins(table(EDUCATION)),addmargins(prop.table(table(EDUCATION)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)

tbl <- with(auto, addmargins(table(EDUCATION, TARGET_FLAG)))
tbl

pt <- round(prop.table(tbl[1:4,1:2], margin=1),2)
pt

prop.test(tbl[1:4,1:2])


bc <- ggplot(auto, aes(x=reorder(EDUCATION, EDUCATION,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") + 
  theme_classic()+ labs(title = 'EDUCATION Prop') +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + 
  labs(x="EDUCATION") + coord_flip()


bp <- ggplot(auto_censor, aes(factor(EDUCATION), log(TARGET_AMT))) +
  geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
  theme_classic()+ labs(title = 'log(TARGET_AMT) by EDUCATION', x = "EDUCATION") +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)  +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


lo <- auto %>%
  group_by(EDUCATION) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
                                                                     length(TARGET_FLAG))),3))

sp <- ggplot(lo, aes(EDUCATION,logodds)) +
  geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
  theme_classic()+ labs(title = 'Log Odds Claim by EDUCATION') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) 
  

  


grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)  

t <- auto[auto$TARGET_AMT > 0,] %>%
  group_by(EDUCATION) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=130)

tbl <- with(auto, rbind(addmargins(table(30*floor(INCOME/30))),addmargins(prop.table(table(30*floor(INCOME/30))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(INCOME), StdD=sd(INCOME, na.rm=T), Skew=skewness(INCOME, na.rm=T),
                  Kurt=kurtosis(INCOME, na.rm=T))),2)


h1 <- ggplot(auto, aes(INCOME)) + geom_histogram(binwidth = 20, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))


h2 <- ggplot(auto, aes(INCOME, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 20, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))



#auto$INCOME_5 <- 5 * floor(auto$INCOME / 5)
#auto$INCOME_30  <- auto$INCOME^0.5
auto$INCOME_30 <- 30 * floor(auto$INCOME / 30)

lo <- auto %>%
  group_by(INCOME_30) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))


  
sp1 <- ggplot(lo, aes(INCOME_30,logodds)) + 
  geom_point(color ='steelblue4', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned INCOME') + 
  geom_smooth(method='loess', color='darkred', size=0.75, se=T) +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))



sp2 <- ggplot(auto_censor, aes(INCOME, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by INCOME') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))

grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(INCOME)) %>%
  group_by(INCBIN=30*floor(INCOME/30)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)

tbl <- with(auto, rbind(addmargins(table(50*floor(HOME_VAL/50))),addmargins(prop.table(table(50*floor(HOME_VAL/50))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(HOME_VAL), StdD=sd(HOME_VAL, na.rm=T), Skew=skewness(HOME_VAL, na.rm=T),
                  Kurt=kurtosis(HOME_VAL, na.rm=T))),2)


h1 <- ggplot(auto, aes(HOME_VAL)) + geom_histogram(binwidth = 40, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(HOME_VAL, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 40, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$HOME_VAL_5 <- 5 * floor(auto$HOME_VAL / 5)
auto$HOME_VAL_30 <- 30 * floor(auto$HOME_VAL / 30)

lo <- auto %>%
  group_by(HOME_VAL_30) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(HOME_VAL_30,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned HOME_VAL') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(HOME_VAL, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by HOME_VAL') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(HOME_VAL)) %>%
  group_by(HOMEVALBIN=50*floor(HOME_VAL/50)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)
tbl <- with(auto, rbind(addmargins(table(4*floor(BLUEBOOK/4))),addmargins(prop.table(table(4*floor(BLUEBOOK/4))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(BLUEBOOK), StdD=sd(BLUEBOOK, na.rm=T), Skew=skewness(BLUEBOOK, na.rm=T),
                  Kurt=kurtosis(BLUEBOOK, na.rm=T))),2)


h1 <- ggplot(auto, aes(BLUEBOOK)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(BLUEBOOK, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$BLUEBOOK_5 <- 5 * floor(auto$BLUEBOOK / 5)
auto$BLUEBOOK_2 <- 2 * floor(auto$BLUEBOOK / 2)

lo <- auto %>%
  group_by(BLUEBOOK_2) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(BLUEBOOK_2,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned BLUEBOOK') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(BLUEBOOK, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by BLUEBOOK') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(BLUEBOOK)) %>%
  group_by(BLUEBKBIN=5*floor(BLUEBOOK/5)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

options(width=120)
tbl <- with(auto, rbind(addmargins(table(4000*floor(OLDCLAIM/4000))),addmargins(prop.table(table(4000*floor(OLDCLAIM/4000))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)




options(width=100)
round(with(auto,c(summary(OLDCLAIM), StdD=sd(OLDCLAIM, na.rm=T), Skew=skewness(OLDCLAIM, na.rm=T),
                  Kurt=kurtosis(OLDCLAIM, na.rm=T))),2)


h1 <- ggplot(auto, aes(OLDCLAIM)) + geom_histogram(binwidth = 4, fill="steelblue4", color="ghostwhite" ) + 
  theme_classic() + labs(title = 'Histogram of OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))




h2 <- ggplot(auto, aes(OLDCLAIM, ..density.., fill=factor(TARGET_FLAG))) + 
  geom_histogram(binwidth = 4, color="ghostwhite", position="identity", alpha=0.5) + 
  theme_classic() + labs(title = 'TARGET_FLAG Prop by OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue")) + 
  scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1), legend.background = 
          element_rect(fill='lightblue'))


#auto$OLDCLAIM_5 <- 5 * floor(auto$OLDCLAIM / 5)
auto$OLDCLAIM_4 <- 4 * floor(auto$OLDCLAIM / 4)

lo <- auto %>%
  group_by(OLDCLAIM_4) %>%
  summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
  
sp1 <- ggplot(lo, aes(OLDCLAIM_4,logodds)) + 
  geom_point(color ='steelblue', size=1) + 
  theme_classic()+ labs(title = 'Log Odds Clm by Binned OLDCLAIM') + 
  geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
  theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
  theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))


sp2 <- ggplot(auto_censor, aes(OLDCLAIM, log(TARGET_AMT))) + 
  geom_point(position="jitter", color="steelblue4", size=1) + 
  geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() + 
  labs(title = 'log(TARGET_AMT) by OLDCLAIM') + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +  
  theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
        panel.background = element_rect(fill = "lightblue"))



grid.arrange(h1,h2,sp1,sp2, ncol=2)

t <- auto[auto$TARGET_AMT > 0,] %>%
  filter(!is.na(OLDCLAIM)) %>%
  group_by(OLD_BIN=5*floor(OLDCLAIM/5)) %>%
  summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))

kable(t)

tmp_data <- mice(auto,maxit=5, method='pmm',seed=20, print=F)
densityplot(tmp_data)[1:5]
auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)
#auto_censor <- auto[auto$TARGET_FLAG==1,]
auto$CAR_AGE <- ifelse(auto$CAR_AGE == -3, 0, auto$CAR_AGE)

boxcoxfit(auto$TARGET_AMT[auto$TARGET_FLAG==1])

auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)

auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)

auto$AGE_MOD <- scale(auto$AGE, scale=F)


#auto$YOJ_ZERO <- ifelse(auto$YOJ == 0,1,0)

# auto$YOJ_MOD <- ifelse(auto$YOJ == 0, 
#                          mean(auto$YOJ[auto$YOJ !=0]),
#                          scale(auto$YOJ[auto$YOJ !=0],
#                                scale=F))
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)

auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)

auto$TIF_MOD <- scale(auto$TIF, scale=F)

boxcoxfit(auto$INCOME[auto$INCOME >0])
auto$INCOME_MOD <- auto$INCOME ^0.5
boxcoxfit(auto$HOME_VAL[auto$HOME_VAL > 0])
  
#auto$HOME_VAL_ZERO <- ifelse(auto$HOME_VAL==0,1,0)
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25

boxcoxfit(auto$BLUEBOOK)
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5

boxcoxfit(auto$OLDCLAIM[auto$OLDCLAIM>0])

auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1)  

m1 <- glm(TARGET_FLAG ~. -TARGET_AMT -TARGET_AMT_MOD -INDEX -KIDSDRIV -HOMEKIDS -CLM_FREQ -AGE -YOJ -TRAVTIME -TIF -SEX -RED_CAR -INCOME -HOME_VAL -BLUEBOOK -OLDCLAIM
          -INCOME_30 -HOME_VAL_30 -BLUEBOOK_2 -OLDCLAIM_4,
          data = auto, family='binomial')
    
vif(m1)        

            
summary(m1)

m1_mod <- update(m1, . ~ . -CAR_AGE -HOMEKIDS_MOD -AGE_MOD -YOJ_MOD -OLD_CLAIM_MOD)

summary(m1_mod)        

m2 <- update(m1, . ~ . + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2)
                                               + I(AGE_MOD^2) + I(YOJ_MOD^2)
                                                   + I(TIF_MOD^2))

summary(m2)

m2_mod <- update(m2, . ~ . -YOJ_MOD -HOMEKIDS_MOD -I(YOJ_MOD^2) -I(HOMEKIDS_MOD^2)
                 - CAR_AGE)

summary(m2_mod)

auto_redux <- auto[,c(9,11,12:14,16,19,20,23:26,32:42,2)]

model.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=auto_redux)
model.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=auto_redux)
 
m3 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
           trace= 0, direction='both')

summary(m3)
# m4 <- bestglm(Xy=auto_redux, family=binomial,IC="AIC", method="exhaustive", nvmax=3)
# summary(m4$BestModel)
names(auto_redux)

auto_clm <- auto[auto$TARGET_FLAG ==
                           1,c(9,11,12:14,16,19,20,23:26,32:42,31)]
mymodel <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
vif(mymodel)
m4 <- lm(TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD+ AGE_MOD + TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB + CAR_TYPE + EDUCATION + BLUEBOOK_MOD , data=auto_clm)
summary(m4)


m4_mod <- lm(TARGET_AMT_MOD ~ BLUEBOOK_MOD, data=auto_clm)

summary(m4_mod)

m5 <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2)) 

summary(m5)

m5_mod <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + AGE_MOD + I(AGE_MOD^2))
summary(m5_mod)


model.upper <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
model.null <- lm(TARGET_AMT_MOD ~ 1, data=auto_clm)
 
m6 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
           trace= 0, direction='both')

summary(m6)


m1_out <- cbind(AIC=AIC(m1), AICc=AICc(m1), BIC = BIC(m1), loglik=logLik(m1))
m1_mod_out <- cbind(AIC=AIC(m1_mod), AICc=AICc(m1_mod), BIC = BIC(m1_mod), loglik=logLik(m1_mod))
m2_out <- cbind(AIC=AIC(m2), AICc=AICc(m2), BIC = BIC(m2), loglik=logLik(m2))
m2_mod_out <- cbind(AIC=AIC(m2_mod), AICc=AICc(m2_mod), BIC = BIC(m2_mod), loglik=logLik(m2_mod))
m3_out <- cbind(AIC=AIC(m3), AICc=AICc(m3), BIC = BIC(m3), loglik=logLik(m3))

model_comp <- rbind(m1_out, m1_mod_out, m2_out, m2_mod_out, m3_out)
rownames(model_comp) <- c("m1","m1_mod","m2","m2_mod","m3")

model_comp
# models 1 and 2 prediction probs
m1_pred <- predict(m1, auto, type="response")
m1_mod_pred <- predict(m1_mod, auto, type="response")
m2_pred <- predict(m2, auto, type="response")
m2_mod_pred <- predict(m2_mod, auto, type="response")
m3_pred <- predict(m3, auto, type="response")

#AUC
paste("Model 1:",round(as.numeric(roc(auto$TARGET_FLAG, m1_pred)["auc"]),3))
paste("Model 1 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m1_mod_pred)["auc"]),3))
paste("Model 2:",round(as.numeric(roc(auto$TARGET_FLAG, m2_pred)["auc"]),3))
paste("Model 2 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m2_mod_pred)["auc"]),3))
paste("Model 3:",round(as.numeric(roc(auto$TARGET_FLAG, m3_pred)["auc"]),3))

mypredict <- predict(m2_mod, auto, type='response')

summary(mypredict)
myroc <- roc(auto$TARGET_FLAG, mypredict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")
coords(myroc, "best", ret='threshold', best.method="youden")
predict_target <- ifelse(mypredict >= 0.276, 1, 0)  
cm <- confusionMatrix(data=as.factor(predict_target), reference=as.factor(auto$TARGET_FLAG), positive='1')

cm$table

a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL  # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL  # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1

# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)
predict_target2 <- ifelse(mypredict >= 0.5, 1, 0)  
cm <- confusionMatrix(data=as.factor(predict_target2), reference=as.factor(auto$TARGET_FLAG), positive='1')

cm$table

a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL  # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL  # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1

# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)

RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}

# predictions
p4 <- predict(m4, auto_clm)
p4_mod <- predict(m4_mod, auto_clm)
p5 <- predict(m5, auto_clm)
p5_mod <- predict(m5_mod, auto_clm)
p6 <- predict(m6, auto_clm)
              

m4_out <- cbind(R_2=round(summary(m4)$r.squared,4),
                adj_R_2=round(summary(m4)$adj.r.squared,4),
                rmse=RMSE(exp(p4),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m4),AICc=AICc(m4),BIC=BIC(m4))
                
m4_mod_out <- cbind(R_2=round(summary(m4_mod)$r.squared,4),
                adj_R_2=round(summary(m4_mod)$adj.r.squared,4),
                rmse=RMSE(exp(p4_mod),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m4_mod),AICc=AICc(m4_mod),BIC=BIC(m4_mod))

m5_out <- cbind(R_2=round(summary(m5)$r.squared,4),
                adj_R_2=round(summary(m5)$adj.r.squared,4),
                rmse=RMSE(exp(p5),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m5),AICc=AICc(m5),BIC=BIC(m5))

m5_mod_out <- cbind(R_2=round(summary(m5_mod)$r.squared,4),
                adj_R_2=round(summary(m5_mod)$adj.r.squared,4),
                rmse=RMSE(exp(p5_mod),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m5_mod),AICc=AICc(m5_mod),BIC=BIC(m5_mod))

m6_out <- cbind(R_2=round(summary(m6)$r.squared,4),
                adj_R_2=round(summary(m6)$adj.r.squared,4),
                rmse=RMSE(exp(p6),exp(auto_clm$TARGET_AMT_MOD)),
                AIC=AIC(m6),AICc=AICc(m6),BIC=BIC(m6))

reg <- rbind(m4=m4_out, m4_mod=m4_mod_out,m5=m5_out,m5_mod=m5_mod_out,m6=m6_out)
rownames(reg) <- c("m4", "m4_mod", "m5", "m5_mod","m6")
reg            
par(mfrow = c(2,2))
plot(m6)

url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance-evaluation-data.csv"

auto <- read.csv(url)  


currency_fix <- function(x) {
  num <- str_replace_all(x, "\\$","")
  num <- as.numeric(str_replace_all(num, "\\,",""))
  num
}

auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)


auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)], 
                                        function(x) str_replace(x,"z_|<",""))

auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor) 

auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )

auto$INCOME <- auto$INCOME / 1000  
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000


auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)

auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)
auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)
auto$AGE_MOD <- scale(auto$AGE, scale=F)
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)
auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)
auto$TIF_MOD <- scale(auto$TIF, scale=F)
auto$INCOME_MOD <- auto$INCOME ^0.5
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5
auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1) 

mypred <- predict(m2_mod, auto, type='response')
auto$TARGET_FLAG <- ifelse(mypred  >= 0.276, 1, 0)
  
mypred2 <- exp(predict(m6, auto))  
auto$TARGET_AMT <- mypred2

write.csv(auto, "evaluation_data_w_predictions.csv")

DATA621_HW_4

Niteen Kumar

July 8, 2018

1 Introduction

2 Data Exploration

2.1 Variable Overview

2.2 TARGET Variables

2.2.0.1 Count Variables

2.3 Time Variables

2.3.0.1 Binary Variables

2.4 Multinomial Categorical Variables

2.5 Ordinal Categorical Variables

2.6 Continuous Variables

3 Data Preparation

3.1 Missing Values

4 Data Transformation

5 Build Models

5.0.0.1 Binary Logistic Regression

5.0.0.2 Multiple Linear Regression

6 SELECT MODELS

6.1 Binary Logistic Regression Models

6.2 Multiple Regression Models

6.3 Make Predictions

7 Appendix