In this report, we analyze insurance data to estimate the following quantities:
In practice, these claim frequency and severity measures are useful for determining appropriate pure premium amounts to charge auto policyholders.
Using the provide training data set, we will build two separate models:
We will then make predictions on the provided insurance testing data set.
The training data set includes 8,161 observations, with 26 variables: 23 predictors, two response variables, and one record identifier.
Below is a brief description of the included variables:
Variable Name | Description | Theoretical Impact |
---|---|---|
INDEX | Identification Variable (do not use) | None |
TARGET_FLAG | In a crash? 1=YES 0=NO | None |
TARGET_AMT | Cost of Crash, if applicable | None |
KIDSDRIV | # Driving Children | When teenagers drive your car, increased crash risk |
AGE | Age of Driver | Young and old drivers might be riskier |
HOMEKIDS | # Children at Home | Unknown effect |
YOJ | Years on Job | Long-term employees are usually safer |
INCOME | Income | In theory, rich have fewer crashes |
PARENT1 | Single Parent | Unknown impact |
HOME_VAL | Home Value | In theory, home owners may drive more responsibly |
MSTATUS | Marital Status | In theory, married individuals are less risky |
SEX | Gender | Urban legend: females are safer drivers |
EDUCATION | Max Education Level | Unknown, but in theory more educated people tend to drive more safely |
JOB | Job Category | In theory, white collar workers are less risky |
TRAVTIME | Commute Distance | Long drives to work usually suggest greater risk |
CAR_USE | Vehicle Use | Commercial fleet driven more, may impact collision prob |
BLUEBOOK | Value of Vehicle | Unknown impact on collision prob, but impacts crash payout |
TIF | Time in Force | Long-term customers are usually safer |
CAR_TYPE | Type of Car | Unknown impact on collision prob, but impacts crash payout |
RED_CAR | A Red Car | Urban legend: red cars are riskier, particularly sports cars |
OLDCLAIM | # Claims (Past 5 Years) | If total payout high, future payouts might be high |
CLM_FREQ | Total Claims (Past 5 Years) | Claim count should be positively correlated with future claims |
REVOKED | License Revoked (Past 7 Years) | If your license was revoked, you probably are a riskier driver |
MVR_PTS | Motor Vehicle Report Points | Traffic ticket counts have postive correlation with crashes |
CAR_AGE | Vehicle Age | Unknown impact on collision prob, but impacts crash payout |
URBANICITY | Home/Work Area | Unknown impact |
There are a couple issues with the raw data file:
We also rescaled the fields INCOME
, HOME_VAL
, BLUEBOOK
, and OLDCLAIM
to be so that dollars are expressed in $1,000s,
After cleaning up these minor issues, we’re ready to explore data types and sample observations from the training set:
'data.frame': 8161 obs. of 26 variables:
$ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
$ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
$ TARGET_AMT : num 0 0 0 0 0 ...
$ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
$ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
$ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
$ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
$ INCOME : num 67.3 91.4 16 NA 115 ...
$ PARENT1 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
$ HOME_VAL : num 0 257 124 306 244 ...
$ MSTATUS : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 1 2 2 1 1 ...
$ SEX : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 1 2 ...
$ EDUCATION : Ord.factor w/ 4 levels "High School"<..: 4 1 1 1 4 2 1 2 2 2 ...
$ JOB : Factor w/ 9 levels "","Blue Collar",..: 8 2 3 2 4 2 2 2 3 8 ...
$ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
$ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
$ BLUEBOOK : num 14.23 14.94 4.01 15.44 18 ...
$ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
$ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 5 1 5 4 5 6 5 6 ...
$ RED_CAR : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
$ OLDCLAIM : num 4.46 0 38.69 0 19.22 ...
$ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
$ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
$ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
$ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
$ URBANICITY : Factor w/ 2 levels "Highly Rural/ Rural",..: 2 2 2 2 2 2 2 2 2 1 ...
We ignore the field INDEX
for modeling purposes, which is used only for identifying observations.
The two response variables, TARGET_FLAG
and TARGET_AMT
, contain binary and dollar values, respectively.
The predictors include four discrete count variables:
KIDSDRIV
HOMEKIDS
CLM_FREQ
MVR_PTS
There are five discrete time measurements:
AGE
YOJ
TRAVTIME
TIF
CAR_AGE
There are also seven binary, categorical features:
PARENT1
MSTATUS
SEX
CAR_USE
RED_CAR
REVOKED
URBANCITY
The data set includes two multinomial, categorical features:
JOB
CAR_TYPE
There is one ordinal, categorical predictor:
EDUCTATION
Finally, there are four predictors that express dollar amounts. These variables are effectively continuous:
INCOME
HOME_VAL
BLUEBOOK
OLDCLAIM
Now, let’s review a high level statistical summary of the variables:
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE
Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
Median : 5133 Median :0.0000 Median : 0 Median :0.0000 Median :45.00
Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79
3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
NA's :6
HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS
Min. :0.0000 Min. : 0.0 Min. : 0.00 No :7084 Min. : 0.0 No :3267
1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28.10 Yes:1077 1st Qu.: 0.0 Yes:4894
Median :0.0000 Median :11.0 Median : 54.03 Median :161.2
Mean :0.7212 Mean :10.5 Mean : 61.90 Mean :154.9
3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85.99 3rd Qu.:238.7
Max. :5.0000 Max. :23.0 Max. :367.03 Max. :885.3
NA's :454 NA's :445 NA's :464
SEX EDUCATION JOB TRAVTIME CAR_USE
F:4375 High School:3533 Blue Collar :1825 Min. : 5.00 Commercial:3029
M:3786 Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132
Masters :1658 Professional:1117 Median : 33.00
PhD : 728 Manager : 988 Mean : 33.49
Lawyer : 835 3rd Qu.: 44.00
Student : 712 Max. :142.00
(Other) :1413
BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
Min. : 1.50 Min. : 1.000 Minivan :2145 no :5783 Min. : 0.000
1st Qu.: 9.28 1st Qu.: 1.000 Panel Truck: 676 yes:2378 1st Qu.: 0.000
Median :14.44 Median : 4.000 Pickup :1389 Median : 0.000
Mean :15.71 Mean : 5.351 Sports Car : 907 Mean : 4.037
3rd Qu.:20.85 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4.636
Max. :69.74 Max. :25.000 Van : 750 Max. :57.037
CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000 Highly Rural/ Rural:1669
1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000 Highly Urban/ Urban:6492
Median :0.0000 Median : 1.000 Median : 8.000
Mean :0.7986 Mean : 1.696 Mean : 8.328
3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
Max. :5.0000 Max. :13.000 Max. :28.000
NA's :510
We notice a variety of missing values and strange field entries that may reflect data errors. These issues will be address in a later section.
Let’s now focus on each variable individually.
TARGET_FLAG
The response variable TARGET_FLAG
has a moderate imbalance, with three-quarters of the observations indicating no crashes.
0 1 Sum
count 6008.0 2153.0 8161
percent 73.6 26.4 100
TARGET_AMT
The other response, TARGET_AMT
, exhibits extreme, positive skewness and high kurtosis.
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 0.00 1504.32 1036.00 107586.14 4704.03 8.71 115.32
We already noted that almost three quarters of records indicate no car crash. In these cases, the TARGET_AMT
has a zero value.
For modeling purposes, however, we will only be interested in dollar amounts where a crash occurred. Going forward, when performing calculations and summaries involving TARGET_AMT
, we will use a subset of the data where zero amounts are filtered out.
Here is a summary of the zero-truncated TARGET_AMT
variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
30.28 2609.78 4104.00 5702.18 5787.00 107586.14 7743.18 5.64 45.49
Even after we remove the zero values, the variable remains highly skewed:
KIDSDRIV
The discrete variable KIDSDRIV
is right skewed, with 88% of insureds in the training data having no teenage drivers in the household.
0 1 2 3 4 Sum
count 7180 636.0 279.0 62.0 4 8161
percent 88 7.8 3.4 0.8 0 100
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 0.00 0.17 0.00 4.00 0.51 3.35 14.78
In the barplot below, we see different a slightly different distribution of policyholders involved in crashes vis-a-vis the incident-free insureds. Specifically, we see slightly higher concentration of individuals teenage drivers. The scatter plot also indicates a relationship between relationship between number of teenage drivers and the log odds of an auto incident.
We also plot the the log TARGET_AMT–given that a crash occurred–against KIDSDRIV
. Note: we applied the log transformation because TARGET_AMT
is highly right skewed, and the directional relationship with KIDSDRIV
should be clearer in this form.
While not entirely clear in the boxplot, there appears to be a rough, positive relationship between KIDSDRIV
and the median cost of the crash. The relationship with the mean crash amount, however, is less clear.
KIDSDRIV | ct | mean_cost | median_cost |
---|---|---|---|
0 | 1773 | 5659 | 4123 |
1 | 236 | 6220 | 3958 |
2 | 111 | 5542 | 4148 |
3 | 31 | 4915 | 4541 |
4 | 2 | 4054 | 4054 |
HOMEKIDS
Numerical distribution and statistical summaries are presented below:
0 1 2 3 4 5 Sum
count 5289.0 902.0 1118.0 674.0 164 14.0 8161
percent 64.8 11.1 13.7 8.3 2 0.2 100
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 0.00 0.72 1.00 5.00 1.12 1.34 3.65
The distribution of this discrete variable is right skewed, but not to the same extent as KIDSRIV
. HOMEKIDS
contains some of the same information as KIDSDRIV
: presumably, some of the reported children are also drivers.
Let’s look plots relating this predictor to the target variables:
The barplot provides some evidence that policyholders with crashes tend to have more children than than insureds not involved in an auto incident. The scatterplot indicates a significant difference in log odds between policyholders with children vs. insureds without children. We discount the observation associated with five children due to the small sample size.
There appears to be a subtle relationship between HOMEKIDS
and the median cost of crashes. The relationship with the mean is less clear, given the highly variable and skewed distribution of crash amounts.
HOMEKIDS | ct | mean_cost | median_cost |
---|---|---|---|
0 | 1173 | 5685 | 4080 |
1 | 305 | 5522 | 4036 |
2 | 382 | 6085 | 4122 |
3 | 230 | 5432 | 4192 |
4 | 57 | 5610 | 4575 |
5 | 6 | 5009 | 5130 |
CLM_FREQ
Let’s review numerical and statistical summaries for CLM_FREQ
, another discrete count variable:
0 1 2 3 4 5 Sum
count 5009.0 997.0 1171.0 776.0 190.0 18.0 8161
percent 61.4 12.2 14.3 9.5 2.3 0.2 100
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 0.00 0.80 2.00 5.00 1.16 1.21 3.29
This variable has a similar skew as the previously reviewed count variable, HOMEKIDS
.
Based on the barplot below, there seems to be a significance difference in the distribution of prior claim frequencies for TARGET_FLAG
values of zero vs. one. We see roughly 60% of auto claimants have had one or more prior claims in the past five year, while only 40% of non-claimants have had accidents. The scatter plot indicates a nonlinear albeit positive relationship between log odds of a claim and prior claim history. We also do not see a clear pattern between CLM_FREQ
and the claim amounts.
CLM_FREQ | ct | mean_cost | median_cost |
---|---|---|---|
0 | 898 | 5633 | 4160 |
1 | 385 | 5995 | 4312 |
2 | 469 | 5463 | 3858 |
3 | 314 | 5970 | 4206 |
4 | 80 | 5551 | 3393 |
5 | 7 | 4247 | 3737 |
MVR_PTS
Like the other count variables, MVR_PTS
is positively skewed. There seems to be a positive relationship between points and log odds; however the scatter plot below indicates a strange, curved relationship between the two variables. We wonder if the curvature is related to to interaction with another predictor.
We see no straightforward relationship between TARGET_AMT
and MVR_PTS
.
0 1 2 3 4 5 6 7 8 9 10 11 13 Sum
count 3712.0 1157.0 948.0 758.0 599.0 399.0 266.0 167 84 45.0 13.0 11.0 2 8161
percent 45.5 14.2 11.6 9.3 7.3 4.9 3.3 2 1 0.6 0.2 0.1 0 100
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 1.00 1.70 3.00 13.00 2.15 1.35 4.38
BIN_MVR | ct | mean_cost | median_cost |
---|---|---|---|
0 | 985 | 5379 | 4037 |
2 | 506 | 5762 | 4068 |
4 | 354 | 6110 | 4292 |
6 | 198 | 5896 | 4194 |
8 | 88 | 6809 | 4492 |
10 | 20 | 6293 | 3824 |
12 | 2 | 3786 | 3786 |
AGE
Below is table of values of the AGE
predictor, with ages bucketed into 5 year increments (i.e. [15,20), [20,25), [30,35), etc.])
15 20 25 30 35 40 45 50 55 60 65 70 75 80 Sum
count 14.0 58.0 249.0 649 1259.0 1673.0 1810.0 1373.0 725.0 265.0 65.0 12.0 1 2 8155
percent 0.2 0.7 3.1 8 15.4 20.5 22.2 16.8 8.9 3.2 0.8 0.1 0 0 100
Here is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's StdD Skew Kurt
16.00 39.00 45.00 44.79 51.00 81.00 6.00 8.63 -0.03 2.94
We note six missing values that we’ll need to address later.
The distribution of AGE
is almost perfectly normal. When we break out the data by TARGET_FLAG
values, the distributions of age by TARGET_FLAG
are still roughly normal. However, individuals involved in a crash appear to be slightly younger, on average.
In the bottom left scatter plot, we notice a pattern between log odds and age. Specifically, there appears to be a curved relationship where younger ages have a higher odds of a crash. The odds continue to decrease until around age 60, when costs begin to trend upward again.
TARGET_AMT
appears to be slightly higher for both younger (< 25) and older (>60) drivers, but the differences appear to be subtle.
AGEBIN | ct | mean_cost | median_cost |
---|---|---|---|
10 | 7 | 9650 | 3917 |
20 | 151 | 5412 | 4193 |
30 | 627 | 5601 | 4144 |
40 | 804 | 5470 | 3936 |
50 | 446 | 6280 | 4291 |
60 | 109 | 5907 | 4460 |
70 | 4 | 4284 | 4276 |
YOJ
YOB
refers to the number of years in the insured’s current job.
Here is a numerical summary, binned in 2 year increments:
0 2 4 6 8 10 12 14 16 18 22 Sum
count 631.0 51.0 129.0 473.0 905.0 1752.0 2174.0 1248.0 305 37.0 2 7707
percent 8.2 0.7 1.7 6.1 11.7 22.7 28.2 16.2 4 0.5 0 100
Below is the statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's StdD Skew Kurt
0.00 9.00 11.00 10.50 13.00 23.00 454.00 4.09 -1.20 4.18
We have a significant number of NA records–454, or 5.6%–that we’ll need to address.
The variable would be approximately normally distributed if it weren’t for the high percentage of individuals with less than one year on the job.
Insureds with accidents have a relatively high proportion of individuals with less than a year on the job.
The relationship between log odds and YOB
appears to be complex–see the fitted loess curve below.
The relationship between YOJ
and TARGET_AMT
is also not very clear.
YOJBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 257 | 4765 | 3736 |
4 | 134 | 5993 | 4512 |
8 | 729 | 5964 | 4026 |
12 | 811 | 5590 | 4249 |
16 | 99 | 6001 | 3969 |
TRAVTIME
Here is a summary of TRAVTIME
, the commuting distance to work, in bins of 10 minutes:
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Sum
count 550.0 1036.0 1788.0 2023.0 1531.0 784.0 305.0 94.0 33.0 11.0 2 1 1 1 1 8161
percent 6.7 12.7 21.9 24.8 18.8 9.6 3.7 1.2 0.4 0.1 0 0 0 0 0 100
Here is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
5.00 22.00 33.00 33.49 44.00 142.00 15.91 0.45 3.67
The distribution has a slight positive skew. The subset of insureds with no accidents have a higher proportion of individuals with short commute times. In the scatterplot below, we notice a generally positive relationship between log odds and TRAVETIME
. Some of the curvature in the loess curve is likely driven by small sample sizes for long commute times.
Finally, we see a slight, upward curvature in the log of TARGET_AMT
for long commute times. Judging from the loess curve confidence interval, this upward trend may not be statistically significant.
TRAVBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 214 | 5395 | 4539 |
15 | 596 | 5846 | 3998 |
30 | 772 | 5671 | 4163 |
45 | 451 | 5516 | 3964 |
60 | 104 | 6470 | 4376 |
75 | 15 | 6169 | 4579 |
90 | 1 | 6909 | 6909 |
TIF
Here is a distribution of time in force values in bins of 2 year increments:
0 2 4 6 8 10 12 14 16 18 20 22 24 Sum
count 2533 430.0 1294.0 1961 285.0 1022.0 323 109.0 148.0 32.0 19.0 3 2 8161
percent 31 5.3 15.9 24 3.5 12.5 4 1.3 1.8 0.4 0.2 0 0 100
Here is the statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
1.00 1.00 4.00 5.35 7.00 25.00 4.15 0.89 3.42
The distribution is somewhat positively skewed. We see somewhat small sub-samples for TIF
values of 2-3 and 8-9. Also, the log odds vs. TIF
scatterplot suggests a quadratic relationship.
Finally, there does not seem to be a significant relationship between log of TARGET_AMT
and TIF
.
TIFBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 912 | 5858 | 4102 |
4 | 823 | 5609 | 4049 |
8 | 289 | 5518 | 4220 |
12 | 91 | 5764 | 4061 |
16 | 35 | 5257 | 4868 |
20 | 3 | 5025 | 5610 |
CAR_AGE
The predictor CAR_AGE
describes the age in years of the insured’s car. Below we bin the data in increments of three:
-3 0 3 6 9 12 15 18 21 24 27 Sum
count 1 1949.0 494.0 1512.0 1455 1035.0 720.0 369.0 96.0 18.0 2 7651
percent 0 25.5 6.5 19.8 19 13.5 9.4 4.8 1.3 0.2 0 100
Here is the statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's StdD Skew Kurt
-3.00 1.00 8.00 8.33 12.00 28.00 510.00 5.70 0.28 2.25
There is an observation that indicates a CAR_AGE
value of -3. This is clearly an error. Also, there are 510 missing observations, representing 6% of total records. 1934 out of the 8161 (25%) of the records have a value of one.
Surprisingly, CAR_AGE
appears to be negatively correlated with the log odds of a claim. Perhaps there is a confounding variable responsible for this result.
Finally, we expected to see a stronger relationship between CAR_AGE
and the log of TARGET_AMT
.
Our intuition was that payouts go down with the age of the car, as replacement value goes down with age. However, the bottom right scatter plot below does not seem to indicate a significant association.
CAR_AGE | ct | mean_cost | median_cost |
---|---|---|---|
-4 | 1 | 1469 | 1469 |
0 | 646 | 5824 | 4107 |
4 | 392 | 5884 | 4224 |
8 | 516 | 5789 | 4088 |
12 | 261 | 5431 | 4274 |
16 | 158 | 5327 | 4076 |
20 | 34 | 5438 | 3150 |
24 | 3 | 4480 | 3949 |
PARENT1
Here is a record summary for PARENT1
, a binary variable indicating if an insured is a single parent.
No Yes Sum
count 7084.0 1077.0 8161
percent 86.8 13.2 100
The vast majority (87%) of individuals in the training data are not single parents.
Let’s explore the relationship with TARGET_FLAG
TARGET_FLAG
PARENT1 0 1 Sum
No 5407 1677 7084
Yes 601 476 1077
Sum 6008 2153 8161
Now let’s review the proportions of individuals involved in a crash, given PARENT1
status:
TARGET_FLAG
PARENT1 0 1
No 0.76 0.24
Yes 0.56 0.44
The is a 20% difference in the calculated proportions. This difference is statistically significant:
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 201.7, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
0.1734351 0.2370404
sample estimates:
prop 1 prop 2
0.7632693 0.5580316
Finally, lets review the relationship of PARENT1
with TARGET_AMT
.
PARENT1 | ct | mean_cost | median_cost |
---|---|---|---|
No | 1677 | 5603 | 4101 |
Yes | 476 | 6050 | 4133 |
The distribution of log amounts is pretty similar for both values of PARENT1
, with perhaps a very modest increase for single parent insureds.
MSTATUS
The predictor, MSTATUS
, refers to the married status of the policyholder.
No Yes Sum
count 3267 4894 8161
percent 40 60 100
There is a fairly balanced split (60/40) between married and single insureds.
Let’s look at the relationship with TARGET_FLAG
TARGET_FLAG
MSTATUS 0 1 Sum
No 2167 1100 3267
Yes 3841 1053 4894
Sum 6008 2153 8161
Now let’s review the proportions of individuals involved in a crash, given MSTATUS
status:
TARGET_FLAG
MSTATUS 0 1
No 0.66 0.34
Yes 0.78 0.22
The 12% difference in proportions is statistically significant:
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 148.38, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1416726 -0.1014053
sample estimates:
prop 1 prop 2
0.6632997 0.7848386
Now lets explore the relationship of MSTATUS
with TARGET_AMT
.
MSTATUS | ct | mean_cost | median_cost |
---|---|---|---|
No | 1100 | 5967 | 4098 |
Yes | 1053 | 5426 | 4117 |
The distribution of log amounts is very similar across MSTATUS
type. Non married individuals appear to have a slightly higher average log cost compared to the married cohort. However, median costs are almost identical between the two cohorts.
SEX
The variable, SEX
, denotes the gender of the insured policyholder.
F M Sum
count 4375.0 3786.0 8161
percent 53.6 46.4 100
The split between males and females is split almost 50/50.
Let’s review the relationship with TARGET_FLAG
.
TARGET_FLAG
SEX 0 1 Sum
F 3183 1192 4375
M 2825 961 3786
Sum 6008 2153 8161
Here are the proportions of individuals involved in a crash, given gender type:
TARGET_FLAG
SEX 0 1
F 0.73 0.27
M 0.75 0.25
The 2% difference in proportions is not statistically significant:
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 3.5307, df = 1, p-value = 0.06024
alternative hypothesis: two.sided
95 percent confidence interval:
-0.0380106016 0.0007561151
sample estimates:
prop 1 prop 2
0.7275429 0.7461701
Now lets explore the relationship of SEX
with TARGET_AMT
.
SEX | ct | mean_cost | median_cost |
---|---|---|---|
F | 1192 | 5344 | 4039 |
M | 961 | 6147 | 4220 |
The log dollar cost of claims seems to be slightly higher for males, on average.
CAR_USE
CAR_USE
is a predictor that indicates how the vehicle is uses (i.e. commercial or private purposes).
Commercial Private Sum
count 3029.0 5132.0 8161
percent 37.1 62.9 100
The majority of the observations involve private use vehicles, at 60%.
Let’s explore the relationship with TARGET_FLAG
.
TARGET_FLAG
CAR_USE 0 1 Sum
Commercial 1982 1047 3029
Private 4026 1106 5132
Sum 6008 2153 8161
Below are the proportions of individuals involved in a crash, given the CAR_USE
indicator:
TARGET_FLAG
CAR_USE 0 1
Commercial 0.65 0.35
Private 0.78 0.22
There is a 13% difference in proportions between the two cohorts. This result is statistically significant.
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 165.45, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
-0.1507428 -0.1095535
sample estimates:
prop 1 prop 2
0.6543414 0.7844895
Finally, lets explore the relationship of CAR_USE
with TARGET_AMT
.
CAR_USE | ct | mean_cost | median_cost |
---|---|---|---|
Commercial | 1047 | 6099 | 4193 |
Private | 1106 | 5327 | 4037 |
The log dollar cost of claims involving commercial transit appears to somewhat higher than costs associated with private transportation.
RED_CAR
RED_CAR
is a binary predictor indicating whether a care is primarily red in color.
no yes Sum
count 5783.0 2378.0 8161
percent 70.9 29.1 100
We see only 30% of vehicles in the red category.
Let’s look at the relationship with TARGET_FLAG
.
TARGET_FLAG
RED_CAR 0 1 Sum
no 4246 1537 5783
yes 1762 616 2378
Sum 6008 2153 8161
Here are the proportions of individuals involved in a crash, given the RED_CAR
status:
TARGET_FLAG
RED_CAR 0 1
no 0.73 0.27
yes 0.74 0.26
There is only 1% difference in proportions between red and non-red cars. The result is not statistically significant.
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 0.35996, df = 1, p-value = 0.5485
alternative hypothesis: two.sided
95 percent confidence interval:
-0.02800322 0.01452763
sample estimates:
prop 1 prop 2
0.7342210 0.7409588
Now, we’ll explore the relationship between RED_CAR
and TARGET_AMT
.
RED_CAR | ct | mean_cost | median_cost |
---|---|---|---|
no | 1537 | 5568 | 4112 |
yes | 616 | 6036 | 4082 |
The distribution for each cohort is very similar, with an subtle uptick in average log costs for red car types.
REVOKED
The variable REVOKED
indicates whether a driver’s license has been suspended within the last seven years.
No Yes Sum
count 7161.0 1000.0 8161
percent 87.7 12.3 100
Only 12% of drivers in the training data have a former license suspension on record.
Here is a look look at the relationship of REVOKED
with TARGET_FLAG
.
TARGET_FLAG
REVOKED 0 1 Sum
No 5451 1710 7161
Yes 557 443 1000
Sum 6008 2153 8161
Below are the proportions of individuals involved in a crash, given the REVOKED
status:
TARGET_FLAG
REVOKED 0 1
No 0.76 0.24
Yes 0.56 0.44
There is statistically significant difference in proportions by REVOKED
type, with a 20% observed difference in the training data.
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 187.35, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
0.1713042 0.2371089
sample estimates:
prop 1 prop 2
0.7612065 0.5570000
Finally, let’s explore the relationship between REVOKED
and TARGET_AMT
.
REVOKED | ct | mean_cost | median_cost |
---|---|---|---|
No | 1710 | 5848 | 4059 |
Yes | 443 | 5140 | 4254 |
The distribution for each cohort is very similar. In the training data, we very slight increase in average log costs for folks that have not had their license suspended. However, the median log cost seems to be slightly higher for folks with a prior suspended license.
URBANICITY
The predictor URBANICITY
indicates whether the environment in which the driver primarily drives: urban vs. rural.
Highly Rural/ Rural Highly Urban/ Urban Sum
count 1669.0 6492.0 8161
percent 20.5 79.5 100
Only 30% of drivers in the training data are categorized as being in a rural environment. .
Let’s look at the relationship of URBANICITY
with TARGET_FLAG
.
TARGET_FLAG
URBANICITY 0 1 Sum
Highly Rural/ Rural 1554 115 1669
Highly Urban/ Urban 4454 2038 6492
Sum 6008 2153 8161
Her are the proportions of individuals involved in a crash, given the URBANICITY
category :
TARGET_FLAG
URBANICITY 0 1
Highly Rural/ Rural 0.93 0.07
Highly Urban/ Urban 0.69 0.31
There is a huge difference, 24%, between the two proportions. Urban drivers appear to be significantly more at risk for being involved in a crash.
2-sample test for equality of proportions with continuity correction
data: tbl[1:2, 1:2]
X-squared = 409.14, df = 1, p-value < 0.00000000000000022
alternative hypothesis: two.sided
95 percent confidence interval:
0.2280583 0.2619842
sample estimates:
prop 1 prop 2
0.9310965 0.6860752
Let’s review the relationship between URBANICITY
and TARGET_AMT
.
URBANICITY | ct | mean_cost | median_cost |
---|---|---|---|
Highly Rural/ Rural | 115 | 5545 | 3589 |
Highly Urban/ Urban | 2038 | 5711 | 4125 |
Urban drivers tend to have somewhat higher claim costs, given a crash. There also appears to be more variability in claim costs for urban drivers compared to their rural counterparts.
JOB
The variable JOB
indicates the insured’s job category. there is also blank category which we interpret to mean unknown or not working.
Blue Collar Clerical Doctor Home Maker Lawyer Manager Professional Student Sum
count 526.0 1825.0 1271.0 246 641.0 835.0 988.0 1117.0 712.0 8161
percent 6.4 22.4 15.6 3 7.9 10.2 12.1 13.7 8.7 100
Here is a breakdown of job categories by TARGET_FLAG
:
TARGET_FLAG
JOB 0 1 Sum
390 136 526
Blue Collar 1191 634 1825
Clerical 900 371 1271
Doctor 217 29 246
Home Maker 461 180 641
Lawyer 682 153 835
Manager 851 137 988
Professional 870 247 1117
Student 446 266 712
Sum 6008 2153 8161
Let’s calculate the proportion of observations by job category in each TARGET_FLAG
indicator:
TARGET_FLAG
JOB 0 1
0.74 0.26
Blue Collar 0.65 0.35
Clerical 0.71 0.29
Doctor 0.88 0.12
Home Maker 0.72 0.28
Lawyer 0.82 0.18
Manager 0.86 0.14
Professional 0.78 0.22
Student 0.63 0.37
There appears to be significant variability in the probability of a claim by occupational category, with students and blue collar jobs leading the pack.
We reject the null hypothesis that all proportions are identical:
9-sample test for equality of proportions without continuity correction
data: tbl[1:9, 1:2]
X-squared = 261.06, df = 8, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 prop 9
0.7414449 0.6526027 0.7081039 0.8821138 0.7191888 0.8167665 0.8613360 0.7788720 0.6264045
JOB | ct | mean_cost | median_cost |
---|---|---|---|
136 | 6904 | 4155 | |
Blue Collar | 634 | 5890 | 4042 |
Clerical | 371 | 5446 | 4144 |
Doctor | 29 | 4896 | 4117 |
Home Maker | 180 | 4951 | 3612 |
Lawyer | 153 | 5991 | 4019 |
Manager | 137 | 4944 | 4256 |
Professional | 247 | 6560 | 4348 |
Student | 266 | 5021 | 4188 |
There is moderate variability in log costs across job categories, with the unknown, student, and professional, and manager categories appearing to have higher median costs than than other five categories. The relationship are somewhat different though when viewing mean costs (or mean log costs) by category.
CAR_TYPE
The predictor CAR_TYPE
indicates the insured’s vehicle type.
Minivan Panel Truck Pickup Sports Car SUV Van Sum
count 2145.0 676.0 1389 907.0 2294.0 750.0 8161
percent 26.3 8.3 17 11.1 28.1 9.2 100
Here is a breakdown of CAR_TYPE categories by TARGET_FLAG
:
TARGET_FLAG
CAR_TYPE 0 1 Sum
Minivan 1796 349 2145
Panel Truck 498 178 676
Pickup 946 443 1389
Sports Car 603 304 907
SUV 1616 678 2294
Van 549 201 750
Sum 6008 2153 8161
We’ll now calculate the proportion of observations by CAR_TYPE category in each TARGET_FLAG
indicator:
TARGET_FLAG
CAR_TYPE 0 1
Minivan 0.84 0.16
Panel Truck 0.74 0.26
Pickup 0.68 0.32
Sports Car 0.66 0.34
SUV 0.70 0.30
Van 0.73 0.27
There is significant variability in the probability of a claim by occupational category, with sports cars having the highest proportion of accidents, and minivan have the lowest proportion.
We reject the null hypothesis that all proportions are identical:
6-sample test for equality of proportions without continuity correction
data: tbl[1:6, 1:2]
X-squared = 170.38, df = 5, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4 prop 5 prop 6
0.8372960 0.7366864 0.6810655 0.6648291 0.7044464 0.7320000
CAR_TYPE | ct | mean_cost | median_cost |
---|---|---|---|
Minivan | 349 | 5602 | 4023 |
Panel Truck | 178 | 7465 | 4556 |
Pickup | 443 | 5430 | 4147 |
Sports Car | 304 | 5413 | 4023 |
SUV | 678 | 5241 | 4032 |
Van | 201 | 6909 | 4220 |
Panel trucks and vans appear to have significantly higher log claim costs compared to other car types.
EDUCATION
The variable EDUCATION
denotes the insured’s highest level of education attained.
High School Bachelors Masters PhD Sum
count 3533.0 2242.0 1658.0 728.0 8161
percent 43.3 27.5 20.3 8.9 100
Below is a breakdown of EDUCATION categories by TARGET_FLAG
:
TARGET_FLAG
EDUCATION 0 1 Sum
High School 2355 1178 3533
Bachelors 1719 523 2242
Masters 1331 327 1658
PhD 603 125 728
Sum 6008 2153 8161
Let’s review the proportion of observations by EDUCATION category in each TARGET_FLAG
indicator:
TARGET_FLAG
EDUCATION 0 1
High School 0.67 0.33
Bachelors 0.77 0.23
Masters 0.80 0.20
PhD 0.83 0.17
There are statistically significant differences between the categories, with higher levels of educational attainment associated with lower probabilities of crashes.
4-sample test for equality of proportions without continuity correction
data: tbl[1:4, 1:2]
X-squared = 168.58, df = 3, p-value < 0.00000000000000022
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.6665723 0.7667261 0.8027744 0.8282967
EDUCATION | ct | mean_cost | median_cost |
---|---|---|---|
High School | 1178 | 5451 | 4079 |
Bachelors | 523 | 5883 | 4036 |
Masters | 327 | 5966 | 4256 |
PhD | 125 | 6623 | 4395 |
Interestingly, the TARGET_AMT
appears to have a positive association with the level of EDUCATION
.
INCOME
Below is table of values of the INCOME
predictor, with wage bucketed into $30K increments (i.e. [0,30), [30,60), [60,90), etc.])
0 30 60 90 120 150 180 210 240 270 300 330 360 Sum
count 2059.0 2206.0 1671.0 951.0 417.0 203.0 117.0 56.0 20.0 11.0 3 1 1 7716
percent 26.7 28.6 21.7 12.3 5.4 2.6 1.5 0.7 0.3 0.1 0 0 0 100
Here is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's StdD Skew Kurt
0.00 28.10 54.03 61.90 85.99 367.03 445.00 47.57 1.19 5.13
There are some missing values, 445 (5.4% of total), that we’ll address later.
The distribution of INCOME
is right skewed, with a significant number of observations indicating $0 in income.
Individuals involved in crashes appear to have a high proportion of low-wage earners. We see the log-odds of a crash decreasing with increases to income–see the lower left scatter plot and loess curve. We see the loess curve starting to bend upward around $210k, although this phenomenon is possibly due to sparse data for high wage earners.
Finally, the relationship between income and log TARGET_AMT
also does not appear to be very strong.
INCBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 695 | 5102 | 4045 |
30 | 654 | 5909 | 4299 |
60 | 404 | 6053 | 3988 |
90 | 163 | 6603 | 4254 |
120 | 66 | 6107 | 3856 |
150 | 31 | 5671 | 4738 |
180 | 10 | 4091 | 3940 |
210 | 14 | 6020 | 5402 |
240 | 1 | 4104 | 4104 |
270 | 4 | 3100 | 3007 |
300 | 1 | 3231 | 3231 |
HOME_VAL
Below is table of values of the HOME_VAL
predictor, with home appraisals bucketed into $50K increments.
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 850 Sum
count 2294.0 371.0 902.0 1265.0 1181.0 751.0 435.0 208.0 128.0 85.0 43.0 21.0 7.0 3 1 1 1 7697
percent 29.8 4.8 11.7 16.4 15.3 9.8 5.7 2.7 1.7 1.1 0.6 0.3 0.1 0 0 0 0 100
Below is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's StdD Skew Kurt
0.00 0.00 161.16 154.87 238.72 885.28 464.00 129.12 0.49 2.98
There are some missing values, 464 (5.7% of total). We’ll address these later.
The distribution of HOME_VAL
is right skewed, with a large proportion of observations indicating $0 in HOME_VAL–this probably refers to policyholders who are not homeowners.
Individuals involved in crashes have a higher proportion of low home values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. The upward trend in the curve around $400k is possibly due to the the smaller sample size for higher home values.
Finally, the relationship between HOME_VAL
and log TARGET_AMT
also does not appear to be very strong.
HOMEVALBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 846 | 5550 | 4006 |
50 | 111 | 4591 | 4400 |
100 | 247 | 5233 | 4185 |
150 | 328 | 6004 | 4280 |
200 | 264 | 6086 | 4015 |
250 | 115 | 5573 | 4093 |
300 | 59 | 6990 | 4021 |
350 | 24 | 9735 | 4071 |
400 | 14 | 4011 | 3868 |
450 | 14 | 4458 | 3981 |
500 | 3 | 5917 | 5455 |
550 | 3 | 4756 | 4435 |
600 | 2 | 3540 | 3540 |
650 | 1 | 2908 | 2908 |
750 | 1 | 3231 | 3231 |
BLUEBOOK
Below is table of values of the BLUEBOOK
predictor, with values bucketed into $4K increments.
0 4 8 12 16 20 24 28 32 36 40 44 48 56 60 64 68 Sum
count 281.0 1327.0 1503.0 1548 1238.0 903.0 628.0 390.0 190.0 83 45.0 14.0 6.0 1 2 1 1 8161
percent 3.4 16.3 18.4 19 15.2 11.1 7.7 4.8 2.3 1 0.6 0.2 0.1 0 0 0 0 100
Here is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
1.50 9.28 14.44 15.71 20.85 69.74 8.42 0.79 3.79
The distribution of BLUEBOOK
is right skewed, like the continuous variables reviewed earlier.
Individuals involved in crashes have a higher proportion of low BLUEBOOK
values. The loess curve in the lower left scatterplot indicates a decreasing log odds of claim occurrence with increasing incomes. However, there is a prominent bend in the curve around $30k. Perhaps this upward bend indicates risky driving behavior by owners of luxury and sports cars.
Finally, the relationship between BLUEBOOK and log TARGET_AMT
reveals an increase in payouts with the value of the auto, but the curve flattens out around $20k - $30k.
BLUEBKBIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 171 | 3680 | 3423 |
5 | 614 | 5131 | 4196 |
10 | 504 | 5463 | 4051 |
15 | 373 | 6148 | 4164 |
20 | 267 | 7177 | 4256 |
25 | 123 | 6224 | 4369 |
30 | 57 | 6180 | 4321 |
35 | 26 | 11346 | 5308 |
40 | 13 | 6798 | 3041 |
45 | 2 | 3846 | 3846 |
50 | 1 | 18084 | 18084 |
55 | 1 | 3231 | 3231 |
60 | 1 | 4104 | 4104 |
OLDCLAIM
Below is table of values of the OLDCLAIM
variable, with values bucketed into $4K increments.
0 Sum
count 8161 8161
percent 100 100
Below is a statistical summary:
Min. 1st Qu. Median Mean 3rd Qu. Max. StdD Skew Kurt
0.00 0.00 0.00 4.04 4.64 57.04 8.78 3.12 12.86
The distribution of OLDCLAIM
is extremely right skewed.
There does not appear to be a clear relationship between OLDCLAIM
and log of TARGET_CLM
. High previous claim amounts seem to be positively associated with the log odds of a future claim. However, this relationship appears to take a quadratic form.
OLD_BIN | ct | mean_cost | median_cost |
---|---|---|---|
0 | 1395 | 5814 | 4148 |
5 | 452 | 5409 | 4025 |
10 | 62 | 6347 | 4186 |
15 | 28 | 4712 | 3318 |
20 | 33 | 4865 | 3764 |
25 | 49 | 5060 | 4460 |
30 | 53 | 4898 | 4262 |
35 | 33 | 6875 | 3779 |
40 | 29 | 6922 | 4477 |
45 | 16 | 5350 | 3299 |
50 | 2 | 3070 | 3070 |
55 | 1 | 7803 | 7803 |
The following five variables have missing variables:
Because four of the predictors are missing a significant number of records (5%-6%), we will attempt to use a sophisticated imputation procedure from R’s MICE package to fill in the missing values. Our goal in using this procedure is to minimize the introduction of bias in our data vis-a-vis simpler methods like mean substitution.
One way to assess the quality of the imputation procedure used in MICE is to compare the distribution of the imputed data to the distribution of the non-missing values. Let’s do that now by reviewing density plots:
With the exception of the
AGE
variable, the imputed variables seem to reasonably approximate the original distribution.
we’re only missing a handful of values for AGE
; so we will apply simply mean imputation for this variable.
Finally, we’ll assume the -3 value for CAR_AGE
is actually zero.
TARGET_FLAG
No changes.
TARGET_AMT
The response variable TARGET_AMT
is highly right skewed. We will use the box-cox procedure to suggest a reasonable transformation when the TARGET_AMT
variable is positive:
Fitted parameters:
lambda beta sigmasq
0.003172095 8.386645882 0.695511168
Convergence code returned by optim: 0
Given this output, We will apply the log transformation.
KIDSDRIV
Given the possible curvature between the log odds and KIDSDRIV
, we’re going to introduce a centered version of this variable to reduce possible collinearity issues.
HOMEKIDS
We will create another centered version of this variable due to a possible quadratic relationship with the response variable, TARGET_FLAG.
Again, our intent with this transformation is avoid multicollinearity issues.
CLM_FREQ
Once again, we’ll center the variable due concerns described earlier.
MVR_PTS
No changes.
AGE
Age appears to have a quadratic relationship with both TARGET_FLAG
and TARGET_AMT
. We’ll center the variable.
YOJ
The variable YOJ
has a significant number of zero observations. We’ll create a mean centered variable, YOJ_MOD
, that will be used for years on job of 1 or higher.
TRAVTIME
Well apply the mean-center transformation.
TIF
We’ll mean-center the variable to reduce multicollinearity issues if we implement a quadratic term.
CAR_AGE
No changes.
PARENT1
No changes.
MSTATUS
No changes.
SEX
No changes.
CARUSE
No changes.
RED_CAR
No changes.
REVOKED
No changes.
URBANICITY
No changes.
JOB
No changes.
CAR_TYPE
No changes.
EDUCATION
No changes.
INCOME
Income is a positively skewed variable with a significant number zeroes.
We will apply the square root transformation suggested by the box-cox procedure to the original variable to reduce the overall skew.
Fitted parameters:
lambda beta sigmasq
0.4298838 11.0283525 17.8086126
Convergence code returned by optim: 0
HOME_VAL
Home values are also moderately right skewed with a significant number of zeroes.
We’ll apply a quarter root transformation to the original variable to reduce the overall skew.
Fitted parameters:
lambda beta sigmasq
0.1951741 9.3491718 1.5206145
Convergence code returned by optim: 0
BLUEBOOK
The BLUEBOOK
variable has a moderate right skew. We’ll apply the square root transformation suggested by the box-cox procedure.
Fitted parameters:
lambda beta sigmasq
0.4610754 5.2624962 3.7967126
Convergence code returned by optim: 0
OLDCLAIM
OLDCLAIM
is extremely right skewed. We’ll apply a log(x+1) transformation to reduce the overall skew.
Fitted parameters:
lambda beta sigmasq
-0.04511237 1.76185840 0.82136055
Convergence code returned by optim: 0
Model 1: Manual Variable Selection, Linear Terms Only
Based on our data exploration, we believe the following variables could be relevant in predicted whether or not a claim occurs:
KIDSDRIV
HOMEKIDS
CLM_FREQ
MVR_PTS
AGE
YOJ
TRAVTIME
TIF
CAR_AGE
PARENT1
MSTATUS
CAR_USE
REVOKED
URBANICITY
JOB
CAR_TYPE
EDUCATION
INCOME
HOMEVAL
BLUEBOOK
OLDCLAIM
Granted, we haven’t narrowed down the list much.
Before moving forward, let’s calculate variance inflation factors, including all potential predictors in our model.
GVIF Df GVIF^(1/(2*Df))
PARENT1 1.927927 1 1.388498
MSTATUS 2.258214 1 1.502735
EDUCATION 9.204331 3 1.447656
JOB 31.325910 8 1.240206
CAR_USE 2.227957 1 1.492634
CAR_TYPE 2.443471 5 1.093454
REVOKED 1.148154 1 1.071520
MVR_PTS 1.194820 1 1.093078
CAR_AGE 2.123064 1 1.457074
URBANICITY 1.147162 1 1.071057
KIDSDRIV_MOD 1.347203 1 1.160691
HOMEKIDS_MOD 2.183075 1 1.477523
CLM_FREQ_MOD 2.350866 1 1.533253
AGE_MOD 1.455622 1 1.206491
YOJ_MOD 1.608639 1 1.268321
TRAVTIME_MOD 1.039894 1 1.019752
TIF_MOD 1.009853 1 1.004915
INCOME_MOD 3.468635 1 1.862427
HOME_VAL_MOD 1.926043 1 1.387819
BLUEBOOK_MOD 1.689897 1 1.299960
OLD_CLAIM_MOD 2.534623 1 1.592050
Seeing no major VIF issues–once accounting for degrees of freedom–we will proceed with the model with all suggested predictors.
Call:
glm(formula = TARGET_FLAG ~ . - TARGET_AMT - TARGET_AMT_MOD -
INDEX - KIDSDRIV - HOMEKIDS - CLM_FREQ - AGE - YOJ - TRAVTIME -
TIF - SEX - RED_CAR - INCOME - HOME_VAL - BLUEBOOK - OLDCLAIM -
INCOME_30 - HOME_VAL_30 - BLUEBOOK_2 - OLDCLAIM_4, family = "binomial",
data = auto)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6137 -0.7132 -0.3982 0.6166 3.1592
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.872686 0.282675 -6.625 0.00000000003475425 ***
PARENT1Yes 0.370690 0.109692 3.379 0.000727 ***
MSTATUSYes -0.474052 0.087739 -5.403 0.00000006554066899 ***
EDUCATION.L -0.078343 0.149372 -0.524 0.599945
EDUCATION.Q 0.233170 0.086835 2.685 0.007249 **
EDUCATION.C -0.089875 0.081919 -1.097 0.272589
JOBBlue Collar 0.308221 0.184894 1.667 0.095512 .
JOBClerical 0.347545 0.197037 1.764 0.077756 .
JOBDoctor -0.413664 0.266212 -1.554 0.120211
JOBHome Maker -0.053576 0.220427 -0.243 0.807961
JOBLawyer 0.127853 0.168679 0.758 0.448470
JOBManager -0.545841 0.170920 -3.194 0.001405 **
JOBProfessional 0.177985 0.177883 1.001 0.317032
JOBStudent -0.170969 0.226935 -0.753 0.451218
CAR_USEPrivate -0.779822 0.087565 -8.906 < 0.0000000000000002 ***
CAR_TYPEPanel Truck 0.583726 0.146384 3.988 0.00006673268440400 ***
CAR_TYPEPickup 0.550812 0.100174 5.499 0.00000003829305886 ***
CAR_TYPESports Car 0.948214 0.108249 8.760 < 0.0000000000000002 ***
CAR_TYPESUV 0.718723 0.086067 8.351 < 0.0000000000000002 ***
CAR_TYPEVan 0.657014 0.121696 5.399 0.00000006707980041 ***
REVOKEDYes 0.748190 0.085838 8.716 < 0.0000000000000002 ***
MVR_PTS 0.110159 0.013843 7.958 0.00000000000000175 ***
CAR_AGE -0.001176 0.007502 -0.157 0.875411
URBANICITYHighly Urban/ Urban 2.411637 0.113224 21.300 < 0.0000000000000002 ***
KIDSDRIV_MOD 0.392587 0.061248 6.410 0.00000000014567496 ***
HOMEKIDS_MOD 0.044878 0.037243 1.205 0.228204
CLM_FREQ_MOD 0.170681 0.036265 4.707 0.00000251988421409 ***
AGE_MOD -0.001631 0.004007 -0.407 0.684036
YOJ_MOD -0.001673 0.008759 -0.191 0.848497
TRAVTIME_MOD 0.014751 0.001886 7.819 0.00000000000000532 ***
TIF_MOD -0.055559 0.007352 -7.557 0.00000000000004129 ***
INCOME_MOD -0.095766 0.016269 -5.886 0.00000000394563137 ***
HOME_VAL_MOD -0.090138 0.022401 -4.024 0.00005726974387333 ***
BLUEBOOK_MOD -0.177734 0.035205 -5.049 0.00000044508752803 ***
OLD_CLAIM_MOD -0.030521 0.039102 -0.781 0.435073
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7281.9 on 8126 degrees of freedom
AIC: 7351.9
Number of Fisher Scoring iterations: 5
The signs of the coefficients mostly make sense:
EDUCATION
levels make sense; however, we expected the Master’s and PhD levels to show a greater reduction in log odds relative to the high school reference level. We are not too concerned with this result though, as other variables such as JOB_TYPE
and INCOME
help in defining an individual’s composite risk profile.YOB
. This is maybe a strange result, but our model indicates a very high p-value for our coefficient.BLUEBOOK
values being associated with reduced risk in our model compared to lower values.Let’s clean up our model by removing some of the statistically insignificant predictors. We will leave all JOB
levels in our model, as the “Manager” and “Clerical” coefficients are highly significant, and two additional levels have p-values below 10%.
Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB +
CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD +
CLM_FREQ_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD +
BLUEBOOK_MOD, family = "binomial", data = auto)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6328 -0.7136 -0.3986 0.6135 3.1439
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.918503 0.267796 -7.164 0.00000000000078332 ***
PARENT1Yes 0.455054 0.094479 4.816 0.00000146139638217 ***
MSTATUSYes -0.442419 0.084012 -5.266 0.00000013933679907 ***
EDUCATION.L -0.093724 0.137261 -0.683 0.49472
EDUCATION.Q 0.236564 0.084790 2.790 0.00527 **
EDUCATION.C -0.085934 0.081180 -1.059 0.28980
JOBBlue Collar 0.310997 0.184778 1.683 0.09236 .
JOBClerical 0.355083 0.196670 1.805 0.07100 .
JOBDoctor -0.421711 0.265747 -1.587 0.11254
JOBHome Maker -0.045799 0.218566 -0.210 0.83402
JOBLawyer 0.119301 0.168323 0.709 0.47847
JOBManager -0.555839 0.170742 -3.255 0.00113 **
JOBProfessional 0.172429 0.177740 0.970 0.33199
JOBStudent -0.146979 0.225655 -0.651 0.51482
CAR_USEPrivate -0.777122 0.087414 -8.890 < 0.0000000000000002 ***
CAR_TYPEPanel Truck 0.588330 0.146142 4.026 0.00005679472185962 ***
CAR_TYPEPickup 0.550137 0.100124 5.495 0.00000003917260431 ***
CAR_TYPESports Car 0.945631 0.107956 8.759 < 0.0000000000000002 ***
CAR_TYPESUV 0.719308 0.085983 8.366 < 0.0000000000000002 ***
CAR_TYPEVan 0.658243 0.121557 5.415 0.00000006125629378 ***
REVOKEDYes 0.728369 0.080414 9.058 < 0.0000000000000002 ***
MVR_PTS 0.108765 0.013572 8.014 0.00000000000000111 ***
URBANICITYHighly Urban/ Urban 2.409081 0.113106 21.299 < 0.0000000000000002 ***
KIDSDRIV_MOD 0.424693 0.055148 7.701 0.00000000000001351 ***
CLM_FREQ_MOD 0.150238 0.025536 5.883 0.00000000402164286 ***
TRAVTIME_MOD 0.014697 0.001885 7.796 0.00000000000000637 ***
TIF_MOD -0.055311 0.007347 -7.528 0.00000000000005149 ***
INCOME_MOD -0.095829 0.015629 -6.131 0.00000000087089256 ***
HOME_VAL_MOD -0.091463 0.022383 -4.086 0.00004385319335945 ***
BLUEBOOK_MOD -0.181410 0.035000 -5.183 0.00000021811943514 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7284.9 on 8131 degrees of freedom
AIC: 7344.9
Number of Fisher Scoring iterations: 5
All predictors are now statistically significant, with the exception of several of the job types. The coefficients in our model also still make sense.
Model 2: Add Quadratic Terms to Model 1
We noted earlier that there appeared to be quadratic relationship between some predictors and log odds.
Starting with our original Model 1–i.e. before we removed the insignificant predictors–we’ll add second order polynomial terms for the following variables:
KIDSDRIV
HOMEKIDS
CLM_FREQ
AGE
YOJ
TIF
Here is the summary output from model 2:
Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB +
CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + CAR_AGE + URBANICITY +
KIDSDRIV_MOD + HOMEKIDS_MOD + CLM_FREQ_MOD + AGE_MOD + YOJ_MOD +
TRAVTIME_MOD + TIF_MOD + INCOME_MOD + HOME_VAL_MOD + BLUEBOOK_MOD +
OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2) +
I(AGE_MOD^2) + I(YOJ_MOD^2) + I(TIF_MOD^2), family = "binomial",
data = auto)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4212 -0.7079 -0.3896 0.5997 3.0494
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8549782 0.3023391 -6.135 0.000000000849329604 ***
PARENT1Yes 0.2287145 0.1190337 1.921 0.05468 .
MSTATUSYes -0.5472126 0.0907131 -6.032 0.000000001615991315 ***
EDUCATION.L -0.1110321 0.1507177 -0.737 0.46131
EDUCATION.Q 0.2302008 0.0871779 2.641 0.00828 **
EDUCATION.C -0.1063223 0.0824316 -1.290 0.19711
JOBBlue Collar 0.3377066 0.1857011 1.819 0.06898 .
JOBClerical 0.3711729 0.1980087 1.875 0.06086 .
JOBDoctor -0.4381672 0.2666846 -1.643 0.10038
JOBHome Maker -0.0171249 0.2232588 -0.077 0.93886
JOBLawyer 0.1293698 0.1693179 0.764 0.44483
JOBManager -0.5408136 0.1711226 -3.160 0.00158 **
JOBProfessional 0.1900593 0.1784593 1.065 0.28688
JOBStudent -0.1735835 0.2291796 -0.757 0.44880
CAR_USEPrivate -0.7834831 0.0881933 -8.884 < 0.0000000000000002 ***
CAR_TYPEPanel Truck 0.6238633 0.1471955 4.238 0.000022518613851542 ***
CAR_TYPEPickup 0.5677139 0.1009806 5.622 0.000000018874685247 ***
CAR_TYPESports Car 0.8853900 0.1094746 8.088 0.000000000000000608 ***
CAR_TYPESUV 0.7106989 0.0868341 8.185 0.000000000000000273 ***
CAR_TYPEVan 0.6855025 0.1226882 5.587 0.000000023055663145 ***
REVOKEDYes 0.8259645 0.0887815 9.303 < 0.0000000000000002 ***
MVR_PTS 0.0971609 0.0141218 6.880 0.000000000005976967 ***
CAR_AGE -0.0012267 0.0075548 -0.162 0.87102
URBANICITYHighly Urban/ Urban 2.3933908 0.1135895 21.071 < 0.0000000000000002 ***
KIDSDRIV_MOD 0.7280955 0.1425080 5.109 0.000000323602039340 ***
HOMEKIDS_MOD 0.0617138 0.0685642 0.900 0.36807
CLM_FREQ_MOD 0.3677199 0.0737840 4.984 0.000000623693042781 ***
AGE_MOD -0.0020003 0.0041442 -0.483 0.62932
YOJ_MOD -0.0006137 0.0110557 -0.056 0.95573
TRAVTIME_MOD 0.0148435 0.0018995 7.815 0.000000000000005515 ***
TIF_MOD -0.0666080 0.0086300 -7.718 0.000000000000011795 ***
INCOME_MOD -0.0829304 0.0168293 -4.928 0.000000831917394078 ***
HOME_VAL_MOD -0.0887297 0.0225581 -3.933 0.000083759903483572 ***
BLUEBOOK_MOD -0.1860283 0.0354683 -5.245 0.000000156350347101 ***
OLD_CLAIM_MOD -0.1226506 0.0492985 -2.488 0.01285 *
I(KIDSDRIV_MOD^2) -0.1387882 0.0690188 -2.011 0.04434 *
I(HOMEKIDS_MOD^2) -0.0328246 0.0281772 -1.165 0.24405
I(CLM_FREQ_MOD^2) -0.0875831 0.0282607 -3.099 0.00194 **
I(AGE_MOD^2) 0.0022268 0.0002811 7.922 0.000000000000002339 ***
I(YOJ_MOD^2) 0.0012227 0.0016034 0.763 0.44570
I(TIF_MOD^2) 0.0031365 0.0013927 2.252 0.02431 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7198.7 on 8120 degrees of freedom
AIC: 7280.7
Number of Fisher Scoring iterations: 5
Below our key results from this second model:
KIDSDRIV
are both statistically significant. The negative sign of the second order term makes sense: the unfavorable impact of adding additional children drivers diminishes with each subsequent child.HOMEKIDS
.CLM_FREQ
terms are significant. The negative second order term indicates diminishing impact of each additional prior claim.AGE
is statistically significant, but the first first order term is not. We’ll leave both terms in our model. The coefficients of the terms make sense. There is a primary trend of reduced risk with age–see the negative first order coefficient, but the trend diminishes and potentially reverses for higher ages–as reflected in the positive second order term.YOJ
terms are statistically significant.TIF
terms are significant. The signs of the coefficients have an intuitive explanation: there is a primary effect of risk reduction in risk with increases to TIF
but the favorable impact diminishes with higher TIF
values.CAR_AGE
is still insignificant in this model.Based on the results above, we’ll remove all HOMEKIDS
and YOJ
terms from our model. Here are the summary results from this modified, second model:
Call:
glm(formula = TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB +
CAR_USE + CAR_TYPE + REVOKED + MVR_PTS + URBANICITY + KIDSDRIV_MOD +
CLM_FREQ_MOD + AGE_MOD + TRAVTIME_MOD + TIF_MOD + INCOME_MOD +
HOME_VAL_MOD + BLUEBOOK_MOD + OLD_CLAIM_MOD + I(KIDSDRIV_MOD^2) +
I(CLM_FREQ_MOD^2) + I(AGE_MOD^2) + I(TIF_MOD^2), family = "binomial",
data = auto)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4188 -0.7079 -0.3899 0.6017 3.0411
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8566640 0.2816370 -6.592 0.000000000043277015 ***
PARENT1Yes 0.2739046 0.1017449 2.692 0.00710 **
MSTATUSYes -0.5275390 0.0855811 -6.164 0.000000000708416117 ***
EDUCATION.L -0.1089269 0.1383381 -0.787 0.43105
EDUCATION.Q 0.2326644 0.0851694 2.732 0.00630 **
EDUCATION.C -0.1037084 0.0816550 -1.270 0.20406
JOBBlue Collar 0.3308613 0.1855816 1.783 0.07461 .
JOBClerical 0.3577260 0.1975653 1.811 0.07019 .
JOBDoctor -0.4363115 0.2668807 -1.635 0.10208
JOBHome Maker 0.0138339 0.2199564 0.063 0.94985
JOBLawyer 0.1258867 0.1693304 0.743 0.45722
JOBManager -0.5435257 0.1710934 -3.177 0.00149 **
JOBProfessional 0.1872710 0.1783856 1.050 0.29381
JOBStudent -0.1531559 0.2268795 -0.675 0.49964
CAR_USEPrivate -0.7861317 0.0880463 -8.929 < 0.0000000000000002 ***
CAR_TYPEPanel Truck 0.6224359 0.1471477 4.230 0.000023368219454604 ***
CAR_TYPEPickup 0.5651171 0.1009060 5.600 0.000000021382163925 ***
CAR_TYPESports Car 0.8883475 0.1093351 8.125 0.000000000000000447 ***
CAR_TYPESUV 0.7116850 0.0867565 8.203 0.000000000000000234 ***
CAR_TYPEVan 0.6840691 0.1226023 5.580 0.000000024110109095 ***
REVOKEDYes 0.8231669 0.0887081 9.280 < 0.0000000000000002 ***
MVR_PTS 0.0980783 0.0141003 6.956 0.000000000003506304 ***
URBANICITYHighly Urban/ Urban 2.3910929 0.1135255 21.062 < 0.0000000000000002 ***
KIDSDRIV_MOD 0.7850828 0.1289262 6.089 0.000000001133357267 ***
CLM_FREQ_MOD 0.3652676 0.0737359 4.954 0.000000728054720019 ***
AGE_MOD -0.0025611 0.0036151 -0.708 0.47867
TRAVTIME_MOD 0.0148180 0.0018989 7.803 0.000000000000006031 ***
TIF_MOD -0.0667534 0.0086268 -7.738 0.000000000000010109 ***
INCOME_MOD -0.0886512 0.0157886 -5.615 0.000000019668539231 ***
HOME_VAL_MOD -0.0886985 0.0225545 -3.933 0.000084020987281530 ***
BLUEBOOK_MOD -0.1860655 0.0354614 -5.247 0.000000154612137297 ***
OLD_CLAIM_MOD -0.1214653 0.0492809 -2.465 0.01371 *
I(KIDSDRIV_MOD^2) -0.1637618 0.0651787 -2.513 0.01199 *
I(CLM_FREQ_MOD^2) -0.0865030 0.0282256 -3.065 0.00218 **
I(AGE_MOD^2) 0.0022487 0.0002768 8.124 0.000000000000000452 ***
I(TIF_MOD^2) 0.0031484 0.0013917 2.262 0.02368 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7201.1 on 8125 degrees of freedom
AIC: 7273.1
Number of Fisher Scoring iterations: 5
Model 3: Stepwise Regression
For our third model, we’ll implement stepwise regression, with variable selecting occurring in both directions. We’ll include all predictors (transformed versions where applicable) in our potential universe of candidates.
For simplicity, we’ll only include first order terms.
Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS +
CAR_TYPE + REVOKED + KIDSDRIV_MOD + INCOME_MOD + CAR_USE +
TIF_MOD + TRAVTIME_MOD + CLM_FREQ_MOD + BLUEBOOK_MOD + PARENT1 +
HOME_VAL_MOD + EDUCATION + HOMEKIDS_MOD, family = "binomial",
data = auto_redux)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6158 -0.7141 -0.3980 0.6181 3.1513
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.887080 0.268674 -7.024 0.00000000000216101 ***
URBANICITYHighly Urban/ Urban 2.408584 0.113076 21.301 < 0.0000000000000002 ***
JOBBlue Collar 0.309851 0.184810 1.677 0.09362 .
JOBClerical 0.347849 0.196756 1.768 0.07707 .
JOBDoctor -0.414469 0.265832 -1.559 0.11896
JOBHome Maker -0.049057 0.218639 -0.224 0.82246
JOBLawyer 0.125372 0.168406 0.744 0.45660
JOBManager -0.548824 0.170794 -3.213 0.00131 **
JOBProfessional 0.178743 0.177815 1.005 0.31479
JOBStudent -0.164947 0.226014 -0.730 0.46551
MVR_PTS 0.108414 0.013576 7.986 0.00000000000000140 ***
MSTATUSYes -0.476489 0.087310 -5.457 0.00000004830512386 ***
CAR_TYPEPanel Truck 0.588390 0.146226 4.024 0.00005725500332016 ***
CAR_TYPEPickup 0.551882 0.100139 5.511 0.00000003564383764 ***
CAR_TYPESports Car 0.944740 0.107966 8.750 < 0.0000000000000002 ***
CAR_TYPESUV 0.717233 0.086011 8.339 < 0.0000000000000002 ***
CAR_TYPEVan 0.658849 0.121591 5.419 0.00000006007329279 ***
REVOKEDYes 0.724890 0.080457 9.010 < 0.0000000000000002 ***
KIDSDRIV_MOD 0.389384 0.060210 6.467 0.00000000009988987 ***
INCOME_MOD -0.096508 0.015642 -6.170 0.00000000068430015 ***
CAR_USEPrivate -0.779104 0.087448 -8.909 < 0.0000000000000002 ***
TIF_MOD -0.055518 0.007349 -7.554 0.00000000000004217 ***
TRAVTIME_MOD 0.014755 0.001885 7.826 0.00000000000000505 ***
CLM_FREQ_MOD 0.150203 0.025546 5.880 0.00000000410930376 ***
BLUEBOOK_MOD -0.179788 0.035019 -5.134 0.00000028357055432 ***
PARENT1Yes 0.375928 0.109102 3.446 0.00057 ***
HOME_VAL_MOD -0.090524 0.022374 -4.046 0.00005210493016275 ***
EDUCATION.L -0.086834 0.137351 -0.632 0.52725
EDUCATION.Q 0.235675 0.084809 2.779 0.00545 **
EDUCATION.C -0.087132 0.081189 -1.073 0.28318
HOMEKIDS_MOD 0.049238 0.034007 1.448 0.14765
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7282.8 on 8130 degrees of freedom
AIC: 7344.8
Number of Fisher Scoring iterations: 5
[1] "PARENT1" "MSTATUS" "SEX" "EDUCATION" "JOB" "CAR_USE"
[7] "CAR_TYPE" "RED_CAR" "REVOKED" "MVR_PTS" "CAR_AGE" "URBANICITY"
[13] "KIDSDRIV_MOD" "HOMEKIDS_MOD" "CLM_FREQ_MOD" "AGE_MOD" "YOJ_MOD" "TRAVTIME_MOD"
[19] "TIF_MOD" "INCOME_MOD" "HOME_VAL_MOD" "BLUEBOOK_MOD" "OLD_CLAIM_MOD" "TARGET_FLAG"
Surprisingly, the stepwise procedure produced a model that is identical to our modified Model 1!
Now We’ll model claim costs using the subset of the training data where claim costs were greater than one.
Before we build our models, let’s check for multicollinearity uses by reviewing variance inflation factors for a linear model that includes all predictors.
GVIF Df GVIF^(1/(2*Df))
PARENT1 2.163475 1 1.470876
MSTATUS 2.442422 1 1.562825
SEX 3.756936 1 1.938282
EDUCATION 9.707532 3 1.460556
JOB 40.971902 8 1.261190
CAR_USE 2.259463 1 1.503151
CAR_TYPE 6.325230 5 1.202562
RED_CAR 1.832475 1 1.353689
REVOKED 1.266601 1 1.125433
MVR_PTS 1.158984 1 1.076561
CAR_AGE 2.130468 1 1.459612
URBANICITY 1.052960 1 1.026139
KIDSDRIV_MOD 1.434861 1 1.197857
HOMEKIDS_MOD 2.245965 1 1.498654
CLM_FREQ_MOD 2.232403 1 1.494123
AGE_MOD 1.505499 1 1.226988
YOJ_MOD 1.925940 1 1.387782
TRAVTIME_MOD 1.030401 1 1.015087
TIF_MOD 1.018817 1 1.009364
INCOME_MOD 4.310419 1 2.076155
HOME_VAL_MOD 1.991408 1 1.411173
BLUEBOOK_MOD 2.083098 1 1.443294
OLD_CLAIM_MOD 2.486871 1 1.576982
There does not appear to be an issue with mulicollinearity.
Model 4: Manual Variable Selection, Linear Terms Only
Choosing relevant predictors manually is a challenging exercise as most predictors seemed to have only a subtle influence–if any–on claim costs.
Based on our exploratory work, we believe the following variables may be relevant:
AGE
KIDSDRIV
HOMEKIDS
TRAVTIME
SEX
CAR_USE
RED_CAR
UBANICITY
JOB
CARTYPE
EDUCATION
BLUEBOOK
Let’s look at a preliminary model using all 12 of our proposed predictors:
Call:
lm(formula = TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD + AGE_MOD +
TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB +
CAR_TYPE + EDUCATION + BLUEBOOK_MOD, data = auto_clm)
Residuals:
Min 1Q Median 3Q Max
-4.7973 -0.4027 0.0404 0.4037 3.2710
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7902933 0.1678675 46.407 < 0.0000000000000002 ***
KIDSDRIV_MOD -0.0361875 0.0331524 -1.092 0.275
HOMEKIDS_MOD 0.0205244 0.0190398 1.078 0.281
AGE_MOD 0.0004682 0.0021647 0.216 0.829
TRAVTIME_MOD -0.0005157 0.0011623 -0.444 0.657
SEXM 0.0939910 0.0678247 1.386 0.166
CAR_USEPrivate -0.0042709 0.0519136 -0.082 0.934
RED_CARyes 0.0270567 0.0521811 0.519 0.604
URBANICITYHighly Urban/ Urban 0.0274888 0.0789854 0.348 0.728
JOBBlue Collar 0.0579560 0.1194463 0.485 0.628
JOBClerical 0.0770022 0.1242427 0.620 0.535
JOBDoctor -0.0423386 0.1844218 -0.230 0.818
JOBHome Maker 0.0186071 0.1219750 0.153 0.879
JOBLawyer -0.0040646 0.1076214 -0.038 0.970
JOBManager 0.0143688 0.1117657 0.129 0.898
JOBProfessional 0.0966171 0.1180370 0.819 0.413
JOBStudent 0.0708606 0.1262229 0.561 0.575
CAR_TYPEPanel Truck -0.0023405 0.0960163 -0.024 0.981
CAR_TYPEPickup 0.0298623 0.0622814 0.479 0.632
CAR_TYPESports Car 0.0719608 0.0781399 0.921 0.357
CAR_TYPESUV 0.0919203 0.0689452 1.333 0.183
CAR_TYPEVan -0.0278677 0.0804508 -0.346 0.729
EDUCATION.L 0.1274041 0.0887219 1.436 0.151
EDUCATION.Q 0.0541159 0.0539471 1.003 0.316
EDUCATION.C -0.0631616 0.0541032 -1.167 0.243
BLUEBOOK_MOD 0.0978773 0.0227092 4.310 0.0000171 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8092 on 2127 degrees of freedom
Multiple R-squared: 0.0198, Adjusted R-squared: 0.008278
F-statistic: 1.718 on 25 and 2127 DF, p-value: 0.01486
Only one of our predictors, BLUEBOOK
appears to be significant in our model.
Here is our modified model, with BLUEBOOK
as the sole predictor:
Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD, data = auto_clm)
Residuals:
Min 1Q Median 3Q Max
-4.7842 -0.3929 0.0404 0.3952 3.2528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.95359 0.06005 132.445 < 0.0000000000000002 ***
BLUEBOOK_MOD 0.08921 0.01590 5.609 0.000000023 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8069 on 2151 degrees of freedom
Multiple R-squared: 0.01441, Adjusted R-squared: 0.01396
F-statistic: 31.46 on 1 and 2151 DF, p-value: 0.00000002298
The positive coefficient for Bluebook
makes sense: we expect the replacement cost and/or repairs for a high-valued car to be more expensive than auto with a low replacement cost.
Model 5: Add Quadratic Terms to Model 4
In the exploratory section, we noted potential curved relationship between some predictors and the log of TARGET_AMT
.
Those predictors were AGE
and TRAVTIME
. Let’s include squared terms for these two predictors and also for for BLUEBOOK
:
Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) +
TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2),
data = auto_clm)
Residuals:
Min 1Q Median 3Q Max
-4.7678 -0.3885 0.0345 0.3994 3.2450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.56360993 0.15342722 49.298 < 0.0000000000000002 ***
BLUEBOOK_MOD 0.29960723 0.08326270 3.598 0.000328 ***
I(BLUEBOOK_MOD^2) -0.02798286 0.01094977 -2.556 0.010670 *
TRAVTIME_MOD -0.00057584 0.00119020 -0.484 0.628569
I(TRAVTIME_MOD^2) 0.00001853 0.00005609 0.330 0.741178
AGE_MOD 0.00042933 0.00186549 0.230 0.818002
I(AGE_MOD^2) 0.00027328 0.00014705 1.858 0.063242 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8059 on 2146 degrees of freedom
Multiple R-squared: 0.01912, Adjusted R-squared: 0.01638
F-statistic: 6.972 on 6 and 2146 DF, p-value: 0.0000002344
The second order term for BLUEBOOK
is statistically significant. The second order term for AGE
is borderline significant; so we include leave both age variables in our model; but remove terms related to TRAVTIME
.
Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + I(BLUEBOOK_MOD^2) +
AGE_MOD + I(AGE_MOD^2), data = auto_clm)
Residuals:
Min 1Q Median 3Q Max
-4.7720 -0.3860 0.0358 0.3948 3.2380
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5684831 0.1529052 49.498 < 0.0000000000000002 ***
BLUEBOOK_MOD 0.2990922 0.0831967 3.595 0.000332 ***
I(BLUEBOOK_MOD^2) -0.0279652 0.0109407 -2.556 0.010655 *
AGE_MOD 0.0004062 0.0018617 0.218 0.827316
I(AGE_MOD^2) 0.0002761 0.0001468 1.880 0.060177 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8056 on 2148 degrees of freedom
Multiple R-squared: 0.01899, Adjusted R-squared: 0.01717
F-statistic: 10.4 on 4 and 2148 DF, p-value: 0.00000002434
This model indicates that log costs increase quadratically with Age. This result seems possible.
Model 6: Stepwise Regression
Finally, we’ll perform a basic stepwise regression with variable selection performed in both directions. For simplicity, we’ll only include linear terms.
Call:
lm(formula = TARGET_AMT_MOD ~ BLUEBOOK_MOD + MSTATUS + MVR_PTS +
SEX + CLM_FREQ_MOD, data = auto_clm)
Residuals:
Min 1Q Median 3Q Max
-4.6968 -0.4038 0.0391 0.4093 3.2236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.940083 0.066264 119.825 < 0.0000000000000002 ***
BLUEBOOK_MOD 0.086861 0.015942 5.449 0.0000000566 ***
MSTATUSYes -0.073598 0.034727 -2.119 0.0342 *
MVR_PTS 0.017181 0.007053 2.436 0.0149 *
SEXM 0.055410 0.035037 1.581 0.1139
CLM_FREQ_MOD -0.022434 0.014567 -1.540 0.1237
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8051 on 2147 degrees of freedom
Multiple R-squared: 0.02064, Adjusted R-squared: 0.01836
F-statistic: 9.051 on 5 and 2147 DF, p-value: 0.00000001583
The stepwise procedure included four additional predictors in addition to BLUEBOOK
:
* MSTATUS
: the negative coefficient indicates the married individuals are less expensive than singles. This seems reasonable.
* MVR_PTS
: the model indicates a positive association between MVR_PTS
and claim costs. This also seems reasonable.
* SEX
: According to the model, males are more expensive than females. This is consistent with our earlier exploratory work.
* CLM_FREQ_MOD1
: The coefficient is negative. This result is counterintuitive. We would expect folks with a high incidence accident rate to potentially be at risk for higher cost accidents. For now, we’ll leave this predictor in, but we may want do additional analysis.
Let’s compare model fits for all of our models:
AIC AICc BIC loglik
m1 7351.936 7352.246 7597.185 -3640.968
m1_mod 7344.898 7345.127 7555.112 -3642.449
m2 7280.684 7281.108 7567.976 -3599.342
m2_mod 7273.059 7273.386 7525.315 -3600.529
m3 7344.813 7345.057 7562.034 -3641.407
Based on the various model evaluation criteria, model m2_mod appears to be the clear winner. This model is the binary logistic regression model that included multiple quadratic terms, but removed statistically insignificant predictors from the original model 2 formulation.
Model 2 is superior in that the AIC, AIC_c, and BIC measures are lower than all other evaluated models. The log likelihood is not quite as the original model 2, but this measure does not account for model complexity as the other models do.
Let’s also compare the AUC measure for all models:
[1] "Model 1: 0.815"
[1] "Model 1 mod: 0.814"
[1] "Model 2: 0.82"
[1] "Model 2 mod: 0.819"
[1] "Model 3: 0.815"
Models 2 and Model 2 mod are tied and have the highest AUC measures. Given that Model 2 mod has fewer parameters than Model 2, the AUC measure supports our contention that Model 2 mod is superior to the other models.
Let’s explore a summary of our model predictions using the training data:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002406 0.074672 0.195498 0.263816 0.402386 0.960296
We now need to choose an appropriate probability cutoff measure for predicting whether or not an individual will have a claim.
We will use Youden’s index to determine this optimal cutoff.
[1] 0.2762045
Using Youden’s index, we select a relatively low cutoff measure of 0.276. With such a low cutoff, we will inevitably sacrifice specificity for gains in sensitivity compared to a traditional 0.5 cutoff. However, this lower cutoff provides a better balance between precision and recall.
Reference
Prediction 0 1
0 4476 545
1 1532 1608
$accuracy
[1] 0.7454969
$error_rt
[1] 0.2545031
$precision
[1] 0.5121019
$sensitivity
[1] 0.7468648
$specificity
[1] 0.7450067
$F1
[1] 0.6075949
For comparison purposes, here is the confusion matrix and related classification metrics with a 0.5 cutoff.
Reference
Prediction 0 1
0 5534 1207
1 474 946
$accuracy
[1] 0.7940203
$error_rt
[1] 0.2059797
$precision
[1] 0.6661972
$sensitivity
[1] 0.4393869
$specificity
[1] 0.9211052
$F1
[1] 0.529527
We see that our 0.276 cutoff also results in lower accuracy vis-a-vis the 0.5 threshold. But our lower threshold also results in better balance in precision vs. recall, as indicated by the improved F1 measure.
We’ll review five different measures for assessing our multiple regression models:
R_2 adj_R_2 rmse AIC AICc BIC
m4 0.0198 0.0083 7891.505 5226.127 5226.838 5379.341
m4_mod 0.0144 0.0140 7901.013 5189.919 5189.931 5206.943
m5 0.0191 0.0164 7897.224 5189.618 5189.685 5235.015
m5_mod 0.0190 0.0172 7897.273 5185.896 5185.935 5219.944
m6 0.0206 0.0184 7885.800 5184.273 5184.325 5223.995
Based on the table above, model 6 appears to be the superior model.
It has superior measures in all categories except for BIC.
Let’s now do some quick model diagnostics for our selected model, model 6:
The residuals in our model appear to have a relatively constant variance across all fitted values. The qq plot indicates standardized residuals that are fairly well behaved, with only minor departures from normality. Finally, there are only a couple outliers in our data–none appear to be high leverage points.
Let’s wrap up by scrubbing the test data set and make predictions. Please refer to the Github account in the Appendix to access the prediction file.
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, cache=TRUE, comment=NA,
fig.align='center',options(width = 75, scipen=999))
# load libraries
library(stringr)
library(knitr)
library(moments)
library(ggplot2)
library(ggthemes)
library(gridExtra)
library(scales)
library(dplyr)
library(mice)
library(geoR)
library(car)
library(bestglm)
library(sme)
library(pROC)
library(caret)
url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance_training_data.csv"
auto <- read.csv(url)
# create table describing variables in dataset
vname <- names(auto)
vdefn <- c("Identification Variable (do not use) ",
"In a crash? 1=YES 0=NO ",
"Cost of Crash, if applicable",
"# Driving Children",
"Age of Driver",
"# Children at Home",
"Years on Job",
"Income",
"Single Parent",
"Home Value",
"Marital Status",
"Gender",
"Max Education Level",
"Job Category",
"Commute Distance",
"Vehicle Use",
"Value of Vehicle",
"Time in Force",
"Type of Car",
"A Red Car",
"# Claims (Past 5 Years) ",
"Total Claims (Past 5 Years)",
"License Revoked (Past 7 Years)",
"Motor Vehicle Report Points",
"Vehicle Age",
"Home/Work Area"
)
veffect <- c("None",
"None",
"None",
"When teenagers drive your car, increased crash risk",
"Young and old drivers might be riskier",
"Unknown effect",
"Long-term employees are usually safer",
"In theory, rich have fewer crashes",
"Unknown impact",
"In theory, home owners may drive more responsibly",
"In theory, married individuals are less risky",
"Urban legend: females are safer drivers",
"Unknown, but in theory more educated people tend to drive more safely",
"In theory, white collar workers are less risky",
"Long drives to work usually suggest greater risk",
"Commercial fleet driven more, may impact collision prob",
"Unknown impact on collision prob, but impacts crash payout",
"Long-term customers are usually safer",
"Unknown impact on collision prob, but impacts crash payout",
"Urban legend: red cars are riskier, particularly sports cars",
"If total payout high, future payouts might be high",
"Claim count should be positively correlated with future claims",
"If your license was revoked, you probably are a riskier driver",
"Traffic ticket counts have postive correlation with crashes",
"Unknown impact on collision prob, but impacts crash payout",
"Unknown impact"
)
var_df <- data.frame(vname,vdefn, veffect, row.names = NULL)
kable(var_df, col.names=c('Variable Name','Description','Theoretical Impact'))
# training data cleaning scripts
currency_fix <- function(x) {
num <- str_replace_all(x, "\\$","")
num <- as.numeric(str_replace_all(num, "\\,",""))
num
}
auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)
auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)],
function(x) str_replace(x,"z_|<",""))
auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor)
auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )
auto$INCOME <- auto$INCOME / 1000
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000
str(auto)
options(width=95)
summary(auto)
tbl <- with(auto,rbind(round(addmargins(table(TARGET_FLAG)),0),
addmargins(prop.table(table(TARGET_FLAG)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto, c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)
options(width=100)
auto_censor <- auto[auto$TARGET_FLAG == 1,]
round(with(auto_censor,c(summary(TARGET_AMT), StdD=sd(TARGET_AMT), Skew=skewness(TARGET_AMT), Kurt=kurtosis(TARGET_AMT))),2)
h <- ggplot(auto_censor, aes(TARGET_AMT)) +
geom_histogram(color="ghostwhite", fill="steelblue4") +
theme_classic()+ labs(title = 'Histogram of TARGET_AMT') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue"))
b <- ggplot(auto_censor, aes(x="",y=TARGET_AMT)) +
geom_boxplot(color="ghostwhite", fill="steelblue4",outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'Boxplot of TARGET_AMT') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) + coord_flip() +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)
grid.arrange(h,b, ncol=2)
tbl <- with(auto, rbind(addmargins(table(KIDSDRIV)),addmargins(prop.table(table(KIDSDRIV)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
round(with(auto,c(summary(KIDSDRIV), StdD=sd(KIDSDRIV), Skew=skewness(KIDSDRIV), Kurt=kurtosis(KIDSDRIV))),2)
bc <- ggplot(auto, aes(KIDSDRIV, fill=factor(TARGET_FLAG))) +
geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
theme_classic()+ labs(title = 'KIDSDRIV Prop by TARGET_FLAG') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue"))
bp <- ggplot(auto_censor, aes(factor(KIDSDRIV), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by KIDSDRIV', x = "KIDSDRIV") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(KIDSDRIV) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(KIDSDRIV,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by KIDSDRIV') +
geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <-auto[auto$TARGET_AMT > 0,] %>%
group_by(KIDSDRIV) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
tbl <- with(auto, rbind(addmargins(table(HOMEKIDS)),addmargins(prop.table(table(HOMEKIDS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
round(with(auto,c(summary(HOMEKIDS), StdD=sd(HOMEKIDS), Skew=skewness(HOMEKIDS), Kurt=kurtosis(HOMEKIDS))),2)
bc <- ggplot(auto, aes(HOMEKIDS, fill=factor(TARGET_FLAG))) +
geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
theme_classic()+ labs(title = 'HOMEKIDS Prop by TARGET_FLAG') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue"))
bp <- ggplot(auto_censor, aes(factor(HOMEKIDS), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by HOMEKIDS', x = "HOMEKIDS") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(HOMEKIDS) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(HOMEKIDS,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue4', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by HOMEKIDS') +
geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(HOMEKIDS) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
tbl <- with(auto, rbind(addmargins(table(CLM_FREQ)),addmargins(prop.table(table(CLM_FREQ)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
round(with(auto,c(summary(CLM_FREQ), StdD=sd(CLM_FREQ), Skew=skewness(CLM_FREQ), Kurt=kurtosis(CLM_FREQ))),2)
bc <- ggplot(auto, aes(CLM_FREQ, fill=factor(TARGET_FLAG))) +
geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
theme_classic()+ labs(title = 'CLM_FREQ Prop by TARGET_FLAG') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue"))
bp <- ggplot(auto_censor, aes(factor(CLM_FREQ), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by CLM_FREQ', x = "CLM_FREQ") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(CLM_FREQ) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(CLM_FREQ,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by CLM_FREQ') +
geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(CLM_FREQ) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=90)
tbl <- with(auto, rbind(addmargins(table(MVR_PTS)),addmargins(prop.table(table(MVR_PTS)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
round(with(auto,c(summary(MVR_PTS), StdD=sd(MVR_PTS), Skew=skewness(MVR_PTS), Kurt=kurtosis(MVR_PTS))),2)
bc <- ggplot(auto, aes(MVR_PTS, fill=factor(TARGET_FLAG))) +
geom_bar(aes(y = ..prop..), position="dodge", color="ghostwhite") +
theme_classic()+ labs(title = 'MVR_PTS Prop by TARGET_FLAG') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background = element_rect(fill='lightblue')) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue"))
bp <- ggplot(auto_censor, aes(factor(MVR_PTS), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by MVR_PTS', x = "MVR_PTS") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(MVR_PTS) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(MVR_PTS,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by MVR_PTS') +
geom_smooth(method='lm', color='darkred', size=0.75, se=F, linetype='dotted') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(BIN_MVR=2*floor(MVR_PTS/2)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(5*floor(AGE/5))),addmargins(prop.table(table(5*floor(AGE/5))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto,c(summary(AGE), StdD=sd(AGE, na.rm=T), Skew=skewness(AGE, na.rm=T),
Kurt=kurtosis(AGE, na.rm=T))),2)
h1 <- ggplot(auto, aes(AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(AGE, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$AGE_5 <- 5 * floor(auto$AGE / 5)
#auto$AGE_10 <- 10 * floor(auto$AGE / 10)
lo <- auto %>%
group_by(AGE) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(AGE,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Claim by AGE') +
geom_smooth(method='lm', color='darkred', size=0.75, se=F, formula='y ~ poly(x, 2)') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(AGE, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(AGE)) %>%
group_by(AGEBIN=10*floor(AGE/10)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
tbl <- with(auto, rbind(addmargins(table(2*floor(YOJ/2))),addmargins(prop.table(table(2*floor(YOJ/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=90)
round(with(auto,c(summary(YOJ), StdD=sd(YOJ, na.rm=T), Skew=skewness(YOJ, na.rm=T),
Kurt=kurtosis(YOJ, na.rm=T))),2)
h1 <- ggplot(auto, aes(YOJ)) + geom_histogram(binwidth = 1, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of YOJ') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(YOJ, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 1, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by YOJ') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$YOJ_5 <- 5 * floor(auto$YOJ / 5)
#auto$YOJ_10 <- 10 * floor(auto$YOJ / 10)
lo <- auto %>%
group_by(YOJ) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(YOJ,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Claim by YOJ') +
geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(YOJ, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by YOJ') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(YOJ)) %>%
group_by(YOJBIN=4*floor(YOJ/4)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(10*floor(TRAVTIME/10))),addmargins(prop.table(table(10*floor(TRAVTIME/10)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width = 90)
round(with(auto,c(summary(TRAVTIME), StdD=sd(TRAVTIME, na.rm=T), Skew=skewness(TRAVTIME, na.rm=T),
Kurt=kurtosis(TRAVTIME, na.rm=T))),2)
h1 <- ggplot(auto, aes(TRAVTIME)) + geom_histogram(binwidth = 5, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of TRAVTIME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(TRAVTIME, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 5, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by TRAVTIME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$TRAVTIME_5 <- 5 * floor(auto$TRAVTIME / 5)
#auto$TRAVTIME_10 <- 10 * floor(auto$TRAVTIME / 10)
lo <- auto %>%
group_by(TRAVTIME) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(TRAVTIME,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Claim by TRAVTIME') +
geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(TRAVTIME, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by TRAVTIME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(TRAVTIME)) %>%
group_by(TRAVBIN=15*floor(TRAVTIME/15)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(2*floor(TIF/2))),addmargins(prop.table(table(2*floor(TIF/2)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width = 90)
round(with(auto,c(summary(TIF), StdD=sd(TIF, na.rm=T), Skew=skewness(TIF, na.rm=T),
Kurt=kurtosis(TIF, na.rm=T))),2)
h1 <- ggplot(auto, aes(TIF)) + geom_histogram(binwidth = 3, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of TIF') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(TIF, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 3, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by TIF') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$TIF_5 <- 5 * floor(auto$TIF / 5)
#auto$TIF_10 <- 10 * floor(auto$TIF / 10)
lo <- auto %>%
group_by(TIF) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(TIF,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Claim by TIF') +
geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(TIF, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) Prop by TIF') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(TIF)) %>%
group_by(TIFBIN=4*floor(TIF/4)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(3*floor(CAR_AGE/3))),addmargins(prop.table(table(3*floor(CAR_AGE/3)))*100)))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width = 90)
round(with(auto,c(summary(CAR_AGE), StdD=sd(CAR_AGE, na.rm=T), Skew=skewness(CAR_AGE, na.rm=T),
Kurt=kurtosis(CAR_AGE, na.rm=T))),2)
h1 <- ggplot(auto, aes(CAR_AGE)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of CAR_AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(CAR_AGE, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by CAR_AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$CAR_AGE_5 <- 5 * floor(auto$CAR_AGE / 5)
#auto$CAR_AGE_10 <- 10 * floor(auto$CAR_AGE / 10)
lo <- auto %>%
group_by(CAR_AGE) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(CAR_AGE,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Claim by CAR_AGE') +
geom_smooth(method='loess', color='darkred', size=0.75, se=F) +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(CAR_AGE, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) Prop by CAR_AGE') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(CAR_AGE)) %>%
group_by(CAR_AGE=4*floor(CAR_AGE/4)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$PARENT1))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(PARENT1=auto$PARENT1,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(PARENT1, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by PARENT1') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(PARENT1) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$MSTATUS))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(MSTATUS=auto$MSTATUS,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(MSTATUS, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by MSTATUS') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(MSTATUS) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$SEX))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(SEX=auto$SEX,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(SEX, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by SEX') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(SEX) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$CAR_USE))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(CAR_USE=auto$CAR_USE,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(CAR_USE, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_USE') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(CAR_USE) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$RED_CAR))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(RED_CAR=auto$RED_CAR,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(RED_CAR, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by RED_CAR') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(RED_CAR) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$REVOKED))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(REVOKED=auto$REVOKED,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(REVOKED, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by REVOKED') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(REVOKED) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
ct <- addmargins(table(auto$URBANICITY))
per <- round(ct / ct['Sum']* 100,1)
rbind(count=ct, percent=per)
tbl <- addmargins(table(URBANICITY=auto$URBANICITY,TARGET_FLAG=auto$TARGET_FLAG))
tbl
round(prop.table(tbl[1:2,1:2], margin=1),2)
prop.test(tbl[1:2,1:2])
bp <- ggplot(auto_censor, aes(URBANICITY, log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by URBANICITY') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
bp
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(URBANICITY) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(JOB)),addmargins(prop.table(table(JOB)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
tbl <- with(auto, addmargins(table(JOB, TARGET_FLAG)))
tbl
pt <- round(prop.table(tbl[1:9,1:2], margin=1),2)
pt
prop.test(tbl[1:9,1:2])
bc <- ggplot(auto, aes(x=reorder(JOB, JOB,function(x) length(x)))) +
geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") +
theme_classic()+ labs(title = 'JOB Prop') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) +
scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) +
labs(x="JOB") + coord_flip()
bp <- ggplot(auto_censor, aes(factor(JOB), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by JOB', x = "JOB") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(JOB) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(JOB,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by JOB') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) +
scale_x_discrete(labels=c('','BlCol','Clr', 'Dr', 'Hmkr', 'Law', 'Mgmt', 'Prof','Stud'))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(JOB) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(CAR_TYPE)),addmargins(prop.table(table(CAR_TYPE)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
tbl <- with(auto, addmargins(table(CAR_TYPE, TARGET_FLAG)))
tbl
pt <- round(prop.table(tbl[1:6,1:2], margin=1),2)
pt
prop.test(tbl[1:6,1:2])
bc <- ggplot(auto, aes(x=reorder(CAR_TYPE, CAR_TYPE,function(x) length(x)))) +
geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") +
theme_classic()+ labs(title = 'CAR_TYPE Prop') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) +
labs(x="CAR_TYPE") + coord_flip()
#scale_x_discrete(labels=c('Dr','Unk', 'Hmkr', 'Stud', 'Law', 'Mgmt', 'Prof','Clr','BlCol')) +
bp <- ggplot(auto_censor, aes(factor(CAR_TYPE), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by CAR_TYPE', x = "CAR_TYPE") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(CAR_TYPE) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(CAR_TYPE,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by CAR_TYPE') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue")) +
scale_x_discrete(labels=c('MV','PnlTrk','PkpTrk', 'SptCar', 'SUV', 'Van'))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(CAR_TYPE) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=100)
tbl <- with(auto, rbind(addmargins(table(EDUCATION)),addmargins(prop.table(table(EDUCATION)))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
tbl <- with(auto, addmargins(table(EDUCATION, TARGET_FLAG)))
tbl
pt <- round(prop.table(tbl[1:4,1:2], margin=1),2)
pt
prop.test(tbl[1:4,1:2])
bc <- ggplot(auto, aes(x=reorder(EDUCATION, EDUCATION,function(x) length(x)))) +
geom_bar(aes(y = ..prop.., group=1),color="ghostwhite", fill="steelblue4") +
theme_classic()+ labs(title = 'EDUCATION Prop') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
legend.text=element_text(size=7),panel.background = element_rect(fill = "lightblue")) +
labs(x="EDUCATION") + coord_flip()
bp <- ggplot(auto_censor, aes(factor(EDUCATION), log(TARGET_AMT))) +
geom_boxplot(fill="steelblue4", color = "ghostwhite", outlier.color="steelblue4", outlier.size = 0.5) +
theme_classic()+ labs(title = 'log(TARGET_AMT) by EDUCATION', x = "EDUCATION") +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
lo <- auto %>%
group_by(EDUCATION) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/
length(TARGET_FLAG))),3))
sp <- ggplot(lo, aes(EDUCATION,logodds)) +
geom_point(shape=21, color = "ghostwhite", fill='steelblue', size=4, stroke=0.75) +
theme_classic()+ labs(title = 'Log Odds Claim by EDUCATION') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
grid.arrange(grid.arrange(bc,sp, ncol=2),bp, ncol=1)
t <- auto[auto$TARGET_AMT > 0,] %>%
group_by(EDUCATION) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=130)
tbl <- with(auto, rbind(addmargins(table(30*floor(INCOME/30))),addmargins(prop.table(table(30*floor(INCOME/30))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto,c(summary(INCOME), StdD=sd(INCOME, na.rm=T), Skew=skewness(INCOME, na.rm=T),
Kurt=kurtosis(INCOME, na.rm=T))),2)
h1 <- ggplot(auto, aes(INCOME)) + geom_histogram(binwidth = 20, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of INCOME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(INCOME, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 20, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by INCOME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$INCOME_5 <- 5 * floor(auto$INCOME / 5)
#auto$INCOME_30 <- auto$INCOME^0.5
auto$INCOME_30 <- 30 * floor(auto$INCOME / 30)
lo <- auto %>%
group_by(INCOME_30) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(INCOME_30,logodds)) +
geom_point(color ='steelblue4', size=1) +
theme_classic()+ labs(title = 'Log Odds Clm by Binned INCOME') +
geom_smooth(method='loess', color='darkred', size=0.75, se=T) +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(INCOME, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by INCOME') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(INCOME)) %>%
group_by(INCBIN=30*floor(INCOME/30)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=120)
tbl <- with(auto, rbind(addmargins(table(50*floor(HOME_VAL/50))),addmargins(prop.table(table(50*floor(HOME_VAL/50))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto,c(summary(HOME_VAL), StdD=sd(HOME_VAL, na.rm=T), Skew=skewness(HOME_VAL, na.rm=T),
Kurt=kurtosis(HOME_VAL, na.rm=T))),2)
h1 <- ggplot(auto, aes(HOME_VAL)) + geom_histogram(binwidth = 40, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of HOME_VAL') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(HOME_VAL, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 40, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by HOME_VAL') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$HOME_VAL_5 <- 5 * floor(auto$HOME_VAL / 5)
auto$HOME_VAL_30 <- 30 * floor(auto$HOME_VAL / 30)
lo <- auto %>%
group_by(HOME_VAL_30) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(HOME_VAL_30,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Clm by Binned HOME_VAL') +
geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(HOME_VAL, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by HOME_VAL') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(HOME_VAL)) %>%
group_by(HOMEVALBIN=50*floor(HOME_VAL/50)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=120)
tbl <- with(auto, rbind(addmargins(table(4*floor(BLUEBOOK/4))),addmargins(prop.table(table(4*floor(BLUEBOOK/4))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto,c(summary(BLUEBOOK), StdD=sd(BLUEBOOK, na.rm=T), Skew=skewness(BLUEBOOK, na.rm=T),
Kurt=kurtosis(BLUEBOOK, na.rm=T))),2)
h1 <- ggplot(auto, aes(BLUEBOOK)) + geom_histogram(binwidth = 2, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of BLUEBOOK') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(BLUEBOOK, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 2, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by BLUEBOOK') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$BLUEBOOK_5 <- 5 * floor(auto$BLUEBOOK / 5)
auto$BLUEBOOK_2 <- 2 * floor(auto$BLUEBOOK / 2)
lo <- auto %>%
group_by(BLUEBOOK_2) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(BLUEBOOK_2,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Clm by Binned BLUEBOOK') +
geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(BLUEBOOK, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by BLUEBOOK') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(BLUEBOOK)) %>%
group_by(BLUEBKBIN=5*floor(BLUEBOOK/5)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
options(width=120)
tbl <- with(auto, rbind(addmargins(table(4000*floor(OLDCLAIM/4000))),addmargins(prop.table(table(4000*floor(OLDCLAIM/4000))))*100))
row.names(tbl) <- c('count','percent')
round(tbl,1)
options(width=100)
round(with(auto,c(summary(OLDCLAIM), StdD=sd(OLDCLAIM, na.rm=T), Skew=skewness(OLDCLAIM, na.rm=T),
Kurt=kurtosis(OLDCLAIM, na.rm=T))),2)
h1 <- ggplot(auto, aes(OLDCLAIM)) + geom_histogram(binwidth = 4, fill="steelblue4", color="ghostwhite" ) +
theme_classic() + labs(title = 'Histogram of OLDCLAIM') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
h2 <- ggplot(auto, aes(OLDCLAIM, ..density.., fill=factor(TARGET_FLAG))) +
geom_histogram(binwidth = 4, color="ghostwhite", position="identity", alpha=0.5) +
theme_classic() + labs(title = 'TARGET_FLAG Prop by OLDCLAIM') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue")) +
scale_fill_manual("TARGET_FLAG",values=c("steelblue4","cornflowerblue")) +
theme(legend.position = c(1,1),legend.justification = c(1,1), legend.background =
element_rect(fill='lightblue'))
#auto$OLDCLAIM_5 <- 5 * floor(auto$OLDCLAIM / 5)
auto$OLDCLAIM_4 <- 4 * floor(auto$OLDCLAIM / 4)
lo <- auto %>%
group_by(OLDCLAIM_4) %>%
summarise(logodds=round(log((sum(TARGET_FLAG)/length(TARGET_FLAG))/(1-sum(TARGET_FLAG)/length(TARGET_FLAG))),3))
sp1 <- ggplot(lo, aes(OLDCLAIM_4,logodds)) +
geom_point(color ='steelblue', size=1) +
theme_classic()+ labs(title = 'Log Odds Clm by Binned OLDCLAIM') +
geom_smooth(method='lm', color='darkred', size=0.75, se=T, formula='y ~ poly(x, 2)') +
theme(plot.title = element_text(hjust = 0.5), axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),panel.background = element_rect(fill = "lightblue"))
sp2 <- ggplot(auto_censor, aes(OLDCLAIM, log(TARGET_AMT))) +
geom_point(position="jitter", color="steelblue4", size=1) +
geom_smooth(method = "loess", size=1, color="darkred") + theme_classic() +
labs(title = 'log(TARGET_AMT) by OLDCLAIM') +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(plot.title = element_text(size=12),legend.title=element_text(size=8),
panel.background = element_rect(fill = "lightblue"))
grid.arrange(h1,h2,sp1,sp2, ncol=2)
t <- auto[auto$TARGET_AMT > 0,] %>%
filter(!is.na(OLDCLAIM)) %>%
group_by(OLD_BIN=5*floor(OLDCLAIM/5)) %>%
summarise(ct=n(),mean_cost=round(mean(TARGET_AMT)), median_cost=round(median(TARGET_AMT)))
kable(t)
tmp_data <- mice(auto,maxit=5, method='pmm',seed=20, print=F)
densityplot(tmp_data)[1:5]
auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)
#auto_censor <- auto[auto$TARGET_FLAG==1,]
auto$CAR_AGE <- ifelse(auto$CAR_AGE == -3, 0, auto$CAR_AGE)
boxcoxfit(auto$TARGET_AMT[auto$TARGET_FLAG==1])
auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)
auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)
auto$AGE_MOD <- scale(auto$AGE, scale=F)
#auto$YOJ_ZERO <- ifelse(auto$YOJ == 0,1,0)
# auto$YOJ_MOD <- ifelse(auto$YOJ == 0,
# mean(auto$YOJ[auto$YOJ !=0]),
# scale(auto$YOJ[auto$YOJ !=0],
# scale=F))
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)
auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)
auto$TIF_MOD <- scale(auto$TIF, scale=F)
boxcoxfit(auto$INCOME[auto$INCOME >0])
auto$INCOME_MOD <- auto$INCOME ^0.5
boxcoxfit(auto$HOME_VAL[auto$HOME_VAL > 0])
#auto$HOME_VAL_ZERO <- ifelse(auto$HOME_VAL==0,1,0)
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25
boxcoxfit(auto$BLUEBOOK)
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5
boxcoxfit(auto$OLDCLAIM[auto$OLDCLAIM>0])
auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1)
m1 <- glm(TARGET_FLAG ~. -TARGET_AMT -TARGET_AMT_MOD -INDEX -KIDSDRIV -HOMEKIDS -CLM_FREQ -AGE -YOJ -TRAVTIME -TIF -SEX -RED_CAR -INCOME -HOME_VAL -BLUEBOOK -OLDCLAIM
-INCOME_30 -HOME_VAL_30 -BLUEBOOK_2 -OLDCLAIM_4,
data = auto, family='binomial')
vif(m1)
summary(m1)
m1_mod <- update(m1, . ~ . -CAR_AGE -HOMEKIDS_MOD -AGE_MOD -YOJ_MOD -OLD_CLAIM_MOD)
summary(m1_mod)
m2 <- update(m1, . ~ . + I(KIDSDRIV_MOD^2) + I(HOMEKIDS_MOD^2) + I(CLM_FREQ_MOD^2)
+ I(AGE_MOD^2) + I(YOJ_MOD^2)
+ I(TIF_MOD^2))
summary(m2)
m2_mod <- update(m2, . ~ . -YOJ_MOD -HOMEKIDS_MOD -I(YOJ_MOD^2) -I(HOMEKIDS_MOD^2)
- CAR_AGE)
summary(m2_mod)
auto_redux <- auto[,c(9,11,12:14,16,19,20,23:26,32:42,2)]
model.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=auto_redux)
model.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=auto_redux)
m3 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
trace= 0, direction='both')
summary(m3)
# m4 <- bestglm(Xy=auto_redux, family=binomial,IC="AIC", method="exhaustive", nvmax=3)
# summary(m4$BestModel)
names(auto_redux)
auto_clm <- auto[auto$TARGET_FLAG ==
1,c(9,11,12:14,16,19,20,23:26,32:42,31)]
mymodel <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
vif(mymodel)
m4 <- lm(TARGET_AMT_MOD ~ KIDSDRIV_MOD + HOMEKIDS_MOD+ AGE_MOD + TRAVTIME_MOD + SEX + CAR_USE + RED_CAR + URBANICITY + JOB + CAR_TYPE + EDUCATION + BLUEBOOK_MOD , data=auto_clm)
summary(m4)
m4_mod <- lm(TARGET_AMT_MOD ~ BLUEBOOK_MOD, data=auto_clm)
summary(m4_mod)
m5 <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + TRAVTIME_MOD + I(TRAVTIME_MOD^2) + AGE_MOD + I(AGE_MOD^2))
summary(m5)
m5_mod <- update(m4_mod, . ~ . + I(BLUEBOOK_MOD^2) + AGE_MOD + I(AGE_MOD^2))
summary(m5_mod)
model.upper <- lm(TARGET_AMT_MOD ~ ., data=auto_clm)
model.null <- lm(TARGET_AMT_MOD ~ 1, data=auto_clm)
m6 <- step(model.null,scope=list(upper=model.upper, lower=model.null),
trace= 0, direction='both')
summary(m6)
m1_out <- cbind(AIC=AIC(m1), AICc=AICc(m1), BIC = BIC(m1), loglik=logLik(m1))
m1_mod_out <- cbind(AIC=AIC(m1_mod), AICc=AICc(m1_mod), BIC = BIC(m1_mod), loglik=logLik(m1_mod))
m2_out <- cbind(AIC=AIC(m2), AICc=AICc(m2), BIC = BIC(m2), loglik=logLik(m2))
m2_mod_out <- cbind(AIC=AIC(m2_mod), AICc=AICc(m2_mod), BIC = BIC(m2_mod), loglik=logLik(m2_mod))
m3_out <- cbind(AIC=AIC(m3), AICc=AICc(m3), BIC = BIC(m3), loglik=logLik(m3))
model_comp <- rbind(m1_out, m1_mod_out, m2_out, m2_mod_out, m3_out)
rownames(model_comp) <- c("m1","m1_mod","m2","m2_mod","m3")
model_comp
# models 1 and 2 prediction probs
m1_pred <- predict(m1, auto, type="response")
m1_mod_pred <- predict(m1_mod, auto, type="response")
m2_pred <- predict(m2, auto, type="response")
m2_mod_pred <- predict(m2_mod, auto, type="response")
m3_pred <- predict(m3, auto, type="response")
#AUC
paste("Model 1:",round(as.numeric(roc(auto$TARGET_FLAG, m1_pred)["auc"]),3))
paste("Model 1 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m1_mod_pred)["auc"]),3))
paste("Model 2:",round(as.numeric(roc(auto$TARGET_FLAG, m2_pred)["auc"]),3))
paste("Model 2 mod:",round(as.numeric(roc(auto$TARGET_FLAG, m2_mod_pred)["auc"]),3))
paste("Model 3:",round(as.numeric(roc(auto$TARGET_FLAG, m3_pred)["auc"]),3))
mypredict <- predict(m2_mod, auto, type='response')
summary(mypredict)
myroc <- roc(auto$TARGET_FLAG, mypredict, plot=T, asp=NA,
legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")
coords(myroc, "best", ret='threshold', best.method="youden")
predict_target <- ifelse(mypredict >= 0.276, 1, 0)
cm <- confusionMatrix(data=as.factor(predict_target), reference=as.factor(auto$TARGET_FLAG), positive='1')
cm$table
a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1
# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)
predict_target2 <- ifelse(mypredict >= 0.5, 1, 0)
cm <- confusionMatrix(data=as.factor(predict_target2), reference=as.factor(auto$TARGET_FLAG), positive='1')
cm$table
a <- cm$overall["Accuracy"]; names(a) <- NULL # accuracy
e <- 1 - a; names(e) <- NULL # error rate
p <- cm$byClass["Precision"]; names(p) <- NULL # precision
st <- cm$byClass["Sensitivity"]; names(st) <- NULL # sensitivity
sp <- cm$byClass["Specificity"]; names(sp) <- NULL # specificity
f1 <- cm$byClass["F1"] ; names(f1) <- NULL # F1
# display metrics
list(accuracy=a, error_rt = e, precision=p, sensitivity=st, specificity=sp, F1=f1)
RMSE = function(m, o){
sqrt(mean((m - o)^2))
}
# predictions
p4 <- predict(m4, auto_clm)
p4_mod <- predict(m4_mod, auto_clm)
p5 <- predict(m5, auto_clm)
p5_mod <- predict(m5_mod, auto_clm)
p6 <- predict(m6, auto_clm)
m4_out <- cbind(R_2=round(summary(m4)$r.squared,4),
adj_R_2=round(summary(m4)$adj.r.squared,4),
rmse=RMSE(exp(p4),exp(auto_clm$TARGET_AMT_MOD)),
AIC=AIC(m4),AICc=AICc(m4),BIC=BIC(m4))
m4_mod_out <- cbind(R_2=round(summary(m4_mod)$r.squared,4),
adj_R_2=round(summary(m4_mod)$adj.r.squared,4),
rmse=RMSE(exp(p4_mod),exp(auto_clm$TARGET_AMT_MOD)),
AIC=AIC(m4_mod),AICc=AICc(m4_mod),BIC=BIC(m4_mod))
m5_out <- cbind(R_2=round(summary(m5)$r.squared,4),
adj_R_2=round(summary(m5)$adj.r.squared,4),
rmse=RMSE(exp(p5),exp(auto_clm$TARGET_AMT_MOD)),
AIC=AIC(m5),AICc=AICc(m5),BIC=BIC(m5))
m5_mod_out <- cbind(R_2=round(summary(m5_mod)$r.squared,4),
adj_R_2=round(summary(m5_mod)$adj.r.squared,4),
rmse=RMSE(exp(p5_mod),exp(auto_clm$TARGET_AMT_MOD)),
AIC=AIC(m5_mod),AICc=AICc(m5_mod),BIC=BIC(m5_mod))
m6_out <- cbind(R_2=round(summary(m6)$r.squared,4),
adj_R_2=round(summary(m6)$adj.r.squared,4),
rmse=RMSE(exp(p6),exp(auto_clm$TARGET_AMT_MOD)),
AIC=AIC(m6),AICc=AICc(m6),BIC=BIC(m6))
reg <- rbind(m4=m4_out, m4_mod=m4_mod_out,m5=m5_out,m5_mod=m5_mod_out,m6=m6_out)
rownames(reg) <- c("m4", "m4_mod", "m5", "m5_mod","m6")
reg
par(mfrow = c(2,2))
plot(m6)
url <- "https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/dataset/insurance-evaluation-data.csv"
auto <- read.csv(url)
currency_fix <- function(x) {
num <- str_replace_all(x, "\\$","")
num <- as.numeric(str_replace_all(num, "\\,",""))
num
}
auto$INCOME <- currency_fix(auto$INCOME)
auto$HOME_VAL <- currency_fix(auto$HOME_VAL)
auto$BLUEBOOK <- currency_fix(auto$BLUEBOOK)
auto$OLDCLAIM <- currency_fix(auto$OLDCLAIM)
auto[sapply(auto, is.factor)] <- lapply(auto[sapply(auto, is.factor)],
function(x) str_replace(x,"z_|<",""))
auto[sapply(auto, is.character)] <- lapply(auto[sapply(auto, is.character)],as.factor)
auto$EDUCATION <- ordered(auto$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )
auto$INCOME <- auto$INCOME / 1000
auto$HOME_VAL <- auto$HOME_VAL / 1000
auto$BLUEBOOK <- auto$BLUEBOOK / 1000
auto$OLDCLAIM <- auto$OLDCLAIM / 1000
auto$AGE <- ifelse(is.na(auto$AGE), mean(auto$AGE),auto$AGE)
tmp_data <- mice(auto,maxit=1, method='pmm',seed=20, print=F)
auto <- complete(tmp_data,1)
auto$TARGET_AMT_MOD <- log(auto$TARGET_AMT)
auto$KIDSDRIV_MOD <- scale(auto$KIDSDRIV, scale=F)
auto$HOMEKIDS_MOD <- scale(auto$HOMEKIDS, scale=F)
auto$CLM_FREQ_MOD <- scale(auto$CLM_FREQ, scale=F)
auto$AGE_MOD <- scale(auto$AGE, scale=F)
auto$YOJ_MOD <- scale(auto$YOJ, scale=F)
auto$TRAVTIME_MOD <- scale(auto$TRAVTIME, scale=F)
auto$TIF_MOD <- scale(auto$TIF, scale=F)
auto$INCOME_MOD <- auto$INCOME ^0.5
auto$HOME_VAL_MOD <- auto$HOME_VAL^0.25
auto$BLUEBOOK_MOD <- auto$BLUEBOOK^0.5
auto$OLD_CLAIM_MOD <- log(auto$OLDCLAIM + 1)
mypred <- predict(m2_mod, auto, type='response')
auto$TARGET_FLAG <- ifelse(mypred >= 0.276, 1, 0)
mypred2 <- exp(predict(m6, auto))
auto$TARGET_AMT <- mypred2
write.csv(auto, "evaluation_data_w_predictions.csv")