Homework 4 - Binary Logistic Regression & Multiple Linear Regression

Introduction:

We load an auto insurance company dataset containing 8,161 records. Each record represents a customer, and each record has two response variables: TARGET_FLAG and TARGET_AMT. Below is a short description of all the variables of interest in the data set, including these response variables:

VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable None
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None
TARGET_AMT If car was in a crash, what was the cost None
AGE Age of Driver Very young and very old people tend to be risky
BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future
EDUCATION Max Education Level Unknown but possible more educated people tend to drive safer
HOMEKIDS # Children at Home Unknown
HOME_VAL Home Value Homeowners tend to drive safer
INCOME Income Rich people tend to be in fewer crashes
JOB Job Category White collar jobs tend to be safer
KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes
MSTATUS Marital Status Married people driver safer
MVR_PTS Motor Vehicle Record Points If you get a lot of traffic tickets, you tend to get into more accidents
OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high
PARENT1 Single Parent Unknown
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky
REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver
SEX Gender Urban legend says that women have less crashes then men
TIF Time in Force People who have been customers for a long time are usually more safe
TRAVTIME Distance to Work Long drives to work usually suggest greater risk
URBANICITY Home/Work Area Unknown
YOJ Years on Job People who stay at a job for a long time are usually more safe

Data Exploration:

We check the classes of our variables to determine whether any of them need to be coerced to numeric or other classes prior to exploratory data analysis.

Class Count Variables
character 14 BLUEBOOK, CAR_TYPE, CAR_USE, EDUCATION, HOME_VAL, INCOME, JOB, MSTATUS, OLDCLAIM, PARENT1, RED_CAR, REVOKED, SEX, URBANICITY
integer 11 AGE, CAR_AGE, CLM_FREQ, HOMEKIDS, INDEX, KIDSDRIV, MVR_PTS, TARGET_FLAG, TIF, TRAVTIME, YOJ
numeric 1 TARGET_AMT

INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM are all character variables that will need to be coerced to integers after we strip the “$” from their strings. TARGET_FLAG and the remaining character variables will all need to be coerced to factors.

We remove the identification variable INDEX and take a look at a summary of the dataset’s completeness.

rows 8161
columns 25
all_missing_columns 0
total_missing_values 2405
complete_rows 6045

None of our columns are completely devoid of data. There are 6,045 complete rows in the dataset, which is about 74% of our observations. There are 2,405 total missing values. We take a look at which variables contain these missing values and what the spread is.

A very small percentage of observations contain missing AGE values. The INCOME, YOJ, HOME_VAL, CAR_AGE, and JOB variables are each missing around 5.5 to 6.5 percent of values. There are no variables containing such extreme proportions of missing values that removal would be warranted on that basis alone.

To check whether the predictor variables are correlated with the binary response variable, we produce a correlation funnel that visualizes the strength of the relationships between our predictors and TARGET_FLAG. This correlation funnel will not include variables for which there are any missing values.

The predictor variables without missing values that are most correlated with getting into a car crash are CLM_FREQ, URBANICITY, MVR_PTS, OLDCLAIM, PARENT1, REVOKED, and CAR_USE. Some of this is unsurprising. Increased claim frequency, increased numbers of traffic tickets, increased past payouts, having your license previously revoked, and using your car commercially all positively correlate with getting into a car crash, as we expected they would. We did not expect URBANICITY to be so relevant, but urban areas can often be more difficult to drive through and have more traffic, so that combination could reasonably make urban-dwellers more likely to get into car crashes, as the correlation suggests. We also did not expect PARENT1 to be so relevant, but the correlation between being a single parent and getting into a car crash is very similar to that of having your license previously revoked and getting into a car crash.

The predictor variables without missing values that are least correlated with getting into a car crash are SEX and RED_CAR. Being a woman has a very slight positive correlation with getting into a car crash, and driving a red car has a slightly negative correlation with getting into a car crash. These are contrary to urban legend, and more importantly they probably won’t be useful when modeling.

To check whether the predictor variables are correlated with the numeric response variable, we produce correlation plots that visualize the strength of the relationships between our predictors and TARGET_AMT (only when observations involve a car crash, as otherwise we know TARGET_AMT = 0). For readability, first we look at numeric predictors only.

It’s worth noting that BLUEBOOK is the single numeric variable most correlated with an increased TARGET_AMT, which is sensible. Cars that are currently still more valuable can be more expensive to fix. We expected CAR_AGE to be more negatively correlated with TARGET_AMT.

Next we look at two-level factors.

We see a small positive correlation between using your car commercially and TARGET_AMT. But we see an equally large negative correlation between being female and TARGET_AMT. The former is more logical than the latter, so neither may be a good predictor of TARGET_AMT ultimately.

Finally we look at factors with more than two levels.

The various car types don’t have as high of a correlation (either positively or negatively) with TARGET_AMT as expected, but we still believe CAR_TYPE will be somewhat useful for modeling.

Because we have so many variables, it would be difficult to check for and visualize collinearity for our responses and predictors all at the same time without setting a threshold. So we will set a correlation threshold of 0.45 (in absolute value) and only visualize variables with any correlation values at or above that level.

We see some expected collinearity. KIDSDRIV and HOMEKIDS are moderately positively correlated because teenagers driving your car depends on you having kids at all, but the number of teens driving your car won’t always exactly match the number of kids you have. HOME_VAL and INCOME are pretty positively correlated, as higher incomes lead to the ability to purchase higher valued homes. Not being married is also moderately negatively correlated with HOME_VAL, likely because married people often have two incomes instead of one and can therefore purchase higher valued homes. Having a PhD is equally correlated with being a doctor or lawyer, which makes sense because those jobs require them. Working a blue collar job is logically pretty negatively correlated with driving your car privately since driving your car commercially is itself a blue collar job. Being a woman is very negatively correlated with driving a red car. Lastly of note, claim frequency is moderately correlated with higher past payouts, which adds up.

We have 14 numeric variables and 11 categorical variables (including the dummy variable TARGET_FLAG). We list the possible ranges or values for each variable in the breakdown below:

Variable Type Values
AGE Numeric 16 - 81
BLUEBOOK Numeric 1500 - 69740
CAR_AGE Numeric -3 - 28
CLM_FREQ Numeric 0 - 5
HOME_VAL Numeric 0 - 885282
HOMEKIDS Numeric 0 - 5
INCOME Numeric 0 - 367030
KIDSDRIV Numeric 0 - 4
MVR_PTS Numeric 0 - 13
OLDCLAIM Numeric 0 - 57037
TARGET_AMT Numeric 0 - 107586.1
TIF Numeric 1 - 25
TRAVTIME Numeric 5 - 142
YOJ Numeric 0 - 23
CAR_TYPE Categorical Minivan, Panel Truck, Pickup, Sports Car, Van, z_SUV
CAR_USE Categorical Commercial, Private
EDUCATION Categorical <High School, Bachelors, Masters, PhD, z_High School
JOB Categorical Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student, z_Blue Collar
MSTATUS Categorical Yes, z_No
PARENT1 Categorical No, Yes
RED_CAR Categorical no, yes
REVOKED Categorical No, Yes
SEX Categorical M, z_F
TARGET_FLAG Categorical 0, 1
URBANICITY Categorical Highly Urban/ Urban, z_Highly Rural/ Rural

The ranges for TARGET_AMT, HOME_VAL, and INCOME all include zero, and recoding these zero values as NA will make analyzing summary statistics for these variables more meaningful than if we included zeroes in their calculations. (We will maintain a separate copy of the data, in which we do not introduce additional NA values, for later use when creating the fully imputed dataset that some of our models will rely on for completeness.)

The range for CAR_AGE includes -3. Since the variable can only take positive or zero values logically, and only one observation in the dataset has a negative sign, we make the assumption that the age of 3 years is correct for this observation, and the sign is simply a data entry error. We fix this observation.

Some of the factor levels are named inconsistently, so we will rename and relevel them in the next section.

Let’s take a look at the summary statistics for each variable.

##  TARGET_FLAG   TARGET_AMT           KIDSDRIV           AGE       
##  0:6008      Min.   :    30.28   Min.   :0.0000   Min.   :16.00  
##  1:2153      1st Qu.:  2609.78   1st Qu.:0.0000   1st Qu.:39.00  
##              Median :  4104.00   Median :0.0000   Median :45.00  
##              Mean   :  5702.18   Mean   :0.1711   Mean   :44.79  
##              3rd Qu.:  5787.00   3rd Qu.:0.0000   3rd Qu.:51.00  
##              Max.   :107586.14   Max.   :4.0000   Max.   :81.00  
##              NA's   :6008                         NA's   :6      
##     HOMEKIDS           YOJ           INCOME       PARENT1       HOME_VAL     
##  Min.   :0.0000   Min.   : 0.0   Min.   :     5   No :7084   Min.   : 50223  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 34135   Yes:1077   1st Qu.:153074  
##  Median :0.0000   Median :11.0   Median : 58438              Median :206692  
##  Mean   :0.7212   Mean   :10.5   Mean   : 67259              Mean   :220621  
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 90053              3rd Qu.:270023  
##  Max.   :5.0000   Max.   :23.0   Max.   :367030              Max.   :885282  
##                   NA's   :454    NA's   :1060                NA's   :2758    
##  MSTATUS      SEX               EDUCATION               JOB      
##  Yes :4894   M  :3786   <High School :1203   z_Blue Collar:1825  
##  z_No:3267   z_F:4375   Bachelors    :2242   Clerical     :1271  
##                         Masters      :1658   Professional :1117  
##                         PhD          : 728   Manager      : 988  
##                         z_High School:2330   Lawyer       : 835  
##                                              (Other)      :1599  
##                                              NA's         : 526  
##     TRAVTIME            CAR_USE        BLUEBOOK          TIF        
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   1st Qu.: 1.000  
##  Median : 33.00                     Median :14440   Median : 4.000  
##  Mean   : 33.49                     Mean   :15710   Mean   : 5.351  
##  3rd Qu.: 44.00                     3rd Qu.:20850   3rd Qu.: 7.000  
##  Max.   :142.00                     Max.   :69740   Max.   :25.000  
##                                                                     
##         CAR_TYPE    RED_CAR       OLDCLAIM        CLM_FREQ      REVOKED   
##  Minivan    :2145   no :5783   Min.   :    0   Min.   :0.0000   No :7161  
##  Panel Truck: 676   yes:2378   1st Qu.:    0   1st Qu.:0.0000   Yes:1000  
##  Pickup     :1389              Median :    0   Median :0.0000             
##  Sports Car : 907              Mean   : 4037   Mean   :0.7986             
##  Van        : 750              3rd Qu.: 4636   3rd Qu.:2.0000             
##  z_SUV      :2294              Max.   :57037   Max.   :5.0000             
##                                                                           
##     MVR_PTS          CAR_AGE                       URBANICITY  
##  Min.   : 0.000   Min.   : 0.000   Highly Urban/ Urban  :6492  
##  1st Qu.: 0.000   1st Qu.: 1.000   z_Highly Rural/ Rural:1669  
##  Median : 1.000   Median : 8.000                               
##  Mean   : 1.696   Mean   : 8.329                               
##  3rd Qu.: 3.000   3rd Qu.:12.000                               
##  Max.   :13.000   Max.   :28.000                               
##                   NA's   :510

The majority of observations live/work in a highly urban or urban area. There are more married than unmarried observations, and there are also more female than male observations. The average observation has a median age of 45 years old, has been in their job for a median of 11 years, and has a median income of roughly $58,500.00. Most cars in the dataset are driven for private use rather than commercially, and the median car age is 8 years.

6,008 observations, which is the majority of observations, do not involve car crashes, and we now correctly record 6,008 NA observations for TARGET_AMT. (Since we introduced NA values for TARGET_AMT on purpose, we will not consider imputing them.)

There are 6 NA values in AGE, 510 in CAR_AGE, 454 in YOJ, 1,060 in INCOME, 2,758 in HOME_VAL, and 526 in JOB. In the next section, we will impute all these missing values in an alternate version of our dataset, as we mentioned earlier, and in the main version of our dataset, we will only impute the variables if we determine their data is at least Missing at Random (MAR), and there’s no other evidence we should exclude them from imputation.

We check whether there is evidence that the data are Missing Completely at Random (MCAR), a higher standard than MAR, using the mcar_test function from the naniar package. Meeting this standard is unlikely with real data, but still worth checking.

statistic df p.value missing.patterns
16862.3 1116 0 51

The low p-value provides evidence that missing data on these variables are not MCAR.

Excluding AGE since the number of missing values is so small for that variable, and we plan to impute it anyway, let’s check whether missingness in any of the others is associated with any of the other predictors or the response variables using the missing_compare function from the finalfit package. Due to the large number of variables, we exclude any observed variables that could not account for a variable’s missingness in the output by setting a p-value threshold of 0.05.

Dependant Explanatory Ref Not Missing Missing p
INCOME TARGET_FLAG 0 5308 (88.3) 700 (11.7) 0.001
INCOME TARGET_FLAG 1 1793 (83.3) 360 (16.7) NA
INCOME AGE Mean (SD) 44.9 (8.6) 43.8 (9.0) 0.001
INCOME HOMEKIDS Mean (SD) 0.7 (1.1) 0.9 (1.2) 0.001
INCOME YOJ Mean (SD) 11.4 (2.8) 4.3 (5.8) 0.001
INCOME PARENT1 No 6188 (87.4) 896 (12.6) 0.022
INCOME PARENT1 Yes 913 (84.8) 164 (15.2) NA
INCOME HOME_VAL Mean (SD) 227842.0 (93771.4) 155319.7 (92741.6) 0.001
INCOME SEX M 3420 (90.3) 366 (9.7) 0.001
INCOME SEX z_F 3681 (84.1) 694 (15.9) NA
INCOME EDUCATION <High School 982 (81.6) 221 (18.4) 0.001
INCOME EDUCATION Bachelors 1972 (88.0) 270 (12.0) NA
INCOME EDUCATION Masters 1547 (93.3) 111 (6.7) NA
INCOME EDUCATION PhD 652 (89.6) 76 (10.4) NA
INCOME EDUCATION z_High School 1948 (83.6) 382 (16.4) NA
INCOME JOB Clerical 1198 (94.3) 73 (5.7) 0.001
INCOME JOB Doctor 232 (94.3) 14 (5.7) NA
INCOME JOB Home Maker 308 (48.0) 333 (52.0) NA
INCOME JOB Lawyer 792 (94.9) 43 (5.1) NA
INCOME JOB Manager 937 (94.8) 51 (5.2) NA
INCOME JOB Professional 1055 (94.4) 62 (5.6) NA
INCOME JOB Student 350 (49.2) 362 (50.8) NA
INCOME JOB z_Blue Collar 1727 (94.6) 98 (5.4) NA
INCOME CAR_USE Commercial 2675 (88.3) 354 (11.7) 0.008
INCOME CAR_USE Private 4426 (86.2) 706 (13.8) NA
INCOME BLUEBOOK Mean (SD) 16199.2 (8430.5) 12432.0 (7574.9) 0.001
INCOME TIF Mean (SD) 5.4 (4.2) 5.1 (4.0) 0.045
INCOME CAR_TYPE Minivan 1922 (89.6) 223 (10.4) 0.001
INCOME CAR_TYPE Panel Truck 632 (93.5) 44 (6.5) NA
INCOME CAR_TYPE Pickup 1225 (88.2) 164 (11.8) NA
INCOME CAR_TYPE Sports Car 729 (80.4) 178 (19.6) NA
INCOME CAR_TYPE Van 683 (91.1) 67 (8.9) NA
INCOME CAR_TYPE z_SUV 1910 (83.3) 384 (16.7) NA
INCOME RED_CAR no 4974 (86.0) 809 (14.0) 0.001
INCOME RED_CAR yes 2127 (89.4) 251 (10.6) NA
INCOME CLM_FREQ Mean (SD) 0.8 (1.2) 0.9 (1.2) 0.048
INCOME CAR_AGE Mean (SD) 8.5 (5.7) 7.2 (5.3) 0.001
INCOME URBANICITY Highly Urban/ Urban 5753 (88.6) 739 (11.4) 0.001
INCOME URBANICITY z_Highly Rural/ Rural 1348 (80.8) 321 (19.2) NA
HOME_VAL TARGET_FLAG 0 4217 (70.2) 1791 (29.8) 0.001
HOME_VAL TARGET_FLAG 1 1186 (55.1) 967 (44.9) NA
HOME_VAL AGE Mean (SD) 45.4 (8.5) 43.5 (8.7) 0.001
HOME_VAL HOMEKIDS Mean (SD) 0.7 (1.1) 0.8 (1.1) 0.009
HOME_VAL YOJ Mean (SD) 11.1 (3.7) 9.3 (4.6) 0.001
HOME_VAL INCOME Mean (SD) 68771.2 (44434.0) 63968.7 (48518.0) 0.001
HOME_VAL PARENT1 No 5055 (71.4) 2029 (28.6) 0.001
HOME_VAL PARENT1 Yes 348 (32.3) 729 (67.7) NA
HOME_VAL MSTATUS Yes 4267 (87.2) 627 (12.8) 0.001
HOME_VAL MSTATUS z_No 1136 (34.8) 2131 (65.2) NA
HOME_VAL EDUCATION <High School 729 (60.6) 474 (39.4) 0.001
HOME_VAL EDUCATION Bachelors 1545 (68.9) 697 (31.1) NA
HOME_VAL EDUCATION Masters 1166 (70.3) 492 (29.7) NA
HOME_VAL EDUCATION PhD 474 (65.1) 254 (34.9) NA
HOME_VAL EDUCATION z_High School 1489 (63.9) 841 (36.1) NA
HOME_VAL JOB Clerical 913 (71.8) 358 (28.2) 0.001
HOME_VAL JOB Doctor 154 (62.6) 92 (37.4) NA
HOME_VAL JOB Home Maker 456 (71.1) 185 (28.9) NA
HOME_VAL JOB Lawyer 596 (71.4) 239 (28.6) NA
HOME_VAL JOB Manager 703 (71.2) 285 (28.8) NA
HOME_VAL JOB Professional 817 (73.1) 300 (26.9) NA
HOME_VAL JOB Student 100 (14.0) 612 (86.0) NA
HOME_VAL JOB z_Blue Collar 1309 (71.7) 516 (28.3) NA
HOME_VAL CAR_USE Commercial 1942 (64.1) 1087 (35.9) 0.002
HOME_VAL CAR_USE Private 3461 (67.4) 1671 (32.6) NA
HOME_VAL BLUEBOOK Mean (SD) 16073.5 (8388.1) 14997.6 (8437.5) 0.001
HOME_VAL OLDCLAIM Mean (SD) 3726.1 (8512.2) 4646.3 (9245.6) 0.001
HOME_VAL CLM_FREQ Mean (SD) 0.7 (1.1) 0.9 (1.2) 0.001
HOME_VAL REVOKED No 4801 (67.0) 2360 (33.0) 0.001
HOME_VAL REVOKED Yes 602 (60.2) 398 (39.8) NA
HOME_VAL MVR_PTS Mean (SD) 1.6 (2.0) 1.9 (2.3) 0.001
HOME_VAL CAR_AGE Mean (SD) 8.4 (5.7) 8.1 (5.7) 0.012
HOME_VAL URBANICITY Highly Urban/ Urban 4345 (66.9) 2147 (33.1) 0.007
HOME_VAL URBANICITY z_Highly Rural/ Rural 1058 (63.4) 611 (36.6) NA
JOB KIDSDRIV Mean (SD) 0.2 (0.5) 0.1 (0.4) 0.005
JOB AGE Mean (SD) 44.7 (8.7) 46.5 (8.0) 0.001
JOB HOMEKIDS Mean (SD) 0.7 (1.1) 0.4 (0.9) 0.001
JOB YOJ Mean (SD) 10.4 (4.2) 11.3 (2.7) 0.001
JOB INCOME Mean (SD) 63334.1 (42157.2) 118852.9 (58861.8) 0.001
JOB PARENT1 No 6601 (93.2) 483 (6.8) 0.001
JOB PARENT1 Yes 1034 (96.0) 43 (4.0) NA
JOB HOME_VAL Mean (SD) 213485.5 (89924.5) 322080.5 (121344.9) 0.001
JOB SEX M 3365 (88.9) 421 (11.1) 0.001
JOB SEX z_F 4270 (97.6) 105 (2.4) NA
JOB EDUCATION <High School 1203 (100.0) 0 (0.0) 0.001
JOB EDUCATION Bachelors 2242 (100.0) 0 (0.0) NA
JOB EDUCATION Masters 1330 (80.2) 328 (19.8) NA
JOB EDUCATION PhD 530 (72.8) 198 (27.2) NA
JOB EDUCATION z_High School 2330 (100.0) 0 (0.0) NA
JOB CAR_USE Commercial 2557 (84.4) 472 (15.6) 0.001
JOB CAR_USE Private 5078 (98.9) 54 (1.1) NA
JOB BLUEBOOK Mean (SD) 15161.5 (8018.6) 23669.5 (9952.7) 0.001
JOB CAR_TYPE Minivan 2123 (99.0) 22 (1.0) 0.001
JOB CAR_TYPE Panel Truck 435 (64.3) 241 (35.7) NA
JOB CAR_TYPE Pickup 1265 (91.1) 124 (8.9) NA
JOB CAR_TYPE Sports Car 902 (99.4) 5 (0.6) NA
JOB CAR_TYPE Van 634 (84.5) 116 (15.5) NA
JOB CAR_TYPE z_SUV 2276 (99.2) 18 (0.8) NA
JOB RED_CAR no 5510 (95.3) 273 (4.7) 0.001
JOB RED_CAR yes 2125 (89.4) 253 (10.6) NA
JOB OLDCLAIM Mean (SD) 3980.4 (8722.8) 4859.5 (9501.7) 0.026
JOB CLM_FREQ Mean (SD) 0.8 (1.2) 1.0 (1.3) 0.001
JOB CAR_AGE Mean (SD) 7.9 (5.6) 14.0 (4.6) 0.001
JOB URBANICITY Highly Urban/ Urban 5987 (92.2) 505 (7.8) 0.001
JOB URBANICITY z_Highly Rural/ Rural 1648 (98.7) 21 (1.3) NA

There is evidence that some of the missingness for INCOME, HOME_VAL, and JOB can be explained by other observed information, so they could be considered Missing at Random (MAR). There is no evidence missing values for CAR_AGE or YOJ can be explained by other observed information, so we will no longer consider imputing them in the main version of our dataset.

It’s reasonable to assume that the missing values in YOJ, HOME_VAL, INCOME and JOB might all be related because money, employment, and assets are interconnected. Therefore the missingness of one or more of these variables might be dependent on the missingness of one or more of the others. Let’s look at the overlap of observations with missing values for these variables using the missing_plot function from the finalfit package.

We do see some overlap in the observations that have missing values for these variables, but it’s hard to detect anything more conclusive from this plot. To take a closer look at the patterns of missingness between these variables, we can use the missing_pattern function from the finalfit package. (Note that in the visualization that follows, the numbers along the bottom axis are unfortunately illegible, but they are just the column-wise counts of missing values for each variable, plus a sum of missing values for all variables, and we have already remarked on these totals.)

Here, we see several patterns of missingness worth noting. 814 observations are missing two out of these four variables, and 49 observations are missing three. Of the observations that are missing HOME_VAL, 483 are also missing INCOME, 154 are also missing JOB, and 109 are also missing YOJ. Due to these patterns of related missingness, we will no longer consider imputing these variables in the main version of our dataset.

Let’s take a look at the distributions of the numeric variables.

The distribution for AGE is approximately normal. The distribution for YOJ is left-skewed. The distributions for TARGET_AMT, KIDSDRIV, HOMEKIDS, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, and CAR_AGE are all right-skewed. 75% of observations for TARGET_AMT are at or below $5,787.00, but the maximum value recorded is $107,586.14.

Let’s also take a look at the distributions of the categorical variables. First, we look at the distributions for categorical variables with only two levels.

Looking at PARENT1 and REVOKED, we can see that single parents represent relatively few observations in the dataset, as do people whose licenses were revoked in the past seven years. MSTATUS and SEX are the most evenly split categorical variables with two levels in the dataset.

Next we look at the distributions for the categorical variables with more than two levels.

The most common profession represented in the observations is blue collar, and the most commonly represented cars are the SUV and the minivan. The number of observations with high school diplomas and bachelor’s degrees are fairly similar. Having less or more education is less common.

Data Preparation

First, we rename and relevel the inconsistently named and leveled factor variables we noted earlier. A summary of only the factors we changed the levels for is below, with the first level in each list always being the reference level. For variables which have “Yes” and “No” values, we will replace these with 1/0 (1 = “Yes”, 0 = “No”).

Factor New Levels
CAR_TYPE Minivan, Panel Truck, Pickup, Sports Car, SUV, Van
EDUCATION <High School, High School, Bachelors, Masters, PhD
JOB Blue Collar, Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student
PARENT1 0, 1
MSTATUS 0, 1
RED_CAR 0, 1
REVOKED 0, 1
SEX Male, Female
URBANICITY 0, 1

We reduce the scale of the INCOME and HOME_VAL variables to thousands of dollars so the figures will be more readable when visualized. The replacement variables are INCOME_THOU and HOME_VAL_THOU.

Some observations list Student as their occupation as well as a value for YOJ. We recode these values as NA. The most likely interpretation is that people incorrectly listed how many years they’ve been in school here, which will not be useful to our analysis.

Based on the descriptions of some of the variables and their theoretical effects on the target variables, and to handle the variables that have missing data that we chose not to impute, including those for which we replaced zero or incorrect values with NA values, we create several factors that we believe will be helpful when building models:

  • HOME_VAL_CAT (Levels based on HOME_VAL_THOU = “<=250K”, “251-500K”, “501-750K”, “751K+”, “”)

  • HOMEOWNER (1 = HOME_VAL_THOU not NA)

  • INCOME_CAT (Levels based on INCOME = “<=50K”, “51-100K”, “101-150K”, “151K+”, “”)

  • INCOME_FLAG (1 = INCOME_THOU not NA)

  • KIDSDRIV_FLAG (1 = KIDSDRIV number of children > 0)

  • HOMEKIDS_FLAG (1 = HOMEKIDS number of children > 0)

  • EMPLOYED (1 = JOB not NA/Student/Home Maker)

  • CAR_AGE_CAT (Levels based on CAR_AGE = “<=4”, “5-8”, “9-12”, “13+”, “”)

  • WHITE_COLLAR (1 = JOB not NA/Student/Home Maker/Blue Collar)

We then split both the main version of our dataset and the alternate version we created earlier into train and test sets. The main version will have all the derived variables we just created, imputed values for the AGE variable, and any transformations we make. The alternate version will not include any derived variables or transformations, but it will include imputed values for all variables with missing values.

We impute missing data in the main train and test sets for one numeric variable, AGE, using the mean value since it is normally distributed.

We take a look at the distributions for our imputed variable to see if the distributions of this variable in the train and test sets differ from what we originally observed or between sets.

The distributions in the train and test sets for AGE are similar to each other and to its original distribution.

We impute missing data in the alternate train and test sets for all variables with missing values using the mice package.

We confirm there are no longer any missing values in the alternate train or test datasets.

## [1] TRUE

We take a look at the distributions for the imputed numeric variables to see if their distributions in the alternate train and test sets differ from what we originally observed or between sets.

The distributions for the imputed numeric variables don’t differ between the alternate train and test sets or from what we originally observed.

We also perform the same check for the single categorical variable we imputed in the alternate train and test sets: JOB.

The distributions in the alternate train and test sets for the single imputed categorical variable, JOB, are similar to each other, and the rankings of most frequent to least frequent occupation here are similar to the rankings of the original distribution. We note that the “Professional” and “Manager” occupations are more tied in the rankings here than they were in the original distribution, however.

Since the distributions of some of our numeric variables are skewed, we transform the data for some of them. In the main dataset, we exclude any numeric variables with missing values that we decided not to impute and for which we have already created factors, as well as the response variable TARGET_AMT. We also use the alternate dataset, which as a reminder has no missing values, as the basis for a third version of the data, in which every skewed numeric predictor and the response variable TARGET_AMT have all been transformed.

Below is a breakdown of the variables, the ideal labmdas proposed by Box-Cox, and the reasonable alternative transformations we have chosen to make in the main dataset:

Skewed Variable Ideal Lambda Proposed by Box-Cox Reasonable Alternative Transformation
TRAVTIME 0.7 no transformation
BLUEBOOK 0.45 square root
TIF 0.25 log
OLDCLAIM -0.0999999999999999 log
CLM_FREQ -0.2 log
MVR_PTS 0.0500000000000003 log

We check whether the distributions of the transformed variables now differ between the train and test sets.

They do not.

Below is a breakdown of the variables, the ideal labmdas proposed by Box-Cox, and the reasonable alternative transformations we have chosen to make in the third version of the data.

Skewed Variable Ideal Lambda Proposed by Box-Cox Reasonable Alternative Transformation
TARGET_AMT -0.2 log
YOJ 0.65 no transformation
TRAVTIME 0.7 no transformation
KIDSDRIV -1.15 inverse
HOMEKIDS -0.25 log
BLUEBOOK 0.45 square root
TIF 0.25 log
OLDCLAIM -0.0999999999999999 log
CLM_FREQ -0.2 log
MVR_PTS 0.0500000000000003 log
INCOME_THOU 0.45 square root
HOME_VAL_THOU 0.2 log
CAR_AGE 0.55 square root

Build Models

Binary Logistic Regression Models

Model BLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:1, our baseline binary logistic regression model based on all the original, untransformed variables, with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise model selection to select the model with the smallest AIC value using the stepAIC() function from the MASS package.

A summary of Model BLR:1 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + PARENT1 + MSTATUS + 
##     EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + 
##     OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + 
##     INCOME_THOU + HOME_VAL_THOU, family = "binomial", data = alt_train_df_imputed)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.650e+00  2.392e-01 -11.081  < 2e-16 ***
## KIDSDRIV              4.158e-01  7.151e-02   5.814 6.09e-09 ***
## HOMEKIDS              5.995e-02  4.059e-02   1.477 0.139653    
## PARENT11              3.424e-01  1.302e-01   2.630 0.008531 ** 
## MSTATUS1             -5.248e-01  1.019e-01  -5.149 2.62e-07 ***
## EDUCATIONHigh School -2.886e-01  1.383e-01  -2.086 0.036951 *  
## EDUCATIONBachelors   -3.325e-02  1.132e-01  -0.294 0.768865    
## EDUCATIONMasters     -2.060e-01  2.056e-01  -1.002 0.316383    
## EDUCATIONPhD          9.573e-02  2.454e-01   0.390 0.696413    
## JOBClerical           1.265e-01  1.265e-01   1.000 0.317240    
## JOBDoctor            -7.704e-01  2.858e-01  -2.696 0.007022 ** 
## JOBHome Maker        -8.142e-03  1.669e-01  -0.049 0.961080    
## JOBLawyer            -7.256e-02  1.984e-01  -0.366 0.714507    
## JOBManager           -8.184e-01  1.538e-01  -5.320 1.04e-07 ***
## JOBProfessional      -2.059e-01  1.383e-01  -1.489 0.136444    
## JOBStudent           -2.063e-01  1.474e-01  -1.400 0.161584    
## TRAVTIME              1.636e-02  2.248e-03   7.277 3.42e-13 ***
## CAR_USEPrivate       -7.800e-01  1.049e-01  -7.435 1.05e-13 ***
## BLUEBOOK             -1.894e-05  5.645e-06  -3.356 0.000791 ***
## TIF                  -5.528e-02  8.941e-03  -6.182 6.31e-10 ***
## CAR_TYPEPanel Truck   5.569e-01  1.803e-01   3.089 0.002009 ** 
## CAR_TYPEPickup        5.279e-01  1.201e-01   4.397 1.10e-05 ***
## CAR_TYPESports Car    9.627e-01  1.278e-01   7.535 4.87e-14 ***
## CAR_TYPESUV           6.849e-01  1.033e-01   6.628 3.41e-11 ***
## CAR_TYPEVan           6.620e-01  1.449e-01   4.568 4.93e-06 ***
## OLDCLAIM             -1.496e-05  4.689e-06  -3.190 0.001425 ** 
## CLM_FREQ              1.833e-01  3.384e-02   5.417 6.06e-08 ***
## REVOKED1              9.225e-01  1.105e-01   8.347  < 2e-16 ***
## MVR_PTS               1.312e-01  1.627e-02   8.066 7.24e-16 ***
## CAR_AGE              -1.420e-02  9.019e-03  -1.574 0.115487    
## URBANICITY1           2.385e+00  1.336e-01  17.849  < 2e-16 ***
## INCOME_THOU          -4.250e-03  1.335e-03  -3.184 0.001452 ** 
## HOME_VAL_THOU        -1.392e-03  4.149e-04  -3.354 0.000797 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5107.1  on 5704  degrees of freedom
## AIC: 5173.1
## 
## Number of Fisher Scoring iterations: 5

The AIC of Model BLR:1 is 5173.1.

Feature Coefficient Percentage Change in Odds of Car Crash
URBANICITY1 10.8643928 986.4
CAR_TYPESports Car 2.6188460 161.9
REVOKED1 2.5155223 151.6
CAR_TYPESUV 1.9835867 98.4
CAR_TYPEVan 1.9386614 93.9
CAR_TYPEPanel Truck 1.7452765 74.5
CAR_TYPEPickup 1.6953038 69.5
KIDSDRIV 1.5155229 51.6
PARENT11 1.4083716 40.8
CLM_FREQ 1.2011971 20.1
MVR_PTS 1.1402160 14.0
JOBClerical 1.1348452 13.5
EDUCATIONPhD 1.1004642 10.0
HOMEKIDS 1.0617818 6.2
TRAVTIME 1.0164960 1.6
BLUEBOOK 0.9999811 0.0
OLDCLAIM 0.9999850 0.0
HOME_VAL_THOU 0.9986093 -0.1
INCOME_THOU 0.9957593 -0.4
JOBHome Maker 0.9918908 -0.8
CAR_AGE 0.9859046 -1.4
EDUCATIONBachelors 0.9672964 -3.3
TIF 0.9462246 -5.4
JOBLawyer 0.9300077 -7.0
EDUCATIONMasters 0.8137984 -18.6
JOBProfessional 0.8139173 -18.6
JOBStudent 0.8135512 -18.6
EDUCATIONHigh School 0.7493022 -25.1
MSTATUS1 0.5916818 -40.8
JOBDoctor 0.4628116 -53.7
CAR_USEPrivate 0.4583885 -54.2
JOBManager 0.4411460 -55.9

The coefficients for Model BLR:1 mostly match expectations. Using your car privately is one of the biggest reducers of the odds of a car crash. While we expected more educated people to drive more safely, having a high school education is the level that reduces the odds of a car crash the most. All non-blue collar jobs reduce the odds of a car crash, with doctor and manager seeing the largest reductions. The biggest increaser of the odds of a car crash is living/working in an urban area. Some other notable increasers are driving anything other than a minivan, especially a sports car; having had your license revoked; and having teenagers driving your car.

We check for possible multicollinearity within this model.

##                    GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV       1.306531  1        1.143036
## HOMEKIDS       1.830130  1        1.352823
## PARENT1        1.899743  1        1.378312
## MSTATUS        2.139221  1        1.462608
## EDUCATION      9.505786  4        1.325100
## JOB           12.372591  7        1.196831
## TRAVTIME       1.041548  1        1.020562
## CAR_USE        2.229513  1        1.493155
## BLUEBOOK       1.756755  1        1.325426
## TIF            1.010771  1        1.005371
## CAR_TYPE       2.573303  5        1.099130
## OLDCLAIM       1.665731  1        1.290632
## CLM_FREQ       1.459245  1        1.207992
## REVOKED        1.339087  1        1.157189
## MVR_PTS        1.150930  1        1.072814
## CAR_AGE        2.140672  1        1.463104
## URBANICITY     1.142613  1        1.068931
## INCOME_THOU    2.747964  1        1.657698
## HOME_VAL_THOU  2.009943  1        1.417725

EDUCATION and JOB only appear to have high variance inflation factors artificially, as these variables have higher degrees of freedom. A different metric is calculated for variables like this (GVIF^(1/(2*Df)), and that metric squared is typically considered acceptable if it is less than five, the usual VIF threshold. So we don’t need to remove either EDUCATION or JOB.

Model BLR:2 - Select Model Using Original & Derived, but Untransformed Variables, with Only AGE Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:2, a second binary logistic regression model based on one combination of variables we believe could be the best predictors of TARGET_FLAG, including some original variables and some variables we derived from other variables, but no transformed variables. The only value we’ve imputed for this model is AGE.

A summary of Model BLR:2 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOMEOWNER + INCOME_FLAG + 
##     EMPLOYED + WHITE_COLLAR + MSTATUS + PARENT1 + REVOKED + SEX + 
##     TRAVTIME, family = "binomial", data = train_df_imputed)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.593521   0.224284  -2.646 0.008138 ** 
## AGE           -0.014439   0.003959  -3.647 0.000266 ***
## CLM_FREQ       0.374499   0.025589  14.635  < 2e-16 ***
## HOMEOWNER1    -0.262771   0.081438  -3.227 0.001253 ** 
## INCOME_FLAG1  -0.478040   0.092563  -5.164 2.41e-07 ***
## EMPLOYED1      0.473987   0.107889   4.393 1.12e-05 ***
## WHITE_COLLAR1 -0.644178   0.075688  -8.511  < 2e-16 ***
## MSTATUS1      -0.241994   0.084976  -2.848 0.004402 ** 
## PARENT11       0.487306   0.103513   4.708 2.51e-06 ***
## REVOKED1       0.904666   0.087874  10.295  < 2e-16 ***
## SEXFemale      0.110879   0.065567   1.691 0.090822 .  
## TRAVTIME       0.008946   0.001991   4.494 6.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640  on 5736  degrees of freedom
## Residual deviance: 5993  on 5725  degrees of freedom
## AIC: 6017
## 
## Number of Fisher Scoring iterations: 4

The AIC of Model BLR:2 is 6017.

Feature Coefficient Percentage Change in Odds of Car Crash
REVOKED1 2.4711065 147.1
PARENT11 1.6279239 62.8
EMPLOYED1 1.6063864 60.6
CLM_FREQ 1.4542633 45.4
SEXFemale 1.1172602 11.7
TRAVTIME 1.0089865 0.9
AGE 0.9856651 -1.4
MSTATUS1 0.7850611 -21.5
HOMEOWNER1 0.7689178 -23.1
INCOME_FLAG1 0.6199974 -38.0
WHITE_COLLAR1 0.5250941 -47.5

In Model BLR:2, the largest reducer of the odds of being in a car crash is working a white collar job, and the largest odds increaser is having your license revoked. Being employed at all, i.e. having any job other than student or homemaker, strangely increases the odds. Since we understand the effects of the WHITE_COLLAR factor better than we understand the effects of the EMPLOYED factor, and they both describe the same information, we favor the WHITE_COLLAR factor here and remove the EMPLOYED factor. We don’t reprint a summary, but the new AIC is 6034.6. We’ve mentioned before that we don’t understand being a single parent’s correlation with increased car crash odds, but it is worth noting it’s the second largest increaser of odds in this subset of predictors. Lastly, being a woman also slightly increases the odds of a car crash despite our prior expectations.

We check for possible multicollinearity within this model.

##          AGE     CLM_FREQ    HOMEOWNER  INCOME_FLAG WHITE_COLLAR      MSTATUS 
##     1.152491     1.003834     1.456116     1.036460     1.059972     1.732481 
##      PARENT1      REVOKED          SEX     TRAVTIME 
##     1.485383     1.002159     1.043544     1.005053

All of the variance inflation factors are less than five, so there are no issues of multicollinearity within this model.

Model BLR:3 - Select Model Using Original, Derived, & Transformed Variables, with Only AGE Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:3, a third binary logistic regression model based on another combination of variables we believe could be the best predictors of TARGET_FLAG, including some original variables, some variables we derived from other variables, and some variables we transformed. The only value we’ve imputed for this model is AGE.

##  [1] "AGE"           "CLM_FREQ_LOG"  "URBANICITY"    "MVR_PTS_LOG"  
##  [5] "OLDCLAIM_LOG"  "PARENT1"       "REVOKED"       "CAR_USE"      
##  [9] "CAR_TYPE"      "MSTATUS"       "EDUCATION"     "KIDSDRIV_FLAG"
## [13] "INCOME_CAT"    "EMPLOYED"      "HOMEOWNER"     "WHITE_COLLAR"

In choosing some of these variables, we excluded others for which collinearity might be an issue. That is, our factor describing income was chosen over the home value factor, the kids driving factor was chosen over the kids at home factor, and the education factor was chosen over the job factor.

A summary of Model BLR:3 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + 
##     MVR_PTS_LOG + OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + 
##     CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + 
##     EMPLOYED + HOMEOWNER + WHITE_COLLAR, family = "binomial", 
##     data = train_df_trans)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.923096   0.381740  -2.418 0.015601 *  
## AGE                  -0.008696   0.004280  -2.032 0.042169 *  
## CLM_FREQ_LOG          0.248283   0.074101   3.351 0.000806 ***
## URBANICITY1           2.089626   0.130303  16.037  < 2e-16 ***
## MVR_PTS_LOG           0.061683   0.009240   6.676 2.46e-11 ***
## OLDCLAIM_LOG         -0.088658   0.035745  -2.480 0.013129 *  
## PARENT11              0.315903   0.116011   2.723 0.006468 ** 
## REVOKED1              0.857847   0.099265   8.642  < 2e-16 ***
## CAR_USEPrivate       -0.736742   0.104881  -7.025 2.15e-12 ***
## CAR_TYPEPanel Truck   0.151060   0.159413   0.948 0.343335    
## CAR_TYPEPickup        0.525694   0.116545   4.511 6.46e-06 ***
## CAR_TYPESports Car    1.015507   0.123022   8.255  < 2e-16 ***
## CAR_TYPESUV           0.756766   0.099151   7.632 2.30e-14 ***
## CAR_TYPEVan           0.496401   0.139351   3.562 0.000368 ***
## MSTATUS1             -0.497166   0.092775  -5.359 8.38e-08 ***
## EDUCATIONHigh School -0.191783   0.109515  -1.751 0.079912 .  
## EDUCATIONBachelors   -0.718931   0.122381  -5.875 4.24e-09 ***
## EDUCATIONMasters     -0.786428   0.137006  -5.740 9.46e-09 ***
## EDUCATIONPhD         -1.095949   0.167519  -6.542 6.06e-11 ***
## KIDSDRIV_FLAG1        0.720830   0.101306   7.115 1.12e-12 ***
## INCOME_CAT.L          0.092508   0.076456   1.210 0.226299    
## INCOME_CAT.Q          0.368222   0.088759   4.149 3.35e-05 ***
## INCOME_CAT.C          0.233418   0.086987   2.683 0.007289 ** 
## EMPLOYED1             0.169179   0.126425   1.338 0.180839    
## HOMEOWNER1           -0.252892   0.086945  -2.909 0.003630 ** 
## WHITE_COLLAR1        -0.119849   0.105581  -1.135 0.256317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5330.8  on 5711  degrees of freedom
## AIC: 5382.8
## 
## Number of Fisher Scoring iterations: 5

We remove the least statistically significant variable, WHITE_COLLAR, check the new summary, remove the only remaining statistically insignificant variable, EMPLOYED, and reprint only the final summary. We’re slightly surprised these variables were significant to the previous model, but not this one. However, that could be because the INCOME_CAT factor supersedes both in this model.

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + 
##     MVR_PTS_LOG + OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + 
##     CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + 
##     HOMEOWNER, family = "binomial", data = train_df_trans)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.808918   0.372886  -2.169 0.030057 *  
## AGE                  -0.008346   0.004261  -1.959 0.050154 .  
## CLM_FREQ_LOG          0.248601   0.074047   3.357 0.000787 ***
## URBANICITY1           2.094943   0.130333  16.074  < 2e-16 ***
## MVR_PTS_LOG           0.061745   0.009238   6.684 2.33e-11 ***
## OLDCLAIM_LOG         -0.088956   0.035719  -2.490 0.012758 *  
## PARENT11              0.310786   0.115917   2.681 0.007338 ** 
## REVOKED1              0.856583   0.099228   8.632  < 2e-16 ***
## CAR_USEPrivate       -0.787060   0.085768  -9.177  < 2e-16 ***
## CAR_TYPEPanel Truck   0.097836   0.154648   0.633 0.526972    
## CAR_TYPEPickup        0.498138   0.114647   4.345 1.39e-05 ***
## CAR_TYPESports Car    1.012436   0.122911   8.237  < 2e-16 ***
## CAR_TYPESUV           0.755746   0.099107   7.626 2.43e-14 ***
## CAR_TYPEVan           0.473068   0.138070   3.426 0.000612 ***
## MSTATUS1             -0.511067   0.091593  -5.580 2.41e-08 ***
## EDUCATIONHigh School -0.207637   0.107779  -1.927 0.054040 .  
## EDUCATIONBachelors   -0.746540   0.117738  -6.341 2.29e-10 ***
## EDUCATIONMasters     -0.846365   0.129320  -6.545 5.96e-11 ***
## EDUCATIONPhD         -1.149027   0.162915  -7.053 1.75e-12 ***
## KIDSDRIV_FLAG1        0.726144   0.101196   7.176 7.20e-13 ***
## INCOME_CAT.L          0.075360   0.075063   1.004 0.315402    
## INCOME_CAT.Q          0.343078   0.086893   3.948 7.87e-05 ***
## INCOME_CAT.C          0.234976   0.086869   2.705 0.006831 ** 
## HOMEOWNER1           -0.230535   0.083544  -2.759 0.005790 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5332.9  on 5713  degrees of freedom
## AIC: 5380.9
## 
## Number of Fisher Scoring iterations: 5

The AIC of Model BLR:3 is 5380.9.

Feature Coefficient Percentage Change in Odds of Car Crash
URBANICITY1 8.1249789 712.5
CAR_TYPESports Car 2.7522987 175.2
REVOKED1 2.3551001 135.5
CAR_TYPESUV 2.1292001 112.9
KIDSDRIV_FLAG1 2.0670935 106.7
CAR_TYPEPickup 1.6456545 64.6
CAR_TYPEVan 1.6049102 60.5
INCOME_CAT.Q 1.4092787 40.9
PARENT11 1.3644972 36.4
CLM_FREQ_LOG 1.2822305 28.2
INCOME_CAT.C 1.2648788 26.5
CAR_TYPEPanel Truck 1.1027816 10.3
INCOME_CAT.L 1.0782723 7.8
MVR_PTS_LOG 1.0636907 6.4
AGE 0.9916890 -0.8
OLDCLAIM_LOG 0.9148860 -8.5
EDUCATIONHigh School 0.8125019 -18.7
HOMEOWNER1 0.7941084 -20.6
MSTATUS1 0.5998551 -40.0
EDUCATIONBachelors 0.4740035 -52.6
CAR_USEPrivate 0.4551809 -54.5
EDUCATIONMasters 0.4289714 -57.1
EDUCATIONPhD 0.3169449 -68.3

Interestingly, in Model BLR:3, education levels do reduce the odds of a car crash in the order expected. That is, having a PhD decreases the odds more than a Master’s, having a Master’s decreases the odds more than a Bachelor’s, and having a Bachelor’s decreases the odds more than having a High School Diploma. Otherwise, coefficients follow similar patterns to what we discussed with the first model. Private car use is one of the biggest car crash odds reducers; the biggest increaser of the odds of a car crash is living/working in an urban area; and driving anything other than a minivan, having had your license revoked, and having teenagers driving your car all big odds increasers as well. The INCOME_CAT factor has the opposite effect we were expecting. Perhaps the reason higher income categories are associated with higher car crash odds is incomes are usually higher in urban areas, and urban areas are very associated with higher car crash odds.

We check for possible multicollinearity within this model.

##                    GVIF Df GVIF^(1/(2*Df))
## AGE            1.192534  1        1.092032
## CLM_FREQ_LOG  68.540964  1        8.278947
## URBANICITY     1.102783  1        1.050135
## MVR_PTS_LOG    1.104418  1        1.050913
## OLDCLAIM_LOG  68.607176  1        8.282945
## PARENT1        1.583420  1        1.258340
## REVOKED        1.129558  1        1.062807
## CAR_USE        1.567173  1        1.251868
## CAR_TYPE       1.599229  5        1.048072
## MSTATUS        1.815128  1        1.347267
## EDUCATION      1.782365  4        1.074916
## KIDSDRIV_FLAG  1.106473  1        1.051890
## INCOME_CAT     1.533759  3        1.073889
## HOMEOWNER      1.437353  1        1.198897

OLDCLAIM_LOG and CLM_FREQ_LOG have variance inflation factors greater than five. Since we believe claim frequency has more to do with TARGET_FLAG, and past claim amounts have more to do with TARGET_AMT, we choose to remove OLDCLAIM_LOG from this model. We don’t reprint a summary, but the new AIC is 5385.1, and none of the variables have variance inflation factors greater than five any longer.

Multiple Linear Regression Models

Model MLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

We create Model MLR:1, our baseline multiple linear regression model based on all the original, untransformed variables, with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise backward model selection.

A summary of Model MLR:1 is below:

## 
## Call:
## lm(formula = TARGET_AMT ~ MSTATUS + SEX + BLUEBOOK + RED_CAR + 
##     REVOKED + MVR_PTS, data = select(filter(alt_train_df_imputed, 
##     TARGET_FLAG == 1), -TARGET_FLAG))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8227  -3166  -1587    411 100814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3861.32677  554.66624   6.962  5.0e-12 ***
## MSTATUS1    -598.72182  414.43071  -1.445   0.1488    
## SEXFemale   1437.88719  570.49291   2.520   0.0118 *  
## BLUEBOOK       0.11586    0.02558   4.529  6.4e-06 ***
## RED_CAR1    -919.61610  626.13920  -1.469   0.1421    
## REVOKED1    -971.02786  511.74890  -1.897   0.0580 .  
## MVR_PTS      127.62768   80.12720   1.593   0.1114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8073 on 1516 degrees of freedom
## Multiple R-squared:  0.02348,    Adjusted R-squared:  0.01962 
## F-statistic: 6.076 on 6 and 1516 DF,  p-value: 2.67e-06

Only a small number of variables remain, and they explain very little variance in our data. Some of them are not statistically significant by normal standards. We will leave them in and see whether they provide predictive power.

We check for multicollinearity within this model.

##  MSTATUS      SEX BLUEBOOK  RED_CAR  REVOKED  MVR_PTS 
## 1.002852 1.873415 1.008312 1.872006 1.006370 1.004409

None of the variance inflation factors are greater than five, so there are no multicollinearity issues to address for this model.

Model MLR:2 - Full Model Using Original and Transformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

We create Model MLR:2, a second multiple linear regression model using original and transformed variables (including the response variable), with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise backward model selection.

A summary of Model MLR:2 is below:

## 
## Call:
## lm(formula = TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT, 
##     data = select(filter(alt_train_df_trans, TARGET_FLAG == 1), 
##         -TARGET_FLAG))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0399 -0.4154  0.0412  0.4118  3.2477 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.9364878  0.0761413 104.234  < 2e-16 ***
## MSTATUS1      -0.0779400  0.0412953  -1.887   0.0593 .  
## SEXFemale      0.0967186  0.0416973   2.320   0.0205 *  
## BLUEBOOK_SQRT  0.0028464  0.0006045   4.709 2.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8054 on 1519 degrees of freedom
## Multiple R-squared:  0.02127,    Adjusted R-squared:  0.01934 
## F-statistic: 11.01 on 3 and 1519 DF,  p-value: 3.777e-07

Again, only a small number of predictors remain, and they explain very little variance in our data.

We check for multicollinearity within this model.

##       MSTATUS           SEX BLUEBOOK_SQRT 
##      1.000487      1.005604      1.005612

There are no variance inflation factors greater than five, so there are no issues of multicollinearity to address.

Model MLR:3 - Robust Model Using Select Original and Transformed Variables

We create Model MLR:3, a robust model designed to deal with outliers using select original and transformed variables (including the response variable). The predictors were chosen from among those retained in the previous two models, as stepwise backward selection is not possible with a robust model, and the full robust model’s residual standard error was higher than that of this reduced model.

A summary of Model MLR:3 is below:

## 
## Call: rlm(formula = TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT + 
##     RED_CAR + REVOKED + MVR_PTS_LOG, data = select(filter(alt_train_df_trans, 
##     TARGET_FLAG == 1), -TARGET_FLAG))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.10955 -0.40365  0.03963  0.40579  3.26592 
## 
## Coefficients:
##               Value    Std. Error t value 
## (Intercept)     8.0765   0.0643   125.6189
## MSTATUS1       -0.0546   0.0343    -1.5935
## SEXFemale       0.0939   0.0472     1.9900
## BLUEBOOK_SQRT   0.0017   0.0005     3.3032
## RED_CAR1       -0.0269   0.0517    -0.5196
## REVOKED1        0.0043   0.0423     0.1020
## MVR_PTS_LOG     0.0042   0.0045     0.9331
## 
## Residual standard error: 0.6011 on 1516 degrees of freedom

Select Models

Binary Logistic Regression Models

To choose our binary logistic regression model, we consider that false positives would likely result in the company charging too high a premium for those customers, and false negatives would likely result in the company charging too low a premium for those customers. Therefore, the effects of those inaccurate predictions could be equally costly. False positive customers might jump to competitors offering them lower rates (perhaps because those competitors more accurately identified them as lower risk), and false negative customers might cost the company more in unanticipated claim costs. So we will rely primarily on the F1 Score, which incorporates both precision and recall to accurately classify positives while minimizing false positives and false negatives, to select the best model. However, we will look at other metrics and goodness of fit checks as well.

To first check for goodness of fit, we create marginal model plots for the response and each predictor in each binary logistic regression model. (Note that the mmps function from the car package used to generate these plots skips any factors and interaction terms within the models intentionally.)

The marginal models plots for Model BLR:1 reveal some small fit issues, mostly with TRAVTIME.

The marginal models plots for Model BLR:2 reveal more fit issues than Model BLR:1 had, but Model BLR:2 relies on fewer numeric variables than Model BLR:1, and remember these plots can’t visualize factors.

The marginal models plots for Model BLR:3 reveals one fit issue for AGE. This model also relies on a lot of factors, which the marginal models plots can’t visualize.

We calculate the Hosmer-Lemeshow statistic for each model to further check for lack of fit.

Model HL Statistic DoF P Value
Model BLR:1 17.512 8 0.0251979
Model BLR:2 11.1989 8 0.1906817
Model BLR:3 22.34053 8 0.004322592

The low p-values for Models BLR:1 and BLR:3 suggest some lack of fit there. The moderate p-value for Model BLR:2 suggests no lack of fit there. This is not what we expected based on the incomplete pictures provided by looking at just the marginal models plots.

We produce ROC curves to visualize how each model performs on the training and test data.

Model BLR:1 has the highest AUC on both the training and the test data, although Model BLR:3 is not far behind.

We produce confusion matrices for all three models based on the training and test data.

We calculate performance metrics for all models on the training and test data.

1 (Train) 1 (Test) 2 (Train) 2 (Test) 3 (Train) 3 (Test)
Sensitivity 0.443 0.421 0.196 0.181 0.366 0.371
Specificity 0.923 0.919 0.950 0.950 0.925 0.916
Pos Pred Value 0.675 0.645 0.587 0.559 0.639 0.609
Neg Pred Value 0.821 0.819 0.766 0.768 0.802 0.806
Precision 0.675 0.645 0.587 0.559 0.639 0.609
Recall 0.443 0.421 0.196 0.181 0.366 0.371
F1 0.535 0.509 0.294 0.273 0.466 0.462
Prevalence 0.265 0.260 0.265 0.260 0.265 0.260
Detection Rate 0.117 0.109 0.052 0.047 0.097 0.097
Detection Prevalence 0.174 0.170 0.089 0.084 0.152 0.158
Balanced Accuracy 0.683 0.670 0.573 0.565 0.646 0.644
Accuracy 0.796 0.789 0.750 0.750 0.777 0.775
Classification Error Rate 0.204 0.211 0.250 0.250 0.223 0.225
AUC 0.817 0.806 0.710 0.712 0.795 0.791

Model BLR:1 performs best based on most metrics, whether on the training data or on the test data. It had the lowest AIC, and its AUC was slightly higher than Model BLR:3, but most importantly, it balances Precision and Recall the best, and it therefore has the highest F1 Score. (Note that Model BLR:2’s Recall is particularly low compared to the other models, while the Precision among them varies less.) Since the F1 Score is our primary metric, we select Model BLR:1 as the final binary logistic regression model we will use to make predictions on the evaluation data.

Multiple Linear Regression Models

To select our multiple linear regression model, we will primarily rely on predictive R^2 and RMSE based on the test data, but we will also check for goodness of fit.

To check for goodness of fit, we primarily examine Residuals vs. Fitted Values and Q-Q Residuals plots for all three models.

Model MLR:1 does not fit the data very well. The residuals vs. fitted values are not randomly/evenly spaced above and below 0, and there is deviation from the normal line in the Q-Q plot on the right end. There are noted outliers.

Model MLR:2 fits the data better, if not well. The residuals vs. fitted values are more randomly/evenly spaced above and below 0. There is deviation from the normal line in the Q-Q plot on both ends though.

Model MLR:3 can’t be said to fit the data better or worse than Model MLR:2.

We calculate predictive \(R^2\) and RMSE using the test set for all three models. (In cases where the response variable was transformed, predictions are back-transformed.)

models pred_rsq rmse
Model MLR:1 0.00164553582810001 6643.62198126247
Model MLR:2 -0.0572331141972691 6836.72183171633
Model MLR:3 -0.0616884808018474 6851.11226184339

The predictive power of all these models is pretty low, as expected. Model MLR:1 is the only model that beats predicting using the mean, and it doesn’t beat it by much. We know the assumptions of OLS are violated in Model MLR:1, even though the other models have flaws as well. So we do select it as the final multiple linear regression model we will use to make predictions on the evaluation data, but we approach these predictions with caution. Models that are more superior to the naive method of predicting using the mean, and that do not have statistical flaws, should be further investigated. No alternate models we have attempted have fared better as of yet though.

Predictions on Evaluation Data

We make predictions on the evaluation dataset, and we save the file with the predicted probabilities, classifications, and costs as “HW4_Eval_PredProbs_Flags_Amounts.csv.”

While we can’t know whether the TARGET_FLAG classifications in particular are accurate, we summarize them below so we can compare the percentage of observations related to car crashes in the original data to the percentage in the evaluation data.

In the original data, about 26.4% of observations were related to car crashes, and in the evaluation data, we’ve identified only 17.1% of observations as being related to car crashes. Again, this does not speak to accuracy.

Appendix: Report Code

Below is the code for this report to generate the models and charts above.

knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(DataExplorer)
library(knitr)
library(cowplot)
library(finalfit)
library(correlationfunnel)
library(ggcorrplot)
library(RColorBrewer)
library(naniar)
library(mice)
library(MASS)
select <- dplyr::select
library(car)
library(glmtoolbox)
library(pROC)
library(caret)
library(robustbase)

cur_theme <- theme_set(theme_classic())

my_url <- "https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/insurance_training_data.csv"
main_df <- read.csv(my_url, na.strings = "")

classes <- as.data.frame(unlist(lapply(main_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", "))
kable(classes_summary, format = "simple")

vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
main_df <- main_df |>
    mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))

main_df <- main_df |>
    select(-INDEX)
remove <- c("discrete_columns", "continuous_columns",
            "total_observations", "memory_usage")
completeness <- introduce(main_df) |>
    select(-all_of(remove))
knitr::kable(t(completeness), format = "simple")

p1 <- plot_missing(main_df, missing_only = TRUE,
                   ggtheme = theme_classic(), title = "Missing Values")

p1 <- p1 + 
    scale_fill_brewer(palette = "Paired")
p1

exclude <- c("TARGET_AMT", "AGE", "INCOME", "YOJ", "HOME_VAL", "CAR_AGE", "JOB")
main_df_binarized <- main_df |>
    select(-all_of(exclude)) |>
    binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER",
           one_hot = TRUE)
main_df_corr <- main_df_binarized |>
    correlate(TARGET_FLAG__1)
main_df_corr |>
    plot_correlation_funnel()

palette <- brewer.pal(n = 7, name = "RdBu")[c(1, 4, 7)]
excl <- c("TARGET_FLAG", "JOB", "CAR_TYPE", "CAR_USE", "EDUCATION",
             "MSTATUS", "PARENT1", "RED_CAR", "REVOKED", "SEX", "URBANICITY")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |>select(-all_of(excl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 2.5,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

incl <- c("TARGET_AMT", "CAR_USE", "MSTATUS", "PARENT1", "RED_CAR",
             "REVOKED", "SEX", "URBANICITY")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |> select(all_of(incl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 3,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

incl <- c("TARGET_AMT", "JOB", "CAR_TYPE", "EDUCATION")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |> select(all_of(incl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 1.75,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

r <- model.matrix(~0+., data = main_df) |>
    cor(use = "pairwise.complete.obs")
is.na(r) <- abs(r) < 0.45
r |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 2.5,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

output <- split_columns(main_df, binary_as_factor = TRUE)
num <- data.frame(Variable = names(output$continuous),
                   Type = rep("Numeric", ncol(output$continuous)))
cat <- data.frame(Variable = names(output$discrete),
                   Type = rep("Categorical", ncol(output$discrete)))
ranges <- as.data.frame(t(sapply(main_df |> select(-names(output$discrete)),
                                 range, na.rm = TRUE)))
factors <- names(output$discrete)
main_df <- main_df |> 
    mutate(across(all_of(factors), ~as.factor(.)))
values <- as.data.frame(t(sapply(main_df |> select(all_of(factors)),
                                 levels)))
values <- values |>
    mutate(across(all_of(factors), ~toString(unlist(.))))
values <- as.data.frame(t(values)) |>
    rownames_to_column()
cols <- c("Variable", "Values")
colnames(values) <- cols
remove <- c("V1", "V2")
ranges <- ranges |>
    rownames_to_column() |>
    group_by(rowname) |>
    mutate(Values = toString(c(V1, " - ", round(V2, 1))),
           Values = str_replace_all(Values, ",", "")) |>
    select(-all_of(remove))
colnames(ranges) <- cols
num <- num |>
    merge(ranges)
cat <- cat |>
    merge(values)
num_vs_cat <- num |>
    bind_rows(cat)
kable(num_vs_cat, format = "simple")

alt_df <- main_df
main_df <- main_df |>
    mutate(TARGET_AMT = case_when(as.numeric(as.character(TARGET_FLAG)) < 1 ~ NA,
                               TRUE ~ TARGET_AMT),
           HOME_VAL = case_when(HOME_VAL < 1 ~ NA,
                               TRUE ~ HOME_VAL),
           INCOME = case_when(INCOME < 1 ~ NA,
                              TRUE ~ INCOME))

main_df <- main_df |>
    mutate(CAR_AGE = case_when(CAR_AGE < 0 ~ CAR_AGE * -1,
                               TRUE ~ CAR_AGE))
alt_df <- alt_df |>
    mutate(CAR_AGE = case_when(CAR_AGE < 0 ~ CAR_AGE * -1,
                               TRUE ~ CAR_AGE))

summary(main_df)

littles_test <- main_df |>
    mcar_test()
knitr::kable(littles_test, format = "simple")

x <- colnames(main_df)
dep = c("CAR_AGE")
exp = x[!x %in% dep]
missing_comp1 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp1) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("YOJ")
exp = x[!x %in% dep]
missing_comp2 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp2) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("INCOME")
exp = x[!x %in% dep]
missing_comp3 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp3) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("HOME_VAL")
exp = x[!x %in% dep]
missing_comp4 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp4) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("JOB")
exp = x[!x %in% dep]
missing_comp5 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp5) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
missing_comp <- missing_comp1 |>
    bind_rows(missing_comp2, missing_comp3, missing_comp4, missing_comp5) |>
    mutate(Explanatory = case_when(is.na(p) ~ NA,
                                   TRUE ~ Explanatory)) |>
    fill(Explanatory, .direction = "down") |>
    group_by(Dependant, Explanatory) |>
    filter(any(p < 0.05)) |>
    select(Dependant, everything())
knitr::kable(missing_comp, format = "simple")

show <- c("YOJ", "INCOME", "HOME_VAL", "JOB")
p2 <- main_df |>
    select(all_of(show)) |>
    missing_plot()
p2

explanatory = c("JOB", "INCOME", "YOJ")
dependent = "HOME_VAL"
p3 <- main_df |>
    select(all_of(show)) |>
    missing_pattern(dependent, explanatory)

# just numeric variables
numeric_train <- main_df[,sapply(main_df, is.numeric)]
par(mfrow=c(4,4))
par(mai=c(.3,.3,.3,.3))
variables <- names(numeric_train)
for (i in 1:(length(variables))) {
  hist(numeric_train[[variables[i]]], main = variables[i], col = "lightblue")
}

cat_pivot <- main_df |>
    select(all_of(factors)) |>
    pivot_longer(cols = all_of(factors),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    group_by(Variable) |>
    mutate(Levels = n()) |>
    ungroup()
p4 <- cat_pivot |>
    filter(Levels == 2) |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    facet_wrap(vars(Variable), ncol = 4, scales = "free_x") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p4

p5 <- cat_pivot |>
    filter(Levels > 2) |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    coord_flip() + 
    facet_wrap(vars(Variable), ncol = 1, scales = "free")
p5

# car type
x <- main_df$CAR_TYPE
main_df$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
main_df$CAR_TYPE <- factor(main_df$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))
x <- alt_df$CAR_TYPE
alt_df$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
alt_df$CAR_TYPE <- factor(alt_df$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))

# education
x <- main_df$EDUCATION
main_df$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
main_df$EDUCATION <- factor(main_df$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))
x <- alt_df$EDUCATION
alt_df$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
alt_df$EDUCATION <- factor(alt_df$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))

# job
x <- main_df$JOB
main_df$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
main_df$JOB <- factor(main_df$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))
x <- alt_df$JOB
alt_df$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
alt_df$JOB <- factor(alt_df$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))

# single parent
main_df <- main_df |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))
alt_df <- alt_df |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))

# marital status
x <- main_df$MSTATUS
main_df$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
main_df <- main_df |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))
x <- alt_df$MSTATUS
alt_df$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
alt_df <- alt_df |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))

# red car
x <- main_df$RED_CAR
main_df$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
main_df <- main_df |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))
x <- alt_df$RED_CAR
alt_df$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
alt_df <- alt_df |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))

# revoked
main_df <- main_df |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))
alt_df <- alt_df |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))

# sex 
x <- main_df$SEX
main_df$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
main_df$SEX <- factor(main_df$SEX, levels = c("Male", "Female"))
x <- alt_df$SEX
alt_df$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
alt_df$SEX <- factor(alt_df$SEX, levels = c("Male", "Female"))

# urban city - 1 if urban, 0 if rural
x <- main_df$URBANICITY
main_df$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
main_df <- main_df |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))
x <- alt_df$URBANICITY
alt_df$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
alt_df <- alt_df |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))

vars <- c("CAR_TYPE", "EDUCATION", "JOB", "PARENT1", "MSTATUS", "RED_CAR",
          "REVOKED", "SEX", "URBANICITY")

levs <- c("Minivan, Panel Truck, Pickup, Sports Car, SUV, Van",
          "<High School, High School, Bachelors, Masters, PhD",
          "Blue Collar, Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student",
          "0, 1",
          "0, 1",
          "0, 1",
          "0, 1",
          "Male, Female",
          "0, 1")

vars_levs <- as.data.frame(cbind(vars, levs))
colnames(vars_levs) <- c("Factor", "New Levels")
knitr::kable(vars_levs, format = "simple")

drop <- c("INCOME", "HOME_VAL")
main_df <- main_df |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
alt_df <- alt_df |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
    
main_df <- main_df |>
    mutate(YOJ = case_when(JOB == "Student" ~ NA,
                           TRUE ~ YOJ))
alt_df <- alt_df |>
    mutate(YOJ = case_when(JOB == "Student" ~ 0,
                           TRUE ~ YOJ))

exclude1 <- c("Student", "Homemaker")
exclude2 <- c(exclude1, "Blue Collar")
main_df <- main_df |>
    mutate(HOME_VAL_CAT = factor(case_when(HOME_VAL_THOU < 251 ~ "<=250K",
                                        HOME_VAL_THOU < 501 ~ "251-500K",
                                        HOME_VAL_THOU < 751 ~ "501-750K",
                                        TRUE ~ "751K+"),
                              ordered = TRUE,
                              levels = c("<=250K", "251-500K", "501-750K", "751K+"),
                              exclude = NULL),
           HOMEOWNER = as.factor(ifelse(is.na(HOME_VAL_THOU), 0, 1)),
           INCOME_CAT = factor(case_when(INCOME_THOU < 51 ~ "<=50K",
                                          INCOME_THOU < 101 ~ "51-100K",
                                          INCOME_THOU < 151 ~ "101-150K",
                                          TRUE ~ "151K+"),
                                ordered = TRUE,
                                levels = c("<=50K", "51-100K", "101-150K", "151K+"),
                                exclude = NULL),
           INCOME_FLAG = as.factor(ifelse(is.na(INCOME_THOU), 0, 1)),
           KIDSDRIV_FLAG = as.factor(case_when(KIDSDRIV > 0 ~ 1,
                                               TRUE ~ 0)),
           HOMEKIDS_FLAG = as.factor(case_when(HOMEKIDS > 0 ~ 1,
                                               TRUE ~ 0)),
           EMPLOYED = as.factor(ifelse(JOB %in% exclude1 | is.na(JOB),
                                       0, 1)),
           CAR_AGE_CAT = factor(case_when(CAR_AGE < 5 ~ "<=4",
                                           CAR_AGE < 9 ~ "5-8",
                                           CAR_AGE < 13 ~ "9-12",
                                           TRUE ~ "13+"),
                                 ordered = TRUE,
                                 levels = c("<=4", "5-8", "9-12", "13+"),
                                 exclude = NULL),
           WHITE_COLLAR = as.factor(ifelse(JOB %in% exclude2 | is.na(JOB),
                                           0, 1)))
main_df$JOB <- factor(main_df$JOB, exclude = NULL)

set.seed(202)
rows <- sample(nrow(main_df))
main_df <- main_df[rows, ]
alt_df <- alt_df[rows, ]
sample <- sample(c(TRUE, FALSE), nrow(main_df), replace=TRUE,
                 prob=c(0.7,0.3))
train_df <- main_df[sample, ]
test_df <- main_df[!sample, ]
alt_train_df <- alt_df[sample, ]
alt_test_df <- alt_df[!sample, ]

train_df_imputed <- train_df |>
    mutate(AGE = case_when(is.na(AGE) ~ mean(AGE, na.rm = TRUE),
                           TRUE ~ AGE))
test_df_imputed <- test_df |>
    mutate(AGE = case_when(is.na(AGE) ~ mean(AGE, na.rm = TRUE),
                           TRUE ~ AGE))

missing <- c("AGE")
imp_train_num <- train_df_imputed |>
    select(all_of(missing)) |>
    mutate(Set = "Train")
imp_test_num <- test_df_imputed |>
    select(all_of(missing)) |>
    mutate(Set = "Test")
imp_num <- imp_train_num |>
    bind_rows(imp_test_num)
imp_num_pivot <- imp_num |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p6 <- imp_num_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p6

col_classes <- unlist(lapply(alt_train_df, class))
missing <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE", "JOB")
x <- names(col_classes)
not_missing <- x[!x %in% missing]
#Since the imputation process is a little slow, we only do the imputations once, save the results as .csv files once, and load those .csv files moving forward, making sure the levels of the factors stay the same. 
if (file.exists("alt_train_df_imputed.csv") & file.exists("alt_test_df_imputed.csv")){
    alt_train_df_imputed <- read.csv("alt_train_df_imputed.csv", na.strings = "",
                                 colClasses = col_classes)
    alt_test_df_imputed <- read.csv("alt_test_df_imputed.csv", na.strings = "",
                                 colClasses = col_classes)
}else{
    #Start with alt_train_df
    init = mice(alt_train_df, maxit=0) 
    meth = init$method
    predM = init$predictorMatrix
    
    #Skip variables without missing data
    meth[not_missing] = ""
    
    #Set different imputation methods for each of the variables with missing data
    meth[c("AGE")] = "pmm" #Predictive mean matching
    meth[c("INCOME_THOU")] = "pmm"
    meth[c("YOJ")] = "pmm"
    meth[c("HOME_VAL_THOU")] = "pmm"
    meth[c("CAR_AGE")] = "pmm"
    meth[c("JOB")] = "polyreg" #Polytomous (multinomial) logistic regression
    
    #Impute
    imputed = mice(alt_train_df, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
    alt_train_df_imputed <- complete(imputed)
    write.csv(alt_train_df_imputed, "alt_train_df_imputed.csv", row.names = FALSE,
              fileEncoding = "UTF-8")
    
    #Repeat for alt_test_df
    init = mice(alt_test_df, maxit=0) 
    meth = init$method
    predM = init$predictorMatrix
    meth[not_missing] = ""
    meth[c("AGE")] = "pmm"
    meth[c("INCOME_THOU")] = "pmm"
    meth[c("YOJ")] = "pmm"
    meth[c("HOME_VAL_THOU")] = "pmm"
    meth[c("CAR_AGE")] = "pmm"
    meth[c("JOB")] = "polyreg"
    imputed = mice(alt_test_df, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
    alt_test_df_imputed <- complete(imputed)
    write.csv(alt_test_df_imputed, "alt_test_df_imputed.csv", row.names = FALSE,
              fileEncoding = "UTF-8")
}

#Make sure the levels stay the same
levels(alt_train_df_imputed$CAR_TYPE) <- levels(main_df$CAR_TYPE)
levels(alt_train_df_imputed$EDUCATION) <- levels(main_df$EDUCATION)
levels(alt_train_df_imputed$JOB) <- levels(main_df$JOB)
levels(alt_train_df_imputed$SEX) <- levels(main_df$SEX)
levels(alt_test_df_imputed$CAR_TYPE) <- levels(main_df$CAR_TYPE)
levels(alt_test_df_imputed$EDUCATION) <- levels(main_df$EDUCATION)
levels(alt_test_df_imputed$JOB) <- levels(main_df$JOB)
levels(alt_test_df_imputed$SEX) <- levels(main_df$SEX)

x <- sapply(alt_train_df_imputed, function(x) sum(is.na(x)))
y <- sapply(alt_test_df_imputed, function(x) sum(is.na(x)))
sum(x, y) == 0

missing_num <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE")
imp_alt_train_num <- alt_train_df_imputed |>
    select(all_of(missing_num)) |>
    mutate(Set = "Train")
imp_alt_test_num <- alt_test_df_imputed |>
    select(all_of(missing_num)) |>
    mutate(Set = "Test")
imp_alt_num <- imp_alt_train_num |>
    bind_rows(imp_alt_test_num)
imp_alt_num_pivot <- imp_alt_num |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p7 <- imp_alt_num_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p7

missing_cat <- c("JOB")
imp_alt_train_cat <- alt_train_df_imputed |>
    select(all_of(missing_cat)) |>
    pivot_longer(cols = all_of(missing_cat),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    mutate(Set = "Train")
imp_alt_test_cat <- alt_test_df_imputed |>
    select(all_of(missing_cat)) |>
    pivot_longer(cols = all_of(missing_cat),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    mutate(Set = "Test")
imp_alt_pivot_cat <- imp_alt_train_cat |>
    bind_rows(imp_alt_test_cat)
p8 <- imp_alt_pivot_cat |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    labs(x = "Job") + 
    coord_flip() +
    facet_wrap(vars(Set), ncol = 2)
p8

skewed <- c("TRAVTIME", "BLUEBOOK", "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS")
train_df_trans <- train_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(train_df_trans[[skewed[i]]] == 0) > 0){
        train_df_trans[[skewed[i]]] <-
            train_df_trans[[skewed[i]]] + 0.001
    }
}
for (i in 1:(length(skewed))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(train_df_trans[[skewed[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
adj <- c("no transformation", "square root", "log", "log", "log", "log")
lambdas <- cbind(lambdas, adj)
cols <- c("Skewed Variable", "Ideal Lambda Proposed by Box-Cox", "Reasonable Alternative Transformation")
colnames(lambdas) <- cols
knitr::kable(lambdas, format = "simple")

remove <- c("BLUEBOOK", "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS")
train_df_trans <- train_df_trans |>
    mutate(BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS)) |>
    select(-all_of(remove))
test_df_trans <- test_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(test_df_trans[[skewed[i]]] == 0) > 0){
        test_df_trans[[skewed[i]]] <-
            test_df_trans[[skewed[i]]] + 0.001
    }
}
test_df_trans <- test_df_trans |>
    mutate(BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS)) |>
    select(-all_of(remove))

transformed <- c("BLUEBOOK_SQRT", "TIF_LOG", "OLDCLAIM_LOG", "CLM_FREQ_LOG",
                 "MVR_PTS_LOG")
train_df_trans_set <- train_df_trans |>
    select(all_of(transformed)) |>
    mutate(Set = "Train")
test_df_trans_set <- test_df_trans |>
    select(all_of(transformed)) |>
    mutate(Set = "Test")
trans_sets <- train_df_trans_set |>
    bind_rows(test_df_trans_set)
trans_sets_pivot <- trans_sets |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p9 <- trans_sets_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p9

skewed <- c("TARGET_AMT", "YOJ", "TRAVTIME", "KIDSDRIV", "HOMEKIDS", "BLUEBOOK",
            "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS", "INCOME_THOU",
            "HOME_VAL_THOU", "CAR_AGE")
alt_train_df_trans <- alt_train_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(alt_train_df_trans[[skewed[i]]] == 0) > 0){
        alt_train_df_trans[[skewed[i]]] <-
            alt_train_df_trans[[skewed[i]]] + 0.001
    }
}
for (i in 1:(length(skewed))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(alt_train_df_trans[[skewed[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
adj <- c("log", "no transformation", "no transformation", "inverse", "log",
         "square root", "log", "log", "log", "log", "square root", "log",
         "square root")
lambdas <- cbind(lambdas, adj)
cols <- c("Skewed Variable", "Ideal Lambda Proposed by Box-Cox", "Reasonable Alternative Transformation")
colnames(lambdas) <- cols
knitr::kable(lambdas, format = "simple")

remove <- c("TARGET_AMT", "KIDSDRIV", "HOMEKIDS", "BLUEBOOK",
            "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS", "INCOME_THOU",
            "HOME_VAL_THOU", "CAR_AGE")
alt_train_df_trans <- alt_train_df_trans |>
    mutate(TARGET_AMT_LOG = log(TARGET_AMT),
           KIDSDRIV_INV = KIDSDRIV^-1,
           HOMEKIDS_LOG = log(HOMEKIDS),
           BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS),
           INCOME_THOU_SQRT = INCOME_THOU^0.5,
           HOME_VAL_THOU_LOG = log(HOME_VAL_THOU),
           CAR_AGE_SQRT = CAR_AGE^0.5) |>
    select(-all_of(remove))
alt_test_df_trans <- alt_test_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(alt_test_df_trans[[skewed[i]]] == 0) > 0){
        alt_test_df_trans[[skewed[i]]] <-
            alt_test_df_trans[[skewed[i]]] + 0.001
    }
}
alt_test_df_trans <- alt_test_df_trans |>
    mutate(TARGET_AMT_LOG = log(TARGET_AMT),
           KIDSDRIV_INV = KIDSDRIV^-1,
           HOMEKIDS_LOG = log(HOMEKIDS),
           BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS),
           INCOME_THOU_SQRT = INCOME_THOU^0.5,
           HOME_VAL_THOU_LOG = log(HOME_VAL_THOU),
           CAR_AGE_SQRT = CAR_AGE^0.5) |>
    select(-all_of(remove))

model_blr_1 <- glm(TARGET_FLAG ~ . - TARGET_AMT, family = 'binomial',
                   data = alt_train_df_imputed)
model_blr_1 <- stepAIC(model_blr_1, trace = 0)
summary(model_blr_1)

beta <- coef(model_blr_1)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

vif(model_blr_1)

model_blr_2 <- glm(TARGET_FLAG ~ AGE + CLM_FREQ + HOMEOWNER + INCOME_FLAG + EMPLOYED + WHITE_COLLAR + MSTATUS + PARENT1 + RED_CAR + REVOKED + SEX + TRAVTIME,
                   data=train_df_imputed, family='binomial')
model_blr_2 <- stepAIC(model_blr_2, trace=0)
summary(model_blr_2)

beta <- coef(model_blr_2)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

model_blr_2 <- update(model_blr_2, ~ . - EMPLOYED)

vif(model_blr_2)

choices <- c("AGE", "CLM_FREQ_LOG", "URBANICITY", "MVR_PTS_LOG", "OLDCLAIM_LOG", "PARENT1", "REVOKED", "CAR_USE", "CAR_TYPE", "MSTATUS", "EDUCATION", "KIDSDRIV_FLAG", "INCOME_CAT", "EMPLOYED", "HOMEOWNER", "WHITE_COLLAR")
print(choices)

model_blr_3 <- glm(TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + MVR_PTS_LOG +  OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + EMPLOYED + HOMEOWNER + WHITE_COLLAR,
                   family = 'binomial', data = train_df_trans)
summary(model_blr_3)

model_blr_3 <- update(model_blr_3, ~ . - WHITE_COLLAR)
model_blr_3 <- update(model_blr_3, ~ . - EMPLOYED)
summary(model_blr_3)

beta <- coef(model_blr_3)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

vif(model_blr_3)

model_blr_3 <- update(model_blr_3, ~ . - OLDCLAIM_LOG)

model_mlr_1 <- lm(TARGET_AMT ~ ., data = alt_train_df_imputed |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
model_mlr_1 <- step(model_mlr_1, trace=0)
summary(model_mlr_1)

vif(model_mlr_1)

model_mlr_2 <- lm(TARGET_AMT_LOG ~ ., data = alt_train_df_trans |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
model_mlr_2 <- step(model_mlr_2, trace=0)
summary(model_mlr_2)

vif(model_mlr_2)

model_mlr_3 <- rlm(TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT + RED_CAR + REVOKED + MVR_PTS_LOG,
                   data = alt_train_df_trans |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
summary(model_mlr_3)

palette <- brewer.pal(n = 12, name = "Paired")
mmps(model_blr_1, layout = c(3, 4), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:1")

mmps(model_blr_2, layout = c(2, 2), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:2")

mmps(model_blr_3, layout = c(2, 2), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:3")

hlstat1 <- hltest(model_blr_1, verbose = FALSE)
hlstat2 <- hltest(model_blr_2, verbose = FALSE)
hlstat3 <- hltest(model_blr_3, verbose = FALSE)
models <- c("Model BLR:1",
            "Model BLR:2",
            "Model BLR:3")
hl_tbl <- as.data.frame(cbind(models, rbind(hlstat1[2:4], hlstat2[2:4],
                                            hlstat3[2:4])))
cols <- c("Model", "HL Statistic", "DoF", "P Value")
colnames(hl_tbl) <- cols
knitr::kable(hl_tbl, format = "simple")

model_blr_1_train_preds_df <- alt_train_df_imputed |>
    mutate(linpred = predict(model_blr_1),
           predprob = predict(model_blr_1, type = "response"))
model_blr_1_test_preds_df <- alt_test_df_imputed |>
    mutate(linpred = predict(model_blr_1, alt_test_df_imputed),
           predprob = predict(model_blr_1, alt_test_df_imputed, type = "response"))
model_blr_2_train_preds_df <- train_df_imputed |>
    mutate(linpred = predict(model_blr_2),
           predprob = predict(model_blr_2, type = "response"))
model_blr_2_test_preds_df <- test_df_imputed |>
    mutate(linpred = predict(model_blr_2, test_df_imputed),
           predprob = predict(model_blr_2, test_df_imputed, type = "response"))
model_blr_3_train_preds_df <- train_df_trans |>
    mutate(linpred = predict(model_blr_3),
           predprob = predict(model_blr_3, type = "response"))
model_blr_3_test_preds_df <- test_df_trans |>
    mutate(linpred = predict(model_blr_3, test_df_trans),
           predprob = predict(model_blr_3, test_df_trans, type = "response"))
par(mfrow=c(3,2))
par(mai=c(.3,.3,.3,.3))
roc1 <- roc(model_blr_1_train_preds_df$TARGET_FLAG,
            model_blr_1_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:1 ROC (Train)")
roc2 <- roc(model_blr_1_test_preds_df$TARGET_FLAG,
            model_blr_1_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:1 ROC (Test)")
roc3 <- roc(model_blr_2_train_preds_df$TARGET_FLAG,
            model_blr_2_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:2 ROC (Train)")
roc4 <- roc(model_blr_2_test_preds_df$TARGET_FLAG,
            model_blr_2_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:2 ROC (Test)")
roc5 <- roc(model_blr_3_train_preds_df$TARGET_FLAG,
            model_blr_3_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:3 ROC (Train)")
roc6 <- roc(model_blr_3_test_preds_df$TARGET_FLAG,
            model_blr_3_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:3 ROC (Test)")

model_blr_1_train_preds_df <- model_blr_1_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_1_test_preds_df <- model_blr_1_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_2_train_preds_df <- model_blr_2_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_2_test_preds_df <- model_blr_2_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_3_train_preds_df <- model_blr_3_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_3_test_preds_df <- model_blr_3_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_1_train_cm <- confusionMatrix(model_blr_1_train_preds_df$predicted,
                                        model_blr_1_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_1_test_cm <- confusionMatrix(model_blr_1_test_preds_df$predicted,
                                        model_blr_1_test_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_2_train_cm <- confusionMatrix(model_blr_2_train_preds_df$predicted,
                                        model_blr_2_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_2_test_cm <- confusionMatrix(model_blr_2_test_preds_df$predicted,
                                        model_blr_2_test_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_3_train_cm <- confusionMatrix(model_blr_3_train_preds_df$predicted,
                                        model_blr_3_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_3_test_cm <- confusionMatrix(model_blr_3_test_preds_df$predicted,
                                        model_blr_3_test_preds_df$TARGET_FLAG,
                                        positive = "1")
plt1a <- as.data.frame(model_blr_1_train_cm$table)
plt1a$Reference <- factor(plt1a$Reference, levels=rev(levels(plt1a$Reference)))
plt1a <- plt1a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm1a <- plt1a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:1 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt1b <- as.data.frame(model_blr_1_test_cm$table)
plt1b$Reference <- factor(plt1b$Reference, levels=rev(levels(plt1b$Reference)))
plt1b <- plt1b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm1b <- plt1b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:1 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt2a <- as.data.frame(model_blr_2_train_cm$table)
plt2a$Reference <- factor(plt2a$Reference, levels=rev(levels(plt2a$Reference)))
plt2a <- plt2a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm2a <- plt2a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:2 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt2b <- as.data.frame(model_blr_2_test_cm$table)
plt2b$Reference <- factor(plt2b$Reference, levels=rev(levels(plt2b$Reference)))
plt2b <- plt2b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm2b <- plt2b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:2 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt3a <- as.data.frame(model_blr_3_train_cm$table)
plt3a$Reference <- factor(plt3a$Reference, levels=rev(levels(plt3a$Reference)))
plt3a <- plt3a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm3a <- plt3a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:3 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt3b <- as.data.frame(model_blr_3_test_cm$table)
plt3b$Reference <- factor(plt3b$Reference, levels=rev(levels(plt3b$Reference)))
plt3b <- plt3b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm3b <- plt3b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:3 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
pcm_all <- plot_grid(pcm1a, pcm1b, pcm2a, pcm2b, pcm3a, pcm3b,
                     ncol = 2)
pcm_all

metrics_a <- as.data.frame(cbind(model_blr_1_train_cm$byClass,
                                 model_blr_1_test_cm$byClass,
                                 model_blr_2_train_cm$byClass,
                                 model_blr_2_test_cm$byClass,
                                 model_blr_3_train_cm$byClass,
                                 model_blr_3_test_cm$byClass))
colnames(metrics_a) <-c('1 (Train)',
                        '1 (Test)',
                        '2 (Train)',
                        '2 (Test)',
                        '3 (Train)',
                        '3 (Test)')
metrics <- rbind(metrics_a,
                 c(model_blr_1_train_cm$overall[1],
                   model_blr_1_test_cm$overall[1],
                   model_blr_2_train_cm$overall[1],
                   model_blr_2_test_cm$overall[1],
                   model_blr_3_train_cm$overall[1],
                   model_blr_3_test_cm$overall[1]),
                 c(1-model_blr_1_train_cm$overall[1],
                   1-model_blr_1_test_cm$overall[1],
                   1-model_blr_2_train_cm$overall[1],
                   1-model_blr_2_test_cm$overall[1],
                   1-model_blr_3_train_cm$overall[1],
                   1-model_blr_3_test_cm$overall[1]),
                 c(roc1$auc,
                   roc2$auc,
                   roc3$auc,
                   roc4$auc,
                   roc5$auc,
                   roc6$auc))
metrics <- round(metrics, 3)
rownames(metrics)[12:14] <- c('Accuracy','Classification Error Rate','AUC')
knitr::kable(metrics, format = "simple")

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_1)
mtext("Model MLR:1", side = 3, line = -1.5, outer = TRUE)

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_2)
mtext("Model MLR:2", side = 3, line = -1.5, outer = TRUE)

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_3)
mtext("Model MLR:3", side = 3, line = -1.5, outer = TRUE)

excl <- c("TARGET_AMT", "TARGET_FLAG")
test_x <- alt_test_df_imputed |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_imputed |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT)
test_y <- as.numeric(test_y$TARGET_AMT)
test_pred <- predict(model_mlr_1, test_x)
model_mlr_1_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_1_test_rmse <- as.numeric(RMSE(test_pred, test_y))
excl <- c("TARGET_AMT_LOG", "TARGET_FLAG")
test_x <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT_LOG)
test_y <- exp(as.numeric(test_y$TARGET_AMT_LOG))
test_pred <- exp(predict(model_mlr_2, test_x))
model_mlr_2_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_2_test_rmse <- as.numeric(RMSE(test_pred, test_y))
test_x <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT_LOG)
test_y <- exp(as.numeric(test_y$TARGET_AMT_LOG))
test_pred <- exp(predict(model_mlr_3, test_x))
model_mlr_3_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_3_test_rmse <- as.numeric(RMSE(test_pred, test_y))
models <- c("Model MLR:1", "Model MLR:2", "Model MLR:3")
mlr_summary <- as.data.frame(cbind(models,
                                   pred_rsq = c(model_mlr_1_test_rsq,
                                                model_mlr_2_test_rsq,
                                                model_mlr_3_test_rsq),
                                   rmse = c(model_mlr_1_test_rmse,
                                            model_mlr_2_test_rmse,
                                            model_mlr_3_test_rmse)))
knitr::kable(mlr_summary, format = "simple")

my_url <- "https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/insurance-evaluation-data.csv"
eval_df <- read.csv(my_url, na.strings = "")
eval_df_w_preds <- eval_df
# car type
x <- eval_df_w_preds$CAR_TYPE
eval_df_w_preds$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
eval_df_w_preds$CAR_TYPE <- factor(eval_df_w_preds$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))
# education
x <- eval_df_w_preds$EDUCATION
eval_df_w_preds$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
eval_df_w_preds$EDUCATION <- factor(eval_df_w_preds$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))
# job
x <- eval_df_w_preds$JOB
eval_df_w_preds$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
eval_df_w_preds$JOB <- factor(eval_df_w_preds$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))
# single parent
eval_df_w_preds <- eval_df_w_preds |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))
# marital status
x <- eval_df_w_preds$MSTATUS
eval_df_w_preds$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))
# red car
x <- eval_df_w_preds$RED_CAR
eval_df_w_preds$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))
# revoked
eval_df_w_preds <- eval_df_w_preds |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))
# sex 
x <- eval_df_w_preds$SEX
eval_df_w_preds$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
eval_df_w_preds$SEX <- factor(eval_df_w_preds$SEX, levels = c("Male", "Female"))

# urban city - 1 if urban, 0 if rural
x <- eval_df_w_preds$URBANICITY
eval_df_w_preds$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))
vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
eval_df_w_preds <- eval_df_w_preds |>
    mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))
drop <- c("INCOME", "HOME_VAL")
eval_df_w_preds <- eval_df_w_preds |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
missing <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE", "JOB")
x <- names(eval_df_w_preds)
not_missing <- x[!x %in% missing]
init = mice(eval_df_w_preds, maxit=0) 
meth = init$method
predM = init$predictorMatrix
meth[not_missing] = ""
meth[c("AGE")] = "pmm" #Predictive mean matching
meth[c("INCOME_THOU")] = "pmm"
meth[c("YOJ")] = "pmm"
meth[c("HOME_VAL_THOU")] = "pmm"
meth[c("CAR_AGE")] = "pmm"
meth[c("JOB")] = "polyreg" #Polytomous (multinomial) logistic regression
imputed = mice(eval_df_w_preds, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
eval_df_w_preds <- complete(imputed)
eval_df_w_preds <- eval_df_w_preds |>
    mutate(PREDPROB = predict(model_blr_1, eval_df_w_preds,
                              type = "response"),
           TARGET_FLAG = as.factor(ifelse(PREDPROB > 0.5, 1, 0)),
           TARGET_AMT = case_when(TARGET_FLAG == 1 ~ predict(model_mlr_1,
                                                             eval_df_w_preds),
                                  TRUE ~ 0))
write.csv(eval_df_w_preds,file='HW4_Eval_PredProbs_Flags_Amounts.csv')

eval_df_w_preds |> group_by(TARGET_FLAG) |> summarise(cnt=n())