DATA 621 - HW4

Homework 4 - Binary Logistic Regression & Multiple Linear Regression

Introduction:

We load an auto insurance company dataset containing 8,161 records. Each record represents a customer, and each record has two response variables: TARGET_FLAG and TARGET_AMT. Below is a short description of all the variables of interest in the data set, including these response variables:

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
`INDEX`	Identification Variable	None
`TARGET_FLAG`	Was Car in a crash? 1=YES 0=NO	None
`TARGET_AMT`	If car was in a crash, what was the cost	None
`AGE`	Age of Driver	Very young and very old people tend to be risky
`BLUEBOOK`	Value of Vehicle	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_AGE`	Vehicle Age	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_TYPE`	Type of Car	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_USE`	Vehicle Use	Commercial vehicles are driven more, so might increase probability of collision
`CLM_FREQ`	# Claims (Past 5 Years)	The more claims you filed in the past, the more you are likely to file in the future
`EDUCATION`	Max Education Level	Unknown but possible more educated people tend to drive safer
`HOMEKIDS`	# Children at Home	Unknown
`HOME_VAL`	Home Value	Homeowners tend to drive safer
`INCOME`	Income	Rich people tend to be in fewer crashes
`JOB`	Job Category	White collar jobs tend to be safer
`KIDSDRIV`	# Driving Children	When teenagers drive your car, you are more likely to get into crashes
`MSTATUS`	Marital Status	Married people driver safer
`MVR_PTS`	Motor Vehicle Record Points	If you get a lot of traffic tickets, you tend to get into more accidents
`OLDCLAIM`	Total Claims (Past 5 Years)	If your total payout over the past five years was high, this suggests future payouts will be high
`PARENT1`	Single Parent	Unknown
`RED_CAR`	A Red Car	Urban legend says that red cars (especially red sports cars) are more risky
`REVOKED`	License Revoked (Past 7 Years)	If your license was revoked in the past 7 years, you probably are a more risky driver
`SEX`	Gender	Urban legend says that women have less crashes then men
`TIF`	Time in Force	People who have been customers for a long time are usually more safe
`TRAVTIME`	Distance to Work	Long drives to work usually suggest greater risk
`URBANICITY`	Home/Work Area	Unknown
`YOJ`	Years on Job	People who stay at a job for a long time are usually more safe

Data Exploration:

We check the classes of our variables to determine whether any of them need to be coerced to numeric or other classes prior to exploratory data analysis.

Class	Count	Variables
character	14	BLUEBOOK, CAR_TYPE, CAR_USE, EDUCATION, HOME_VAL, INCOME, JOB, MSTATUS, OLDCLAIM, PARENT1, RED_CAR, REVOKED, SEX, URBANICITY
integer	11	AGE, CAR_AGE, CLM_FREQ, HOMEKIDS, INDEX, KIDSDRIV, MVR_PTS, TARGET_FLAG, TIF, TRAVTIME, YOJ
numeric	1	TARGET_AMT

INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM are all character variables that will need to be coerced to integers after we strip the “$” from their strings. TARGET_FLAG and the remaining character variables will all need to be coerced to factors.

We remove the identification variable INDEX and take a look at a summary of the dataset’s completeness.

rows	8161
columns	25
all_missing_columns	0
total_missing_values	2405
complete_rows	6045

None of our columns are completely devoid of data. There are 6,045 complete rows in the dataset, which is about 74% of our observations. There are 2,405 total missing values. We take a look at which variables contain these missing values and what the spread is.

A very small percentage of observations contain missing AGE values. The INCOME, YOJ, HOME_VAL, CAR_AGE, and JOB variables are each missing around 5.5 to 6.5 percent of values. There are no variables containing such extreme proportions of missing values that removal would be warranted on that basis alone.

To check whether the predictor variables are correlated with the binary response variable, we produce a correlation funnel that visualizes the strength of the relationships between our predictors and TARGET_FLAG. This correlation funnel will not include variables for which there are any missing values.

The predictor variables without missing values that are most correlated with getting into a car crash are CLM_FREQ, URBANICITY, MVR_PTS, OLDCLAIM, PARENT1, REVOKED, and CAR_USE. Some of this is unsurprising. Increased claim frequency, increased numbers of traffic tickets, increased past payouts, having your license previously revoked, and using your car commercially all positively correlate with getting into a car crash, as we expected they would. We did not expect URBANICITY to be so relevant, but urban areas can often be more difficult to drive through and have more traffic, so that combination could reasonably make urban-dwellers more likely to get into car crashes, as the correlation suggests. We also did not expect PARENT1 to be so relevant, but the correlation between being a single parent and getting into a car crash is very similar to that of having your license previously revoked and getting into a car crash.

The predictor variables without missing values that are least correlated with getting into a car crash are SEX and RED_CAR. Being a woman has a very slight positive correlation with getting into a car crash, and driving a red car has a slightly negative correlation with getting into a car crash. These are contrary to urban legend, and more importantly they probably won’t be useful when modeling.

To check whether the predictor variables are correlated with the numeric response variable, we produce correlation plots that visualize the strength of the relationships between our predictors and TARGET_AMT (only when observations involve a car crash, as otherwise we know TARGET_AMT = 0). For readability, first we look at numeric predictors only.

It’s worth noting that BLUEBOOK is the single numeric variable most correlated with an increased TARGET_AMT, which is sensible. Cars that are currently still more valuable can be more expensive to fix. We expected CAR_AGE to be more negatively correlated with TARGET_AMT.

Next we look at two-level factors.

We see a small positive correlation between using your car commercially and TARGET_AMT. But we see an equally large negative correlation between being female and TARGET_AMT. The former is more logical than the latter, so neither may be a good predictor of TARGET_AMT ultimately.

Finally we look at factors with more than two levels.

The various car types don’t have as high of a correlation (either positively or negatively) with TARGET_AMT as expected, but we still believe CAR_TYPE will be somewhat useful for modeling.

Because we have so many variables, it would be difficult to check for and visualize collinearity for our responses and predictors all at the same time without setting a threshold. So we will set a correlation threshold of 0.45 (in absolute value) and only visualize variables with any correlation values at or above that level.

We see some expected collinearity. KIDSDRIV and HOMEKIDS are moderately positively correlated because teenagers driving your car depends on you having kids at all, but the number of teens driving your car won’t always exactly match the number of kids you have. HOME_VAL and INCOME are pretty positively correlated, as higher incomes lead to the ability to purchase higher valued homes. Not being married is also moderately negatively correlated with HOME_VAL, likely because married people often have two incomes instead of one and can therefore purchase higher valued homes. Having a PhD is equally correlated with being a doctor or lawyer, which makes sense because those jobs require them. Working a blue collar job is logically pretty negatively correlated with driving your car privately since driving your car commercially is itself a blue collar job. Being a woman is very negatively correlated with driving a red car. Lastly of note, claim frequency is moderately correlated with higher past payouts, which adds up.

We have 14 numeric variables and 11 categorical variables (including the dummy variable TARGET_FLAG). We list the possible ranges or values for each variable in the breakdown below:

Variable	Type	Values
AGE	Numeric	16 - 81
BLUEBOOK	Numeric	1500 - 69740
CAR_AGE	Numeric	-3 - 28
CLM_FREQ	Numeric	0 - 5
HOME_VAL	Numeric	0 - 885282
HOMEKIDS	Numeric	0 - 5
INCOME	Numeric	0 - 367030
KIDSDRIV	Numeric	0 - 4
MVR_PTS	Numeric	0 - 13
OLDCLAIM	Numeric	0 - 57037
TARGET_AMT	Numeric	0 - 107586.1
TIF	Numeric	1 - 25
TRAVTIME	Numeric	5 - 142
YOJ	Numeric	0 - 23
CAR_TYPE	Categorical	Minivan, Panel Truck, Pickup, Sports Car, Van, z_SUV
CAR_USE	Categorical	Commercial, Private
EDUCATION	Categorical	<High School, Bachelors, Masters, PhD, z_High School
JOB	Categorical	Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student, z_Blue Collar
MSTATUS	Categorical	Yes, z_No
PARENT1	Categorical	No, Yes
RED_CAR	Categorical	no, yes
REVOKED	Categorical	No, Yes
SEX	Categorical	M, z_F
TARGET_FLAG	Categorical	0, 1
URBANICITY	Categorical	Highly Urban/ Urban, z_Highly Rural/ Rural

The ranges for TARGET_AMT, HOME_VAL, and INCOME all include zero, and recoding these zero values as NA will make analyzing summary statistics for these variables more meaningful than if we included zeroes in their calculations. (We will maintain a separate copy of the data, in which we do not introduce additional NA values, for later use when creating the fully imputed dataset that some of our models will rely on for completeness.)

The range for CAR_AGE includes -3. Since the variable can only take positive or zero values logically, and only one observation in the dataset has a negative sign, we make the assumption that the age of 3 years is correct for this observation, and the sign is simply a data entry error. We fix this observation.

Some of the factor levels are named inconsistently, so we will rename and relevel them in the next section.

Let’s take a look at the summary statistics for each variable.

##  TARGET_FLAG   TARGET_AMT           KIDSDRIV           AGE       
##  0:6008      Min.   :    30.28   Min.   :0.0000   Min.   :16.00  
##  1:2153      1st Qu.:  2609.78   1st Qu.:0.0000   1st Qu.:39.00  
##              Median :  4104.00   Median :0.0000   Median :45.00  
##              Mean   :  5702.18   Mean   :0.1711   Mean   :44.79  
##              3rd Qu.:  5787.00   3rd Qu.:0.0000   3rd Qu.:51.00  
##              Max.   :107586.14   Max.   :4.0000   Max.   :81.00  
##              NA's   :6008                         NA's   :6      
##     HOMEKIDS           YOJ           INCOME       PARENT1       HOME_VAL     
##  Min.   :0.0000   Min.   : 0.0   Min.   :     5   No :7084   Min.   : 50223  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 34135   Yes:1077   1st Qu.:153074  
##  Median :0.0000   Median :11.0   Median : 58438              Median :206692  
##  Mean   :0.7212   Mean   :10.5   Mean   : 67259              Mean   :220621  
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 90053              3rd Qu.:270023  
##  Max.   :5.0000   Max.   :23.0   Max.   :367030              Max.   :885282  
##                   NA's   :454    NA's   :1060                NA's   :2758    
##  MSTATUS      SEX               EDUCATION               JOB      
##  Yes :4894   M  :3786   <High School :1203   z_Blue Collar:1825  
##  z_No:3267   z_F:4375   Bachelors    :2242   Clerical     :1271  
##                         Masters      :1658   Professional :1117  
##                         PhD          : 728   Manager      : 988  
##                         z_High School:2330   Lawyer       : 835  
##                                              (Other)      :1599  
##                                              NA's         : 526  
##     TRAVTIME            CAR_USE        BLUEBOOK          TIF        
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   1st Qu.: 1.000  
##  Median : 33.00                     Median :14440   Median : 4.000  
##  Mean   : 33.49                     Mean   :15710   Mean   : 5.351  
##  3rd Qu.: 44.00                     3rd Qu.:20850   3rd Qu.: 7.000  
##  Max.   :142.00                     Max.   :69740   Max.   :25.000  
##                                                                     
##         CAR_TYPE    RED_CAR       OLDCLAIM        CLM_FREQ      REVOKED   
##  Minivan    :2145   no :5783   Min.   :    0   Min.   :0.0000   No :7161  
##  Panel Truck: 676   yes:2378   1st Qu.:    0   1st Qu.:0.0000   Yes:1000  
##  Pickup     :1389              Median :    0   Median :0.0000             
##  Sports Car : 907              Mean   : 4037   Mean   :0.7986             
##  Van        : 750              3rd Qu.: 4636   3rd Qu.:2.0000             
##  z_SUV      :2294              Max.   :57037   Max.   :5.0000             
##                                                                           
##     MVR_PTS          CAR_AGE                       URBANICITY  
##  Min.   : 0.000   Min.   : 0.000   Highly Urban/ Urban  :6492  
##  1st Qu.: 0.000   1st Qu.: 1.000   z_Highly Rural/ Rural:1669  
##  Median : 1.000   Median : 8.000                               
##  Mean   : 1.696   Mean   : 8.329                               
##  3rd Qu.: 3.000   3rd Qu.:12.000                               
##  Max.   :13.000   Max.   :28.000                               
##                   NA's   :510

The majority of observations live/work in a highly urban or urban area. There are more married than unmarried observations, and there are also more female than male observations. The average observation has a median age of 45 years old, has been in their job for a median of 11 years, and has a median income of roughly $58,500.00. Most cars in the dataset are driven for private use rather than commercially, and the median car age is 8 years.

6,008 observations, which is the majority of observations, do not involve car crashes, and we now correctly record 6,008 NA observations for TARGET_AMT. (Since we introduced NA values for TARGET_AMT on purpose, we will not consider imputing them.)

There are 6 NA values in AGE, 510 in CAR_AGE, 454 in YOJ, 1,060 in INCOME, 2,758 in HOME_VAL, and 526 in JOB. In the next section, we will impute all these missing values in an alternate version of our dataset, as we mentioned earlier, and in the main version of our dataset, we will only impute the variables if we determine their data is at least Missing at Random (MAR), and there’s no other evidence we should exclude them from imputation.

We check whether there is evidence that the data are Missing Completely at Random (MCAR), a higher standard than MAR, using the mcar_test function from the naniar package. Meeting this standard is unlikely with real data, but still worth checking.

statistic	df	p.value	missing.patterns
16862.3	1116	0	51

The low p-value provides evidence that missing data on these variables are not MCAR.

Excluding AGE since the number of missing values is so small for that variable, and we plan to impute it anyway, let’s check whether missingness in any of the others is associated with any of the other predictors or the response variables using the missing_compare function from the finalfit package. Due to the large number of variables, we exclude any observed variables that could not account for a variable’s missingness in the output by setting a p-value threshold of 0.05.

Dependant	Explanatory	Ref	Not Missing	Missing	p
INCOME	TARGET_FLAG	0	5308 (88.3)	700 (11.7)	0.001
INCOME	TARGET_FLAG	1	1793 (83.3)	360 (16.7)	NA
INCOME	AGE	Mean (SD)	44.9 (8.6)	43.8 (9.0)	0.001
INCOME	HOMEKIDS	Mean (SD)	0.7 (1.1)	0.9 (1.2)	0.001
INCOME	YOJ	Mean (SD)	11.4 (2.8)	4.3 (5.8)	0.001
INCOME	PARENT1	No	6188 (87.4)	896 (12.6)	0.022
INCOME	PARENT1	Yes	913 (84.8)	164 (15.2)	NA
INCOME	HOME_VAL	Mean (SD)	227842.0 (93771.4)	155319.7 (92741.6)	0.001
INCOME	SEX	M	3420 (90.3)	366 (9.7)	0.001
INCOME	SEX	z_F	3681 (84.1)	694 (15.9)	NA
INCOME	EDUCATION	<High School	982 (81.6)	221 (18.4)	0.001
INCOME	EDUCATION	Bachelors	1972 (88.0)	270 (12.0)	NA
INCOME	EDUCATION	Masters	1547 (93.3)	111 (6.7)	NA
INCOME	EDUCATION	PhD	652 (89.6)	76 (10.4)	NA
INCOME	EDUCATION	z_High School	1948 (83.6)	382 (16.4)	NA
INCOME	JOB	Clerical	1198 (94.3)	73 (5.7)	0.001
INCOME	JOB	Doctor	232 (94.3)	14 (5.7)	NA
INCOME	JOB	Home Maker	308 (48.0)	333 (52.0)	NA
INCOME	JOB	Lawyer	792 (94.9)	43 (5.1)	NA
INCOME	JOB	Manager	937 (94.8)	51 (5.2)	NA
INCOME	JOB	Professional	1055 (94.4)	62 (5.6)	NA
INCOME	JOB	Student	350 (49.2)	362 (50.8)	NA
INCOME	JOB	z_Blue Collar	1727 (94.6)	98 (5.4)	NA
INCOME	CAR_USE	Commercial	2675 (88.3)	354 (11.7)	0.008
INCOME	CAR_USE	Private	4426 (86.2)	706 (13.8)	NA
INCOME	BLUEBOOK	Mean (SD)	16199.2 (8430.5)	12432.0 (7574.9)	0.001
INCOME	TIF	Mean (SD)	5.4 (4.2)	5.1 (4.0)	0.045
INCOME	CAR_TYPE	Minivan	1922 (89.6)	223 (10.4)	0.001
INCOME	CAR_TYPE	Panel Truck	632 (93.5)	44 (6.5)	NA
INCOME	CAR_TYPE	Pickup	1225 (88.2)	164 (11.8)	NA
INCOME	CAR_TYPE	Sports Car	729 (80.4)	178 (19.6)	NA
INCOME	CAR_TYPE	Van	683 (91.1)	67 (8.9)	NA
INCOME	CAR_TYPE	z_SUV	1910 (83.3)	384 (16.7)	NA
INCOME	RED_CAR	no	4974 (86.0)	809 (14.0)	0.001
INCOME	RED_CAR	yes	2127 (89.4)	251 (10.6)	NA
INCOME	CLM_FREQ	Mean (SD)	0.8 (1.2)	0.9 (1.2)	0.048
INCOME	CAR_AGE	Mean (SD)	8.5 (5.7)	7.2 (5.3)	0.001
INCOME	URBANICITY	Highly Urban/ Urban	5753 (88.6)	739 (11.4)	0.001
INCOME	URBANICITY	z_Highly Rural/ Rural	1348 (80.8)	321 (19.2)	NA
HOME_VAL	TARGET_FLAG	0	4217 (70.2)	1791 (29.8)	0.001
HOME_VAL	TARGET_FLAG	1	1186 (55.1)	967 (44.9)	NA
HOME_VAL	AGE	Mean (SD)	45.4 (8.5)	43.5 (8.7)	0.001
HOME_VAL	HOMEKIDS	Mean (SD)	0.7 (1.1)	0.8 (1.1)	0.009
HOME_VAL	YOJ	Mean (SD)	11.1 (3.7)	9.3 (4.6)	0.001
HOME_VAL	INCOME	Mean (SD)	68771.2 (44434.0)	63968.7 (48518.0)	0.001
HOME_VAL	PARENT1	No	5055 (71.4)	2029 (28.6)	0.001
HOME_VAL	PARENT1	Yes	348 (32.3)	729 (67.7)	NA
HOME_VAL	MSTATUS	Yes	4267 (87.2)	627 (12.8)	0.001
HOME_VAL	MSTATUS	z_No	1136 (34.8)	2131 (65.2)	NA
HOME_VAL	EDUCATION	<High School	729 (60.6)	474 (39.4)	0.001
HOME_VAL	EDUCATION	Bachelors	1545 (68.9)	697 (31.1)	NA
HOME_VAL	EDUCATION	Masters	1166 (70.3)	492 (29.7)	NA
HOME_VAL	EDUCATION	PhD	474 (65.1)	254 (34.9)	NA
HOME_VAL	EDUCATION	z_High School	1489 (63.9)	841 (36.1)	NA
HOME_VAL	JOB	Clerical	913 (71.8)	358 (28.2)	0.001
HOME_VAL	JOB	Doctor	154 (62.6)	92 (37.4)	NA
HOME_VAL	JOB	Home Maker	456 (71.1)	185 (28.9)	NA
HOME_VAL	JOB	Lawyer	596 (71.4)	239 (28.6)	NA
HOME_VAL	JOB	Manager	703 (71.2)	285 (28.8)	NA
HOME_VAL	JOB	Professional	817 (73.1)	300 (26.9)	NA
HOME_VAL	JOB	Student	100 (14.0)	612 (86.0)	NA
HOME_VAL	JOB	z_Blue Collar	1309 (71.7)	516 (28.3)	NA
HOME_VAL	CAR_USE	Commercial	1942 (64.1)	1087 (35.9)	0.002
HOME_VAL	CAR_USE	Private	3461 (67.4)	1671 (32.6)	NA
HOME_VAL	BLUEBOOK	Mean (SD)	16073.5 (8388.1)	14997.6 (8437.5)	0.001
HOME_VAL	OLDCLAIM	Mean (SD)	3726.1 (8512.2)	4646.3 (9245.6)	0.001
HOME_VAL	CLM_FREQ	Mean (SD)	0.7 (1.1)	0.9 (1.2)	0.001
HOME_VAL	REVOKED	No	4801 (67.0)	2360 (33.0)	0.001
HOME_VAL	REVOKED	Yes	602 (60.2)	398 (39.8)	NA
HOME_VAL	MVR_PTS	Mean (SD)	1.6 (2.0)	1.9 (2.3)	0.001
HOME_VAL	CAR_AGE	Mean (SD)	8.4 (5.7)	8.1 (5.7)	0.012
HOME_VAL	URBANICITY	Highly Urban/ Urban	4345 (66.9)	2147 (33.1)	0.007
HOME_VAL	URBANICITY	z_Highly Rural/ Rural	1058 (63.4)	611 (36.6)	NA
JOB	KIDSDRIV	Mean (SD)	0.2 (0.5)	0.1 (0.4)	0.005
JOB	AGE	Mean (SD)	44.7 (8.7)	46.5 (8.0)	0.001
JOB	HOMEKIDS	Mean (SD)	0.7 (1.1)	0.4 (0.9)	0.001
JOB	YOJ	Mean (SD)	10.4 (4.2)	11.3 (2.7)	0.001
JOB	INCOME	Mean (SD)	63334.1 (42157.2)	118852.9 (58861.8)	0.001
JOB	PARENT1	No	6601 (93.2)	483 (6.8)	0.001
JOB	PARENT1	Yes	1034 (96.0)	43 (4.0)	NA
JOB	HOME_VAL	Mean (SD)	213485.5 (89924.5)	322080.5 (121344.9)	0.001
JOB	SEX	M	3365 (88.9)	421 (11.1)	0.001
JOB	SEX	z_F	4270 (97.6)	105 (2.4)	NA
JOB	EDUCATION	<High School	1203 (100.0)	0 (0.0)	0.001
JOB	EDUCATION	Bachelors	2242 (100.0)	0 (0.0)	NA
JOB	EDUCATION	Masters	1330 (80.2)	328 (19.8)	NA
JOB	EDUCATION	PhD	530 (72.8)	198 (27.2)	NA
JOB	EDUCATION	z_High School	2330 (100.0)	0 (0.0)	NA
JOB	CAR_USE	Commercial	2557 (84.4)	472 (15.6)	0.001
JOB	CAR_USE	Private	5078 (98.9)	54 (1.1)	NA
JOB	BLUEBOOK	Mean (SD)	15161.5 (8018.6)	23669.5 (9952.7)	0.001
JOB	CAR_TYPE	Minivan	2123 (99.0)	22 (1.0)	0.001
JOB	CAR_TYPE	Panel Truck	435 (64.3)	241 (35.7)	NA
JOB	CAR_TYPE	Pickup	1265 (91.1)	124 (8.9)	NA
JOB	CAR_TYPE	Sports Car	902 (99.4)	5 (0.6)	NA
JOB	CAR_TYPE	Van	634 (84.5)	116 (15.5)	NA
JOB	CAR_TYPE	z_SUV	2276 (99.2)	18 (0.8)	NA
JOB	RED_CAR	no	5510 (95.3)	273 (4.7)	0.001
JOB	RED_CAR	yes	2125 (89.4)	253 (10.6)	NA
JOB	OLDCLAIM	Mean (SD)	3980.4 (8722.8)	4859.5 (9501.7)	0.026
JOB	CLM_FREQ	Mean (SD)	0.8 (1.2)	1.0 (1.3)	0.001
JOB	CAR_AGE	Mean (SD)	7.9 (5.6)	14.0 (4.6)	0.001
JOB	URBANICITY	Highly Urban/ Urban	5987 (92.2)	505 (7.8)	0.001
JOB	URBANICITY	z_Highly Rural/ Rural	1648 (98.7)	21 (1.3)	NA

There is evidence that some of the missingness for INCOME, HOME_VAL, and JOB can be explained by other observed information, so they could be considered Missing at Random (MAR). There is no evidence missing values for CAR_AGE or YOJ can be explained by other observed information, so we will no longer consider imputing them in the main version of our dataset.

It’s reasonable to assume that the missing values in YOJ, HOME_VAL, INCOME and JOB might all be related because money, employment, and assets are interconnected. Therefore the missingness of one or more of these variables might be dependent on the missingness of one or more of the others. Let’s look at the overlap of observations with missing values for these variables using the missing_plot function from the finalfit package.

We do see some overlap in the observations that have missing values for these variables, but it’s hard to detect anything more conclusive from this plot. To take a closer look at the patterns of missingness between these variables, we can use the missing_pattern function from the finalfit package. (Note that in the visualization that follows, the numbers along the bottom axis are unfortunately illegible, but they are just the column-wise counts of missing values for each variable, plus a sum of missing values for all variables, and we have already remarked on these totals.)

Here, we see several patterns of missingness worth noting. 814 observations are missing two out of these four variables, and 49 observations are missing three. Of the observations that are missing HOME_VAL, 483 are also missing INCOME, 154 are also missing JOB, and 109 are also missing YOJ. Due to these patterns of related missingness, we will no longer consider imputing these variables in the main version of our dataset.

Let’s take a look at the distributions of the numeric variables.

The distribution for AGE is approximately normal. The distribution for YOJ is left-skewed. The distributions for TARGET_AMT, KIDSDRIV, HOMEKIDS, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, and CAR_AGE are all right-skewed. 75% of observations for TARGET_AMT are at or below $5,787.00, but the maximum value recorded is $107,586.14.

Let’s also take a look at the distributions of the categorical variables. First, we look at the distributions for categorical variables with only two levels.

Looking at PARENT1 and REVOKED, we can see that single parents represent relatively few observations in the dataset, as do people whose licenses were revoked in the past seven years. MSTATUS and SEX are the most evenly split categorical variables with two levels in the dataset.

Next we look at the distributions for the categorical variables with more than two levels.

The most common profession represented in the observations is blue collar, and the most commonly represented cars are the SUV and the minivan. The number of observations with high school diplomas and bachelor’s degrees are fairly similar. Having less or more education is less common.

Data Preparation

First, we rename and relevel the inconsistently named and leveled factor variables we noted earlier. A summary of only the factors we changed the levels for is below, with the first level in each list always being the reference level. For variables which have “Yes” and “No” values, we will replace these with 1/0 (1 = “Yes”, 0 = “No”).

Factor	New Levels
CAR_TYPE	Minivan, Panel Truck, Pickup, Sports Car, SUV, Van
EDUCATION	<High School, High School, Bachelors, Masters, PhD
JOB	Blue Collar, Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student
PARENT1	0, 1
MSTATUS	0, 1
RED_CAR	0, 1
REVOKED	0, 1
SEX	Male, Female
URBANICITY	0, 1

We reduce the scale of the INCOME and HOME_VAL variables to thousands of dollars so the figures will be more readable when visualized. The replacement variables are INCOME_THOU and HOME_VAL_THOU.

Some observations list Student as their occupation as well as a value for YOJ. We recode these values as NA. The most likely interpretation is that people incorrectly listed how many years they’ve been in school here, which will not be useful to our analysis.

Based on the descriptions of some of the variables and their theoretical effects on the target variables, and to handle the variables that have missing data that we chose not to impute, including those for which we replaced zero or incorrect values with NA values, we create several factors that we believe will be helpful when building models:

HOME_VAL_CAT (Levels based on HOME_VAL_THOU = “<=250K”, “251-500K”, “501-750K”, “751K+”, “”)
HOMEOWNER (1 = HOME_VAL_THOU not NA)
INCOME_CAT (Levels based on INCOME = “<=50K”, “51-100K”, “101-150K”, “151K+”, “”)
INCOME_FLAG (1 = INCOME_THOU not NA)
KIDSDRIV_FLAG (1 = KIDSDRIV number of children > 0)
HOMEKIDS_FLAG (1 = HOMEKIDS number of children > 0)
EMPLOYED (1 = JOB not NA/Student/Home Maker)
CAR_AGE_CAT (Levels based on CAR_AGE = “<=4”, “5-8”, “9-12”, “13+”, “”)
WHITE_COLLAR (1 = JOB not NA/Student/Home Maker/Blue Collar)

We then split both the main version of our dataset and the alternate version we created earlier into train and test sets. The main version will have all the derived variables we just created, imputed values for the AGE variable, and any transformations we make. The alternate version will not include any derived variables or transformations, but it will include imputed values for all variables with missing values.

We impute missing data in the main train and test sets for one numeric variable, AGE, using the mean value since it is normally distributed.

We take a look at the distributions for our imputed variable to see if the distributions of this variable in the train and test sets differ from what we originally observed or between sets.

The distributions in the train and test sets for AGE are similar to each other and to its original distribution.

We impute missing data in the alternate train and test sets for all variables with missing values using the mice package.

We confirm there are no longer any missing values in the alternate train or test datasets.

## [1] TRUE

We take a look at the distributions for the imputed numeric variables to see if their distributions in the alternate train and test sets differ from what we originally observed or between sets.

The distributions for the imputed numeric variables don’t differ between the alternate train and test sets or from what we originally observed.

We also perform the same check for the single categorical variable we imputed in the alternate train and test sets: JOB.

The distributions in the alternate train and test sets for the single imputed categorical variable, JOB, are similar to each other, and the rankings of most frequent to least frequent occupation here are similar to the rankings of the original distribution. We note that the “Professional” and “Manager” occupations are more tied in the rankings here than they were in the original distribution, however.

Since the distributions of some of our numeric variables are skewed, we transform the data for some of them. In the main dataset, we exclude any numeric variables with missing values that we decided not to impute and for which we have already created factors, as well as the response variable TARGET_AMT. We also use the alternate dataset, which as a reminder has no missing values, as the basis for a third version of the data, in which every skewed numeric predictor and the response variable TARGET_AMT have all been transformed.

Below is a breakdown of the variables, the ideal labmdas proposed by Box-Cox, and the reasonable alternative transformations we have chosen to make in the main dataset:

Skewed Variable	Ideal Lambda Proposed by Box-Cox	Reasonable Alternative Transformation
TRAVTIME	0.7	no transformation
BLUEBOOK	0.45	square root
TIF	0.25	log
OLDCLAIM	-0.0999999999999999	log
CLM_FREQ	-0.2	log
MVR_PTS	0.0500000000000003	log

We check whether the distributions of the transformed variables now differ between the train and test sets.

They do not.

Below is a breakdown of the variables, the ideal labmdas proposed by Box-Cox, and the reasonable alternative transformations we have chosen to make in the third version of the data.

Skewed Variable	Ideal Lambda Proposed by Box-Cox	Reasonable Alternative Transformation
TARGET_AMT	-0.2	log
YOJ	0.65	no transformation
TRAVTIME	0.7	no transformation
KIDSDRIV	-1.15	inverse
HOMEKIDS	-0.25	log
BLUEBOOK	0.45	square root
TIF	0.25	log
OLDCLAIM	-0.0999999999999999	log
CLM_FREQ	-0.2	log
MVR_PTS	0.0500000000000003	log
INCOME_THOU	0.45	square root
HOME_VAL_THOU	0.2	log
CAR_AGE	0.55	square root

Build Models

Binary Logistic Regression Models

Model BLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:1, our baseline binary logistic regression model based on all the original, untransformed variables, with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise model selection to select the model with the smallest AIC value using the stepAIC() function from the MASS package.

A summary of Model BLR:1 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + PARENT1 + MSTATUS + 
##     EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + 
##     OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + 
##     INCOME_THOU + HOME_VAL_THOU, family = "binomial", data = alt_train_df_imputed)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.650e+00  2.392e-01 -11.081  < 2e-16 ***
## KIDSDRIV              4.158e-01  7.151e-02   5.814 6.09e-09 ***
## HOMEKIDS              5.995e-02  4.059e-02   1.477 0.139653    
## PARENT11              3.424e-01  1.302e-01   2.630 0.008531 ** 
## MSTATUS1             -5.248e-01  1.019e-01  -5.149 2.62e-07 ***
## EDUCATIONHigh School -2.886e-01  1.383e-01  -2.086 0.036951 *  
## EDUCATIONBachelors   -3.325e-02  1.132e-01  -0.294 0.768865    
## EDUCATIONMasters     -2.060e-01  2.056e-01  -1.002 0.316383    
## EDUCATIONPhD          9.573e-02  2.454e-01   0.390 0.696413    
## JOBClerical           1.265e-01  1.265e-01   1.000 0.317240    
## JOBDoctor            -7.704e-01  2.858e-01  -2.696 0.007022 ** 
## JOBHome Maker        -8.142e-03  1.669e-01  -0.049 0.961080    
## JOBLawyer            -7.256e-02  1.984e-01  -0.366 0.714507    
## JOBManager           -8.184e-01  1.538e-01  -5.320 1.04e-07 ***
## JOBProfessional      -2.059e-01  1.383e-01  -1.489 0.136444    
## JOBStudent           -2.063e-01  1.474e-01  -1.400 0.161584    
## TRAVTIME              1.636e-02  2.248e-03   7.277 3.42e-13 ***
## CAR_USEPrivate       -7.800e-01  1.049e-01  -7.435 1.05e-13 ***
## BLUEBOOK             -1.894e-05  5.645e-06  -3.356 0.000791 ***
## TIF                  -5.528e-02  8.941e-03  -6.182 6.31e-10 ***
## CAR_TYPEPanel Truck   5.569e-01  1.803e-01   3.089 0.002009 ** 
## CAR_TYPEPickup        5.279e-01  1.201e-01   4.397 1.10e-05 ***
## CAR_TYPESports Car    9.627e-01  1.278e-01   7.535 4.87e-14 ***
## CAR_TYPESUV           6.849e-01  1.033e-01   6.628 3.41e-11 ***
## CAR_TYPEVan           6.620e-01  1.449e-01   4.568 4.93e-06 ***
## OLDCLAIM             -1.496e-05  4.689e-06  -3.190 0.001425 ** 
## CLM_FREQ              1.833e-01  3.384e-02   5.417 6.06e-08 ***
## REVOKED1              9.225e-01  1.105e-01   8.347  < 2e-16 ***
## MVR_PTS               1.312e-01  1.627e-02   8.066 7.24e-16 ***
## CAR_AGE              -1.420e-02  9.019e-03  -1.574 0.115487    
## URBANICITY1           2.385e+00  1.336e-01  17.849  < 2e-16 ***
## INCOME_THOU          -4.250e-03  1.335e-03  -3.184 0.001452 ** 
## HOME_VAL_THOU        -1.392e-03  4.149e-04  -3.354 0.000797 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5107.1  on 5704  degrees of freedom
## AIC: 5173.1
## 
## Number of Fisher Scoring iterations: 5

The AIC of Model BLR:1 is 5173.1.

Feature	Coefficient	Percentage Change in Odds of Car Crash
URBANICITY1	10.8643928	986.4
CAR_TYPESports Car	2.6188460	161.9
REVOKED1	2.5155223	151.6
CAR_TYPESUV	1.9835867	98.4
CAR_TYPEVan	1.9386614	93.9
CAR_TYPEPanel Truck	1.7452765	74.5
CAR_TYPEPickup	1.6953038	69.5
KIDSDRIV	1.5155229	51.6
PARENT11	1.4083716	40.8
CLM_FREQ	1.2011971	20.1
MVR_PTS	1.1402160	14.0
JOBClerical	1.1348452	13.5
EDUCATIONPhD	1.1004642	10.0
HOMEKIDS	1.0617818	6.2
TRAVTIME	1.0164960	1.6
BLUEBOOK	0.9999811	0.0
OLDCLAIM	0.9999850	0.0
HOME_VAL_THOU	0.9986093	-0.1
INCOME_THOU	0.9957593	-0.4
JOBHome Maker	0.9918908	-0.8
CAR_AGE	0.9859046	-1.4
EDUCATIONBachelors	0.9672964	-3.3
TIF	0.9462246	-5.4
JOBLawyer	0.9300077	-7.0
EDUCATIONMasters	0.8137984	-18.6
JOBProfessional	0.8139173	-18.6
JOBStudent	0.8135512	-18.6
EDUCATIONHigh School	0.7493022	-25.1
MSTATUS1	0.5916818	-40.8
JOBDoctor	0.4628116	-53.7
CAR_USEPrivate	0.4583885	-54.2
JOBManager	0.4411460	-55.9

The coefficients for Model BLR:1 mostly match expectations. Using your car privately is one of the biggest reducers of the odds of a car crash. While we expected more educated people to drive more safely, having a high school education is the level that reduces the odds of a car crash the most. All non-blue collar jobs reduce the odds of a car crash, with doctor and manager seeing the largest reductions. The biggest increaser of the odds of a car crash is living/working in an urban area. Some other notable increasers are driving anything other than a minivan, especially a sports car; having had your license revoked; and having teenagers driving your car.

We check for possible multicollinearity within this model.

##                    GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV       1.306531  1        1.143036
## HOMEKIDS       1.830130  1        1.352823
## PARENT1        1.899743  1        1.378312
## MSTATUS        2.139221  1        1.462608
## EDUCATION      9.505786  4        1.325100
## JOB           12.372591  7        1.196831
## TRAVTIME       1.041548  1        1.020562
## CAR_USE        2.229513  1        1.493155
## BLUEBOOK       1.756755  1        1.325426
## TIF            1.010771  1        1.005371
## CAR_TYPE       2.573303  5        1.099130
## OLDCLAIM       1.665731  1        1.290632
## CLM_FREQ       1.459245  1        1.207992
## REVOKED        1.339087  1        1.157189
## MVR_PTS        1.150930  1        1.072814
## CAR_AGE        2.140672  1        1.463104
## URBANICITY     1.142613  1        1.068931
## INCOME_THOU    2.747964  1        1.657698
## HOME_VAL_THOU  2.009943  1        1.417725

EDUCATION and JOB only appear to have high variance inflation factors artificially, as these variables have higher degrees of freedom. A different metric is calculated for variables like this (GVIF^(1/(2*Df)), and that metric squared is typically considered acceptable if it is less than five, the usual VIF threshold. So we don’t need to remove either EDUCATION or JOB.

Model BLR:2 - Select Model Using Original & Derived, but Untransformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:2, a second binary logistic regression model based on one combination of variables we believe could be the best predictors of TARGET_FLAG, including some original variables and some variables we derived from other variables, but no transformed variables. The only value we’ve imputed for this model is AGE.

A summary of Model BLR:2 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ + HOMEOWNER + INCOME_FLAG + 
##     EMPLOYED + WHITE_COLLAR + MSTATUS + PARENT1 + REVOKED + SEX + 
##     TRAVTIME, family = "binomial", data = train_df_imputed)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.593521   0.224284  -2.646 0.008138 ** 
## AGE           -0.014439   0.003959  -3.647 0.000266 ***
## CLM_FREQ       0.374499   0.025589  14.635  < 2e-16 ***
## HOMEOWNER1    -0.262771   0.081438  -3.227 0.001253 ** 
## INCOME_FLAG1  -0.478040   0.092563  -5.164 2.41e-07 ***
## EMPLOYED1      0.473987   0.107889   4.393 1.12e-05 ***
## WHITE_COLLAR1 -0.644178   0.075688  -8.511  < 2e-16 ***
## MSTATUS1      -0.241994   0.084976  -2.848 0.004402 ** 
## PARENT11       0.487306   0.103513   4.708 2.51e-06 ***
## REVOKED1       0.904666   0.087874  10.295  < 2e-16 ***
## SEXFemale      0.110879   0.065567   1.691 0.090822 .  
## TRAVTIME       0.008946   0.001991   4.494 6.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640  on 5736  degrees of freedom
## Residual deviance: 5993  on 5725  degrees of freedom
## AIC: 6017
## 
## Number of Fisher Scoring iterations: 4

The AIC of Model BLR:2 is 6017.

Feature	Coefficient	Percentage Change in Odds of Car Crash
REVOKED1	2.4711065	147.1
PARENT11	1.6279239	62.8
EMPLOYED1	1.6063864	60.6
CLM_FREQ	1.4542633	45.4
SEXFemale	1.1172602	11.7
TRAVTIME	1.0089865	0.9
AGE	0.9856651	-1.4
MSTATUS1	0.7850611	-21.5
HOMEOWNER1	0.7689178	-23.1
INCOME_FLAG1	0.6199974	-38.0
WHITE_COLLAR1	0.5250941	-47.5

In Model BLR:2, the largest reducer of the odds of being in a car crash is working a white collar job, and the largest odds increaser is having your license revoked. Being employed at all, i.e. having any job other than student or homemaker, strangely increases the odds. Since we understand the effects of the WHITE_COLLAR factor better than we understand the effects of the EMPLOYED factor, and they both describe the same information, we favor the WHITE_COLLAR factor here and remove the EMPLOYED factor. We don’t reprint a summary, but the new AIC is 6034.6. We’ve mentioned before that we don’t understand being a single parent’s correlation with increased car crash odds, but it is worth noting it’s the second largest increaser of odds in this subset of predictors. Lastly, being a woman also slightly increases the odds of a car crash despite our prior expectations.

We check for possible multicollinearity within this model.

##          AGE     CLM_FREQ    HOMEOWNER  INCOME_FLAG WHITE_COLLAR      MSTATUS 
##     1.152491     1.003834     1.456116     1.036460     1.059972     1.732481 
##      PARENT1      REVOKED          SEX     TRAVTIME 
##     1.485383     1.002159     1.043544     1.005053

All of the variance inflation factors are less than five, so there are no issues of multicollinearity within this model.

Model BLR:3 - Select Model Using Original, Derived, & Transformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection

We create Model BLR:3, a third binary logistic regression model based on another combination of variables we believe could be the best predictors of TARGET_FLAG, including some original variables, some variables we derived from other variables, and some variables we transformed. The only value we’ve imputed for this model is AGE.

##  [1] "AGE"           "CLM_FREQ_LOG"  "URBANICITY"    "MVR_PTS_LOG"  
##  [5] "OLDCLAIM_LOG"  "PARENT1"       "REVOKED"       "CAR_USE"      
##  [9] "CAR_TYPE"      "MSTATUS"       "EDUCATION"     "KIDSDRIV_FLAG"
## [13] "INCOME_CAT"    "EMPLOYED"      "HOMEOWNER"     "WHITE_COLLAR"

In choosing some of these variables, we excluded others for which collinearity might be an issue. That is, our factor describing income was chosen over the home value factor, the kids driving factor was chosen over the kids at home factor, and the education factor was chosen over the job factor.

A summary of Model BLR:3 is below:

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + 
##     MVR_PTS_LOG + OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + 
##     CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + 
##     EMPLOYED + HOMEOWNER + WHITE_COLLAR, family = "binomial", 
##     data = train_df_trans)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.923096   0.381740  -2.418 0.015601 *  
## AGE                  -0.008696   0.004280  -2.032 0.042169 *  
## CLM_FREQ_LOG          0.248283   0.074101   3.351 0.000806 ***
## URBANICITY1           2.089626   0.130303  16.037  < 2e-16 ***
## MVR_PTS_LOG           0.061683   0.009240   6.676 2.46e-11 ***
## OLDCLAIM_LOG         -0.088658   0.035745  -2.480 0.013129 *  
## PARENT11              0.315903   0.116011   2.723 0.006468 ** 
## REVOKED1              0.857847   0.099265   8.642  < 2e-16 ***
## CAR_USEPrivate       -0.736742   0.104881  -7.025 2.15e-12 ***
## CAR_TYPEPanel Truck   0.151060   0.159413   0.948 0.343335    
## CAR_TYPEPickup        0.525694   0.116545   4.511 6.46e-06 ***
## CAR_TYPESports Car    1.015507   0.123022   8.255  < 2e-16 ***
## CAR_TYPESUV           0.756766   0.099151   7.632 2.30e-14 ***
## CAR_TYPEVan           0.496401   0.139351   3.562 0.000368 ***
## MSTATUS1             -0.497166   0.092775  -5.359 8.38e-08 ***
## EDUCATIONHigh School -0.191783   0.109515  -1.751 0.079912 .  
## EDUCATIONBachelors   -0.718931   0.122381  -5.875 4.24e-09 ***
## EDUCATIONMasters     -0.786428   0.137006  -5.740 9.46e-09 ***
## EDUCATIONPhD         -1.095949   0.167519  -6.542 6.06e-11 ***
## KIDSDRIV_FLAG1        0.720830   0.101306   7.115 1.12e-12 ***
## INCOME_CAT.L          0.092508   0.076456   1.210 0.226299    
## INCOME_CAT.Q          0.368222   0.088759   4.149 3.35e-05 ***
## INCOME_CAT.C          0.233418   0.086987   2.683 0.007289 ** 
## EMPLOYED1             0.169179   0.126425   1.338 0.180839    
## HOMEOWNER1           -0.252892   0.086945  -2.909 0.003630 ** 
## WHITE_COLLAR1        -0.119849   0.105581  -1.135 0.256317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5330.8  on 5711  degrees of freedom
## AIC: 5382.8
## 
## Number of Fisher Scoring iterations: 5

We remove the least statistically significant variable, WHITE_COLLAR, check the new summary, remove the only remaining statistically insignificant variable, EMPLOYED, and reprint only the final summary. We’re slightly surprised these variables were significant to the previous model, but not this one. However, that could be because the INCOME_CAT factor supersedes both in this model.

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + 
##     MVR_PTS_LOG + OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + 
##     CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + 
##     HOMEOWNER, family = "binomial", data = train_df_trans)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.808918   0.372886  -2.169 0.030057 *  
## AGE                  -0.008346   0.004261  -1.959 0.050154 .  
## CLM_FREQ_LOG          0.248601   0.074047   3.357 0.000787 ***
## URBANICITY1           2.094943   0.130333  16.074  < 2e-16 ***
## MVR_PTS_LOG           0.061745   0.009238   6.684 2.33e-11 ***
## OLDCLAIM_LOG         -0.088956   0.035719  -2.490 0.012758 *  
## PARENT11              0.310786   0.115917   2.681 0.007338 ** 
## REVOKED1              0.856583   0.099228   8.632  < 2e-16 ***
## CAR_USEPrivate       -0.787060   0.085768  -9.177  < 2e-16 ***
## CAR_TYPEPanel Truck   0.097836   0.154648   0.633 0.526972    
## CAR_TYPEPickup        0.498138   0.114647   4.345 1.39e-05 ***
## CAR_TYPESports Car    1.012436   0.122911   8.237  < 2e-16 ***
## CAR_TYPESUV           0.755746   0.099107   7.626 2.43e-14 ***
## CAR_TYPEVan           0.473068   0.138070   3.426 0.000612 ***
## MSTATUS1             -0.511067   0.091593  -5.580 2.41e-08 ***
## EDUCATIONHigh School -0.207637   0.107779  -1.927 0.054040 .  
## EDUCATIONBachelors   -0.746540   0.117738  -6.341 2.29e-10 ***
## EDUCATIONMasters     -0.846365   0.129320  -6.545 5.96e-11 ***
## EDUCATIONPhD         -1.149027   0.162915  -7.053 1.75e-12 ***
## KIDSDRIV_FLAG1        0.726144   0.101196   7.176 7.20e-13 ***
## INCOME_CAT.L          0.075360   0.075063   1.004 0.315402    
## INCOME_CAT.Q          0.343078   0.086893   3.948 7.87e-05 ***
## INCOME_CAT.C          0.234976   0.086869   2.705 0.006831 ** 
## HOMEOWNER1           -0.230535   0.083544  -2.759 0.005790 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6640.0  on 5736  degrees of freedom
## Residual deviance: 5332.9  on 5713  degrees of freedom
## AIC: 5380.9
## 
## Number of Fisher Scoring iterations: 5

The AIC of Model BLR:3 is 5380.9.

Feature	Coefficient	Percentage Change in Odds of Car Crash
URBANICITY1	8.1249789	712.5
CAR_TYPESports Car	2.7522987	175.2
REVOKED1	2.3551001	135.5
CAR_TYPESUV	2.1292001	112.9
KIDSDRIV_FLAG1	2.0670935	106.7
CAR_TYPEPickup	1.6456545	64.6
CAR_TYPEVan	1.6049102	60.5
INCOME_CAT.Q	1.4092787	40.9
PARENT11	1.3644972	36.4
CLM_FREQ_LOG	1.2822305	28.2
INCOME_CAT.C	1.2648788	26.5
CAR_TYPEPanel Truck	1.1027816	10.3
INCOME_CAT.L	1.0782723	7.8
MVR_PTS_LOG	1.0636907	6.4
AGE	0.9916890	-0.8
OLDCLAIM_LOG	0.9148860	-8.5
EDUCATIONHigh School	0.8125019	-18.7
HOMEOWNER1	0.7941084	-20.6
MSTATUS1	0.5998551	-40.0
EDUCATIONBachelors	0.4740035	-52.6
CAR_USEPrivate	0.4551809	-54.5
EDUCATIONMasters	0.4289714	-57.1
EDUCATIONPhD	0.3169449	-68.3

Interestingly, in Model BLR:3, education levels do reduce the odds of a car crash in the order expected. That is, having a PhD decreases the odds more than a Master’s, having a Master’s decreases the odds more than a Bachelor’s, and having a Bachelor’s decreases the odds more than having a High School Diploma. Otherwise, coefficients follow similar patterns to what we discussed with the first model. Private car use is one of the biggest car crash odds reducers; the biggest increaser of the odds of a car crash is living/working in an urban area; and driving anything other than a minivan, having had your license revoked, and having teenagers driving your car all big odds increasers as well. The INCOME_CAT factor has the opposite effect we were expecting. Perhaps the reason higher income categories are associated with higher car crash odds is incomes are usually higher in urban areas, and urban areas are very associated with higher car crash odds.

We check for possible multicollinearity within this model.

##                    GVIF Df GVIF^(1/(2*Df))
## AGE            1.192534  1        1.092032
## CLM_FREQ_LOG  68.540964  1        8.278947
## URBANICITY     1.102783  1        1.050135
## MVR_PTS_LOG    1.104418  1        1.050913
## OLDCLAIM_LOG  68.607176  1        8.282945
## PARENT1        1.583420  1        1.258340
## REVOKED        1.129558  1        1.062807
## CAR_USE        1.567173  1        1.251868
## CAR_TYPE       1.599229  5        1.048072
## MSTATUS        1.815128  1        1.347267
## EDUCATION      1.782365  4        1.074916
## KIDSDRIV_FLAG  1.106473  1        1.051890
## INCOME_CAT     1.533759  3        1.073889
## HOMEOWNER      1.437353  1        1.198897

OLDCLAIM_LOG and CLM_FREQ_LOG have variance inflation factors greater than five. Since we believe claim frequency has more to do with TARGET_FLAG, and past claim amounts have more to do with TARGET_AMT, we choose to remove OLDCLAIM_LOG from this model. We don’t reprint a summary, but the new AIC is 5385.1, and none of the variables have variance inflation factors greater than five any longer.

Multiple Linear Regression Models

Model MLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

We create Model MLR:1, our baseline multiple linear regression model based on all the original, untransformed variables, with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise backward model selection.

A summary of Model MLR:1 is below:

## 
## Call:
## lm(formula = TARGET_AMT ~ MSTATUS + SEX + BLUEBOOK + RED_CAR + 
##     REVOKED + MVR_PTS, data = select(filter(alt_train_df_imputed, 
##     TARGET_FLAG == 1), -TARGET_FLAG))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8227  -3166  -1587    411 100814 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3861.32677  554.66624   6.962  5.0e-12 ***
## MSTATUS1    -598.72182  414.43071  -1.445   0.1488    
## SEXFemale   1437.88719  570.49291   2.520   0.0118 *  
## BLUEBOOK       0.11586    0.02558   4.529  6.4e-06 ***
## RED_CAR1    -919.61610  626.13920  -1.469   0.1421    
## REVOKED1    -971.02786  511.74890  -1.897   0.0580 .  
## MVR_PTS      127.62768   80.12720   1.593   0.1114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8073 on 1516 degrees of freedom
## Multiple R-squared:  0.02348,    Adjusted R-squared:  0.01962 
## F-statistic: 6.076 on 6 and 1516 DF,  p-value: 2.67e-06

Only a small number of variables remain, and they explain very little variance in our data. Some of them are not statistically significant by normal standards. We will leave them in and see whether they provide predictive power.

We check for multicollinearity within this model.

##  MSTATUS      SEX BLUEBOOK  RED_CAR  REVOKED  MVR_PTS 
## 1.002852 1.873415 1.008312 1.872006 1.006370 1.004409

None of the variance inflation factors are greater than five, so there are no multicollinearity issues to address for this model.

Model MLR:2 - Full Model Using Original and Transformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

We create Model MLR:2, a second multiple linear regression model using original and transformed variables (including the response variable), with all missing values imputed so that no observations or predictors have to be excluded from the model. Then we perform stepwise backward model selection.

A summary of Model MLR:2 is below:

## 
## Call:
## lm(formula = TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT, 
##     data = select(filter(alt_train_df_trans, TARGET_FLAG == 1), 
##         -TARGET_FLAG))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0399 -0.4154  0.0412  0.4118  3.2477 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.9364878  0.0761413 104.234  < 2e-16 ***
## MSTATUS1      -0.0779400  0.0412953  -1.887   0.0593 .  
## SEXFemale      0.0967186  0.0416973   2.320   0.0205 *  
## BLUEBOOK_SQRT  0.0028464  0.0006045   4.709 2.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8054 on 1519 degrees of freedom
## Multiple R-squared:  0.02127,    Adjusted R-squared:  0.01934 
## F-statistic: 11.01 on 3 and 1519 DF,  p-value: 3.777e-07

Again, only a small number of predictors remain, and they explain very little variance in our data.

We check for multicollinearity within this model.

##       MSTATUS           SEX BLUEBOOK_SQRT 
##      1.000487      1.005604      1.005612

There are no variance inflation factors greater than five, so there are no issues of multicollinearity to address.

Model MLR:3 - Robust Model Using Select Original and Transformed Variables

We create Model MLR:3, a robust model designed to deal with outliers using select original and transformed variables (including the response variable). The predictors were chosen from among those retained in the previous two models, as stepwise backward selection is not possible with a robust model, and the full robust model’s residual standard error was higher than that of this reduced model.

A summary of Model MLR:3 is below:

## 
## Call: rlm(formula = TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT + 
##     RED_CAR + REVOKED + MVR_PTS_LOG, data = select(filter(alt_train_df_trans, 
##     TARGET_FLAG == 1), -TARGET_FLAG))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.10955 -0.40365  0.03963  0.40579  3.26592 
## 
## Coefficients:
##               Value    Std. Error t value 
## (Intercept)     8.0765   0.0643   125.6189
## MSTATUS1       -0.0546   0.0343    -1.5935
## SEXFemale       0.0939   0.0472     1.9900
## BLUEBOOK_SQRT   0.0017   0.0005     3.3032
## RED_CAR1       -0.0269   0.0517    -0.5196
## REVOKED1        0.0043   0.0423     0.1020
## MVR_PTS_LOG     0.0042   0.0045     0.9331
## 
## Residual standard error: 0.6011 on 1516 degrees of freedom

Select Models

Binary Logistic Regression Models

To choose our binary logistic regression model, we consider that false positives would likely result in the company charging too high a premium for those customers, and false negatives would likely result in the company charging too low a premium for those customers. Therefore, the effects of those inaccurate predictions could be equally costly. False positive customers might jump to competitors offering them lower rates (perhaps because those competitors more accurately identified them as lower risk), and false negative customers might cost the company more in unanticipated claim costs. So we will rely primarily on the F1 Score, which incorporates both precision and recall to accurately classify positives while minimizing false positives and false negatives, to select the best model. However, we will look at other metrics and goodness of fit checks as well.

To first check for goodness of fit, we create marginal model plots for the response and each predictor in each binary logistic regression model. (Note that the mmps function from the car package used to generate these plots skips any factors and interaction terms within the models intentionally.)

The marginal models plots for Model BLR:1 reveal some small fit issues, mostly with TRAVTIME.

The marginal models plots for Model BLR:2 reveal more fit issues than Model BLR:1 had, but Model BLR:2 relies on fewer numeric variables than Model BLR:1, and remember these plots can’t visualize factors.

The marginal models plots for Model BLR:3 reveals one fit issue for AGE. This model also relies on a lot of factors, which the marginal models plots can’t visualize.

We calculate the Hosmer-Lemeshow statistic for each model to further check for lack of fit.

Model	HL Statistic	DoF	P Value
Model BLR:1	17.512	8	0.0251979
Model BLR:2	11.1989	8	0.1906817
Model BLR:3	22.34053	8	0.004322592

The low p-values for Models BLR:1 and BLR:3 suggest some lack of fit there. The moderate p-value for Model BLR:2 suggests no lack of fit there. This is not what we expected based on the incomplete pictures provided by looking at just the marginal models plots.

We produce ROC curves to visualize how each model performs on the training and test data.

Model BLR:1 has the highest AUC on both the training and the test data, although Model BLR:3 is not far behind.

We produce confusion matrices for all three models based on the training and test data.

We calculate performance metrics for all models on the training and test data.

	1 (Train)	1 (Test)	2 (Train)	2 (Test)	3 (Train)	3 (Test)
Sensitivity	0.443	0.421	0.196	0.181	0.366	0.371
Specificity	0.923	0.919	0.950	0.950	0.925	0.916
Pos Pred Value	0.675	0.645	0.587	0.559	0.639	0.609
Neg Pred Value	0.821	0.819	0.766	0.768	0.802	0.806
Precision	0.675	0.645	0.587	0.559	0.639	0.609
Recall	0.443	0.421	0.196	0.181	0.366	0.371
F1	0.535	0.509	0.294	0.273	0.466	0.462
Prevalence	0.265	0.260	0.265	0.260	0.265	0.260
Detection Rate	0.117	0.109	0.052	0.047	0.097	0.097
Detection Prevalence	0.174	0.170	0.089	0.084	0.152	0.158
Balanced Accuracy	0.683	0.670	0.573	0.565	0.646	0.644
Accuracy	0.796	0.789	0.750	0.750	0.777	0.775
Classification Error Rate	0.204	0.211	0.250	0.250	0.223	0.225
AUC	0.817	0.806	0.710	0.712	0.795	0.791

Model BLR:1 performs best based on most metrics, whether on the training data or on the test data. It had the lowest AIC, and its AUC was slightly higher than Model BLR:3, but most importantly, it balances Precision and Recall the best, and it therefore has the highest F1 Score. (Note that Model BLR:2’s Recall is particularly low compared to the other models, while the Precision among them varies less.) Since the F1 Score is our primary metric, we select Model BLR:1 as the final binary logistic regression model we will use to make predictions on the evaluation data.

Multiple Linear Regression Models

To select our multiple linear regression model, we will primarily rely on predictive R^2 and RMSE based on the test data, but we will also check for goodness of fit.

To check for goodness of fit, we primarily examine Residuals vs. Fitted Values and Q-Q Residuals plots for all three models.

Model MLR:1 does not fit the data very well. The residuals vs. fitted values are not randomly/evenly spaced above and below 0, and there is deviation from the normal line in the Q-Q plot on the right end. There are noted outliers.

Model MLR:2 fits the data better, if not well. The residuals vs. fitted values are more randomly/evenly spaced above and below 0. There is deviation from the normal line in the Q-Q plot on both ends though.

Model MLR:3 can’t be said to fit the data better or worse than Model MLR:2.

We calculate predictive $R^2$ and RMSE using the test set for all three models. (In cases where the response variable was transformed, predictions are back-transformed.)

models	pred_rsq	rmse
Model MLR:1	0.00164553582810001	6643.62198126247
Model MLR:2	-0.0572331141972691	6836.72183171633
Model MLR:3	-0.0616884808018474	6851.11226184339

The predictive power of all these models is pretty low, as expected. Model MLR:1 is the only model that beats predicting using the mean, and it doesn’t beat it by much. We know the assumptions of OLS are violated in Model MLR:1, even though the other models have flaws as well. So we do select it as the final multiple linear regression model we will use to make predictions on the evaluation data, but we approach these predictions with caution. Models that are more superior to the naive method of predicting using the mean, and that do not have statistical flaws, should be further investigated. No alternate models we have attempted have fared better as of yet though.

Predictions on Evaluation Data

We make predictions on the evaluation dataset, and we save the file with the predicted probabilities, classifications, and costs as “HW4_Eval_PredProbs_Flags_Amounts.csv.”

While we can’t know whether the TARGET_FLAG classifications in particular are accurate, we summarize them below so we can compare the percentage of observations related to car crashes in the original data to the percentage in the evaluation data.

In the original data, about 26.4% of observations were related to car crashes, and in the evaluation data, we’ve identified only 17.1% of observations as being related to car crashes. Again, this does not speak to accuracy.

Appendix: Report Code

Below is the code for this report to generate the models and charts above.

knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(DataExplorer)
library(knitr)
library(cowplot)
library(finalfit)
library(correlationfunnel)
library(ggcorrplot)
library(RColorBrewer)
library(naniar)
library(mice)
library(MASS)
select <- dplyr::select
library(car)
library(glmtoolbox)
library(pROC)
library(caret)
library(robustbase)

cur_theme <- theme_set(theme_classic())

my_url <- "https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/insurance_training_data.csv"
main_df <- read.csv(my_url, na.strings = "")

classes <- as.data.frame(unlist(lapply(main_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", "))
kable(classes_summary, format = "simple")

vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
main_df <- main_df |>
    mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))

main_df <- main_df |>
    select(-INDEX)
remove <- c("discrete_columns", "continuous_columns",
            "total_observations", "memory_usage")
completeness <- introduce(main_df) |>
    select(-all_of(remove))
knitr::kable(t(completeness), format = "simple")

p1 <- plot_missing(main_df, missing_only = TRUE,
                   ggtheme = theme_classic(), title = "Missing Values")

p1 <- p1 + 
    scale_fill_brewer(palette = "Paired")
p1

exclude <- c("TARGET_AMT", "AGE", "INCOME", "YOJ", "HOME_VAL", "CAR_AGE", "JOB")
main_df_binarized <- main_df |>
    select(-all_of(exclude)) |>
    binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER",
           one_hot = TRUE)
main_df_corr <- main_df_binarized |>
    correlate(TARGET_FLAG__1)
main_df_corr |>
    plot_correlation_funnel()

palette <- brewer.pal(n = 7, name = "RdBu")[c(1, 4, 7)]
excl <- c("TARGET_FLAG", "JOB", "CAR_TYPE", "CAR_USE", "EDUCATION",
             "MSTATUS", "PARENT1", "RED_CAR", "REVOKED", "SEX", "URBANICITY")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |>select(-all_of(excl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 2.5,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

incl <- c("TARGET_AMT", "CAR_USE", "MSTATUS", "PARENT1", "RED_CAR",
             "REVOKED", "SEX", "URBANICITY")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |> select(all_of(incl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 3,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

incl <- c("TARGET_AMT", "JOB", "CAR_TYPE", "EDUCATION")
model.matrix(~0+., data = main_df |> filter(TARGET_FLAG == 1) |> select(all_of(incl))) |>
    cor(use = "pairwise.complete.obs") |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 1.75,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

r <- model.matrix(~0+., data = main_df) |>
    cor(use = "pairwise.complete.obs")
is.na(r) <- abs(r) < 0.45
r |>
    ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE, lab_size = 2.5,
               tl.cex = 8, tl.srt = 90,
               colors = palette, outline.color = "white")

output <- split_columns(main_df, binary_as_factor = TRUE)
num <- data.frame(Variable = names(output$continuous),
                   Type = rep("Numeric", ncol(output$continuous)))
cat <- data.frame(Variable = names(output$discrete),
                   Type = rep("Categorical", ncol(output$discrete)))
ranges <- as.data.frame(t(sapply(main_df |> select(-names(output$discrete)),
                                 range, na.rm = TRUE)))
factors <- names(output$discrete)
main_df <- main_df |> 
    mutate(across(all_of(factors), ~as.factor(.)))
values <- as.data.frame(t(sapply(main_df |> select(all_of(factors)),
                                 levels)))
values <- values |>
    mutate(across(all_of(factors), ~toString(unlist(.))))
values <- as.data.frame(t(values)) |>
    rownames_to_column()
cols <- c("Variable", "Values")
colnames(values) <- cols
remove <- c("V1", "V2")
ranges <- ranges |>
    rownames_to_column() |>
    group_by(rowname) |>
    mutate(Values = toString(c(V1, " - ", round(V2, 1))),
           Values = str_replace_all(Values, ",", "")) |>
    select(-all_of(remove))
colnames(ranges) <- cols
num <- num |>
    merge(ranges)
cat <- cat |>
    merge(values)
num_vs_cat <- num |>
    bind_rows(cat)
kable(num_vs_cat, format = "simple")

alt_df <- main_df
main_df <- main_df |>
    mutate(TARGET_AMT = case_when(as.numeric(as.character(TARGET_FLAG)) < 1 ~ NA,
                               TRUE ~ TARGET_AMT),
           HOME_VAL = case_when(HOME_VAL < 1 ~ NA,
                               TRUE ~ HOME_VAL),
           INCOME = case_when(INCOME < 1 ~ NA,
                              TRUE ~ INCOME))

main_df <- main_df |>
    mutate(CAR_AGE = case_when(CAR_AGE < 0 ~ CAR_AGE * -1,
                               TRUE ~ CAR_AGE))
alt_df <- alt_df |>
    mutate(CAR_AGE = case_when(CAR_AGE < 0 ~ CAR_AGE * -1,
                               TRUE ~ CAR_AGE))

summary(main_df)

littles_test <- main_df |>
    mcar_test()
knitr::kable(littles_test, format = "simple")

x <- colnames(main_df)
dep = c("CAR_AGE")
exp = x[!x %in% dep]
missing_comp1 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp1) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("YOJ")
exp = x[!x %in% dep]
missing_comp2 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp2) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("INCOME")
exp = x[!x %in% dep]
missing_comp3 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp3) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("HOME_VAL")
exp = x[!x %in% dep]
missing_comp4 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp4) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
dep = c("JOB")
exp = x[!x %in% dep]
missing_comp5 <- main_df |>
    missing_compare(explanatory = exp, dependent = dep) |>
    mutate(p = as.numeric(case_when(p == "<0.001" ~ "0.001",
                                    TRUE ~ p))) |>
    mutate(Dependant = dep)
colnames(missing_comp5) <- c("Explanatory", "Ref", "Not Missing", "Missing", "p",
                             "Dependant")
missing_comp <- missing_comp1 |>
    bind_rows(missing_comp2, missing_comp3, missing_comp4, missing_comp5) |>
    mutate(Explanatory = case_when(is.na(p) ~ NA,
                                   TRUE ~ Explanatory)) |>
    fill(Explanatory, .direction = "down") |>
    group_by(Dependant, Explanatory) |>
    filter(any(p < 0.05)) |>
    select(Dependant, everything())
knitr::kable(missing_comp, format = "simple")

show <- c("YOJ", "INCOME", "HOME_VAL", "JOB")
p2 <- main_df |>
    select(all_of(show)) |>
    missing_plot()
p2

explanatory = c("JOB", "INCOME", "YOJ")
dependent = "HOME_VAL"
p3 <- main_df |>
    select(all_of(show)) |>
    missing_pattern(dependent, explanatory)

# just numeric variables
numeric_train <- main_df[,sapply(main_df, is.numeric)]
par(mfrow=c(4,4))
par(mai=c(.3,.3,.3,.3))
variables <- names(numeric_train)
for (i in 1:(length(variables))) {
  hist(numeric_train[[variables[i]]], main = variables[i], col = "lightblue")
}

cat_pivot <- main_df |>
    select(all_of(factors)) |>
    pivot_longer(cols = all_of(factors),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    group_by(Variable) |>
    mutate(Levels = n()) |>
    ungroup()
p4 <- cat_pivot |>
    filter(Levels == 2) |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    facet_wrap(vars(Variable), ncol = 4, scales = "free_x") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p4

p5 <- cat_pivot |>
    filter(Levels > 2) |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    coord_flip() + 
    facet_wrap(vars(Variable), ncol = 1, scales = "free")
p5

# car type
x <- main_df$CAR_TYPE
main_df$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
main_df$CAR_TYPE <- factor(main_df$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))
x <- alt_df$CAR_TYPE
alt_df$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
alt_df$CAR_TYPE <- factor(alt_df$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))

# education
x <- main_df$EDUCATION
main_df$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
main_df$EDUCATION <- factor(main_df$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))
x <- alt_df$EDUCATION
alt_df$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
alt_df$EDUCATION <- factor(alt_df$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))

# job
x <- main_df$JOB
main_df$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
main_df$JOB <- factor(main_df$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))
x <- alt_df$JOB
alt_df$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
alt_df$JOB <- factor(alt_df$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))

# single parent
main_df <- main_df |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))
alt_df <- alt_df |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))

# marital status
x <- main_df$MSTATUS
main_df$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
main_df <- main_df |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))
x <- alt_df$MSTATUS
alt_df$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
alt_df <- alt_df |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))

# red car
x <- main_df$RED_CAR
main_df$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
main_df <- main_df |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))
x <- alt_df$RED_CAR
alt_df$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
alt_df <- alt_df |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))

# revoked
main_df <- main_df |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))
alt_df <- alt_df |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))

# sex 
x <- main_df$SEX
main_df$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
main_df$SEX <- factor(main_df$SEX, levels = c("Male", "Female"))
x <- alt_df$SEX
alt_df$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
alt_df$SEX <- factor(alt_df$SEX, levels = c("Male", "Female"))

# urban city - 1 if urban, 0 if rural
x <- main_df$URBANICITY
main_df$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
main_df <- main_df |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))
x <- alt_df$URBANICITY
alt_df$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
alt_df <- alt_df |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))

vars <- c("CAR_TYPE", "EDUCATION", "JOB", "PARENT1", "MSTATUS", "RED_CAR",
          "REVOKED", "SEX", "URBANICITY")

levs <- c("Minivan, Panel Truck, Pickup, Sports Car, SUV, Van",
          "<High School, High School, Bachelors, Masters, PhD",
          "Blue Collar, Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student",
          "0, 1",
          "0, 1",
          "0, 1",
          "0, 1",
          "Male, Female",
          "0, 1")

vars_levs <- as.data.frame(cbind(vars, levs))
colnames(vars_levs) <- c("Factor", "New Levels")
knitr::kable(vars_levs, format = "simple")

drop <- c("INCOME", "HOME_VAL")
main_df <- main_df |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
alt_df <- alt_df |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
    
main_df <- main_df |>
    mutate(YOJ = case_when(JOB == "Student" ~ NA,
                           TRUE ~ YOJ))
alt_df <- alt_df |>
    mutate(YOJ = case_when(JOB == "Student" ~ 0,
                           TRUE ~ YOJ))

exclude1 <- c("Student", "Homemaker")
exclude2 <- c(exclude1, "Blue Collar")
main_df <- main_df |>
    mutate(HOME_VAL_CAT = factor(case_when(HOME_VAL_THOU < 251 ~ "<=250K",
                                        HOME_VAL_THOU < 501 ~ "251-500K",
                                        HOME_VAL_THOU < 751 ~ "501-750K",
                                        TRUE ~ "751K+"),
                              ordered = TRUE,
                              levels = c("<=250K", "251-500K", "501-750K", "751K+"),
                              exclude = NULL),
           HOMEOWNER = as.factor(ifelse(is.na(HOME_VAL_THOU), 0, 1)),
           INCOME_CAT = factor(case_when(INCOME_THOU < 51 ~ "<=50K",
                                          INCOME_THOU < 101 ~ "51-100K",
                                          INCOME_THOU < 151 ~ "101-150K",
                                          TRUE ~ "151K+"),
                                ordered = TRUE,
                                levels = c("<=50K", "51-100K", "101-150K", "151K+"),
                                exclude = NULL),
           INCOME_FLAG = as.factor(ifelse(is.na(INCOME_THOU), 0, 1)),
           KIDSDRIV_FLAG = as.factor(case_when(KIDSDRIV > 0 ~ 1,
                                               TRUE ~ 0)),
           HOMEKIDS_FLAG = as.factor(case_when(HOMEKIDS > 0 ~ 1,
                                               TRUE ~ 0)),
           EMPLOYED = as.factor(ifelse(JOB %in% exclude1 | is.na(JOB),
                                       0, 1)),
           CAR_AGE_CAT = factor(case_when(CAR_AGE < 5 ~ "<=4",
                                           CAR_AGE < 9 ~ "5-8",
                                           CAR_AGE < 13 ~ "9-12",
                                           TRUE ~ "13+"),
                                 ordered = TRUE,
                                 levels = c("<=4", "5-8", "9-12", "13+"),
                                 exclude = NULL),
           WHITE_COLLAR = as.factor(ifelse(JOB %in% exclude2 | is.na(JOB),
                                           0, 1)))
main_df$JOB <- factor(main_df$JOB, exclude = NULL)

set.seed(202)
rows <- sample(nrow(main_df))
main_df <- main_df[rows, ]
alt_df <- alt_df[rows, ]
sample <- sample(c(TRUE, FALSE), nrow(main_df), replace=TRUE,
                 prob=c(0.7,0.3))
train_df <- main_df[sample, ]
test_df <- main_df[!sample, ]
alt_train_df <- alt_df[sample, ]
alt_test_df <- alt_df[!sample, ]

train_df_imputed <- train_df |>
    mutate(AGE = case_when(is.na(AGE) ~ mean(AGE, na.rm = TRUE),
                           TRUE ~ AGE))
test_df_imputed <- test_df |>
    mutate(AGE = case_when(is.na(AGE) ~ mean(AGE, na.rm = TRUE),
                           TRUE ~ AGE))

missing <- c("AGE")
imp_train_num <- train_df_imputed |>
    select(all_of(missing)) |>
    mutate(Set = "Train")
imp_test_num <- test_df_imputed |>
    select(all_of(missing)) |>
    mutate(Set = "Test")
imp_num <- imp_train_num |>
    bind_rows(imp_test_num)
imp_num_pivot <- imp_num |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p6 <- imp_num_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p6

col_classes <- unlist(lapply(alt_train_df, class))
missing <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE", "JOB")
x <- names(col_classes)
not_missing <- x[!x %in% missing]
#Since the imputation process is a little slow, we only do the imputations once, save the results as .csv files once, and load those .csv files moving forward, making sure the levels of the factors stay the same. 
if (file.exists("alt_train_df_imputed.csv") & file.exists("alt_test_df_imputed.csv")){
    alt_train_df_imputed <- read.csv("alt_train_df_imputed.csv", na.strings = "",
                                 colClasses = col_classes)
    alt_test_df_imputed <- read.csv("alt_test_df_imputed.csv", na.strings = "",
                                 colClasses = col_classes)
}else{
    #Start with alt_train_df
    init = mice(alt_train_df, maxit=0) 
    meth = init$method
    predM = init$predictorMatrix
    
    #Skip variables without missing data
    meth[not_missing] = ""
    
    #Set different imputation methods for each of the variables with missing data
    meth[c("AGE")] = "pmm" #Predictive mean matching
    meth[c("INCOME_THOU")] = "pmm"
    meth[c("YOJ")] = "pmm"
    meth[c("HOME_VAL_THOU")] = "pmm"
    meth[c("CAR_AGE")] = "pmm"
    meth[c("JOB")] = "polyreg" #Polytomous (multinomial) logistic regression
    
    #Impute
    imputed = mice(alt_train_df, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
    alt_train_df_imputed <- complete(imputed)
    write.csv(alt_train_df_imputed, "alt_train_df_imputed.csv", row.names = FALSE,
              fileEncoding = "UTF-8")
    
    #Repeat for alt_test_df
    init = mice(alt_test_df, maxit=0) 
    meth = init$method
    predM = init$predictorMatrix
    meth[not_missing] = ""
    meth[c("AGE")] = "pmm"
    meth[c("INCOME_THOU")] = "pmm"
    meth[c("YOJ")] = "pmm"
    meth[c("HOME_VAL_THOU")] = "pmm"
    meth[c("CAR_AGE")] = "pmm"
    meth[c("JOB")] = "polyreg"
    imputed = mice(alt_test_df, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
    alt_test_df_imputed <- complete(imputed)
    write.csv(alt_test_df_imputed, "alt_test_df_imputed.csv", row.names = FALSE,
              fileEncoding = "UTF-8")
}

#Make sure the levels stay the same
levels(alt_train_df_imputed$CAR_TYPE) <- levels(main_df$CAR_TYPE)
levels(alt_train_df_imputed$EDUCATION) <- levels(main_df$EDUCATION)
levels(alt_train_df_imputed$JOB) <- levels(main_df$JOB)
levels(alt_train_df_imputed$SEX) <- levels(main_df$SEX)
levels(alt_test_df_imputed$CAR_TYPE) <- levels(main_df$CAR_TYPE)
levels(alt_test_df_imputed$EDUCATION) <- levels(main_df$EDUCATION)
levels(alt_test_df_imputed$JOB) <- levels(main_df$JOB)
levels(alt_test_df_imputed$SEX) <- levels(main_df$SEX)

x <- sapply(alt_train_df_imputed, function(x) sum(is.na(x)))
y <- sapply(alt_test_df_imputed, function(x) sum(is.na(x)))
sum(x, y) == 0

missing_num <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE")
imp_alt_train_num <- alt_train_df_imputed |>
    select(all_of(missing_num)) |>
    mutate(Set = "Train")
imp_alt_test_num <- alt_test_df_imputed |>
    select(all_of(missing_num)) |>
    mutate(Set = "Test")
imp_alt_num <- imp_alt_train_num |>
    bind_rows(imp_alt_test_num)
imp_alt_num_pivot <- imp_alt_num |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p7 <- imp_alt_num_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p7

missing_cat <- c("JOB")
imp_alt_train_cat <- alt_train_df_imputed |>
    select(all_of(missing_cat)) |>
    pivot_longer(cols = all_of(missing_cat),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    mutate(Set = "Train")
imp_alt_test_cat <- alt_test_df_imputed |>
    select(all_of(missing_cat)) |>
    pivot_longer(cols = all_of(missing_cat),
                 names_to = "Variable",
                 values_to = "Value") |>
    group_by(Variable, Value) |>
    summarize(Count = n()) |>
    mutate(Set = "Test")
imp_alt_pivot_cat <- imp_alt_train_cat |>
    bind_rows(imp_alt_test_cat)
p8 <- imp_alt_pivot_cat |>
    ggplot(aes(x = Value, y = Count)) +
    geom_col(fill = "lightblue", color = "black") +
    labs(x = "Job") + 
    coord_flip() +
    facet_wrap(vars(Set), ncol = 2)
p8

skewed <- c("TRAVTIME", "BLUEBOOK", "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS")
train_df_trans <- train_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(train_df_trans[[skewed[i]]] == 0) > 0){
        train_df_trans[[skewed[i]]] <-
            train_df_trans[[skewed[i]]] + 0.001
    }
}
for (i in 1:(length(skewed))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(train_df_trans[[skewed[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
adj <- c("no transformation", "square root", "log", "log", "log", "log")
lambdas <- cbind(lambdas, adj)
cols <- c("Skewed Variable", "Ideal Lambda Proposed by Box-Cox", "Reasonable Alternative Transformation")
colnames(lambdas) <- cols
knitr::kable(lambdas, format = "simple")

remove <- c("BLUEBOOK", "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS")
train_df_trans <- train_df_trans |>
    mutate(BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS)) |>
    select(-all_of(remove))
test_df_trans <- test_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(test_df_trans[[skewed[i]]] == 0) > 0){
        test_df_trans[[skewed[i]]] <-
            test_df_trans[[skewed[i]]] + 0.001
    }
}
test_df_trans <- test_df_trans |>
    mutate(BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS)) |>
    select(-all_of(remove))

transformed <- c("BLUEBOOK_SQRT", "TIF_LOG", "OLDCLAIM_LOG", "CLM_FREQ_LOG",
                 "MVR_PTS_LOG")
train_df_trans_set <- train_df_trans |>
    select(all_of(transformed)) |>
    mutate(Set = "Train")
test_df_trans_set <- test_df_trans |>
    select(all_of(transformed)) |>
    mutate(Set = "Test")
trans_sets <- train_df_trans_set |>
    bind_rows(test_df_trans_set)
trans_sets_pivot <- trans_sets |>
    pivot_longer(!Set, names_to = "Variable", values_to = "Value")
p9 <- trans_sets_pivot |>
    ggplot(aes(x = Value)) +
    geom_density(fill = "lightblue", color = "black") +
    labs(y = "Density") +
    facet_grid(rows = vars(Set), cols = vars(Variable),
               switch = "y", scales = "free_x")
p9

skewed <- c("TARGET_AMT", "YOJ", "TRAVTIME", "KIDSDRIV", "HOMEKIDS", "BLUEBOOK",
            "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS", "INCOME_THOU",
            "HOME_VAL_THOU", "CAR_AGE")
alt_train_df_trans <- alt_train_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(alt_train_df_trans[[skewed[i]]] == 0) > 0){
        alt_train_df_trans[[skewed[i]]] <-
            alt_train_df_trans[[skewed[i]]] + 0.001
    }
}
for (i in 1:(length(skewed))){
    if (i == 1){
        lambdas <- c()
    }
    bc <- boxcox(lm(alt_train_df_trans[[skewed[i]]] ~ 1),
                 lambda = seq(-2, 2, length.out = 81),
                 plotit = FALSE)
    lambda <- bc$x[which.max(bc$y)]
    lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
adj <- c("log", "no transformation", "no transformation", "inverse", "log",
         "square root", "log", "log", "log", "log", "square root", "log",
         "square root")
lambdas <- cbind(lambdas, adj)
cols <- c("Skewed Variable", "Ideal Lambda Proposed by Box-Cox", "Reasonable Alternative Transformation")
colnames(lambdas) <- cols
knitr::kable(lambdas, format = "simple")

remove <- c("TARGET_AMT", "KIDSDRIV", "HOMEKIDS", "BLUEBOOK",
            "TIF", "OLDCLAIM", "CLM_FREQ", "MVR_PTS", "INCOME_THOU",
            "HOME_VAL_THOU", "CAR_AGE")
alt_train_df_trans <- alt_train_df_trans |>
    mutate(TARGET_AMT_LOG = log(TARGET_AMT),
           KIDSDRIV_INV = KIDSDRIV^-1,
           HOMEKIDS_LOG = log(HOMEKIDS),
           BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS),
           INCOME_THOU_SQRT = INCOME_THOU^0.5,
           HOME_VAL_THOU_LOG = log(HOME_VAL_THOU),
           CAR_AGE_SQRT = CAR_AGE^0.5) |>
    select(-all_of(remove))
alt_test_df_trans <- alt_test_df_imputed
for (i in 1:(length(skewed))){
    #Add a small constant to columns with any 0 values
    if (sum(alt_test_df_trans[[skewed[i]]] == 0) > 0){
        alt_test_df_trans[[skewed[i]]] <-
            alt_test_df_trans[[skewed[i]]] + 0.001
    }
}
alt_test_df_trans <- alt_test_df_trans |>
    mutate(TARGET_AMT_LOG = log(TARGET_AMT),
           KIDSDRIV_INV = KIDSDRIV^-1,
           HOMEKIDS_LOG = log(HOMEKIDS),
           BLUEBOOK_SQRT = BLUEBOOK^0.5,
           TIF_LOG = log(TIF),
           OLDCLAIM_LOG = log(OLDCLAIM),
           CLM_FREQ_LOG = log(CLM_FREQ),
           MVR_PTS_LOG = log(MVR_PTS),
           INCOME_THOU_SQRT = INCOME_THOU^0.5,
           HOME_VAL_THOU_LOG = log(HOME_VAL_THOU),
           CAR_AGE_SQRT = CAR_AGE^0.5) |>
    select(-all_of(remove))

model_blr_1 <- glm(TARGET_FLAG ~ . - TARGET_AMT, family = 'binomial',
                   data = alt_train_df_imputed)
model_blr_1 <- stepAIC(model_blr_1, trace = 0)
summary(model_blr_1)

beta <- coef(model_blr_1)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

vif(model_blr_1)

model_blr_2 <- glm(TARGET_FLAG ~ AGE + CLM_FREQ + HOMEOWNER + INCOME_FLAG + EMPLOYED + WHITE_COLLAR + MSTATUS + PARENT1 + RED_CAR + REVOKED + SEX + TRAVTIME,
                   data=train_df_imputed, family='binomial')
model_blr_2 <- stepAIC(model_blr_2, trace=0)
summary(model_blr_2)

beta <- coef(model_blr_2)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

model_blr_2 <- update(model_blr_2, ~ . - EMPLOYED)

vif(model_blr_2)

choices <- c("AGE", "CLM_FREQ_LOG", "URBANICITY", "MVR_PTS_LOG", "OLDCLAIM_LOG", "PARENT1", "REVOKED", "CAR_USE", "CAR_TYPE", "MSTATUS", "EDUCATION", "KIDSDRIV_FLAG", "INCOME_CAT", "EMPLOYED", "HOMEOWNER", "WHITE_COLLAR")
print(choices)

model_blr_3 <- glm(TARGET_FLAG ~ AGE + CLM_FREQ_LOG + URBANICITY + MVR_PTS_LOG +  OLDCLAIM_LOG + PARENT1 + REVOKED + CAR_USE + CAR_TYPE + MSTATUS + EDUCATION + KIDSDRIV_FLAG + INCOME_CAT + EMPLOYED + HOMEOWNER + WHITE_COLLAR,
                   family = 'binomial', data = train_df_trans)
summary(model_blr_3)

model_blr_3 <- update(model_blr_3, ~ . - WHITE_COLLAR)
model_blr_3 <- update(model_blr_3, ~ . - EMPLOYED)
summary(model_blr_3)

beta <- coef(model_blr_3)
beta_exp <- as.data.frame(exp(beta)) |>
    rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
    filter(Feature != "(Intercept)")
beta_exp <- beta_exp |>
    mutate(diff = round(Coefficient - 1, 3) * 100) |>
    arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds of Car Crash")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")

vif(model_blr_3)

model_blr_3 <- update(model_blr_3, ~ . - OLDCLAIM_LOG)

model_mlr_1 <- lm(TARGET_AMT ~ ., data = alt_train_df_imputed |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
model_mlr_1 <- step(model_mlr_1, trace=0)
summary(model_mlr_1)

vif(model_mlr_1)

model_mlr_2 <- lm(TARGET_AMT_LOG ~ ., data = alt_train_df_trans |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
model_mlr_2 <- step(model_mlr_2, trace=0)
summary(model_mlr_2)

vif(model_mlr_2)

model_mlr_3 <- rlm(TARGET_AMT_LOG ~ MSTATUS + SEX + BLUEBOOK_SQRT + RED_CAR + REVOKED + MVR_PTS_LOG,
                   data = alt_train_df_trans |>
                      filter(TARGET_FLAG == 1) |> select(-TARGET_FLAG))
summary(model_mlr_3)

palette <- brewer.pal(n = 12, name = "Paired")
mmps(model_blr_1, layout = c(3, 4), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:1")

mmps(model_blr_2, layout = c(2, 2), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:2")

mmps(model_blr_3, layout = c(2, 2), grid = FALSE, col.line = palette[c(2,6)],
     main = "Model BLR:3")

hlstat1 <- hltest(model_blr_1, verbose = FALSE)
hlstat2 <- hltest(model_blr_2, verbose = FALSE)
hlstat3 <- hltest(model_blr_3, verbose = FALSE)
models <- c("Model BLR:1",
            "Model BLR:2",
            "Model BLR:3")
hl_tbl <- as.data.frame(cbind(models, rbind(hlstat1[2:4], hlstat2[2:4],
                                            hlstat3[2:4])))
cols <- c("Model", "HL Statistic", "DoF", "P Value")
colnames(hl_tbl) <- cols
knitr::kable(hl_tbl, format = "simple")

model_blr_1_train_preds_df <- alt_train_df_imputed |>
    mutate(linpred = predict(model_blr_1),
           predprob = predict(model_blr_1, type = "response"))
model_blr_1_test_preds_df <- alt_test_df_imputed |>
    mutate(linpred = predict(model_blr_1, alt_test_df_imputed),
           predprob = predict(model_blr_1, alt_test_df_imputed, type = "response"))
model_blr_2_train_preds_df <- train_df_imputed |>
    mutate(linpred = predict(model_blr_2),
           predprob = predict(model_blr_2, type = "response"))
model_blr_2_test_preds_df <- test_df_imputed |>
    mutate(linpred = predict(model_blr_2, test_df_imputed),
           predprob = predict(model_blr_2, test_df_imputed, type = "response"))
model_blr_3_train_preds_df <- train_df_trans |>
    mutate(linpred = predict(model_blr_3),
           predprob = predict(model_blr_3, type = "response"))
model_blr_3_test_preds_df <- test_df_trans |>
    mutate(linpred = predict(model_blr_3, test_df_trans),
           predprob = predict(model_blr_3, test_df_trans, type = "response"))
par(mfrow=c(3,2))
par(mai=c(.3,.3,.3,.3))
roc1 <- roc(model_blr_1_train_preds_df$TARGET_FLAG,
            model_blr_1_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:1 ROC (Train)")
roc2 <- roc(model_blr_1_test_preds_df$TARGET_FLAG,
            model_blr_1_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:1 ROC (Test)")
roc3 <- roc(model_blr_2_train_preds_df$TARGET_FLAG,
            model_blr_2_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:2 ROC (Train)")
roc4 <- roc(model_blr_2_test_preds_df$TARGET_FLAG,
            model_blr_2_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:2 ROC (Test)")
roc5 <- roc(model_blr_3_train_preds_df$TARGET_FLAG,
            model_blr_3_train_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:3 ROC (Train)")
roc6 <- roc(model_blr_3_test_preds_df$TARGET_FLAG,
            model_blr_3_test_preds_df$predprob,
            plot = TRUE, print.auc = TRUE, show.thres = TRUE)
title(main = "Model BLR:3 ROC (Test)")

model_blr_1_train_preds_df <- model_blr_1_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_1_test_preds_df <- model_blr_1_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_2_train_preds_df <- model_blr_2_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_2_test_preds_df <- model_blr_2_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_3_train_preds_df <- model_blr_3_train_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_3_test_preds_df <- model_blr_3_test_preds_df |>
    mutate(predicted = as.factor(ifelse(predprob>0.5,1,0)))
model_blr_1_train_cm <- confusionMatrix(model_blr_1_train_preds_df$predicted,
                                        model_blr_1_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_1_test_cm <- confusionMatrix(model_blr_1_test_preds_df$predicted,
                                        model_blr_1_test_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_2_train_cm <- confusionMatrix(model_blr_2_train_preds_df$predicted,
                                        model_blr_2_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_2_test_cm <- confusionMatrix(model_blr_2_test_preds_df$predicted,
                                        model_blr_2_test_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_3_train_cm <- confusionMatrix(model_blr_3_train_preds_df$predicted,
                                        model_blr_3_train_preds_df$TARGET_FLAG,
                                        positive = "1")
model_blr_3_test_cm <- confusionMatrix(model_blr_3_test_preds_df$predicted,
                                        model_blr_3_test_preds_df$TARGET_FLAG,
                                        positive = "1")
plt1a <- as.data.frame(model_blr_1_train_cm$table)
plt1a$Reference <- factor(plt1a$Reference, levels=rev(levels(plt1a$Reference)))
plt1a <- plt1a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm1a <- plt1a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:1 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt1b <- as.data.frame(model_blr_1_test_cm$table)
plt1b$Reference <- factor(plt1b$Reference, levels=rev(levels(plt1b$Reference)))
plt1b <- plt1b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm1b <- plt1b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:1 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt2a <- as.data.frame(model_blr_2_train_cm$table)
plt2a$Reference <- factor(plt2a$Reference, levels=rev(levels(plt2a$Reference)))
plt2a <- plt2a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm2a <- plt2a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:2 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt2b <- as.data.frame(model_blr_2_test_cm$table)
plt2b$Reference <- factor(plt2b$Reference, levels=rev(levels(plt2b$Reference)))
plt2b <- plt2b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm2b <- plt2b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:2 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt3a <- as.data.frame(model_blr_3_train_cm$table)
plt3a$Reference <- factor(plt3a$Reference, levels=rev(levels(plt3a$Reference)))
plt3a <- plt3a |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm3a <- plt3a |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:3 (Train) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
plt3b <- as.data.frame(model_blr_3_test_cm$table)
plt3b$Reference <- factor(plt3b$Reference, levels=rev(levels(plt3b$Reference)))
plt3b <- plt3b |>
    mutate(Label = case_when(Prediction == 0 & Reference == 0 ~ "TN",
                             Prediction == 1 & Reference == 1 ~ "TP",
                             Prediction == 0 & Reference == 1 ~ "FN",
                             Prediction == 1 & Reference == 0 ~ "FP"))
pcm3b <- plt3b |>
    ggplot(aes(x = Reference, y = Prediction)) +
    geom_tile(fill = "white", col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label, hjust = 3)) + 
    scale_x_discrete(position = "top") +
    labs(x = "Reference", y = "Prediction", title = "Model BLR:3 (Test) CM") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank())
pcm_all <- plot_grid(pcm1a, pcm1b, pcm2a, pcm2b, pcm3a, pcm3b,
                     ncol = 2)
pcm_all

metrics_a <- as.data.frame(cbind(model_blr_1_train_cm$byClass,
                                 model_blr_1_test_cm$byClass,
                                 model_blr_2_train_cm$byClass,
                                 model_blr_2_test_cm$byClass,
                                 model_blr_3_train_cm$byClass,
                                 model_blr_3_test_cm$byClass))
colnames(metrics_a) <-c('1 (Train)',
                        '1 (Test)',
                        '2 (Train)',
                        '2 (Test)',
                        '3 (Train)',
                        '3 (Test)')
metrics <- rbind(metrics_a,
                 c(model_blr_1_train_cm$overall[1],
                   model_blr_1_test_cm$overall[1],
                   model_blr_2_train_cm$overall[1],
                   model_blr_2_test_cm$overall[1],
                   model_blr_3_train_cm$overall[1],
                   model_blr_3_test_cm$overall[1]),
                 c(1-model_blr_1_train_cm$overall[1],
                   1-model_blr_1_test_cm$overall[1],
                   1-model_blr_2_train_cm$overall[1],
                   1-model_blr_2_test_cm$overall[1],
                   1-model_blr_3_train_cm$overall[1],
                   1-model_blr_3_test_cm$overall[1]),
                 c(roc1$auc,
                   roc2$auc,
                   roc3$auc,
                   roc4$auc,
                   roc5$auc,
                   roc6$auc))
metrics <- round(metrics, 3)
rownames(metrics)[12:14] <- c('Accuracy','Classification Error Rate','AUC')
knitr::kable(metrics, format = "simple")

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_1)
mtext("Model MLR:1", side = 3, line = -1.5, outer = TRUE)

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_2)
mtext("Model MLR:2", side = 3, line = -1.5, outer = TRUE)

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
plot(model_mlr_3)
mtext("Model MLR:3", side = 3, line = -1.5, outer = TRUE)

excl <- c("TARGET_AMT", "TARGET_FLAG")
test_x <- alt_test_df_imputed |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_imputed |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT)
test_y <- as.numeric(test_y$TARGET_AMT)
test_pred <- predict(model_mlr_1, test_x)
model_mlr_1_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_1_test_rmse <- as.numeric(RMSE(test_pred, test_y))
excl <- c("TARGET_AMT_LOG", "TARGET_FLAG")
test_x <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT_LOG)
test_y <- exp(as.numeric(test_y$TARGET_AMT_LOG))
test_pred <- exp(predict(model_mlr_2, test_x))
model_mlr_2_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_2_test_rmse <- as.numeric(RMSE(test_pred, test_y))
test_x <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(-all_of(excl))
test_y <- alt_test_df_trans |>
    filter(TARGET_FLAG == 1) |>
    select(TARGET_AMT_LOG)
test_y <- exp(as.numeric(test_y$TARGET_AMT_LOG))
test_pred <- exp(predict(model_mlr_3, test_x))
model_mlr_3_test_rsq <- as.numeric(R2(test_pred, test_y, form = "traditional"))
model_mlr_3_test_rmse <- as.numeric(RMSE(test_pred, test_y))
models <- c("Model MLR:1", "Model MLR:2", "Model MLR:3")
mlr_summary <- as.data.frame(cbind(models,
                                   pred_rsq = c(model_mlr_1_test_rsq,
                                                model_mlr_2_test_rsq,
                                                model_mlr_3_test_rsq),
                                   rmse = c(model_mlr_1_test_rmse,
                                            model_mlr_2_test_rmse,
                                            model_mlr_3_test_rmse)))
knitr::kable(mlr_summary, format = "simple")

my_url <- "https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/insurance-evaluation-data.csv"
eval_df <- read.csv(my_url, na.strings = "")
eval_df_w_preds <- eval_df
# car type
x <- eval_df_w_preds$CAR_TYPE
eval_df_w_preds$CAR_TYPE <- case_match(x, "z_SUV" ~ "SUV", .default = x)
eval_df_w_preds$CAR_TYPE <- factor(eval_df_w_preds$CAR_TYPE,
                           levels = c("Minivan", "Panel Truck",
                                      "Pickup", "Sports Car", "SUV", "Van"))
# education
x <- eval_df_w_preds$EDUCATION
eval_df_w_preds$EDUCATION <- case_match(x, "z_High School" ~ "High School", .default = x)
eval_df_w_preds$EDUCATION <- factor(eval_df_w_preds$EDUCATION,
                             levels = c("<High School", "High School",
                                        "Bachelors", "Masters", "PhD"))
# job
x <- eval_df_w_preds$JOB
eval_df_w_preds$JOB <- case_match(x, "z_Blue Collar" ~ "Blue Collar", .default = x)
eval_df_w_preds$JOB <- factor(eval_df_w_preds$JOB, levels = c("Blue Collar", "Clerical",
                                              "Doctor", "Home Maker","Lawyer",
                                              "Manager", "Professional", "Student"))
# single parent
eval_df_w_preds <- eval_df_w_preds |>
  mutate(PARENT1 = as.factor(ifelse(PARENT1 == "Yes", 1, 0)))
# marital status
x <- eval_df_w_preds$MSTATUS
eval_df_w_preds$MSTATUS <- case_match(x, "z_No" ~ "No", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(MSTATUS = as.factor(ifelse(MSTATUS == "Yes", 1, 0)))
# red car
x <- eval_df_w_preds$RED_CAR
eval_df_w_preds$RED_CAR <- case_match(x, "no" ~ "No", "yes" ~ "Yes", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(RED_CAR = as.factor(ifelse(RED_CAR == "Yes", 1, 0)))
# revoked
eval_df_w_preds <- eval_df_w_preds |>
  mutate(REVOKED = as.factor(ifelse(REVOKED == "Yes", 1, 0)))
# sex 
x <- eval_df_w_preds$SEX
eval_df_w_preds$SEX <- case_match(x, "M" ~ "Male", "z_F" ~ "Female", .default = x)
eval_df_w_preds$SEX <- factor(eval_df_w_preds$SEX, levels = c("Male", "Female"))

# urban city - 1 if urban, 0 if rural
x <- eval_df_w_preds$URBANICITY
eval_df_w_preds$URBANICITY <- case_match(x, "Highly Urban/ Urban" ~ "Urban",
                                 "z_Highly Rural/ Rural" ~ "Rural", .default = x)
eval_df_w_preds <- eval_df_w_preds |>
  mutate(URBANICITY = as.factor(ifelse(URBANICITY == "Urban", 1, 0)))
vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
eval_df_w_preds <- eval_df_w_preds |>
    mutate(across(all_of(vars), ~gsub("\\$|,", "", .) |> as.integer()))
drop <- c("INCOME", "HOME_VAL")
eval_df_w_preds <- eval_df_w_preds |>
    mutate(INCOME_THOU = INCOME / 1000,
           HOME_VAL_THOU = HOME_VAL / 1000) |>
    select(-all_of(drop))
missing <- c("AGE", "INCOME_THOU", "YOJ", "HOME_VAL_THOU", "CAR_AGE", "JOB")
x <- names(eval_df_w_preds)
not_missing <- x[!x %in% missing]
init = mice(eval_df_w_preds, maxit=0) 
meth = init$method
predM = init$predictorMatrix
meth[not_missing] = ""
meth[c("AGE")] = "pmm" #Predictive mean matching
meth[c("INCOME_THOU")] = "pmm"
meth[c("YOJ")] = "pmm"
meth[c("HOME_VAL_THOU")] = "pmm"
meth[c("CAR_AGE")] = "pmm"
meth[c("JOB")] = "polyreg" #Polytomous (multinomial) logistic regression
imputed = mice(eval_df_w_preds, method=meth, predictorMatrix=predM, m=5,
                   printFlag = FALSE)
eval_df_w_preds <- complete(imputed)
eval_df_w_preds <- eval_df_w_preds |>
    mutate(PREDPROB = predict(model_blr_1, eval_df_w_preds,
                              type = "response"),
           TARGET_FLAG = as.factor(ifelse(PREDPROB > 0.5, 1, 0)),
           TARGET_AMT = case_when(TARGET_FLAG == 1 ~ predict(model_mlr_1,
                                                             eval_df_w_preds),
                                  TRUE ~ 0))
write.csv(eval_df_w_preds,file='HW4_Eval_PredProbs_Flags_Amounts.csv')

eval_df_w_preds |> group_by(TARGET_FLAG) |> summarise(cnt=n())

DATA 621 - HW4

Andrew Bowen, Glen Davis, Shoshana Farber, Joshua Forster, Charles Ugiagbe

2023-10-30

Homework 4 - Binary Logistic Regression & Multiple Linear Regression

Introduction:

Data Exploration:

Data Preparation

Build Models

Binary Logistic Regression Models

Model BLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise AIC Model Selection

Model BLR:2 - Select Model Using Original & Derived, but Untransformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection

Model BLR:3 - Select Model Using Original, Derived, & Transformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection

Multiple Linear Regression Models

Model MLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

Model MLR:2 - Full Model Using Original and Transformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

Model MLR:3 - Robust Model Using Select Original and Transformed Variables

Select Models

Binary Logistic Regression Models

Multiple Linear Regression Models

Predictions on Evaluation Data

Appendix: Report Code

DATA 621 - HW4

Andrew Bowen, Glen Davis, Shoshana Farber, Joshua Forster, Charles Ugiagbe

2023-10-30

Homework 4 - Binary Logistic Regression & Multiple Linear Regression

Introduction:

Data Exploration:

Data Preparation

Build Models

Binary Logistic Regression Models

Model BLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise AIC Model Selection

Model BLR:2 - Select Model Using Original & Derived, but Untransformed Variables, with Only AGE Values Imputed - Reduced via Stepwise AIC Model Selection

Model BLR:3 - Select Model Using Original, Derived, & Transformed Variables, with Only AGE Values Imputed - Reduced via Stepwise AIC Model Selection

Multiple Linear Regression Models

Model MLR:1 - Full Model Using Original, Untransformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

Model MLR:2 - Full Model Using Original and Transformed Variables, with All Missing Values Imputed - Reduced via Stepwise Backward Model Selection

Model MLR:3 - Robust Model Using Select Original and Transformed Variables

Select Models

Binary Logistic Regression Models

Multiple Linear Regression Models

Predictions on Evaluation Data

Appendix: Report Code

Model BLR:2 - Select Model Using Original & Derived, but Untransformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection

Model BLR:3 - Select Model Using Original, Derived, & Transformed Variables, with Only `AGE` Values Imputed - Reduced via Stepwise AIC Model Selection