CUNY-SPS-DATA621: Business Analytics and Data Mining Homework #4: Multiple Linear and Binary Logistic Regression

Github Master files with code link: https://github.com/asmozo24/Data621-Multiple-Linear-Regression

Overview

In this homework assignment, we will explore, analyze and model a data set representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero. The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. Let’s recall couple definition: According to Rebecca Bevans from https://www.scribbr.com/statistics/multiple-linear-regression/, a multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. In order words, we want to determine whether the relationship between the independent variables(not necessary all of them but at least 02) is linear and how good it is.

On the other hands, binary logistic regression determines relationship between the categorical dependent variables and one or more independent variables. Meaning, binary logistic and multiple linear regression are similar except the dependent variable has a different data type.

1. Data Exploration

There are 02 datasets: insurance_training_data,and insurance-evaluation-data provided by Instructor:Nasrin Khansari. These are csv files and we used R-programming language to acquire the 02 datasets pre-stored in Github repository. These 24 variables of interest are all predictors except the variables called “TARGET_FLAG”,“TARGET_AMT” which are the response variable, and are already defined within the dataset package(see below). The case study: to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

A.Short.Description.of.the.Variables.of.Interest.in.the.insurance.dataset	X	X.1
VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
INDEX	Identification Variable (do not use)	None
TARGET_FLAG	Was Car in a crash? 1=YES 0=NO	None
TARGET_AMT	If car was in a crash, what was the cost	None
AGE	Age of Driver	Very young people tend to be risky. Maybe very old people also.
BLUEBOOK	Value of Vehicle	Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_AGE	Vehicle Age	Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_TYPE	Type of Car	Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_USE	Vehicle Use	Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ	# Claims (Past 5 Years)	The more claims you filed in the past, the more you are likely to file in the future
EDUCATION	Max Education Level	Unknown effect, but in theory more educated people tend to drive more safely
HOMEKIDS	# Children at Home	Unknown effect
HOME_VAL	Home Value	In theory, home owners tend to drive more responsibly
INCOME	Income	In theory, rich people tend to get into fewer crashes
JOB	Job Category	In theory, white collar jobs tend to be safer
KIDSDRIV	# Driving Children	When teenagers drive your car, you are more likely to get into crashes
MSTATUS	Marital Status	In theory, married people drive more safely
MVR_PTS	Motor Vehicle Record Points	If you get lots of traffic tickets, you tend to get into more crashes
OLDCLAIM	Total Claims (Past 5 Years)	If your total payout over the past five years was high, this suggests future payouts will be high
PARENT1	Single Parent	Unknown effect
RED_CAR	A Red Car	Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
REVOKED	License Revoked (Past 7 Years)	If your license was revoked in the past 7 years, you probably are a more risky driver.
SEX	Gender	Urban legend says that women have less crashes then men. Is that true?
TIF	Time in Force	People who have been customers for a long time are usually more safe.
TRAVTIME	Distance to Work	Long drives to work usually suggest greater risk
URBANICITY	Home/Work Area	Unknown
YOJ	Years on Job	People who stay at a job for a long time are usually more safe

Data Structure

These datasets include 8161 observations and 26 variables. The variables are numerical, categorial and character data type. There are variables (predictors) that might need to change data type if we will use them to build the different models.

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

INDEX	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
1	60	0	11	$67,349	No	$0	z_No	M	PhD	Professional	14	Private	$14,230	11	Minivan	yes	$4,461	2	No	3	18	Highly Urban/ Urban
2	43	0	11	$91,449	No	$257,252	z_No	M	z_High School	z_Blue Collar	22	Commercial	$14,940	1	Minivan	yes	$0	0	No	0	1	Highly Urban/ Urban
4	35	1	10	$16,039	No	$124,191	Yes	z_F	z_High School	Clerical	5	Private	$4,010	4	z_SUV	no	$38,690	2	No	3	10	Highly Urban/ Urban
5	51	0	14		No	$306,251	Yes	M	<High School	z_Blue Collar	32	Private	$15,440	7	Minivan	yes	$0	0	No	0	6	Highly Urban/ Urban
6	50	0	NA	$114,986	No	$243,925	Yes	z_F	PhD	Doctor	36	Private	$18,000	1	z_SUV	no	$19,217	2	Yes	3	17	Highly Urban/ Urban

##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510

2.Data Preparation

Data Cleaning-Stripping Variables

We need to strip some variables. The variables need to be stripped for purpose of simplifying data manipulation. These variables are: INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM (values with $), and we will change the character data type to numeric because the values will become numeric(integer/double) after stripping is done.

INDEX	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
1	60	0	11	67349	No	0	z_No	M	PhD	Professional	14	Private	14230	11	Minivan	yes	4461	2	No	3	18	Highly Urban/ Urban
2	43	0	11	91449	No	257252	z_No	M	z_High School	z_Blue Collar	22	Commercial	14940	1	Minivan	yes	0	0	No	0	1	Highly Urban/ Urban
4	35	1	10	16039	No	124191	Yes	z_F	z_High School	Clerical	5	Private	4010	4	z_SUV	no	38690	2	No	3	10	Highly Urban/ Urban
5	51	0	14	NA	No	306251	Yes	M	<High School	z_Blue Collar	32	Private	15440	7	Minivan	yes	0	0	No	0	6	Highly Urban/ Urban
6	50	0	NA	114986	No	243925	Yes	z_F	PhD	Doctor	36	Private	18000	1	z_SUV	no	19217	2	Yes	3	17	Highly Urban/ Urban

###Data Cleaning-Missing Values

It is important to check the missing values before applying regression analysis because missing values can increase the error and add bias to the regression model.As we can see below, the dataset shows a total of 1879 missing values. this is about (1879/8161)*100 = 23.02% of the entire dataset. This significant and will add bias into the model that we will build later. So, we need to fix the variables with missing values.

means the dataset is good for analysis. In addition, the variable called ‘target’ = whether the crime rate is above the median crime rate (1) or not (0) is a two level response, so we want to set it as factor as well as chas(even it is not the response variable).

## The total of missing values is :  1879

###Data Cleaning-Fixing variable with missing Values

Let’s see the distribution of the missing values across the training dataset.

## 
## Variable distribution of missing 'NA' is:

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing
CAR_AGE	510
HOME_VAL	464
YOJ	454
INCOME	445
AGE	6

Let’s visualize the missing values for variables: CAR_AGE, HOME_VAL, YOJ, INCOME and AGE

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

Since, the variable AGE has only 6 missing values out of the 8161, we can delete these them as it only represents 0.0735% of the total record. Then, for vraibles CAR_AGE, HOME_VAL, YOJ, INCOME, we will impute the missing values because there represent some non-negligent numbers for building the model.

## Warning: package 'Hmisc' was built under R version 4.0.5

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:e1071':
## 
##     impute

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## [1] 6

## 
## Let's see the new observation(it should be 1876-6=1870

## [1] 1870

## 
## Let's impute missing values in (CAR_AGE, HOME_VAL, YOJ, INCOME) with median

## 
## Let's see the effect of imputing missing values on the total observations(it should be 0

## [1] 0

Data Distribution

We want to take a look at how the data are distributed across all variables. We see that the response variable(target) has a binomial normal distribution.This makes sense because the response variable only has 02 outputs(0 and 1). Beside the average number of rooms per dwelling(rm) which as a normal distribution, the rest of variables show right and left skewed. Based on these density plots, we want to see what the variance between predictors and response variable look like.

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

Let’s visualize the effect of one of the predictors (Travel time, Sex) on the response variable TARGET_AMT

## `geom_smooth()` using formula 'y ~ x'

Build Models

We are going to use the training dataset to build two different multiple linear regression models and 03 different binary logistic regression models.Due to a high volume of predictors (25), we will not focus on the coefficients for the equation.

Model 1-Multiple Linear Regression:

All variables in except TARGET_FLAG, INDEX. INDEX is not significant to the dataset and TARGET_FLAG is a categorical variable that we are going to use it later to build the binary logistic model.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5889  -1696   -761    343 103790 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.075e+03  5.577e+02   1.928  0.05394 .  
## KIDSDRIV                         3.163e+02  1.133e+02   2.792  0.00524 ** 
## AGE                              5.079e+00  7.074e+00   0.718  0.47277    
## HOMEKIDS                         7.610e+01  6.544e+01   1.163  0.24486    
## YOJ                             -3.911e+00  1.511e+01  -0.259  0.79573    
## INCOME                          -4.406e-03  1.806e-03  -2.440  0.01472 *  
## PARENT1Yes                       5.744e+02  2.023e+02   2.840  0.00453 ** 
## HOME_VAL                        -5.603e-04  5.911e-04  -0.948  0.34325    
## MSTATUSz_No                      5.695e+02  1.449e+02   3.930 8.57e-05 ***
## SEXz_F                          -3.713e+02  1.840e+02  -2.018  0.04358 *  
## EDUCATIONBachelors              -2.576e+02  2.050e+02  -1.257  0.20892    
## EDUCATIONMasters                 2.466e+01  3.002e+02   0.082  0.93453    
## EDUCATIONPhD                     2.877e+02  3.562e+02   0.808  0.41934    
## EDUCATIONz_High School          -8.818e+01  1.722e+02  -0.512  0.60853    
## JOBClerical                      5.281e+02  3.416e+02   1.546  0.12207    
## JOBDoctor                       -4.988e+02  4.089e+02  -1.220  0.22252    
## JOBHome Maker                    3.518e+02  3.649e+02   0.964  0.33500    
## JOBLawyer                        2.316e+02  2.958e+02   0.783  0.43355    
## JOBManager                      -4.780e+02  2.886e+02  -1.656  0.09767 .  
## JOBProfessional                  4.570e+02  3.089e+02   1.480  0.13904    
## JOBStudent                       2.852e+02  3.743e+02   0.762  0.44605    
## JOBz_Blue Collar                 5.072e+02  3.220e+02   1.575  0.11519    
## TRAVTIME                         1.196e+01  3.224e+00   3.709  0.00021 ***
## CAR_USEPrivate                  -7.820e+02  1.646e+02  -4.751 2.06e-06 ***
## BLUEBOOK                         1.434e-02  8.630e-03   1.662  0.09664 .  
## TIF                             -4.815e+01  1.219e+01  -3.951 7.86e-05 ***
## CAR_TYPEPanel Truck              2.625e+02  2.783e+02   0.943  0.34559    
## CAR_TYPEPickup                   3.749e+02  1.708e+02   2.195  0.02819 *  
## CAR_TYPESports Car               1.019e+03  2.179e+02   4.677 2.95e-06 ***
## CAR_TYPEVan                      5.097e+02  2.135e+02   2.388  0.01696 *  
## CAR_TYPEz_SUV                    7.520e+02  1.794e+02   4.191 2.80e-05 ***
## RED_CARyes                      -5.262e+01  1.494e+02  -0.352  0.72473    
## OLDCLAIM                        -1.059e-02  7.441e-03  -1.423  0.15479    
## CLM_FREQ                         1.423e+02  5.511e+01   2.583  0.00982 ** 
## REVOKEDYes                       5.504e+02  1.736e+02   3.170  0.00153 ** 
## MVR_PTS                          1.749e+02  2.595e+01   6.739 1.70e-11 ***
## CAR_AGE                         -2.680e+01  1.280e+01  -2.094  0.03629 *  
## URBANICITYz_Highly Rural/ Rural -1.662e+03  1.395e+02 -11.917  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4546 on 8117 degrees of freedom
## Multiple R-squared:  0.07087,    Adjusted R-squared:  0.06664 
## F-statistic: 16.73 on 37 and 8117 DF,  p-value: < 2.2e-16

Interpretation of summary of the multiple linear regression stats output

This model 1 probably does not fits the assumption of heteroscedasticity because the model residuals is not center around the median (-761) and not equally spread (min:-5889 and max:103790 ).

F statistics: how good is the relationship between the predictors (all predictors in) and response (TARGET_AMT: If car was in a crash, what was the cost). Since we a have dataset (8161 observations), and pValue less than alpha, we are rejecting the null hypothesis. This means some predictors are likely to influence the cost of a car after it was in car accident. In this case, F-statistic: 16.73 which indicates there is some relationship between the two variables, how strong we cannot confirm it.

R^2: 0.07087 or 7.09% is weak (scale from 0 to 1, R^2 = 1 being best fit), which indicates there are some outliers (as the diagnostic plot show below) or large errors.

Standard error: errors in the models or a measure of the quality of the regression fit. We expect all regression models to carry in some errors. The actual cost of a car after it was in car accident can deviate from the true regression line by 4546 based on all predictors)

p-values: Based on the 95% confidence interval, 2.2e-16 <0.05 which indicates that changes in the predictors can be associated with changes in the response variable (cost of a car after it was in accident).

##Model 2-Multiple Linear Regression

Based on model 1, there are some variables that are not significant to the model good fit. The reason for this insignificance is because these predictors have larger pValue than the threshold (alpha). So, we can use another test to see if we can improve the model good fit. Thus, we will use a backward elimination by dropping the predictors with insignificant mean to the model 1.

step() function for model 2. Let’s recall that the step function performs iteration in backward direction(by default) while looking for variables that offer the minimum AIC(Akaike information criterion).

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3b)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5689  -1679   -786    300 103761 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.695e+03  2.276e+02   7.446 1.06e-13 ***
## KIDSDRIV                         3.816e+02  1.022e+02   3.735 0.000189 ***
## INCOME                          -5.816e-03  1.253e-03  -4.643 3.49e-06 ***
## PARENT1Yes                       6.489e+02  1.767e+02   3.673 0.000241 ***
## MSTATUSz_No                      5.903e+02  1.193e+02   4.947 7.68e-07 ***
## SEXz_F                          -2.046e+02  1.451e+02  -1.410 0.158522    
## TRAVTIME                         1.267e+01  3.221e+00   3.934 8.41e-05 ***
## CAR_USEPrivate                  -8.566e+02  1.257e+02  -6.816 1.00e-11 ***
## TIF                             -4.754e+01  1.218e+01  -3.902 9.61e-05 ***
## CAR_TYPEPanel Truck              4.020e+02  2.319e+02   1.734 0.083040 .  
## CAR_TYPEPickup                   3.171e+02  1.651e+02   1.921 0.054747 .  
## CAR_TYPESports Car               8.842e+02  2.045e+02   4.324 1.55e-05 ***
## CAR_TYPEVan                      5.627e+02  2.031e+02   2.770 0.005620 ** 
## CAR_TYPEz_SUV                    6.343e+02  1.654e+02   3.835 0.000127 ***
## CLM_FREQ                         1.108e+02  4.891e+01   2.265 0.023531 *  
## REVOKEDYes                       4.647e+02  1.551e+02   2.996 0.002748 ** 
## MVR_PTS                          1.790e+02  2.582e+01   6.934 4.40e-12 ***
## CAR_AGE                         -3.556e+01  1.006e+01  -3.536 0.000409 ***
## URBANICITYz_Highly Rural/ Rural -1.523e+03  1.358e+02 -11.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4553 on 8136 degrees of freedom
## Multiple R-squared:  0.06584,    Adjusted R-squared:  0.06377 
## F-statistic: 31.86 on 18 and 8136 DF,  p-value: < 2.2e-16

Interpretation of model 2-multiple linear regression stats output

Model 2 uses backward elimination method. We removed the following predictors: HOMEKIDS,-AGE,-YOJ,-JOB, -BLUEBOOK, -HOME_VAL, -RED_CAR, -OLDCLAIM, -EDUCATION. The result is a much better pValues , however, the adjusted R-squared or residuals did not improve. Let’s use leap() function to find the best fit model from model 2 (the new dataframe with reduced predictors). In fact, after running leap() function on model 1(we didn’t include summary to reduce redundancy), we noticed the same result.

## Loading required package: leaps

## Warning: package 'leaps' was built under R version 4.0.5

## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ ., data = insuranceT_df3b, nbest = 1)
## 18 Variables  (and intercept)
##                                 Forced in Forced out
## KIDSDRIV                            FALSE      FALSE
## INCOME                              FALSE      FALSE
## PARENT1Yes                          FALSE      FALSE
## MSTATUSz_No                         FALSE      FALSE
## SEXz_F                              FALSE      FALSE
## TRAVTIME                            FALSE      FALSE
## CAR_USEPrivate                      FALSE      FALSE
## TIF                                 FALSE      FALSE
## CAR_TYPEPanel Truck                 FALSE      FALSE
## CAR_TYPEPickup                      FALSE      FALSE
## CAR_TYPESports Car                  FALSE      FALSE
## CAR_TYPEVan                         FALSE      FALSE
## CAR_TYPEz_SUV                       FALSE      FALSE
## CLM_FREQ                            FALSE      FALSE
## REVOKEDYes                          FALSE      FALSE
## MVR_PTS                             FALSE      FALSE
## CAR_AGE                             FALSE      FALSE
## URBANICITYz_Highly Rural/ Rural     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          KIDSDRIV INCOME PARENT1Yes MSTATUSz_No SEXz_F TRAVTIME CAR_USEPrivate
## 1  ( 1 ) " "      " "    " "        " "         " "    " "      " "           
## 2  ( 1 ) " "      " "    " "        " "         " "    " "      " "           
## 3  ( 1 ) " "      " "    " "        " "         " "    " "      "*"           
## 4  ( 1 ) " "      " "    "*"        " "         " "    " "      "*"           
## 5  ( 1 ) " "      "*"    "*"        " "         " "    " "      "*"           
## 6  ( 1 ) "*"      "*"    " "        "*"         " "    " "      "*"           
## 7  ( 1 ) "*"      "*"    " "        "*"         " "    " "      "*"           
## 8  ( 1 ) "*"      "*"    " "        "*"         " "    "*"      "*"           
##          TIF CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPEVan
## 1  ( 1 ) " " " "                 " "            " "                " "        
## 2  ( 1 ) " " " "                 " "            " "                " "        
## 3  ( 1 ) " " " "                 " "            " "                " "        
## 4  ( 1 ) " " " "                 " "            " "                " "        
## 5  ( 1 ) " " " "                 " "            " "                " "        
## 6  ( 1 ) " " " "                 " "            " "                " "        
## 7  ( 1 ) "*" " "                 " "            " "                " "        
## 8  ( 1 ) "*" " "                 " "            " "                " "        
##          CAR_TYPEz_SUV CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1  ( 1 ) " "           " "      " "        "*"     " "    
## 2  ( 1 ) " "           " "      " "        "*"     " "    
## 3  ( 1 ) " "           " "      " "        "*"     " "    
## 4  ( 1 ) " "           " "      " "        "*"     " "    
## 5  ( 1 ) " "           " "      " "        "*"     " "    
## 6  ( 1 ) " "           " "      " "        "*"     " "    
## 7  ( 1 ) " "           " "      " "        "*"     " "    
## 8  ( 1 ) " "           " "      " "        "*"     " "    
##          URBANICITYz_Highly Rural/ Rural
## 1  ( 1 ) " "                            
## 2  ( 1 ) "*"                            
## 3  ( 1 ) "*"                            
## 4  ( 1 ) "*"                            
## 5  ( 1 ) "*"                            
## 6  ( 1 ) "*"                            
## 7  ( 1 ) "*"                            
## 8  ( 1 ) "*"

Interpretation of model 3-multiple linear regression stats output

Using leap() function on model 3, we that the best model will only account for these variables: MVR_PTS,URBANICITY, CAR_USE, MSTATUS, INCOME, KIDSDRIV. It all comes to which variables has the most asterisk indicates that the leap function selected the variable to be significant in predicting the response variable

Binary Logistic Regression

We are going to use the same logic from the multiple linear regression model. This time, we will be using TARGET_FLAG instead. The reason is because the models we are going to build are based on response variable with a categorical data type. In this case, TARGET_FLAG is binary (values: 1 and 0).

Model 1: All variable in

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(), data = insuranceT_df3c)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5863  -0.7129  -0.3987   0.6242   3.1531  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -9.107e-01  3.217e-01  -2.831 0.004646 ** 
## KIDSDRIV                         3.911e-01  6.127e-02   6.384 1.73e-10 ***
## AGE                             -1.431e-03  4.026e-03  -0.355 0.722320    
## HOMEKIDS                         4.571e-02  3.720e-02   1.229 0.219124    
## YOJ                             -1.048e-02  8.597e-03  -1.219 0.222715    
## INCOME                          -3.432e-06  1.082e-06  -3.172 0.001514 ** 
## PARENT1Yes                       3.755e-01  1.097e-01   3.422 0.000622 ***
## HOME_VAL                        -1.310e-06  3.421e-07  -3.829 0.000129 ***
## MSTATUSz_No                      4.929e-01  8.358e-02   5.898 3.69e-09 ***
## SEXz_F                          -8.772e-02  1.121e-01  -0.783 0.433844    
## EDUCATIONBachelors              -3.793e-01  1.158e-01  -3.275 0.001055 ** 
## EDUCATIONMasters                -2.832e-01  1.789e-01  -1.583 0.113366    
## EDUCATIONPhD                    -1.592e-01  2.141e-01  -0.744 0.456964    
## EDUCATIONz_High School           1.968e-02  9.518e-02   0.207 0.836232    
## JOBClerical                      4.109e-01  1.967e-01   2.089 0.036681 *  
## JOBDoctor                       -4.456e-01  2.671e-01  -1.669 0.095216 .  
## JOBHome Maker                    2.265e-01  2.103e-01   1.077 0.281319    
## JOBLawyer                        1.059e-01  1.695e-01   0.625 0.531963    
## JOBManager                      -5.549e-01  1.715e-01  -3.235 0.001215 ** 
## JOBProfessional                  1.645e-01  1.784e-01   0.922 0.356376    
## JOBStudent                       2.122e-01  2.146e-01   0.989 0.322828    
## JOBz_Blue Collar                 3.119e-01  1.856e-01   1.681 0.092803 .  
## TRAVTIME                         1.462e-02  1.884e-03   7.760 8.51e-15 ***
## CAR_USEPrivate                  -7.573e-01  9.182e-02  -8.247  < 2e-16 ***
## BLUEBOOK                        -2.072e-05  5.267e-06  -3.933 8.38e-05 ***
## TIF                             -5.553e-02  7.350e-03  -7.556 4.15e-14 ***
## CAR_TYPEPanel Truck              5.592e-01  1.618e-01   3.456 0.000549 ***
## CAR_TYPEPickup                   5.559e-01  1.008e-01   5.516 3.47e-08 ***
## CAR_TYPESports Car               1.024e+00  1.300e-01   7.877 3.35e-15 ***
## CAR_TYPEVan                      6.136e-01  1.267e-01   4.844 1.27e-06 ***
## CAR_TYPEz_SUV                    7.695e-01  1.113e-01   6.913 4.74e-12 ***
## RED_CARyes                      -2.052e-02  8.660e-02  -0.237 0.812692    
## OLDCLAIM                        -1.396e-05  3.911e-06  -3.570 0.000357 ***
## CLM_FREQ                         1.978e-01  2.856e-02   6.926 4.33e-12 ***
## REVOKEDYes                       8.875e-01  9.134e-02   9.717  < 2e-16 ***
## MVR_PTS                          1.125e-01  1.362e-02   8.259  < 2e-16 ***
## CAR_AGE                         -1.118e-03  7.542e-03  -0.148 0.882172    
## URBANICITYz_Highly Rural/ Rural -2.384e+00  1.128e-01 -21.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 7291.3  on 8117  degrees of freedom
## AIC: 7367.3
## 
## Number of Fisher Scoring iterations: 5

Interpretation of Model 1: All variable in

Comparing model 1 build on binary logit with the model 1 multiple linear regression, we see a better residuals on logit model (the spread is about the even from the median). Deviance Residuals: Min 1Q Median 3Q Max
-2.5863 -0.7129 -0.3987 0.6242 3.1531

pValue, we got about the same value previously obtained. The predictors with larges pValue are about the same.

Model 2- Binary Logistic Regression: Backward elimination

Model 2 will factor out the predictors with large pValue. There variables are: -HOMEKIDS,-AGE,-YOJ,-JOB, -SEX, -RED_CAR, -EDUCATION.

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(), data = insuranceT_df3c1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5355  -0.7350  -0.4130   0.6473   3.0718  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -5.977e-01  1.457e-01  -4.101 4.11e-05 ***
## KIDSDRIV                         4.173e-01  5.457e-02   7.647 2.06e-14 ***
## INCOME                          -6.074e-06  9.183e-07  -6.615 3.72e-11 ***
## PARENT1Yes                       4.621e-01  9.312e-02   4.962 6.97e-07 ***
## HOME_VAL                        -1.487e-06  3.301e-07  -4.505 6.65e-06 ***
## MSTATUSz_No                      4.261e-01  7.816e-02   5.451 5.01e-08 ***
## TRAVTIME                         1.488e-02  1.861e-03   7.997 1.27e-15 ***
## CAR_USEPrivate                  -8.679e-01  6.955e-02 -12.479  < 2e-16 ***
## BLUEBOOK                        -2.472e-05  4.682e-06  -5.278 1.30e-07 ***
## TIF                             -5.389e-02  7.278e-03  -7.404 1.32e-13 ***
## CAR_TYPEPanel Truck              5.161e-01  1.403e-01   3.679 0.000234 ***
## CAR_TYPEPickup                   4.969e-01  9.707e-02   5.119 3.07e-07 ***
## CAR_TYPESports Car               9.336e-01  1.054e-01   8.855  < 2e-16 ***
## CAR_TYPEVan                      5.761e-01  1.189e-01   4.847 1.26e-06 ***
## CAR_TYPEz_SUV                    7.075e-01  8.449e-02   8.374  < 2e-16 ***
## OLDCLAIM                        -1.388e-05  3.866e-06  -3.590 0.000331 ***
## CLM_FREQ                         1.940e-01  2.822e-02   6.876 6.16e-12 ***
## REVOKEDYes                       8.961e-01  9.017e-02   9.938  < 2e-16 ***
## MVR_PTS                          1.172e-01  1.349e-02   8.686  < 2e-16 ***
## CAR_AGE                         -2.574e-02  5.799e-03  -4.439 9.03e-06 ***
## URBANICITYz_Highly Rural/ Rural -2.277e+00  1.120e-01 -20.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 7403.5  on 8134  degrees of freedom
## AIC: 7445.5
## 
## Number of Fisher Scoring iterations: 5

Interpretation of Model 2: Reduced variables

Model 2 definitely shows some great results. The residuals looks good. Deviance Residuals: Min 1Q Median 3Q Max
-2.5355 -0.7350 -0.4130 0.6473 3.0718

pValue: all selected variables have pvalues < 0.05 which indicates the selected predictors are significant to model 2 in predicting the outcome of TARGET_FLAG.

Model 3- Binary Logistic Regression: Backward elimination

Model 2 looks good, however, we want to explore another method by trying glmulti() function. Let’s recall that glmulti is an R package for automated model selection and multi-model inference with glm and related functions. From a list of explanatory variables, the provided function glmulti builds all possible unique models involving these variables. Due to R-system not detecting the glmulti() function because rJava could not be loaded( issue with interaction between java and r, install/re-install/reload didn’t fix the issue), we decided to use step() function.

## Start:  AIC=7445.53
## TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS + 
##     TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + 
##     CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY
## 
##              Df Deviance    AIC
## <none>            7403.5 7445.5
## - OLDCLAIM    1   7416.7 7456.7
## - CAR_AGE     1   7423.4 7463.4
## - HOME_VAL    1   7423.8 7463.8
## - PARENT1     1   7428.1 7468.1
## - BLUEBOOK    1   7432.0 7472.0
## - MSTATUS     1   7432.9 7472.9
## - INCOME      1   7449.1 7489.1
## - CLM_FREQ    1   7450.2 7490.2
## - TIF         1   7460.4 7500.4
## - KIDSDRIV    1   7461.6 7501.6
## - TRAVTIME    1   7467.7 7507.7
## - MVR_PTS     1   7479.6 7519.6
## - REVOKED     1   7500.6 7540.6
## - CAR_TYPE    5   7509.7 7541.7
## - CAR_USE     1   7560.7 7600.7
## - URBANICITY  1   7999.2 8039.2

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + 
##     OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, 
##     family = binomial(), data = insuranceT_df3c1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5355  -0.7350  -0.4130   0.6473   3.0718  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -5.977e-01  1.457e-01  -4.101 4.11e-05 ***
## KIDSDRIV                         4.173e-01  5.457e-02   7.647 2.06e-14 ***
## INCOME                          -6.074e-06  9.183e-07  -6.615 3.72e-11 ***
## PARENT1Yes                       4.621e-01  9.312e-02   4.962 6.97e-07 ***
## HOME_VAL                        -1.487e-06  3.301e-07  -4.505 6.65e-06 ***
## MSTATUSz_No                      4.261e-01  7.816e-02   5.451 5.01e-08 ***
## TRAVTIME                         1.488e-02  1.861e-03   7.997 1.27e-15 ***
## CAR_USEPrivate                  -8.679e-01  6.955e-02 -12.479  < 2e-16 ***
## BLUEBOOK                        -2.472e-05  4.682e-06  -5.278 1.30e-07 ***
## TIF                             -5.389e-02  7.278e-03  -7.404 1.32e-13 ***
## CAR_TYPEPanel Truck              5.161e-01  1.403e-01   3.679 0.000234 ***
## CAR_TYPEPickup                   4.969e-01  9.707e-02   5.119 3.07e-07 ***
## CAR_TYPESports Car               9.336e-01  1.054e-01   8.855  < 2e-16 ***
## CAR_TYPEVan                      5.761e-01  1.189e-01   4.847 1.26e-06 ***
## CAR_TYPEz_SUV                    7.075e-01  8.449e-02   8.374  < 2e-16 ***
## OLDCLAIM                        -1.388e-05  3.866e-06  -3.590 0.000331 ***
## CLM_FREQ                         1.940e-01  2.822e-02   6.876 6.16e-12 ***
## REVOKEDYes                       8.961e-01  9.017e-02   9.938  < 2e-16 ***
## MVR_PTS                          1.172e-01  1.349e-02   8.686  < 2e-16 ***
## CAR_AGE                         -2.574e-02  5.799e-03  -4.439 9.03e-06 ***
## URBANICITYz_Highly Rural/ Rural -2.277e+00  1.120e-01 -20.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 7403.5  on 8134  degrees of freedom
## AIC: 7445.5
## 
## Number of Fisher Scoring iterations: 5

Model 3 does not show much of a difference from model 2. Perhaps, glmulti() function would have given a different results.

SELECT MODELS

In this section, we used AIC, pValue, Adjusted R-squared and F-statistic to select the best multiple linear regression model and the best binary logistic regression model.

For Multiple Linear Regression: model 1 and 2 show very low adjusted R-square and F-statistic values. However, model 3 shows the variables that are most significant in explaining the variability on the response variable (TARGET_AMT). We want to check the stats output based on these variables:MVR_PTS,URBANICITY, CAR_USE, MSTATUS, INCOME, KIDSDRIV

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3_final)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4938  -1691   -844    292 104517 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.149e+03  1.382e+02  15.555  < 2e-16 ***
## MVR_PTS                          2.190e+02  2.415e+01   9.067  < 2e-16 ***
## URBANICITYz_Highly Rural/ Rural -1.474e+03  1.304e+02 -11.305  < 2e-16 ***
## CAR_USEPrivate                  -9.669e+02  1.057e+02  -9.149  < 2e-16 ***
## MSTATUSz_No                      8.184e+02  1.038e+02   7.887 3.51e-15 ***
## INCOME                          -8.471e-03  1.129e-03  -7.504 6.87e-14 ***
## KIDSDRIV                         5.007e+02  9.948e+01   5.034 4.92e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4578 on 8148 degrees of freedom
## Multiple R-squared:  0.05397,    Adjusted R-squared:  0.05327 
## F-statistic: 77.47 on 6 and 8148 DF,  p-value: < 2.2e-16

The stats results from the final model multiple linear regression show higher significant pValues and F-statistic values. Despite, no improvement on the Adjusted R-squared , we can conclude that model 3 is the best model among the 03 models build in multiple linear regression. We suspect the outliers has some influences on the the Adjusted R-squared.

For binary logistic we will use AIC by comparing the 03 models and the model with significant low AIC will be the best model.

## [1] 7367.323

## [1] 7445.531

## [1] 7445.531

The AIC comparison shows model 1 is the best model for the binary logistic regression. However, we are skepticle because model 2 shows better pvalue. So, we did some prediction and model accuracy is 78.8% for model 1 for binary logistic regression.

## Warning: package 'prediction' was built under R version 4.0.5

## 
## Attaching package: 'prediction'

## The following object is masked from 'package:ROCR':
## 
##     prediction

## Start:  AIC=3671.18
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + CAR_AGE + URBANICITY
## 
##              Df Deviance    AIC
## - CAR_AGE     1   3595.3 3669.3
## - RED_CAR     1   3595.5 3669.5
## - SEX         1   3595.7 3669.7
## - AGE         1   3596.7 3670.7
## <none>            3595.2 3671.2
## - HOMEKIDS    1   3597.3 3671.3
## - YOJ         1   3598.3 3672.3
## - EDUCATION   4   3605.8 3673.8
## - INCOME      1   3600.9 3674.9
## - OLDCLAIM    1   3601.8 3675.8
## - HOME_VAL    1   3603.5 3677.5
## - PARENT1     1   3604.7 3678.7
## - KIDSDRIV    1   3608.6 3682.6
## - BLUEBOOK    1   3608.9 3682.9
## - MSTATUS     1   3611.4 3685.4
## - CAR_USE     1   3616.8 3690.8
## - TIF         1   3621.2 3695.2
## - JOB         8   3635.7 3695.7
## - TRAVTIME    1   3622.4 3696.4
## - CLM_FREQ    1   3625.4 3699.4
## - MVR_PTS     1   3626.6 3700.6
## - CAR_TYPE    5   3646.6 3712.6
## - REVOKED     1   3640.6 3714.6
## - URBANICITY  1   3918.7 3992.7
## 
## Step:  AIC=3669.28
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + URBANICITY
## 
##              Df Deviance    AIC
## - RED_CAR     1   3595.6 3667.6
## - SEX         1   3595.8 3667.8
## - AGE         1   3596.8 3668.8
## <none>            3595.3 3669.3
## - HOMEKIDS    1   3597.4 3669.4
## - YOJ         1   3598.4 3670.4
## - EDUCATION   4   3606.3 3672.3
## - INCOME      1   3600.9 3672.9
## - OLDCLAIM    1   3601.9 3673.9
## - HOME_VAL    1   3603.6 3675.6
## - PARENT1     1   3604.8 3676.8
## - KIDSDRIV    1   3608.7 3680.7
## - BLUEBOOK    1   3609.1 3681.1
## - MSTATUS     1   3611.4 3683.4
## - CAR_USE     1   3616.9 3688.9
## - TIF         1   3621.3 3693.3
## - JOB         8   3635.8 3693.8
## - TRAVTIME    1   3622.4 3694.4
## - CLM_FREQ    1   3625.6 3697.6
## - MVR_PTS     1   3626.7 3698.7
## - CAR_TYPE    5   3646.6 3710.6
## - REVOKED     1   3640.8 3712.8
## - URBANICITY  1   3918.7 3990.7
## 
## Step:  AIC=3667.64
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY
## 
##              Df Deviance    AIC
## - SEX         1   3595.9 3665.9
## - AGE         1   3597.2 3667.2
## <none>            3595.6 3667.6
## - HOMEKIDS    1   3597.8 3667.8
## - YOJ         1   3598.8 3668.8
## - EDUCATION   4   3606.6 3670.6
## - INCOME      1   3601.4 3671.4
## - OLDCLAIM    1   3602.3 3672.3
## - HOME_VAL    1   3603.9 3673.9
## - PARENT1     1   3605.1 3675.1
## - KIDSDRIV    1   3608.9 3678.9
## - BLUEBOOK    1   3609.5 3679.5
## - MSTATUS     1   3612.0 3682.0
## - CAR_USE     1   3617.3 3687.3
## - TIF         1   3621.7 3691.7
## - JOB         8   3636.0 3692.0
## - TRAVTIME    1   3622.8 3692.8
## - CLM_FREQ    1   3626.1 3696.1
## - MVR_PTS     1   3627.1 3697.1
## - CAR_TYPE    5   3646.9 3708.9
## - REVOKED     1   3641.2 3711.2
## - URBANICITY  1   3918.7 3988.7
## 
## Step:  AIC=3665.88
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY
## 
##              Df Deviance    AIC
## - AGE         1   3597.3 3665.3
## <none>            3595.9 3665.9
## - HOMEKIDS    1   3598.1 3666.1
## - YOJ         1   3599.0 3667.0
## - EDUCATION   4   3607.0 3669.0
## - INCOME      1   3601.6 3669.6
## - OLDCLAIM    1   3602.6 3670.6
## - HOME_VAL    1   3604.2 3672.2
## - PARENT1     1   3605.3 3673.3
## - KIDSDRIV    1   3609.3 3677.3
## - BLUEBOOK    1   3611.0 3679.0
## - MSTATUS     1   3612.4 3680.4
## - CAR_USE     1   3617.7 3685.7
## - TIF         1   3622.0 3690.0
## - JOB         8   3636.3 3690.3
## - TRAVTIME    1   3623.0 3691.0
## - CLM_FREQ    1   3626.4 3694.4
## - MVR_PTS     1   3627.4 3695.4
## - REVOKED     1   3641.4 3709.4
## - CAR_TYPE    5   3658.3 3718.3
## - URBANICITY  1   3918.8 3986.8
## 
## Step:  AIC=3665.29
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY
## 
##              Df Deviance    AIC
## - HOMEKIDS    1   3598.5 3664.5
## <none>            3597.3 3665.3
## - YOJ         1   3599.9 3665.9
## - EDUCATION   4   3608.2 3668.2
## - INCOME      1   3603.4 3669.4
## - OLDCLAIM    1   3604.0 3670.0
## - HOME_VAL    1   3605.2 3671.2
## - PARENT1     1   3606.1 3672.1
## - BLUEBOOK    1   3611.6 3677.6
## - KIDSDRIV    1   3612.9 3678.9
## - MSTATUS     1   3613.6 3679.6
## - CAR_USE     1   3619.4 3685.4
## - JOB         8   3637.0 3689.0
## - TIF         1   3623.4 3689.4
## - TRAVTIME    1   3624.6 3690.6
## - CLM_FREQ    1   3628.0 3694.0
## - MVR_PTS     1   3628.4 3694.4
## - REVOKED     1   3643.1 3709.1
## - CAR_TYPE    5   3660.6 3718.6
## - URBANICITY  1   3919.0 3985.0
## 
## Step:  AIC=3664.54
## TARGET_FLAG ~ KIDSDRIV + YOJ + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + 
##     URBANICITY
## 
##              Df Deviance    AIC
## <none>            3598.5 3664.5
## - YOJ         1   3600.6 3664.6
## - EDUCATION   4   3609.6 3667.6
## - INCOME      1   3604.5 3668.5
## - OLDCLAIM    1   3605.3 3669.3
## - HOME_VAL    1   3606.9 3670.9
## - BLUEBOOK    1   3613.1 3677.1
## - MSTATUS     1   3613.6 3677.6
## - PARENT1     1   3615.1 3679.1
## - CAR_USE     1   3620.7 3684.7
## - KIDSDRIV    1   3621.6 3685.6
## - TIF         1   3624.6 3688.6
## - JOB         8   3639.3 3689.3
## - TRAVTIME    1   3625.7 3689.7
## - CLM_FREQ    1   3629.5 3693.5
## - MVR_PTS     1   3629.7 3693.7
## - REVOKED     1   3644.8 3708.8
## - CAR_TYPE    5   3662.0 3718.0
## - URBANICITY  1   3920.4 3984.4

## [1] 0.790287

References

https://www.scribbr.com/statistics/multiple-linear-regression/

https://bookdown.org/chua/ber642_advanced_regression/binary-logistic-regression.html

https://www.jstatsoft.org/article/view/v034i12

http://r-statistics.co/Missing-Value-Treatment-With-R.html

https://andyreagan.github.io/teaching/2018/09-SDS-291/lectures/14_modelselection.pdf