Github Master files with code link: https://github.com/asmozo24/Data621-Multiple-Linear-Regression
In this homework assignment, we will explore, analyze and model a data set representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero. The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. Let’s recall couple definition: According to Rebecca Bevans from https://www.scribbr.com/statistics/multiple-linear-regression/, a multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. In order words, we want to determine whether the relationship between the independent variables(not necessary all of them but at least 02) is linear and how good it is.
On the other hands, binary logistic regression determines relationship between the categorical dependent variables and one or more independent variables. Meaning, binary logistic and multiple linear regression are similar except the dependent variable has a different data type.
There are 02 datasets: insurance_training_data,and insurance-evaluation-data provided by Instructor:Nasrin Khansari. These are csv files and we used R-programming language to acquire the 02 datasets pre-stored in Github repository. These 24 variables of interest are all predictors except the variables called “TARGET_FLAG”,“TARGET_AMT” which are the response variable, and are already defined within the dataset package(see below). The case study: to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
A.Short.Description.of.the.Variables.of.Interest.in.the.insurance.dataset | X | X.1 |
---|---|---|
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
INDEX | Identification Variable (do not use) | None |
TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | None |
TARGET_AMT | If car was in a crash, what was the cost | None |
AGE | Age of Driver | Very young people tend to be risky. Maybe very old people also. |
BLUEBOOK | Value of Vehicle | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_AGE | Vehicle Age | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_TYPE | Type of Car | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_USE | Vehicle Use | Commercial vehicles are driven more, so might increase probability of collision |
CLM_FREQ | # Claims (Past 5 Years) | The more claims you filed in the past, the more you are likely to file in the future |
EDUCATION | Max Education Level | Unknown effect, but in theory more educated people tend to drive more safely |
HOMEKIDS | # Children at Home | Unknown effect |
HOME_VAL | Home Value | In theory, home owners tend to drive more responsibly |
INCOME | Income | In theory, rich people tend to get into fewer crashes |
JOB | Job Category | In theory, white collar jobs tend to be safer |
KIDSDRIV | # Driving Children | When teenagers drive your car, you are more likely to get into crashes |
MSTATUS | Marital Status | In theory, married people drive more safely |
MVR_PTS | Motor Vehicle Record Points | If you get lots of traffic tickets, you tend to get into more crashes |
OLDCLAIM | Total Claims (Past 5 Years) | If your total payout over the past five years was high, this suggests future payouts will be high |
PARENT1 | Single Parent | Unknown effect |
RED_CAR | A Red Car | Urban legend says that red cars (especially red sports cars) are more risky. Is that true? |
REVOKED | License Revoked (Past 7 Years) | If your license was revoked in the past 7 years, you probably are a more risky driver. |
SEX | Gender | Urban legend says that women have less crashes then men. Is that true? |
TIF | Time in Force | People who have been customers for a long time are usually more safe. |
TRAVTIME | Distance to Work | Long drives to work usually suggest greater risk |
URBANICITY | Home/Work Area | Unknown |
YOJ | Years on Job | People who stay at a job for a long time are usually more safe |
These datasets include 8161 observations and 26 variables. The variables are numerical, categorial and character data type. There are variables (predictors) that might need to change data type if we will use them to build the different models.
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 60 | 0 | 11 | $67,349 | No | $0 | z_No | M | PhD | Professional | 14 | Private | $14,230 | 11 | Minivan | yes | $4,461 | 2 | No | 3 | 18 | Highly Urban/ Urban |
2 | 0 | 0 | 0 | 43 | 0 | 11 | $91,449 | No | $257,252 | z_No | M | z_High School | z_Blue Collar | 22 | Commercial | $14,940 | 1 | Minivan | yes | $0 | 0 | No | 0 | 1 | Highly Urban/ Urban |
4 | 0 | 0 | 0 | 35 | 1 | 10 | $16,039 | No | $124,191 | Yes | z_F | z_High School | Clerical | 5 | Private | $4,010 | 4 | z_SUV | no | $38,690 | 2 | No | 3 | 10 | Highly Urban/ Urban |
5 | 0 | 0 | 0 | 51 | 0 | 14 | No | $306,251 | Yes | M | <High School | z_Blue Collar | 32 | Private | $15,440 | 7 | Minivan | yes | $0 | 0 | No | 0 | 6 | Highly Urban/ Urban | |
6 | 0 | 0 | 0 | 50 | 0 | NA | $114,986 | No | $243,925 | Yes | z_F | PhD | Doctor | 36 | Private | $18,000 | 1 | z_SUV | no | $19,217 | 2 | Yes | 3 | 17 | Highly Urban/ Urban |
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
We need to strip some variables. The variables need to be stripped for purpose of simplifying data manipulation. These variables are: INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM (values with $), and we will change the character data type to numeric because the values will become numeric(integer/double) after stripping is done.
INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 60 | 0 | 11 | 67349 | No | 0 | z_No | M | PhD | Professional | 14 | Private | 14230 | 11 | Minivan | yes | 4461 | 2 | No | 3 | 18 | Highly Urban/ Urban |
2 | 0 | 0 | 0 | 43 | 0 | 11 | 91449 | No | 257252 | z_No | M | z_High School | z_Blue Collar | 22 | Commercial | 14940 | 1 | Minivan | yes | 0 | 0 | No | 0 | 1 | Highly Urban/ Urban |
4 | 0 | 0 | 0 | 35 | 1 | 10 | 16039 | No | 124191 | Yes | z_F | z_High School | Clerical | 5 | Private | 4010 | 4 | z_SUV | no | 38690 | 2 | No | 3 | 10 | Highly Urban/ Urban |
5 | 0 | 0 | 0 | 51 | 0 | 14 | NA | No | 306251 | Yes | M | <High School | z_Blue Collar | 32 | Private | 15440 | 7 | Minivan | yes | 0 | 0 | No | 0 | 6 | Highly Urban/ Urban |
6 | 0 | 0 | 0 | 50 | 0 | NA | 114986 | No | 243925 | Yes | z_F | PhD | Doctor | 36 | Private | 18000 | 1 | z_SUV | no | 19217 | 2 | Yes | 3 | 17 | Highly Urban/ Urban |
###Data Cleaning-Missing Values
It is important to check the missing values before applying regression analysis because missing values can increase the error and add bias to the regression model.As we can see below, the dataset shows a total of 1879 missing values. this is about (1879/8161)*100 = 23.02% of the entire dataset. This significant and will add bias into the model that we will build later. So, we need to fix the variables with missing values.
means the dataset is good for analysis. In addition, the variable called ‘target’ = whether the crime rate is above the median crime rate (1) or not (0) is a two level response, so we want to set it as factor as well as chas(even it is not the response variable).
## The total of missing values is : 1879
###Data Cleaning-Fixing variable with missing Values
Let’s see the distribution of the missing values across the training dataset.
##
## Variable distribution of missing 'NA' is:
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
variables | number.missing |
---|---|
CAR_AGE | 510 |
HOME_VAL | 464 |
YOJ | 454 |
INCOME | 445 |
AGE | 6 |
Let’s visualize the missing values for variables: CAR_AGE, HOME_VAL, YOJ, INCOME and AGE
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
Since, the variable AGE has only 6 missing values out of the 8161, we can delete these them as it only represents 0.0735% of the total record. Then, for vraibles CAR_AGE, HOME_VAL, YOJ, INCOME, we will impute the missing values because there represent some non-negligent numbers for building the model.
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## [1] 6
##
## Let's see the new observation(it should be 1876-6=1870
## [1] 1870
##
## Let's impute missing values in (CAR_AGE, HOME_VAL, YOJ, INCOME) with median
##
## Let's see the effect of imputing missing values on the total observations(it should be 0
## [1] 0
We want to take a look at how the data are distributed across all variables. We see that the response variable(target) has a binomial normal distribution.This makes sense because the response variable only has 02 outputs(0 and 1). Beside the average number of rooms per dwelling(rm) which as a normal distribution, the rest of variables show right and left skewed. Based on these density plots, we want to see what the variance between predictors and response variable look like.
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.
Let’s visualize the effect of one of the predictors (Travel time, Sex) on the response variable TARGET_AMT
## `geom_smooth()` using formula 'y ~ x'
We are going to use the training dataset to build two different multiple linear regression models and 03 different binary logistic regression models.Due to a high volume of predictors (25), we will not focus on the coefficients for the equation.
All variables in except TARGET_FLAG, INDEX. INDEX is not significant to the dataset and TARGET_FLAG is a categorical variable that we are going to use it later to build the binary logistic model.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5889 -1696 -761 343 103790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.075e+03 5.577e+02 1.928 0.05394 .
## KIDSDRIV 3.163e+02 1.133e+02 2.792 0.00524 **
## AGE 5.079e+00 7.074e+00 0.718 0.47277
## HOMEKIDS 7.610e+01 6.544e+01 1.163 0.24486
## YOJ -3.911e+00 1.511e+01 -0.259 0.79573
## INCOME -4.406e-03 1.806e-03 -2.440 0.01472 *
## PARENT1Yes 5.744e+02 2.023e+02 2.840 0.00453 **
## HOME_VAL -5.603e-04 5.911e-04 -0.948 0.34325
## MSTATUSz_No 5.695e+02 1.449e+02 3.930 8.57e-05 ***
## SEXz_F -3.713e+02 1.840e+02 -2.018 0.04358 *
## EDUCATIONBachelors -2.576e+02 2.050e+02 -1.257 0.20892
## EDUCATIONMasters 2.466e+01 3.002e+02 0.082 0.93453
## EDUCATIONPhD 2.877e+02 3.562e+02 0.808 0.41934
## EDUCATIONz_High School -8.818e+01 1.722e+02 -0.512 0.60853
## JOBClerical 5.281e+02 3.416e+02 1.546 0.12207
## JOBDoctor -4.988e+02 4.089e+02 -1.220 0.22252
## JOBHome Maker 3.518e+02 3.649e+02 0.964 0.33500
## JOBLawyer 2.316e+02 2.958e+02 0.783 0.43355
## JOBManager -4.780e+02 2.886e+02 -1.656 0.09767 .
## JOBProfessional 4.570e+02 3.089e+02 1.480 0.13904
## JOBStudent 2.852e+02 3.743e+02 0.762 0.44605
## JOBz_Blue Collar 5.072e+02 3.220e+02 1.575 0.11519
## TRAVTIME 1.196e+01 3.224e+00 3.709 0.00021 ***
## CAR_USEPrivate -7.820e+02 1.646e+02 -4.751 2.06e-06 ***
## BLUEBOOK 1.434e-02 8.630e-03 1.662 0.09664 .
## TIF -4.815e+01 1.219e+01 -3.951 7.86e-05 ***
## CAR_TYPEPanel Truck 2.625e+02 2.783e+02 0.943 0.34559
## CAR_TYPEPickup 3.749e+02 1.708e+02 2.195 0.02819 *
## CAR_TYPESports Car 1.019e+03 2.179e+02 4.677 2.95e-06 ***
## CAR_TYPEVan 5.097e+02 2.135e+02 2.388 0.01696 *
## CAR_TYPEz_SUV 7.520e+02 1.794e+02 4.191 2.80e-05 ***
## RED_CARyes -5.262e+01 1.494e+02 -0.352 0.72473
## OLDCLAIM -1.059e-02 7.441e-03 -1.423 0.15479
## CLM_FREQ 1.423e+02 5.511e+01 2.583 0.00982 **
## REVOKEDYes 5.504e+02 1.736e+02 3.170 0.00153 **
## MVR_PTS 1.749e+02 2.595e+01 6.739 1.70e-11 ***
## CAR_AGE -2.680e+01 1.280e+01 -2.094 0.03629 *
## URBANICITYz_Highly Rural/ Rural -1.662e+03 1.395e+02 -11.917 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4546 on 8117 degrees of freedom
## Multiple R-squared: 0.07087, Adjusted R-squared: 0.06664
## F-statistic: 16.73 on 37 and 8117 DF, p-value: < 2.2e-16
This model 1 probably does not fits the assumption of heteroscedasticity because the model residuals is not center around the median (-761) and not equally spread (min:-5889 and max:103790 ).
F statistics: how good is the relationship between the predictors (all predictors in) and response (TARGET_AMT: If car was in a crash, what was the cost). Since we a have dataset (8161 observations), and pValue less than alpha, we are rejecting the null hypothesis. This means some predictors are likely to influence the cost of a car after it was in car accident. In this case, F-statistic: 16.73 which indicates there is some relationship between the two variables, how strong we cannot confirm it.
R^2: 0.07087 or 7.09% is weak (scale from 0 to 1, R^2 = 1 being best fit), which indicates there are some outliers (as the diagnostic plot show below) or large errors.
Standard error: errors in the models or a measure of the quality of the regression fit. We expect all regression models to carry in some errors. The actual cost of a car after it was in car accident can deviate from the true regression line by 4546 based on all predictors)
p-values: Based on the 95% confidence interval, 2.2e-16 <0.05 which indicates that changes in the predictors can be associated with changes in the response variable (cost of a car after it was in accident).
##Model 2-Multiple Linear Regression
Based on model 1, there are some variables that are not significant to the model good fit. The reason for this insignificance is because these predictors have larger pValue than the threshold (alpha). So, we can use another test to see if we can improve the model good fit. Thus, we will use a backward elimination by dropping the predictors with insignificant mean to the model 1.
step() function for model 2. Let’s recall that the step function performs iteration in backward direction(by default) while looking for variables that offer the minimum AIC(Akaike information criterion).
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5689 -1679 -786 300 103761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.695e+03 2.276e+02 7.446 1.06e-13 ***
## KIDSDRIV 3.816e+02 1.022e+02 3.735 0.000189 ***
## INCOME -5.816e-03 1.253e-03 -4.643 3.49e-06 ***
## PARENT1Yes 6.489e+02 1.767e+02 3.673 0.000241 ***
## MSTATUSz_No 5.903e+02 1.193e+02 4.947 7.68e-07 ***
## SEXz_F -2.046e+02 1.451e+02 -1.410 0.158522
## TRAVTIME 1.267e+01 3.221e+00 3.934 8.41e-05 ***
## CAR_USEPrivate -8.566e+02 1.257e+02 -6.816 1.00e-11 ***
## TIF -4.754e+01 1.218e+01 -3.902 9.61e-05 ***
## CAR_TYPEPanel Truck 4.020e+02 2.319e+02 1.734 0.083040 .
## CAR_TYPEPickup 3.171e+02 1.651e+02 1.921 0.054747 .
## CAR_TYPESports Car 8.842e+02 2.045e+02 4.324 1.55e-05 ***
## CAR_TYPEVan 5.627e+02 2.031e+02 2.770 0.005620 **
## CAR_TYPEz_SUV 6.343e+02 1.654e+02 3.835 0.000127 ***
## CLM_FREQ 1.108e+02 4.891e+01 2.265 0.023531 *
## REVOKEDYes 4.647e+02 1.551e+02 2.996 0.002748 **
## MVR_PTS 1.790e+02 2.582e+01 6.934 4.40e-12 ***
## CAR_AGE -3.556e+01 1.006e+01 -3.536 0.000409 ***
## URBANICITYz_Highly Rural/ Rural -1.523e+03 1.358e+02 -11.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4553 on 8136 degrees of freedom
## Multiple R-squared: 0.06584, Adjusted R-squared: 0.06377
## F-statistic: 31.86 on 18 and 8136 DF, p-value: < 2.2e-16
Model 2 uses backward elimination method. We removed the following predictors: HOMEKIDS,-AGE,-YOJ,-JOB, -BLUEBOOK, -HOME_VAL, -RED_CAR, -OLDCLAIM, -EDUCATION. The result is a much better pValues , however, the adjusted R-squared or residuals did not improve. Let’s use leap() function to find the best fit model from model 2 (the new dataframe with reduced predictors). In fact, after running leap() function on model 1(we didn’t include summary to reduce redundancy), we noticed the same result.
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.0.5
## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ ., data = insuranceT_df3b, nbest = 1)
## 18 Variables (and intercept)
## Forced in Forced out
## KIDSDRIV FALSE FALSE
## INCOME FALSE FALSE
## PARENT1Yes FALSE FALSE
## MSTATUSz_No FALSE FALSE
## SEXz_F FALSE FALSE
## TRAVTIME FALSE FALSE
## CAR_USEPrivate FALSE FALSE
## TIF FALSE FALSE
## CAR_TYPEPanel Truck FALSE FALSE
## CAR_TYPEPickup FALSE FALSE
## CAR_TYPESports Car FALSE FALSE
## CAR_TYPEVan FALSE FALSE
## CAR_TYPEz_SUV FALSE FALSE
## CLM_FREQ FALSE FALSE
## REVOKEDYes FALSE FALSE
## MVR_PTS FALSE FALSE
## CAR_AGE FALSE FALSE
## URBANICITYz_Highly Rural/ Rural FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## KIDSDRIV INCOME PARENT1Yes MSTATUSz_No SEXz_F TRAVTIME CAR_USEPrivate
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " " " " " "*"
## 5 ( 1 ) " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" "*" " " "*" " " " " "*"
## 7 ( 1 ) "*" "*" " " "*" " " " " "*"
## 8 ( 1 ) "*" "*" " " "*" " " "*" "*"
## TIF CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPEVan
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " "
## CAR_TYPEz_SUV CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1 ( 1 ) " " " " " " "*" " "
## 2 ( 1 ) " " " " " " "*" " "
## 3 ( 1 ) " " " " " " "*" " "
## 4 ( 1 ) " " " " " " "*" " "
## 5 ( 1 ) " " " " " " "*" " "
## 6 ( 1 ) " " " " " " "*" " "
## 7 ( 1 ) " " " " " " "*" " "
## 8 ( 1 ) " " " " " " "*" " "
## URBANICITYz_Highly Rural/ Rural
## 1 ( 1 ) " "
## 2 ( 1 ) "*"
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
## 6 ( 1 ) "*"
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
Using leap() function on model 3, we that the best model will only account for these variables: MVR_PTS,URBANICITY, CAR_USE, MSTATUS, INCOME, KIDSDRIV. It all comes to which variables has the most asterisk indicates that the leap function selected the variable to be significant in predicting the response variable
We are going to use the same logic from the multiple linear regression model. This time, we will be using TARGET_FLAG instead. The reason is because the models we are going to build are based on response variable with a categorical data type. In this case, TARGET_FLAG is binary (values: 1 and 0).
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(), data = insuranceT_df3c)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5863 -0.7129 -0.3987 0.6242 3.1531
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.107e-01 3.217e-01 -2.831 0.004646 **
## KIDSDRIV 3.911e-01 6.127e-02 6.384 1.73e-10 ***
## AGE -1.431e-03 4.026e-03 -0.355 0.722320
## HOMEKIDS 4.571e-02 3.720e-02 1.229 0.219124
## YOJ -1.048e-02 8.597e-03 -1.219 0.222715
## INCOME -3.432e-06 1.082e-06 -3.172 0.001514 **
## PARENT1Yes 3.755e-01 1.097e-01 3.422 0.000622 ***
## HOME_VAL -1.310e-06 3.421e-07 -3.829 0.000129 ***
## MSTATUSz_No 4.929e-01 8.358e-02 5.898 3.69e-09 ***
## SEXz_F -8.772e-02 1.121e-01 -0.783 0.433844
## EDUCATIONBachelors -3.793e-01 1.158e-01 -3.275 0.001055 **
## EDUCATIONMasters -2.832e-01 1.789e-01 -1.583 0.113366
## EDUCATIONPhD -1.592e-01 2.141e-01 -0.744 0.456964
## EDUCATIONz_High School 1.968e-02 9.518e-02 0.207 0.836232
## JOBClerical 4.109e-01 1.967e-01 2.089 0.036681 *
## JOBDoctor -4.456e-01 2.671e-01 -1.669 0.095216 .
## JOBHome Maker 2.265e-01 2.103e-01 1.077 0.281319
## JOBLawyer 1.059e-01 1.695e-01 0.625 0.531963
## JOBManager -5.549e-01 1.715e-01 -3.235 0.001215 **
## JOBProfessional 1.645e-01 1.784e-01 0.922 0.356376
## JOBStudent 2.122e-01 2.146e-01 0.989 0.322828
## JOBz_Blue Collar 3.119e-01 1.856e-01 1.681 0.092803 .
## TRAVTIME 1.462e-02 1.884e-03 7.760 8.51e-15 ***
## CAR_USEPrivate -7.573e-01 9.182e-02 -8.247 < 2e-16 ***
## BLUEBOOK -2.072e-05 5.267e-06 -3.933 8.38e-05 ***
## TIF -5.553e-02 7.350e-03 -7.556 4.15e-14 ***
## CAR_TYPEPanel Truck 5.592e-01 1.618e-01 3.456 0.000549 ***
## CAR_TYPEPickup 5.559e-01 1.008e-01 5.516 3.47e-08 ***
## CAR_TYPESports Car 1.024e+00 1.300e-01 7.877 3.35e-15 ***
## CAR_TYPEVan 6.136e-01 1.267e-01 4.844 1.27e-06 ***
## CAR_TYPEz_SUV 7.695e-01 1.113e-01 6.913 4.74e-12 ***
## RED_CARyes -2.052e-02 8.660e-02 -0.237 0.812692
## OLDCLAIM -1.396e-05 3.911e-06 -3.570 0.000357 ***
## CLM_FREQ 1.978e-01 2.856e-02 6.926 4.33e-12 ***
## REVOKEDYes 8.875e-01 9.134e-02 9.717 < 2e-16 ***
## MVR_PTS 1.125e-01 1.362e-02 8.259 < 2e-16 ***
## CAR_AGE -1.118e-03 7.542e-03 -0.148 0.882172
## URBANICITYz_Highly Rural/ Rural -2.384e+00 1.128e-01 -21.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404.0 on 8154 degrees of freedom
## Residual deviance: 7291.3 on 8117 degrees of freedom
## AIC: 7367.3
##
## Number of Fisher Scoring iterations: 5
Comparing model 1 build on binary logit with the model 1 multiple linear regression, we see a better residuals on logit model (the spread is about the even from the median). Deviance Residuals: Min 1Q Median 3Q Max
-2.5863 -0.7129 -0.3987 0.6242 3.1531
pValue, we got about the same value previously obtained. The predictors with larges pValue are about the same.
Model 2 will factor out the predictors with large pValue. There variables are: -HOMEKIDS,-AGE,-YOJ,-JOB, -SEX, -RED_CAR, -EDUCATION.
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(), data = insuranceT_df3c1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5355 -0.7350 -0.4130 0.6473 3.0718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.977e-01 1.457e-01 -4.101 4.11e-05 ***
## KIDSDRIV 4.173e-01 5.457e-02 7.647 2.06e-14 ***
## INCOME -6.074e-06 9.183e-07 -6.615 3.72e-11 ***
## PARENT1Yes 4.621e-01 9.312e-02 4.962 6.97e-07 ***
## HOME_VAL -1.487e-06 3.301e-07 -4.505 6.65e-06 ***
## MSTATUSz_No 4.261e-01 7.816e-02 5.451 5.01e-08 ***
## TRAVTIME 1.488e-02 1.861e-03 7.997 1.27e-15 ***
## CAR_USEPrivate -8.679e-01 6.955e-02 -12.479 < 2e-16 ***
## BLUEBOOK -2.472e-05 4.682e-06 -5.278 1.30e-07 ***
## TIF -5.389e-02 7.278e-03 -7.404 1.32e-13 ***
## CAR_TYPEPanel Truck 5.161e-01 1.403e-01 3.679 0.000234 ***
## CAR_TYPEPickup 4.969e-01 9.707e-02 5.119 3.07e-07 ***
## CAR_TYPESports Car 9.336e-01 1.054e-01 8.855 < 2e-16 ***
## CAR_TYPEVan 5.761e-01 1.189e-01 4.847 1.26e-06 ***
## CAR_TYPEz_SUV 7.075e-01 8.449e-02 8.374 < 2e-16 ***
## OLDCLAIM -1.388e-05 3.866e-06 -3.590 0.000331 ***
## CLM_FREQ 1.940e-01 2.822e-02 6.876 6.16e-12 ***
## REVOKEDYes 8.961e-01 9.017e-02 9.938 < 2e-16 ***
## MVR_PTS 1.172e-01 1.349e-02 8.686 < 2e-16 ***
## CAR_AGE -2.574e-02 5.799e-03 -4.439 9.03e-06 ***
## URBANICITYz_Highly Rural/ Rural -2.277e+00 1.120e-01 -20.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404.0 on 8154 degrees of freedom
## Residual deviance: 7403.5 on 8134 degrees of freedom
## AIC: 7445.5
##
## Number of Fisher Scoring iterations: 5
Model 2 definitely shows some great results. The residuals looks good. Deviance Residuals: Min 1Q Median 3Q Max
-2.5355 -0.7350 -0.4130 0.6473 3.0718
pValue: all selected variables have pvalues < 0.05 which indicates the selected predictors are significant to model 2 in predicting the outcome of TARGET_FLAG.
Model 2 looks good, however, we want to explore another method by trying glmulti() function. Let’s recall that glmulti is an R package for automated model selection and multi-model inference with glm and related functions. From a list of explanatory variables, the provided function glmulti builds all possible unique models involving these variables. Due to R-system not detecting the glmulti() function because rJava could not be loaded( issue with interaction between java and r, install/re-install/reload didn’t fix the issue), we decided to use step() function.
## Start: AIC=7445.53
## TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS +
## TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY
##
## Df Deviance AIC
## <none> 7403.5 7445.5
## - OLDCLAIM 1 7416.7 7456.7
## - CAR_AGE 1 7423.4 7463.4
## - HOME_VAL 1 7423.8 7463.8
## - PARENT1 1 7428.1 7468.1
## - BLUEBOOK 1 7432.0 7472.0
## - MSTATUS 1 7432.9 7472.9
## - INCOME 1 7449.1 7489.1
## - CLM_FREQ 1 7450.2 7490.2
## - TIF 1 7460.4 7500.4
## - KIDSDRIV 1 7461.6 7501.6
## - TRAVTIME 1 7467.7 7507.7
## - MVR_PTS 1 7479.6 7519.6
## - REVOKED 1 7500.6 7540.6
## - CAR_TYPE 5 7509.7 7541.7
## - CAR_USE 1 7560.7 7600.7
## - URBANICITY 1 7999.2 8039.2
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE +
## OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY,
## family = binomial(), data = insuranceT_df3c1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5355 -0.7350 -0.4130 0.6473 3.0718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.977e-01 1.457e-01 -4.101 4.11e-05 ***
## KIDSDRIV 4.173e-01 5.457e-02 7.647 2.06e-14 ***
## INCOME -6.074e-06 9.183e-07 -6.615 3.72e-11 ***
## PARENT1Yes 4.621e-01 9.312e-02 4.962 6.97e-07 ***
## HOME_VAL -1.487e-06 3.301e-07 -4.505 6.65e-06 ***
## MSTATUSz_No 4.261e-01 7.816e-02 5.451 5.01e-08 ***
## TRAVTIME 1.488e-02 1.861e-03 7.997 1.27e-15 ***
## CAR_USEPrivate -8.679e-01 6.955e-02 -12.479 < 2e-16 ***
## BLUEBOOK -2.472e-05 4.682e-06 -5.278 1.30e-07 ***
## TIF -5.389e-02 7.278e-03 -7.404 1.32e-13 ***
## CAR_TYPEPanel Truck 5.161e-01 1.403e-01 3.679 0.000234 ***
## CAR_TYPEPickup 4.969e-01 9.707e-02 5.119 3.07e-07 ***
## CAR_TYPESports Car 9.336e-01 1.054e-01 8.855 < 2e-16 ***
## CAR_TYPEVan 5.761e-01 1.189e-01 4.847 1.26e-06 ***
## CAR_TYPEz_SUV 7.075e-01 8.449e-02 8.374 < 2e-16 ***
## OLDCLAIM -1.388e-05 3.866e-06 -3.590 0.000331 ***
## CLM_FREQ 1.940e-01 2.822e-02 6.876 6.16e-12 ***
## REVOKEDYes 8.961e-01 9.017e-02 9.938 < 2e-16 ***
## MVR_PTS 1.172e-01 1.349e-02 8.686 < 2e-16 ***
## CAR_AGE -2.574e-02 5.799e-03 -4.439 9.03e-06 ***
## URBANICITYz_Highly Rural/ Rural -2.277e+00 1.120e-01 -20.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404.0 on 8154 degrees of freedom
## Residual deviance: 7403.5 on 8134 degrees of freedom
## AIC: 7445.5
##
## Number of Fisher Scoring iterations: 5
Model 3 does not show much of a difference from model 2. Perhaps, glmulti() function would have given a different results.
In this section, we used AIC, pValue, Adjusted R-squared and F-statistic to select the best multiple linear regression model and the best binary logistic regression model.
For Multiple Linear Regression: model 1 and 2 show very low adjusted R-square and F-statistic values. However, model 3 shows the variables that are most significant in explaining the variability on the response variable (TARGET_AMT). We want to check the stats output based on these variables:MVR_PTS,URBANICITY, CAR_USE, MSTATUS, INCOME, KIDSDRIV
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insuranceT_df3_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4938 -1691 -844 292 104517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.149e+03 1.382e+02 15.555 < 2e-16 ***
## MVR_PTS 2.190e+02 2.415e+01 9.067 < 2e-16 ***
## URBANICITYz_Highly Rural/ Rural -1.474e+03 1.304e+02 -11.305 < 2e-16 ***
## CAR_USEPrivate -9.669e+02 1.057e+02 -9.149 < 2e-16 ***
## MSTATUSz_No 8.184e+02 1.038e+02 7.887 3.51e-15 ***
## INCOME -8.471e-03 1.129e-03 -7.504 6.87e-14 ***
## KIDSDRIV 5.007e+02 9.948e+01 5.034 4.92e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4578 on 8148 degrees of freedom
## Multiple R-squared: 0.05397, Adjusted R-squared: 0.05327
## F-statistic: 77.47 on 6 and 8148 DF, p-value: < 2.2e-16
The stats results from the final model multiple linear regression show higher significant pValues and F-statistic values. Despite, no improvement on the Adjusted R-squared , we can conclude that model 3 is the best model among the 03 models build in multiple linear regression. We suspect the outliers has some influences on the the Adjusted R-squared.
For binary logistic we will use AIC by comparing the 03 models and the model with significant low AIC will be the best model.
## [1] 7367.323
## [1] 7445.531
## [1] 7445.531
The AIC comparison shows model 1 is the best model for the binary logistic regression. However, we are skepticle because model 2 shows better pvalue. So, we did some prediction and model accuracy is 78.8% for model 1 for binary logistic regression.
## Warning: package 'prediction' was built under R version 4.0.5
##
## Attaching package: 'prediction'
## The following object is masked from 'package:ROCR':
##
## prediction
## Start: AIC=3671.18
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + CAR_AGE + URBANICITY
##
## Df Deviance AIC
## - CAR_AGE 1 3595.3 3669.3
## - RED_CAR 1 3595.5 3669.5
## - SEX 1 3595.7 3669.7
## - AGE 1 3596.7 3670.7
## <none> 3595.2 3671.2
## - HOMEKIDS 1 3597.3 3671.3
## - YOJ 1 3598.3 3672.3
## - EDUCATION 4 3605.8 3673.8
## - INCOME 1 3600.9 3674.9
## - OLDCLAIM 1 3601.8 3675.8
## - HOME_VAL 1 3603.5 3677.5
## - PARENT1 1 3604.7 3678.7
## - KIDSDRIV 1 3608.6 3682.6
## - BLUEBOOK 1 3608.9 3682.9
## - MSTATUS 1 3611.4 3685.4
## - CAR_USE 1 3616.8 3690.8
## - TIF 1 3621.2 3695.2
## - JOB 8 3635.7 3695.7
## - TRAVTIME 1 3622.4 3696.4
## - CLM_FREQ 1 3625.4 3699.4
## - MVR_PTS 1 3626.6 3700.6
## - CAR_TYPE 5 3646.6 3712.6
## - REVOKED 1 3640.6 3714.6
## - URBANICITY 1 3918.7 3992.7
##
## Step: AIC=3669.28
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - RED_CAR 1 3595.6 3667.6
## - SEX 1 3595.8 3667.8
## - AGE 1 3596.8 3668.8
## <none> 3595.3 3669.3
## - HOMEKIDS 1 3597.4 3669.4
## - YOJ 1 3598.4 3670.4
## - EDUCATION 4 3606.3 3672.3
## - INCOME 1 3600.9 3672.9
## - OLDCLAIM 1 3601.9 3673.9
## - HOME_VAL 1 3603.6 3675.6
## - PARENT1 1 3604.8 3676.8
## - KIDSDRIV 1 3608.7 3680.7
## - BLUEBOOK 1 3609.1 3681.1
## - MSTATUS 1 3611.4 3683.4
## - CAR_USE 1 3616.9 3688.9
## - TIF 1 3621.3 3693.3
## - JOB 8 3635.8 3693.8
## - TRAVTIME 1 3622.4 3694.4
## - CLM_FREQ 1 3625.6 3697.6
## - MVR_PTS 1 3626.7 3698.7
## - CAR_TYPE 5 3646.6 3710.6
## - REVOKED 1 3640.8 3712.8
## - URBANICITY 1 3918.7 3990.7
##
## Step: AIC=3667.64
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - SEX 1 3595.9 3665.9
## - AGE 1 3597.2 3667.2
## <none> 3595.6 3667.6
## - HOMEKIDS 1 3597.8 3667.8
## - YOJ 1 3598.8 3668.8
## - EDUCATION 4 3606.6 3670.6
## - INCOME 1 3601.4 3671.4
## - OLDCLAIM 1 3602.3 3672.3
## - HOME_VAL 1 3603.9 3673.9
## - PARENT1 1 3605.1 3675.1
## - KIDSDRIV 1 3608.9 3678.9
## - BLUEBOOK 1 3609.5 3679.5
## - MSTATUS 1 3612.0 3682.0
## - CAR_USE 1 3617.3 3687.3
## - TIF 1 3621.7 3691.7
## - JOB 8 3636.0 3692.0
## - TRAVTIME 1 3622.8 3692.8
## - CLM_FREQ 1 3626.1 3696.1
## - MVR_PTS 1 3627.1 3697.1
## - CAR_TYPE 5 3646.9 3708.9
## - REVOKED 1 3641.2 3711.2
## - URBANICITY 1 3918.7 3988.7
##
## Step: AIC=3665.88
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - AGE 1 3597.3 3665.3
## <none> 3595.9 3665.9
## - HOMEKIDS 1 3598.1 3666.1
## - YOJ 1 3599.0 3667.0
## - EDUCATION 4 3607.0 3669.0
## - INCOME 1 3601.6 3669.6
## - OLDCLAIM 1 3602.6 3670.6
## - HOME_VAL 1 3604.2 3672.2
## - PARENT1 1 3605.3 3673.3
## - KIDSDRIV 1 3609.3 3677.3
## - BLUEBOOK 1 3611.0 3679.0
## - MSTATUS 1 3612.4 3680.4
## - CAR_USE 1 3617.7 3685.7
## - TIF 1 3622.0 3690.0
## - JOB 8 3636.3 3690.3
## - TRAVTIME 1 3623.0 3691.0
## - CLM_FREQ 1 3626.4 3694.4
## - MVR_PTS 1 3627.4 3695.4
## - REVOKED 1 3641.4 3709.4
## - CAR_TYPE 5 3658.3 3718.3
## - URBANICITY 1 3918.8 3986.8
##
## Step: AIC=3665.29
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - HOMEKIDS 1 3598.5 3664.5
## <none> 3597.3 3665.3
## - YOJ 1 3599.9 3665.9
## - EDUCATION 4 3608.2 3668.2
## - INCOME 1 3603.4 3669.4
## - OLDCLAIM 1 3604.0 3670.0
## - HOME_VAL 1 3605.2 3671.2
## - PARENT1 1 3606.1 3672.1
## - BLUEBOOK 1 3611.6 3677.6
## - KIDSDRIV 1 3612.9 3678.9
## - MSTATUS 1 3613.6 3679.6
## - CAR_USE 1 3619.4 3685.4
## - JOB 8 3637.0 3689.0
## - TIF 1 3623.4 3689.4
## - TRAVTIME 1 3624.6 3690.6
## - CLM_FREQ 1 3628.0 3694.0
## - MVR_PTS 1 3628.4 3694.4
## - REVOKED 1 3643.1 3709.1
## - CAR_TYPE 5 3660.6 3718.6
## - URBANICITY 1 3919.0 3985.0
##
## Step: AIC=3664.54
## TARGET_FLAG ~ KIDSDRIV + YOJ + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS +
## URBANICITY
##
## Df Deviance AIC
## <none> 3598.5 3664.5
## - YOJ 1 3600.6 3664.6
## - EDUCATION 4 3609.6 3667.6
## - INCOME 1 3604.5 3668.5
## - OLDCLAIM 1 3605.3 3669.3
## - HOME_VAL 1 3606.9 3670.9
## - BLUEBOOK 1 3613.1 3677.1
## - MSTATUS 1 3613.6 3677.6
## - PARENT1 1 3615.1 3679.1
## - CAR_USE 1 3620.7 3684.7
## - KIDSDRIV 1 3621.6 3685.6
## - TIF 1 3624.6 3688.6
## - JOB 8 3639.3 3689.3
## - TRAVTIME 1 3625.7 3689.7
## - CLM_FREQ 1 3629.5 3693.5
## - MVR_PTS 1 3629.7 3693.7
## - REVOKED 1 3644.8 3708.8
## - CAR_TYPE 5 3662.0 3718.0
## - URBANICITY 1 3920.4 3984.4
## [1] 0.790287
https://www.scribbr.com/statistics/multiple-linear-regression/
https://bookdown.org/chua/ber642_advanced_regression/binary-logistic-regression.html
https://www.jstatsoft.org/article/view/v034i12
http://r-statistics.co/Missing-Value-Treatment-With-R.html
https://andyreagan.github.io/teaching/2018/09-SDS-291/lectures/14_modelselection.pdf