DATA 621 – Business Analytics and Data Mining
In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).
Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?
<- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/insurance_training_data.csv")
ins_train_df <- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/insurance-evaluation-data.csv") ins_eval_df
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
<- as.data.frame(lapply(ins_train_df, gsub, pattern='z_', replacement=''))
ins_train_df <- as.data.frame(lapply(ins_eval_df, gsub, pattern='z_', replacement=''))
ins_eval_df <- c("TARGET_FLAG",
cols.num "TARGET_AMT","AGE","YOJ","TRAVTIME","TIF","CLM_FREQ",
"MVR_PTS","CAR_AGE","KIDSDRIV","HOMEKIDS")
<- sapply(ins_train_df[cols.num],as.numeric)
ins_train_df[cols.num] <- sapply(ins_eval_df[cols.num],as.numeric)
ins_eval_df[cols.num]
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
<- c("EDUCATION", "SEX", "CAR_TYPE", "JOB", "CAR_USE", "URBANICITY")
cols.fac
<- lapply(ins_train_df[cols.fac],factor)
ins_train_df[cols.fac] <- lapply(ins_eval_df[cols.fac],factor)
ins_eval_df[cols.fac]
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 F:4375
## Class :character Class :character Class :character M:3786
## Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## <High School:1203 Blue Collar :1825 Min. : 5.00 Commercial:3029
## Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132
## High School :2330 Professional:1117 Median : 33.00
## Masters :1658 Manager : 988 Mean : 33.49
## PhD : 728 Lawyer : 835 3rd Qu.: 44.00
## Student : 712 Max. :142.00
## (Other) :1413
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Minivan :2145 Length:8161
## Class :character 1st Qu.: 1.000 Panel Truck: 676 Class :character
## Mode :character Median : 4.000 Pickup :1389 Mode :character
## Mean : 5.351 Sports Car : 907
## 3rd Qu.: 7.000 SUV :2294
## Max. :25.000 Van : 750
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Highly Rural/ Rural:1669
## 1st Qu.: 1.000 Highly Urban/ Urban:6492
## Median : 8.000
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
#print first six values
head(ins_train_df$BLUEBOOK)
## [1] "$14,230" "$14,940" "$4,010" "$15,440" "$18,000" "$17,430"
typeof(ins_train_df$BLUEBOOK)
## [1] "character"
head(ins_train_df$INCOME)
## [1] "$67,349" "$91,449" "$16,039" "" "$114,986" "$125,301"
typeof(ins_train_df$INCOME)
## [1] "character"
head(ins_train_df$HOME_VAL)
## [1] "$0" "$257,252" "$124,191" "$306,251" "$243,925" "$0"
typeof(ins_train_df$HOME_VAL)
## [1] "character"
head(ins_train_df$OLDCLAIM)
## [1] "$4,461" "$0" "$38,690" "$0" "$19,217" "$0"
typeof(ins_train_df$OLDCLAIM)
## [1] "character"
head(ins_train_df$INCOME)
## [1] "$67,349" "$91,449" "$16,039" "" "$114,986" "$125,301"
#change it to as.numeric from original (as.integer)
<- function(n){
dollars as.numeric(paste0('0', str_remove_all(n, '[,$]'))) # add leading zero for blanks
}dollars(ins_train_df$INCOME[1:11])
## [1] 67349 91449 16039 0 114986 125301 18755 107961 62978 106952
## [11] 77100
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 F:4375
## Class :character Class :character Class :character M:3786
## Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## <High School:1203 Blue Collar :1825 Min. : 5.00 Commercial:3029
## Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132
## High School :2330 Professional:1117 Median : 33.00
## Masters :1658 Manager : 988 Mean : 33.49
## PhD : 728 Lawyer : 835 3rd Qu.: 44.00
## Student : 712 Max. :142.00
## (Other) :1413
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Minivan :2145 Length:8161
## Class :character 1st Qu.: 1.000 Panel Truck: 676 Class :character
## Mode :character Median : 4.000 Pickup :1389 Mode :character
## Mean : 5.351 Sports Car : 907
## 3rd Qu.: 7.000 SUV :2294
## Max. :25.000 Van : 750
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Highly Rural/ Rural:1669
## 1st Qu.: 1.000 Highly Urban/ Urban:6492
## Median : 8.000
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
#converting to dollars of blueblook
$BLUEBOOK = dollars(ins_eval_df$BLUEBOOK)
ins_eval_df$BLUEBOOK = dollars(ins_train_df$BLUEBOOK)
ins_train_df$HOME_VAL = dollars(ins_eval_df$HOME_VAL)
ins_eval_df$HOME_VAL = dollars(ins_train_df$HOME_VAL)
ins_train_df$INCOME = dollars(ins_eval_df$INCOME)
ins_eval_df$INCOME = dollars(ins_train_df$INCOME)
ins_train_df$OLDCLAIM = dollars(ins_eval_df$OLDCLAIM)
ins_eval_df$OLDCLAIM = dollars(ins_train_df$OLDCLAIM)
ins_train_df
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.0 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Min. : 0 Length:8161 F:4375
## Class :character 1st Qu.: 0 Class :character M:3786
## Mode :character Median :151957 Mode :character
## Mean :146062
## 3rd Qu.:233352
## Max. :885282
##
## EDUCATION JOB TRAVTIME CAR_USE
## <High School:1203 Blue Collar :1825 Min. : 5.00 Commercial:3029
## Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Private :5132
## High School :2330 Professional:1117 Median : 33.00
## Masters :1658 Manager : 988 Mean : 33.49
## PhD : 728 Lawyer : 835 3rd Qu.: 44.00
## Student : 712 Max. :142.00
## (Other) :1413
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Min. : 1500 Min. : 1.000 Minivan :2145 Length:8161
## 1st Qu.: 9280 1st Qu.: 1.000 Panel Truck: 676 Class :character
## Median :14440 Median : 4.000 Pickup :1389 Mode :character
## Mean :15710 Mean : 5.351 Sports Car : 907
## 3rd Qu.:20850 3rd Qu.: 7.000 SUV :2294
## Max. :69740 Max. :25.000 Van : 750
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Min. : 0 Min. :0.0000 Length:8161 Min. : 0.000
## 1st Qu.: 0 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Median : 0 Median :0.0000 Mode :character Median : 1.000
## Mean : 4037 Mean :0.7986 Mean : 1.696
## 3rd Qu.: 4636 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :57037 Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Highly Rural/ Rural:1669
## 1st Qu.: 1.000 Highly Urban/ Urban:6492
## Median : 8.000
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
head(ins_train_df$PARENT1)
## [1] "No" "No" "No" "No" "No" "Yes"
head(ins_train_df$MSTATUS)
## [1] "No" "No" "Yes" "Yes" "Yes" "No"
head(ins_train_df$SEX)
## [1] M M F M F F
## Levels: F M
head(ins_train_df$EDUCATION)
## [1] PhD High School High School <High School PhD
## [6] Bachelors
## Levels: <High School Bachelors High School Masters PhD
head(ins_train_df$JOB)
## [1] Professional Blue Collar Clerical Blue Collar Doctor
## [6] Blue Collar
## 9 Levels: Blue Collar Clerical Doctor Home Maker Lawyer ... Student
head(ins_train_df$CAR_USE)
## [1] Private Commercial Private Private Private Commercial
## Levels: Commercial Private
head(ins_train_df$CAR_TYPE)
## [1] Minivan Minivan SUV Minivan SUV Sports Car
## Levels: Minivan Panel Truck Pickup Sports Car SUV Van
head(ins_train_df$REVOKED)
## [1] "No" "No" "No" "No" "Yes" "No"
head(ins_train_df$URBANICITY)
## [1] Highly Urban/ Urban Highly Urban/ Urban Highly Urban/ Urban
## [4] Highly Urban/ Urban Highly Urban/ Urban Highly Urban/ Urban
## Levels: Highly Rural/ Rural Highly Urban/ Urban
head(ins_train_df$RED_CAR)
## [1] "yes" "yes" "no" "yes" "no" "no"
table(ins_train_df$URBANICITY)
##
## Highly Rural/ Rural Highly Urban/ Urban
## 1669 6492
#changes to logical type
table(ins_train_df$REVOKED)
##
## No Yes
## 7161 1000
$REVOKED = tolower(ins_train_df$REVOKED)=='yes'
ins_train_df$REVOKED = tolower(ins_eval_df$REVOKED)=='yes'
ins_eval_df$PARENT1 = tolower(ins_train_df$PARENT1)=='yes'
ins_train_df$PARENT1 = tolower(ins_eval_df$PARENT1)=='yes'
ins_eval_df$RED_CAR = tolower(ins_train_df$RED_CAR)=='yes'
ins_train_df$RED_CAR = tolower(ins_eval_df$RED_CAR)=='yes'
ins_eval_df$MSTATUS = tolower(ins_train_df$MSTATUS)=='yes'
ins_train_df$MSTATUS = tolower(ins_eval_df$MSTATUS)=='yes'
ins_eval_df
table(ins_train_df$REVOKED)
##
## FALSE TRUE
## 7161 1000
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.0 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## Mode :logical Min. : 0 Mode :logical F:4375 <High School:1203
## FALSE:7084 1st Qu.: 0 FALSE:3267 M:3786 Bachelors :2242
## TRUE :1077 Median :151957 TRUE :4894 High School :2330
## Mean :146062 Masters :1658
## 3rd Qu.:233352 PhD : 728
## Max. :885282
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Blue Collar :1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional:1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Mode :logical Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 FALSE:5783 1st Qu.: 0
## Median : 4.000 Pickup :1389 TRUE :2378 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4636
## Max. :25.000 Van : 750 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Mode :logical Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 FALSE:7161 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 TRUE :1000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Rural/ Rural:1669
## Highly Urban/ Urban:6492
##
##
##
##
##
$INCOME <- as.numeric(ins_train_df$INCOME)
ins_train_dftypeof(ins_train_df$INCOME)
## [1] "double"
head(ins_train_df$INCOME)
## [1] 67349 91449 16039 0 114986 125301
#summaries for mean
<- ins_train_df %>%
ins_train_numeric ::select(where(is.numeric))
dplyr#we will see what type of variable they are with str
str(ins_train_df)
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : chr "1" "2" "4" "5" ...
## $ TARGET_FLAG: num 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : num 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : num 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : num 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : num 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : num 67349 91449 16039 0 114986 ...
## $ PARENT1 : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ HOME_VAL : num 0 257252 124191 306251 243925 ...
## $ MSTATUS : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
## $ SEX : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 1 2 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 5 3 3 1 5 2 1 2 2 2 ...
## $ JOB : Factor w/ 9 levels "","Blue Collar",..: 8 2 3 2 4 2 2 2 3 8 ...
## $ TRAVTIME : num 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
## $ BLUEBOOK : num 14230 14940 4010 15440 18000 ...
## $ TIF : num 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 5 1 5 4 5 6 5 6 ...
## $ RED_CAR : logi TRUE TRUE FALSE TRUE FALSE FALSE ...
## $ OLDCLAIM : num 4461 0 38690 0 19217 ...
## $ CLM_FREQ : num 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ MVR_PTS : num 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : num 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Rural/ Rural",..: 2 2 2 2 2 2 2 2 2 1 ...
<- kable(format(sapply(ins_train_numeric, function(ins_train_numeric) c( "Stand dev" = round(sd(ins_train_numeric),2),
Summaries_KBL "Mean"= mean(ins_train_numeric,na.rm=TRUE),
"n" = length(ins_train_numeric),
"Median" = median(ins_train_numeric,na.rm = TRUE),
"CoeffofVariation" = sd(ins_train_numeric)/mean(ins_train_numeric,na.rm=TRUE),
"Minimum" = min(ins_train_numeric),
"Maximum" = max(ins_train_numeric),
"Upper Quantile" = quantile(ins_train_numeric,1,na.rm = TRUE),
"LowerQuartile" = quantile(ins_train_numeric,0,na.rm = TRUE)
)scientific = FALSE)
),
)<- sub("0+$", "", as.character(Summaries_KBL))
Summaries_KBL Summaries_KBL
## [1] "| |TARGET_FLAG |TARGET_AMT |KIDSDRIV |AGE |HOMEKIDS |YOJ |INCOME |HOME_VAL |TRAVTIME |BLUEBOOK |TIF |OLDCLAIM |CLM_FREQ |MVR_PTS |CAR_AGE |"
## [2] "|:-------------------|:------------|:--------------|:------------|:------------|:------------|:------------|:--------------|:--------------|:------------|:-------------|:------------|:-------------|:------------|:------------|:------------|"
## [3] "|Stand dev |0.4400000 |4704.0300000 |0.5100000 |NA |1.1200000 |NA |48345.5100000 |130426.7200000 |15.9100000 |8419.7300000 |4.1500000 |8777.1400000 |1.1600000 |2.1500000 |NA |"
## [4] "|Mean |0.2638157 |1504.3246481 |0.1710575 |44.7903127 |0.7212351 |10.4992864 |58522.9381203 |146062.1895601 |33.4857248 |15709.8995221 |5.3513050 |4037.0762161 |0.7985541 |1.6955030 |8.3283231 |"
## [5] "|n |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |"
## [6] "|Median |0.0000000 |0.0000000 |0.0000000 |45.0000000 |0.0000000 |11.0000000 |51116.0000000 |151957.0000000 |33.0000000 |14440.0000000 |4.0000000 |0.0000000 |0.0000000 |1.0000000 |8.0000000 |"
## [7] "|CoeffofVariation |1.6705888 |3.1270025 |2.9904224 |NA |1.5477938 |NA |0.8260951 |0.8929533 |0.4750781 |0.5359509 |0.7748830 |2.1741326 |1.4506878 |1.2663568 |NA |"
## [8] "|Minimum |0.0000000 |0.0000000 |0.0000000 |NA |0.0000000 |NA |0.0000000 |0.0000000 |5.0000000 |1500.0000000 |1.0000000 |0.0000000 |0.0000000 |0.0000000 |NA |"
## [9] "|Maximum |1.0000000 |107586.1361600 |4.0000000 |NA |5.0000000 |NA |367030.0000000 |885282.0000000 |142.0000000 |69740.0000000 |25.0000000 |57037.0000000 |5.0000000 |13.0000000 |NA |"
## [10] "|Upper Quantile.100% |1.0000000 |107586.1361600 |4.0000000 |81.0000000 |5.0000000 |23.0000000 |367030.0000000 |885282.0000000 |142.0000000 |69740.0000000 |25.0000000 |57037.0000000 |5.0000000 |13.0000000 |28.0000000 |"
## [11] "|LowerQuartile.0% |0.0000000 |0.0000000 |0.0000000 |16.0000000 |0.0000000 |0.0000000 |0.0000000 |0.0000000 |5.0000000 |1500.0000000 |1.0000000 |0.0000000 |0.0000000 |0.0000000 |-3.0000000 |"
###summary statistics
#ignore all missing values, provides our SD for all numeric
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.0 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## Mode :logical Min. : 0 Mode :logical F:4375 <High School:1203
## FALSE:7084 1st Qu.: 0 FALSE:3267 M:3786 Bachelors :2242
## TRUE :1077 Median :151957 TRUE :4894 High School :2330
## Mean :146062 Masters :1658
## 3rd Qu.:233352 PhD : 728
## Max. :885282
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Blue Collar :1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional:1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Mode :logical Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 FALSE:5783 1st Qu.: 0
## Median : 4.000 Pickup :1389 TRUE :2378 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4636
## Max. :25.000 Van : 750 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Mode :logical Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 FALSE:7161 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 TRUE :1000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Rural/ Rural:1669
## Highly Urban/ Urban:6492
##
##
##
##
##
sapply(ins_train_numeric, sd, na.rm=TRUE)
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 4.407276e-01 4.704027e+03 5.115341e-01 8.627589e+00 1.116323e+00 4.092474e+00
## INCOME HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM
## 4.834551e+04 1.304267e+05 1.590833e+01 8.419734e+03 4.146635e+00 8.777139e+03
## CLM_FREQ MVR_PTS CAR_AGE
## 1.158453e+00 2.147112e+00 5.700742e+00
%>% gather() %>% head() ins_train_numeric
## key value
## 1 TARGET_FLAG 0
## 2 TARGET_FLAG 0
## 3 TARGET_FLAG 0
## 4 TARGET_FLAG 0
## 5 TARGET_FLAG 0
## 6 TARGET_FLAG 1
ggplot(gather(ins_train_numeric), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
hist(ins_train_df$TARGET_AMT)
summary(ins_train_df$TARGET_AMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1504 1036 107586
var(ins_train_df$TARGET_AMT)
## [1] 22127869
var(log(ins_train_df$TARGET_AMT+1))
## [1] 13.47907
#benefit of log transform, we added because there a lot of 0's in Target_AMT, we added 1
#log of 0 undefined
hist(ins_train_df$INCOME)
summary(ins_train_df$INCOME)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 23157 51116 58523 83304 367030
var(ins_train_df$INCOME)
## [1] 2337288545
var(log(ins_train_df$INCOME+1))
## [1] 13.98386
#age, yoj,car_age many misssing values
#we are focusing on the ones with missing values first
hist(ins_train_df$OLDCLAIM)
summary(ins_train_df$OLDCLAIM)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 4037 4636 57037
var(ins_train_df$OLDCLAIM)
## [1] 77038171
var(log(ins_train_df$OLDCLAIM+1))
## [1] 18.59325
#table(ins_train_df$OLDCLAIM)
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.0 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## Mode :logical Min. : 0 Mode :logical F:4375 <High School:1203
## FALSE:7084 1st Qu.: 0 FALSE:3267 M:3786 Bachelors :2242
## TRUE :1077 Median :151957 TRUE :4894 High School :2330
## Mean :146062 Masters :1658
## 3rd Qu.:233352 PhD : 728
## Max. :885282
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Blue Collar :1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional:1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Mode :logical Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 FALSE:5783 1st Qu.: 0
## Median : 4.000 Pickup :1389 TRUE :2378 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4636
## Max. :25.000 Van : 750 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Mode :logical Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 FALSE:7161 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 TRUE :1000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Rural/ Rural:1669
## Highly Urban/ Urban:6492
##
##
##
##
##
# ggplot(gather(ins_train_numeric, cols, value), aes(x = value)) +
# geom_histogram(binwidth = 20) + facet_grid(.~cols)
##correlation matrix
= rcorr(as.matrix(ins_train_numeric))
ins_train_numeric.rcorr ins_train_numeric.rcorr
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME
## TARGET_FLAG 1.00 0.53 0.10 -0.10 0.12 -0.07 -0.13
## TARGET_AMT 0.53 1.00 0.06 -0.04 0.06 -0.02 -0.05
## KIDSDRIV 0.10 0.06 1.00 -0.08 0.46 0.04 -0.04
## AGE -0.10 -0.04 -0.08 1.00 -0.45 0.14 0.17
## HOMEKIDS 0.12 0.06 0.46 -0.45 1.00 0.09 -0.15
## YOJ -0.07 -0.02 0.04 0.14 0.09 1.00 0.27
## INCOME -0.13 -0.05 -0.04 0.17 -0.15 0.27 1.00
## HOME_VAL -0.17 -0.08 -0.02 0.19 -0.10 0.25 0.50
## TRAVTIME 0.05 0.03 0.01 0.01 -0.01 -0.02 -0.04
## BLUEBOOK -0.10 0.00 -0.02 0.17 -0.11 0.14 0.41
## TIF -0.08 -0.05 0.00 0.00 0.01 0.02 0.00
## OLDCLAIM 0.14 0.07 0.02 -0.03 0.03 0.00 -0.04
## CLM_FREQ 0.22 0.12 0.04 -0.02 0.03 -0.03 -0.04
## MVR_PTS 0.22 0.14 0.05 -0.07 0.06 -0.04 -0.05
## CAR_AGE -0.10 -0.06 -0.05 0.18 -0.15 0.06 0.39
## HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG -0.17 0.05 -0.10 -0.08 0.14 0.22 0.22 -0.10
## TARGET_AMT -0.08 0.03 0.00 -0.05 0.07 0.12 0.14 -0.06
## KIDSDRIV -0.02 0.01 -0.02 0.00 0.02 0.04 0.05 -0.05
## AGE 0.19 0.01 0.17 0.00 -0.03 -0.02 -0.07 0.18
## HOMEKIDS -0.10 -0.01 -0.11 0.01 0.03 0.03 0.06 -0.15
## YOJ 0.25 -0.02 0.14 0.02 0.00 -0.03 -0.04 0.06
## INCOME 0.50 -0.04 0.41 0.00 -0.04 -0.04 -0.05 0.39
## HOME_VAL 1.00 -0.03 0.24 0.00 -0.07 -0.09 -0.08 0.20
## TRAVTIME -0.03 1.00 -0.02 -0.01 -0.02 0.01 0.01 -0.04
## BLUEBOOK 0.24 -0.02 1.00 -0.01 -0.03 -0.04 -0.04 0.19
## TIF 0.00 -0.01 -0.01 1.00 -0.02 -0.02 -0.04 0.01
## OLDCLAIM -0.07 -0.02 -0.03 -0.02 1.00 0.50 0.26 -0.01
## CLM_FREQ -0.09 0.01 -0.04 -0.02 0.50 1.00 0.40 -0.01
## MVR_PTS -0.08 0.01 -0.04 -0.04 0.26 0.40 1.00 -0.02
## CAR_AGE 0.20 -0.04 0.19 0.01 -0.01 -0.01 -0.02 1.00
##
## n
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## TARGET_FLAG 8161 8161 8161 8155 8161 7707 8161 8161
## TARGET_AMT 8161 8161 8161 8155 8161 7707 8161 8161
## KIDSDRIV 8161 8161 8161 8155 8161 7707 8161 8161
## AGE 8155 8155 8155 8155 8155 7701 8155 8155
## HOMEKIDS 8161 8161 8161 8155 8161 7707 8161 8161
## YOJ 7707 7707 7707 7701 7707 7707 7707 7707
## INCOME 8161 8161 8161 8155 8161 7707 8161 8161
## HOME_VAL 8161 8161 8161 8155 8161 7707 8161 8161
## TRAVTIME 8161 8161 8161 8155 8161 7707 8161 8161
## BLUEBOOK 8161 8161 8161 8155 8161 7707 8161 8161
## TIF 8161 8161 8161 8155 8161 7707 8161 8161
## OLDCLAIM 8161 8161 8161 8155 8161 7707 8161 8161
## CLM_FREQ 8161 8161 8161 8155 8161 7707 8161 8161
## MVR_PTS 8161 8161 8161 8155 8161 7707 8161 8161
## CAR_AGE 7651 7651 7651 7645 7651 7219 7651 7651
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG 8161 8161 8161 8161 8161 8161 7651
## TARGET_AMT 8161 8161 8161 8161 8161 8161 7651
## KIDSDRIV 8161 8161 8161 8161 8161 8161 7651
## AGE 8155 8155 8155 8155 8155 8155 7645
## HOMEKIDS 8161 8161 8161 8161 8161 8161 7651
## YOJ 7707 7707 7707 7707 7707 7707 7219
## INCOME 8161 8161 8161 8161 8161 8161 7651
## HOME_VAL 8161 8161 8161 8161 8161 8161 7651
## TRAVTIME 8161 8161 8161 8161 8161 8161 7651
## BLUEBOOK 8161 8161 8161 8161 8161 8161 7651
## TIF 8161 8161 8161 8161 8161 8161 7651
## OLDCLAIM 8161 8161 8161 8161 8161 8161 7651
## CLM_FREQ 8161 8161 8161 8161 8161 8161 7651
## MVR_PTS 8161 8161 8161 8161 8161 8161 7651
## CAR_AGE 7651 7651 7651 7651 7651 7651 7651
##
## P
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME
## TARGET_FLAG 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## TARGET_AMT 0.0000 0.0000 0.0002 0.0000 0.0525 0.0000
## KIDSDRIV 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000
## AGE 0.0000 0.0002 0.0000 0.0000 0.0000 0.0000
## HOMEKIDS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## YOJ 0.0000 0.0525 0.0001 0.0000 0.0000 0.0000
## INCOME 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## HOME_VAL 0.0000 0.0000 0.1648 0.0000 0.0000 0.0000 0.0000
## TRAVTIME 0.0000 0.0115 0.4455 0.6342 0.5128 0.1369 0.0000
## BLUEBOOK 0.0000 0.6712 0.0516 0.0000 0.0000 0.0000 0.0000
## TIF 0.0000 0.0000 0.8574 0.9952 0.2859 0.0296 0.9396
## OLDCLAIM 0.0000 0.0000 0.0653 0.0082 0.0069 0.7936 0.0001
## CLM_FREQ 0.0000 0.0000 0.0008 0.0296 0.0080 0.0209 0.0001
## MVR_PTS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0009 0.0000
## CAR_AGE 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## TARGET_AMT 0.0000 0.0115 0.6712 0.0000 0.0000 0.0000 0.0000 0.0000
## KIDSDRIV 0.1648 0.4455 0.0516 0.8574 0.0653 0.0008 0.0000 0.0000
## AGE 0.0000 0.6342 0.0000 0.9952 0.0082 0.0296 0.0000 0.0000
## HOMEKIDS 0.0000 0.5128 0.0000 0.2859 0.0069 0.0080 0.0000 0.0000
## YOJ 0.0000 0.1369 0.0000 0.0296 0.7936 0.0209 0.0009 0.0000
## INCOME 0.0000 0.0000 0.0000 0.9396 0.0001 0.0001 0.0000 0.0000
## HOME_VAL 0.0055 0.0000 0.7376 0.0000 0.0000 0.0000 0.0000
## TRAVTIME 0.0055 0.1246 0.2945 0.0818 0.5535 0.3384 0.0008
## BLUEBOOK 0.0000 0.1246 0.6242 0.0077 0.0010 0.0004 0.0000
## TIF 0.7376 0.2945 0.6242 0.0473 0.0375 0.0002 0.4969
## OLDCLAIM 0.0000 0.0818 0.0077 0.0473 0.0000 0.0000 0.2417
## CLM_FREQ 0.0000 0.5535 0.0010 0.0375 0.0000 0.0000 0.4151
## MVR_PTS 0.0000 0.3384 0.0004 0.0002 0.0000 0.0000 0.0817
## CAR_AGE 0.0000 0.0008 0.0000 0.4969 0.2417 0.4151 0.0817
#a lot of NA's for the three variables, Correlation take all values in the column and associate them
= cor(ins_train_numeric, use = "complete.obs")
ins_train_numeric.cor corrplot(ins_train_numeric.cor)
#HOMEKIDS looks like good group for imputing ages
#incoming needs to be clustered for imputing
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.0 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## Mode :logical Min. : 0 Mode :logical F:4375 <High School:1203
## FALSE:7084 1st Qu.: 0 FALSE:3267 M:3786 Bachelors :2242
## TRUE :1077 Median :151957 TRUE :4894 High School :2330
## Mean :146062 Masters :1658
## 3rd Qu.:233352 PhD : 728
## Max. :885282
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Blue Collar :1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional:1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Mode :logical Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 FALSE:5783 1st Qu.: 0
## Median : 4.000 Pickup :1389 TRUE :2378 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4636
## Max. :25.000 Van : 750 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Mode :logical Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 FALSE:7161 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 TRUE :1000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Rural/ Rural:1669
## Highly Urban/ Urban:6492
##
##
##
##
##
#correlations between variables we highly suspect
<- cor.test(ins_train_df$INCOME, ins_train_df$TARGET_AMT)
cor_inctarget cor_inctarget
##
## Pearson's product-moment correlation
##
## data: ins_train_df$INCOME and ins_train_df$TARGET_AMT
## t = -4.7556, df = 8159, p-value = 2.013e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07418735 -0.03091440
## sample estimates:
## cor
## -0.05257556
<- cor.test(ins_train_df$HOME_VAL, ins_train_df$TARGET_AMT)
cor_hovaltarget cor_hovaltarget
##
## Pearson's product-moment correlation
##
## data: ins_train_df$HOME_VAL and ins_train_df$TARGET_AMT
## t = -7.435, df = 8159, p-value = 1.152e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10354605 -0.06044509
## sample estimates:
## cor
## -0.08203393
<- cor.test(ins_train_df$CLM_FREQ, ins_train_df$TARGET_AMT)
cor_cfreqtarget cor_cfreqtarget
##
## Pearson's product-moment correlation
##
## data: ins_train_df$CLM_FREQ and ins_train_df$TARGET_AMT
## t = 10.588, df = 8159, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0949626 0.1377676
## sample estimates:
## cor
## 0.1164192
<- cor.test(ins_train_df$MVR_PTS, ins_train_df$TARGET_AMT)
cor_mptstarget cor_mptstarget
##
## Pearson's product-moment correlation
##
## data: ins_train_df$MVR_PTS and ins_train_df$TARGET_AMT
## t = 12.573, df = 8159, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1165176 0.1590861
## sample estimates:
## cor
## 0.1378655
<- cor.test(ins_train_df$TRAVTIME, ins_train_df$TARGET_AMT)
cor_travtarget cor_travtarget
##
## Pearson's product-moment correlation
##
## data: ins_train_df$TRAVTIME and ins_train_df$TARGET_AMT
## t = 2.529, df = 8159, p-value = 0.01146
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.006294418 0.049653286
## sample estimates:
## cor
## 0.02798702
#the weaker the correlation the higher the p value
#ttests
#most P values significantly low
<- t.test(ins_train_df$INCOME~ ins_train_df$TARGET_FLAG)
t_inctarget t_inctarget
##
## Welch Two Sample t-test
##
## data: ins_train_df$INCOME by ins_train_df$TARGET_FLAG
## t = 12.604, df = 4321.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12008.56 16432.63
## sample estimates:
## mean in group 0 mean in group 1
## 62274.55 48053.96
<- t.test(ins_train_df$HOME_VAL~ ins_train_df$TARGET_FLAG)
t_hovaltarget t_hovaltarget
##
## Welch Two Sample t-test
##
## data: ins_train_df$HOME_VAL by ins_train_df$TARGET_FLAG
## t = 16.56, df = 4222.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 44648.07 56639.42
## sample estimates:
## mean in group 0 mean in group 1
## 159422.8 108779.1
<- t.test(ins_train_df$CLM_FREQ~ ins_train_df$TARGET_FLAG)
t_cfreqtarget t_cfreqtarget
##
## Welch Two Sample t-test
##
## data: ins_train_df$CLM_FREQ by ins_train_df$TARGET_FLAG
## t = -18.734, df = 3388.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6277464 -0.5087966
## sample estimates:
## mean in group 0 mean in group 1
## 0.6486352 1.2169066
<- t.test(ins_train_df$MVR_PTS~ ins_train_df$TARGET_FLAG)
t_mptstarget t_mptstarget
##
## Welch Two Sample t-test
##
## data: ins_train_df$MVR_PTS by ins_train_df$TARGET_FLAG
## t = -17.591, df = 3021.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1869030 -0.9488407
## sample estimates:
## mean in group 0 mean in group 1
## 1.413782 2.481654
<- t.test(ins_train_df$TRAVTIME~ ins_train_df$TARGET_FLAG)
t_travtarget t_travtarget
##
## Welch Two Sample t-test
##
## data: ins_train_df$TRAVTIME by ins_train_df$TARGET_FLAG
## t = -4.4996, df = 4007.7, p-value = 7e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.5065923 -0.9851758
## sample estimates:
## mean in group 0 mean in group 1
## 33.02513 34.77102
#percentages is better because same scale for both graphs
library('lattice')
histogram(~ INCOME | TARGET_FLAG, data = ins_train_df)
histogram(~ HOME_VAL | TARGET_FLAG, data = ins_train_df)
histogram(~ CLM_FREQ | TARGET_FLAG, data = ins_train_df)
histogram(~ MVR_PTS | TARGET_FLAG, data = ins_train_df)
#checking the distribution, flag 0 means no car crashes, target 1 means car crashes, we are checking distribution of income between both
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
There was missing values
We are putting the data into clusters, based on these buckets we are imputing the missing values.
#flagging missing values
$AGE_miss <- ifelse(is.na(ins_train_df$AGE),1,0)
ins_train_df$YOJ_miss <- ifelse(is.na(ins_train_df$YOJ),1,0)
ins_train_df$CAR_AGE_miss <- ifelse(is.na(ins_train_df$CAR_AGE),1,0)
ins_train_df
#create five clusters of income
<- kmeans(ins_train_df$INCOME, centers = 5)
kmeans.re table(kmeans.re$cluster)
##
## 1 2 3 4 5
## 2069 2527 335 2160 1070
$inc_clusters <- kmeans.re$cluster
ins_train_df#table(ins_train_df$INCOME)
#we are creating groups based on HOMEKIDS, replace missing value
<- ins_train_df %>%
ins_train_df group_by(HOMEKIDS) %>%
mutate(AGE= ifelse(is.na(AGE), mean(AGE, na.rm=TRUE), AGE))
#we are creating groups based on car_age, replace missing value
<- ins_train_df %>%
ins_train_df group_by(inc_clusters) %>%
mutate(CAR_AGE= ifelse(is.na(CAR_AGE), mean(CAR_AGE, na.rm=TRUE), CAR_AGE))
#replace missing value {based on whole sample/mean based}
#no group value is highly correlated by YOJ
<- ins_train_df %>%
ins_train_df mutate(YOJ= ifelse(is.na(YOJ), mean(YOJ, na.rm=TRUE), YOJ))
#subesection for converting those with huge variance (log x + 1)
$TARGET_AMT_log <- log(ins_train_df$TARGET_AMT+1)
ins_train_df$INCOME_log <- log(ins_train_df$INCOME+1)
ins_train_df$OLDCLAIM_log <- log(ins_train_df$OLDCLAIM+1)
ins_train_df
$AVG_CLM_PAY <- ins_train_df$OLDCLAIM/ins_train_df$CLM_FREQ
ins_train_df
summary(ins_train_df)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Length:8161 Min. :0.0000 Min. : 0 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Mode :character Median :0.0000 Median : 0 Median :0.0000
## Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.00 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 23157
## Median :45.00 Median :0.0000 Median :11.38 Median : 51116
## Mean :44.79 Mean :0.7212 Mean :10.49 Mean : 58523
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.: 83304
## Max. :81.00 Max. :5.0000 Max. :23.00 Max. :367030
##
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## Mode :logical Min. : 0 Mode :logical F:4375 <High School:1203
## FALSE:7084 1st Qu.: 0 FALSE:3267 M:3786 Bachelors :2242
## TRUE :1077 Median :151957 TRUE :4894 High School :2330
## Mean :146062 Masters :1658
## 3rd Qu.:233352 PhD : 728
## Max. :885282
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Blue Collar :1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional:1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Mode :logical Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 FALSE:5783 1st Qu.: 0
## Median : 4.000 Pickup :1389 TRUE :2378 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 SUV :2294 3rd Qu.: 4636
## Max. :25.000 Van : 750 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Mode :logical Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 FALSE:7161 1st Qu.: 0.000 1st Qu.: 4.000
## Median :0.0000 TRUE :1000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.333
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
##
## URBANICITY AGE_miss YOJ_miss
## Highly Rural/ Rural:1669 Min. :0.0000000 Min. :0.00000
## Highly Urban/ Urban:6492 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.0000000 Median :0.00000
## Mean :0.0007352 Mean :0.05563
## 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.00000
##
## CAR_AGE_miss inc_clusters TARGET_AMT_log INCOME_log
## Min. :0.00000 Min. :1.00 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.:1.00 1st Qu.: 0.000 1st Qu.:10.050
## Median :0.00000 Median :2.00 Median : 0.000 Median :10.842
## Mean :0.06249 Mean :2.71 Mean : 2.183 Mean : 9.421
## 3rd Qu.:0.00000 3rd Qu.:4.00 3rd Qu.: 6.944 3rd Qu.:11.330
## Max. :1.00000 Max. :5.00 Max. :11.586 Max. :12.813
##
## OLDCLAIM_log AVG_CLM_PAY
## Min. : 0.000 Min. : 129.8
## 1st Qu.: 0.000 1st Qu.: 1665.1
## Median : 0.000 Median : 3204.8
## Mean : 3.386 Mean : 6315.6
## 3rd Qu.: 8.442 3rd Qu.: 7160.7
## Max. :10.951 Max. :57037.0
## NA's :5009
#inspecting the TARGET_AMT variable
boxplot(ins_train_df$TARGET_AMT)
summary(ins_train_df$TARGET_AMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1504 1036 107586
#Building model using all features.
<- lm(TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME +
lm_ins_train_df +HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME +
PARENT1 + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
CAR_USE + MVR_PTS + CAR_AGE + URBANICITY
REVOKED data = ins_train_df)
,
summary(lm_ins_train_df)
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME +
## PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME +
## CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = ins_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5875 -1698 -758 355 103796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.332e+02 5.732e+02 -0.756 0.44979
## KIDSDRIV 3.128e+02 1.132e+02 2.764 0.00573 **
## AGE 5.440e+00 7.068e+00 0.770 0.44153
## HOMEKIDS 8.050e+01 6.539e+01 1.231 0.21838
## YOJ -7.319e+00 1.515e+01 -0.483 0.62899
## INCOME -3.555e-03 1.626e-03 -2.187 0.02878 *
## PARENT1TRUE 5.788e+02 2.020e+02 2.866 0.00417 **
## HOME_VAL -6.343e-04 5.380e-04 -1.179 0.23847
## MSTATUSTRUE -5.588e+02 1.416e+02 -3.946 8.00e-05 ***
## SEXM 3.728e+02 1.838e+02 2.028 0.04255 *
## EDUCATIONBachelors -2.630e+02 2.048e+02 -1.284 0.19917
## EDUCATIONHigh School -9.077e+01 1.719e+02 -0.528 0.59751
## EDUCATIONMasters 2.530e+01 3.007e+02 0.084 0.93294
## EDUCATIONPhD 2.605e+02 3.548e+02 0.734 0.46293
## JOBBlue Collar 5.209e+02 3.216e+02 1.620 0.10531
## JOBClerical 5.536e+02 3.406e+02 1.625 0.10415
## JOBDoctor -4.918e+02 4.087e+02 -1.203 0.22884
## JOBHome Maker 3.763e+02 3.631e+02 1.036 0.30010
## JOBLawyer 2.397e+02 2.956e+02 0.811 0.41751
## JOBManager -4.708e+02 2.884e+02 -1.633 0.10257
## JOBProfessional 4.663e+02 3.086e+02 1.511 0.13083
## JOBStudent 3.027e+02 3.734e+02 0.811 0.41760
## TRAVTIME 1.195e+01 3.222e+00 3.709 0.00021 ***
## CAR_USEPrivate -7.784e+02 1.644e+02 -4.734 2.24e-06 ***
## BLUEBOOK 1.402e-02 8.615e-03 1.628 0.10365
## TIF -4.803e+01 1.218e+01 -3.944 8.07e-05 ***
## CAR_TYPEPanel Truck 2.592e+02 2.781e+02 0.932 0.35134
## CAR_TYPEPickup 3.770e+02 1.707e+02 2.209 0.02720 *
## CAR_TYPESports Car 1.024e+03 2.178e+02 4.703 2.60e-06 ***
## CAR_TYPESUV 7.527e+02 1.793e+02 4.197 2.73e-05 ***
## CAR_TYPEVan 5.071e+02 2.132e+02 2.379 0.01739 *
## RED_CARTRUE -4.812e+01 1.490e+02 -0.323 0.74681
## OLDCLAIM -1.054e-02 7.436e-03 -1.417 0.15655
## CLM_FREQ 1.416e+02 5.503e+01 2.573 0.01009 *
## REVOKEDTRUE 5.488e+02 1.735e+02 3.163 0.00157 **
## MVR_PTS 1.760e+02 2.591e+01 6.790 1.20e-11 ***
## CAR_AGE -2.853e+01 1.292e+01 -2.208 0.02729 *
## URBANICITYHighly Urban/ Urban 1.665e+03 1.394e+02 11.944 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared: 0.07099, Adjusted R-squared: 0.06675
## F-statistic: 16.78 on 37 and 8123 DF, p-value: < 2.2e-16
#Building model Selectively chosen features.
<- lm(TARGET_AMT ~ KIDSDRIV + MSTATUS + CAR_USE + BLUEBOOK + CAR_TYPE
lms_ins_train_df + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY
data = ins_train_df)
,
summary(lms_ins_train_df)
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + MSTATUS + CAR_USE + BLUEBOOK +
## CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,
## data = ins_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5194 -1667 -843 228 104141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.762e+02 2.178e+02 3.564 0.000368 ***
## KIDSDRIV 4.888e+02 9.967e+01 4.904 9.58e-07 ***
## MSTATUSTRUE -7.664e+02 1.040e+02 -7.372 1.85e-13 ***
## CAR_USEPrivate -9.401e+02 1.255e+02 -7.492 7.49e-14 ***
## BLUEBOOK -1.008e-02 7.411e-03 -1.360 0.173783
## CAR_TYPEPanel Truck 2.795e+02 2.452e+02 1.140 0.254497
## CAR_TYPEPickup 3.214e+02 1.672e+02 1.922 0.054636 .
## CAR_TYPESports Car 8.420e+02 1.847e+02 4.558 5.25e-06 ***
## CAR_TYPESUV 5.594e+02 1.408e+02 3.974 7.12e-05 ***
## CAR_TYPEVan 4.990e+02 2.036e+02 2.450 0.014294 *
## OLDCLAIM -1.101e-02 7.479e-03 -1.472 0.141194
## CLM_FREQ 1.769e+02 5.524e+01 3.202 0.001370 **
## REVOKEDTRUE 6.394e+02 1.744e+02 3.666 0.000248 ***
## MVR_PTS 1.995e+02 2.595e+01 7.689 1.65e-14 ***
## URBANICITYHighly Urban/ Urban 1.214e+03 1.315e+02 9.227 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4578 on 8146 degrees of freedom
## Multiple R-squared: 0.05444, Adjusted R-squared: 0.05281
## F-statistic: 33.5 on 14 and 8146 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(lms_ins_train_df$residuals ~ lms_ins_train_df$fitted.values, main="New Reduced Var Model")
abline(h = 0)
plot(lm_ins_train_df$residuals ~ lm_ins_train_df$fitted.values, main="Orignal Model All Vars")
abline(h = 0)
Log model takeaway: even a basic baseline model is valuable for an insurance company