We discuss the challenges from evaluating insurance data, asserting whether a car is involved in a crash and then how expensive the claim would be. Mainly, we discuss imputation but our end models are not able to effectively predict car crashes due to overfitting and high false positives. Our linear regression models pull some insights regarding the effects of marital status and others on claim costs.
The data has 23 features and 2 target variables that we wish to predict. As part of our predictive analytics, first we will decide if the car that belongs to a license plate has been in a crash before. Second, we will determine how much the cost of the crash was if they were.
There are some general myths related to driving that may skew our view about what we expect. How much of it is true? Personally, I think that senior drivers get into more crashes than those younger but what if that’s not the case? Provided below is a quick data dictionary of the features we will be measuring.
The top of the dataset is shown below, where we can see a smattering of isues such as untidy data and even a missing value in one of the income observations.
#this step is necessary in order to analyze data as it is not clean
currencyconv = function(input) {
out = sub("\\$", "", input)
out = as.numeric(sub(",", "", out))
return(out)
}
# Replace spaces with underscores
underscore = function(input) {
out = sub(" ", "_", input)
return(out)
}
rawData = as.tbl(rawData) %>%
mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
currencyconv) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
underscore) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
as.factor) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
## Warning: `as.tbl()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#check data
summary(rawData) %>% kable() %>% kable_styling()
| INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1 | 0:6008 | Min. : 0 | Min. :0.0000 | Min. :16.00 | Min. :0.0000 | Min. : 0.0 | Min. : 0 | Length:8161 | Min. : 0 | Length:8161 | Length:8161 | <High_School :1203 | z_Blue_Collar:1825 | Min. : 5.00 | Length:8161 | Min. : 1500 | Min. : 1.000 | Minivan :2145 | Length:8161 | Min. : 0 | Min. :0.0000 | Length:8161 | Min. : 0.000 | Min. :-3.000 | Highly_Urban/ Urban :6492 | |
| 1st Qu.: 2559 | 1:2153 | 1st Qu.: 0 | 1st Qu.:0.0000 | 1st Qu.:39.00 | 1st Qu.:0.0000 | 1st Qu.: 9.0 | 1st Qu.: 28097 | Class :character | 1st Qu.: 0 | Class :character | Class :character | Bachelors :2242 | Clerical :1271 | 1st Qu.: 22.00 | Class :character | 1st Qu.: 9280 | 1st Qu.: 1.000 | Panel_Truck: 676 | Class :character | 1st Qu.: 0 | 1st Qu.:0.0000 | Class :character | 1st Qu.: 0.000 | 1st Qu.: 1.000 | z_Highly_Rural/ Rural:1669 | |
| Median : 5133 | NA | Median : 0 | Median :0.0000 | Median :45.00 | Median :0.0000 | Median :11.0 | Median : 54028 | Mode :character | Median :161160 | Mode :character | Mode :character | Masters :1658 | Professional :1117 | Median : 33.00 | Mode :character | Median :14440 | Median : 4.000 | Pickup :1389 | Mode :character | Median : 0 | Median :0.0000 | Mode :character | Median : 1.000 | Median : 8.000 | NA | |
| Mean : 5152 | NA | Mean : 1504 | Mean :0.1711 | Mean :44.79 | Mean :0.7212 | Mean :10.5 | Mean : 61898 | NA | Mean :154867 | NA | NA | PhD : 728 | Manager : 988 | Mean : 33.49 | NA | Mean :15710 | Mean : 5.351 | Sports_Car : 907 | NA | Mean : 4037 | Mean :0.7986 | NA | Mean : 1.696 | Mean : 8.328 | NA | |
| 3rd Qu.: 7745 | NA | 3rd Qu.: 1036 | 3rd Qu.:0.0000 | 3rd Qu.:51.00 | 3rd Qu.:1.0000 | 3rd Qu.:13.0 | 3rd Qu.: 85986 | NA | 3rd Qu.:238724 | NA | NA | z_High_School:2330 | Lawyer : 835 | 3rd Qu.: 44.00 | NA | 3rd Qu.:20850 | 3rd Qu.: 7.000 | Van : 750 | NA | 3rd Qu.: 4636 | 3rd Qu.:2.0000 | NA | 3rd Qu.: 3.000 | 3rd Qu.:12.000 | NA | |
| Max. :10302 | NA | Max. :107586 | Max. :4.0000 | Max. :81.00 | Max. :5.0000 | Max. :23.0 | Max. :367030 | NA | Max. :885282 | NA | NA | NA | Student : 712 | Max. :142.00 | NA | Max. :69740 | Max. :25.000 | z_SUV :2294 | NA | Max. :57037 | Max. :5.0000 | NA | Max. :13.000 | Max. :28.000 | NA | |
| NA | NA | NA | NA | NA’s :6 | NA | NA’s :454 | NA’s :445 | NA | NA’s :464 | NA | NA | NA | (Other) :1413 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :510 | NA |
After carefully examining the data in the raw data frame, it was found that certain variables contained unnecessary characters such as dollar signs and garbage data. To address this issue, regular expression (regex) was used to remove these unwanted characters from the variables. By doing so, the data is now cleaner and more suitable for analysis.
Additionally, there are several nominal categories that are yes and no questions. We reduced this to binary where 1=yes and2=no using regex again to further simplify the process.
Finally, we convert variables to type factor or numeric, as appropriate.
### Akin to dplyr::mutate_at but does not throw error ("silent") on missing cols and applies a function
SilentMutateAt <- function(df, dirtyCols = c("Sample", "Text"), FUN, ...) {
ind = colnames(df) %in% dirtyCols
df[, ind] = sapply(df[, ind], FUN = FUN, ... = ...)
return(df)
}
### Rearranges parameters of gsub to be more friendly to pipe operator
GsubPipe <- function(x, pattern, replacement, ...) {
gsub(pattern, replacement, x, ...)
}
### Replaces "yes/no" questions with an integer where yes=1 and no=0
CleanYesNo <- function(df, trace = FALSE, threshold = 0.5) {
yesNoCols = sapply(df, function(x) {
unique(x) %>%
grepl("yes|no|^y$|^n$", ., ignore.case = T) %>%
{ sum(.) / length(.) } %>%
{ . >= threshold }
}) %>%
.[.] %>%
names(.)
result = SilentMutateAt(df, yesNoCols, function(x) {
truthy = as.integer(grepl("yes|^y$", x, ignore.case = T))
truthy = truthy - as.integer(grepl("no|^n$", x, ignore.case = T))
truthy[truthy == 0L] = NA
truthy[truthy == -1L] = 0L
truthy
})
if (trace) {
return(list(result, yesNoCols))
}
return(result)
}
### Pre-processing data pipeline for insurance data; cleans up strings
CleanInsuranceStrings <- function(df) {
SilentMutateAt(df, colnames(df)[sapply(df, is.character)], GsubPipe, "\\$|z_|<|,", "") %>%
CleanYesNo(.) %>%
# could also use readr::parse_number here
SilentMutateAt(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), as.numeric) %>%
### Coerces blank strings to NA
SilentMutateAt(., "JOB", function(x) { x[x == ""] = "Unspecified" ; x }) %>%
### Converts strings to factors
dplyr::mutate_at(., colnames(.)[sapply(., is.character)], as.factor)
}
#runs functions
cleanData <- CleanInsuranceStrings(rawData)
## Warning in `[<-.factor`(`*tmp*`, x == "", value = "Unspecified"): invalid
## factor level, NA generated
cleanTestData <- CleanInsuranceStrings(rawTestData)
imputeData = CleanInsuranceStrings(rawData)
## Warning in `[<-.factor`(`*tmp*`, x == "", value = "Unspecified"): invalid
## factor level, NA generated
#reference: https://statisticsglobe.com/convert-character-to-factor-in-r
summary(cleanData)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. : 1 0:6008 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.: 2559 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median : 5133 Median : 0 Median :0.0000 Median :45.00
## Mean : 5152 Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.: 7745 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :10302 Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 Min. : 0 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 1st Qu.:0.000
## Median :0.0000 Median :11.0 Median : 54028 Median :0.000
## Mean :0.7212 Mean :10.5 Mean : 61898 Mean :0.132
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 3rd Qu.:0.000
## Max. :5.0000 Max. :23.0 Max. :367030 Max. :1.000
## NA's :454 NA's :445
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 0 Min. :0.0000 F:4375 <High_School :1203
## 1st Qu.: 0 1st Qu.:0.0000 M:3786 Bachelors :2242
## Median :161160 Median :1.0000 Masters :1658
## Mean :154867 Mean :0.5997 PhD : 728
## 3rd Qu.:238724 3rd Qu.:1.0000 z_High_School:2330
## Max. :885282 Max. :1.0000
## NA's :464
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue_Collar:1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional :1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## (Other) :1599 Max. :142.00 Max. :69740
## NA's : 526
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 Min. :0.0000 Min. : 0
## 1st Qu.: 1.000 Panel_Truck: 676 1st Qu.:0.0000 1st Qu.: 0
## Median : 4.000 Pickup :1389 Median :0.0000 Median : 0
## Mean : 5.351 Sports_Car : 907 Mean :0.2914 Mean : 4037
## 3rd Qu.: 7.000 Van : 750 3rd Qu.:1.0000 3rd Qu.: 4636
## Max. :25.000 z_SUV :2294 Max. :1.0000 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean :0.1225 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.:0.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :1.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly_Urban/ Urban :6492
## z_Highly_Rural/ Rural:1669
##
##
##
##
##
head(cleanData)
## # A tibble: 6 Ă— 26
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## <int> <fct> <dbl> <int> <int> <int> <int> <dbl> <int>
## 1 1 0 0 0 60 0 11 67349 0
## 2 2 0 0 0 43 0 11 91449 0
## 3 4 0 0 0 35 1 10 16039 0
## 4 5 0 0 0 51 0 14 NA 0
## 5 6 0 0 0 50 0 NA 114986 0
## 6 7 1 2946 0 34 1 12 125301 1
## # ℹ 17 more variables: HOME_VAL <dbl>, MSTATUS <int>, SEX <fct>,
## # EDUCATION <fct>, JOB <fct>, TRAVTIME <int>, CAR_USE <fct>, BLUEBOOK <dbl>,
## # TIF <int>, CAR_TYPE <fct>, RED_CAR <int>, OLDCLAIM <dbl>, CLM_FREQ <int>,
## # REVOKED <int>, MVR_PTS <int>, CAR_AGE <int>, URBANICITY <fct>
Our features span a multitude of different distributions. Many of them are normal, others are binomial, and others are nominal. How we handle each feature will thus influence future analysis.
### Plots the distribution of each variable
# See common-helpers.R
ntrain<-select_if(rawData, is.numeric)
ntrain %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density()
## Warning: Removed 1879 rows containing non-finite values (`stat_density()`).
According to this correlation matrix, all of the features are not strongly correlated to each other. With this in mind, we will go forward and assume that potentially every feature is a principal component and we will not exclude any until further downstream analysis is conducted.
### Develops a correlation plot for all numeric values
# See common-helpers.R
rawDatanum <- dplyr::select_if(rawData, is.numeric)
rcorr(as.matrix(rawDatanum))
## INDEX TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## INDEX 1.00 0.00 0.02 0.03 0.00 0.03 0.01 0.01
## TARGET_AMT 0.00 1.00 0.06 -0.04 0.06 -0.02 -0.06 -0.09
## KIDSDRIV 0.02 0.06 1.00 -0.08 0.46 0.04 -0.05 -0.02
## AGE 0.03 -0.04 -0.08 1.00 -0.45 0.14 0.18 0.21
## HOMEKIDS 0.00 0.06 0.46 -0.45 1.00 0.09 -0.16 -0.11
## YOJ 0.03 -0.02 0.04 0.14 0.09 1.00 0.29 0.27
## INCOME 0.01 -0.06 -0.05 0.18 -0.16 0.29 1.00 0.58
## HOME_VAL 0.01 -0.09 -0.02 0.21 -0.11 0.27 0.58 1.00
## TRAVTIME -0.02 0.03 0.01 0.01 -0.01 -0.02 -0.05 -0.04
## BLUEBOOK 0.01 0.00 -0.02 0.17 -0.11 0.14 0.43 0.26
## TIF -0.01 -0.05 0.00 0.00 0.01 0.02 0.00 0.00
## OLDCLAIM 0.00 0.07 0.02 -0.03 0.03 0.00 -0.05 -0.07
## CLM_FREQ 0.02 0.12 0.04 -0.02 0.03 -0.03 -0.05 -0.09
## MVR_PTS 0.01 0.14 0.05 -0.07 0.06 -0.04 -0.06 -0.09
## CAR_AGE 0.00 -0.06 -0.05 0.18 -0.15 0.06 0.41 0.22
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX -0.02 0.01 -0.01 0.00 0.02 0.01 0.00
## TARGET_AMT 0.03 0.00 -0.05 0.07 0.12 0.14 -0.06
## KIDSDRIV 0.01 -0.02 0.00 0.02 0.04 0.05 -0.05
## AGE 0.01 0.17 0.00 -0.03 -0.02 -0.07 0.18
## HOMEKIDS -0.01 -0.11 0.01 0.03 0.03 0.06 -0.15
## YOJ -0.02 0.14 0.02 0.00 -0.03 -0.04 0.06
## INCOME -0.05 0.43 0.00 -0.05 -0.05 -0.06 0.41
## HOME_VAL -0.04 0.26 0.00 -0.07 -0.09 -0.09 0.22
## TRAVTIME 1.00 -0.02 -0.01 -0.02 0.01 0.01 -0.04
## BLUEBOOK -0.02 1.00 -0.01 -0.03 -0.04 -0.04 0.19
## TIF -0.01 -0.01 1.00 -0.02 -0.02 -0.04 0.01
## OLDCLAIM -0.02 -0.03 -0.02 1.00 0.50 0.26 -0.01
## CLM_FREQ 0.01 -0.04 -0.02 0.50 1.00 0.40 -0.01
## MVR_PTS 0.01 -0.04 -0.04 0.26 0.40 1.00 -0.02
## CAR_AGE -0.04 0.19 0.01 -0.01 -0.01 -0.02 1.00
##
## n
## INDEX TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## INDEX 8161 8161 8161 8155 8161 7707 7716 7697
## TARGET_AMT 8161 8161 8161 8155 8161 7707 7716 7697
## KIDSDRIV 8161 8161 8161 8155 8161 7707 7716 7697
## AGE 8155 8155 8155 8155 8155 7701 7711 7693
## HOMEKIDS 8161 8161 8161 8155 8161 7707 7716 7697
## YOJ 7707 7707 7707 7701 7707 7707 7291 7270
## INCOME 7716 7716 7716 7711 7716 7291 7716 7285
## HOME_VAL 7697 7697 7697 7693 7697 7270 7285 7697
## TRAVTIME 8161 8161 8161 8155 8161 7707 7716 7697
## BLUEBOOK 8161 8161 8161 8155 8161 7707 7716 7697
## TIF 8161 8161 8161 8155 8161 7707 7716 7697
## OLDCLAIM 8161 8161 8161 8155 8161 7707 7716 7697
## CLM_FREQ 8161 8161 8161 8155 8161 7707 7716 7697
## MVR_PTS 8161 8161 8161 8155 8161 7707 7716 7697
## CAR_AGE 7651 7651 7651 7645 7651 7219 7237 7223
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX 8161 8161 8161 8161 8161 8161 7651
## TARGET_AMT 8161 8161 8161 8161 8161 8161 7651
## KIDSDRIV 8161 8161 8161 8161 8161 8161 7651
## AGE 8155 8155 8155 8155 8155 8155 7645
## HOMEKIDS 8161 8161 8161 8161 8161 8161 7651
## YOJ 7707 7707 7707 7707 7707 7707 7219
## INCOME 7716 7716 7716 7716 7716 7716 7237
## HOME_VAL 7697 7697 7697 7697 7697 7697 7223
## TRAVTIME 8161 8161 8161 8161 8161 8161 7651
## BLUEBOOK 8161 8161 8161 8161 8161 8161 7651
## TIF 8161 8161 8161 8161 8161 8161 7651
## OLDCLAIM 8161 8161 8161 8161 8161 8161 7651
## CLM_FREQ 8161 8161 8161 8161 8161 8161 7651
## MVR_PTS 8161 8161 8161 8161 8161 8161 7651
## CAR_AGE 7651 7651 7651 7651 7651 7651 7651
##
## P
## INDEX TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## INDEX 0.9572 0.1594 0.0022 0.9962 0.0189 0.4385 0.2881
## TARGET_AMT 0.9572 0.0000 0.0002 0.0000 0.0525 0.0000 0.0000
## KIDSDRIV 0.1594 0.0000 0.0000 0.0000 0.0001 0.0000 0.0825
## AGE 0.0022 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000
## HOMEKIDS 0.9962 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## YOJ 0.0189 0.0525 0.0001 0.0000 0.0000 0.0000 0.0000
## INCOME 0.4385 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## HOME_VAL 0.2881 0.0000 0.0825 0.0000 0.0000 0.0000 0.0000
## TRAVTIME 0.0372 0.0115 0.4455 0.6342 0.5128 0.1369 0.0000 0.0018
## BLUEBOOK 0.2089 0.6712 0.0516 0.0000 0.0000 0.0000 0.0000 0.0000
## TIF 0.4053 0.0000 0.8574 0.9952 0.2859 0.0296 0.9276 0.8564
## OLDCLAIM 0.9091 0.0000 0.0653 0.0082 0.0069 0.7936 0.0000 0.0000
## CLM_FREQ 0.0898 0.0000 0.0008 0.0296 0.0080 0.0209 0.0000 0.0000
## MVR_PTS 0.4765 0.0000 0.0000 0.0000 0.0000 0.0009 0.0000 0.0000
## CAR_AGE 0.9513 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX 0.0372 0.2089 0.4053 0.9091 0.0898 0.4765 0.9513
## TARGET_AMT 0.0115 0.6712 0.0000 0.0000 0.0000 0.0000 0.0000
## KIDSDRIV 0.4455 0.0516 0.8574 0.0653 0.0008 0.0000 0.0000
## AGE 0.6342 0.0000 0.9952 0.0082 0.0296 0.0000 0.0000
## HOMEKIDS 0.5128 0.0000 0.2859 0.0069 0.0080 0.0000 0.0000
## YOJ 0.1369 0.0000 0.0296 0.7936 0.0209 0.0009 0.0000
## INCOME 0.0000 0.0000 0.9276 0.0000 0.0000 0.0000 0.0000
## HOME_VAL 0.0018 0.0000 0.8564 0.0000 0.0000 0.0000 0.0000
## TRAVTIME 0.1246 0.2945 0.0818 0.5535 0.3384 0.0008
## BLUEBOOK 0.1246 0.6242 0.0077 0.0010 0.0004 0.0000
## TIF 0.2945 0.6242 0.0473 0.0375 0.0002 0.4969
## OLDCLAIM 0.0818 0.0077 0.0473 0.0000 0.0000 0.2417
## CLM_FREQ 0.5535 0.0010 0.0375 0.0000 0.0000 0.4151
## MVR_PTS 0.3384 0.0004 0.0002 0.0000 0.0000 0.0817
## CAR_AGE 0.0008 0.0000 0.4969 0.2417 0.4151 0.0817
corrplot(cor(rawDatanum), method="square")
The raw data frame contains 1879` missing values, representing 1713 out of 8161 observations.
### Counts values with missing data
#I don't understand this line, but this does not agree with the next chunk
# AA: i got the same result on my side but the other version is more clear
#nrow(rawData[is.na(rawData),])
### total missing values
colSums(is.na(rawData)) %>% sum()
### Counts rows with missing data
sum(!complete.cases(rawData)) #observations with missing data
The following variables in the dataset have missing values: TARGET_FLAG, TARGET_AMT, AGE, YOJ, and CAR_AGE.
#this function adds up the rows with missing values to get a count of NAs by feature
colSums(is.na(rawData))
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0 0 0 0 6 0
## YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX
## 454 445 0 464 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## 0 0 0 0 0 0
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 510 0
A good rule of thumb is to drop any feature with more than 80% missing values – that is not the case with any features here. The remaining options are to: 1) Drop rows with missing values 2) Infer missing values
In order to decide which course to take we should first determine if there is any correlation between features with missing values, or if values are missing at random. Because several of these features are numeric, we will first need to convert the features to the appropriate data type in order to determine if there is a correlation.
# TO COME -- Section exploring correlation
# impute data for missing values
# use column mean for calculation
rawData$AGE[is.na(rawData$AGE)] <- mean(rawData$AGE, na.rm=TRUE)
rawData$YOJ[is.na(rawData$YOJ)] <- mean(rawData$YOJ, na.rm=TRUE)
rawData$HOME_VAL[is.na(rawData$HOME_VAL)] <- mean(rawData$HOME_VAL, na.rm=TRUE)
rawData$CAR_AGE[is.na(rawData$CAR_AGE)] <- mean(rawData$CAR_AGE, na.rm=TRUE)
rawData$INCOME[is.na(rawData$INCOME)] <- mean(rawData$INCOME, na.rm=TRUE)
#get complete cases
rawData <- rawData[complete.cases(rawData),]
rawData2<-rawData
# # transform data using log for skewed HOMEKIDS, MVR_PTS, OLDCLAIM, TIF, KIDSDRIVE and CLM_FREQ
rawData$HOMEKIDS <- log(rawData$HOMEKIDS+1)
rawData$MVR_PTS <- log(rawData$MVR_PTS+1)
rawData$TIF <- log(rawData$TIF+1)
rawData$KIDSDRIV <- log(rawData$KIDSDRIV+1)
rawData$CLM_FREQ <- log(rawData$CLM_FREQ+1)
#remove rad per correlation in prior section
rawData <- rawData[, !(colnames(rawData) %in% c("INDEX"))]
#
# #create variable
rawDatanum <- dplyr::select_if(rawData, is.numeric)
rcorr(as.matrix(rawDatanum))
## TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL TRAVTIME
## TARGET_AMT 1.00 0.06 -0.04 0.07 -0.02 -0.06 -0.08 0.03
## KIDSDRIV 0.06 1.00 -0.08 0.49 0.05 -0.05 -0.02 0.01
## AGE -0.04 -0.08 1.00 -0.47 0.13 0.18 0.20 0.01
## HOMEKIDS 0.07 0.49 -0.47 1.00 0.08 -0.16 -0.11 -0.01
## YOJ -0.02 0.05 0.13 0.08 1.00 0.27 0.26 -0.02
## INCOME -0.06 -0.05 0.18 -0.16 0.27 1.00 0.54 -0.05
## HOME_VAL -0.08 -0.02 0.20 -0.11 0.26 0.54 1.00 -0.03
## TRAVTIME 0.03 0.01 0.01 -0.01 -0.02 -0.05 -0.03 1.00
## BLUEBOOK 0.00 -0.02 0.16 -0.11 0.14 0.42 0.25 -0.02
## TIF -0.05 -0.01 0.00 0.00 0.02 -0.01 0.00 -0.01
## OLDCLAIM 0.07 0.02 -0.03 0.03 0.00 -0.04 -0.07 -0.02
## CLM_FREQ 0.13 0.04 -0.03 0.04 -0.02 -0.05 -0.10 0.00
## MVR_PTS 0.13 0.05 -0.07 0.06 -0.03 -0.05 -0.07 0.01
## CAR_AGE -0.06 -0.05 0.17 -0.15 0.06 0.39 0.20 -0.04
## BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT 0.00 -0.05 0.07 0.13 0.13 -0.06
## KIDSDRIV -0.02 -0.01 0.02 0.04 0.05 -0.05
## AGE 0.16 0.00 -0.03 -0.03 -0.07 0.17
## HOMEKIDS -0.11 0.00 0.03 0.04 0.06 -0.15
## YOJ 0.14 0.02 0.00 -0.02 -0.03 0.06
## INCOME 0.42 -0.01 -0.04 -0.05 -0.05 0.39
## HOME_VAL 0.25 0.00 -0.07 -0.10 -0.07 0.20
## TRAVTIME -0.02 -0.01 -0.02 0.00 0.01 -0.04
## BLUEBOOK 1.00 -0.01 -0.03 -0.04 -0.04 0.18
## TIF -0.01 1.00 -0.02 -0.02 -0.04 0.00
## OLDCLAIM -0.03 -0.02 1.00 0.54 0.25 -0.01
## CLM_FREQ -0.04 -0.02 0.54 1.00 0.41 -0.01
## MVR_PTS -0.04 -0.04 0.25 0.41 1.00 -0.01
## CAR_AGE 0.18 0.00 -0.01 -0.01 -0.01 1.00
##
## n= 8161
##
##
## P
## TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL TRAVTIME
## TARGET_AMT 0.0000 0.0002 0.0000 0.0585 0.0000 0.0000 0.0115
## KIDSDRIV 0.0000 0.0000 0.0000 0.0000 0.0000 0.0577 0.5499
## AGE 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.6342
## HOMEKIDS 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.4230
## YOJ 0.0585 0.0000 0.0000 0.0000 0.0000 0.0000 0.1362
## INCOME 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## HOME_VAL 0.0000 0.0577 0.0000 0.0000 0.0000 0.0000 0.0018
## TRAVTIME 0.0115 0.5499 0.6342 0.4230 0.1362 0.0000 0.0018
## BLUEBOOK 0.6712 0.0415 0.0000 0.0000 0.0000 0.0000 0.0000 0.1246
## TIF 0.0000 0.3832 0.9404 0.6725 0.0498 0.4889 0.7280 0.2945
## OLDCLAIM 0.0000 0.0476 0.0082 0.0023 0.7931 0.0000 0.0000 0.0818
## CLM_FREQ 0.0000 0.0000 0.0054 0.0002 0.0272 0.0000 0.0000 0.7501
## MVR_PTS 0.0000 0.0000 0.0000 0.0000 0.0033 0.0000 0.0000 0.5405
## CAR_AGE 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0009
## BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT 0.6712 0.0000 0.0000 0.0000 0.0000 0.0000
## KIDSDRIV 0.0415 0.3832 0.0476 0.0000 0.0000 0.0000
## AGE 0.0000 0.9404 0.0082 0.0054 0.0000 0.0000
## HOMEKIDS 0.0000 0.6725 0.0023 0.0002 0.0000 0.0000
## YOJ 0.0000 0.0498 0.7931 0.0272 0.0033 0.0000
## INCOME 0.0000 0.4889 0.0000 0.0000 0.0000 0.0000
## HOME_VAL 0.0000 0.7280 0.0000 0.0000 0.0000 0.0000
## TRAVTIME 0.1246 0.2945 0.0818 0.7501 0.5405 0.0009
## BLUEBOOK 0.5420 0.0077 0.0003 0.0007 0.0000
## TIF 0.5420 0.0573 0.0408 0.0006 0.9927
## OLDCLAIM 0.0077 0.0573 0.0000 0.0000 0.2402
## CLM_FREQ 0.0003 0.0408 0.0000 0.0000 0.2247
## MVR_PTS 0.0007 0.0006 0.0000 0.0000 0.4250
## CAR_AGE 0.0000 0.9927 0.2402 0.2247 0.4250
It is important to address these missing values before proceeding with any analysis.
Based on the available data, it seems that YOJ tends to increase with age. There are 6 missing values in the AGE column, which is a trivial enough number that the simplest approach is to drop those observations. Observations that are missing the target cannot be used for training so those will also be dropped.
Build the linear regression model with AGE as the predictor and YOJ as the response and predict the missing values of YOJ based on the values of AGE. Insert the predicted values of YOJ into the original dataframe.
# Subset the data with missing values of YOJ
missing_YOJ <- subset(imputeData, is.na(YOJ))
lm_model <- lm(YOJ ~ AGE, data = imputeData)
predicted_YOJ <- predict(lm_model, newdata = missing_YOJ)
rounded_YOJ <- round(predicted_YOJ, digits = 0)
imputeData$YOJ[is.na(imputeData$YOJ)] <- rounded_YOJ
Build the linear regression model with AGE and YOJ as the predictors and INCOME as the response and predict the missing values of INCOME based on the values of AGE and YOJ. insert the predicted values of INCOME into the original dataframe.
# Subset the data with missing values of INCOME
missing_INCOME <- subset(imputeData, is.na(INCOME))
lm_model <- lm(INCOME ~ AGE + YOJ, data = imputeData)
predicted_INCOME <- predict(lm_model, newdata = missing_INCOME)
rounded_INCOME <- round(predicted_INCOME, digits = 0)
imputeData[is.na(imputeData$INCOME),]$INCOME <- rounded_INCOME
Build the linear regression model with AGE, YOJ, and INCOME as predictors and HOME_VAL as the response and predict the missing values of HOME_VAL based on the values of AGE, YOJ, and INCOME. Insert the predicted values of HOME_VAL into the original dataframe
# Subset the data with missing values of HOME_VAL
missing_HOME_VAL <- subset(imputeData, is.na(HOME_VAL))
lm_model <- lm(HOME_VAL ~ AGE + YOJ + INCOME, data = imputeData)
predicted_HOME_VAL <- predict(lm_model, newdata = missing_HOME_VAL)
#rounded_HOME_VAL <- round(predicted_HOME_VAL, digits = 0)
imputeData[is.na(imputeData$HOME_VAL),]$HOME_VAL <- predicted_HOME_VAL
imputeData$HOME_VAL <- round(imputeData$HOME_VAL, digits = 0)
head(imputeData$HOME_VAL)
## [1] 0 257252 124191 306251 243925 0
To handle the missing value in CAR_AGE, we will use chained imputation since it is difficult to assess. This will also fill any other data that is still missing so we now have produced our final cleaning steps. We may next consider doing manipulations such as normalization, zero centering, and other kinds of feature engineering.
imputeData = imputeData %>%
mice::mice(., method = "pmm", printFlag = FALSE) %>%
mice::complete(.)
trainClean = imputeData
summary(dplyr::select(imputeData, -INDEX))
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0:6008 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000
## 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000
## Median : 0 Median :0.0000 Median :45.00 Median :0.0000
## Mean : 1504 Mean :0.1711 Mean :44.78 Mean :0.7212
## 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000
## Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000
##
## YOJ INCOME PARENT1 HOME_VAL
## Min. : 0.0 Min. : 0 Min. :0.000 Min. : -5482
## 1st Qu.: 9.0 1st Qu.: 29401 1st Qu.:0.000 1st Qu.: 0
## Median :11.0 Median : 55578 Median :0.000 Median :160042
## Mean :10.5 Mean : 61895 Mean :0.132 Mean :154885
## 3rd Qu.:13.0 3rd Qu.: 83871 3rd Qu.:0.000 3rd Qu.:236296
## Max. :23.0 Max. :367030 Max. :1.000 Max. :885282
##
## MSTATUS SEX EDUCATION JOB
## Min. :0.0000 F:4375 <High_School :1203 z_Blue_Collar:1973
## 1st Qu.:0.0000 M:3786 Bachelors :2242 Clerical :1338
## Median :1.0000 Masters :1658 Professional :1204
## Mean :0.5997 PhD : 728 Manager :1094
## 3rd Qu.:1.0000 z_High_School:2330 Lawyer : 872
## Max. :1.0000 Student : 761
## (Other) : 919
## TRAVTIME CAR_USE BLUEBOOK TIF
## Min. : 5.00 Commercial:3029 Min. : 1500 Min. : 1.000
## 1st Qu.: 22.00 Private :5132 1st Qu.: 9280 1st Qu.: 1.000
## Median : 33.00 Median :14440 Median : 4.000
## Mean : 33.49 Mean :15710 Mean : 5.351
## 3rd Qu.: 44.00 3rd Qu.:20850 3rd Qu.: 7.000
## Max. :142.00 Max. :69740 Max. :25.000
##
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ
## Minivan :2145 Min. :0.0000 Min. : 0 Min. :0.0000
## Panel_Truck: 676 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Pickup :1389 Median :0.0000 Median : 0 Median :0.0000
## Sports_Car : 907 Mean :0.2914 Mean : 4037 Mean :0.7986
## Van : 750 3rd Qu.:1.0000 3rd Qu.: 4636 3rd Qu.:2.0000
## z_SUV :2294 Max. :1.0000 Max. :57037 Max. :5.0000
##
## REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.1225 Mean : 1.696 Mean : 8.336
## 3rd Qu.:0.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :1.0000 Max. :13.000 Max. :28.000
##
## URBANICITY
## Highly_Urban/ Urban :6492
## z_Highly_Rural/ Rural:1669
##
##
##
##
##
We have 971 male and 1170 female in this study. Male average income is higher than man. However, man has highest car purchase amount. The boxplots show that the medians and interquartile range of the two distributions are very close.
SEX_df <- imputeData %>%
dplyr::filter(!is.na(INCOME)) %>%
dplyr::group_by(SEX) %>%
dplyr::summarise(INCOME = mean(INCOME))
SEX_df
## # A tibble: 2 Ă— 2
## SEX INCOME
## <fct> <dbl>
## 1 F 57271.
## 2 M 67238.
ggplot2::ggplot(imputeData) +
ggplot2::aes(x = SEX, y = INCOME, color = SEX, fill = SEX) +
ggplot2::geom_bar(data = SEX_df, stat = "identity", alpha = .3) +
ggrepel::geom_text_repel(ggplot2::aes(label = HOME_VAL), color = "black", size = 2.5, segment.color = "grey") +
ggplot2::geom_point() +
ggplot2::guides(color = "none", fill = "none") +
ggplot2::theme_bw() +
ggplot2::labs(title = "Income by Sex", x = "Sex", y = "Income")
## Warning: ggrepel: 8156 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
We will each create a model to explore and compare the differences between them. Additionally, we will discuss the key insights that we gain from the modeling process because it will help us draw conclusions about the dataset and potentially point us towards a preferred model.
Stepwise regression analysis was performed. The purpose of stepwise regression is to identify the subset of predictors that best explain the response variable. In this case, the response variables are TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, and CAR_AGE. The model selects the best subset of these predictors by iteratively adding or removing variables based on statistical criteria, in this case the Akaike Information Criterion (AIC).
The final model contains the variables TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ ,MVR_PTS, CAR_AGE. This subset of variables yielded the lowest AIC value among all possible subsets of variables considered by the algorithm.
To evaluate the performance of a machine learning model, it is important to split the data into training and test sets. The training set is used to train the model, while the test set is used to evaluate how well the model generalizes to new, unseen data.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(333)
trainIndex <- sample(1:nrow(trainClean), round(0.7*nrow(trainClean)), replace = FALSE)
trainData <- trainClean[trainIndex, ]
testData <- trainClean[-trainIndex, ]
trainData <- na.omit(trainData)
testData <- na.omit(testData)
logit <- glm(formula = TARGET_FLAG ~ . - TARGET_AMT, data=rawData, family = "binomial" (link="logit"))
summary(logit)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial(link = "logit"),
## data = rawData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5328 -0.7137 -0.3955 0.6469 3.1515
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.039e-01 3.297e-01 -2.438 0.014763 *
## KIDSDRIV 6.724e-01 1.106e-01 6.079 1.21e-09 ***
## AGE -5.647e-05 4.083e-03 -0.014 0.988965
## HOMEKIDS 1.493e-01 8.316e-02 1.795 0.072594 .
## YOJ -1.262e-02 8.592e-03 -1.468 0.142042
## INCOME -3.533e-06 1.079e-06 -3.273 0.001065 **
## PARENT1Yes 3.275e-01 1.146e-01 2.858 0.004270 **
## HOME_VAL -1.300e-06 3.425e-07 -3.796 0.000147 ***
## MSTATUSz_No 5.183e-01 8.499e-02 6.098 1.07e-09 ***
## SEXz_F -8.739e-02 1.122e-01 -0.779 0.436005
## EDUCATIONBachelors -3.791e-01 1.156e-01 -3.280 0.001037 **
## EDUCATIONMasters -2.803e-01 1.788e-01 -1.568 0.116928
## EDUCATIONPhD -1.638e-01 2.141e-01 -0.765 0.444147
## EDUCATIONz_High_School 1.674e-02 9.495e-02 0.176 0.860061
## JOBClerical 3.929e-01 1.967e-01 1.997 0.045806 *
## JOBDoctor -4.416e-01 2.669e-01 -1.654 0.098064 .
## JOBHome_Maker 2.019e-01 2.102e-01 0.960 0.336881
## JOBLawyer 9.864e-02 1.696e-01 0.581 0.560905
## JOBManager -5.670e-01 1.715e-01 -3.305 0.000948 ***
## JOBProfessional 1.513e-01 1.786e-01 0.847 0.396937
## JOBStudent 1.970e-01 2.145e-01 0.918 0.358363
## JOBz_Blue_Collar 3.000e-01 1.856e-01 1.616 0.106095
## TRAVTIME 1.467e-02 1.883e-03 7.790 6.73e-15 ***
## CAR_USEPrivate -7.587e-01 9.176e-02 -8.269 < 2e-16 ***
## BLUEBOOK -2.044e-05 5.263e-06 -3.884 0.000103 ***
## TIF -3.255e-01 4.145e-02 -7.854 4.04e-15 ***
## CAR_TYPEPanel_Truck 5.527e-01 1.617e-01 3.418 0.000630 ***
## CAR_TYPEPickup 5.509e-01 1.008e-01 5.467 4.57e-08 ***
## CAR_TYPESports_Car 1.035e+00 1.300e-01 7.965 1.66e-15 ***
## CAR_TYPEVan 6.171e-01 1.266e-01 4.874 1.09e-06 ***
## CAR_TYPEz_SUV 7.725e-01 1.114e-01 6.936 4.05e-12 ***
## RED_CARyes -3.334e-03 8.638e-02 -0.039 0.969215
## OLDCLAIM -1.724e-05 4.049e-06 -4.258 2.07e-05 ***
## CLM_FREQ 5.010e-01 6.347e-02 7.894 2.93e-15 ***
## REVOKEDYes 9.256e-01 9.201e-02 10.059 < 2e-16 ***
## MVR_PTS 2.902e-01 4.157e-02 6.983 2.90e-12 ***
## CAR_AGE -1.735e-03 7.548e-03 -0.230 0.818180
## URBANICITYz_Highly_Rural/ Rural -2.370e+00 1.130e-01 -20.966 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418 on 8160 degrees of freedom
## Residual deviance: 7290 on 8123 degrees of freedom
## AIC: 7366
##
## Number of Fisher Scoring iterations: 5
The model’s coefficient of 0.5860083 represents the correlation between the predicted values and actual values of the target variable, which is the probability of a car insurance claim being filed (TARGET_FLAG). A value of 0.60 indicates a moderate positive correlation between the predicted and actual values.
The model’s coefficients for INCOME and HOME_VAL are negative, indicating that as these variables decrease, the predicted probability of a car insurance claim being filed (TARGET_FLAG) increases. This observation seems to contradict the expectation that individuals with higher incomes and more valuable homes would be safer drivers and thus less likely to file insurance claims.
If the result is counter intuitive, it might be worthwhile to conduct another experiment to verify the accuracy of the outcome.
exp(logit$coefficients)
## (Intercept) KIDSDRIV
## 0.44759439 1.95890613
## AGE HOMEKIDS
## 0.99994354 1.16101816
## YOJ INCOME
## 0.98746421 0.99999647
## PARENT1Yes HOME_VAL
## 1.38746810 0.99999870
## MSTATUSz_No SEXz_F
## 1.67913068 0.91631881
## EDUCATIONBachelors EDUCATIONMasters
## 0.68447094 0.75557285
## EDUCATIONPhD EDUCATIONz_High_School
## 0.84890695 1.01688024
## JOBClerical JOBDoctor
## 1.48122021 0.64303595
## JOBHome_Maker JOBLawyer
## 1.22370504 1.10366439
## JOBManager JOBProfessional
## 0.56721931 1.16332072
## JOBStudent JOBz_Blue_Collar
## 1.21775367 1.34982753
## TRAVTIME CAR_USEPrivate
## 1.01477923 0.46827081
## BLUEBOOK TIF
## 0.99997956 0.72216334
## CAR_TYPEPanel_Truck CAR_TYPEPickup
## 1.73800068 1.73473229
## CAR_TYPESports_Car CAR_TYPEVan
## 2.81535682 1.85359874
## CAR_TYPEz_SUV RED_CARyes
## 2.16517590 0.99667196
## OLDCLAIM CLM_FREQ
## 0.99998276 1.65039273
## REVOKEDYes MVR_PTS
## 2.52326347 1.33675662
## CAR_AGE URBANICITYz_Highly_Rural/ Rural
## 0.99826632 0.09348369
logitscalar <- mean(dlogis(predict(logit, type = "link")))
logitscalar * coef(logit)
## (Intercept) KIDSDRIV
## -1.168766e-01 9.776014e-02
## AGE HOMEKIDS
## -8.209729e-06 2.170676e-02
## YOJ INCOME
## -1.834135e-03 -5.136424e-07
## PARENT1Yes HOME_VAL
## 4.761333e-02 -1.890080e-07
## MSTATUSz_No SEXz_F
## 7.535364e-02 -1.270601e-02
## EDUCATIONBachelors EDUCATIONMasters
## -5.511974e-02 -4.075057e-02
## EDUCATIONPhD EDUCATIONz_High_School
## -2.381617e-02 2.433782e-03
## JOBClerical JOBDoctor
## 5.711993e-02 -6.419888e-02
## JOBHome_Maker JOBLawyer
## 2.935237e-02 1.434095e-02
## JOBManager JOBProfessional
## -8.243908e-02 2.199482e-02
## JOBStudent JOBz_Blue_Collar
## 2.864354e-02 4.361448e-02
## TRAVTIME CAR_USEPrivate
## 2.133070e-03 -1.103108e-01
## BLUEBOOK TIF
## -2.972373e-06 -4.732594e-02
## CAR_TYPEPanel_Truck CAR_TYPEPickup
## 8.036377e-02 8.009009e-02
## CAR_TYPESports_Car CAR_TYPEVan
## 1.504945e-01 8.972613e-02
## CAR_TYPEz_SUV RED_CARyes
## 1.123162e-01 -4.846794e-04
## OLDCLAIM CLM_FREQ
## -2.506236e-06 7.284374e-02
## REVOKEDYes MVR_PTS
## 1.345688e-01 4.219972e-02
## CAR_AGE URBANICITYz_Highly_Rural/ Rural
## -2.522830e-04 -3.445764e-01
confint.default(logit)
## 2.5 % 97.5 %
## (Intercept) -1.450076e+00 -1.576594e-01
## KIDSDRIV 4.556053e-01 8.891671e-01
## AGE -8.058607e-03 7.945675e-03
## HOMEKIDS -1.368645e-02 3.122811e-01
## YOJ -2.945506e-02 4.225006e-03
## INCOME -5.648457e-06 -1.417123e-06
## PARENT1Yes 1.028624e-01 5.520987e-01
## HOME_VAL -1.971256e-06 -6.287076e-07
## MSTATUSz_No 3.517063e-01 6.848462e-01
## SEXz_F -3.072784e-01 1.324965e-01
## EDUCATIONBachelors -6.056274e-01 -1.525908e-01
## EDUCATIONMasters -6.306668e-01 7.010865e-02
## EDUCATIONPhD -5.833681e-01 2.557567e-01
## EDUCATIONz_High_School -1.693589e-01 2.028376e-01
## JOBClerical 7.320068e-03 7.784124e-01
## JOBDoctor -9.646916e-01 8.158231e-02
## JOBHome_Maker -2.101384e-01 6.139047e-01
## JOBLawyer -2.338210e-01 4.310928e-01
## JOBManager -9.032236e-01 -2.307950e-01
## JOBProfessional -1.987380e-01 5.012952e-01
## JOBStudent -2.233874e-01 6.174032e-01
## JOBz_Blue_Collar -6.385011e-02 6.638038e-01
## TRAVTIME 1.097961e-02 1.836256e-02
## CAR_USEPrivate -9.385494e-01 -5.788676e-01
## BLUEBOOK -3.075900e-05 -1.012847e-05
## TIF -4.067373e-01 -2.442706e-01
## CAR_TYPEPanel_Truck 2.358295e-01 8.696413e-01
## CAR_TYPEPickup 3.533748e-01 7.483314e-01
## CAR_TYPESports_Car 7.803697e-01 1.289808e+00
## CAR_TYPEVan 3.689576e-01 8.653005e-01
## CAR_TYPEz_SUV 5.541958e-01 9.908075e-01
## RED_CARyes -1.726303e-01 1.659631e-01
## OLDCLAIM -2.517283e-05 -9.302541e-06
## CLM_FREQ 3.766150e-01 6.254116e-01
## REVOKEDYes 7.452091e-01 1.105897e+00
## MVR_PTS 2.087775e-01 3.717150e-01
## CAR_AGE -1.652895e-02 1.305859e-02
## URBANICITYz_Highly_Rural/ Rural -2.591520e+00 -2.148416e+00
predlogit <- predict(logit, type="response")
rawData2$pred1 <- predict(logit, type="response")
summary(predlogit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002416 0.076138 0.201101 0.263816 0.403715 0.959539
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit)))
## pred
## true 0 1
## 0 5537 471
## 1 1239 914
#plots for Seung's Model logit
par(mfrow=c(2,2))
plot(logit)
data.frame(rawData2$pred1) %>%
ggplot(aes(x = rawData2.pred1)) +
geom_histogram(bins = 50, fill = 'green') +
labs(title = 'Histogram of Predictions') +
theme_bw()
plot.roc(rawData$TARGET_FLAG, rawData2$pred1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#extract variables that are significant and rerun model
sigvars <- data.frame(summary(logit)$coef[summary(logit)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
## Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::rownames_to_column()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Of the models we build, we will build our first two models after an exhaustive outlier cleaning. It seems that both models favor managers for a lower insurance rate, interestingly enough. Additionally, the type of car has a significant impact on the amount. Also, both models agree that married couples are charged significantly lower rates, whereas this is even worse in the event that an unmarried person is also a single parent. Finally, both models agree that it is extremely significant whether the drivers are in urban or rural areas, such that urban drivers get charged around $1500 more than their rural counterparts.
colist<-dplyr::pull(sigvars, vars)
# colist<-colist[2:11]
colist<-c("KIDSDRIV","INCOME","PARENT1","HOME_VAL","MSTATUS","EDUCATION","JOB","TRAVTIME","CAR_USE","BLUEBOOK","TIF","CAR_TYPE","CLM_FREQ","REVOKED","MVR_PTS","URBANICITY")
idx <- match(colist, names(rawData))
rawDatamod2 <- cbind(rawData[,idx], rawData2['TARGET_FLAG'])
logit2 <- glm(TARGET_FLAG ~ ., data=rawDatamod2, family = "binomial" (link="logit"))
summary(logit2)
exp(logit2$coefficients)
logit2scalar <- mean(dlogis(predict(logit2, type = "link")))
logit2scalar * coef(logit2)
Since we created our initial models, we decided that there are three additional transforms we want to add. We believe that having a claim is predictive of making a claim again, so we will have a boolean column for any number of claims. Additionally, we decided that the only two relevant job types are doctors and managers, based on the models. Therefore, we will also make two boolean columns for if they are a manager or doctor.
### Creates a factor column for Manager/Doctor jobs and no claim history, specific to insurance dataset
InsuranceTransforms <- function(df) {
df %>%
dplyr::mutate(
IS_MANAGER = as.factor(as.integer(JOB %in% "Manager")),
IS_DOCTOR = as.factor(as.integer(JOB %in% "Doctor")),
NO_CLM_HIST = as.factor(as.integer(CLM_FREQ < 1))
) %>%
dplyr::select(-JOB, -CLM_FREQ)
}
Although new models were attempted with the new transforms and additional considerations, none of them successfully beat the existing model by a statistically significant margin. We compared each of the models using ANOVA and evaluated the LM models with MAE and RMSE, whereas the logistical models were compared using F1 and Accuracy metrics. Overall, we kept the previously accepted models due to the lack of advancement.
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit2)))
Anthony’s First Model plots for logit2
par(mfrow=c(2,2))
plot(logit2)
data.frame(rawData2$pred2) %>%
ggplot(aes(x = rawData2.pred2)) +
geom_histogram(bins = 50, fill = 'blue') +
labs(title = 'Histogram of Predictions') +
theme_bw()
#PC Model no racial bias
logit3 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData, family = "binomial" (link="logit"))
summary(logit3)
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME,
## family = binomial(link = "logit"), data = rawData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5299 -0.8217 -0.6749 1.2315 2.8090
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.876e-01 7.305e-02 -9.412 < 2e-16 ***
## KIDSDRIV 7.266e-01 8.115e-02 8.953 < 2e-16 ***
## INCOME -3.497e-06 6.826e-07 -5.123 3.01e-07 ***
## HOME_VAL -2.972e-06 2.499e-07 -11.895 < 2e-16 ***
## TRAVTIME 5.880e-03 1.598e-03 3.679 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 9021.1 on 8156 degrees of freedom
## AIC: 9031.1
##
## Number of Fisher Scoring iterations: 4
exp(logit3$coefficients)
## (Intercept) KIDSDRIV INCOME HOME_VAL TRAVTIME
## 0.5028055 2.0679778 0.9999965 0.9999970 1.0058969
predlogit3 <- predict(logit3, type="response")
rawData2$pred3 <- predict(logit3, type="response")
summary(predlogit3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01176 0.19679 0.25557 0.26382 0.32927 0.68970
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit3)))
## pred
## true 0 1
## 0 5937 71
## 1 2086 67
par(mfrow=c(2,2))
plot(logit3)
data.frame(rawData2$pred3) %>%
ggplot(aes(x = rawData2.pred3)) +
geom_histogram(bins = 50, fill = 'red') +
labs(title = 'Histogram of Predictions') +
theme_bw()
plot.roc(rawData$TARGET_FLAG, rawData2$pred3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
logit3scalar <- mean(dlogis(predict(logit3, type = "link")))
logit3scalar * coef(logit3)
## (Intercept) KIDSDRIV INCOME HOME_VAL TRAVTIME
## -1.271908e-01 1.344090e-01 -6.468917e-07 -5.498016e-07 1.087668e-03
round(logitscalar * coef(logit),2)
## (Intercept) KIDSDRIV
## -0.12 0.10
## AGE HOMEKIDS
## 0.00 0.02
## YOJ INCOME
## 0.00 0.00
## PARENT1Yes HOME_VAL
## 0.05 0.00
## MSTATUSz_No SEXz_F
## 0.08 -0.01
## EDUCATIONBachelors EDUCATIONMasters
## -0.06 -0.04
## EDUCATIONPhD EDUCATIONz_High_School
## -0.02 0.00
## JOBClerical JOBDoctor
## 0.06 -0.06
## JOBHome_Maker JOBLawyer
## 0.03 0.01
## JOBManager JOBProfessional
## -0.08 0.02
## JOBStudent JOBz_Blue_Collar
## 0.03 0.04
## TRAVTIME CAR_USEPrivate
## 0.00 -0.11
## BLUEBOOK TIF
## 0.00 -0.05
## CAR_TYPEPanel_Truck CAR_TYPEPickup
## 0.08 0.08
## CAR_TYPESports_Car CAR_TYPEVan
## 0.15 0.09
## CAR_TYPEz_SUV RED_CARyes
## 0.11 0.00
## OLDCLAIM CLM_FREQ
## 0.00 0.07
## REVOKEDYes MVR_PTS
## 0.13 0.04
## CAR_AGE URBANICITYz_Highly_Rural/ Rural
## 0.00 -0.34
round(logit3scalar * coef(logit3),2)
## (Intercept) KIDSDRIV INCOME HOME_VAL TRAVTIME
## -0.13 0.13 0.00 0.00 0.00
model <- lm(TARGET_AMT ~ ., data=rawData)
summary(model)
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6211 -464 -60 243 101197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.940e+02 5.009e+02 -1.186 0.2357
## TARGET_FLAG1 5.710e+03 1.135e+02 50.297 < 2e-16 ***
## KIDSDRIV -2.052e+01 1.781e+02 -0.115 0.9083
## AGE 6.147e+00 6.271e+00 0.980 0.3270
## HOMEKIDS 9.206e+01 1.256e+02 0.733 0.4637
## YOJ 7.591e+00 1.319e+01 0.576 0.5648
## INCOME -2.254e-03 1.577e-03 -1.430 0.1528
## PARENT1Yes 1.218e+02 1.830e+02 0.666 0.5056
## HOME_VAL 3.834e-04 5.165e-04 0.742 0.4580
## MSTATUSz_No 1.765e+02 1.282e+02 1.377 0.1685
## SEXz_F -2.901e+02 1.606e+02 -1.806 0.0709 .
## EDUCATIONBachelors 6.878e+01 1.790e+02 0.384 0.7008
## EDUCATIONMasters 2.222e+02 2.620e+02 0.848 0.3965
## EDUCATIONPhD 4.288e+02 3.110e+02 1.379 0.1679
## EDUCATIONz_High_School -1.241e+02 1.502e+02 -0.827 0.4084
## JOBClerical -8.206e+00 2.984e+02 -0.027 0.9781
## JOBDoctor -2.791e+02 3.571e+02 -0.782 0.4345
## JOBHome_Maker -7.036e+01 3.185e+02 -0.221 0.8251
## JOBLawyer 8.014e+01 2.582e+02 0.310 0.7563
## JOBManager -1.261e+02 2.521e+02 -0.500 0.6170
## JOBProfessional 1.750e+02 2.698e+02 0.649 0.5165
## JOBStudent -1.302e+02 3.266e+02 -0.399 0.6900
## JOBz_Blue_Collar 5.242e+01 2.813e+02 0.186 0.8522
## TRAVTIME 5.975e-01 2.824e+00 0.212 0.8324
## CAR_USEPrivate -1.006e+02 1.443e+02 -0.697 0.4857
## BLUEBOOK 2.941e-02 7.536e-03 3.902 9.61e-05 ***
## TIF -1.653e+01 6.277e+01 -0.263 0.7923
## CAR_TYPEPanel_Truck -5.601e+01 2.431e+02 -0.230 0.8178
## CAR_TYPEPickup -3.187e+01 1.493e+02 -0.213 0.8310
## CAR_TYPESports_Car 2.089e+02 1.910e+02 1.094 0.2741
## CAR_TYPEVan 9.699e+01 1.864e+02 0.520 0.6029
## CAR_TYPEz_SUV 1.635e+02 1.571e+02 1.040 0.2982
## RED_CARyes -2.768e+01 1.302e+02 -0.213 0.8316
## OLDCLAIM 3.731e-03 6.731e-03 0.554 0.5793
## CLM_FREQ -8.778e+01 1.060e+02 -0.828 0.4078
## REVOKEDYes -3.348e+02 1.534e+02 -2.182 0.0292 *
## MVR_PTS 1.397e+02 6.644e+01 2.103 0.0355 *
## CAR_AGE -2.521e+01 1.118e+01 -2.255 0.0242 *
## URBANICITYz_Highly_Rural/ Rural 2.772e+01 1.269e+02 0.218 0.8271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared: 0.2912, Adjusted R-squared: 0.2879
## F-statistic: 87.81 on 38 and 8122 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model$residuals ~ model$fitted.values)
plot(model$fitted.values,rawData$TARGET_AMT)
par(mfrow=c(2,2))
plot(model)
#extract variables that are significant and rerun model
sigvars <- data.frame(summary(model)$coef[summary(model)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
colist<-dplyr::pull(sigvars, vars)
colist<-c("TARGET_FLAG","BLUEBOOK","REVOKED","MVR_PTS","CAR_AGE")
idx <- match(colist, names(rawData))
rawDatamod2 <- cbind(rawData[,idx], rawData['TARGET_AMT'])
model2<-lm(TARGET_AMT ~ ., data=rawDatamod2)
summary(model2)
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawDatamod2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6269 -378 -34 192 101505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.315e+02 1.206e+02 -3.579 0.000347 ***
## TARGET_FLAG1 5.735e+03 1.036e+02 55.334 < 2e-16 ***
## BLUEBOOK 3.010e-02 5.328e-03 5.649 1.67e-08 ***
## REVOKEDYes -2.874e+02 1.356e+02 -2.120 0.034021 *
## MVR_PTS 1.309e+02 6.101e+01 2.145 0.031986 *
## CAR_AGE -1.291e+01 8.122e+00 -1.590 0.111894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3968 on 8155 degrees of freedom
## Multiple R-squared: 0.289, Adjusted R-squared: 0.2886
## F-statistic: 662.9 on 5 and 8155 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2$residuals ~ model2$fitted.values)
plot(model2$fitted.values,rawData$TARGET_AMT)
par(mfrow=c(2,2))
plot(model2)
par(mfrow=c(1,2))
plot(model2$residuals ~ model2$fitted.values, main="New Reduced Var Model")
abline(h = 0)
plot(model$residuals ~ model$fitted.values, main="Orignal Model All Vars")
abline(h = 0)
# IvanTikhonov Model
#remove variables with opposite coefficients
model3<-lm(TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData)
summary(model3)
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME,
## data = rawData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3610 -1652 -1239 -318 106277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.680e+03 1.470e+02 11.426 < 2e-16 ***
## KIDSDRIV 9.172e+02 1.789e+02 5.126 3.03e-07 ***
## INCOME -1.242e-03 1.336e-03 -0.930 0.3522
## HOME_VAL -2.809e-03 4.920e-04 -5.710 1.17e-08 ***
## TRAVTIME 7.234e+00 3.260e+00 2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4679 on 8156 degrees of freedom
## Multiple R-squared: 0.01096, Adjusted R-squared: 0.01047
## F-statistic: 22.59 on 4 and 8156 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model3$residuals ~ model3$fitted.values)
plot(model3$fitted.values,rawData$TARGET_AMT)
par(mfrow=c(2,2))
plot(model3)
rawTestData2<- rawTestData
rawTestData = read.csv(file="https://raw.githubusercontent.com/IvanGrozny88/DATA-621-HW-4/main/insurance-evaluation-data.csv")
rawTestData$TARGET_AMT <- 0
rawTestData$TARGET_FLAG <- 0
rawTestData = as.tbl(rawTestData) %>%
mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
currencyconv) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
underscore) %>%
mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
as.factor) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
# impute data for missing values
# use column mean for calculation
rawTestData$HOMEKIDS <- log(rawTestData$HOMEKIDS+1)
rawTestData$MVR_PTS <- log(rawTestData$MVR_PTS+1)
rawTestData$OLDCLAIM <- log(rawTestData$OLDCLAIM+1)
rawTestData$TIF <- log(rawTestData$TIF+1)
rawTestData$KIDSDRIV <- log(rawTestData$KIDSDRIV+1)
rawTestData$CLM_FREQ <- log(rawTestData$CLM_FREQ+1)
# use column mean for calculation
rawTestData$AGE[is.na(rawTestData$AGE)] <- mean(rawTestData$AGE, na.rm=TRUE)
rawTestData$YOJ[is.na(rawTestData$YOJ)] <- mean(rawTestData$YOJ, na.rm=TRUE)
rawTestData$HOME_VAL[is.na(rawTestData$HOME_VAL)] <- mean(rawTestData$HOME_VAL, na.rm=TRUE)
rawTestData$CAR_AGE[is.na(rawTestData$CAR_AGE)] <- mean(rawTestData$CAR_AGE, na.rm=TRUE)
rawTestData$INCOME[is.na(rawTestData$INCOME)] <- mean(rawTestData$INCOME, na.rm=TRUE)
#get complete cases
#remove rad per correlation in prior section
rawTestData <- rawTestData[, !(colnames(rawTestData) %in% c("INDEX"))]
TARGET_FLAG <- predict(logit, newdata = rawTestData, type="response")
y_pred_num <- ifelse(TARGET_FLAG > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
summary(y_pred)
## 0 1
## 1719 422
rbind(round(summary(predlogit),4), round(summary(TARGET_FLAG),4)) %>% kable()
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0.0024 | 0.0761 | 0.2011 | 0.2638 | 0.4037 | 0.9595 |
| 0.0031 | 0.0784 | 0.2313 | 0.2846 | 0.4360 | 0.9588 |
rawTestData$TARGET_FLAG <- as.factor(rawTestData$TARGET_FLAG)
rawTestData2 <- rawTestData[, !(colnames(rawTestData) %in% c("TARGET_FLAG"))]
TARGET_AMT<- predict(model, newdata = rawTestData, interval='confidence') #data from scaling originally to get to actual wins
summary(TARGET_AMT)
## fit lwr upr
## Min. :-1283.26 Min. :-2002.8 Min. :-563.7
## 1st Qu.: -272.62 1st Qu.: -799.6 1st Qu.: 242.0
## Median : -36.87 Median : -550.2 Median : 465.8
## Mean : -24.50 Mean : -558.3 Mean : 509.3
## 3rd Qu.: 211.39 3rd Qu.: -312.8 3rd Qu.: 763.2
## Max. : 1250.19 Max. : 503.5 Max. :1996.9
summary(model)
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6211 -464 -60 243 101197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.940e+02 5.009e+02 -1.186 0.2357
## TARGET_FLAG1 5.710e+03 1.135e+02 50.297 < 2e-16 ***
## KIDSDRIV -2.052e+01 1.781e+02 -0.115 0.9083
## AGE 6.147e+00 6.271e+00 0.980 0.3270
## HOMEKIDS 9.206e+01 1.256e+02 0.733 0.4637
## YOJ 7.591e+00 1.319e+01 0.576 0.5648
## INCOME -2.254e-03 1.577e-03 -1.430 0.1528
## PARENT1Yes 1.218e+02 1.830e+02 0.666 0.5056
## HOME_VAL 3.834e-04 5.165e-04 0.742 0.4580
## MSTATUSz_No 1.765e+02 1.282e+02 1.377 0.1685
## SEXz_F -2.901e+02 1.606e+02 -1.806 0.0709 .
## EDUCATIONBachelors 6.878e+01 1.790e+02 0.384 0.7008
## EDUCATIONMasters 2.222e+02 2.620e+02 0.848 0.3965
## EDUCATIONPhD 4.288e+02 3.110e+02 1.379 0.1679
## EDUCATIONz_High_School -1.241e+02 1.502e+02 -0.827 0.4084
## JOBClerical -8.206e+00 2.984e+02 -0.027 0.9781
## JOBDoctor -2.791e+02 3.571e+02 -0.782 0.4345
## JOBHome_Maker -7.036e+01 3.185e+02 -0.221 0.8251
## JOBLawyer 8.014e+01 2.582e+02 0.310 0.7563
## JOBManager -1.261e+02 2.521e+02 -0.500 0.6170
## JOBProfessional 1.750e+02 2.698e+02 0.649 0.5165
## JOBStudent -1.302e+02 3.266e+02 -0.399 0.6900
## JOBz_Blue_Collar 5.242e+01 2.813e+02 0.186 0.8522
## TRAVTIME 5.975e-01 2.824e+00 0.212 0.8324
## CAR_USEPrivate -1.006e+02 1.443e+02 -0.697 0.4857
## BLUEBOOK 2.941e-02 7.536e-03 3.902 9.61e-05 ***
## TIF -1.653e+01 6.277e+01 -0.263 0.7923
## CAR_TYPEPanel_Truck -5.601e+01 2.431e+02 -0.230 0.8178
## CAR_TYPEPickup -3.187e+01 1.493e+02 -0.213 0.8310
## CAR_TYPESports_Car 2.089e+02 1.910e+02 1.094 0.2741
## CAR_TYPEVan 9.699e+01 1.864e+02 0.520 0.6029
## CAR_TYPEz_SUV 1.635e+02 1.571e+02 1.040 0.2982
## RED_CARyes -2.768e+01 1.302e+02 -0.213 0.8316
## OLDCLAIM 3.731e-03 6.731e-03 0.554 0.5793
## CLM_FREQ -8.778e+01 1.060e+02 -0.828 0.4078
## REVOKEDYes -3.348e+02 1.534e+02 -2.182 0.0292 *
## MVR_PTS 1.397e+02 6.644e+01 2.103 0.0355 *
## CAR_AGE -2.521e+01 1.118e+01 -2.255 0.0242 *
## URBANICITYz_Highly_Rural/ Rural 2.772e+01 1.269e+02 0.218 0.8271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared: 0.2912, Adjusted R-squared: 0.2879
## F-statistic: 87.81 on 38 and 8122 DF, p-value: < 2.2e-16
logit4 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData, family = "binomial" (link="logit"))
summary(logit4)
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME,
## family = binomial(link = "logit"), data = rawData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5299 -0.8217 -0.6749 1.2315 2.8090
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.876e-01 7.305e-02 -9.412 < 2e-16 ***
## KIDSDRIV 7.266e-01 8.115e-02 8.953 < 2e-16 ***
## INCOME -3.497e-06 6.826e-07 -5.123 3.01e-07 ***
## HOME_VAL -2.972e-06 2.499e-07 -11.895 < 2e-16 ***
## TRAVTIME 5.880e-03 1.598e-03 3.679 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 9021.1 on 8156 degrees of freedom
## AIC: 9031.1
##
## Number of Fisher Scoring iterations: 4
exp(logit4$coefficients)
## (Intercept) KIDSDRIV INCOME HOME_VAL TRAVTIME
## 0.5028055 2.0679778 0.9999965 0.9999970 1.0058969
predlogit4 <- predict(logit4, type="response")
rawData2$pred4 <- predict(logit4, type="response")
summary(predlogit4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01176 0.19679 0.25557 0.26382 0.32927 0.68970
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit4)))
## pred
## true 0 1
## 0 5937 71
## 1 2086 67
par(mfrow=c(2,2))
plot(logit4)
data.frame(rawData2$pred4) %>%
ggplot(aes(x = rawData2.pred4)) +
geom_histogram(bins = 50, fill = 'violet') +
labs(title = 'Histogram of Predictions') +
theme_bw()
plot.roc(rawData$TARGET_FLAG, rawData2$pred4)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Overall, none of our models really seemed to have the desired correlation that we were looking for. The one model that Seung made predicts well on the condition that the regression model can correctly predict claim amount. But since we don’t have such a pristine model available, both models perform poorly. Perhaps we need to look into more data engineering techniques to better normalize our data. We spent an excessive amount of time doing imputations and statistical testing but it was overall ineffective to producing results.
Data is provided by CUNY School of Professional Studies and is thus sourced by open data initiatives.
CleanInsuranceStrings
## function(df) {
## SilentMutateAt(df, colnames(df)[sapply(df, is.character)], GsubPipe, "\\$|z_|<|,", "") %>%
## CleanYesNo(.) %>%
## # could also use readr::parse_number here
## SilentMutateAt(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), as.numeric) %>%
## ### Coerces blank strings to NA
## SilentMutateAt(., "JOB", function(x) { x[x == ""] = "Unspecified" ; x }) %>%
## ### Converts strings to factors
## dplyr::mutate_at(., colnames(.)[sapply(., is.character)], as.factor)
## }
## <bytecode: 0x0000021f9c7d7b58>
InsuranceTransforms
## function(df) {
## df %>%
## dplyr::mutate(
## IS_MANAGER = as.factor(as.integer(JOB %in% "Manager")),
## IS_DOCTOR = as.factor(as.integer(JOB %in% "Doctor")),
## NO_CLM_HIST = as.factor(as.integer(CLM_FREQ < 1))
## ) %>%
## dplyr::select(-JOB, -CLM_FREQ)
## }