Abstract

We discuss the challenges from evaluating insurance data, asserting whether a car is involved in a crash and then how expensive the claim would be. Mainly, we discuss imputation but our end models are not able to effectively predict car crashes due to overfitting and high false positives. Our linear regression models pull some insights regarding the effects of marital status and others on claim costs.

Data Exploration & Preparation

The data has 23 features and 2 target variables that we wish to predict. As part of our predictive analytics, first we will decide if the car that belongs to a license plate has been in a crash before. Second, we will determine how much the cost of the crash was if they were.

There are some general myths related to driving that may skew our view about what we expect. How much of it is true? Personally, I think that senior drivers get into more crashes than those younger but what if that’s not the case? Provided below is a quick data dictionary of the features we will be measuring.

The top of the dataset is shown below, where we can see a smattering of isues such as untidy data and even a missing value in one of the income observations.

#this step is necessary in order to analyze data as it is not clean
currencyconv = function(input) {
  out = sub("\\$", "", input)
  out = as.numeric(sub(",", "", out))
  return(out)
}

# Replace spaces with underscores
underscore = function(input) {
  out = sub(" ", "_", input)
  return(out)
}


rawData = as.tbl(rawData) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            currencyconv) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
## Warning: `as.tbl()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#check data
summary(rawData) %>% kable() %>% kable_styling()
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. : 1 0:6008 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.0 Min. : 0 Length:8161 Min. : 0 Length:8161 Length:8161 <High_School :1203 z_Blue_Collar:1825 Min. : 5.00 Length:8161 Min. : 1500 Min. : 1.000 Minivan :2145 Length:8161 Min. : 0 Min. :0.0000 Length:8161 Min. : 0.000 Min. :-3.000 Highly_Urban/ Urban :6492
1st Qu.: 2559 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Class :character 1st Qu.: 0 Class :character Class :character Bachelors :2242 Clerical :1271 1st Qu.: 22.00 Class :character 1st Qu.: 9280 1st Qu.: 1.000 Panel_Truck: 676 Class :character 1st Qu.: 0 1st Qu.:0.0000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000 z_Highly_Rural/ Rural:1669
Median : 5133 NA Median : 0 Median :0.0000 Median :45.00 Median :0.0000 Median :11.0 Median : 54028 Mode :character Median :161160 Mode :character Mode :character Masters :1658 Professional :1117 Median : 33.00 Mode :character Median :14440 Median : 4.000 Pickup :1389 Mode :character Median : 0 Median :0.0000 Mode :character Median : 1.000 Median : 8.000 NA
Mean : 5152 NA Mean : 1504 Mean :0.1711 Mean :44.79 Mean :0.7212 Mean :10.5 Mean : 61898 NA Mean :154867 NA NA PhD : 728 Manager : 988 Mean : 33.49 NA Mean :15710 Mean : 5.351 Sports_Car : 907 NA Mean : 4037 Mean :0.7986 NA Mean : 1.696 Mean : 8.328 NA
3rd Qu.: 7745 NA 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986 NA 3rd Qu.:238724 NA NA z_High_School:2330 Lawyer : 835 3rd Qu.: 44.00 NA 3rd Qu.:20850 3rd Qu.: 7.000 Van : 750 NA 3rd Qu.: 4636 3rd Qu.:2.0000 NA 3rd Qu.: 3.000 3rd Qu.:12.000 NA
Max. :10302 NA Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.0 Max. :367030 NA Max. :885282 NA NA NA Student : 712 Max. :142.00 NA Max. :69740 Max. :25.000 z_SUV :2294 NA Max. :57037 Max. :5.0000 NA Max. :13.000 Max. :28.000 NA
NA NA NA NA NA’s :6 NA NA’s :454 NA’s :445 NA NA’s :464 NA NA NA (Other) :1413 NA NA NA NA NA NA NA NA NA NA NA’s :510 NA

Tidy Data

After carefully examining the data in the raw data frame, it was found that certain variables contained unnecessary characters such as dollar signs and garbage data. To address this issue, regular expression (regex) was used to remove these unwanted characters from the variables. By doing so, the data is now cleaner and more suitable for analysis.

Additionally, there are several nominal categories that are yes and no questions. We reduced this to binary where 1=yes and2=no using regex again to further simplify the process.

Finally, we convert variables to type factor or numeric, as appropriate.

### Akin to dplyr::mutate_at but does not throw error ("silent") on missing cols and applies a function
SilentMutateAt <- function(df, dirtyCols = c("Sample", "Text"), FUN, ...) {
  ind = colnames(df) %in% dirtyCols
  df[, ind] = sapply(df[, ind], FUN = FUN, ... = ...)
  
  return(df)
}

### Rearranges parameters of gsub to be more friendly to pipe operator
GsubPipe <- function(x, pattern, replacement, ...) {
  gsub(pattern, replacement, x, ...)
}

### Replaces "yes/no" questions with an integer where yes=1 and no=0
CleanYesNo <- function(df, trace = FALSE, threshold = 0.5) {
  yesNoCols = sapply(df, function(x) { 
    unique(x) %>% 
      grepl("yes|no|^y$|^n$", ., ignore.case = T) %>% 
        { sum(.) / length(.) } %>%
          { . >= threshold }
  }) %>%
    .[.] %>%
      names(.)
  
  result = SilentMutateAt(df, yesNoCols, function(x) {
    truthy = as.integer(grepl("yes|^y$", x, ignore.case = T))
    truthy = truthy - as.integer(grepl("no|^n$", x, ignore.case = T))
    
    truthy[truthy == 0L] = NA
    truthy[truthy == -1L] = 0L
    
    truthy
  })
  
  if (trace) {
    return(list(result, yesNoCols))
  }
  
  return(result)
}

### Pre-processing data pipeline for insurance data; cleans up strings
CleanInsuranceStrings <- function(df) {
  SilentMutateAt(df, colnames(df)[sapply(df, is.character)], GsubPipe, "\\$|z_|<|,", "") %>% 
    CleanYesNo(.) %>%
      # could also use readr::parse_number here
      SilentMutateAt(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), as.numeric) %>% 
        ### Coerces blank strings to NA
        SilentMutateAt(., "JOB", function(x) { x[x == ""] = "Unspecified" ; x }) %>%
          ### Converts strings to factors
          dplyr::mutate_at(., colnames(.)[sapply(., is.character)], as.factor)
}
#runs functions
cleanData <- CleanInsuranceStrings(rawData)
## Warning in `[<-.factor`(`*tmp*`, x == "", value = "Unspecified"): invalid
## factor level, NA generated
cleanTestData <- CleanInsuranceStrings(rawTestData)
imputeData = CleanInsuranceStrings(rawData)
## Warning in `[<-.factor`(`*tmp*`, x == "", value = "Unspecified"): invalid
## factor level, NA generated
#reference: https://statisticsglobe.com/convert-character-to-factor-in-r

summary(cleanData)
##      INDEX       TARGET_FLAG   TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :    1   0:6008      Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.: 2559   1:2153      1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median : 5133               Median :     0   Median :0.0000   Median :45.00  
##  Mean   : 5152               Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##  3rd Qu.: 7745               3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :10302               Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                                NA's   :6      
##     HOMEKIDS           YOJ           INCOME          PARENT1     
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   1st Qu.:0.000  
##  Median :0.0000   Median :11.0   Median : 54028   Median :0.000  
##  Mean   :0.7212   Mean   :10.5   Mean   : 61898   Mean   :0.132  
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986   3rd Qu.:0.000  
##  Max.   :5.0000   Max.   :23.0   Max.   :367030   Max.   :1.000  
##                   NA's   :454    NA's   :445                     
##     HOME_VAL         MSTATUS       SEX              EDUCATION   
##  Min.   :     0   Min.   :0.0000   F:4375   <High_School :1203  
##  1st Qu.:     0   1st Qu.:0.0000   M:3786   Bachelors    :2242  
##  Median :161160   Median :1.0000            Masters      :1658  
##  Mean   :154867   Mean   :0.5997            PhD          : 728  
##  3rd Qu.:238724   3rd Qu.:1.0000            z_High_School:2330  
##  Max.   :885282   Max.   :1.0000                                
##  NA's   :464                                                    
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  z_Blue_Collar:1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional :1117   Median : 33.00                     Median :14440  
##  Manager      : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer       : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  (Other)      :1599   Max.   :142.00                     Max.   :69740  
##  NA's         : 526                                                     
##       TIF                CAR_TYPE       RED_CAR          OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Min.   :0.0000   Min.   :    0  
##  1st Qu.: 1.000   Panel_Truck: 676   1st Qu.:0.0000   1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   Median :0.0000   Median :    0  
##  Mean   : 5.351   Sports_Car : 907   Mean   :0.2914   Mean   : 4037  
##  3rd Qu.: 7.000   Van        : 750   3rd Qu.:1.0000   3rd Qu.: 4636  
##  Max.   :25.000   z_SUV      :2294   Max.   :1.0000   Max.   :57037  
##                                                                      
##     CLM_FREQ         REVOKED          MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   Median :0.0000   Median : 1.000   Median : 8.000  
##  Mean   :0.7986   Mean   :0.1225   Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000   3rd Qu.:0.0000   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000   Max.   :1.0000   Max.   :13.000   Max.   :28.000  
##                                                     NA's   :510     
##                  URBANICITY  
##  Highly_Urban/ Urban  :6492  
##  z_Highly_Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 
head(cleanData)
## # A tibble: 6 Ă— 26
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME PARENT1
##   <int> <fct>            <dbl>    <int> <int>    <int> <int>  <dbl>   <int>
## 1     1 0                    0        0    60        0    11  67349       0
## 2     2 0                    0        0    43        0    11  91449       0
## 3     4 0                    0        0    35        1    10  16039       0
## 4     5 0                    0        0    51        0    14     NA       0
## 5     6 0                    0        0    50        0    NA 114986       0
## 6     7 1                 2946        0    34        1    12 125301       1
## # ℹ 17 more variables: HOME_VAL <dbl>, MSTATUS <int>, SEX <fct>,
## #   EDUCATION <fct>, JOB <fct>, TRAVTIME <int>, CAR_USE <fct>, BLUEBOOK <dbl>,
## #   TIF <int>, CAR_TYPE <fct>, RED_CAR <int>, OLDCLAIM <dbl>, CLM_FREQ <int>,
## #   REVOKED <int>, MVR_PTS <int>, CAR_AGE <int>, URBANICITY <fct>

Visulization of the data set

Our features span a multitude of different distributions. Many of them are normal, others are binomial, and others are nominal. How we handle each feature will thus influence future analysis.

### Plots the distribution of each variable
# See common-helpers.R
ntrain<-select_if(rawData, is.numeric)
ntrain %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density() 
## Warning: Removed 1879 rows containing non-finite values (`stat_density()`).

Visualization of Correlation

According to this correlation matrix, all of the features are not strongly correlated to each other. With this in mind, we will go forward and assume that potentially every feature is a principal component and we will not exclude any until further downstream analysis is conducted.

### Develops a correlation plot for all numeric values
# See common-helpers.R
rawDatanum <- dplyr::select_if(rawData, is.numeric)

rcorr(as.matrix(rawDatanum))
##            INDEX TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME HOME_VAL
## INDEX       1.00       0.00     0.02  0.03     0.00  0.03   0.01     0.01
## TARGET_AMT  0.00       1.00     0.06 -0.04     0.06 -0.02  -0.06    -0.09
## KIDSDRIV    0.02       0.06     1.00 -0.08     0.46  0.04  -0.05    -0.02
## AGE         0.03      -0.04    -0.08  1.00    -0.45  0.14   0.18     0.21
## HOMEKIDS    0.00       0.06     0.46 -0.45     1.00  0.09  -0.16    -0.11
## YOJ         0.03      -0.02     0.04  0.14     0.09  1.00   0.29     0.27
## INCOME      0.01      -0.06    -0.05  0.18    -0.16  0.29   1.00     0.58
## HOME_VAL    0.01      -0.09    -0.02  0.21    -0.11  0.27   0.58     1.00
## TRAVTIME   -0.02       0.03     0.01  0.01    -0.01 -0.02  -0.05    -0.04
## BLUEBOOK    0.01       0.00    -0.02  0.17    -0.11  0.14   0.43     0.26
## TIF        -0.01      -0.05     0.00  0.00     0.01  0.02   0.00     0.00
## OLDCLAIM    0.00       0.07     0.02 -0.03     0.03  0.00  -0.05    -0.07
## CLM_FREQ    0.02       0.12     0.04 -0.02     0.03 -0.03  -0.05    -0.09
## MVR_PTS     0.01       0.14     0.05 -0.07     0.06 -0.04  -0.06    -0.09
## CAR_AGE     0.00      -0.06    -0.05  0.18    -0.15  0.06   0.41     0.22
##            TRAVTIME BLUEBOOK   TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX         -0.02     0.01 -0.01     0.00     0.02    0.01    0.00
## TARGET_AMT     0.03     0.00 -0.05     0.07     0.12    0.14   -0.06
## KIDSDRIV       0.01    -0.02  0.00     0.02     0.04    0.05   -0.05
## AGE            0.01     0.17  0.00    -0.03    -0.02   -0.07    0.18
## HOMEKIDS      -0.01    -0.11  0.01     0.03     0.03    0.06   -0.15
## YOJ           -0.02     0.14  0.02     0.00    -0.03   -0.04    0.06
## INCOME        -0.05     0.43  0.00    -0.05    -0.05   -0.06    0.41
## HOME_VAL      -0.04     0.26  0.00    -0.07    -0.09   -0.09    0.22
## TRAVTIME       1.00    -0.02 -0.01    -0.02     0.01    0.01   -0.04
## BLUEBOOK      -0.02     1.00 -0.01    -0.03    -0.04   -0.04    0.19
## TIF           -0.01    -0.01  1.00    -0.02    -0.02   -0.04    0.01
## OLDCLAIM      -0.02    -0.03 -0.02     1.00     0.50    0.26   -0.01
## CLM_FREQ       0.01    -0.04 -0.02     0.50     1.00    0.40   -0.01
## MVR_PTS        0.01    -0.04 -0.04     0.26     0.40    1.00   -0.02
## CAR_AGE       -0.04     0.19  0.01    -0.01    -0.01   -0.02    1.00
## 
## n
##            INDEX TARGET_AMT KIDSDRIV  AGE HOMEKIDS  YOJ INCOME HOME_VAL
## INDEX       8161       8161     8161 8155     8161 7707   7716     7697
## TARGET_AMT  8161       8161     8161 8155     8161 7707   7716     7697
## KIDSDRIV    8161       8161     8161 8155     8161 7707   7716     7697
## AGE         8155       8155     8155 8155     8155 7701   7711     7693
## HOMEKIDS    8161       8161     8161 8155     8161 7707   7716     7697
## YOJ         7707       7707     7707 7701     7707 7707   7291     7270
## INCOME      7716       7716     7716 7711     7716 7291   7716     7285
## HOME_VAL    7697       7697     7697 7693     7697 7270   7285     7697
## TRAVTIME    8161       8161     8161 8155     8161 7707   7716     7697
## BLUEBOOK    8161       8161     8161 8155     8161 7707   7716     7697
## TIF         8161       8161     8161 8155     8161 7707   7716     7697
## OLDCLAIM    8161       8161     8161 8155     8161 7707   7716     7697
## CLM_FREQ    8161       8161     8161 8155     8161 7707   7716     7697
## MVR_PTS     8161       8161     8161 8155     8161 7707   7716     7697
## CAR_AGE     7651       7651     7651 7645     7651 7219   7237     7223
##            TRAVTIME BLUEBOOK  TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX          8161     8161 8161     8161     8161    8161    7651
## TARGET_AMT     8161     8161 8161     8161     8161    8161    7651
## KIDSDRIV       8161     8161 8161     8161     8161    8161    7651
## AGE            8155     8155 8155     8155     8155    8155    7645
## HOMEKIDS       8161     8161 8161     8161     8161    8161    7651
## YOJ            7707     7707 7707     7707     7707    7707    7219
## INCOME         7716     7716 7716     7716     7716    7716    7237
## HOME_VAL       7697     7697 7697     7697     7697    7697    7223
## TRAVTIME       8161     8161 8161     8161     8161    8161    7651
## BLUEBOOK       8161     8161 8161     8161     8161    8161    7651
## TIF            8161     8161 8161     8161     8161    8161    7651
## OLDCLAIM       8161     8161 8161     8161     8161    8161    7651
## CLM_FREQ       8161     8161 8161     8161     8161    8161    7651
## MVR_PTS        8161     8161 8161     8161     8161    8161    7651
## CAR_AGE        7651     7651 7651     7651     7651    7651    7651
## 
## P
##            INDEX  TARGET_AMT KIDSDRIV AGE    HOMEKIDS YOJ    INCOME HOME_VAL
## INDEX             0.9572     0.1594   0.0022 0.9962   0.0189 0.4385 0.2881  
## TARGET_AMT 0.9572            0.0000   0.0002 0.0000   0.0525 0.0000 0.0000  
## KIDSDRIV   0.1594 0.0000              0.0000 0.0000   0.0001 0.0000 0.0825  
## AGE        0.0022 0.0002     0.0000          0.0000   0.0000 0.0000 0.0000  
## HOMEKIDS   0.9962 0.0000     0.0000   0.0000          0.0000 0.0000 0.0000  
## YOJ        0.0189 0.0525     0.0001   0.0000 0.0000          0.0000 0.0000  
## INCOME     0.4385 0.0000     0.0000   0.0000 0.0000   0.0000        0.0000  
## HOME_VAL   0.2881 0.0000     0.0825   0.0000 0.0000   0.0000 0.0000         
## TRAVTIME   0.0372 0.0115     0.4455   0.6342 0.5128   0.1369 0.0000 0.0018  
## BLUEBOOK   0.2089 0.6712     0.0516   0.0000 0.0000   0.0000 0.0000 0.0000  
## TIF        0.4053 0.0000     0.8574   0.9952 0.2859   0.0296 0.9276 0.8564  
## OLDCLAIM   0.9091 0.0000     0.0653   0.0082 0.0069   0.7936 0.0000 0.0000  
## CLM_FREQ   0.0898 0.0000     0.0008   0.0296 0.0080   0.0209 0.0000 0.0000  
## MVR_PTS    0.4765 0.0000     0.0000   0.0000 0.0000   0.0009 0.0000 0.0000  
## CAR_AGE    0.9513 0.0000     0.0000   0.0000 0.0000   0.0000 0.0000 0.0000  
##            TRAVTIME BLUEBOOK TIF    OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## INDEX      0.0372   0.2089   0.4053 0.9091   0.0898   0.4765  0.9513 
## TARGET_AMT 0.0115   0.6712   0.0000 0.0000   0.0000   0.0000  0.0000 
## KIDSDRIV   0.4455   0.0516   0.8574 0.0653   0.0008   0.0000  0.0000 
## AGE        0.6342   0.0000   0.9952 0.0082   0.0296   0.0000  0.0000 
## HOMEKIDS   0.5128   0.0000   0.2859 0.0069   0.0080   0.0000  0.0000 
## YOJ        0.1369   0.0000   0.0296 0.7936   0.0209   0.0009  0.0000 
## INCOME     0.0000   0.0000   0.9276 0.0000   0.0000   0.0000  0.0000 
## HOME_VAL   0.0018   0.0000   0.8564 0.0000   0.0000   0.0000  0.0000 
## TRAVTIME            0.1246   0.2945 0.0818   0.5535   0.3384  0.0008 
## BLUEBOOK   0.1246            0.6242 0.0077   0.0010   0.0004  0.0000 
## TIF        0.2945   0.6242          0.0473   0.0375   0.0002  0.4969 
## OLDCLAIM   0.0818   0.0077   0.0473          0.0000   0.0000  0.2417 
## CLM_FREQ   0.5535   0.0010   0.0375 0.0000            0.0000  0.4151 
## MVR_PTS    0.3384   0.0004   0.0002 0.0000   0.0000           0.0817 
## CAR_AGE    0.0008   0.0000   0.4969 0.2417   0.4151   0.0817
corrplot(cor(rawDatanum), method="square")

Missing values

The raw data frame contains 1879` missing values, representing 1713 out of 8161 observations.

### Counts values with missing data
#I don't understand this line, but this does not agree with the next chunk
# AA: i got the same result on my side but the other version is more clear
#nrow(rawData[is.na(rawData),]) 

### total missing values
colSums(is.na(rawData)) %>% sum()

### Counts rows with missing data
sum(!complete.cases(rawData)) #observations with missing data

The following variables in the dataset have missing values: TARGET_FLAG, TARGET_AMT, AGE, YOJ, and CAR_AGE.

#this function adds up the rows with missing values to get a count of NAs by feature
colSums(is.na(rawData)) 
##       INDEX TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS 
##           0           0           0           0           6           0 
##         YOJ      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX 
##         454         445           0         464           0           0 
##   EDUCATION         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF 
##           0           0           0           0           0           0 
##    CAR_TYPE     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS 
##           0           0           0           0           0           0 
##     CAR_AGE  URBANICITY 
##         510           0

A good rule of thumb is to drop any feature with more than 80% missing values – that is not the case with any features here. The remaining options are to: 1) Drop rows with missing values 2) Infer missing values

In order to decide which course to take we should first determine if there is any correlation between features with missing values, or if values are missing at random. Because several of these features are numeric, we will first need to convert the features to the appropriate data type in order to determine if there is a correlation.

# TO COME -- Section exploring correlation
# impute data for missing values
# use column mean for calculation

rawData$AGE[is.na(rawData$AGE)] <- mean(rawData$AGE, na.rm=TRUE)
rawData$YOJ[is.na(rawData$YOJ)] <- mean(rawData$YOJ, na.rm=TRUE)
rawData$HOME_VAL[is.na(rawData$HOME_VAL)] <- mean(rawData$HOME_VAL, na.rm=TRUE)
rawData$CAR_AGE[is.na(rawData$CAR_AGE)] <- mean(rawData$CAR_AGE, na.rm=TRUE)

rawData$INCOME[is.na(rawData$INCOME)] <- mean(rawData$INCOME, na.rm=TRUE)

#get complete cases
rawData <- rawData[complete.cases(rawData),]

rawData2<-rawData
 
# # transform data using log for skewed HOMEKIDS, MVR_PTS, OLDCLAIM, TIF, KIDSDRIVE and CLM_FREQ 

rawData$HOMEKIDS <- log(rawData$HOMEKIDS+1)
rawData$MVR_PTS <- log(rawData$MVR_PTS+1)
rawData$TIF <- log(rawData$TIF+1)
rawData$KIDSDRIV <- log(rawData$KIDSDRIV+1)
rawData$CLM_FREQ <- log(rawData$CLM_FREQ+1)


#remove rad per correlation in prior section

rawData <- rawData[, !(colnames(rawData) %in% c("INDEX"))]
# 
# #create variable
rawDatanum <- dplyr::select_if(rawData, is.numeric)

rcorr(as.matrix(rawDatanum))
##            TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME HOME_VAL TRAVTIME
## TARGET_AMT       1.00     0.06 -0.04     0.07 -0.02  -0.06    -0.08     0.03
## KIDSDRIV         0.06     1.00 -0.08     0.49  0.05  -0.05    -0.02     0.01
## AGE             -0.04    -0.08  1.00    -0.47  0.13   0.18     0.20     0.01
## HOMEKIDS         0.07     0.49 -0.47     1.00  0.08  -0.16    -0.11    -0.01
## YOJ             -0.02     0.05  0.13     0.08  1.00   0.27     0.26    -0.02
## INCOME          -0.06    -0.05  0.18    -0.16  0.27   1.00     0.54    -0.05
## HOME_VAL        -0.08    -0.02  0.20    -0.11  0.26   0.54     1.00    -0.03
## TRAVTIME         0.03     0.01  0.01    -0.01 -0.02  -0.05    -0.03     1.00
## BLUEBOOK         0.00    -0.02  0.16    -0.11  0.14   0.42     0.25    -0.02
## TIF             -0.05    -0.01  0.00     0.00  0.02  -0.01     0.00    -0.01
## OLDCLAIM         0.07     0.02 -0.03     0.03  0.00  -0.04    -0.07    -0.02
## CLM_FREQ         0.13     0.04 -0.03     0.04 -0.02  -0.05    -0.10     0.00
## MVR_PTS          0.13     0.05 -0.07     0.06 -0.03  -0.05    -0.07     0.01
## CAR_AGE         -0.06    -0.05  0.17    -0.15  0.06   0.39     0.20    -0.04
##            BLUEBOOK   TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT     0.00 -0.05     0.07     0.13    0.13   -0.06
## KIDSDRIV      -0.02 -0.01     0.02     0.04    0.05   -0.05
## AGE            0.16  0.00    -0.03    -0.03   -0.07    0.17
## HOMEKIDS      -0.11  0.00     0.03     0.04    0.06   -0.15
## YOJ            0.14  0.02     0.00    -0.02   -0.03    0.06
## INCOME         0.42 -0.01    -0.04    -0.05   -0.05    0.39
## HOME_VAL       0.25  0.00    -0.07    -0.10   -0.07    0.20
## TRAVTIME      -0.02 -0.01    -0.02     0.00    0.01   -0.04
## BLUEBOOK       1.00 -0.01    -0.03    -0.04   -0.04    0.18
## TIF           -0.01  1.00    -0.02    -0.02   -0.04    0.00
## OLDCLAIM      -0.03 -0.02     1.00     0.54    0.25   -0.01
## CLM_FREQ      -0.04 -0.02     0.54     1.00    0.41   -0.01
## MVR_PTS       -0.04 -0.04     0.25     0.41    1.00   -0.01
## CAR_AGE        0.18  0.00    -0.01    -0.01   -0.01    1.00
## 
## n= 8161 
## 
## 
## P
##            TARGET_AMT KIDSDRIV AGE    HOMEKIDS YOJ    INCOME HOME_VAL TRAVTIME
## TARGET_AMT            0.0000   0.0002 0.0000   0.0585 0.0000 0.0000   0.0115  
## KIDSDRIV   0.0000              0.0000 0.0000   0.0000 0.0000 0.0577   0.5499  
## AGE        0.0002     0.0000          0.0000   0.0000 0.0000 0.0000   0.6342  
## HOMEKIDS   0.0000     0.0000   0.0000          0.0000 0.0000 0.0000   0.4230  
## YOJ        0.0585     0.0000   0.0000 0.0000          0.0000 0.0000   0.1362  
## INCOME     0.0000     0.0000   0.0000 0.0000   0.0000        0.0000   0.0000  
## HOME_VAL   0.0000     0.0577   0.0000 0.0000   0.0000 0.0000          0.0018  
## TRAVTIME   0.0115     0.5499   0.6342 0.4230   0.1362 0.0000 0.0018           
## BLUEBOOK   0.6712     0.0415   0.0000 0.0000   0.0000 0.0000 0.0000   0.1246  
## TIF        0.0000     0.3832   0.9404 0.6725   0.0498 0.4889 0.7280   0.2945  
## OLDCLAIM   0.0000     0.0476   0.0082 0.0023   0.7931 0.0000 0.0000   0.0818  
## CLM_FREQ   0.0000     0.0000   0.0054 0.0002   0.0272 0.0000 0.0000   0.7501  
## MVR_PTS    0.0000     0.0000   0.0000 0.0000   0.0033 0.0000 0.0000   0.5405  
## CAR_AGE    0.0000     0.0000   0.0000 0.0000   0.0000 0.0000 0.0000   0.0009  
##            BLUEBOOK TIF    OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_AMT 0.6712   0.0000 0.0000   0.0000   0.0000  0.0000 
## KIDSDRIV   0.0415   0.3832 0.0476   0.0000   0.0000  0.0000 
## AGE        0.0000   0.9404 0.0082   0.0054   0.0000  0.0000 
## HOMEKIDS   0.0000   0.6725 0.0023   0.0002   0.0000  0.0000 
## YOJ        0.0000   0.0498 0.7931   0.0272   0.0033  0.0000 
## INCOME     0.0000   0.4889 0.0000   0.0000   0.0000  0.0000 
## HOME_VAL   0.0000   0.7280 0.0000   0.0000   0.0000  0.0000 
## TRAVTIME   0.1246   0.2945 0.0818   0.7501   0.5405  0.0009 
## BLUEBOOK            0.5420 0.0077   0.0003   0.0007  0.0000 
## TIF        0.5420          0.0573   0.0408   0.0006  0.9927 
## OLDCLAIM   0.0077   0.0573          0.0000   0.0000  0.2402 
## CLM_FREQ   0.0003   0.0408 0.0000            0.0000  0.2247 
## MVR_PTS    0.0007   0.0006 0.0000   0.0000           0.4250 
## CAR_AGE    0.0000   0.9927 0.2402   0.2247   0.4250

It is important to address these missing values before proceeding with any analysis.

AGE & TARGET_FLAG

Based on the available data, it seems that YOJ tends to increase with age. There are 6 missing values in the AGE column, which is a trivial enough number that the simplest approach is to drop those observations. Observations that are missing the target cannot be used for training so those will also be dropped.

YOJ

Build the linear regression model with AGE as the predictor and YOJ as the response and predict the missing values of YOJ based on the values of AGE. Insert the predicted values of YOJ into the original dataframe.

# Subset the data with missing values of YOJ
missing_YOJ <- subset(imputeData, is.na(YOJ))

lm_model <- lm(YOJ ~ AGE, data = imputeData)

predicted_YOJ <- predict(lm_model, newdata = missing_YOJ)

rounded_YOJ <- round(predicted_YOJ, digits = 0)
imputeData$YOJ[is.na(imputeData$YOJ)] <- rounded_YOJ

Income

Build the linear regression model with AGE and YOJ as the predictors and INCOME as the response and predict the missing values of INCOME based on the values of AGE and YOJ. insert the predicted values of INCOME into the original dataframe.

# Subset the data with missing values of INCOME
missing_INCOME <- subset(imputeData, is.na(INCOME))

lm_model <- lm(INCOME ~ AGE + YOJ, data = imputeData)

predicted_INCOME <- predict(lm_model, newdata = missing_INCOME)

rounded_INCOME <- round(predicted_INCOME, digits = 0)
imputeData[is.na(imputeData$INCOME),]$INCOME <- rounded_INCOME

Home-val

Build the linear regression model with AGE, YOJ, and INCOME as predictors and HOME_VAL as the response and predict the missing values of HOME_VAL based on the values of AGE, YOJ, and INCOME. Insert the predicted values of HOME_VAL into the original dataframe

# Subset the data with missing values of HOME_VAL
missing_HOME_VAL <- subset(imputeData, is.na(HOME_VAL))


lm_model <- lm(HOME_VAL ~ AGE + YOJ + INCOME, data = imputeData)


predicted_HOME_VAL <- predict(lm_model, newdata = missing_HOME_VAL)

#rounded_HOME_VAL <- round(predicted_HOME_VAL, digits = 0)
imputeData[is.na(imputeData$HOME_VAL),]$HOME_VAL <- predicted_HOME_VAL
imputeData$HOME_VAL <- round(imputeData$HOME_VAL, digits = 0)

head(imputeData$HOME_VAL)
## [1]      0 257252 124191 306251 243925      0

Car Age

To handle the missing value in CAR_AGE, we will use chained imputation since it is difficult to assess. This will also fill any other data that is still missing so we now have produced our final cleaning steps. We may next consider doing manipulations such as normalization, zero centering, and other kinds of feature engineering.

imputeData = imputeData %>%
  mice::mice(., method = "pmm", printFlag = FALSE) %>%
  mice::complete(.)
trainClean = imputeData

Summary

summary(dplyr::select(imputeData, -INDEX))
##  TARGET_FLAG   TARGET_AMT        KIDSDRIV           AGE           HOMEKIDS     
##  0:6008      Min.   :     0   Min.   :0.0000   Min.   :16.00   Min.   :0.0000  
##  1:2153      1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00   1st Qu.:0.0000  
##              Median :     0   Median :0.0000   Median :45.00   Median :0.0000  
##              Mean   :  1504   Mean   :0.1711   Mean   :44.78   Mean   :0.7212  
##              3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00   3rd Qu.:1.0000  
##              Max.   :107586   Max.   :4.0000   Max.   :81.00   Max.   :5.0000  
##                                                                                
##       YOJ           INCOME          PARENT1         HOME_VAL     
##  Min.   : 0.0   Min.   :     0   Min.   :0.000   Min.   : -5482  
##  1st Qu.: 9.0   1st Qu.: 29401   1st Qu.:0.000   1st Qu.:     0  
##  Median :11.0   Median : 55578   Median :0.000   Median :160042  
##  Mean   :10.5   Mean   : 61895   Mean   :0.132   Mean   :154885  
##  3rd Qu.:13.0   3rd Qu.: 83871   3rd Qu.:0.000   3rd Qu.:236296  
##  Max.   :23.0   Max.   :367030   Max.   :1.000   Max.   :885282  
##                                                                  
##     MSTATUS       SEX              EDUCATION               JOB      
##  Min.   :0.0000   F:4375   <High_School :1203   z_Blue_Collar:1973  
##  1st Qu.:0.0000   M:3786   Bachelors    :2242   Clerical     :1338  
##  Median :1.0000            Masters      :1658   Professional :1204  
##  Mean   :0.5997            PhD          : 728   Manager      :1094  
##  3rd Qu.:1.0000            z_High_School:2330   Lawyer       : 872  
##  Max.   :1.0000                                 Student      : 761  
##                                                 (Other)      : 919  
##     TRAVTIME            CAR_USE        BLUEBOOK          TIF        
##  Min.   :  5.00   Commercial:3029   Min.   : 1500   Min.   : 1.000  
##  1st Qu.: 22.00   Private   :5132   1st Qu.: 9280   1st Qu.: 1.000  
##  Median : 33.00                     Median :14440   Median : 4.000  
##  Mean   : 33.49                     Mean   :15710   Mean   : 5.351  
##  3rd Qu.: 44.00                     3rd Qu.:20850   3rd Qu.: 7.000  
##  Max.   :142.00                     Max.   :69740   Max.   :25.000  
##                                                                     
##         CAR_TYPE       RED_CAR          OLDCLAIM        CLM_FREQ     
##  Minivan    :2145   Min.   :0.0000   Min.   :    0   Min.   :0.0000  
##  Panel_Truck: 676   1st Qu.:0.0000   1st Qu.:    0   1st Qu.:0.0000  
##  Pickup     :1389   Median :0.0000   Median :    0   Median :0.0000  
##  Sports_Car : 907   Mean   :0.2914   Mean   : 4037   Mean   :0.7986  
##  Van        : 750   3rd Qu.:1.0000   3rd Qu.: 4636   3rd Qu.:2.0000  
##  z_SUV      :2294   Max.   :1.0000   Max.   :57037   Max.   :5.0000  
##                                                                      
##     REVOKED          MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   Median : 1.000   Median : 8.000  
##  Mean   :0.1225   Mean   : 1.696   Mean   : 8.336  
##  3rd Qu.:0.0000   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :1.0000   Max.   :13.000   Max.   :28.000  
##                                                    
##                  URBANICITY  
##  Highly_Urban/ Urban  :6492  
##  z_Highly_Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 

Box-plot

We have 971 male and 1170 female in this study. Male average income is higher than man. However, man has highest car purchase amount. The boxplots show that the medians and interquartile range of the two distributions are very close.

SEX_df <- imputeData %>%
  dplyr::filter(!is.na(INCOME)) %>%
  dplyr::group_by(SEX) %>%
  dplyr::summarise(INCOME = mean(INCOME))

SEX_df
## # A tibble: 2 Ă— 2
##   SEX   INCOME
##   <fct>  <dbl>
## 1 F     57271.
## 2 M     67238.
ggplot2::ggplot(imputeData) +
  ggplot2::aes(x = SEX, y = INCOME, color = SEX, fill = SEX) +
  ggplot2::geom_bar(data = SEX_df, stat = "identity", alpha = .3) +
  ggrepel::geom_text_repel(ggplot2::aes(label = HOME_VAL), color = "black", size = 2.5, segment.color = "grey") +
  ggplot2::geom_point() +
  ggplot2::guides(color = "none", fill = "none") +
  ggplot2::theme_bw() +
  ggplot2::labs(title = "Income by Sex", x = "Sex", y = "Income")
## Warning: ggrepel: 8156 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Model Building

We will each create a model to explore and compare the differences between them. Additionally, we will discuss the key insights that we gain from the modeling process because it will help us draw conclusions about the dataset and potentially point us towards a preferred model.

Seung’s Model logit

Stepwise regression analysis was performed. The purpose of stepwise regression is to identify the subset of predictors that best explain the response variable. In this case, the response variables are TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, and CAR_AGE. The model selects the best subset of these predictors by iteratively adding or removing variables based on statistical criteria, in this case the Akaike Information Criterion (AIC).

The final model contains the variables TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ ,MVR_PTS, CAR_AGE. This subset of variables yielded the lowest AIC value among all possible subsets of variables considered by the algorithm.

To evaluate the performance of a machine learning model, it is important to split the data into training and test sets. The training set is used to train the model, while the test set is used to evaluate how well the model generalizes to new, unseen data.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(333)
trainIndex <- sample(1:nrow(trainClean), round(0.7*nrow(trainClean)), replace = FALSE)
trainData <- trainClean[trainIndex, ]
testData <- trainClean[-trainIndex, ]


trainData <- na.omit(trainData)
testData <- na.omit(testData)


logit <- glm(formula = TARGET_FLAG ~ . - TARGET_AMT, data=rawData, family = "binomial" (link="logit"))

summary(logit)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial(link = "logit"), 
##     data = rawData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5328  -0.7137  -0.3955   0.6469   3.1515  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -8.039e-01  3.297e-01  -2.438 0.014763 *  
## KIDSDRIV                         6.724e-01  1.106e-01   6.079 1.21e-09 ***
## AGE                             -5.647e-05  4.083e-03  -0.014 0.988965    
## HOMEKIDS                         1.493e-01  8.316e-02   1.795 0.072594 .  
## YOJ                             -1.262e-02  8.592e-03  -1.468 0.142042    
## INCOME                          -3.533e-06  1.079e-06  -3.273 0.001065 ** 
## PARENT1Yes                       3.275e-01  1.146e-01   2.858 0.004270 ** 
## HOME_VAL                        -1.300e-06  3.425e-07  -3.796 0.000147 ***
## MSTATUSz_No                      5.183e-01  8.499e-02   6.098 1.07e-09 ***
## SEXz_F                          -8.739e-02  1.122e-01  -0.779 0.436005    
## EDUCATIONBachelors              -3.791e-01  1.156e-01  -3.280 0.001037 ** 
## EDUCATIONMasters                -2.803e-01  1.788e-01  -1.568 0.116928    
## EDUCATIONPhD                    -1.638e-01  2.141e-01  -0.765 0.444147    
## EDUCATIONz_High_School           1.674e-02  9.495e-02   0.176 0.860061    
## JOBClerical                      3.929e-01  1.967e-01   1.997 0.045806 *  
## JOBDoctor                       -4.416e-01  2.669e-01  -1.654 0.098064 .  
## JOBHome_Maker                    2.019e-01  2.102e-01   0.960 0.336881    
## JOBLawyer                        9.864e-02  1.696e-01   0.581 0.560905    
## JOBManager                      -5.670e-01  1.715e-01  -3.305 0.000948 ***
## JOBProfessional                  1.513e-01  1.786e-01   0.847 0.396937    
## JOBStudent                       1.970e-01  2.145e-01   0.918 0.358363    
## JOBz_Blue_Collar                 3.000e-01  1.856e-01   1.616 0.106095    
## TRAVTIME                         1.467e-02  1.883e-03   7.790 6.73e-15 ***
## CAR_USEPrivate                  -7.587e-01  9.176e-02  -8.269  < 2e-16 ***
## BLUEBOOK                        -2.044e-05  5.263e-06  -3.884 0.000103 ***
## TIF                             -3.255e-01  4.145e-02  -7.854 4.04e-15 ***
## CAR_TYPEPanel_Truck              5.527e-01  1.617e-01   3.418 0.000630 ***
## CAR_TYPEPickup                   5.509e-01  1.008e-01   5.467 4.57e-08 ***
## CAR_TYPESports_Car               1.035e+00  1.300e-01   7.965 1.66e-15 ***
## CAR_TYPEVan                      6.171e-01  1.266e-01   4.874 1.09e-06 ***
## CAR_TYPEz_SUV                    7.725e-01  1.114e-01   6.936 4.05e-12 ***
## RED_CARyes                      -3.334e-03  8.638e-02  -0.039 0.969215    
## OLDCLAIM                        -1.724e-05  4.049e-06  -4.258 2.07e-05 ***
## CLM_FREQ                         5.010e-01  6.347e-02   7.894 2.93e-15 ***
## REVOKEDYes                       9.256e-01  9.201e-02  10.059  < 2e-16 ***
## MVR_PTS                          2.902e-01  4.157e-02   6.983 2.90e-12 ***
## CAR_AGE                         -1.735e-03  7.548e-03  -0.230 0.818180    
## URBANICITYz_Highly_Rural/ Rural -2.370e+00  1.130e-01 -20.966  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418  on 8160  degrees of freedom
## Residual deviance: 7290  on 8123  degrees of freedom
## AIC: 7366
## 
## Number of Fisher Scoring iterations: 5

The model’s coefficient of 0.5860083 represents the correlation between the predicted values and actual values of the target variable, which is the probability of a car insurance claim being filed (TARGET_FLAG). A value of 0.60 indicates a moderate positive correlation between the predicted and actual values.

The model’s coefficients for INCOME and HOME_VAL are negative, indicating that as these variables decrease, the predicted probability of a car insurance claim being filed (TARGET_FLAG) increases. This observation seems to contradict the expectation that individuals with higher incomes and more valuable homes would be safer drivers and thus less likely to file insurance claims.

If the result is counter intuitive, it might be worthwhile to conduct another experiment to verify the accuracy of the outcome.

exp(logit$coefficients)
##                     (Intercept)                        KIDSDRIV 
##                      0.44759439                      1.95890613 
##                             AGE                        HOMEKIDS 
##                      0.99994354                      1.16101816 
##                             YOJ                          INCOME 
##                      0.98746421                      0.99999647 
##                      PARENT1Yes                        HOME_VAL 
##                      1.38746810                      0.99999870 
##                     MSTATUSz_No                          SEXz_F 
##                      1.67913068                      0.91631881 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                      0.68447094                      0.75557285 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                      0.84890695                      1.01688024 
##                     JOBClerical                       JOBDoctor 
##                      1.48122021                      0.64303595 
##                   JOBHome_Maker                       JOBLawyer 
##                      1.22370504                      1.10366439 
##                      JOBManager                 JOBProfessional 
##                      0.56721931                      1.16332072 
##                      JOBStudent                JOBz_Blue_Collar 
##                      1.21775367                      1.34982753 
##                        TRAVTIME                  CAR_USEPrivate 
##                      1.01477923                      0.46827081 
##                        BLUEBOOK                             TIF 
##                      0.99997956                      0.72216334 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                      1.73800068                      1.73473229 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                      2.81535682                      1.85359874 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                      2.16517590                      0.99667196 
##                        OLDCLAIM                        CLM_FREQ 
##                      0.99998276                      1.65039273 
##                      REVOKEDYes                         MVR_PTS 
##                      2.52326347                      1.33675662 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                      0.99826632                      0.09348369
logitscalar <- mean(dlogis(predict(logit, type = "link")))
logitscalar * coef(logit)
##                     (Intercept)                        KIDSDRIV 
##                   -1.168766e-01                    9.776014e-02 
##                             AGE                        HOMEKIDS 
##                   -8.209729e-06                    2.170676e-02 
##                             YOJ                          INCOME 
##                   -1.834135e-03                   -5.136424e-07 
##                      PARENT1Yes                        HOME_VAL 
##                    4.761333e-02                   -1.890080e-07 
##                     MSTATUSz_No                          SEXz_F 
##                    7.535364e-02                   -1.270601e-02 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                   -5.511974e-02                   -4.075057e-02 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                   -2.381617e-02                    2.433782e-03 
##                     JOBClerical                       JOBDoctor 
##                    5.711993e-02                   -6.419888e-02 
##                   JOBHome_Maker                       JOBLawyer 
##                    2.935237e-02                    1.434095e-02 
##                      JOBManager                 JOBProfessional 
##                   -8.243908e-02                    2.199482e-02 
##                      JOBStudent                JOBz_Blue_Collar 
##                    2.864354e-02                    4.361448e-02 
##                        TRAVTIME                  CAR_USEPrivate 
##                    2.133070e-03                   -1.103108e-01 
##                        BLUEBOOK                             TIF 
##                   -2.972373e-06                   -4.732594e-02 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                    8.036377e-02                    8.009009e-02 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                    1.504945e-01                    8.972613e-02 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                    1.123162e-01                   -4.846794e-04 
##                        OLDCLAIM                        CLM_FREQ 
##                   -2.506236e-06                    7.284374e-02 
##                      REVOKEDYes                         MVR_PTS 
##                    1.345688e-01                    4.219972e-02 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                   -2.522830e-04                   -3.445764e-01
confint.default(logit)
##                                         2.5 %        97.5 %
## (Intercept)                     -1.450076e+00 -1.576594e-01
## KIDSDRIV                         4.556053e-01  8.891671e-01
## AGE                             -8.058607e-03  7.945675e-03
## HOMEKIDS                        -1.368645e-02  3.122811e-01
## YOJ                             -2.945506e-02  4.225006e-03
## INCOME                          -5.648457e-06 -1.417123e-06
## PARENT1Yes                       1.028624e-01  5.520987e-01
## HOME_VAL                        -1.971256e-06 -6.287076e-07
## MSTATUSz_No                      3.517063e-01  6.848462e-01
## SEXz_F                          -3.072784e-01  1.324965e-01
## EDUCATIONBachelors              -6.056274e-01 -1.525908e-01
## EDUCATIONMasters                -6.306668e-01  7.010865e-02
## EDUCATIONPhD                    -5.833681e-01  2.557567e-01
## EDUCATIONz_High_School          -1.693589e-01  2.028376e-01
## JOBClerical                      7.320068e-03  7.784124e-01
## JOBDoctor                       -9.646916e-01  8.158231e-02
## JOBHome_Maker                   -2.101384e-01  6.139047e-01
## JOBLawyer                       -2.338210e-01  4.310928e-01
## JOBManager                      -9.032236e-01 -2.307950e-01
## JOBProfessional                 -1.987380e-01  5.012952e-01
## JOBStudent                      -2.233874e-01  6.174032e-01
## JOBz_Blue_Collar                -6.385011e-02  6.638038e-01
## TRAVTIME                         1.097961e-02  1.836256e-02
## CAR_USEPrivate                  -9.385494e-01 -5.788676e-01
## BLUEBOOK                        -3.075900e-05 -1.012847e-05
## TIF                             -4.067373e-01 -2.442706e-01
## CAR_TYPEPanel_Truck              2.358295e-01  8.696413e-01
## CAR_TYPEPickup                   3.533748e-01  7.483314e-01
## CAR_TYPESports_Car               7.803697e-01  1.289808e+00
## CAR_TYPEVan                      3.689576e-01  8.653005e-01
## CAR_TYPEz_SUV                    5.541958e-01  9.908075e-01
## RED_CARyes                      -1.726303e-01  1.659631e-01
## OLDCLAIM                        -2.517283e-05 -9.302541e-06
## CLM_FREQ                         3.766150e-01  6.254116e-01
## REVOKEDYes                       7.452091e-01  1.105897e+00
## MVR_PTS                          2.087775e-01  3.717150e-01
## CAR_AGE                         -1.652895e-02  1.305859e-02
## URBANICITYz_Highly_Rural/ Rural -2.591520e+00 -2.148416e+00
predlogit <- predict(logit, type="response")
rawData2$pred1 <- predict(logit, type="response")
summary(predlogit)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002416 0.076138 0.201101 0.263816 0.403715 0.959539
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit)))
##     pred
## true    0    1
##    0 5537  471
##    1 1239  914
#plots for Seung's Model logit
par(mfrow=c(2,2))
plot(logit)

data.frame(rawData2$pred1) %>%
    ggplot(aes(x = rawData2.pred1)) + 
    geom_histogram(bins = 50, fill = 'green') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(rawData$TARGET_FLAG, rawData2$pred1)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

#extract variables that are significant and rerun model
sigvars <- data.frame(summary(logit)$coef[summary(logit)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
## Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::rownames_to_column()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Anthony’s First Model logit2

Of the models we build, we will build our first two models after an exhaustive outlier cleaning. It seems that both models favor managers for a lower insurance rate, interestingly enough. Additionally, the type of car has a significant impact on the amount. Also, both models agree that married couples are charged significantly lower rates, whereas this is even worse in the event that an unmarried person is also a single parent. Finally, both models agree that it is extremely significant whether the drivers are in urban or rural areas, such that urban drivers get charged around $1500 more than their rural counterparts.

colist<-dplyr::pull(sigvars, vars)
# colist<-colist[2:11]
colist<-c("KIDSDRIV","INCOME","PARENT1","HOME_VAL","MSTATUS","EDUCATION","JOB","TRAVTIME","CAR_USE","BLUEBOOK","TIF","CAR_TYPE","CLM_FREQ","REVOKED","MVR_PTS","URBANICITY")

idx <- match(colist, names(rawData))
rawDatamod2 <- cbind(rawData[,idx], rawData2['TARGET_FLAG'])


logit2 <- glm(TARGET_FLAG ~ ., data=rawDatamod2, family = "binomial" (link="logit"))
summary(logit2)
exp(logit2$coefficients)
logit2scalar <- mean(dlogis(predict(logit2, type = "link")))
logit2scalar * coef(logit2)

Additional Transforms

Since we created our initial models, we decided that there are three additional transforms we want to add. We believe that having a claim is predictive of making a claim again, so we will have a boolean column for any number of claims. Additionally, we decided that the only two relevant job types are doctors and managers, based on the models. Therefore, we will also make two boolean columns for if they are a manager or doctor.

### Creates a factor column for Manager/Doctor jobs and no claim history, specific to insurance dataset
InsuranceTransforms <- function(df) {
  df %>%
    dplyr::mutate(
      IS_MANAGER = as.factor(as.integer(JOB %in% "Manager")),
      IS_DOCTOR = as.factor(as.integer(JOB %in% "Doctor")),
      NO_CLM_HIST = as.factor(as.integer(CLM_FREQ < 1))
    ) %>%
    dplyr::select(-JOB, -CLM_FREQ)
}

Further Notes

Although new models were attempted with the new transforms and additional considerations, none of them successfully beat the existing model by a statistically significant margin. We compared each of the models using ANOVA and evaluated the LM models with MAE and RMSE, whereas the logistical models were compared using F1 and Accuracy metrics. Overall, we kept the previously accepted models due to the lack of advancement.

table(true = rawData$TARGET_FLAG, pred = round(fitted(logit2)))

Anthony’s First Model plots for logit2

par(mfrow=c(2,2))
plot(logit2)
data.frame(rawData2$pred2) %>%
    ggplot(aes(x = rawData2.pred2)) + 
    geom_histogram(bins = 50, fill = 'blue') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

IvanTikhonov Best Model

#PC Model no racial bias
logit3 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData, family = "binomial" (link="logit"))
summary(logit3)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, 
##     family = binomial(link = "logit"), data = rawData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5299  -0.8217  -0.6749   1.2315   2.8090  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.876e-01  7.305e-02  -9.412  < 2e-16 ***
## KIDSDRIV     7.266e-01  8.115e-02   8.953  < 2e-16 ***
## INCOME      -3.497e-06  6.826e-07  -5.123 3.01e-07 ***
## HOME_VAL    -2.972e-06  2.499e-07 -11.895  < 2e-16 ***
## TRAVTIME     5.880e-03  1.598e-03   3.679 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 9021.1  on 8156  degrees of freedom
## AIC: 9031.1
## 
## Number of Fisher Scoring iterations: 4
exp(logit3$coefficients)
## (Intercept)    KIDSDRIV      INCOME    HOME_VAL    TRAVTIME 
##   0.5028055   2.0679778   0.9999965   0.9999970   1.0058969
predlogit3 <- predict(logit3, type="response")
rawData2$pred3 <- predict(logit3, type="response")
summary(predlogit3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01176 0.19679 0.25557 0.26382 0.32927 0.68970
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit3)))
##     pred
## true    0    1
##    0 5937   71
##    1 2086   67
par(mfrow=c(2,2))
plot(logit3)

data.frame(rawData2$pred3) %>%
    ggplot(aes(x = rawData2.pred3)) + 
    geom_histogram(bins = 50, fill = 'red') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(rawData$TARGET_FLAG, rawData2$pred3)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

logit3scalar <- mean(dlogis(predict(logit3, type = "link")))
logit3scalar * coef(logit3)
##   (Intercept)      KIDSDRIV        INCOME      HOME_VAL      TRAVTIME 
## -1.271908e-01  1.344090e-01 -6.468917e-07 -5.498016e-07  1.087668e-03
round(logitscalar * coef(logit),2)
##                     (Intercept)                        KIDSDRIV 
##                           -0.12                            0.10 
##                             AGE                        HOMEKIDS 
##                            0.00                            0.02 
##                             YOJ                          INCOME 
##                            0.00                            0.00 
##                      PARENT1Yes                        HOME_VAL 
##                            0.05                            0.00 
##                     MSTATUSz_No                          SEXz_F 
##                            0.08                           -0.01 
##              EDUCATIONBachelors                EDUCATIONMasters 
##                           -0.06                           -0.04 
##                    EDUCATIONPhD          EDUCATIONz_High_School 
##                           -0.02                            0.00 
##                     JOBClerical                       JOBDoctor 
##                            0.06                           -0.06 
##                   JOBHome_Maker                       JOBLawyer 
##                            0.03                            0.01 
##                      JOBManager                 JOBProfessional 
##                           -0.08                            0.02 
##                      JOBStudent                JOBz_Blue_Collar 
##                            0.03                            0.04 
##                        TRAVTIME                  CAR_USEPrivate 
##                            0.00                           -0.11 
##                        BLUEBOOK                             TIF 
##                            0.00                           -0.05 
##             CAR_TYPEPanel_Truck                  CAR_TYPEPickup 
##                            0.08                            0.08 
##              CAR_TYPESports_Car                     CAR_TYPEVan 
##                            0.15                            0.09 
##                   CAR_TYPEz_SUV                      RED_CARyes 
##                            0.11                            0.00 
##                        OLDCLAIM                        CLM_FREQ 
##                            0.00                            0.07 
##                      REVOKEDYes                         MVR_PTS 
##                            0.13                            0.04 
##                         CAR_AGE URBANICITYz_Highly_Rural/ Rural 
##                            0.00                           -0.34
round(logit3scalar * coef(logit3),2)
## (Intercept)    KIDSDRIV      INCOME    HOME_VAL    TRAVTIME 
##       -0.13        0.13        0.00        0.00        0.00

Seung’s Model

model <- lm(TARGET_AMT ~ ., data=rawData)
summary(model)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6211   -464    -60    243 101197 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -5.940e+02  5.009e+02  -1.186   0.2357    
## TARGET_FLAG1                     5.710e+03  1.135e+02  50.297  < 2e-16 ***
## KIDSDRIV                        -2.052e+01  1.781e+02  -0.115   0.9083    
## AGE                              6.147e+00  6.271e+00   0.980   0.3270    
## HOMEKIDS                         9.206e+01  1.256e+02   0.733   0.4637    
## YOJ                              7.591e+00  1.319e+01   0.576   0.5648    
## INCOME                          -2.254e-03  1.577e-03  -1.430   0.1528    
## PARENT1Yes                       1.218e+02  1.830e+02   0.666   0.5056    
## HOME_VAL                         3.834e-04  5.165e-04   0.742   0.4580    
## MSTATUSz_No                      1.765e+02  1.282e+02   1.377   0.1685    
## SEXz_F                          -2.901e+02  1.606e+02  -1.806   0.0709 .  
## EDUCATIONBachelors               6.878e+01  1.790e+02   0.384   0.7008    
## EDUCATIONMasters                 2.222e+02  2.620e+02   0.848   0.3965    
## EDUCATIONPhD                     4.288e+02  3.110e+02   1.379   0.1679    
## EDUCATIONz_High_School          -1.241e+02  1.502e+02  -0.827   0.4084    
## JOBClerical                     -8.206e+00  2.984e+02  -0.027   0.9781    
## JOBDoctor                       -2.791e+02  3.571e+02  -0.782   0.4345    
## JOBHome_Maker                   -7.036e+01  3.185e+02  -0.221   0.8251    
## JOBLawyer                        8.014e+01  2.582e+02   0.310   0.7563    
## JOBManager                      -1.261e+02  2.521e+02  -0.500   0.6170    
## JOBProfessional                  1.750e+02  2.698e+02   0.649   0.5165    
## JOBStudent                      -1.302e+02  3.266e+02  -0.399   0.6900    
## JOBz_Blue_Collar                 5.242e+01  2.813e+02   0.186   0.8522    
## TRAVTIME                         5.975e-01  2.824e+00   0.212   0.8324    
## CAR_USEPrivate                  -1.006e+02  1.443e+02  -0.697   0.4857    
## BLUEBOOK                         2.941e-02  7.536e-03   3.902 9.61e-05 ***
## TIF                             -1.653e+01  6.277e+01  -0.263   0.7923    
## CAR_TYPEPanel_Truck             -5.601e+01  2.431e+02  -0.230   0.8178    
## CAR_TYPEPickup                  -3.187e+01  1.493e+02  -0.213   0.8310    
## CAR_TYPESports_Car               2.089e+02  1.910e+02   1.094   0.2741    
## CAR_TYPEVan                      9.699e+01  1.864e+02   0.520   0.6029    
## CAR_TYPEz_SUV                    1.635e+02  1.571e+02   1.040   0.2982    
## RED_CARyes                      -2.768e+01  1.302e+02  -0.213   0.8316    
## OLDCLAIM                         3.731e-03  6.731e-03   0.554   0.5793    
## CLM_FREQ                        -8.778e+01  1.060e+02  -0.828   0.4078    
## REVOKEDYes                      -3.348e+02  1.534e+02  -2.182   0.0292 *  
## MVR_PTS                          1.397e+02  6.644e+01   2.103   0.0355 *  
## CAR_AGE                         -2.521e+01  1.118e+01  -2.255   0.0242 *  
## URBANICITYz_Highly_Rural/ Rural  2.772e+01  1.269e+02   0.218   0.8271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared:  0.2912, Adjusted R-squared:  0.2879 
## F-statistic: 87.81 on 38 and 8122 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model$residuals ~ model$fitted.values)
plot(model$fitted.values,rawData$TARGET_AMT)

par(mfrow=c(2,2))
plot(model)

#extract variables that are significant and rerun model
sigvars <- data.frame(summary(model)$coef[summary(model)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
colist<-dplyr::pull(sigvars, vars)
colist<-c("TARGET_FLAG","BLUEBOOK","REVOKED","MVR_PTS","CAR_AGE")

idx <- match(colist, names(rawData))
rawDatamod2 <- cbind(rawData[,idx], rawData['TARGET_AMT'])

Anthony’s First Model

model2<-lm(TARGET_AMT ~ ., data=rawDatamod2)

summary(model2)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawDatamod2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6269   -378    -34    192 101505 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.315e+02  1.206e+02  -3.579 0.000347 ***
## TARGET_FLAG1  5.735e+03  1.036e+02  55.334  < 2e-16 ***
## BLUEBOOK      3.010e-02  5.328e-03   5.649 1.67e-08 ***
## REVOKEDYes   -2.874e+02  1.356e+02  -2.120 0.034021 *  
## MVR_PTS       1.309e+02  6.101e+01   2.145 0.031986 *  
## CAR_AGE      -1.291e+01  8.122e+00  -1.590 0.111894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3968 on 8155 degrees of freedom
## Multiple R-squared:  0.289,  Adjusted R-squared:  0.2886 
## F-statistic: 662.9 on 5 and 8155 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2$residuals ~ model2$fitted.values)
plot(model2$fitted.values,rawData$TARGET_AMT)


par(mfrow=c(2,2))

plot(model2)

par(mfrow=c(1,2))
plot(model2$residuals ~ model2$fitted.values, main="New Reduced Var Model")
abline(h = 0)
plot(model$residuals ~ model$fitted.values, main="Orignal Model All Vars")
abline(h = 0)

# IvanTikhonov Model

#remove variables with opposite coefficients

model3<-lm(TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData)
summary(model3)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, 
##     data = rawData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3610  -1652  -1239   -318 106277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.680e+03  1.470e+02  11.426  < 2e-16 ***
## KIDSDRIV     9.172e+02  1.789e+02   5.126 3.03e-07 ***
## INCOME      -1.242e-03  1.336e-03  -0.930   0.3522    
## HOME_VAL    -2.809e-03  4.920e-04  -5.710 1.17e-08 ***
## TRAVTIME     7.234e+00  3.260e+00   2.219   0.0265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4679 on 8156 degrees of freedom
## Multiple R-squared:  0.01096,    Adjusted R-squared:  0.01047 
## F-statistic: 22.59 on 4 and 8156 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(model3$residuals ~ model3$fitted.values)
plot(model3$fitted.values,rawData$TARGET_AMT)

par(mfrow=c(2,2))
plot(model3)

Charles Ugiagbe Select Models

rawTestData2<- rawTestData
rawTestData = read.csv(file="https://raw.githubusercontent.com/IvanGrozny88/DATA-621-HW-4/main/insurance-evaluation-data.csv")

rawTestData$TARGET_AMT <- 0
rawTestData$TARGET_FLAG <- 0


rawTestData = as.tbl(rawTestData) %>% 
  mutate_at(c("INCOME","HOME_VAL","BLUEBOOK","OLDCLAIM"),
            currencyconv) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            underscore) %>% 
  mutate_at(c("EDUCATION","JOB","CAR_TYPE","URBANICITY"),
            as.factor) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))

# impute data for missing values
# use column mean for calculation

rawTestData$HOMEKIDS <- log(rawTestData$HOMEKIDS+1)
rawTestData$MVR_PTS <- log(rawTestData$MVR_PTS+1)
rawTestData$OLDCLAIM <- log(rawTestData$OLDCLAIM+1)
rawTestData$TIF <- log(rawTestData$TIF+1)
rawTestData$KIDSDRIV <- log(rawTestData$KIDSDRIV+1)
rawTestData$CLM_FREQ <- log(rawTestData$CLM_FREQ+1)

# use column mean for calculation

rawTestData$AGE[is.na(rawTestData$AGE)] <- mean(rawTestData$AGE, na.rm=TRUE)
rawTestData$YOJ[is.na(rawTestData$YOJ)] <- mean(rawTestData$YOJ, na.rm=TRUE)
rawTestData$HOME_VAL[is.na(rawTestData$HOME_VAL)] <- mean(rawTestData$HOME_VAL, na.rm=TRUE)
rawTestData$CAR_AGE[is.na(rawTestData$CAR_AGE)] <- mean(rawTestData$CAR_AGE, na.rm=TRUE)

rawTestData$INCOME[is.na(rawTestData$INCOME)] <- mean(rawTestData$INCOME, na.rm=TRUE)

#get complete cases


#remove rad per correlation in prior section

rawTestData <- rawTestData[, !(colnames(rawTestData) %in% c("INDEX"))]



TARGET_FLAG <- predict(logit, newdata = rawTestData, type="response")

y_pred_num <- ifelse(TARGET_FLAG > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
summary(y_pred)
##    0    1 
## 1719  422
rbind(round(summary(predlogit),4), round(summary(TARGET_FLAG),4)) %>% kable()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0024 0.0761 0.2011 0.2638 0.4037 0.9595
0.0031 0.0784 0.2313 0.2846 0.4360 0.9588
rawTestData$TARGET_FLAG <- as.factor(rawTestData$TARGET_FLAG)

rawTestData2 <- rawTestData[, !(colnames(rawTestData) %in% c("TARGET_FLAG"))]
TARGET_AMT<- predict(model, newdata = rawTestData, interval='confidence') #data from scaling originally to get to actual wins
summary(TARGET_AMT)
##       fit                lwr               upr        
##  Min.   :-1283.26   Min.   :-2002.8   Min.   :-563.7  
##  1st Qu.: -272.62   1st Qu.: -799.6   1st Qu.: 242.0  
##  Median :  -36.87   Median : -550.2   Median : 465.8  
##  Mean   :  -24.50   Mean   : -558.3   Mean   : 509.3  
##  3rd Qu.:  211.39   3rd Qu.: -312.8   3rd Qu.: 763.2  
##  Max.   : 1250.19   Max.   :  503.5   Max.   :1996.9
summary(model)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = rawData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6211   -464    -60    243 101197 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -5.940e+02  5.009e+02  -1.186   0.2357    
## TARGET_FLAG1                     5.710e+03  1.135e+02  50.297  < 2e-16 ***
## KIDSDRIV                        -2.052e+01  1.781e+02  -0.115   0.9083    
## AGE                              6.147e+00  6.271e+00   0.980   0.3270    
## HOMEKIDS                         9.206e+01  1.256e+02   0.733   0.4637    
## YOJ                              7.591e+00  1.319e+01   0.576   0.5648    
## INCOME                          -2.254e-03  1.577e-03  -1.430   0.1528    
## PARENT1Yes                       1.218e+02  1.830e+02   0.666   0.5056    
## HOME_VAL                         3.834e-04  5.165e-04   0.742   0.4580    
## MSTATUSz_No                      1.765e+02  1.282e+02   1.377   0.1685    
## SEXz_F                          -2.901e+02  1.606e+02  -1.806   0.0709 .  
## EDUCATIONBachelors               6.878e+01  1.790e+02   0.384   0.7008    
## EDUCATIONMasters                 2.222e+02  2.620e+02   0.848   0.3965    
## EDUCATIONPhD                     4.288e+02  3.110e+02   1.379   0.1679    
## EDUCATIONz_High_School          -1.241e+02  1.502e+02  -0.827   0.4084    
## JOBClerical                     -8.206e+00  2.984e+02  -0.027   0.9781    
## JOBDoctor                       -2.791e+02  3.571e+02  -0.782   0.4345    
## JOBHome_Maker                   -7.036e+01  3.185e+02  -0.221   0.8251    
## JOBLawyer                        8.014e+01  2.582e+02   0.310   0.7563    
## JOBManager                      -1.261e+02  2.521e+02  -0.500   0.6170    
## JOBProfessional                  1.750e+02  2.698e+02   0.649   0.5165    
## JOBStudent                      -1.302e+02  3.266e+02  -0.399   0.6900    
## JOBz_Blue_Collar                 5.242e+01  2.813e+02   0.186   0.8522    
## TRAVTIME                         5.975e-01  2.824e+00   0.212   0.8324    
## CAR_USEPrivate                  -1.006e+02  1.443e+02  -0.697   0.4857    
## BLUEBOOK                         2.941e-02  7.536e-03   3.902 9.61e-05 ***
## TIF                             -1.653e+01  6.277e+01  -0.263   0.7923    
## CAR_TYPEPanel_Truck             -5.601e+01  2.431e+02  -0.230   0.8178    
## CAR_TYPEPickup                  -3.187e+01  1.493e+02  -0.213   0.8310    
## CAR_TYPESports_Car               2.089e+02  1.910e+02   1.094   0.2741    
## CAR_TYPEVan                      9.699e+01  1.864e+02   0.520   0.6029    
## CAR_TYPEz_SUV                    1.635e+02  1.571e+02   1.040   0.2982    
## RED_CARyes                      -2.768e+01  1.302e+02  -0.213   0.8316    
## OLDCLAIM                         3.731e-03  6.731e-03   0.554   0.5793    
## CLM_FREQ                        -8.778e+01  1.060e+02  -0.828   0.4078    
## REVOKEDYes                      -3.348e+02  1.534e+02  -2.182   0.0292 *  
## MVR_PTS                          1.397e+02  6.644e+01   2.103   0.0355 *  
## CAR_AGE                         -2.521e+01  1.118e+01  -2.255   0.0242 *  
## URBANICITYz_Highly_Rural/ Rural  2.772e+01  1.269e+02   0.218   0.8271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3970 on 8122 degrees of freedom
## Multiple R-squared:  0.2912, Adjusted R-squared:  0.2879 
## F-statistic: 87.81 on 38 and 8122 DF,  p-value: < 2.2e-16
logit4 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, data=rawData, family = "binomial" (link="logit"))
summary(logit4)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME, 
##     family = binomial(link = "logit"), data = rawData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5299  -0.8217  -0.6749   1.2315   2.8090  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.876e-01  7.305e-02  -9.412  < 2e-16 ***
## KIDSDRIV     7.266e-01  8.115e-02   8.953  < 2e-16 ***
## INCOME      -3.497e-06  6.826e-07  -5.123 3.01e-07 ***
## HOME_VAL    -2.972e-06  2.499e-07 -11.895  < 2e-16 ***
## TRAVTIME     5.880e-03  1.598e-03   3.679 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 9021.1  on 8156  degrees of freedom
## AIC: 9031.1
## 
## Number of Fisher Scoring iterations: 4
exp(logit4$coefficients)
## (Intercept)    KIDSDRIV      INCOME    HOME_VAL    TRAVTIME 
##   0.5028055   2.0679778   0.9999965   0.9999970   1.0058969
predlogit4 <- predict(logit4, type="response")
rawData2$pred4 <- predict(logit4, type="response")
summary(predlogit4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01176 0.19679 0.25557 0.26382 0.32927 0.68970
table(true = rawData$TARGET_FLAG, pred = round(fitted(logit4)))
##     pred
## true    0    1
##    0 5937   71
##    1 2086   67
par(mfrow=c(2,2))
plot(logit4)

data.frame(rawData2$pred4) %>%
    ggplot(aes(x = rawData2.pred4)) + 
    geom_histogram(bins = 50, fill = 'violet') +
    labs(title = 'Histogram of Predictions') +
    theme_bw()

plot.roc(rawData$TARGET_FLAG, rawData2$pred4)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Conclusions and Suggestions

Overall, none of our models really seemed to have the desired correlation that we were looking for. The one model that Seung made predicts well on the condition that the regression model can correctly predict claim amount. But since we don’t have such a pristine model available, both models perform poorly. Perhaps we need to look into more data engineering techniques to better normalize our data. We spent an excessive amount of time doing imputations and statistical testing but it was overall ineffective to producing results.

References

Data is provided by CUNY School of Professional Studies and is thus sourced by open data initiatives.

Appendix

CleanInsuranceStrings
## function(df) {
##   SilentMutateAt(df, colnames(df)[sapply(df, is.character)], GsubPipe, "\\$|z_|<|,", "") %>% 
##     CleanYesNo(.) %>%
##       # could also use readr::parse_number here
##       SilentMutateAt(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), as.numeric) %>% 
##         ### Coerces blank strings to NA
##         SilentMutateAt(., "JOB", function(x) { x[x == ""] = "Unspecified" ; x }) %>%
##           ### Converts strings to factors
##           dplyr::mutate_at(., colnames(.)[sapply(., is.character)], as.factor)
## }
## <bytecode: 0x0000021f9c7d7b58>
InsuranceTransforms
## function(df) {
##   df %>%
##     dplyr::mutate(
##       IS_MANAGER = as.factor(as.integer(JOB %in% "Manager")),
##       IS_DOCTOR = as.factor(as.integer(JOB %in% "Doctor")),
##       NO_CLM_HIST = as.factor(as.integer(CLM_FREQ < 1))
##     ) %>%
##     dplyr::select(-JOB, -CLM_FREQ)
## }