DATA 621 – Business Analytics and Data Mining

OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).

  • INDEX: Identification Variable (do not use)
  • TARGET_FLAG: Was Car in a crash? 1=YES 0=NO None
  • TARGET_AMT: If car was in a crash, what was the cost None
  • AGE: Age of Driver Very young people tend to be risky. Maybe very old people also.
  • BLUEBOOK: Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash
  • CAR_AGE: Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash
  • CAR_TYPE: Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash
  • CAR_USE: Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
  • CLM_FREQ: # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future
  • EDUCATION: Max Education Level Unknown effect, but in theory more educated people tend to drive more safely
  • HOMEKIDS: # Children at Home Unknown effect
  • HOME_VAL: Home Value In theory, home owners tend to drive more responsibly
  • INCOME: Income In theory, rich people tend to get into fewer crashes
  • JOB: Job Category In theory, white collar jobs tend to be safer
  • KIDSDRIV: # Driving Children When teenagers drive your car, you are more likely to get into crashes
  • MSTATUS: Marital Status In theory, married people drive more safely
  • MVR_PTS: Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes
  • OLDCLAIM: Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high
  • PARENT1: Single Parent Unknown effect
  • RED_CAR: A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
  • REVOKED: License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver.
  • SEX: Gender Urban legend says that women have less crashes then men. Is that true?
  • TIF: Time in Force People who have been customers for a long time are usually more safe.
  • TRAVTIME: Distance to Work Long drives to work usually suggest greater risk
  • URBANICITY: Home/Work Area Unknown
  • YOJ: Years on Job People who stay at a job for a long time are usually more safe

DATA EXPLORATION

Describe the size and the variables in the insurance training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas. a. Mean / Standard Deviation / Median b. Bar Chart or Box Plot of the data c. Is the data correlated to the target variable (or to other variables?) d. Are any of the variables missing and need to be imputed “fixed”?

Load datasets

ins_train_df <- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/insurance_training_data.csv")
ins_eval_df <- read.csv("https://raw.githubusercontent.com/johnm1990/msds-621/main/insurance-evaluation-data.csv")

inspect values

summary(ins_train_df)
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510
ins_train_df <- as.data.frame(lapply(ins_train_df, gsub, pattern='z_', replacement=''))
ins_eval_df <- as.data.frame(lapply(ins_eval_df, gsub, pattern='z_', replacement=''))
cols.num <- c("TARGET_FLAG",
"TARGET_AMT","AGE","YOJ","TRAVTIME","TIF","CLM_FREQ",
"MVR_PTS","CAR_AGE","KIDSDRIV","HOMEKIDS")


ins_train_df[cols.num] <- sapply(ins_train_df[cols.num],as.numeric)
ins_eval_df[cols.num] <- sapply(ins_eval_df[cols.num],as.numeric)

summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510
cols.fac <- c("EDUCATION", "SEX", "CAR_TYPE", "JOB", "CAR_USE", "URBANICITY")

ins_train_df[cols.fac] <- lapply(ins_train_df[cols.fac],factor)
ins_eval_df[cols.fac] <- lapply(ins_eval_df[cols.fac],factor)


summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS          SEX     
##  Length:8161        Length:8161        Length:8161        F:4375  
##  Class :character   Class :character   Class :character   M:3786  
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         EDUCATION              JOB          TRAVTIME            CAR_USE    
##  <High School:1203   Blue Collar :1825   Min.   :  5.00   Commercial:3029  
##  Bachelors   :2242   Clerical    :1271   1st Qu.: 22.00   Private   :5132  
##  High School :2330   Professional:1117   Median : 33.00                    
##  Masters     :1658   Manager     : 988   Mean   : 33.49                    
##  PhD         : 728   Lawyer      : 835   3rd Qu.: 44.00                    
##                      Student     : 712   Max.   :142.00                    
##                      (Other)     :1413                                     
##    BLUEBOOK              TIF                CAR_TYPE      RED_CAR         
##  Length:8161        Min.   : 1.000   Minivan    :2145   Length:8161       
##  Class :character   1st Qu.: 1.000   Panel Truck: 676   Class :character  
##  Mode  :character   Median : 4.000   Pickup     :1389   Mode  :character  
##                     Mean   : 5.351   Sports Car : 907                     
##                     3rd Qu.: 7.000   SUV        :2294                     
##                     Max.   :25.000   Van        : 750                     
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE                     URBANICITY  
##  Min.   :-3.000   Highly Rural/ Rural:1669  
##  1st Qu.: 1.000   Highly Urban/ Urban:6492  
##  Median : 8.000                             
##  Mean   : 8.328                             
##  3rd Qu.:12.000                             
##  Max.   :28.000                             
##  NA's   :510
#print first six values
head(ins_train_df$BLUEBOOK)
## [1] "$14,230" "$14,940" "$4,010"  "$15,440" "$18,000" "$17,430"
typeof(ins_train_df$BLUEBOOK)
## [1] "character"
head(ins_train_df$INCOME)
## [1] "$67,349"  "$91,449"  "$16,039"  ""         "$114,986" "$125,301"
typeof(ins_train_df$INCOME)
## [1] "character"
head(ins_train_df$HOME_VAL)
## [1] "$0"       "$257,252" "$124,191" "$306,251" "$243,925" "$0"
typeof(ins_train_df$HOME_VAL)
## [1] "character"
head(ins_train_df$OLDCLAIM)
## [1] "$4,461"  "$0"      "$38,690" "$0"      "$19,217" "$0"
typeof(ins_train_df$OLDCLAIM)
## [1] "character"
head(ins_train_df$INCOME)
## [1] "$67,349"  "$91,449"  "$16,039"  ""         "$114,986" "$125,301"
#change it to as.numeric from original (as.integer) 
dollars <- function(n){
  as.numeric(paste0('0', str_remove_all(n, '[,$]')))  # add leading zero for blanks
}
dollars(ins_train_df$INCOME[1:11])
##  [1]  67349  91449  16039      0 114986 125301  18755 107961  62978 106952
## [11]  77100
summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS          SEX     
##  Length:8161        Length:8161        Length:8161        F:4375  
##  Class :character   Class :character   Class :character   M:3786  
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         EDUCATION              JOB          TRAVTIME            CAR_USE    
##  <High School:1203   Blue Collar :1825   Min.   :  5.00   Commercial:3029  
##  Bachelors   :2242   Clerical    :1271   1st Qu.: 22.00   Private   :5132  
##  High School :2330   Professional:1117   Median : 33.00                    
##  Masters     :1658   Manager     : 988   Mean   : 33.49                    
##  PhD         : 728   Lawyer      : 835   3rd Qu.: 44.00                    
##                      Student     : 712   Max.   :142.00                    
##                      (Other)     :1413                                     
##    BLUEBOOK              TIF                CAR_TYPE      RED_CAR         
##  Length:8161        Min.   : 1.000   Minivan    :2145   Length:8161       
##  Class :character   1st Qu.: 1.000   Panel Truck: 676   Class :character  
##  Mode  :character   Median : 4.000   Pickup     :1389   Mode  :character  
##                     Mean   : 5.351   Sports Car : 907                     
##                     3rd Qu.: 7.000   SUV        :2294                     
##                     Max.   :25.000   Van        : 750                     
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE                     URBANICITY  
##  Min.   :-3.000   Highly Rural/ Rural:1669  
##  1st Qu.: 1.000   Highly Urban/ Urban:6492  
##  Median : 8.000                             
##  Mean   : 8.328                             
##  3rd Qu.:12.000                             
##  Max.   :28.000                             
##  NA's   :510
#converting to dollars of blueblook

ins_eval_df$BLUEBOOK = dollars(ins_eval_df$BLUEBOOK)
ins_train_df$BLUEBOOK = dollars(ins_train_df$BLUEBOOK)
ins_eval_df$HOME_VAL = dollars(ins_eval_df$HOME_VAL)
ins_train_df$HOME_VAL = dollars(ins_train_df$HOME_VAL)
ins_eval_df$INCOME = dollars(ins_eval_df$INCOME)
ins_train_df$INCOME = dollars(ins_train_df$INCOME)
ins_eval_df$OLDCLAIM = dollars(ins_eval_df$OLDCLAIM)
ins_train_df$OLDCLAIM = dollars(ins_train_df$OLDCLAIM)

summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454                    
##    PARENT1             HOME_VAL        MSTATUS          SEX     
##  Length:8161        Min.   :     0   Length:8161        F:4375  
##  Class :character   1st Qu.:     0   Class :character   M:3786  
##  Mode  :character   Median :151957   Mode  :character           
##                     Mean   :146062                              
##                     3rd Qu.:233352                              
##                     Max.   :885282                              
##                                                                 
##         EDUCATION              JOB          TRAVTIME            CAR_USE    
##  <High School:1203   Blue Collar :1825   Min.   :  5.00   Commercial:3029  
##  Bachelors   :2242   Clerical    :1271   1st Qu.: 22.00   Private   :5132  
##  High School :2330   Professional:1117   Median : 33.00                    
##  Masters     :1658   Manager     : 988   Mean   : 33.49                    
##  PhD         : 728   Lawyer      : 835   3rd Qu.: 44.00                    
##                      Student     : 712   Max.   :142.00                    
##                      (Other)     :1413                                     
##     BLUEBOOK          TIF                CAR_TYPE      RED_CAR         
##  Min.   : 1500   Min.   : 1.000   Minivan    :2145   Length:8161       
##  1st Qu.: 9280   1st Qu.: 1.000   Panel Truck: 676   Class :character  
##  Median :14440   Median : 4.000   Pickup     :1389   Mode  :character  
##  Mean   :15710   Mean   : 5.351   Sports Car : 907                     
##  3rd Qu.:20850   3rd Qu.: 7.000   SUV        :2294                     
##  Max.   :69740   Max.   :25.000   Van        : 750                     
##                                                                        
##     OLDCLAIM        CLM_FREQ        REVOKED             MVR_PTS      
##  Min.   :    0   Min.   :0.0000   Length:8161        Min.   : 0.000  
##  1st Qu.:    0   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Median :    0   Median :0.0000   Mode  :character   Median : 1.000  
##  Mean   : 4037   Mean   :0.7986                      Mean   : 1.696  
##  3rd Qu.: 4636   3rd Qu.:2.0000                      3rd Qu.: 3.000  
##  Max.   :57037   Max.   :5.0000                      Max.   :13.000  
##                                                                      
##     CAR_AGE                     URBANICITY  
##  Min.   :-3.000   Highly Rural/ Rural:1669  
##  1st Qu.: 1.000   Highly Urban/ Urban:6492  
##  Median : 8.000                             
##  Mean   : 8.328                             
##  3rd Qu.:12.000                             
##  Max.   :28.000                             
##  NA's   :510

Inspect more features

head(ins_train_df$PARENT1)
## [1] "No"  "No"  "No"  "No"  "No"  "Yes"
head(ins_train_df$MSTATUS)
## [1] "No"  "No"  "Yes" "Yes" "Yes" "No"
head(ins_train_df$SEX)
## [1] M M F M F F
## Levels: F M
head(ins_train_df$EDUCATION)
## [1] PhD          High School  High School  <High School PhD         
## [6] Bachelors   
## Levels: <High School Bachelors High School Masters PhD
head(ins_train_df$JOB)
## [1] Professional Blue Collar  Clerical     Blue Collar  Doctor      
## [6] Blue Collar 
## 9 Levels:  Blue Collar Clerical Doctor Home Maker Lawyer ... Student
head(ins_train_df$CAR_USE)
## [1] Private    Commercial Private    Private    Private    Commercial
## Levels: Commercial Private
head(ins_train_df$CAR_TYPE)
## [1] Minivan    Minivan    SUV        Minivan    SUV        Sports Car
## Levels: Minivan Panel Truck Pickup Sports Car SUV Van
head(ins_train_df$REVOKED)
## [1] "No"  "No"  "No"  "No"  "Yes" "No"
head(ins_train_df$URBANICITY)
## [1] Highly Urban/ Urban Highly Urban/ Urban Highly Urban/ Urban
## [4] Highly Urban/ Urban Highly Urban/ Urban Highly Urban/ Urban
## Levels: Highly Rural/ Rural Highly Urban/ Urban
head(ins_train_df$RED_CAR)
## [1] "yes" "yes" "no"  "yes" "no"  "no"
table(ins_train_df$URBANICITY)
## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                1669                6492

Turn boolean strings into booleans

#changes to logical type

table(ins_train_df$REVOKED)
## 
##   No  Yes 
## 7161 1000
ins_train_df$REVOKED = tolower(ins_train_df$REVOKED)=='yes'
ins_eval_df$REVOKED = tolower(ins_eval_df$REVOKED)=='yes'
ins_train_df$PARENT1 = tolower(ins_train_df$PARENT1)=='yes'
ins_eval_df$PARENT1 = tolower(ins_eval_df$PARENT1)=='yes'
ins_train_df$RED_CAR = tolower(ins_train_df$RED_CAR)=='yes'
ins_eval_df$RED_CAR = tolower(ins_eval_df$RED_CAR)=='yes'
ins_train_df$MSTATUS = tolower(ins_train_df$MSTATUS)=='yes'
ins_eval_df$MSTATUS = tolower(ins_eval_df$MSTATUS)=='yes'

table(ins_train_df$REVOKED)
## 
## FALSE  TRUE 
##  7161  1000
summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454                    
##   PARENT1           HOME_VAL       MSTATUS        SEX             EDUCATION   
##  Mode :logical   Min.   :     0   Mode :logical   F:4375   <High School:1203  
##  FALSE:7084      1st Qu.:     0   FALSE:3267      M:3786   Bachelors   :2242  
##  TRUE :1077      Median :151957   TRUE :4894               High School :2330  
##                  Mean   :146062                            Masters     :1658  
##                  3rd Qu.:233352                            PhD         : 728  
##                  Max.   :885282                                               
##                                                                               
##            JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  Blue Collar :1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical    :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional:1117   Median : 33.00                     Median :14440  
##  Manager     : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer      : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student     : 712   Max.   :142.00                     Max.   :69740  
##  (Other)     :1413                                                     
##       TIF                CAR_TYPE     RED_CAR           OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Mode :logical   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   FALSE:5783      1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   TRUE :2378      Median :    0  
##  Mean   : 5.351   Sports Car : 907                   Mean   : 4037  
##  3rd Qu.: 7.000   SUV        :2294                   3rd Qu.: 4636  
##  Max.   :25.000   Van        : 750                   Max.   :57037  
##                                                                     
##     CLM_FREQ       REVOKED           MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Mode :logical   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   FALSE:7161      1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   TRUE :1000      Median : 1.000   Median : 8.000  
##  Mean   :0.7986                   Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000                   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                   Max.   :13.000   Max.   :28.000  
##                                                    NA's   :510     
##                URBANICITY  
##  Highly Rural/ Rural:1669  
##  Highly Urban/ Urban:6492  
##                            
##                            
##                            
##                            
## 
ins_train_df$INCOME <- as.numeric(ins_train_df$INCOME)
typeof(ins_train_df$INCOME)
## [1] "double"
head(ins_train_df$INCOME)
## [1]  67349  91449  16039      0 114986 125301
#summaries for mean

ins_train_numeric <- ins_train_df %>% 
  dplyr::select(where(is.numeric))
#we will see what type of variable they are with str
str(ins_train_df)
## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : chr  "1" "2" "4" "5" ...
##  $ TARGET_FLAG: num  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : num  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : num  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : num  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : num  67349 91449 16039 0 114986 ...
##  $ PARENT1    : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ HOME_VAL   : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS    : logi  FALSE FALSE TRUE TRUE TRUE FALSE ...
##  $ SEX        : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 1 2 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 5 3 3 1 5 2 1 2 2 2 ...
##  $ JOB        : Factor w/ 9 levels "","Blue Collar",..: 8 2 3 2 4 2 2 2 3 8 ...
##  $ TRAVTIME   : num  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
##  $ BLUEBOOK   : num  14230 14940 4010 15440 18000 ...
##  $ TIF        : num  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 5 1 5 4 5 6 5 6 ...
##  $ RED_CAR    : logi  TRUE TRUE FALSE TRUE FALSE FALSE ...
##  $ OLDCLAIM   : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ   : num  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ MVR_PTS    : num  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : num  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Rural/ Rural",..: 2 2 2 2 2 2 2 2 2 1 ...
Summaries_KBL <- kable(format(sapply(ins_train_numeric, function(ins_train_numeric) c( "Stand dev" = round(sd(ins_train_numeric),2), 
                         "Mean"= mean(ins_train_numeric,na.rm=TRUE),
                         "n" = length(ins_train_numeric),
                         "Median" = median(ins_train_numeric,na.rm = TRUE),
                         "CoeffofVariation" = sd(ins_train_numeric)/mean(ins_train_numeric,na.rm=TRUE),
                         "Minimum" = min(ins_train_numeric),
                         "Maximum" = max(ins_train_numeric),
                         "Upper Quantile" = quantile(ins_train_numeric,1,na.rm = TRUE),
                         "LowerQuartile" = quantile(ins_train_numeric,0,na.rm = TRUE)
                    )
), scientific = FALSE)
)
Summaries_KBL <- sub("0+$", "", as.character(Summaries_KBL))
Summaries_KBL
##  [1] "|                    |TARGET_FLAG  |TARGET_AMT     |KIDSDRIV     |AGE          |HOMEKIDS     |YOJ          |INCOME         |HOME_VAL       |TRAVTIME     |BLUEBOOK      |TIF          |OLDCLAIM      |CLM_FREQ     |MVR_PTS      |CAR_AGE      |"
##  [2] "|:-------------------|:------------|:--------------|:------------|:------------|:------------|:------------|:--------------|:--------------|:------------|:-------------|:------------|:-------------|:------------|:------------|:------------|"
##  [3] "|Stand dev           |0.4400000    |4704.0300000   |0.5100000    |NA           |1.1200000    |NA           |48345.5100000  |130426.7200000 |15.9100000   |8419.7300000  |4.1500000    |8777.1400000  |1.1600000    |2.1500000    |NA           |"
##  [4] "|Mean                |0.2638157    |1504.3246481   |0.1710575    |44.7903127   |0.7212351    |10.4992864   |58522.9381203  |146062.1895601 |33.4857248   |15709.8995221 |5.3513050    |4037.0762161  |0.7985541    |1.6955030    |8.3283231    |"
##  [5] "|n                   |8161.0000000 |8161.0000000   |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000 |8161.0000000   |8161.0000000   |8161.0000000 |8161.0000000  |8161.0000000 |8161.0000000  |8161.0000000 |8161.0000000 |8161.0000000 |"
##  [6] "|Median              |0.0000000    |0.0000000      |0.0000000    |45.0000000   |0.0000000    |11.0000000   |51116.0000000  |151957.0000000 |33.0000000   |14440.0000000 |4.0000000    |0.0000000     |0.0000000    |1.0000000    |8.0000000    |"
##  [7] "|CoeffofVariation    |1.6705888    |3.1270025      |2.9904224    |NA           |1.5477938    |NA           |0.8260951      |0.8929533      |0.4750781    |0.5359509     |0.7748830    |2.1741326     |1.4506878    |1.2663568    |NA           |"
##  [8] "|Minimum             |0.0000000    |0.0000000      |0.0000000    |NA           |0.0000000    |NA           |0.0000000      |0.0000000      |5.0000000    |1500.0000000  |1.0000000    |0.0000000     |0.0000000    |0.0000000    |NA           |"
##  [9] "|Maximum             |1.0000000    |107586.1361600 |4.0000000    |NA           |5.0000000    |NA           |367030.0000000 |885282.0000000 |142.0000000  |69740.0000000 |25.0000000   |57037.0000000 |5.0000000    |13.0000000   |NA           |"
## [10] "|Upper Quantile.100% |1.0000000    |107586.1361600 |4.0000000    |81.0000000   |5.0000000    |23.0000000   |367030.0000000 |885282.0000000 |142.0000000  |69740.0000000 |25.0000000   |57037.0000000 |5.0000000    |13.0000000   |28.0000000   |"
## [11] "|LowerQuartile.0%    |0.0000000    |0.0000000      |0.0000000    |16.0000000   |0.0000000    |0.0000000    |0.0000000      |0.0000000      |5.0000000    |1500.0000000  |1.0000000    |0.0000000     |0.0000000    |0.0000000    |-3.0000000   |"
###summary statistics
#ignore all missing values, provides our SD for all numeric
summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454                    
##   PARENT1           HOME_VAL       MSTATUS        SEX             EDUCATION   
##  Mode :logical   Min.   :     0   Mode :logical   F:4375   <High School:1203  
##  FALSE:7084      1st Qu.:     0   FALSE:3267      M:3786   Bachelors   :2242  
##  TRUE :1077      Median :151957   TRUE :4894               High School :2330  
##                  Mean   :146062                            Masters     :1658  
##                  3rd Qu.:233352                            PhD         : 728  
##                  Max.   :885282                                               
##                                                                               
##            JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  Blue Collar :1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical    :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional:1117   Median : 33.00                     Median :14440  
##  Manager     : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer      : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student     : 712   Max.   :142.00                     Max.   :69740  
##  (Other)     :1413                                                     
##       TIF                CAR_TYPE     RED_CAR           OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Mode :logical   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   FALSE:5783      1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   TRUE :2378      Median :    0  
##  Mean   : 5.351   Sports Car : 907                   Mean   : 4037  
##  3rd Qu.: 7.000   SUV        :2294                   3rd Qu.: 4636  
##  Max.   :25.000   Van        : 750                   Max.   :57037  
##                                                                     
##     CLM_FREQ       REVOKED           MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Mode :logical   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   FALSE:7161      1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   TRUE :1000      Median : 1.000   Median : 8.000  
##  Mean   :0.7986                   Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000                   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                   Max.   :13.000   Max.   :28.000  
##                                                    NA's   :510     
##                URBANICITY  
##  Highly Rural/ Rural:1669  
##  Highly Urban/ Urban:6492  
##                            
##                            
##                            
##                            
## 
sapply(ins_train_numeric, sd, na.rm=TRUE)
##  TARGET_FLAG   TARGET_AMT     KIDSDRIV          AGE     HOMEKIDS          YOJ 
## 4.407276e-01 4.704027e+03 5.115341e-01 8.627589e+00 1.116323e+00 4.092474e+00 
##       INCOME     HOME_VAL     TRAVTIME     BLUEBOOK          TIF     OLDCLAIM 
## 4.834551e+04 1.304267e+05 1.590833e+01 8.419734e+03 4.146635e+00 8.777139e+03 
##     CLM_FREQ      MVR_PTS      CAR_AGE 
## 1.158453e+00 2.147112e+00 5.700742e+00
ins_train_numeric %>% gather() %>% head()
##           key value
## 1 TARGET_FLAG     0
## 2 TARGET_FLAG     0
## 3 TARGET_FLAG     0
## 4 TARGET_FLAG     0
## 5 TARGET_FLAG     0
## 6 TARGET_FLAG     1
ggplot(gather(ins_train_numeric), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

hist(ins_train_df$TARGET_AMT)

summary(ins_train_df$TARGET_AMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1504    1036  107586
var(ins_train_df$TARGET_AMT)
## [1] 22127869
var(log(ins_train_df$TARGET_AMT+1))
## [1] 13.47907
#benefit of log transform, we added because there a lot of 0's in Target_AMT, we added 1
#log of 0 undefined

hist(ins_train_df$INCOME)

summary(ins_train_df$INCOME)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   23157   51116   58523   83304  367030
var(ins_train_df$INCOME)
## [1] 2337288545
var(log(ins_train_df$INCOME+1))
## [1] 13.98386
#age, yoj,car_age many misssing values
#we are focusing on the ones with missing values first

hist(ins_train_df$OLDCLAIM)

summary(ins_train_df$OLDCLAIM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    4037    4636   57037
var(ins_train_df$OLDCLAIM)
## [1] 77038171
var(log(ins_train_df$OLDCLAIM+1))
## [1] 18.59325
#table(ins_train_df$OLDCLAIM)

summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454                    
##   PARENT1           HOME_VAL       MSTATUS        SEX             EDUCATION   
##  Mode :logical   Min.   :     0   Mode :logical   F:4375   <High School:1203  
##  FALSE:7084      1st Qu.:     0   FALSE:3267      M:3786   Bachelors   :2242  
##  TRUE :1077      Median :151957   TRUE :4894               High School :2330  
##                  Mean   :146062                            Masters     :1658  
##                  3rd Qu.:233352                            PhD         : 728  
##                  Max.   :885282                                               
##                                                                               
##            JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  Blue Collar :1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical    :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional:1117   Median : 33.00                     Median :14440  
##  Manager     : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer      : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student     : 712   Max.   :142.00                     Max.   :69740  
##  (Other)     :1413                                                     
##       TIF                CAR_TYPE     RED_CAR           OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Mode :logical   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   FALSE:5783      1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   TRUE :2378      Median :    0  
##  Mean   : 5.351   Sports Car : 907                   Mean   : 4037  
##  3rd Qu.: 7.000   SUV        :2294                   3rd Qu.: 4636  
##  Max.   :25.000   Van        : 750                   Max.   :57037  
##                                                                     
##     CLM_FREQ       REVOKED           MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Mode :logical   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   FALSE:7161      1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   TRUE :1000      Median : 1.000   Median : 8.000  
##  Mean   :0.7986                   Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000                   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                   Max.   :13.000   Max.   :28.000  
##                                                    NA's   :510     
##                URBANICITY  
##  Highly Rural/ Rural:1669  
##  Highly Urban/ Urban:6492  
##                            
##                            
##                            
##                            
## 
# ggplot(gather(ins_train_numeric, cols, value), aes(x = value)) + 
#        geom_histogram(binwidth = 20) + facet_grid(.~cols)


##correlation matrix
ins_train_numeric.rcorr = rcorr(as.matrix(ins_train_numeric))
ins_train_numeric.rcorr
##             TARGET_FLAG TARGET_AMT KIDSDRIV   AGE HOMEKIDS   YOJ INCOME
## TARGET_FLAG        1.00       0.53     0.10 -0.10     0.12 -0.07  -0.13
## TARGET_AMT         0.53       1.00     0.06 -0.04     0.06 -0.02  -0.05
## KIDSDRIV           0.10       0.06     1.00 -0.08     0.46  0.04  -0.04
## AGE               -0.10      -0.04    -0.08  1.00    -0.45  0.14   0.17
## HOMEKIDS           0.12       0.06     0.46 -0.45     1.00  0.09  -0.15
## YOJ               -0.07      -0.02     0.04  0.14     0.09  1.00   0.27
## INCOME            -0.13      -0.05    -0.04  0.17    -0.15  0.27   1.00
## HOME_VAL          -0.17      -0.08    -0.02  0.19    -0.10  0.25   0.50
## TRAVTIME           0.05       0.03     0.01  0.01    -0.01 -0.02  -0.04
## BLUEBOOK          -0.10       0.00    -0.02  0.17    -0.11  0.14   0.41
## TIF               -0.08      -0.05     0.00  0.00     0.01  0.02   0.00
## OLDCLAIM           0.14       0.07     0.02 -0.03     0.03  0.00  -0.04
## CLM_FREQ           0.22       0.12     0.04 -0.02     0.03 -0.03  -0.04
## MVR_PTS            0.22       0.14     0.05 -0.07     0.06 -0.04  -0.05
## CAR_AGE           -0.10      -0.06    -0.05  0.18    -0.15  0.06   0.39
##             HOME_VAL TRAVTIME BLUEBOOK   TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG    -0.17     0.05    -0.10 -0.08     0.14     0.22    0.22   -0.10
## TARGET_AMT     -0.08     0.03     0.00 -0.05     0.07     0.12    0.14   -0.06
## KIDSDRIV       -0.02     0.01    -0.02  0.00     0.02     0.04    0.05   -0.05
## AGE             0.19     0.01     0.17  0.00    -0.03    -0.02   -0.07    0.18
## HOMEKIDS       -0.10    -0.01    -0.11  0.01     0.03     0.03    0.06   -0.15
## YOJ             0.25    -0.02     0.14  0.02     0.00    -0.03   -0.04    0.06
## INCOME          0.50    -0.04     0.41  0.00    -0.04    -0.04   -0.05    0.39
## HOME_VAL        1.00    -0.03     0.24  0.00    -0.07    -0.09   -0.08    0.20
## TRAVTIME       -0.03     1.00    -0.02 -0.01    -0.02     0.01    0.01   -0.04
## BLUEBOOK        0.24    -0.02     1.00 -0.01    -0.03    -0.04   -0.04    0.19
## TIF             0.00    -0.01    -0.01  1.00    -0.02    -0.02   -0.04    0.01
## OLDCLAIM       -0.07    -0.02    -0.03 -0.02     1.00     0.50    0.26   -0.01
## CLM_FREQ       -0.09     0.01    -0.04 -0.02     0.50     1.00    0.40   -0.01
## MVR_PTS        -0.08     0.01    -0.04 -0.04     0.26     0.40    1.00   -0.02
## CAR_AGE         0.20    -0.04     0.19  0.01    -0.01    -0.01   -0.02    1.00
## 
## n
##             TARGET_FLAG TARGET_AMT KIDSDRIV  AGE HOMEKIDS  YOJ INCOME HOME_VAL
## TARGET_FLAG        8161       8161     8161 8155     8161 7707   8161     8161
## TARGET_AMT         8161       8161     8161 8155     8161 7707   8161     8161
## KIDSDRIV           8161       8161     8161 8155     8161 7707   8161     8161
## AGE                8155       8155     8155 8155     8155 7701   8155     8155
## HOMEKIDS           8161       8161     8161 8155     8161 7707   8161     8161
## YOJ                7707       7707     7707 7701     7707 7707   7707     7707
## INCOME             8161       8161     8161 8155     8161 7707   8161     8161
## HOME_VAL           8161       8161     8161 8155     8161 7707   8161     8161
## TRAVTIME           8161       8161     8161 8155     8161 7707   8161     8161
## BLUEBOOK           8161       8161     8161 8155     8161 7707   8161     8161
## TIF                8161       8161     8161 8155     8161 7707   8161     8161
## OLDCLAIM           8161       8161     8161 8155     8161 7707   8161     8161
## CLM_FREQ           8161       8161     8161 8155     8161 7707   8161     8161
## MVR_PTS            8161       8161     8161 8155     8161 7707   8161     8161
## CAR_AGE            7651       7651     7651 7645     7651 7219   7651     7651
##             TRAVTIME BLUEBOOK  TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG     8161     8161 8161     8161     8161    8161    7651
## TARGET_AMT      8161     8161 8161     8161     8161    8161    7651
## KIDSDRIV        8161     8161 8161     8161     8161    8161    7651
## AGE             8155     8155 8155     8155     8155    8155    7645
## HOMEKIDS        8161     8161 8161     8161     8161    8161    7651
## YOJ             7707     7707 7707     7707     7707    7707    7219
## INCOME          8161     8161 8161     8161     8161    8161    7651
## HOME_VAL        8161     8161 8161     8161     8161    8161    7651
## TRAVTIME        8161     8161 8161     8161     8161    8161    7651
## BLUEBOOK        8161     8161 8161     8161     8161    8161    7651
## TIF             8161     8161 8161     8161     8161    8161    7651
## OLDCLAIM        8161     8161 8161     8161     8161    8161    7651
## CLM_FREQ        8161     8161 8161     8161     8161    8161    7651
## MVR_PTS         8161     8161 8161     8161     8161    8161    7651
## CAR_AGE         7651     7651 7651     7651     7651    7651    7651
## 
## P
##             TARGET_FLAG TARGET_AMT KIDSDRIV AGE    HOMEKIDS YOJ    INCOME
## TARGET_FLAG             0.0000     0.0000   0.0000 0.0000   0.0000 0.0000
## TARGET_AMT  0.0000                 0.0000   0.0002 0.0000   0.0525 0.0000
## KIDSDRIV    0.0000      0.0000              0.0000 0.0000   0.0001 0.0000
## AGE         0.0000      0.0002     0.0000          0.0000   0.0000 0.0000
## HOMEKIDS    0.0000      0.0000     0.0000   0.0000          0.0000 0.0000
## YOJ         0.0000      0.0525     0.0001   0.0000 0.0000          0.0000
## INCOME      0.0000      0.0000     0.0000   0.0000 0.0000   0.0000       
## HOME_VAL    0.0000      0.0000     0.1648   0.0000 0.0000   0.0000 0.0000
## TRAVTIME    0.0000      0.0115     0.4455   0.6342 0.5128   0.1369 0.0000
## BLUEBOOK    0.0000      0.6712     0.0516   0.0000 0.0000   0.0000 0.0000
## TIF         0.0000      0.0000     0.8574   0.9952 0.2859   0.0296 0.9396
## OLDCLAIM    0.0000      0.0000     0.0653   0.0082 0.0069   0.7936 0.0001
## CLM_FREQ    0.0000      0.0000     0.0008   0.0296 0.0080   0.0209 0.0001
## MVR_PTS     0.0000      0.0000     0.0000   0.0000 0.0000   0.0009 0.0000
## CAR_AGE     0.0000      0.0000     0.0000   0.0000 0.0000   0.0000 0.0000
##             HOME_VAL TRAVTIME BLUEBOOK TIF    OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE
## TARGET_FLAG 0.0000   0.0000   0.0000   0.0000 0.0000   0.0000   0.0000  0.0000 
## TARGET_AMT  0.0000   0.0115   0.6712   0.0000 0.0000   0.0000   0.0000  0.0000 
## KIDSDRIV    0.1648   0.4455   0.0516   0.8574 0.0653   0.0008   0.0000  0.0000 
## AGE         0.0000   0.6342   0.0000   0.9952 0.0082   0.0296   0.0000  0.0000 
## HOMEKIDS    0.0000   0.5128   0.0000   0.2859 0.0069   0.0080   0.0000  0.0000 
## YOJ         0.0000   0.1369   0.0000   0.0296 0.7936   0.0209   0.0009  0.0000 
## INCOME      0.0000   0.0000   0.0000   0.9396 0.0001   0.0001   0.0000  0.0000 
## HOME_VAL             0.0055   0.0000   0.7376 0.0000   0.0000   0.0000  0.0000 
## TRAVTIME    0.0055            0.1246   0.2945 0.0818   0.5535   0.3384  0.0008 
## BLUEBOOK    0.0000   0.1246            0.6242 0.0077   0.0010   0.0004  0.0000 
## TIF         0.7376   0.2945   0.6242          0.0473   0.0375   0.0002  0.4969 
## OLDCLAIM    0.0000   0.0818   0.0077   0.0473          0.0000   0.0000  0.2417 
## CLM_FREQ    0.0000   0.5535   0.0010   0.0375 0.0000            0.0000  0.4151 
## MVR_PTS     0.0000   0.3384   0.0004   0.0002 0.0000   0.0000           0.0817 
## CAR_AGE     0.0000   0.0008   0.0000   0.4969 0.2417   0.4151   0.0817
#a lot of NA's for the three variables, Correlation take all values in the column and associate them

ins_train_numeric.cor = cor(ins_train_numeric, use = "complete.obs")
corrplot(ins_train_numeric.cor)

#HOMEKIDS looks like good group for imputing ages
#incoming needs to be clustered for imputing

summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ           INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.0   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   Max.   :367030  
##  NA's   :6                        NA's   :454                    
##   PARENT1           HOME_VAL       MSTATUS        SEX             EDUCATION   
##  Mode :logical   Min.   :     0   Mode :logical   F:4375   <High School:1203  
##  FALSE:7084      1st Qu.:     0   FALSE:3267      M:3786   Bachelors   :2242  
##  TRUE :1077      Median :151957   TRUE :4894               High School :2330  
##                  Mean   :146062                            Masters     :1658  
##                  3rd Qu.:233352                            PhD         : 728  
##                  Max.   :885282                                               
##                                                                               
##            JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  Blue Collar :1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical    :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional:1117   Median : 33.00                     Median :14440  
##  Manager     : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer      : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student     : 712   Max.   :142.00                     Max.   :69740  
##  (Other)     :1413                                                     
##       TIF                CAR_TYPE     RED_CAR           OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Mode :logical   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   FALSE:5783      1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   TRUE :2378      Median :    0  
##  Mean   : 5.351   Sports Car : 907                   Mean   : 4037  
##  3rd Qu.: 7.000   SUV        :2294                   3rd Qu.: 4636  
##  Max.   :25.000   Van        : 750                   Max.   :57037  
##                                                                     
##     CLM_FREQ       REVOKED           MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Mode :logical   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   FALSE:7161      1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   TRUE :1000      Median : 1.000   Median : 8.000  
##  Mean   :0.7986                   Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000                   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                   Max.   :13.000   Max.   :28.000  
##                                                    NA's   :510     
##                URBANICITY  
##  Highly Rural/ Rural:1669  
##  Highly Urban/ Urban:6492  
##                            
##                            
##                            
##                            
## 
#correlations between variables we highly suspect

cor_inctarget <- cor.test(ins_train_df$INCOME, ins_train_df$TARGET_AMT)
cor_inctarget
## 
##  Pearson's product-moment correlation
## 
## data:  ins_train_df$INCOME and ins_train_df$TARGET_AMT
## t = -4.7556, df = 8159, p-value = 2.013e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07418735 -0.03091440
## sample estimates:
##         cor 
## -0.05257556
cor_hovaltarget <- cor.test(ins_train_df$HOME_VAL, ins_train_df$TARGET_AMT)
cor_hovaltarget
## 
##  Pearson's product-moment correlation
## 
## data:  ins_train_df$HOME_VAL and ins_train_df$TARGET_AMT
## t = -7.435, df = 8159, p-value = 1.152e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10354605 -0.06044509
## sample estimates:
##         cor 
## -0.08203393
cor_cfreqtarget<- cor.test(ins_train_df$CLM_FREQ, ins_train_df$TARGET_AMT)
cor_cfreqtarget
## 
##  Pearson's product-moment correlation
## 
## data:  ins_train_df$CLM_FREQ and ins_train_df$TARGET_AMT
## t = 10.588, df = 8159, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0949626 0.1377676
## sample estimates:
##       cor 
## 0.1164192
cor_mptstarget<- cor.test(ins_train_df$MVR_PTS, ins_train_df$TARGET_AMT)
cor_mptstarget
## 
##  Pearson's product-moment correlation
## 
## data:  ins_train_df$MVR_PTS and ins_train_df$TARGET_AMT
## t = 12.573, df = 8159, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1165176 0.1590861
## sample estimates:
##       cor 
## 0.1378655
cor_travtarget<- cor.test(ins_train_df$TRAVTIME, ins_train_df$TARGET_AMT)
cor_travtarget
## 
##  Pearson's product-moment correlation
## 
## data:  ins_train_df$TRAVTIME and ins_train_df$TARGET_AMT
## t = 2.529, df = 8159, p-value = 0.01146
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.006294418 0.049653286
## sample estimates:
##        cor 
## 0.02798702
#the weaker the correlation the higher the p value

#ttests
#most P values significantly low
t_inctarget <- t.test(ins_train_df$INCOME~ ins_train_df$TARGET_FLAG)
t_inctarget
## 
##  Welch Two Sample t-test
## 
## data:  ins_train_df$INCOME by ins_train_df$TARGET_FLAG
## t = 12.604, df = 4321.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12008.56 16432.63
## sample estimates:
## mean in group 0 mean in group 1 
##        62274.55        48053.96
t_hovaltarget <- t.test(ins_train_df$HOME_VAL~ ins_train_df$TARGET_FLAG)
t_hovaltarget
## 
##  Welch Two Sample t-test
## 
## data:  ins_train_df$HOME_VAL by ins_train_df$TARGET_FLAG
## t = 16.56, df = 4222.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  44648.07 56639.42
## sample estimates:
## mean in group 0 mean in group 1 
##        159422.8        108779.1
t_cfreqtarget<- t.test(ins_train_df$CLM_FREQ~ ins_train_df$TARGET_FLAG)
t_cfreqtarget
## 
##  Welch Two Sample t-test
## 
## data:  ins_train_df$CLM_FREQ by ins_train_df$TARGET_FLAG
## t = -18.734, df = 3388.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6277464 -0.5087966
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6486352       1.2169066
t_mptstarget<- t.test(ins_train_df$MVR_PTS~ ins_train_df$TARGET_FLAG)
t_mptstarget
## 
##  Welch Two Sample t-test
## 
## data:  ins_train_df$MVR_PTS by ins_train_df$TARGET_FLAG
## t = -17.591, df = 3021.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1869030 -0.9488407
## sample estimates:
## mean in group 0 mean in group 1 
##        1.413782        2.481654
t_travtarget<- t.test(ins_train_df$TRAVTIME~ ins_train_df$TARGET_FLAG)
t_travtarget
## 
##  Welch Two Sample t-test
## 
## data:  ins_train_df$TRAVTIME by ins_train_df$TARGET_FLAG
## t = -4.4996, df = 4007.7, p-value = 7e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.5065923 -0.9851758
## sample estimates:
## mean in group 0 mean in group 1 
##        33.02513        34.77102
#percentages is better because same scale for both graphs
library('lattice')
histogram(~ INCOME | TARGET_FLAG, data = ins_train_df)

histogram(~ HOME_VAL | TARGET_FLAG, data = ins_train_df)

histogram(~ CLM_FREQ | TARGET_FLAG, data = ins_train_df)

histogram(~ MVR_PTS | TARGET_FLAG, data = ins_train_df)

#checking the distribution, flag 0 means no car crashes, target 1 means car crashes, we are checking distribution of income between both

DATA PREPARATION

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value)
  2. Create flags to suggest if a variable was missing
  3. Transform data by putting it into buckets
  4. Mathematical transforms such as log or square root (or use Box-Cox)
  5. Combine variables (such as ratios or adding or multiplying) to create new variables

There was missing values

We are putting the data into clusters, based on these buckets we are imputing the missing values.

#flagging missing values
ins_train_df$AGE_miss <- ifelse(is.na(ins_train_df$AGE),1,0)
ins_train_df$YOJ_miss <- ifelse(is.na(ins_train_df$YOJ),1,0)
ins_train_df$CAR_AGE_miss <- ifelse(is.na(ins_train_df$CAR_AGE),1,0)


#create five clusters of income
kmeans.re <- kmeans(ins_train_df$INCOME, centers = 5)
table(kmeans.re$cluster)
## 
##    1    2    3    4    5 
## 2069 2527  335 2160 1070
ins_train_df$inc_clusters <- kmeans.re$cluster
#table(ins_train_df$INCOME)

#we are creating groups based on HOMEKIDS, replace missing value
ins_train_df <- ins_train_df %>% 
             group_by(HOMEKIDS) %>% 
            mutate(AGE= ifelse(is.na(AGE), mean(AGE, na.rm=TRUE), AGE))


#we are creating groups based on car_age, replace missing value
ins_train_df <- ins_train_df %>% 
             group_by(inc_clusters) %>% 
            mutate(CAR_AGE= ifelse(is.na(CAR_AGE), mean(CAR_AGE, na.rm=TRUE), CAR_AGE))

#replace missing value {based on whole sample/mean based}
#no group value is highly correlated by YOJ
ins_train_df <- ins_train_df %>% 
  mutate(YOJ= ifelse(is.na(YOJ), mean(YOJ, na.rm=TRUE), YOJ))


#subesection for converting those with huge variance (log x + 1)
ins_train_df$TARGET_AMT_log <- log(ins_train_df$TARGET_AMT+1)
ins_train_df$INCOME_log <- log(ins_train_df$INCOME+1)
ins_train_df$OLDCLAIM_log <- log(ins_train_df$OLDCLAIM+1)


ins_train_df$AVG_CLM_PAY <- ins_train_df$OLDCLAIM/ins_train_df$CLM_FREQ

summary(ins_train_df)
##     INDEX            TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Length:8161        Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :     0   Median :0.0000  
##                     Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##                     3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                       
##       AGE           HOMEKIDS           YOJ            INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.00   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 23157  
##  Median :45.00   Median :0.0000   Median :11.38   Median : 51116  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.49   Mean   : 58523  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.00   3rd Qu.: 83304  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.00   Max.   :367030  
##                                                                   
##   PARENT1           HOME_VAL       MSTATUS        SEX             EDUCATION   
##  Mode :logical   Min.   :     0   Mode :logical   F:4375   <High School:1203  
##  FALSE:7084      1st Qu.:     0   FALSE:3267      M:3786   Bachelors   :2242  
##  TRUE :1077      Median :151957   TRUE :4894               High School :2330  
##                  Mean   :146062                            Masters     :1658  
##                  3rd Qu.:233352                            PhD         : 728  
##                  Max.   :885282                                               
##                                                                               
##            JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  Blue Collar :1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical    :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional:1117   Median : 33.00                     Median :14440  
##  Manager     : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer      : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student     : 712   Max.   :142.00                     Max.   :69740  
##  (Other)     :1413                                                     
##       TIF                CAR_TYPE     RED_CAR           OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   Mode :logical   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   FALSE:5783      1st Qu.:    0  
##  Median : 4.000   Pickup     :1389   TRUE :2378      Median :    0  
##  Mean   : 5.351   Sports Car : 907                   Mean   : 4037  
##  3rd Qu.: 7.000   SUV        :2294                   3rd Qu.: 4636  
##  Max.   :25.000   Van        : 750                   Max.   :57037  
##                                                                     
##     CLM_FREQ       REVOKED           MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Mode :logical   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   FALSE:7161      1st Qu.: 0.000   1st Qu.: 4.000  
##  Median :0.0000   TRUE :1000      Median : 1.000   Median : 8.000  
##  Mean   :0.7986                   Mean   : 1.696   Mean   : 8.333  
##  3rd Qu.:2.0000                   3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                   Max.   :13.000   Max.   :28.000  
##                                                                    
##                URBANICITY      AGE_miss            YOJ_miss      
##  Highly Rural/ Rural:1669   Min.   :0.0000000   Min.   :0.00000  
##  Highly Urban/ Urban:6492   1st Qu.:0.0000000   1st Qu.:0.00000  
##                             Median :0.0000000   Median :0.00000  
##                             Mean   :0.0007352   Mean   :0.05563  
##                             3rd Qu.:0.0000000   3rd Qu.:0.00000  
##                             Max.   :1.0000000   Max.   :1.00000  
##                                                                  
##   CAR_AGE_miss      inc_clusters  TARGET_AMT_log     INCOME_log    
##  Min.   :0.00000   Min.   :1.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.00000   1st Qu.:1.00   1st Qu.: 0.000   1st Qu.:10.050  
##  Median :0.00000   Median :2.00   Median : 0.000   Median :10.842  
##  Mean   :0.06249   Mean   :2.71   Mean   : 2.183   Mean   : 9.421  
##  3rd Qu.:0.00000   3rd Qu.:4.00   3rd Qu.: 6.944   3rd Qu.:11.330  
##  Max.   :1.00000   Max.   :5.00   Max.   :11.586   Max.   :12.813  
##                                                                    
##   OLDCLAIM_log     AVG_CLM_PAY     
##  Min.   : 0.000   Min.   :  129.8  
##  1st Qu.: 0.000   1st Qu.: 1665.1  
##  Median : 0.000   Median : 3204.8  
##  Mean   : 3.386   Mean   : 6315.6  
##  3rd Qu.: 8.442   3rd Qu.: 7160.7  
##  Max.   :10.951   Max.   :57037.0  
##                   NA's   :5009

BUILD THE MODELS

Models predicting TARGET_AMT

#inspecting the TARGET_AMT variable
boxplot(ins_train_df$TARGET_AMT)

summary(ins_train_df$TARGET_AMT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1504    1036  107586

Model 1- All variables discussed above or literally ALL?

#Building model using all features.

lm_ins_train_df <- lm(TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME +
                    PARENT1 +HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME +
                    CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
                    REVOKED + MVR_PTS + CAR_AGE + URBANICITY
                    , data = ins_train_df)

summary(lm_ins_train_df)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + 
##     PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + 
##     CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + 
##     CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = ins_train_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5875  -1698   -758    355 103796 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -4.332e+02  5.732e+02  -0.756  0.44979    
## KIDSDRIV                       3.128e+02  1.132e+02   2.764  0.00573 ** 
## AGE                            5.440e+00  7.068e+00   0.770  0.44153    
## HOMEKIDS                       8.050e+01  6.539e+01   1.231  0.21838    
## YOJ                           -7.319e+00  1.515e+01  -0.483  0.62899    
## INCOME                        -3.555e-03  1.626e-03  -2.187  0.02878 *  
## PARENT1TRUE                    5.788e+02  2.020e+02   2.866  0.00417 ** 
## HOME_VAL                      -6.343e-04  5.380e-04  -1.179  0.23847    
## MSTATUSTRUE                   -5.588e+02  1.416e+02  -3.946 8.00e-05 ***
## SEXM                           3.728e+02  1.838e+02   2.028  0.04255 *  
## EDUCATIONBachelors            -2.630e+02  2.048e+02  -1.284  0.19917    
## EDUCATIONHigh School          -9.077e+01  1.719e+02  -0.528  0.59751    
## EDUCATIONMasters               2.530e+01  3.007e+02   0.084  0.93294    
## EDUCATIONPhD                   2.605e+02  3.548e+02   0.734  0.46293    
## JOBBlue Collar                 5.209e+02  3.216e+02   1.620  0.10531    
## JOBClerical                    5.536e+02  3.406e+02   1.625  0.10415    
## JOBDoctor                     -4.918e+02  4.087e+02  -1.203  0.22884    
## JOBHome Maker                  3.763e+02  3.631e+02   1.036  0.30010    
## JOBLawyer                      2.397e+02  2.956e+02   0.811  0.41751    
## JOBManager                    -4.708e+02  2.884e+02  -1.633  0.10257    
## JOBProfessional                4.663e+02  3.086e+02   1.511  0.13083    
## JOBStudent                     3.027e+02  3.734e+02   0.811  0.41760    
## TRAVTIME                       1.195e+01  3.222e+00   3.709  0.00021 ***
## CAR_USEPrivate                -7.784e+02  1.644e+02  -4.734 2.24e-06 ***
## BLUEBOOK                       1.402e-02  8.615e-03   1.628  0.10365    
## TIF                           -4.803e+01  1.218e+01  -3.944 8.07e-05 ***
## CAR_TYPEPanel Truck            2.592e+02  2.781e+02   0.932  0.35134    
## CAR_TYPEPickup                 3.770e+02  1.707e+02   2.209  0.02720 *  
## CAR_TYPESports Car             1.024e+03  2.178e+02   4.703 2.60e-06 ***
## CAR_TYPESUV                    7.527e+02  1.793e+02   4.197 2.73e-05 ***
## CAR_TYPEVan                    5.071e+02  2.132e+02   2.379  0.01739 *  
## RED_CARTRUE                   -4.812e+01  1.490e+02  -0.323  0.74681    
## OLDCLAIM                      -1.054e-02  7.436e-03  -1.417  0.15655    
## CLM_FREQ                       1.416e+02  5.503e+01   2.573  0.01009 *  
## REVOKEDTRUE                    5.488e+02  1.735e+02   3.163  0.00157 ** 
## MVR_PTS                        1.760e+02  2.591e+01   6.790 1.20e-11 ***
## CAR_AGE                       -2.853e+01  1.292e+01  -2.208  0.02729 *  
## URBANICITYHighly Urban/ Urban  1.665e+03  1.394e+02  11.944  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4544 on 8123 degrees of freedom
## Multiple R-squared:  0.07099,    Adjusted R-squared:  0.06675 
## F-statistic: 16.78 on 37 and 8123 DF,  p-value: < 2.2e-16

Model 2- Selectively chosen variables

#Building model Selectively chosen features.

lms_ins_train_df <- lm(TARGET_AMT ~ KIDSDRIV +   MSTATUS + CAR_USE + BLUEBOOK +  CAR_TYPE 
                      + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY
                    , data = ins_train_df)

summary(lms_ins_train_df)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + MSTATUS + CAR_USE + BLUEBOOK + 
##     CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, 
##     data = ins_train_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5194  -1667   -843    228 104141 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.762e+02  2.178e+02   3.564 0.000368 ***
## KIDSDRIV                       4.888e+02  9.967e+01   4.904 9.58e-07 ***
## MSTATUSTRUE                   -7.664e+02  1.040e+02  -7.372 1.85e-13 ***
## CAR_USEPrivate                -9.401e+02  1.255e+02  -7.492 7.49e-14 ***
## BLUEBOOK                      -1.008e-02  7.411e-03  -1.360 0.173783    
## CAR_TYPEPanel Truck            2.795e+02  2.452e+02   1.140 0.254497    
## CAR_TYPEPickup                 3.214e+02  1.672e+02   1.922 0.054636 .  
## CAR_TYPESports Car             8.420e+02  1.847e+02   4.558 5.25e-06 ***
## CAR_TYPESUV                    5.594e+02  1.408e+02   3.974 7.12e-05 ***
## CAR_TYPEVan                    4.990e+02  2.036e+02   2.450 0.014294 *  
## OLDCLAIM                      -1.101e-02  7.479e-03  -1.472 0.141194    
## CLM_FREQ                       1.769e+02  5.524e+01   3.202 0.001370 ** 
## REVOKEDTRUE                    6.394e+02  1.744e+02   3.666 0.000248 ***
## MVR_PTS                        1.995e+02  2.595e+01   7.689 1.65e-14 ***
## URBANICITYHighly Urban/ Urban  1.214e+03  1.315e+02   9.227  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4578 on 8146 degrees of freedom
## Multiple R-squared:  0.05444,    Adjusted R-squared:  0.05281 
## F-statistic:  33.5 on 14 and 8146 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(lms_ins_train_df$residuals ~ lms_ins_train_df$fitted.values, main="New Reduced Var Model")
abline(h = 0)
plot(lm_ins_train_df$residuals ~ lm_ins_train_df$fitted.values, main="Orignal Model All Vars")
abline(h = 0)

SELECT THE MODELS

Log model takeaway: even a basic baseline model is valuable for an insurance company