Data 621 - HW 4

Murali Kunissery

April 24, 2019

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

load csv from github

  train_insurance <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance_training_data.csv") %>%
                            dplyr::select(-INDEX) %>% 
  mutate(
    INCOME = as.numeric(INCOME),
    HOME_VAL = as.numeric(HOME_VAL),
    BLUEBOOK = as.numeric(BLUEBOOK),
    OLDCLAIM = as.numeric(OLDCLAIM),
    MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
    SEX = as.factor(str_remove(SEX, "^z_")),
    EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
    JOB = as.factor(str_remove(JOB, "^z_")),
    CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
    URBANICITY = as.factor(str_remove(URBANICITY, "^z_")))

eval_data <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance-evaluation-data.csv") %>% 
    dplyr::select(-INDEX) %>% 
  mutate(
    INCOME = as.numeric(INCOME),
    HOME_VAL = as.numeric(HOME_VAL),
    BLUEBOOK = as.numeric(BLUEBOOK),
    OLDCLAIM = as.numeric(OLDCLAIM),
    MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
    SEX = as.factor(str_remove(SEX, "^z_")),
    EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
    JOB = as.factor(str_remove(JOB, "^z_")),
    CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
    URBANICITY = as.factor(str_remove(URBANICITY, "^z_"))
  )
vars n mean sd median trimmed mad min max range skew kurtosis se na_count
TARGET_FLAG 1 8161 0.2638157 0.4407276 0 0.2047787 0.0000 0 1.0 1.0 1.0716614 -0.8516462 0.0048786 0
TARGET_AMT 2 8161 1504.3246481 4704.0269298 0 593.7121106 0.0000 0 107586.1 107586.1 8.7063034 112.2884386 52.0712628 0
KIDSDRIV 3 8161 0.1710575 0.5115341 0 0.0252719 0.0000 0 4.0 4.0 3.3518374 11.7801916 0.0056624 0
AGE 4 8155 44.7903127 8.6275895 45 44.8306513 8.8956 16 81.0 65.0 -0.0289889 -0.0617020 0.0955383 6
HOMEKIDS 5 8161 0.7212351 1.1163233 0 0.4971665 0.0000 0 5.0 5.0 1.3411271 0.6489915 0.0123571 0
YOJ 6 7707 10.4992864 4.0924742 11 11.0711853 2.9652 0 23.0 23.0 -1.2029676 1.1773410 0.0466169 454
INCOME 7 8161 2875.5505453 2090.6786785 2817 2816.9534385 2799.1488 1 6613.0 6612.0 0.1094699 -1.2853032 23.1427840 0
PARENT1* 8 8161 1.1319691 0.3384779 1 1.0399755 0.0000 1 2.0 1.0 2.1743561 2.7281589 0.0037468 0
HOME_VAL 9 8161 1684.8931503 1697.3791897 1245 1516.4994639 1842.8718 1 5107.0 5106.0 0.5162324 -1.1810965 18.7891522 0
MSTATUS* 10 8161 1.5996814 0.4899929 2 1.6245979 0.0000 1 2.0 1.0 -0.4068189 -1.8347231 0.0054240 0
SEX* 11 8161 1.4639137 0.4987266 1 1.4548936 0.0000 1 2.0 1.0 0.1446959 -1.9793056 0.0055207 0
EDUCATION* 12 8161 2.8120328 1.1786322 3 2.7785266 1.4826 1 5.0 4.0 0.1543452 -0.8453783 0.0130469 0
JOB* 13 8161 4.8337214 2.6238293 5 4.7636698 4.4478 1 9.0 8.0 0.1300643 -1.4594539 0.0290445 0
TRAVTIME 14 8161 33.4857248 15.9083334 33 32.9954051 16.3086 5 142.0 137.0 0.4468174 0.6643331 0.1760974 0
CAR_USE* 15 8161 1.6288445 0.4831436 2 1.6610507 0.0000 1 2.0 1.0 -0.5332937 -1.7158080 0.0053482 0
BLUEBOOK 16 8161 1283.6185516 893.5117428 1124 1259.5665492 1132.7064 1 2789.0 2788.0 0.2472837 -1.3624655 9.8907352 0
TIF 17 8161 5.3513050 4.1466353 4 4.8402512 4.4478 1 25.0 24.0 0.8908120 0.4224940 0.0459012 0
CAR_TYPE* 18 8161 3.3405220 1.7553381 3 3.3107673 2.9652 1 6.0 5.0 -0.0981926 -1.4298002 0.0194307 0
RED_CAR* 19 8161 1.2913859 0.4544287 1 1.2392403 0.0000 1 2.0 1.0 0.9180255 -1.1573709 0.0050303 0
OLDCLAIM 20 8161 552.2714128 862.2006829 1 380.3196508 0.0000 1 2857.0 2856.0 1.3085876 0.2461666 9.5441372 0
CLM_FREQ 21 8161 0.7985541 1.1584527 0 0.5886047 0.0000 0 5.0 5.0 1.2087985 0.2842890 0.0128235 0
REVOKED* 22 8161 1.1225340 0.3279216 1 1.0281820 0.0000 1 2.0 1.0 2.3018899 3.2991013 0.0036299 0
MVR_PTS 23 8161 1.6955030 2.1471117 1 1.3138306 1.4826 0 13.0 13.0 1.3478403 1.3754900 0.0237675 0
CAR_AGE 24 7651 8.3283231 5.7007424 8 7.9632413 7.4130 -3 28.0 31.0 0.2819531 -0.7489756 0.0651737 510
URBANICITY* 25 8161 1.7954907 0.4033673 2 1.8693521 0.0000 1 2.0 1.0 -1.4649406 0.1460688 0.0044651 0

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) + 
  facet_wrap(~variable, scale="free") + 
  geom_boxplot()
## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables
## Warning: Removed 970 rows containing non-finite values (stat_boxplot).

Histograms

ggplot(melt(train_insurance), aes(x=value)) + 
  facet_wrap(~variable, scale="free") + 
  geom_histogram(bins=50)
## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables
## Warning: Removed 970 rows containing non-finite values (stat_bin).

Appendix

Please find the source code assocoiated.

Sourcecode