Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
load csv from github
train_insurance <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance_training_data.csv") %>%
dplyr::select(-INDEX) %>%
mutate(
INCOME = as.numeric(INCOME),
HOME_VAL = as.numeric(HOME_VAL),
BLUEBOOK = as.numeric(BLUEBOOK),
OLDCLAIM = as.numeric(OLDCLAIM),
MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
SEX = as.factor(str_remove(SEX, "^z_")),
EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
JOB = as.factor(str_remove(JOB, "^z_")),
CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
URBANICITY = as.factor(str_remove(URBANICITY, "^z_")))
eval_data <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance-evaluation-data.csv") %>%
dplyr::select(-INDEX) %>%
mutate(
INCOME = as.numeric(INCOME),
HOME_VAL = as.numeric(HOME_VAL),
BLUEBOOK = as.numeric(BLUEBOOK),
OLDCLAIM = as.numeric(OLDCLAIM),
MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
SEX = as.factor(str_remove(SEX, "^z_")),
EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
JOB = as.factor(str_remove(JOB, "^z_")),
CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
URBANICITY = as.factor(str_remove(URBANICITY, "^z_"))
)| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 1 | 8161 | 0.2638157 | 0.4407276 | 0 | 0.2047787 | 0.0000 | 0 | 1.0 | 1.0 | 1.0716614 | -0.8516462 | 0.0048786 | 0 |
| TARGET_AMT | 2 | 8161 | 1504.3246481 | 4704.0269298 | 0 | 593.7121106 | 0.0000 | 0 | 107586.1 | 107586.1 | 8.7063034 | 112.2884386 | 52.0712628 | 0 |
| KIDSDRIV | 3 | 8161 | 0.1710575 | 0.5115341 | 0 | 0.0252719 | 0.0000 | 0 | 4.0 | 4.0 | 3.3518374 | 11.7801916 | 0.0056624 | 0 |
| AGE | 4 | 8155 | 44.7903127 | 8.6275895 | 45 | 44.8306513 | 8.8956 | 16 | 81.0 | 65.0 | -0.0289889 | -0.0617020 | 0.0955383 | 6 |
| HOMEKIDS | 5 | 8161 | 0.7212351 | 1.1163233 | 0 | 0.4971665 | 0.0000 | 0 | 5.0 | 5.0 | 1.3411271 | 0.6489915 | 0.0123571 | 0 |
| YOJ | 6 | 7707 | 10.4992864 | 4.0924742 | 11 | 11.0711853 | 2.9652 | 0 | 23.0 | 23.0 | -1.2029676 | 1.1773410 | 0.0466169 | 454 |
| INCOME | 7 | 8161 | 2875.5505453 | 2090.6786785 | 2817 | 2816.9534385 | 2799.1488 | 1 | 6613.0 | 6612.0 | 0.1094699 | -1.2853032 | 23.1427840 | 0 |
| PARENT1* | 8 | 8161 | 1.1319691 | 0.3384779 | 1 | 1.0399755 | 0.0000 | 1 | 2.0 | 1.0 | 2.1743561 | 2.7281589 | 0.0037468 | 0 |
| HOME_VAL | 9 | 8161 | 1684.8931503 | 1697.3791897 | 1245 | 1516.4994639 | 1842.8718 | 1 | 5107.0 | 5106.0 | 0.5162324 | -1.1810965 | 18.7891522 | 0 |
| MSTATUS* | 10 | 8161 | 1.5996814 | 0.4899929 | 2 | 1.6245979 | 0.0000 | 1 | 2.0 | 1.0 | -0.4068189 | -1.8347231 | 0.0054240 | 0 |
| SEX* | 11 | 8161 | 1.4639137 | 0.4987266 | 1 | 1.4548936 | 0.0000 | 1 | 2.0 | 1.0 | 0.1446959 | -1.9793056 | 0.0055207 | 0 |
| EDUCATION* | 12 | 8161 | 2.8120328 | 1.1786322 | 3 | 2.7785266 | 1.4826 | 1 | 5.0 | 4.0 | 0.1543452 | -0.8453783 | 0.0130469 | 0 |
| JOB* | 13 | 8161 | 4.8337214 | 2.6238293 | 5 | 4.7636698 | 4.4478 | 1 | 9.0 | 8.0 | 0.1300643 | -1.4594539 | 0.0290445 | 0 |
| TRAVTIME | 14 | 8161 | 33.4857248 | 15.9083334 | 33 | 32.9954051 | 16.3086 | 5 | 142.0 | 137.0 | 0.4468174 | 0.6643331 | 0.1760974 | 0 |
| CAR_USE* | 15 | 8161 | 1.6288445 | 0.4831436 | 2 | 1.6610507 | 0.0000 | 1 | 2.0 | 1.0 | -0.5332937 | -1.7158080 | 0.0053482 | 0 |
| BLUEBOOK | 16 | 8161 | 1283.6185516 | 893.5117428 | 1124 | 1259.5665492 | 1132.7064 | 1 | 2789.0 | 2788.0 | 0.2472837 | -1.3624655 | 9.8907352 | 0 |
| TIF | 17 | 8161 | 5.3513050 | 4.1466353 | 4 | 4.8402512 | 4.4478 | 1 | 25.0 | 24.0 | 0.8908120 | 0.4224940 | 0.0459012 | 0 |
| CAR_TYPE* | 18 | 8161 | 3.3405220 | 1.7553381 | 3 | 3.3107673 | 2.9652 | 1 | 6.0 | 5.0 | -0.0981926 | -1.4298002 | 0.0194307 | 0 |
| RED_CAR* | 19 | 8161 | 1.2913859 | 0.4544287 | 1 | 1.2392403 | 0.0000 | 1 | 2.0 | 1.0 | 0.9180255 | -1.1573709 | 0.0050303 | 0 |
| OLDCLAIM | 20 | 8161 | 552.2714128 | 862.2006829 | 1 | 380.3196508 | 0.0000 | 1 | 2857.0 | 2856.0 | 1.3085876 | 0.2461666 | 9.5441372 | 0 |
| CLM_FREQ | 21 | 8161 | 0.7985541 | 1.1584527 | 0 | 0.5886047 | 0.0000 | 0 | 5.0 | 5.0 | 1.2087985 | 0.2842890 | 0.0128235 | 0 |
| REVOKED* | 22 | 8161 | 1.1225340 | 0.3279216 | 1 | 1.0281820 | 0.0000 | 1 | 2.0 | 1.0 | 2.3018899 | 3.2991013 | 0.0036299 | 0 |
| MVR_PTS | 23 | 8161 | 1.6955030 | 2.1471117 | 1 | 1.3138306 | 1.4826 | 0 | 13.0 | 13.0 | 1.3478403 | 1.3754900 | 0.0237675 | 0 |
| CAR_AGE | 24 | 7651 | 8.3283231 | 5.7007424 | 8 | 7.9632413 | 7.4130 | -3 | 28.0 | 31.0 | 0.2819531 | -0.7489756 | 0.0651737 | 510 |
| URBANICITY* | 25 | 8161 | 1.7954907 | 0.4033673 | 2 | 1.8693521 | 0.0000 | 1 | 2.0 | 1.0 | -1.4649406 | 0.1460688 | 0.0044651 | 0 |
Visual Exploration
Boxplots
The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.
ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) +
facet_wrap(~variable, scale="free") +
geom_boxplot()## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables
## Warning: Removed 970 rows containing non-finite values (stat_boxplot).
Histograms
ggplot(melt(train_insurance), aes(x=value)) +
facet_wrap(~variable, scale="free") +
geom_histogram(bins=50)## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables
## Warning: Removed 970 rows containing non-finite values (stat_bin).