Data 621 - HW 4

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

load csv from github

  train_insurance <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance_training_data.csv") %>%
                            dplyr::select(-INDEX) %>% 
  mutate(
    INCOME = as.numeric(INCOME),
    HOME_VAL = as.numeric(HOME_VAL),
    BLUEBOOK = as.numeric(BLUEBOOK),
    OLDCLAIM = as.numeric(OLDCLAIM),
    MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
    SEX = as.factor(str_remove(SEX, "^z_")),
    EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
    JOB = as.factor(str_remove(JOB, "^z_")),
    CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
    URBANICITY = as.factor(str_remove(URBANICITY, "^z_")))

eval_data <- read.csv("https://raw.githubusercontent.com/mkunissery/Data621/master/HW4/insurance-evaluation-data.csv") %>% 
    dplyr::select(-INDEX) %>% 
  mutate(
    INCOME = as.numeric(INCOME),
    HOME_VAL = as.numeric(HOME_VAL),
    BLUEBOOK = as.numeric(BLUEBOOK),
    OLDCLAIM = as.numeric(OLDCLAIM),
    MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
    SEX = as.factor(str_remove(SEX, "^z_")),
    EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
    JOB = as.factor(str_remove(JOB, "^z_")),
    CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
    URBANICITY = as.factor(str_remove(URBANICITY, "^z_"))
  )

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
TARGET_FLAG	1	8161	0.2638157	0.4407276	0	0.2047787	0.0000	0	1.0	1.0	1.0716614	-0.8516462	0.0048786	0
TARGET_AMT	2	8161	1504.3246481	4704.0269298	0	593.7121106	0.0000	0	107586.1	107586.1	8.7063034	112.2884386	52.0712628	0
KIDSDRIV	3	8161	0.1710575	0.5115341	0	0.0252719	0.0000	0	4.0	4.0	3.3518374	11.7801916	0.0056624	0
AGE	4	8155	44.7903127	8.6275895	45	44.8306513	8.8956	16	81.0	65.0	-0.0289889	-0.0617020	0.0955383	6
HOMEKIDS	5	8161	0.7212351	1.1163233	0	0.4971665	0.0000	0	5.0	5.0	1.3411271	0.6489915	0.0123571	0
YOJ	6	7707	10.4992864	4.0924742	11	11.0711853	2.9652	0	23.0	23.0	-1.2029676	1.1773410	0.0466169	454
INCOME	7	8161	2875.5505453	2090.6786785	2817	2816.9534385	2799.1488	1	6613.0	6612.0	0.1094699	-1.2853032	23.1427840	0
PARENT1*	8	8161	1.1319691	0.3384779	1	1.0399755	0.0000	1	2.0	1.0	2.1743561	2.7281589	0.0037468	0
HOME_VAL	9	8161	1684.8931503	1697.3791897	1245	1516.4994639	1842.8718	1	5107.0	5106.0	0.5162324	-1.1810965	18.7891522	0
MSTATUS*	10	8161	1.5996814	0.4899929	2	1.6245979	0.0000	1	2.0	1.0	-0.4068189	-1.8347231	0.0054240	0
SEX*	11	8161	1.4639137	0.4987266	1	1.4548936	0.0000	1	2.0	1.0	0.1446959	-1.9793056	0.0055207	0
EDUCATION*	12	8161	2.8120328	1.1786322	3	2.7785266	1.4826	1	5.0	4.0	0.1543452	-0.8453783	0.0130469	0
JOB*	13	8161	4.8337214	2.6238293	5	4.7636698	4.4478	1	9.0	8.0	0.1300643	-1.4594539	0.0290445	0
TRAVTIME	14	8161	33.4857248	15.9083334	33	32.9954051	16.3086	5	142.0	137.0	0.4468174	0.6643331	0.1760974	0
CAR_USE*	15	8161	1.6288445	0.4831436	2	1.6610507	0.0000	1	2.0	1.0	-0.5332937	-1.7158080	0.0053482	0
BLUEBOOK	16	8161	1283.6185516	893.5117428	1124	1259.5665492	1132.7064	1	2789.0	2788.0	0.2472837	-1.3624655	9.8907352	0
TIF	17	8161	5.3513050	4.1466353	4	4.8402512	4.4478	1	25.0	24.0	0.8908120	0.4224940	0.0459012	0
CAR_TYPE*	18	8161	3.3405220	1.7553381	3	3.3107673	2.9652	1	6.0	5.0	-0.0981926	-1.4298002	0.0194307	0
RED_CAR*	19	8161	1.2913859	0.4544287	1	1.2392403	0.0000	1	2.0	1.0	0.9180255	-1.1573709	0.0050303	0
OLDCLAIM	20	8161	552.2714128	862.2006829	1	380.3196508	0.0000	1	2857.0	2856.0	1.3085876	0.2461666	9.5441372	0
CLM_FREQ	21	8161	0.7985541	1.1584527	0	0.5886047	0.0000	0	5.0	5.0	1.2087985	0.2842890	0.0128235	0
REVOKED*	22	8161	1.1225340	0.3279216	1	1.0281820	0.0000	1	2.0	1.0	2.3018899	3.2991013	0.0036299	0
MVR_PTS	23	8161	1.6955030	2.1471117	1	1.3138306	1.4826	0	13.0	13.0	1.3478403	1.3754900	0.0237675	0
CAR_AGE	24	7651	8.3283231	5.7007424	8	7.9632413	7.4130	-3	28.0	31.0	0.2819531	-0.7489756	0.0651737	510
URBANICITY*	25	8161	1.7954907	0.4033673	2	1.8693521	0.0000	1	2.0	1.0	-1.4649406	0.1460688	0.0044651	0

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) + 
  facet_wrap(~variable, scale="free") + 
  geom_boxplot()

## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables

## Warning: Removed 970 rows containing non-finite values (stat_boxplot).

Histograms

ggplot(melt(train_insurance), aes(x=value)) + 
  facet_wrap(~variable, scale="free") + 
  geom_histogram(bins=50)

## Using PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY as id variables

## Warning: Removed 970 rows containing non-finite values (stat_bin).

Appendix

Please find the source code assocoiated.

Sourcecode