Introduction

We will explore, analyze and model a data set containing approximately 8000 records. Each record represents a customer at an auto insurance company. Each record has various predictor variables regarding the customer’s car, job, and demographics. The response variables within this dataset indicate if the customer was in a car crash with a binary label and the value of damages done if the customer was in a car crash.

We will create two unique types of models for each response variable. The first will be a binary logistic regression model in order to predict if someone will crash their car or not. The other model will be a multiple linear regression model to predict how much it will cost in damages if the customer does crash their car.

Subsections:
- Loading Data
- Description of Variables

Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

We can see that we have the expected amount of records loaded in: 8,161.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301…
## $ PARENT1     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No…
## $ HOME_VAL    <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"…
## $ MSTATUS     <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",…
## $ SEX         <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"…
## $ EDUCATION   <chr> "PhD", "z_High School", "z_High School", "<High School", "…
## $ JOB         <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla…
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <chr> "Private", "Commercial", "Private", "Private", "Private", …
## $ BLUEBOOK    <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports…
## $ RED_CAR     <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", …
## $ OLDCLAIM    <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba…

Description of Variables

Below is a short description of the variables of interest in the data set:

Variable	Quantitative or Categorical	Definition
INDEX	Categorical	Identification Variable
TARGET_FLAG	Categorical	Determines if car was in a crash
TARGET_AMT	Quantitative	Damage value of the car crash
AGE	Quantitative	Age of driver
BLUEBOOK	Quantitative	Value of vehicle
CAR_AGE	Quantitative	Vehicle age
CAR_TYPE	Categorical	Type of Car
CAR_USE	Categorical	If the vehicle is used privately or commercially
CLM_FREQ	Quantitative	# claims (past 5 years)
EDUCATION	Categorical	Max education level of the customer
HOMEKIDS	Quantitative	# children at home
HOME_VAL	Quantitative	Home value of the customer
INCOME	Quantitative	Income of the customer
KIDSDRIV	Quantitative	# driving children of the customer
MSTATUS	Categorical	Marital status of the customer
MVR_PTS	Quantitative	Motor vehicle record points
OLDCLAIM	Quantitative	Total claims (past 5 years)
PARENT1	Categorical	Is the customer a single parent
RED_CAR	Categorical	Is the car red
REVOKED	Categorical	Was the customer’s license revoked (past 7 years)
SEX	Categorical	Customer’s Gender
TIF	Quantitative	Time in force (as a customer)
TRAVTIME	Quantitative	Distance the customer travels to work
URBANICITY	Categorical	How urban the customer’s home/work area is
YOJ	Quantitative	Years on the job the customer has

§1 Data Exploration

Subsections:
- Data Preprocessing - Data Skimming
- Histogram
- Outliers
- Missing Values
- Correlation

Data Preprocessing

From our precursory variable exploration, we know that we have both qualitative and quantitative variables contained within our data. Before we can begin exploring the data, we need to ensure that our data is contained as the type which it should be.

Right away we see variables that both need to be factorized such as urbanicity and the crash flag, variables that need to properly be converted to numeric such as income, and those which have values that need to be cleaned up with a suspicious “z_” prefixing some categorical values such as sex and marital status.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301…
## $ PARENT1     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No…
## $ HOME_VAL    <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"…
## $ MSTATUS     <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",…
## $ SEX         <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"…
## $ EDUCATION   <chr> "PhD", "z_High School", "z_High School", "<High School", "…
## $ JOB         <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla…
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <chr> "Private", "Commercial", "Private", "Private", "Private", …
## $ BLUEBOOK    <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports…
## $ RED_CAR     <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", …
## $ OLDCLAIM    <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba…

We begin our data preprocessing efforts by converting the categorical columns of TARGET_FLAG, PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, and URBANICITY from character vectors into factors. We then view the levels of each of these factors to better understand which levels need to be modified and which factors should be ordered.

## $TARGET_FLAG
## [1] "0" "1"
## 
## $PARENT1
## [1] "No"  "Yes"
## 
## $MSTATUS
## [1] "z_No" "Yes" 
## 
## $SEX
## [1] "M"   "z_F"
## 
## $EDUCATION
## [1] "PhD"           "z_High School" "<High School"  "Bachelors"    
## [5] "Masters"      
## 
## $JOB
## [1] "Professional"  "z_Blue Collar" "Clerical"      "Doctor"       
## [5] "Lawyer"        "Manager"       "Home Maker"    "Student"      
## 
## $CAR_USE
## [1] "Private"    "Commercial"
## 
## $CAR_TYPE
## [1] "Minivan"     "z_SUV"       "Sports Car"  "Van"         "Panel Truck"
## [6] "Pickup"     
## 
## $RED_CAR
## [1] "yes" "no" 
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $URBANICITY
## [1] "Highly Urban/ Urban"   "z_Highly Rural/ Rural"

MSTATUS, EDUCATION, JOB, CAR_TYPE, and URBANICITY all contain the “z_” prefix amongst one level of the factors. On a closer look it seems that the “z_” prefix is a marker within the data to attempt to classify which level would lend the most to the probability of the class. We can see this most clearly through SEX, EDUCATION, and MSTATUS. In each one of these the “z_” marks a stereotype. Those that are not married would drive more recklessly, women are worse drivers, and those who only have a high-school education end up driving more recklessly.

We want our data to speak for itself through the models, thus we have no need for the z_ prefixes within our data and decide to remove them to make the values more clear and readable.

## $MSTATUS
## [1] "No"  "Yes"
## 
## $EDUCATION
## [1] "PhD"          "High School"  "<High School" "Bachelors"    "Masters"     
## 
## $JOB
## [1] "Professional" "Blue Collar"  "Clerical"     "Doctor"       "Lawyer"      
## [6] "Manager"      "Home Maker"   "Student"     
## 
## $CAR_TYPE
## [1] "Minivan"     "SUV"         "Sports Car"  "Van"         "Panel Truck"
## [6] "Pickup"     
## 
## $URBANICITY
## [1] "Highly Urban/ Urban" "Highly Rural/ Rural"

Next, we consider if any of these factors should be ordered factors. As this will help us when visualizing grouping down the line in data exploration.

There is no innate ordering in any of the factorized variables that we have except education, which can be sorted from most educated to least. Thus, we will order the education variable into its proper sorting from not having completed high school up to completing a PhD.

## [1] PhD          High School  <High School Bachelors    Masters     
## Levels: <High School < High School < Bachelors < Masters < PhD

With the factored variables preprocessed, we will move on to the variables that need to be converted into numerics. Which are really just the columns that deal with a money value: INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM.

## Rows: 8,161
## Columns: 4
## $ INCOME   <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301", …
## $ HOME_VAL <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0", N…
## $ BLUEBOOK <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17,43…
## $ OLDCLAIM <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$2,3…

We simply need to remove the dollar signs and commas then convert the columns to numeric. Thankfully, readr has the parse_number function which can do exactly this for us.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <dbl> 67349, 91449, 16039, NA, 114986, 125301, 18755, 107961, 62…
## $ PARENT1     <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No, No, N…
## $ HOME_VAL    <dbl> 0, 257252, 124191, 306251, 243925, 0, NA, 333680, 0, 0, 0,…
## $ MSTATUS     <fct> No, No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes,…
## $ SEX         <fct> M, M, F, M, F, F, F, M, F, M, F, F, M, M, F, F, M, F, F, F…
## $ EDUCATION   <ord> PhD, High School, High School, <High School, PhD, Bachelor…
## $ JOB         <fct> Professional, Blue Collar, Clerical, Blue Collar, Doctor, …
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <fct> Private, Commercial, Private, Private, Private, Commercial…
## $ BLUEBOOK    <dbl> 14230, 14940, 4010, 15440, 18000, 17430, 8780, 16970, 1120…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <fct> Minivan, Minivan, SUV, Minivan, SUV, Sports Car, SUV, Van,…
## $ RED_CAR     <fct> yes, yes, no, yes, no, no, no, yes, no, no, no, no, yes, y…
## $ OLDCLAIM    <dbl> 4461, 0, 38690, 0, 19217, 0, 0, 2374, 0, 0, 0, 0, 5028, 0,…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, Yes, No,…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highly Urban/ Ur…

With the money related columns converted into numerics, we have a good jumping off point to move on from preprocessing to data exploration proper.

Data Skimming

We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

We still have our 8161 rows with 26 columns that are split between 11 factors and 15 numeric variables.

Looking at our response variables we see in the breakdown of the TARGET_FLAG that there are disproportionately more customers without accidents in our dataset compared to customers with accidents. We should modify our cross-validation resampling down the line in order to have approximately equal proportions within the cv training sets. We also see that our target_amt is majorly skewed to the right because there are large amounts of 0 value crashes because a crash never happened at all. This will not need to be something that we worry about as we will subset the data for training our model that predicts crash values on only the data that does have crashes.

Data summary
Name	train_1
Number of rows	8161
Number of columns	26
_______________________
Column type frequency:
factor	11
numeric	15
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	ordered	n_unique	top_counts
TARGET_FLAG	0	FALSE	2	0: 6008, 1: 2153
PARENT1	0	FALSE	2	No: 7084, Yes: 1077
MSTATUS	0	FALSE	2	Yes: 4894, No: 3267
SEX	0	FALSE	2	F: 4375, M: 3786
EDUCATION	0	TRUE	5	Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB	526	FALSE	8	Blu: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_USE	0	FALSE	2	Pri: 5132, Com: 3029
CAR_TYPE	0	FALSE	6	SUV: 2294, Min: 2145, Pic: 1389, Spo: 907
RED_CAR	0	FALSE	2	no: 5783, yes: 2378
REVOKED	0	FALSE	2	No: 7161, Yes: 1000
URBANICITY	0	FALSE	2	Hig: 6492, Hig: 1669

Variable type: numeric

skim_variable	n_missing	mean	p0	p25	p50	p75	p100	hist
INDEX	0	5151.87	1	2559	5133	7745	10302.0	▇▇▇▇▇
TARGET_AMT	0	1504.32	0	0	0	1036	107586.1	▇▁▁▁▁
KIDSDRIV	0	0.17	0	0	0	0	4.0	▇▁▁▁▁
AGE	6	44.79	16	39	45	51	81.0	▁▆▇▂▁
HOMEKIDS	0	0.72	0	0	0	1	5.0	▇▂▁▁▁
YOJ	454	10.50	0	9	11	13	23.0	▂▃▇▃▁
INCOME	445	61898.09	0	28097	54028	85986	367030.0	▇▃▁▁▁
HOME_VAL	464	154867.29	0	0	161160	238724	885282.0	▇▆▁▁▁
TRAVTIME	0	33.49	5	22	33	44	142.0	▇▇▁▁▁
BLUEBOOK	0	15709.90	1500	9280	14440	20850	69740.0	▇▆▁▁▁
TIF	0	5.35	1	1	4	7	25.0	▇▆▁▁▁
OLDCLAIM	0	4037.08	0	0	0	4636	57037.0	▇▁▁▁▁
CLM_FREQ	0	0.80	0	0	0	2	5.0	▇▂▁▁▁
MVR_PTS	0	1.70	0	0	1	3	13.0	▇▂▁▁▁
CAR_AGE	510	8.33	-3	1	8	12	28.0	▆▇▇▃▁

If we take a look at the counts of factors that previously had a “z_” prefix, we can now see that each level which was prefixed is the most common value for the variable. This doesn’t help us much with building a model, but is something interesting to point out. The other thing we can determine by the count of the factors, which we expand below, is that none of our factor variables are degenerate through having a level with extremely low counts compared to either the total population or the highest level. This means we do not need to worry about removing factor variables due to degeneracy.

JOB	n
Professional	1117
Blue Collar	1825
Clerical	1271
Doctor	246
Lawyer	835
Manager	988
Home Maker	641
Student	712
NA	526

CAR_TYPE	n
Minivan	2145
SUV	2294
Sports Car	907
Van	750
Panel Truck	676
Pickup	1389

EDUCATION	n
<High School	1203
High School	2330
Bachelors	2242
Masters	1658
PhD	728

Missing Values

For factors we are missing 526 values amongst the job column. If we look at the breakdown of all the levels we can see that there is no option for the customer to not have a job which is a very distinct possibility. Our dataset provides a potential avenue to determine if this option is true or not: By filtering on those with an NA value of a job and also indicating if they have no income or some income. As we can see below, there are actually no customers who have a job displayed as NA but also don’t have any income. There are only the two possibilities of having income which is the majority of 502 of these cases, or having a missing value for income which is 24 cases. Thus, our initial hypothesis is false and these customers probably have jobs as the majority have income. Thinking about it, there are two distinct possibilities with 500+ missing values in this column: The customer did not provide the info on their job or the customer did not fall into one of the 8 categories provided. To narrow the choice down between the two, we consider the fact that if a customer wanted to hide their job they might have also wanted to hide their income. Yet, there are only 24 cases where the income is also missing. Which is why we choose to conclude that the missing values for the job column should be grouped into a level of “other”.

While we are taking a look at missing values involving income, we also want to determine what to do with the 445 missing values present within the continuous variable of income. Since, we already have existing values of 0 for income, we know that NA is not going to be used to record no income. So, we group NA income by job and take a look at the distribution of counts. In theory, if the missing value of income meant something then it would be most prominently shown in the distribution of the most related variable here. Yet, there doesn’t seem to be a job that is particularly generating missing values. Which rules out another theory that it might be people choosing to hide their income because their income is too high. So, we classify the data that is missing from income as missing completely at random, and will later use MICE imputation in order to find an optimal value to replace these NAs.

NOINCJOB	n
FALSE	502
NA	24

JOB	NA Income Count
Professional	60
Blue Collar	98
Clerical	73
Doctor	8
Lawyer	36
Manager	50
Home Maker	43
Student	53
NA	24

Continuing with our examination of missing values within the continuous variables we observe the fact that there are 6 missing values within age, a value so minuscule compared to the total amount of data which we have that we should be able to simply remove the rows where AGE is missing without it affecting the model we are building. As 0.1% of data will likely not be providing us much information.

Years on the job, home value, and care age have a more substantial amount of data missing having between 450-510 missing values in each row. Although, the amount of data is not significant enough that it can’t be deleted after a bit more examination if the missing data is informing us of something.

With multiple columns having similar amounts of missing data we want to explore a bit further on if the rows that have missing data are connected. For example, if there were about 400 rows that were missing 3 or 4 values at the same time we would know that these missing values are dependent on each other and related. Yet, plotting the amount of missing values in each row we can see the vast majority of rows having only a single missing value and an absolute minuscule amount that have 3 or more columns with missing values. Thus, we will consider these missing values to not be interconnected.

Next we want to consider how the missing data distribution takes form with their next most related variables to attempt to detect a pattern. For years on the job we visualize below if the missing values can be attributed to one job in particular. Although we have larger amounts of missing data for the more common jobs, it is not something that we would deem as significant. Thus, we also consider YOJ data missing at random which means we will deal with it by MICE imputation later down the line.

For home values and car age, both the variables and their most related variables are continuous. So we plot out scatterplots which use jittered points under the 0 ticks on both the y and x-axis in order to determine how missing values differ in their distributions.

For home value we plot against income, as those with higher income will be able to afford a home with a higher value. The missing home values follow the general distribution of how income spreads compared to home value. One thing to note is that the distribution of income is very similar for both 0 home value and home value that is greater than 0. Which tells us that home value is related to income if there is any home value at all. Either way, we can not say the missing home values are following some sort of unusual pattern with income that would give us more information. So, we deem it missing at random and will impute the missing values through MICE.

For car age we plot against age, as older people will have had a higher chance to stick with a car longer and thus age it up. The car age values follow the general distribution of how car age spreads compared to age, which is seemingly at random but with a mean age around 45. Once again, we can deem the missing data here to be randomly missing and will MICE impute the values.

Variable Distribution

Our skimmed histograms give us an idea that most of our data is not actually normally distributed with a rightward skew being very common. We will want to adjust the distribution of those where outlier removal isn’t enough with box-cox transformations once we get to data preparation. As a variable being normally distributed lends to its predictive power when creating a linear regression model.

However, we also have the unique situation in many of these columns where our mode is overwhelmingly 0. This can be problematic to box-cox transform without shifting the columns by adding a static value. It is also possible that our data is bimodal in the case of variables like target amount or old claim which are overwhelmingly distributed towards counts of 0. Thus, we check histograms of our data with and without 0 values included for a few variables.

We begin by analyzing customer kids that can drive. Here, we see that there is a disproportionate amount of customers without kids that drive. So, we will create an extra category that splits the customers into those with kids that drive and those without kids that drive. Since, we do not quite see a natural distribution without doing this. Looking at customers that have at least one kid who drives, we can see what approximates a negative binomial distribution. Thus, we will not attempt to transform this variable.

Next we look at customer kids overall. Here, we see that there is a disproportionate amount of customers without kids as well. So, we will create an extra category that splits the customers into those with kids and those without kids. Since, we do not quite see a natural distribution without doing this. Now, looking at customers that have at least one kid, we can see what approximates a normal distribution with a slight rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution.

Afterwards we look at customer income. Here, we see that that a significant amount of customers do not have any income. So, we will create an extra category that splits the customers into those with income and those without income. Since, we do not quite see a natural distribution without doing this. Now, looking at customers that have income, we can see what approximates a normal distribution with a strong rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution. We should also look at outliers afterwards.

Moving on we look at customer home value. Here, we see that that a significant amount of customers do not have any home value thus do not own a home. So, we will create an extra category that splits the customers into those with a home and those without a home. Since, we see two separate distributions without doing this. Now, looking at customers that have a home, we can see what approximates a normal distribution with a strong rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution. We should also look at outliers afterwards.

Afterwards we look at customer previous claim payout. Here, we see that that an overwhelming amount of customers do not have any previous claim payout. However, we also are able to observe three unique distributions: No previous claims, low-value previous claims, and high-value previous claims. So, we will create a new categorical variable that splits the customers into those categories of previous claim payout. As, there would not be a way that we can think of to modify the continuous value for prediction based on more than two categories we will not utilize the raw values of previous claim payout for building or models, and thus have no need to transform them.

Below we consider number of customer previous claims. Here, we see that that an overwhelming amount of customers do not have any previous claims. So, we will create an extra category that splits the customers into those with previous claims and those without claims. We get a rightward skewed close to normal distribution after splitting thus we will box-cox transform to better approximate normal.

Now we analyze customer moving violations. Here, we see that there is a significant amount of customers without moving violations. So, we will create an extra category that splits the customers into those with moving violations and those without moving violations. Looking at customers that have at least one moving violation, we can see what approximates a negative binomial distribution. Thus, we will not attempt to transform this variable.

Finally we take a look at our quantitative response variable of the crash cost and see that the portion we are concerned with, crashes that actually happened, is normally distributed but with an extremely large tail. We will definitely need to consider removing outliers along with a box-cox transformation to better normalize the data.

Target Flag Breakdown

We want to take a look into what exactly contributes to if our customers get into a car crash or not. To begin with, we take a look at the distribution of who crashes and who does not filled based on the qualitative variables that we have. It seems that we do have meaningful differences in car crash status within all variables except: If the car is red or not and the gender of the customer. These both make sense because in theory males and females should both be equally as good at not crashing along with those that do and don’t choose to rock a red car. On the other hand we see the most marked differences in the distribution of those that crash and not in between urbanicity and having a revoked license. Both of these make sense as important predictors of crashes because a person that has a revoked license inherently has been driving dangerously to get their license revoked. While those in urban areas have a lot more cars around to crash with compared to those in rural areas.

Next we look at how our quantitative variables are distributed amongst those that crashed and those that did not with boxplots. We are able to tell which quantitative variables do not seem to have much of an effect on if there is a crash or not such as those that have kids of driving age and the travel time to work. However, even with those that are very clearly different such as claim frequency amount, it is hard to tell with boxplots if the effect is coming from the amount of 0s present within the variable or not.

Which is why we create a function to plot the various numeric variables in multiple density plots that are split on if the customers crashed or not. This allows us to tell where exactly the biggest difference lies. Right away we are able to see new information that the boxplots couldn’t tell us, such as the fact that those who don’t have any kids of driving age have proportionately less crashes. Another big difference is with those that have kids at home, once more it seems that those with kids at home have proportionately less crashes than those that do.

multdens <- function(df, var, target) {
  
  df %>% 
    dplyr::select(-TARGET_AMT, -INDEX) %>% 
    filter(!is.na(.data[[var]])) %>% 
    ggplot(aes(.data[[var]], fill = .data[[target]], colour = .data[[target]])) +
      geom_density(alpha = 0.2) +
      labs(y = NULL) +
      theme_minimal(base_size = 6) + 
      theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}

num_col <- train_1 |>
  dplyr::select(where(is.numeric), -TARGET_AMT, -INDEX) |>
  colnames()

dens <- lapply(num_col, multdens, df = train_1, target = "TARGET_FLAG")

plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)

Moving on to our next set of density plots we see what seems to be a significant difference in a crash occuring for people who do have an old claim versus those that do not.

plot_grid(plotlist = dens[6:10], label_x = NULL, ncol=1)

We once more see this distribution pattern of the variables counting the amount of previous claims and motor vehicle records points where those without any in a much smaller proportion of crashes than those with at least one. Each of these variables that we have brought up in this section were also spotted in the variable distribution section as candidates for having variables that simply indicate if they have 0 or greater than 0 value. Thus, our decision to create these columns has been reinforced.

plot_grid(plotlist = dens[11:13], label_x = NULL, ncol=1)

Target Amount Breakdown

Next we start to exam how our quantitative response variable, the cost of a crash is related to the other variables. The first thing we do is create another function that will create bar charts of how the mean of the response variable is distributed between classes of each qualitative variable. This calculation excludes data where the customer has not had a car crash because that is the true dataset we will be training on down the line. We see some interesting results in that every categorical variable turns out to have an effect on the mean car crash cost. Urbanicity seems to have the least difference in crash costs. While job has the widest range in average cost of crash between classes.

Afterwards we attempt to visualize the relationship between crash values above zero and the quantitative predictor variables. We end up seeing highly heteroscedastic data on our scatterplots where the variance in age ends up changing as we progress through our various predictor values. This usually does not turn out well for models that rely on normal assumptions without transforming the data. However, in this case we know our data distribution and do not want to change the distribution of our predictors too much.

Still, we are able to see some patterns such as when claim frequency increases the value of each crash seems to get closer to 0. Motor vehicle record points, kids at home, and kids at driving age seem to follow a similar pattern. Although, we should consider the fact that this is coming from our 0 values in each class having large amounts more data and thus a higher chance of having a point that highly deviates from the mean.

Correlation

For the last part of data exploration we take a look at what variables might be colinear, and which ones might be correlated to our response variables. Below we plot a correlation plot focusing on each variable except crash value as we won’t get useful info without limiting the crashes to those that are above a value of zero (aka they actually happened). It is a bit hard to see the plot in its entirety, but resizing it and making it fullscreen allows us to read the information.

We get some interesting multicollinearity between columns such as gender being related to owning a red car, type of car, how the car is used, and profession. Although none of the colinearity is particularly strong besides females being more likely to own red cars than males. Since gender is not very correlated to having a car crash or not, it would be smart to not utilize it when dplyr::selecting variables because of this colinearity.

We have some decent correlation between those that have a crash and the category of job the customer has, urbanicity, if they are a parent, and having old claims or motor vehicle record points.

We then check the correlation plot to determine correlation between crash value and the other variables. Unfortunately, the correlation between most of these seems to be particularly weak. We even have counterintuitive correlations such as crash value decreasing with bluebook value of the customer’s car. We’ll have to take a closer look at the significance of each variable when building models to determine which is the right to use in modeling crash value.

Next we attempt to visualize the correlation between variables in a categorical manner with an interactive correlation funnel plot. This makes it easier to see how correlated different categorical variables are too each other. Unfortunately, we don’t really glean new data but it is more convenient viewing the multitude of categorical variables this way.

We create another correlation funnel that is plotting against target amount binned into above and below the median. We still have extremely low correlation among the data here for every point. Making it seem like fitting a model to this data for determining crash value might not provide us with great information.

§2 Data Preparation

After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.

Subsections:
- Missing Values
- Distribution Transformation
- Interaction Columns
- Correlation

Missing Values

When there are missing values within our data, attempting to create a model from the data will end up dropping multitudes of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves.

Transforming Missing Values to New Level

We transform the missing values within the job column to an aggregate other column to catch jobs that might not be included within our initial levels.

## [1] Professional Blue Collar  Clerical     Doctor       Lawyer      
## [6] Manager      Other        Home Maker   Student     
## 9 Levels: Professional Blue Collar Clerical Doctor Lawyer ... Other

Removing Rows with Missing Values

As discussed in our data exploration we simply remove the minimal rows that have NA values in the age column.

## There are 0 NA values in the age column

MICE Impute the Mean for Other Missing Values

Finally, the remaining columns that still have NAs in them have all been analyzed and deemed as suitable for MICE imputation. Through MICE imputation we are essentially taking a look at multiple ways to impute a model and determining which way leaves the data closest to what it most likely would have been with the NA truly replaced.

Additionally we remove the outlier of negative car age that we have and turn it into an NA value to be processed through imputation.

##       INDEX TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS 
##           0           0           0           0           0           0 
##         YOJ      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX 
##           0           0           0           0           0           0 
##   EDUCATION         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF 
##           0           0           0           0           0           0 
##    CAR_TYPE     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS 
##           0           0           0           0           0           0 
##     CAR_AGE  URBANICITY 
##           0           0

New Categorical Columns

As we discussed in the data exploration, we saw many quantitative variables that seemed to have two separate distributions when considering having a zero value and being greater than zero. Due to this we believe that adding an indicator column for if a value is greater than zero in these cases would allow for interaction variables that will optimally train values to the distribution outside of the oppressive mode of the zero value. We additionally do the same for those cars that have been owned for a year or less to classify them as new cars when training our model.

##  [1] "INDEX"       "TARGET_FLAG" "TARGET_AMT"  "KIDSDRIV"    "AGE"        
##  [6] "HOMEKIDS"    "YOJ"         "INCOME"      "PARENT1"     "HOME_VAL"   
## [11] "MSTATUS"     "SEX"         "EDUCATION"   "JOB"         "TRAVTIME"   
## [16] "CAR_USE"     "BLUEBOOK"    "TIF"         "CAR_TYPE"    "RED_CAR"    
## [21] "OLDCLAIM"    "CLM_FREQ"    "REVOKED"     "MVR_PTS"     "CAR_AGE"    
## [26] "URBANICITY"  "HOME_VAL_Y"  "HOMEKIDS_Y"  "INCOME_Y"    "KIDSDRIV_Y" 
## [31] "YOJ_Y"       "OLDCLAIM_Y"  "MVR_PTS_Y"   "CAR_AGE_Y"

Transformation Discussion

We have previously discussed transforming to improve distribution approximation towards normality with some of our quantitative variables in our data. However, upon further consideration, it would lead to the interaction terms we will be creating from our greater than 0 columns to not work quite as well. This will also increase interpretability of our final models. Thus, we will not be transforming the distributions of any of our variables.

Correlation

After all of our data preparation is done we want to revisit the correlation plot to see if any of the transformations we have done have majorly changed the correlation values. We do not seem to see changes in existing correlation values with the missing values imputed. Which is good as it shows the data structure remains mainly the same. Our new indicator columns do seem to have about the same amount of correlation as their original columns when binned as shown in the correlation funnel. In some cases like determining if the customer has any old claims at all we have a higher correlation than the original quantitative variable itself, an excellent change. Yet, these columns will show their power in the interaction terms that they create down the line.

§3 Build Models

Subsections:
- Logistic Models
- Multiple Linear Regression Models

Logistic Models

We begin our model building with the logistic models that we will use in order to predict if a customer is likely to crash or not.

Model 1

Raw data fed into a stepwise function

Model Description

The first model that we will tackle is a generalized linear model containing all the data without any selections. This is to determine if our selections may have had the opposite effect on the logistic regression model and weakened it. We utilize backward stepwise regression to gradually remove variables from it, determining how significant they may be within our model.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + 
##     PARENT1 + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS + 
##     URBANICITY + HOME_VAL_Y + HOMEKIDS_Y + INCOME_Y + OLDCLAIM_Y, 
##     family = "binomial", data = train_1)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -7.912e-01  2.146e-01  -3.687 0.000227 ***
## KIDSDRIV                       3.719e-01  6.031e-02   6.167 6.96e-10 ***
## HOMEKIDS                      -8.770e-02  5.328e-02  -1.646 0.099782 .  
## YOJ                            1.610e-02  1.069e-02   1.507 0.131868    
## INCOME                        -3.997e-06  1.035e-06  -3.863 0.000112 ***
## PARENT1Yes                     2.096e-01  1.212e-01   1.729 0.083793 .  
## MSTATUSYes                    -5.449e-01  9.107e-02  -5.984 2.18e-09 ***
## EDUCATION.L                   -2.297e-01  1.558e-01  -1.474 0.140472    
## EDUCATION.Q                    1.757e-01  9.479e-02   1.854 0.063743 .  
## EDUCATION.C                    1.354e-01  8.163e-02   1.659 0.097125 .  
## EDUCATION^4                   -1.739e-01  7.074e-02  -2.458 0.013956 *  
## JOBBlue Collar                 1.697e-01  1.202e-01   1.412 0.157898    
## JOBClerical                    2.742e-01  1.252e-01   2.191 0.028463 *  
## JOBDoctor                     -5.917e-01  2.763e-01  -2.142 0.032231 *  
## JOBLawyer                     -7.087e-02  1.678e-01  -0.422 0.672814    
## JOBManager                    -7.077e-01  1.318e-01  -5.370 7.85e-08 ***
## JOBHome Maker                 -1.153e-01  1.637e-01  -0.704 0.481126    
## JOBStudent                    -2.293e-01  1.660e-01  -1.381 0.167148    
## JOBOther                      -1.553e-01  1.787e-01  -0.869 0.384592    
## TRAVTIME                       1.480e-02  1.893e-03   7.819 5.31e-15 ***
## CAR_USECommercial              7.578e-01  9.237e-02   8.205 2.31e-16 ***
## BLUEBOOK                      -2.242e-05  4.748e-06  -4.722 2.33e-06 ***
## TIF                           -5.482e-02  7.372e-03  -7.437 1.03e-13 ***
## CAR_TYPESUV                    7.021e-01  8.653e-02   8.114 4.91e-16 ***
## CAR_TYPESports Car             9.623e-01  1.080e-01   8.911  < 2e-16 ***
## CAR_TYPEVan                    6.301e-01  1.228e-01   5.131 2.87e-07 ***
## CAR_TYPEPanel Truck            5.882e-01  1.513e-01   3.887 0.000101 ***
## CAR_TYPEPickup                 5.636e-01  1.010e-01   5.580 2.41e-08 ***
## OLDCLAIM                      -2.099e-05  4.223e-06  -4.970 6.68e-07 ***
## REVOKEDYes                     9.591e-01  9.315e-02  10.296  < 2e-16 ***
## MVR_PTS                        9.423e-02  1.412e-02   6.672 2.52e-11 ***
## URBANICITYHighly Rural/ Rural -2.362e+00  1.137e-01 -20.782  < 2e-16 ***
## HOME_VAL_Y1                   -3.499e-01  8.455e-02  -4.139 3.49e-05 ***
## HOMEKIDS_Y1                    4.118e-01  1.339e-01   3.075 0.002105 ** 
## INCOME_Y1                     -7.547e-01  1.740e-01  -4.337 1.45e-05 ***
## OLDCLAIM_Y1                    6.385e-01  7.871e-02   8.112 4.98e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 7243.7  on 8119  degrees of freedom
## AIC: 7315.7
## 
## Number of Fisher Scoring iterations: 5

For our coefficients we can see a few of the most important predictors for a crash. Those that have kids at a driving age are more likely to have a crash on record with each additional kid of driving age. This makes sense as a driving age kid is an experienced user of your car that could be liable to crash it. Meanwhile, those with kids at home are less likely to crash, possibly explained by parents being less reckless drivers because they realize they have kids waiting for them at home.

Two of the highest magnitude coefficient predictors belong to those who drive sports cars and those that have their license revoked. These customers have high coefficients which means they are more liable to crash, possibly due to sports car drivers wanting to drive fast and reckless to show off the power of their cars while those with a revoked license having a proven record of reckless driving.

Model Results

Model 1 gives us an AIC of 7315.7 which is a fairly large AIC.

## 
## Call:
## roc.formula(formula = TARGET_FLAG ~ glm_1$fitted.values, data = train_1,     plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
## 
## Data: glm_1$fitted.values in 6007 controls (TARGET_FLAG 0) < 2148 cases (TARGET_FLAG 1).
## Area under the curve: 0.8164

We additionally get an AUC of 0.8164 which isn’t terrible but is pretty far from ideal.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5537 1235
##          1  470  913
##                                           
##                Accuracy : 0.7909          
##                  95% CI : (0.7819, 0.7997)
##     No Information Rate : 0.7366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3916          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9218          
##             Specificity : 0.4250          
##          Pos Pred Value : 0.8176          
##          Neg Pred Value : 0.6602          
##              Prevalence : 0.7366          
##          Detection Rate : 0.6790          
##    Detection Prevalence : 0.8304          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : 0               
##

We are able to see various classification assessment measures calculated from our confusion matrix. At first glance an accuracy of 79% doesn’t seem that bad. However, we then look at the specificity and see that only 42.6% of customers that actually crash are predicted as false. Our detection prevalence combines with us to show that our model is biased in predicting customers as not crashing, this is likely due to the fact that the majority class is those that did not crash.

This is not going to be ideal for an insurance company attempting to decide who to choose to take as a customer. As we would lose more money by not correctly identifying those who are prone to crashing and pooling them into those who might not crash versus the other way around. If we simply choose not to take a customer that was incorrectly classified as liable to crash than we are only losing out on potential revenue instead of gaining potential costs.

We want to try rebalancing the classes in our next model generation to remedy this problem.

Model Evaluation

Looking at our marginal model plots we can see that we do not have particularly good fits within our data simply due to the fact that our various quantitative variables aren’t separated in a straight forward manner between the classes of those that crash and those that do not.

Meanwhile, we have very small cook’s distances with at most 0.0025 of distance for what could be classified as “outliers”. Thus, with our model we do not have any influential outliers.

Model 2

Interaction based model fed into a stepwise function.

Model Description

The second model that we will tackle is a generalized linear model containing the data with interaction variables that we create based on split distributions observed during data exploration. We utilize both forward and backwards stepwise regression to gradually add variables from it, determining how significant they may be within our model.

## 
## Call:
## glm(formula = TARGET_FLAG ~ HOME_VAL_Y + HOMEKIDS_Y + INCOME_Y + 
##     KIDSDRIV_Y + OLDCLAIM_Y + MVR_PTS_Y + CAR_AGE_Y + HOME_VAL_Y:HOME_VAL + 
##     INCOME_Y:INCOME + KIDSDRIV_Y:KIDSDRIV + MVR_PTS_Y:MVR_PTS + 
##     CAR_AGE_Y:CAR_AGE, family = "binomial", data = train_1)
## 
## Coefficients: (4 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -9.454e-01  1.278e-01  -7.399 1.38e-13 ***
## HOME_VAL_Y1          -3.412e-01  1.212e-01  -2.816  0.00487 ** 
## HOMEKIDS_Y1           2.975e-01  6.339e-02   4.694 2.68e-06 ***
## INCOME_Y1            -2.085e-01  1.004e-01  -2.078  0.03772 *  
## KIDSDRIV_Y1           1.587e-01  1.799e-01   0.883  0.37749    
## OLDCLAIM_Y1           8.147e-01  5.975e-02  13.635  < 2e-16 ***
## MVR_PTS_Y1            7.341e-02  7.820e-02   0.939  0.34787    
## CAR_AGE_Y1           -1.184e+01  1.866e+02  -0.063  0.94941    
## HOME_VAL_Y0:HOME_VAL         NA         NA      NA       NA    
## HOME_VAL_Y1:HOME_VAL -1.264e-06  5.495e-07  -2.300  0.02145 *  
## INCOME_Y0:INCOME             NA         NA      NA       NA    
## INCOME_Y1:INCOME     -2.250e-06  9.942e-07  -2.263  0.02361 *  
## KIDSDRIV_Y0:KIDSDRIV         NA         NA      NA       NA    
## KIDSDRIV_Y1:KIDSDRIV  1.591e-01  1.099e-01   1.447  0.14781    
## MVR_PTS_Y0:MVR_PTS           NA         NA      NA       NA    
## MVR_PTS_Y1:MVR_PTS    1.077e-01  1.758e-02   6.127 8.95e-10 ***
## CAR_AGE_Y0:CAR_AGE   -1.981e-02  7.963e-03  -2.487  0.01287 *  
## CAR_AGE_Y1:CAR_AGE    1.191e+01  1.866e+02   0.064  0.94912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 8459.5  on 8141  degrees of freedom
## AIC: 8487.5
## 
## Number of Fisher Scoring iterations: 11

Immediately what draws our eyes are the NA coefficients. In this case the dummy variable and the original variable are deemed as too highly correlated to be able to calculate a coefficient for. However, our model still gets built with the interaction variables where the dummy variable is set to 1. Which succeeds in essentially excluding the zero values from our predictions and only focusing on the values above zero for those quantitative variables while the dummy variable focuses on the values at zero.

For our coefficients we can see a few of the most important predictors for a crash that are familiar. Those that have kids at a driving age are more likely to have a crash on record with each additional kid of driving age. This makes sense as a driving age kid is an experienced user of your car that could be liable to crash it. However, those with kids at home are now seen as more likely to crash, which doesn’t have a very good explanation. Thus, this model seems a bit fishy.

One of the highest magnitude coefficient predictors now belongs to those who have had an old claim. These customers have high coefficients which means they are more liable to crash, possibly as those with an old claim having a proven record of reckless driving.

Model Results

Model 2 gives us an AIC of 8487.5 which is much worse off than our previous model’s AIC of 7315.7.

## 
## Call:
## roc.formula(formula = TARGET_FLAG ~ glm_2$fitted.values, data = train_1,     plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
## 
## Data: glm_2$fitted.values in 6007 controls (TARGET_FLAG 0) < 2148 cases (TARGET_FLAG 1).
## Area under the curve: 0.7172

We additionally get an AUC of 0.7172 which is again worse than our previous model’s AUC of 0.8164 and very close to the threshold of .7 which a good model should beat.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5700 1739
##          1  307  409
##                                           
##                Accuracy : 0.7491          
##                  95% CI : (0.7396, 0.7585)
##     No Information Rate : 0.7366          
##     P-Value [Acc > NIR] : 0.005189        
##                                           
##                   Kappa : 0.1773          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9489          
##             Specificity : 0.1904          
##          Pos Pred Value : 0.7662          
##          Neg Pred Value : 0.5712          
##              Prevalence : 0.7366          
##          Detection Rate : 0.6990          
##    Detection Prevalence : 0.9122          
##       Balanced Accuracy : 0.5697          
##                                           
##        'Positive' Class : 0               
##

We are able to see various classification assessment measures calculated from our confusion matrix. Where we repeat the same issues that we have with the first model but with worse values: At first glance an accuracy of 74.9% doesn’t seem that bad. However, we then look at the specificity and see that only 19.0% of customers that actually crash are predicted as false. Our detection prevalence combines with this to show that our model is biased in predicting customers as not crashing, this is likely due to the fact that the majority class is those that did not crash.

We want to try rebalancing the classes in our next model generation to remedy this problem.

Model Evaluation

Looking at our marginal model plot we can see that we do not have particularly good fits within our data simply due to the fact that our various quantitative variables aren’t separated in a straight forward manner between the classes of those that crash and those that do not.

Meanwhile, we again have very small cook’s distances with at most 0.0025 of distance for what could be classified as “outliers”. Thus, with our model we do not have any influential outliers.

Overall, it seems the first logistic regression model we fit was better than this.

Model 3

All variables and interactions fed into a stepwise function.

Model Description

The third model that we will tackle is a generalized linear model containing the data with interaction variables that we create based on split distributions during data exploration. Additionally, we will include every other variable in the dataset. We utilize backwards stepwise regression to gradually remove variables from it, determining how significant they may be within our model.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + 
##     PARENT1 + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS + 
##     URBANICITY + HOME_VAL_Y + HOMEKIDS_Y + INCOME_Y + OLDCLAIM_Y, 
##     family = "binomial", data = train_1)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -7.912e-01  2.146e-01  -3.687 0.000227 ***
## KIDSDRIV                       3.719e-01  6.031e-02   6.167 6.96e-10 ***
## HOMEKIDS                      -8.770e-02  5.328e-02  -1.646 0.099782 .  
## YOJ                            1.610e-02  1.069e-02   1.507 0.131868    
## INCOME                        -3.997e-06  1.035e-06  -3.863 0.000112 ***
## PARENT1Yes                     2.096e-01  1.212e-01   1.729 0.083793 .  
## MSTATUSYes                    -5.449e-01  9.107e-02  -5.984 2.18e-09 ***
## EDUCATION.L                   -2.297e-01  1.558e-01  -1.474 0.140472    
## EDUCATION.Q                    1.757e-01  9.479e-02   1.854 0.063743 .  
## EDUCATION.C                    1.354e-01  8.163e-02   1.659 0.097125 .  
## EDUCATION^4                   -1.739e-01  7.074e-02  -2.458 0.013956 *  
## JOBBlue Collar                 1.697e-01  1.202e-01   1.412 0.157898    
## JOBClerical                    2.742e-01  1.252e-01   2.191 0.028463 *  
## JOBDoctor                     -5.917e-01  2.763e-01  -2.142 0.032231 *  
## JOBLawyer                     -7.087e-02  1.678e-01  -0.422 0.672814    
## JOBManager                    -7.077e-01  1.318e-01  -5.370 7.85e-08 ***
## JOBHome Maker                 -1.153e-01  1.637e-01  -0.704 0.481126    
## JOBStudent                    -2.293e-01  1.660e-01  -1.381 0.167148    
## JOBOther                      -1.553e-01  1.787e-01  -0.869 0.384592    
## TRAVTIME                       1.480e-02  1.893e-03   7.819 5.31e-15 ***
## CAR_USECommercial              7.578e-01  9.237e-02   8.205 2.31e-16 ***
## BLUEBOOK                      -2.242e-05  4.748e-06  -4.722 2.33e-06 ***
## TIF                           -5.482e-02  7.372e-03  -7.437 1.03e-13 ***
## CAR_TYPESUV                    7.021e-01  8.653e-02   8.114 4.91e-16 ***
## CAR_TYPESports Car             9.623e-01  1.080e-01   8.911  < 2e-16 ***
## CAR_TYPEVan                    6.301e-01  1.228e-01   5.131 2.87e-07 ***
## CAR_TYPEPanel Truck            5.882e-01  1.513e-01   3.887 0.000101 ***
## CAR_TYPEPickup                 5.636e-01  1.010e-01   5.580 2.41e-08 ***
## OLDCLAIM                      -2.099e-05  4.223e-06  -4.970 6.68e-07 ***
## REVOKEDYes                     9.591e-01  9.315e-02  10.296  < 2e-16 ***
## MVR_PTS                        9.423e-02  1.412e-02   6.672 2.52e-11 ***
## URBANICITYHighly Rural/ Rural -2.362e+00  1.137e-01 -20.782  < 2e-16 ***
## HOME_VAL_Y1                   -3.499e-01  8.455e-02  -4.139 3.49e-05 ***
## HOMEKIDS_Y1                    4.118e-01  1.339e-01   3.075 0.002105 ** 
## INCOME_Y1                     -7.547e-01  1.740e-01  -4.337 1.45e-05 ***
## OLDCLAIM_Y1                    6.385e-01  7.871e-02   8.112 4.98e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 7243.7  on 8119  degrees of freedom
## AIC: 7315.7
## 
## Number of Fisher Scoring iterations: 5

What happens here is we don’t have any of our coefficients anymore. In fact, the stepwise variable selection has returned us to our first model. Apparently none of our interaction variables were significant enough to be worth keeping when considering the original variables that we started with.

For our coefficients we can see a few of the most important predictors for a crash. Those that have kids at a driving age are more likely to have a crash on record with each additional kid of driving age. This makes sense as a driving age kid is an experienced user of your car that could be liable to crash it. Additionally, those with kids at home are now seen as more likely to crash.

Model Results

Model 3 gives us an AIC of 7315.7 which is the same as our first model’s AIC of 7315.7.

## 
## Call:
## roc.formula(formula = TARGET_FLAG ~ glm_3$fitted.values, data = train_1,     plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
## 
## Data: glm_3$fitted.values in 6007 controls (TARGET_FLAG 0) < 2148 cases (TARGET_FLAG 1).
## Area under the curve: 0.8164

We additionally get an AUC of 0.8164 which is the same as our first model’s AUC of 0.8164.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5537 1235
##          1  470  913
##                                           
##                Accuracy : 0.7909          
##                  95% CI : (0.7819, 0.7997)
##     No Information Rate : 0.7366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3916          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9218          
##             Specificity : 0.4250          
##          Pos Pred Value : 0.8176          
##          Neg Pred Value : 0.6602          
##              Prevalence : 0.7366          
##          Detection Rate : 0.6790          
##    Detection Prevalence : 0.8304          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : 0               
##

Model Evaluation

Meanwhile, we again have very small cook’s distances with at most 0.0025 of distance for what could be classified as “outliers”. Thus, with our model we do not have any influential outliers.

Overall, it seems the first logistic regression model we fit was the best one as we have returned back to it.

Multiple Linear Regression Models

Model 1

Raw data fed into a stepwise function

Model Description

The first model that we will tackle is a baseline linear regression model from our data that will not use any of our interactions. This is to determine if our transformations may have had the opposite effect on the regression model and weakened it. We utilize automatic step wise regression to gradually add variables to it, determining how significant they may be within our model.

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8010  -3154  -1577    412 100379 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4352.71237  433.41458  10.043  < 2e-16 ***
## BLUEBOOK       0.10769    0.02007   5.366 8.94e-08 ***
## MVR_PTS      129.20710   64.37390   2.007   0.0449 *  
## SEXF        -658.15793  334.75977  -1.966   0.0494 *  
## REVOKEDYes  -678.46505  410.24301  -1.654   0.0983 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7686 on 2143 degrees of freedom
## Multiple R-squared:  0.01877,    Adjusted R-squared:  0.01694 
## F-statistic: 10.25 on 4 and 2143 DF,  p-value: 3.216e-08

We get quiet interesting results here with only the variables of bluebook value, motor vehicle record points, gender, and a previously revoked license affecting crash value. The bluebook record value makes sense as it increases so should the value of the crash. However, the other variables do not quite make sense in reducing or increasing crash value. These coefficients are close to not being significant and the model does not have a very good R-squared. Thus, these coefficients could have just been coincidentally fit.

Model Results

Model 1 gives us an R^2 of 0.01877 with a RSE of 7686.

The F-statistic is 10.25 on 4 and 2143 DF.

Model Evaluation

The residuals for model_1 are normally distributed up until the first positive theoretical quantile as seen from the Normal Q-Q, there is a steep incline in residuals past this point as our model seems to vastly underpredict values at the high end. It should be noted that the variance of the residuals is not heteroscedastic because the max variance clusters around the center of the fitted values while decreasing on both sides in the Residuals vs Fitted graph. The residuals can be considered to be independent here. Overall, our model does not seem to pass assumptions.

Model 2

Interaction and all variables fed into a stepwise function.

Model Description

The second model that we will generate is a linear regression model from the dataset with our interaction variables included. This is to determine if the interaction variables may have increased the predictive power of the regression model. We utilize automatic step wise regression to gradually remove variables to it, determining how significant they may be within our model.

## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8010  -3154  -1577    412 100379 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4352.71237  433.41458  10.043  < 2e-16 ***
## BLUEBOOK       0.10769    0.02007   5.366 8.94e-08 ***
## MVR_PTS      129.20710   64.37390   2.007   0.0449 *  
## SEXF        -658.15793  334.75977  -1.966   0.0494 *  
## REVOKEDYes  -678.46505  410.24301  -1.654   0.0983 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7686 on 2143 degrees of freedom
## Multiple R-squared:  0.01877,    Adjusted R-squared:  0.01694 
## F-statistic: 10.25 on 4 and 2143 DF,  p-value: 3.216e-08

Perhaps unsurprisingly our interaction variables are once more optimized out by the stepwise function. We end up with our original multiple linear regression model again. Thus, the rest of our model analysis remains the same:

Model Results

Model 2 gives us an R^2 of 0.01877 with a RSE of 7686.

The F-statistic is 10.25 on 4 and 2143 DF.

Model Evaluation

The residuals for model_2 are normally distributed up until the first positive theoretical quantile as seen from the Normal Q-Q, there is a steep incline in residuals past this point as our model seems to vastly underpredict values at the high end. It should be noted that the variance of the residuals is not heteroscedastic because the max variance clusters around the center of the fitted values while decreasing on both sides in the Residuals vs Fitted graph. The residuals can be considered to be independent here. Overall, our model does not seem to pass assumptions.

§4 Select Models

Keeping glm_1 for our logistic regression model to predict crash chance and model_1 for our multiple linear regression model to predict crash costs seem like the best ideas. For our logistic regression model we highly want to optimize for specificity to reduce costs to the insurance company which is why we choose glm_1 that simply utilized all of the variables without interactions. model_1 was really the only option for our multiple linear regression model because both the model that fit interaction variables and the one that did not converged to this model. Inherently giving it the highest adjusted R^2 and RMSE.

Thus, we make our predictions with said models and create the csv “evaluated.csv” to export.

Conclusion

It seems that for our logistic regression model we were able to get a fairly competent model if we looked at accuracy alone. However, our imbalanced data posed a roadblock in impeding our specificity, which is an important factor for the insurance company’s goals. SMOTE was unable to be used for resampling the data as it was incompatible with categorical data. Our variables also were not optimally separated for a good logistic regression model. Separating the models further with linear discriminate analysis may provide more promising results.

The simple linear regression model simply was not suited for the predictive analysis that we are doing on crash value with this dataset. There were no variables that had any noticeable correlation with crash value. This is highlighted with the extremely small R^2 and high RSE. The fitted vs residual plot also shows that there is high heteroscedasticity within the residuals of the linear regression model. Ideally, we would source new data to help predict crash values such as more info on the car the customer is actually driving to determine value.

Yet we did our best with the data and techniques that we did have, even if the results were personally unsatisfying..

Code Appendix

library(tidyverse)
library(olsrr)
library(MASS)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)
library(mice)
library(caret)
library(naniar)
library(cowplot)
library(correlationfunnel)
library(car)
library(pROC)
library(smotefamily)
train_1 <- read_csv("insurance_training_data.csv", show_col_types = FALSE)
eval_1 <- read_csv("insurance-evaluation-data.csv", show_col_types = FALSE)
glimpse(train_1)
glimpse(train_1)
train_1 |>
  mutate(
    across(
      c(TARGET_FLAG, PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY),
      as_factor
    )
  )  -> train_1

eval_1 |>
  mutate(
    across(
      c(PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY),
      as_factor
    )
  )  -> eval_1

train_1 |>
  dplyr::select(where(is.factor)) |>
  sapply(levels)
train_1 |>
  mutate(
    across(
      where(is.factor),
      ~factor(.x, labels = str_replace(levels(.x),"z_",""))
    )
  ) -> train_1

eval_1 |>
  mutate(
    across(
      where(is.factor),
      ~factor(.x, labels = str_replace(levels(.x),"z_",""))
    )
  ) -> eval_1

train_1 |>
  dplyr::select(MSTATUS, EDUCATION, JOB, CAR_TYPE, URBANICITY) |>
  sapply(levels)
train_1 |>
  mutate(
    EDUCATION = ordered(EDUCATION, c("<High School", "High School", "Bachelors", "Masters", "PhD"))
  ) -> train_1

eval_1 |>
  mutate(
    EDUCATION = ordered(EDUCATION, c("<High School", "High School", "Bachelors", "Masters", "PhD"))
  ) -> eval_1

train_1 |>
  pull(EDUCATION) |>
  unique()
train_1 |>
  dplyr::select(where(is.character)) |>
  glimpse()
train_1 |>
  mutate(
    across(
      where(is.character),
      parse_number)
    ) -> train_1

eval_1 |>
  mutate(
    across(
      where(is.character),
      parse_number)
    ) -> eval_1

train_1 |>
  glimpse()
train_1 |>
  skim() |>
  dplyr::select(-complete_rate, -numeric.sd) |>
  mutate(numeric.mean = format(numeric.mean, scientific = FALSE, digits = 2))
d1 <- train_1 |>
  dplyr::select(where(is.factor)) |>
  group_by(JOB) |>
  summarise(n = n())
d2 <- train_1 |>
  dplyr::select(where(is.factor)) |>
  group_by(CAR_TYPE) |>
  summarise(n = n())
d3 <- train_1 |>
  dplyr::select(where(is.factor)) |>
  group_by(EDUCATION) |>
  summarise(n = n())

knitr::kable(
  list(d1, d2, d3)
)
d1 <- train_1 |>
  filter(is.na(JOB)) |>
  mutate(NOINCJOB = if_else(INCOME == 0, TRUE, FALSE)) |>
  group_by(NOINCJOB) |>
  summarise(n = n())

d2 <- train_1 |>
  filter(is.na(INCOME)) |>
  group_by(JOB) |>
  summarise(`NA Income Count` = n())

knitr::kable(
  list(d1, d2)
)
tibble(NAs = rowSums(is.na(train_1))) |>
  filter(NAs != 0) |>
  group_by(NAs) |>
  summarise(n = n()) |>
  ggplot(aes(x = NAs, y = n)) +
    geom_bar(stat="identity") +
    labs(title = "Combined missing values are rare", subtitle = "# NAs per row", y = NULL, x = NULL) +
    theme_minimal()
train_1 |>
  filter(is.na(YOJ)) |>
  group_by(JOB) |>
  summarise(`NA YOJ Count` = n()) |>
  arrange(desc(`NA YOJ Count`)) |>
  ggplot(aes(x = reorder(JOB, -`NA YOJ Count`), y = `NA YOJ Count`)) +
    geom_bar(stat="identity") +
    labs(title = "YOJ Missing values can't be attributed to a single job", subtitle = "# NAs in YOJ by job", y = NULL, x = NULL) +
    theme_minimal()
p1 <- train_1 |>
  ggplot(aes(x=INCOME, y=HOME_VAL)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Missing home values don't significantly differ in income distribution", y = "Home value", x = "Income")

p2 <- train_1 |>
  ggplot(aes(x=AGE,y=CAR_AGE)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Missing car ages don't significantly differ in age distribution", y = "Car age", x = "Age")

plot_grid(p1, p2, ncol = 1)
p1 <- train_1 |>
  #filter(KIDSDRIV != 0) |>
  ggplot(aes(x= KIDSDRIV)) +
    geom_histogram(bins = 5) +
    labs(title = "Distribution of customer kids that drive", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(KIDSDRIV != 0) |>
  ggplot(aes(x= KIDSDRIV)) +
    geom_histogram(bins = 4) +
    labs(title = NULL, x = "# of kids that drive") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(HOMEKIDS != 0) |>
  ggplot(aes(x= HOMEKIDS)) +
    geom_histogram(bins = 6) +
    labs(title = "Distribution of customer kids", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(HOMEKIDS != 0) |>
  ggplot(aes(x= HOMEKIDS)) +
    geom_histogram(bins = 5) +
    labs(title = NULL, x = "# of kids") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(INCOME != 0) |>
  ggplot(aes(x= INCOME)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer income", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(INCOME != 0) |>
  ggplot(aes(x= INCOME)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = "Income") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(HOME_VAL != 0) |>
  ggplot(aes(x= HOME_VAL)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer home value", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(HOME_VAL != 0) |>
  ggplot(aes(x= HOME_VAL)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = "Home value") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(OLDCLAIM != 0) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer previous claim payout", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(OLDCLAIM != 0) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = NULL) +
    theme_minimal()

p3 <- train_1 |>
  filter(OLDCLAIM > 15000) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=30, na.rm = TRUE) +
    labs(title = NULL, x = "Previous claim payout") +
    theme_minimal()

plot_grid(p1,p2,p3,ncol=1)
p1 <- train_1 |>
  #filter(CLM_FREQ != 0) |>
  ggplot(aes(x= CLM_FREQ)) +
    geom_histogram(bins=6, na.rm = TRUE) +
    labs(title = "Distribution of customer claims", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(CLM_FREQ != 0) |>
  ggplot(aes(x= CLM_FREQ)) +
    geom_histogram(bins=5, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous claims") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(MVR_PTS != 0) |>
  ggplot(aes(x= MVR_PTS)) +
    geom_histogram(bins=14, na.rm = TRUE) +
    labs(title = "Distribution of customer moving violations", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(MVR_PTS != 0) |>
  ggplot(aes(x= MVR_PTS)) +
    geom_histogram(bins=13, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous moving violations") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(TARGET_AMT != 0) |>
  ggplot(aes(x= TARGET_AMT)) +
    geom_histogram(bins=100, na.rm = TRUE) +
    labs(title = "Distribution of customer crash cost", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(TARGET_AMT != 0) |>
  ggplot(aes(x= TARGET_AMT)) +
    geom_histogram(bins=100, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous crash cost") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
train_1 |>
  dplyr::select(-TARGET_AMT) |>
  plot_bar(by = "TARGET_FLAG",
           ggtheme = theme_minimal(),
           nrow = 3L,
           ncol = 2L)
train_1 |>
  dplyr::select(-TARGET_AMT, -INDEX) |>
  plot_boxplot(by = "TARGET_FLAG",
           ggtheme = theme_minimal(),
           nrow = 3L,
           ncol = 2L)
multdens <- function(df, var, target) {
  
  df %>% 
    dplyr::select(-TARGET_AMT, -INDEX) %>% 
    filter(!is.na(.data[[var]])) %>% 
    ggplot(aes(.data[[var]], fill = .data[[target]], colour = .data[[target]])) +
      geom_density(alpha = 0.2) +
      labs(y = NULL) +
      theme_minimal(base_size = 6) + 
      theme(legend.position = c(0.85, 0.75), legend.key.size = unit(2,'mm'))
}

num_col <- train_1 |>
  dplyr::select(where(is.numeric), -TARGET_AMT, -INDEX) |>
  colnames()

dens <- lapply(num_col, multdens, df = train_1, target = "TARGET_FLAG")

plot_grid(plotlist = dens[1:5], label_x = NULL, ncol=1)
plot_grid(plotlist = dens[6:10], label_x = NULL, ncol=1)
plot_grid(plotlist = dens[11:13], label_x = NULL, ncol=1)
mean_bar <- function(df, var, target, optar) {
  
  df %>% 
    filter(.data[[optar]] == 1) |>
    mutate(!!var := fct_reorder(.data[[var]],desc(.data[[target]]),mean,na.rm = TRUE)) |>
    ggplot(mapping=aes(x=.data[[var]], y=.data[[target]])) + 
      stat_summary(fun.data=mean_sdl, geom="bar") +
      labs(y = NULL) +
      theme_minimal(base_size = 8) 
}

fac_col <- train_1 |>
  dplyr::select(where(is.factor), - TARGET_FLAG) |>
  colnames()

barmeans <- lapply(fac_col, mean_bar,df = train_1,target = "TARGET_AMT",optar = "TARGET_FLAG")

plot_grid(plotlist = barmeans[1:4], label_x = NULL)
plot_grid(plotlist = barmeans[c(6,8:10)], label_x = NULL)
plot_grid(plotlist = barmeans[c(5,7)], label_x = NULL)
plot_scatterplot(split_columns(train_1 |> filter(TARGET_AMT != 0) |> dplyr::select(-INDEX))$continuous, by = "TARGET_AMT", ggtheme = theme_minimal())
train_1 |>
  na.omit() |>
  dplyr::select(-INDEX, -TARGET_AMT) |>
  relocate(TARGET_FLAG, .after = last_col()) |>
  plot_correlation(
                 ggtheme = theme_minimal(),
                 theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
    45)))
train_1 |>
  filter(TARGET_FLAG != 0) %>% 
  na.omit() %>% 
  dplyr::select(-INDEX,-TARGET_FLAG) %>% 
  plot_correlation(
                 ggtheme = theme_minimal(),
                 theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
    45)))
train_1 %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  na.omit() %>%
  binarize(n_bins = 6, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
  correlate(TARGET_FLAG__1) %>% 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
train_1 %>%
  filter(TARGET_FLAG != 0) %>% 
  dplyr::select(-INDEX, -TARGET_FLAG) %>%
  mutate(TARGET_AMT_BIN = if_else(TARGET_AMT > mean(TARGET_AMT), 1, 0)) %>% 
  dplyr::select(-TARGET_AMT) %>% 
  na.omit() %>%
  binarize(n_bins = 6, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
  correlate(TARGET_AMT_BIN__1) %>% 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
train_1 |>
  mutate(
    across(c(JOB),
      ~fct_na_value_to_level(.x, "Other")
    )
  ) -> train_1

eval_1 |>
  mutate(
    across(c(JOB),
      ~fct_na_value_to_level(.x, "Other")
    )
  ) -> eval_1

train_1 |>
  pull(JOB) |>
  unique()
train_1 %>% 
  filter(!is.na(AGE)) -> train_1

eval_1 %>% 
  filter(!is.na(AGE)) -> eval_1

cat("There are",sum(is.na(train_1$AGE)),"NA values in the age column")
train_1 |>
  mutate(CAR_AGE = if_else(CAR_AGE < 0, NA, CAR_AGE)) -> train_1

train_1 |>
  mice(printFlag = FALSE) -> cheesed

cheesed |>
  complete() -> train_1

eval_1 |>
  mutate(CAR_AGE = if_else(CAR_AGE < 0, NA, CAR_AGE)) -> eval_1


eval_1 |>
  mice(printFlag = FALSE) -> cheesed

cheesed |>
  complete() -> eval_1


colSums(is.na(train_1))
train_1 %>%
  mutate(across(.cols = c(HOME_VAL,HOMEKIDS,INCOME,KIDSDRIV, YOJ, OLDCLAIM, MVR_PTS),
                .fns = ~as_factor(if_else(.x == 0, 0, 1)),
                .names = "{.col}_Y"),
         CAR_AGE_Y = as_factor(if_else(CAR_AGE > 1, 0, 1))) -> train_1

eval_1 %>% 
  mutate(across(.cols = c(HOME_VAL,HOMEKIDS,INCOME,KIDSDRIV, YOJ, OLDCLAIM, MVR_PTS),
                .fns = ~as_factor(if_else(.x == 0, 0, 1)),
                .names = "{.col}_Y"),
         CAR_AGE_Y = as_factor(if_else(CAR_AGE > 1, 0, 1))) -> eval_1

colnames(train_1)
train_1 |>
  filter(TARGET_FLAG != 0) %>% 
  na.omit() %>% 
  dplyr::select(-INDEX,-TARGET_FLAG) %>% 
  plot_correlation(
                 ggtheme = theme_minimal(),
                 theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
    45)))
train_1 %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  na.omit() %>%
  binarize(n_bins = 6, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
  correlate(TARGET_FLAG__1) %>% 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
glm_1 <- glm(TARGET_FLAG ~ . - TARGET_AMT - INDEX,
             data = train_1,
             family = "binomial") |>
          stepAIC(
            direction = "backward",
            trace = FALSE
          )

summary(glm_1)
roc(TARGET_FLAG~glm_1$fitted.values, data = train_1, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
confusionMatrix(as_factor(as.numeric(predict(glm_1, newdata = train_1 ,type = "response") > 0.5)), 
                reference = train_1$TARGET_FLAG)
mmps(glm_1, ask = FALSE)
plot(glm_1, which = 4, id.n=3)
glm_2 <- glm(TARGET_FLAG ~ HOME_VAL_Y:HOME_VAL + HOME_VAL_Y + HOMEKIDS_Y:HOMEKIDS + HOMEKIDS_Y + INCOME_Y:INCOME + INCOME_Y + KIDSDRIV_Y:KIDSDRIV + KIDSDRIV_Y + YOJ_Y:YOJ + YOJ_Y + OLDCLAIM_Y:OLDCLAIM + OLDCLAIM_Y + MVR_PTS_Y:MVR_PTS + MVR_PTS_Y + CAR_AGE_Y:CAR_AGE + CAR_AGE_Y,
             data = train_1,
             family = "binomial") |>
          stepAIC(
            direction = "both",
            trace = FALSE
          )

summary(glm_2)
roc(TARGET_FLAG~glm_2$fitted.values, data = train_1, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
confusionMatrix(as_factor(as.numeric(predict(glm_2, newdata = train_1 ,type = "response") > 0.5)), 
                reference = train_1$TARGET_FLAG)
mmps(glm_2, ask = FALSE)
plot(glm_1, which = 4, id.n=3)
glm_3 <- glm(TARGET_FLAG ~ . +
               HOME_VAL_Y:HOME_VAL + HOME_VAL_Y + HOMEKIDS_Y:HOMEKIDS + HOMEKIDS_Y + INCOME_Y:INCOME + INCOME_Y + KIDSDRIV_Y:KIDSDRIV + KIDSDRIV_Y + YOJ_Y:YOJ + YOJ_Y + OLDCLAIM_Y:OLDCLAIM + OLDCLAIM_Y + MVR_PTS_Y:MVR_PTS + MVR_PTS_Y + CAR_AGE_Y:CAR_AGE + CAR_AGE_Y +
               URBANICITY + JOB + CAR_USE + CAR_TYPE + REVOKED
             - INDEX - TARGET_AMT,
             data = train_1,
             family = "binomial") |>
          stepAIC(
            direction = "backward",
            trace = FALSE
          )

summary(glm_3)
roc(TARGET_FLAG~glm_3$fitted.values, data = train_1, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
confusionMatrix(as_factor(as.numeric(predict(glm_3, newdata = train_1 ,type = "response") > 0.5)), 
                reference = train_1$TARGET_FLAG)
mmps(glm_3, ask = FALSE)
plot(glm_1, which = 4, id.n=3)
train_1 %>% 
  filter(TARGET_FLAG == 1) %>% 
  dplyr::select(-TARGET_FLAG,-INDEX) -> train_2

model_1 <- lm(TARGET_AMT ~ ., data = train_2)
model_1 <- ols_step_both_p(model_1)
summary(model_1$model)
autoplot(model_1$model)
model_2 <- lm(TARGET_AMT ~ . +
               HOME_VAL_Y:HOME_VAL + HOME_VAL_Y + HOMEKIDS_Y:HOMEKIDS + HOMEKIDS_Y + INCOME_Y:INCOME + INCOME_Y + KIDSDRIV_Y:KIDSDRIV + KIDSDRIV_Y + YOJ_Y:YOJ + YOJ_Y + OLDCLAIM_Y:OLDCLAIM + OLDCLAIM_Y + MVR_PTS_Y:MVR_PTS + MVR_PTS_Y + CAR_AGE_Y:CAR_AGE + CAR_AGE_Y +
               URBANICITY + JOB + CAR_USE + CAR_TYPE + REVOKED, 
              data = train_2)
model_2 <- ols_step_both_p(model_2)
summary(model_2$model)
autoplot(model_2$model)
pred_cost <- predict(model_1$model, eval_1)
pred_prob <- predict(glm_1, eval_1, type = "response")
pred_class <- as.numeric(pred_prob)

predictions <- tibble(PROBABILITY = pred_prob, TARGET_FLAG = pred_class, TARGET_AMT = pred_cost)
write_csv(predictions, "evaluated.csv")

D621-HW4: Car Insurance Prediction

Group 5: Taha Ahmad, Diana Plunkett, Justin Williams, PK O’Flaherty

2023-11-01

Introduction

Loading Data

Description of Variables

§1 Data Exploration

Data Preprocessing

Data Skimming

Missing Values

Variable Distribution

Target Flag Breakdown

Target Amount Breakdown

Correlation

§2 Data Preparation

Missing Values

Transforming Missing Values to New Level

Removing Rows with Missing Values

MICE Impute the Mean for Other Missing Values

New Categorical Columns

Transformation Discussion

Correlation

§3 Build Models

Logistic Models

Model 1

Model Description

Model Results

Model Evaluation

Model 2

Model Description

Model Results

Model Evaluation

Model 3

Model Description

Model Results

Model Evaluation

Multiple Linear Regression Models

Model 1

Model Description

Model Results

Model Evaluation

Model 2

Model Description

Model Results

Model Evaluation

§4 Select Models

Conclusion

Code Appendix