Introduction

We will explore, analyze and model a data set containing approximately 8000 records. Each record represents a customer at an auto insurance company. Each record has various predictor variables regarding the customer’s car, job, and demographics. The response variables within this dataset indicate if the customer was in a car crash with a binary label and the value of damages done if the customer was in a car crash.

We will create two unique types of models for each response variable. The first will be a binary logistic regression model in order to predict if someone will crash their car or not. The other model will be a multiple linear regression model to predict how much it will cost in damages if the customer does crash their car.


Subsections:
- Loading Data
- Description of Variables


Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

We can see that we have the expected amount of records loaded in: 8,161.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301…
## $ PARENT1     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No…
## $ HOME_VAL    <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"…
## $ MSTATUS     <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",…
## $ SEX         <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"…
## $ EDUCATION   <chr> "PhD", "z_High School", "z_High School", "<High School", "…
## $ JOB         <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla…
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <chr> "Private", "Commercial", "Private", "Private", "Private", …
## $ BLUEBOOK    <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports…
## $ RED_CAR     <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", …
## $ OLDCLAIM    <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba…


Description of Variables

Below is a short description of the variables of interest in the data set:

Variable Quantitative or Categorical Definition
INDEX Categorical Identification Variable
TARGET_FLAG Categorical Determines if car was in a crash
TARGET_AMT Quantitative Damage value of the car crash
AGE Quantitative Age of driver
BLUEBOOK Quantitative Value of vehicle
CAR_AGE Quantitative Vehicle age
CAR_TYPE Categorical Type of Car
CAR_USE Categorical If the vehicle is used privately or commercially
CLM_FREQ Quantitative # claims (past 5 years)
EDUCATION Categorical Max education level of the customer
HOMEKIDS Quantitative # children at home
HOME_VAL Quantitative Home value of the customer
INCOME Quantitative Income of the customer
KIDSDRIV Quantitative # driving children of the customer
MSTATUS Categorical Marital status of the customer
MVR_PTS Quantitative Motor vehicle record points
OLDCLAIM Quantitative Total claims (past 5 years)
PARENT1 Categorical Is the customer a single parent
RED_CAR Categorical Is the car red
REVOKED Categorical Was the customer’s license revoked (past 7 years)
SEX Categorical Customer’s Gender
TIF Quantitative Time in force (as a customer)
TRAVTIME Quantitative Distance the customer travels to work
URBANICITY Categorical How urban the customer’s home/work area is
YOJ Quantitative Years on the job the customer has



§1 Data Exploration


Subsections:
- Data Preprocessing - Data Skimming
- Histogram
- Outliers
- Missing Values
- Correlation


Data Preprocessing


From our precursory variable exploration, we know that we have both qualitative and quantitative variables contained within our data. Before we can begin exploring the data, we need to ensure that our data is contained as the type which it should be.

Right away we see variables that both need to be factorized such as urbanicity and the crash flag, variables that need to properly be converted to numeric such as income, and those which have values that need to be cleaned up with a suspicious “z_” prefixing some categorical values such as sex and marital status.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301…
## $ PARENT1     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No…
## $ HOME_VAL    <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"…
## $ MSTATUS     <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",…
## $ SEX         <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"…
## $ EDUCATION   <chr> "PhD", "z_High School", "z_High School", "<High School", "…
## $ JOB         <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla…
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <chr> "Private", "Commercial", "Private", "Private", "Private", …
## $ BLUEBOOK    <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports…
## $ RED_CAR     <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", …
## $ OLDCLAIM    <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba…

We begin our data preprocessing efforts by converting the categorical columns of TARGET_FLAG, PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, and URBANICITY from character vectors into factors. We then view the levels of each of these factors to better understand which levels need to be modified and which factors should be ordered.

## $TARGET_FLAG
## [1] "0" "1"
## 
## $PARENT1
## [1] "No"  "Yes"
## 
## $MSTATUS
## [1] "z_No" "Yes" 
## 
## $SEX
## [1] "M"   "z_F"
## 
## $EDUCATION
## [1] "PhD"           "z_High School" "<High School"  "Bachelors"    
## [5] "Masters"      
## 
## $JOB
## [1] "Professional"  "z_Blue Collar" "Clerical"      "Doctor"       
## [5] "Lawyer"        "Manager"       "Home Maker"    "Student"      
## 
## $CAR_USE
## [1] "Private"    "Commercial"
## 
## $CAR_TYPE
## [1] "Minivan"     "z_SUV"       "Sports Car"  "Van"         "Panel Truck"
## [6] "Pickup"     
## 
## $RED_CAR
## [1] "yes" "no" 
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $URBANICITY
## [1] "Highly Urban/ Urban"   "z_Highly Rural/ Rural"

MSTATUS, EDUCATION, JOB, CAR_TYPE, and URBANICITY all contain the “z_” prefix amongst one level of the factors. On a closer look it seems that the “z_” prefix is a marker within the data to attempt to classify which level would lend the most to the probability of the class. We can see this most clearly through SEX, EDUCATION, and MSTATUS. In each one of these the “z_” marks a stereotype. Those that are not married would drive more recklessly, women are worse drivers, and those who only have a high-school education end up driving more recklessly.

We want our data to speak for itself through the models, thus we have no need for the z_ prefixes within our data and decide to remove them to make the values more clear and readable.

## $MSTATUS
## [1] "No"  "Yes"
## 
## $EDUCATION
## [1] "PhD"          "High School"  "<High School" "Bachelors"    "Masters"     
## 
## $JOB
## [1] "Professional" "Blue Collar"  "Clerical"     "Doctor"       "Lawyer"      
## [6] "Manager"      "Home Maker"   "Student"     
## 
## $CAR_TYPE
## [1] "Minivan"     "SUV"         "Sports Car"  "Van"         "Panel Truck"
## [6] "Pickup"     
## 
## $URBANICITY
## [1] "Highly Urban/ Urban" "Highly Rural/ Rural"

Next, we consider if any of these factors should be ordered factors. As this will help us when visualizing grouping down the line in data exploration.

There is no innate ordering in any of the factorized variables that we have except education, which can be sorted from most educated to least. Thus, we will order the education variable into its proper sorting from not having completed high school up to completing a PhD.

## [1] PhD          High School  <High School Bachelors    Masters     
## Levels: <High School < High School < Bachelors < Masters < PhD

With the factored variables preprocessed, we will move on to the variables that need to be converted into numerics. Which are really just the columns that deal with a money value: INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM.

## Rows: 8,161
## Columns: 4
## $ INCOME   <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301", …
## $ HOME_VAL <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0", N…
## $ BLUEBOOK <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17,43…
## $ OLDCLAIM <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$2,3…

We simply need to remove the dollar signs and commas then convert the columns to numeric. Thankfully, readr has the parse_number function which can do exactly this for us.

## Rows: 8,161
## Columns: 26
## $ INDEX       <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2…
## $ TARGET_FLAG <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1…
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0…
## $ KIDSDRIV    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AGE         <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45…
## $ HOMEKIDS    <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1…
## $ YOJ         <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1…
## $ INCOME      <dbl> 67349, 91449, 16039, NA, 114986, 125301, 18755, 107961, 62…
## $ PARENT1     <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No, No, N…
## $ HOME_VAL    <dbl> 0, 257252, 124191, 306251, 243925, 0, NA, 333680, 0, 0, 0,…
## $ MSTATUS     <fct> No, No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes,…
## $ SEX         <fct> M, M, F, M, F, F, F, M, F, M, F, F, M, M, F, F, M, F, F, F…
## $ EDUCATION   <ord> PhD, High School, High School, <High School, PhD, Bachelor…
## $ JOB         <fct> Professional, Blue Collar, Clerical, Blue Collar, Doctor, …
## $ TRAVTIME    <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,…
## $ CAR_USE     <fct> Private, Commercial, Private, Private, Private, Commercial…
## $ BLUEBOOK    <dbl> 14230, 14940, 4010, 15440, 18000, 17430, 8780, 16970, 1120…
## $ TIF         <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, …
## $ CAR_TYPE    <fct> Minivan, Minivan, SUV, Minivan, SUV, Sports Car, SUV, Van,…
## $ RED_CAR     <fct> yes, yes, no, yes, no, no, no, yes, no, no, no, no, yes, y…
## $ OLDCLAIM    <dbl> 4461, 0, 38690, 0, 19217, 0, 0, 2374, 0, 0, 0, 0, 5028, 0,…
## $ CLM_FREQ    <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2…
## $ REVOKED     <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, Yes, No,…
## $ MVR_PTS     <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, …
## $ CAR_AGE     <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,…
## $ URBANICITY  <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highly Urban/ Ur…

With the money related columns converted into numerics, we have a good jumping off point to move on from preprocessing to data exploration proper.


Data Skimming

We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

We still have our 8161 rows with 26 columns that are split between 11 factors and 15 numeric variables.

Looking at our response variables we see in the breakdown of the TARGET_FLAG that there are disproportionately more customers without accidents in our dataset compared to customers with accidents. We should modify our cross-validation resampling down the line in order to have approximately equal proportions within the cv training sets. We also see that our target_amt is majorly skewed to the right because there are large amounts of 0 value crashes because a crash never happened at all. This will not need to be something that we worry about as we will subset the data for training our model that predicts crash values on only the data that does have crashes.

Data summary
Name train_1
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
factor 11
numeric 15
________________________
Group variables None

Variable type: factor

skim_variable n_missing ordered n_unique top_counts
TARGET_FLAG 0 FALSE 2 0: 6008, 1: 2153
PARENT1 0 FALSE 2 No: 7084, Yes: 1077
MSTATUS 0 FALSE 2 Yes: 4894, No: 3267
SEX 0 FALSE 2 F: 4375, M: 3786
EDUCATION 0 TRUE 5 Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 526 FALSE 8 Blu: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_USE 0 FALSE 2 Pri: 5132, Com: 3029
CAR_TYPE 0 FALSE 6 SUV: 2294, Min: 2145, Pic: 1389, Spo: 907
RED_CAR 0 FALSE 2 no: 5783, yes: 2378
REVOKED 0 FALSE 2 No: 7161, Yes: 1000
URBANICITY 0 FALSE 2 Hig: 6492, Hig: 1669

Variable type: numeric

skim_variable n_missing mean p0 p25 p50 p75 p100 hist
INDEX 0 5151.87 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_AMT 0 1504.32 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 0.17 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 44.79 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 0.72 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 10.50 0 9 11 13 23.0 ▂▃▇▃▁
INCOME 445 61898.09 0 28097 54028 85986 367030.0 ▇▃▁▁▁
HOME_VAL 464 154867.29 0 0 161160 238724 885282.0 ▇▆▁▁▁
TRAVTIME 0 33.49 5 22 33 44 142.0 ▇▇▁▁▁
BLUEBOOK 0 15709.90 1500 9280 14440 20850 69740.0 ▇▆▁▁▁
TIF 0 5.35 1 1 4 7 25.0 ▇▆▁▁▁
OLDCLAIM 0 4037.08 0 0 0 4636 57037.0 ▇▁▁▁▁
CLM_FREQ 0 0.80 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.70 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 8.33 -3 1 8 12 28.0 ▆▇▇▃▁

If we take a look at the counts of factors that previously had a “z_” prefix, we can now see that each level which was prefixed is the most common value for the variable. This doesn’t help us much with building a model, but is something interesting to point out. The other thing we can determine by the count of the factors, which we expand below, is that none of our factor variables are degenerate through having a level with extremely low counts compared to either the total population or the highest level. This means we do not need to worry about removing factor variables due to degeneracy.

JOB n
Professional 1117
Blue Collar 1825
Clerical 1271
Doctor 246
Lawyer 835
Manager 988
Home Maker 641
Student 712
NA 526
CAR_TYPE n
Minivan 2145
SUV 2294
Sports Car 907
Van 750
Panel Truck 676
Pickup 1389
EDUCATION n
<High School 1203
High School 2330
Bachelors 2242
Masters 1658
PhD 728


Missing Values

For factors we are missing 526 values amongst the job column. If we look at the breakdown of all the levels we can see that there is no option for the customer to not have a job which is a very distinct possibility. Our dataset provides a potential avenue to determine if this option is true or not: By filtering on those with an NA value of a job and also indicating if they have no income or some income. As we can see below, there are actually no customers who have a job displayed as NA but also don’t have any income. There are only the two possibilities of having income which is the majority of 502 of these cases, or having a missing value for income which is 24 cases. Thus, our initial hypothesis is false and these customers probably have jobs as the majority have income. Thinking about it, there are two distinct possibilities with 500+ missing values in this column: The customer did not provide the info on their job or the customer did not fall into one of the 8 categories provided. To narrow the choice down between the two, we consider the fact that if a customer wanted to hide their job they might have also wanted to hide their income. Yet, there are only 24 cases where the income is also missing. Which is why we choose to conclude that the missing values for the job column should be grouped into a level of “other”.

While we are taking a look at missing values involving income, we also want to determine what to do with the 445 missing values present within the continuous variable of income. Since, we already have existing values of 0 for income, we know that NA is not going to be used to record no income. So, we group NA income by job and take a look at the distribution of counts. In theory, if the missing value of income meant something then it would be most prominently shown in the distribution of the most related variable here. Yet, there doesn’t seem to be a job that is particularly generating missing values. Which rules out another theory that it might be people choosing to hide their income because their income is too high. So, we classify the data that is missing from income as missing completely at random, and will later use MICE imputation in order to find an optimal value to replace these NAs.

NOINCJOB n
FALSE 502
NA 24
JOB NA Income Count
Professional 60
Blue Collar 98
Clerical 73
Doctor 8
Lawyer 36
Manager 50
Home Maker 43
Student 53
NA 24

Continuing with our examination of missing values within the continuous variables we observe the fact that there are 6 missing values within age, a value so minuscule compared to the total amount of data which we have that we should be able to simply remove the rows where AGE is missing without it affecting the model we are building. As 0.1% of data will likely not be providing us much information.

Years on the job, home value, and care age have a more substantial amount of data missing having between 450-510 missing values in each row. Although, the amount of data is not significant enough that it can’t be deleted after a bit more examination if the missing data is informing us of something.

With multiple columns having similar amounts of missing data we want to explore a bit further on if the rows that have missing data are connected. For example, if there were about 400 rows that were missing 3 or 4 values at the same time we would know that these missing values are dependent on each other and related. Yet, plotting the amount of missing values in each row we can see the vast majority of rows having only a single missing value and an absolute minuscule amount that have 3 or more columns with missing values. Thus, we will consider these missing values to not be interconnected.

Next we want to consider how the missing data distribution takes form with their next most related variables to attempt to detect a pattern. For years on the job we visualize below if the missing values can be attributed to one job in particular. Although we have larger amounts of missing data for the more common jobs, it is not something that we would deem as significant. Thus, we also consider YOJ data missing at random which means we will deal with it by MICE imputation later down the line.

For home values and car age, both the variables and their most related variables are continuous. So we plot out scatterplots which use jittered points under the 0 ticks on both the y and x-axis in order to determine how missing values differ in their distributions.

For home value we plot against income, as those with higher income will be able to afford a home with a higher value. The missing home values follow the general distribution of how income spreads compared to home value. One thing to note is that the distribution of income is very similar for both 0 home value and home value that is greater than 0. Which tells us that home value is related to income if there is any home value at all. Either way, we can not say the missing home values are following some sort of unusual pattern with income that would give us more information. So, we deem it missing at random and will impute the missing values through MICE.

For car age we plot against age, as older people will have had a higher chance to stick with a car longer and thus age it up. The car age values follow the general distribution of how car age spreads compared to age, which is seemingly at random but with a mean age around 45. Once again, we can deem the missing data here to be randomly missing and will MICE impute the values.


Histogram

Our skimmed histograms give us an idea that most of our data is not actually normally distributed with a rightward skew being very common. We will want to adjust the distribution of those where outlier removal isn’t enough with box-cox transformations once we get to data preparation. As a variable being normally distributed lends to its predictive power when creating a linear regression model.

However, we also have the unique situation in many of these columns where our mode is overwhelmingly 0. This can be problematic to box-cox transform without shifting the columns by adding a static value. It is also possible that our data is bimodal in the case of variables like target amount or old claim which are overwhelmingly distributed towards counts of 0. Thus, we check histograms of our data with and without 0 values included for a few variables.

We begin by analyzing customer kids that can drive. Here, we see that there is a disproportionate amount of customers without kids that drive. So, we will create an extra category that splits the customers into those with kids that drive and those without kids that drive. Since, we do not quite see a natural distribution without doing this. Looking at customers that have at least one kid who drives, we can see what approximates a negative binomial distribution. Thus, we will not attempt to transform this variable.

Next we look at customer kids overall. Here, we see that there is a disproportionate amount of customers without kids as well. So, we will create an extra category that splits the customers into those with kids and those without kids. Since, we do not quite see a natural distribution without doing this. Now, looking at customers that have at least one kid, we can see what approximates a normal distribution with a slight rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution.

Afterwards we look at customer income. Here, we see that that a significant amount of customers do not have any income. So, we will create an extra category that splits the customers into those with income and those without income. Since, we do not quite see a natural distribution without doing this. Now, looking at customers that have income, we can see what approximates a normal distribution with a strong rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution. We should also look at outliers afterwards.

Moving on we look at customer home value. Here, we see that that a significant amount of customers do not have any home value thus do not own a home. So, we will create an extra category that splits the customers into those with a home and those without a home. Since, we see two separate distributions without doing this. Now, looking at customers that have a home, we can see what approximates a normal distribution with a strong rightward skew. Thus, we will box-cox transform this variable to better approximate a normal distribution. We should also look at outliers afterwards.

Afterwards we look at customer previous claim payout. Here, we see that that an overwhelming amount of customers do not have any previous claim payout. However, we also are able to observe three unique distributions: No previous claims, low-value previous claims, and high-value previous claims. So, we will create a new categorical variable that splits the customers into those categories of previous claim payout. As, there would not be a way that we can think of to modify the continuous value for prediction based on more than two categories we will not utilize the raw values of previous claim payout for building or models, and thus have no need to transform them.

Below we consider number of customer previous claims. Here, we see that that an overwhelming amount of customers do not have any previous claims. So, we will create an extra category that splits the customers into those with previous claims and those without claims. We get a rightward skewed close to normal distribution after splitting thus we will box-cox transform to better approximate normal.

Now we analyze customer moving violations. Here, we see that there is a significant amount of customers without moving violations. So, we will create an extra category that splits the customers into those with moving violations and those without moving violations. Looking at customers that have at least one moving violation, we can see what approximates a negative binomial distribution. Thus, we will not attempt to transform this variable.

Finally we take a look at our quantitative response variable of the crash cost and see that the portion we are concerned with, crashes that actually happened, is normally distributed but with an extremely large tail. We will definitely need to consider removing outliers along with a box-cox transformation to better normalize the data.


Code Appendix


Introduction

library(tidyverse)
library(olsrr)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)
library(mice)
library(caret)
library(naniar)
library(cowplot)


Loading Data

train_1 <- read_csv("insurance_training_data.csv", show_col_types = FALSE)
eval_1 <- read_csv("insurance-evaluation-data.csv", show_col_types = FALSE)
glimpse(train_1)


Data Preprocessing


glimpse(train_1)
train_1 |>
  mutate(
    across(
      c(TARGET_FLAG, PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY),
      as_factor
    )
  )  -> train_1

eval_1 |>
  mutate(
    across(
      c(PARENT1, MSTATUS, SEX, EDUCATION, JOB, CAR_USE, CAR_TYPE, RED_CAR, REVOKED, URBANICITY),
      as_factor
    )
  )  -> eval_1

train_1 |>
  select(where(is.factor)) |>
  sapply(levels)
train_1 |>
  mutate(
    across(
      where(is.factor),
      ~factor(.x, labels = str_replace(levels(.x),"z_",""))
    )
  ) -> train_1

eval_1 |>
  mutate(
    across(
      where(is.factor),
      ~factor(.x, labels = str_replace(levels(.x),"z_",""))
    )
  ) -> eval_1

train_1 |>
  select(MSTATUS, EDUCATION, JOB, CAR_TYPE, URBANICITY) |>
  sapply(levels)
train_1 |>
  mutate(
    EDUCATION = ordered(EDUCATION, c("<High School", "High School", "Bachelors", "Masters", "PhD"))
  ) -> train_1

eval_1 |>
  mutate(
    EDUCATION = ordered(EDUCATION, c("<High School", "High School", "Bachelors", "Masters", "PhD"))
  ) -> eval_1

train_1 |>
  pull(EDUCATION) |>
  unique()
train_1 |>
  select(where(is.character)) |>
  glimpse()
train_1 |>
  mutate(
    across(
      where(is.character),
      parse_number)
    ) -> train_1

eval_1 |>
  mutate(
    across(
      where(is.character),
      parse_number)
    ) -> eval_1

train_1 |>
  glimpse()


Data Skimming

train_1 |>
  skim() |>
  select(-complete_rate, -numeric.sd) |>
  mutate(numeric.mean = format(numeric.mean, scientific = FALSE, digits = 2))
d1 <- train_1 |>
  select(where(is.factor)) |>
  group_by(JOB) |>
  summarise(n = n())
d2 <- train_1 |>
  select(where(is.factor)) |>
  group_by(CAR_TYPE) |>
  summarise(n = n())
d3 <- train_1 |>
  select(where(is.factor)) |>
  group_by(EDUCATION) |>
  summarise(n = n())

knitr::kable(
  list(d1, d2, d3)
)


Missing Values

d1 <- train_1 |>
  filter(is.na(JOB)) |>
  mutate(NOINCJOB = if_else(INCOME == 0, TRUE, FALSE)) |>
  group_by(NOINCJOB) |>
  summarise(n = n())

d2 <- train_1 |>
  filter(is.na(INCOME)) |>
  group_by(JOB) |>
  summarise(`NA Income Count` = n())

knitr::kable(
  list(d1, d2)
)
tibble(NAs = rowSums(is.na(train_1))) |>
  filter(NAs != 0) |>
  group_by(NAs) |>
  summarise(n = n()) |>
  ggplot(aes(x = NAs, y = n)) +
    geom_bar(stat="identity") +
    labs(title = "Combined missing values are rare", subtitle = "# NAs per row", y = NULL, x = NULL) +
    theme_minimal()
train_1 |>
  filter(is.na(YOJ)) |>
  group_by(JOB) |>
  summarise(`NA YOJ Count` = n()) |>
  arrange(desc(`NA YOJ Count`)) |>
  ggplot(aes(x = reorder(JOB, -`NA YOJ Count`), y = `NA YOJ Count`)) +
    geom_bar(stat="identity") +
    labs(title = "YOJ Missing values can't be attributed to a single job", subtitle = "# NAs in YOJ by job", y = NULL, x = NULL) +
    theme_minimal()
p1 <- train_1 |>
  ggplot(aes(x=INCOME, y=HOME_VAL)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Missing home values don't significantly differ in income distribution", y = "Home value", x = "Income")

p2 <- train_1 |>
  ggplot(aes(x=AGE,y=CAR_AGE)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Missing car ages don't significantly differ in age distribution", y = "Car age", x = "Age")

plot_grid(p1, p2, ncol = 1)


Histogram

p1 <- train_1 |>
  #filter(KIDSDRIV != 0) |>
  ggplot(aes(x= KIDSDRIV)) +
    geom_histogram(bins = 5) +
    labs(title = "Distribution of customer kids that drive", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(KIDSDRIV != 0) |>
  ggplot(aes(x= KIDSDRIV)) +
    geom_histogram(bins = 4) +
    labs(title = NULL, x = "# of kids that drive") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(HOMEKIDS != 0) |>
  ggplot(aes(x= HOMEKIDS)) +
    geom_histogram(bins = 6) +
    labs(title = "Distribution of customer kids", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(HOMEKIDS != 0) |>
  ggplot(aes(x= HOMEKIDS)) +
    geom_histogram(bins = 5) +
    labs(title = NULL, x = "# of kids") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(INCOME != 0) |>
  ggplot(aes(x= INCOME)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer income", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(INCOME != 0) |>
  ggplot(aes(x= INCOME)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = "Income") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(HOME_VAL != 0) |>
  ggplot(aes(x= HOME_VAL)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer home value", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(HOME_VAL != 0) |>
  ggplot(aes(x= HOME_VAL)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = "Home value") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(OLDCLAIM != 0) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = "Distribution of customer previous claim payout", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(OLDCLAIM != 0) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=50, na.rm = TRUE) +
    labs(title = NULL, x = NULL) +
    theme_minimal()

p3 <- train_1 |>
  filter(OLDCLAIM > 15000) |>
  ggplot(aes(x= OLDCLAIM)) +
    geom_histogram(bins=30, na.rm = TRUE) +
    labs(title = NULL, x = "Previous claim payout") +
    theme_minimal()

plot_grid(p1,p2,p3,ncol=1)
p1 <- train_1 |>
  #filter(CLM_FREQ != 0) |>
  ggplot(aes(x= CLM_FREQ)) +
    geom_histogram(bins=6, na.rm = TRUE) +
    labs(title = "Distribution of customer claims", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(CLM_FREQ != 0) |>
  ggplot(aes(x= CLM_FREQ)) +
    geom_histogram(bins=5, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous claims") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(MVR_PTS != 0) |>
  ggplot(aes(x= MVR_PTS)) +
    geom_histogram(bins=14, na.rm = TRUE) +
    labs(title = "Distribution of customer moving violations", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(MVR_PTS != 0) |>
  ggplot(aes(x= MVR_PTS)) +
    geom_histogram(bins=13, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous moving violations") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)
p1 <- train_1 |>
  #filter(TARGET_AMT != 0) |>
  ggplot(aes(x= TARGET_AMT)) +
    geom_histogram(bins=100, na.rm = TRUE) +
    labs(title = "Distribution of customer crash cost", x = NULL) +
    theme_minimal()

p2 <- train_1 |>
  filter(TARGET_AMT != 0) |>
  ggplot(aes(x= TARGET_AMT)) +
    geom_histogram(bins=100, na.rm = TRUE) +
    labs(title = NULL, x = "# of previous crash cost") +
    theme_minimal()

plot_grid(p1,p2,ncol=1)