Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing customers at an auto insurance company. We will build multiple linear regression models on the continuous variable TARGET_AMT and binary logistic regression model on the boolean variable TARGET_FLAG to predict the probability that a person will crash their car, and to predict the associated costs.

We are going to build several models using different techniques and variable selection. In order to best assess our predictive models, we will create a validation set within our training data along an 80/20 training/testing proportion, before applying the finalized models to a separate evaluation dataset that does not contain the target.

1. Data Exploration

The insurance training dataset contains 8161 observations of 26 variables, each record represents a customer. The evaluation dataset contains 2141 observations of 26 variables. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.

1.1 Summary Statistics

Each record also has two response variables. The first response variable, TARGET_FLAG, is a boolean where “1” means that the person was in a car crash. The second response variable, TARGET_AMT is a numeric indicating the (positive) cost if a car crash occurred; this value is zero if the person did not crash their car.

We can explore a sample of the training data here, and make some initial observations:

  • Some currency variables are strings with ‘$’ symbols instead of numerics (e.g.OLDCLAIM, BLUEBOOK).
  • Some character variables include a prefix z_ that could be removed for readability (EDUCATION, JOB).

The table below provides valuable descriptive statistics about the training data.

Based on this summary table and exploration of the data, we can make the following observations:

  • 14 variables are categorical, 12 are numeric.
  • There is no missing data for character variables.
  • Numeric variables with missing values include YOJ (6%), INCOME (5%), HOME_VAL (6%), CAR_AGE (6%), and AGE (1%).
  • Most of the numeric variables have a minimum of zero.
  • One variable, CAR_AGE has a negative value of -3, which doesn’t make intuitive sense. We’ll handle this below.
  • Some of the variables are character though they should be numeric and vice-versa.
Data summary
Name train_df
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
character 14
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
INCOME 0 1 0 8 445 6613 0
PARENT1 0 1 2 3 0 2 0
HOME_VAL 0 1 0 8 464 5107 0
MSTATUS 0 1 3 4 0 2 0
SEX 0 1 1 3 0 2 0
EDUCATION 0 1 3 13 0 5 0
JOB 0 1 0 13 526 9 0
CAR_USE 0 1 7 10 0 2 0
BLUEBOOK 0 1 6 7 0 2789 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
OLDCLAIM 0 1 2 7 0 2857 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 21 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0
Data summary
Name eval_df
Number of rows 2141
Number of columns 26
_______________________
Column type frequency:
character 14
logical 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
INCOME 0 1 0 8 125 1804 0
PARENT1 0 1 2 3 0 2 0
HOME_VAL 0 1 0 8 111 1398 0
MSTATUS 0 1 3 4 0 2 0
SEX 0 1 1 3 0 2 0
EDUCATION 0 1 3 13 0 5 0
JOB 0 1 0 13 139 9 0
CAR_USE 0 1 7 10 0 2 0
BLUEBOOK 0 1 6 7 0 1417 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
OLDCLAIM 0 1 2 7 0 834 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 21 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
TARGET_FLAG 2141 0 NaN :
TARGET_AMT 2141 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
INDEX 0 1.00 5150.10 2956.33 3 2632 5224 7669 10300
KIDSDRIV 0 1.00 0.16 0.49 0 0 0 0 3
AGE 1 1.00 45.02 8.53 17 39 45 51 73
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5
YOJ 94 0.96 10.38 4.17 0 9 11 13 19
TRAVTIME 0 1.00 33.15 15.72 5 22 33 43 105
TIF 0 1.00 5.24 3.97 1 1 4 7 25
CLM_FREQ 0 1.00 0.81 1.14 0 0 0 2 5
MVR_PTS 0 1.00 1.77 2.20 0 0 1 3 12
CAR_AGE 129 0.94 8.18 5.77 0 1 8 12 26

1.2 Distributions

Before building a model, we need to make sure that we have both classes equally represented in our TARGET_FLAG variable. Class 1 takes 26% and class 0 takes 74% of the target variable. As a result, we have unbalanced class distribution for our target variable that we have to deal with, we have to take some additional steps (bootstrapping, etc) before using logistic regression.

Distribution of Target Flag
Value %
0 0.74
1 0.26

Many of these distributions for the numeric variables seem highly skewed and non-normal. As part of our data preparation we’ll use power transformations to find whether transforming variables to more normal distributions improves our models’ efficacy.

By checking the frequency of the variables that have very few unique values (HOMEKIDS, KIDSDRIV, CLM_FREQ), we can assume that these variables better be transformed to binary (1 if the value is greater than 0). As the see the percentage below, more than 50% of the data for these variables is 0, the rest some over few values.

The distribution of factor variables can help us to determine if any transformations should be applied (to binary, new variables based on the current ones). For example, variable PARENT1 can become binary, EDUCATION can be transformed to If Higher education with 1 for education above Bachelor. Also it tells us that having more than one parent results in more crush or that High Schoolers, blue collar employees, SUV cars, revoked drivers are more likely to be in a car crash.

1.3 Box Plots

This is born out in the box plots, in which we do see quite a variable range for INCOME, KIDSDRIV, and OLDCLAIMamong others. BLUEBOOK, INCOME, OLDCLAIM, TRAVTIME have high number of outliers comparing to other variables. In addition, it seems that people with higher income/car age/ home value less likely to get in a car crash.

1.4 Scatter Plot

Interestingly, none of our predictors appear to have strong linear relationships to our TARGET_AMT response variable, which is a primary assumption of linear regression. This suggests that alternative methods might be more successful in modeling the relationships.

1.5 Correlation Matrix

We see this also born out in the correlation matrix, in which the highest coefficient is 0.58, with others averaging much lower.

2. Data preparation

2.1 Data types

In order to work with our training and evaluation datasets, we’ll need to first convert some variables to more useful data types:

  • Convert currency columns from character to integer: INCOME,HOME_VAL,BLUEBOOK and OLDCLAIM.
  • Convert character columns to factors: TARGET_FLAG, CAR_TYPE, CAR_USE, EDUCATION, JOB, MSTATUS, PARENT1, RED_CAR, REVOKED, SEX and URBANICITY.

2.2 Transformations, Outliers and Missing Values

Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression. We perform the mentioned operations on both datasets (training/evaluation).

First, we’ll apply transformations to clean up and align formatting of our variables:

  • Drop the INDEX variable.
  • Remove “z_” from all character class values (MSTATUS, SEX, JOB, CAR_TYPE, URBANICITY, EDUCATION)
  • Remove “$” from the numeric values (INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM).
  • Update RED_CAR, replace [no,yes] values with [No, Yes] values.
  • Replace JOB blank values with ‘Unknown’.
  • Transform variables HOMEKIDS, KIDSDRIV to binary, the value is TRUE if the original value is greater than 0.

Next, we’ll manually adjust two special cases of missing or outlier values.

  • In cases where YOJ is zero and INCOME is NA, we’ll set INCOME to zero to avoid imputing new values over legitimate instances of non-employment.
  • There is also at least one value of CAR_AGE that is less than zero in the training data - we’ll assume this is a data collection error and set it to zero (representing a brand-new car.)

We’ll use MICE to impute our remaining variables with missing values - AGE, YOJ, CAR_AGE, INCOME and HOME_VALUE:

  • We might reasonably assume that relationships exist between these variables (older, more years on the job may correlate with higher income and home value). Taking simple means or medians might suppress those features, but MICE should provide a better imputation.

To give our models more variables to work with, we’ll engineer some additional features:

  • Create bin values for CAR_AGE, HOME_VAL and TIF.
  • Create dummy variables for two-level factors, MALE, MARRIED, LIC_REVOKED, CAR_RED, PRIVATE_USE, SINGLE_PARENT and URBAN.

Next we’ll want to consider any power transformations for variables that have skewed distributions. For example, our numeric response variable TARGET_AMT is a good candidate for transformation as its distribution is very highly skewed, and the assumption of normality is required in order to apply linear regression.

  • Log transformation will be applied to variables TARGET_AMT, OLDCLAIM, INCOME to transform their distributions from right-skewed to normally distributed.
  • Similarly, BoxCox transformation will be applied to variables BLUEBOOK, TRAVTIME, TIF, so they also are more normally distributed and to deal with the outliers.

We can examine our final, transformed training dataset and distributions below (with a temporary numeric variable CAR_CRASH to represent the response variable for visualization purposes.). As we see on the graphs below, distributions of the transformed variables became closer to normal.

Data summary
Name train_transformed
Number of rows 8161
Number of columns 28
_______________________
Column type frequency:
factor 13
numeric 15
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
TARGET_FLAG 0 1 FALSE 2 0: 6008, 1: 2153
KIDSDRIV 0 1 FALSE 2 0: 7180, 1: 981
HOMEKIDS 0 1 FALSE 2 0: 5289, 1: 2872
EDUCATION 0 1 FALSE 5 Hig: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 Blu: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_TYPE 0 1 FALSE 6 SUV: 2294, Min: 2145, Pic: 1389, Spo: 907
MALE 0 1 FALSE 2 0: 4375, 1: 3786
MARRIED 0 1 FALSE 2 1: 4894, 0: 3267
LIC_REVOKED 0 1 FALSE 2 0: 7161, 1: 1000
CAR_RED 0 1 FALSE 2 0: 5783, 1: 2378
PRIVATE_USE 0 1 FALSE 2 1: 5132, 0: 3029
SINGLE_PARENT 0 1 FALSE 2 0: 7084, 1: 1077
URBAN 0 1 FALSE 2 1: 6492, 0: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
TARGET_AMT 0 1 2.18 3.67 0.00 0.00 0.00 6.94 11.59
AGE 0 1 44.78 8.63 16.00 39.00 45.00 51.00 81.00
YOJ 0 1 10.49 4.09 0.00 9.00 11.00 13.00 23.00
INCOME 0 1 9.92 3.12 0.00 10.23 10.89 11.36 12.81
HOME_VAL 0 1 154891.79 129358.77 0.00 0.00 160959.00 238741.00 885282.00
TRAVTIME 0 1 15.08 5.82 3.00 11.17 15.34 19.12 45.61
BLUEBOOK 0 1 182.32 48.72 62.21 147.95 182.18 216.49 381.01
TIF 0 1 1.63 1.25 0.00 0.00 1.62 2.43 4.70
OLDCLAIM 0 1 3.39 4.31 0.00 0.00 0.00 8.44 10.95
CLM_FREQ 0 1 0.80 1.16 0.00 0.00 0.00 2.00 5.00
MVR_PTS 0 1 1.70 2.15 0.00 0.00 1.00 3.00 13.00
CAR_AGE 0 1 8.34 5.70 0.00 1.00 8.00 12.00 28.00
CAR_AGE_BIN 0 1 2.48 1.12 1.00 1.00 2.00 3.00 4.00
HOME_VAL_BIN 0 1 2.45 1.16 1.00 1.00 2.00 3.00 4.00
TIF_BIN 0 1 2.41 1.16 1.00 1.00 2.00 3.00 4.00

To proceed with modeling, we’ll split our training data into train (75%) and validation (25%) datasets. There will be different dataset splits for linear and logistic regression models. The split is necessary to measure models’ performances since the evaluation dataset doesn’t have the target columns to check how good the models are.

3. Binary Logistic Regression

We’ll use Binary Logistic Regression to classify our response variable TARGET_FLAG, the probability of a car crash for a given observation.

3.1 Model 1 - Original data

As the first step, we are going to build the generalized linear model based on the original training dataset with some variables transformed to categorical/numeric if needed, with no missing values. Since our dependent variable takes only two values (0 and 1), we will use logistic regression. To do so, the function glm() with family=binomial is used. At the beginning, all variables are included.

With stepAIC, we can try to see what variables should be included in our model to increase the performance. Based on the the results of the function, we will include the variables below.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + 
##     HOME_VAL + EDUCATION + JOB + TRAVTIME + BLUEBOOK + TIF + 
##     CAR_TYPE + OLDCLAIM + CLM_FREQ + MVR_PTS + MARRIED + LIC_REVOKED + 
##     PRIVATE_USE + SINGLE_PARENT + URBAN, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3959  -0.7159  -0.3939   0.6355   2.8340  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.608e+00  2.486e-01 -10.493  < 2e-16 ***
## KIDSDRIV1             6.283e-01  1.100e-01   5.711 1.12e-08 ***
## HOMEKIDS1             2.452e-01  1.039e-01   2.361 0.018249 *  
## YOJ                  -1.354e-02  9.531e-03  -1.421 0.155316    
## INCOME               -4.356e-06  1.290e-06  -3.378 0.000730 ***
## HOME_VAL             -1.209e-06  3.975e-07  -3.041 0.002357 ** 
## EDUCATIONBachelors   -3.803e-01  1.263e-01  -3.011 0.002601 ** 
## EDUCATIONHigh School  2.515e-02  1.096e-01   0.229 0.818504    
## EDUCATIONMasters     -2.375e-01  1.877e-01  -1.266 0.205663    
## EDUCATIONPhD          2.961e-02  2.302e-01   0.129 0.897631    
## JOBClerical           1.052e-01  1.233e-01   0.853 0.393426    
## JOBDoctor            -8.776e-01  3.267e-01  -2.686 0.007233 ** 
## JOBHome Maker        -2.935e-01  1.779e-01  -1.650 0.098977 .  
## JOBLawyer            -1.464e-01  2.138e-01  -0.685 0.493387    
## JOBManager           -9.237e-01  1.624e-01  -5.687 1.29e-08 ***
## JOBProfessional      -1.669e-01  1.366e-01  -1.222 0.221654    
## JOBStudent           -1.536e-01  1.483e-01  -1.036 0.300133    
## JOBUnknown           -3.764e-01  2.144e-01  -1.756 0.079102 .  
## TRAVTIME              1.459e-02  2.195e-03   6.649 2.96e-11 ***
## BLUEBOOK             -2.134e-05  5.503e-06  -3.878 0.000105 ***
## TIF                  -5.685e-02  8.418e-03  -6.754 1.44e-11 ***
## CAR_TYPEPanel Truck   6.898e-01  1.752e-01   3.938 8.21e-05 ***
## CAR_TYPEPickup        5.885e-01  1.150e-01   5.117 3.10e-07 ***
## CAR_TYPESports Car    9.766e-01  1.241e-01   7.868 3.59e-15 ***
## CAR_TYPESUV           7.358e-01  9.948e-02   7.396 1.40e-13 ***
## CAR_TYPEVan           6.922e-01  1.400e-01   4.944 7.67e-07 ***
## OLDCLAIM             -1.512e-05  4.570e-06  -3.308 0.000939 ***
## CLM_FREQ              1.965e-01  3.310e-02   5.937 2.90e-09 ***
## MVR_PTS               1.121e-01  1.565e-02   7.159 8.10e-13 ***
## MARRIED1             -5.286e-01  1.025e-01  -5.158 2.50e-07 ***
## LIC_REVOKED1          9.007e-01  1.054e-01   8.546  < 2e-16 ***
## PRIVATE_USE1         -7.258e-01  1.053e-01  -6.894 5.44e-12 ***
## SINGLE_PARENT1        2.073e-01  1.399e-01   1.481 0.138559    
## URBAN1                2.424e+00  1.330e-01  18.227  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7064.1  on 6120  degrees of freedom
## Residual deviance: 5472.0  on 6087  degrees of freedom
## AIC: 5540
## 
## Number of Fisher Scoring iterations: 5

The p-value associated with this Chi-Square statistic is below. The lower the p-value, the better the model is able to fit the dataset compared to a model with just an intercept term.

## [1] 0

Model Performance

The coefficients for the model shows us that if you have kids / you are from high school / you are office clerk / you have a sport car or SUV / your license was revoked within the last 7 years/ you are a single parent, there is much higher chance to get in the car crash. These variables have high positive coefficients, and they all follow the assumptions we made at the data exploration part. Other variables with positive coefficients have less impact but still follow the common logic (the more claims you filed in the past, the more you are to file in the future, lots of traffic tickets, you tend to get into more crashes, etc).

The negative coefficients reduce the chance of a crash. The factors that reduce the chances the most: education should be above high school, profession is doctor or manager, a person should be married, a vehicle should be in a private use. Other variables with negative coeffients have less impact but still follow the common logic (people who have been customers for a long time are more safe, people who stay at a job for a long time are more safe, etc).

The null deviance of 7064.1 defines how well the target variable can be predicted by a model with only an intercept term.

The residual deviance of 5472.0 defines how well the target variable can be predicted by our AIC model that we fit with predictor variables mentioned above. The lower the value, the better the model is able to predict the value of the response variable.

The p-value associated with this Chi-Square statistic is 0 which is less than .05, the model can be useful.

The Akaike information criterion (AIC) is 5540.0. We will use this metric to compare the fit of different models. The lower the value, the better the regression model is able to fit the data.

The Accuracy is 0.798, Sensitivity is 0.43, Specificity is 0.93. We should continue with building another models to increase the accuracy of the predictions.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1397  306
##          1  105  232
##                                           
##                Accuracy : 0.7985          
##                  95% CI : (0.7805, 0.8157)
##     No Information Rate : 0.7363          
##     P-Value [Acc > NIR] : 3.182e-11       
##                                           
##                   Kappa : 0.4105          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4312          
##             Specificity : 0.9301          
##          Pos Pred Value : 0.6884          
##          Neg Pred Value : 0.8203          
##              Prevalence : 0.2637          
##          Detection Rate : 0.1137          
##    Detection Prevalence : 0.1652          
##       Balanced Accuracy : 0.6807          
##                                           
##        'Positive' Class : 1               
## 

Model Assumptions

We evaluate the modeling assumptions using standard diagnostic plots, marginal model plots, and a Variance Inflation Factor (VIF) to assess collinearity.

The resulting plot 1 below (predicted values vs. residuals) can help us to see if there are any signs of over- or under-predicting. It seems that we are experiencing under-fitting at lower predicted values, with the predicted proportions being larger than the observed proportions. On the other hand, we are over-fitting for a couple of predictor patterns at higher predicted values. The predicted values are much larger than the predicted proportions. The labeled points refer to rows in our data. The Standardized Pearson Residuals plot doesn’t follow a Standard Normal distribution if the model fits. The model seems good until the middle range but there are extremes on the left. Some normality of the residuals for the binomial logistic regression models is just an evidence of a decent fitting model.

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The marginal model plot is a very useful graphical method for deciding if a logistic regression model is adequate or not. The marginal model plots below show reasonable agreement across the two sets of fits indicating that log_model_1_aic is a valid model.

In terms of multicollinearity, all variables have a VIF less than 5 besides for the JOB and EDUCATION. As a result, multicollinearity between these variables may be a problem for the model. To improve the results, we may consider another way of choosing variables in the future.

##                    GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV       1.362310  1        1.167180
## HOMEKIDS       2.298518  1        1.516086
## YOJ            1.409292  1        1.187136
## INCOME         2.852277  1        1.688869
## HOME_VAL       2.023704  1        1.422570
## EDUCATION      8.362579  4        1.304045
## JOB           22.703050  8        1.215501
## TRAVTIME       1.037786  1        1.018718
## BLUEBOOK       1.797552  1        1.340728
## TIF            1.011172  1        1.005570
## CAR_TYPE       2.630076  5        1.101531
## OLDCLAIM       1.642314  1        1.281528
## CLM_FREQ       1.467301  1        1.211322
## MVR_PTS        1.159611  1        1.076853
## MARRIED        2.321032  1        1.523493
## LIC_REVOKED    1.318414  1        1.148222
## PRIVATE_USE    2.418968  1        1.555303
## SINGLE_PARENT  2.349104  1        1.532679
## URBAN          1.134599  1        1.065175

3.2 Model 2 - Transformed data

Next, we can try to build the generalized linear model based on the dataset with transformations and additional variables. The stepwise selection will help with the variables that should be included in the model for a better performance.

The stepAIC function tells us to include the variables below.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + 
##     EDUCATION + JOB + TRAVTIME + BLUEBOOK + CAR_TYPE + OLDCLAIM + 
##     CLM_FREQ + MVR_PTS + MARRIED + LIC_REVOKED + PRIVATE_USE + 
##     SINGLE_PARENT + URBAN + HOME_VAL_BIN + TIF_BIN, family = binomial(link = "logit"), 
##     data = log_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4149  -0.7123  -0.3871   0.6233   3.1629  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.4356429  0.3170747  -4.528 5.96e-06 ***
## KIDSDRIV1             0.6125008  0.1107350   5.531 3.18e-08 ***
## HOMEKIDS1             0.1735345  0.1033879   1.678 0.093253 .  
## YOJ                   0.0181402  0.0120519   1.505 0.132280    
## INCOME               -0.0911977  0.0205675  -4.434 9.25e-06 ***
## EDUCATIONBachelors   -0.3816984  0.1253887  -3.044 0.002334 ** 
## EDUCATIONHigh School  0.0192705  0.1093994   0.176 0.860178    
## EDUCATIONMasters     -0.3027408  0.1866892  -1.622 0.104883    
## EDUCATIONPhD         -0.3648310  0.2239646  -1.629 0.103320    
## JOBClerical          -0.0384467  0.1220872  -0.315 0.752829    
## JOBDoctor            -0.9705109  0.3430773  -2.829 0.004672 ** 
## JOBHome Maker        -0.4197852  0.1915950  -2.191 0.028452 *  
## JOBLawyer            -0.2947849  0.2183775  -1.350 0.177052    
## JOBManager           -1.0023066  0.1640197  -6.111 9.91e-10 ***
## JOBProfessional      -0.2565155  0.1384117  -1.853 0.063842 .  
## JOBStudent           -0.5051695  0.1686454  -2.995 0.002740 ** 
## JOBUnknown           -0.3208242  0.2137563  -1.501 0.133385    
## TRAVTIME              0.0429172  0.0060385   7.107 1.18e-12 ***
## BLUEBOOK             -0.0042701  0.0008927  -4.783 1.72e-06 ***
## CAR_TYPEPanel Truck   0.5646337  0.1696985   3.327 0.000877 ***
## CAR_TYPEPickup        0.5526977  0.1171961   4.716 2.41e-06 ***
## CAR_TYPESports Car    0.9712816  0.1270891   7.643 2.13e-14 ***
## CAR_TYPESUV           0.7496064  0.1000067   7.496 6.60e-14 ***
## CAR_TYPEVan           0.7037403  0.1403958   5.013 5.37e-07 ***
## OLDCLAIM              0.0250502  0.0143675   1.744 0.081242 .  
## CLM_FREQ              0.0744074  0.0500295   1.487 0.136943    
## MVR_PTS               0.1006110  0.0162701   6.184 6.26e-10 ***
## MARRIED1             -0.4767708  0.1010670  -4.717 2.39e-06 ***
## LIC_REVOKED1          0.6638945  0.0939952   7.063 1.63e-12 ***
## PRIVATE_USE1         -0.7045502  0.1059483  -6.650 2.93e-11 ***
## SINGLE_PARENT1        0.2758411  0.1406583   1.961 0.049871 *  
## URBAN1                2.4774945  0.1331895  18.601  < 2e-16 ***
## HOME_VAL_BIN         -0.1885024  0.0397260  -4.745 2.08e-06 ***
## TIF_BIN              -0.2201255  0.0294004  -7.487 7.04e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7064.1  on 6120  degrees of freedom
## Residual deviance: 5427.8  on 6087  degrees of freedom
## AIC: 5495.8
## 
## Number of Fisher Scoring iterations: 5

The p-value associated with this Chi-Square statistic is below.

## [1] 0

Model Performance

The coefficients for the model shows us that if you have kids / you have a sport car, SUV, Van / your license was revoked within the last 7 years/ you are from Urban area, there is much higher chance to get in the car crash. These variables have high positive coefficients. Other variables with positive coefficients have less impact but still follow the common logic (long drives lead to a greater risk, lots of traffic tickets, you tend to get into more crashes, etc).

The factors with negative coefficients that reduce the chances the most: education should be above high school, profession is doctor or manager, a vehicle should be in a private use. Other variables with negative coeffients have less impact but still follow the common logic (married people drive more safely, people who have been customers for a long time are more safe, etc). The results are similar to the original model.

The residual deviance is 5427.8.

The p-value associated with this Chi-Square statistic is 0 which is less than .05, the model can be useful.

The Akaike information criterion (AIC) is 5495.8.

The Accuracy is 0.794, Sensitivity is 0.43, Specificity is 0.92. We should continue with building another models to increase the accuracy of the predictions.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1389  308
##          1  113  230
##                                          
##                Accuracy : 0.7936         
##                  95% CI : (0.7754, 0.811)
##     No Information Rate : 0.7363         
##     P-Value [Acc > NIR] : 9.692e-10      
##                                          
##                   Kappa : 0.3986         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4275         
##             Specificity : 0.9248         
##          Pos Pred Value : 0.6706         
##          Neg Pred Value : 0.8185         
##              Prevalence : 0.2637         
##          Detection Rate : 0.1127         
##    Detection Prevalence : 0.1681         
##       Balanced Accuracy : 0.6761         
##                                          
##        'Positive' Class : 1              
## 

Model Assumptions

The resulting plots show us similar to model 1 picture: under-fitting at lower predicted values, with the predicted proportions being larger than the observed proportions; over-fitting for a couple of predictor patterns at higher predicted values with the predicted values are much larger than the predicted proportions. The Standardized Pearson Residuals plot shows an approximate Standard Normal distribution if the model fits. The model seems good until the middle range but there are extremes on the left side. Some normality of the residuals for the binomial logistic regression models is just an evidence of a decent fitting model.

The Standardized Residuals plot seems to have a constant variance though there are some outliers.

The marginal model plots below show reasonable agreement across the two sets of fits indicating that log_model_2_aic is a valid model. The plots look better than for model 1, data and model values match almost everywhere.

In terms of multicollinearity, all variables have a VIF less than 5 besides for the JOB and EDUCATION. As a result, multicollinearity between these variables may be a problem for the model. To improve the results, we should give up the Stepwise approach in favor of Lasso regression.

##                    GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV       1.354900  1        1.164002
## HOMEKIDS       2.252261  1        1.500753
## YOJ            2.299540  1        1.516424
## INCOME         4.018139  1        2.004530
## EDUCATION      7.327913  4        1.282692
## JOB           30.563613  8        1.238298
## TRAVTIME       1.037707  1        1.018679
## BLUEBOOK       1.658224  1        1.287721
## CAR_TYPE       2.523453  5        1.096982
## OLDCLAIM       3.594070  1        1.895803
## CLM_FREQ       3.362439  1        1.833695
## MVR_PTS        1.235074  1        1.111339
## MARRIED        2.229072  1        1.493008
## LIC_REVOKED    1.029671  1        1.014727
## PRIVATE_USE    2.428598  1        1.558396
## SINGLE_PARENT  2.293656  1        1.514482
## URBAN          1.146570  1        1.070780
## HOME_VAL_BIN   1.770648  1        1.330657
## TIF_BIN        1.012378  1        1.006170

3.3 Model 3 - Lasso

Lasso Regression may be a good candidate for this dataset, since we are dealing with a large number of complex variables. Lasso helps identify the most important variables and reduces the model complexity.

The cv.glmnet() function was also used as logistic regression model. Similar to the regression model k-fold cross-validation was performed with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “class” - The type.measure is set to class to minimize the mis-classification errors of the model since the accurate classification of the validation data set is the desired outcome.
  • nfold = 10 - Given the size of the training dataset, we opted for 10-fold cross-validation as a default.
  • family = binomial - For Logistic Regression, the family attribute of the function is set to binomial.
  • link = logit - For this model, we choose the default link function for a logistic model.
  • alpha =1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • weights = a weight of 0.2638 / n for observation with a 0 TARGET_FLAG and 0.7362 / n observations with a 1 value of TARGET_FLAG.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.
## 
## Call:  cv.glmnet(x = X, y = Y, nfolds = 5, family = "binomial", link = "logit",      standardize = TRUE, alpha = 1) 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.001145    49  0.9078 0.004214      35
## 1se 0.004212    35  0.9113 0.003162      29

Model Performance

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

  • The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model includes 35 non zero coefficients and has an AIC of -1605.418.
  • The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). The resulting model includes 25 no zero coefficients and has an AIC of -1503.695,

The coefficients extracted using lambda.min results in the lowest AIC and highest performance model.

The coefficients extracted at the lambda.min value are used to predict the likelihood of an accident. The confusion matrix highlights an accuracy of 73.7%.

##          True
## Predicted    0   1 Total
##     0     1401 311  1712
##     1      101 227   328
##     Total 1502 538  2040
## 
##  Percent Correct:  0.798
## $deviance
## lambda.min 
##  0.8943977 
## attr(,"measure")
## [1] "Binomial Deviance"
## 
## $class
## lambda.min 
##  0.2118935 
## attr(,"measure")
## [1] "Misclassification Error"
## 
## $auc
## [1] 0.8133738
## attr(,"measure")
## [1] "AUC"
## 
## $mse
## lambda.min 
##   0.290788 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##  0.5868817 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1519.116
## 
## $BIC
## [1] -1284.348
## $AICc
## [1] -1495.889
## 
## $BIC
## [1] -1301.31

Model Assumptions

A closer look at the remaining 36 non-zero coefficients for the selected lambda value of lambda.min we can observe the URBANICITY Highly Urban/ Urban predictor variable has the largest impact on the prediction of a car crash by a factor of three. Reviewing the top 5 predictor variables that impact likelihood and cost associated with an accident:

  • URBANICITY Highly Urban/ Urban - working or living in an urban neighborhood increase expected cost associated with a crash
  • CAR_USE Private - using a car for private activities reduces
  • JOB Manager - being a manager reduces the expected costs associated with a crash
  • REVOKED Yes - a history of having a revoked license increases the expected costs associated with a crash
  • JOB Doctor - being a doctor reduces the expected costs associated with a crash

The JOBStudent coefficient is the only predictor variable that drops out, however several variable including INCOME, HOME_VAL and OLDCLAIM are shrunk substantially.

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -2.645103e+00
## KIDSDRIV1             6.070639e-01
## AGE                   5.037397e-06
## HOMEKIDS1             2.374358e-01
## YOJ                  -9.989691e-03
## INCOME               -4.127964e-06
## HOME_VAL             -1.182664e-06
## EDUCATIONBachelors   -3.481361e-01
## EDUCATIONHigh School  2.733386e-02
## EDUCATIONMasters     -2.465319e-01
## EDUCATIONPhD          .           
## JOBClerical           1.650799e-01
## JOBDoctor            -6.764737e-01
## JOBHome Maker        -1.231470e-01
## JOBLawyer            -1.493779e-03
## JOBManager           -7.965356e-01
## JOBProfessional      -6.866516e-02
## JOBStudent           -3.515836e-02
## JOBUnknown           -2.169758e-01
## TRAVTIME              1.378769e-02
## BLUEBOOK             -1.783009e-05
## TIF                  -5.384104e-02
## CAR_TYPEPanel Truck   4.792317e-01
## CAR_TYPEPickup        4.905325e-01
## CAR_TYPESports Car    9.392331e-01
## CAR_TYPESUV           7.115781e-01
## CAR_TYPEVan           5.379744e-01
## OLDCLAIM             -1.268722e-05
## CLM_FREQ              1.853511e-01
## MVR_PTS               1.096592e-01
## CAR_AGE              -3.480499e-03
## MALE1                 8.620584e-02
## MARRIED1             -5.048800e-01
## LIC_REVOKED1          8.508616e-01
## CAR_RED1              .           
## PRIVATE_USE1         -7.615175e-01
## SINGLE_PARENT1        2.098218e-01
## URBAN1                2.344305e+00

The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

3.4 Model selection

To choose which of our three models for prediction, we will focus on performance only. I.e. comparing ROC curves, performance statistics and confusion matrices. We did not have models differ in the amount of predictors.

Ideally, all three comparisons favors a single model but they didn’t in our case. Comparing all the models, we see they are very similar except model 3 has a slightly higher miss rate.

Even though model 3’s accuracy is the lower than for models 1 and 2, its AIC is the lowest (-1519 comparing to 5540/5495 for models 1/2) as well as Deviance, so we’ll choose it for predictions. Models 1 and 2 are similar as we used stepAIC for both of them when selecting variables but it happened that BoxCox transformations lower AIC result for model 2.

Model Accuracy Classification error rate F1 Deviance R2 Sensitivity Specificity Precision AUC AIC
Model #1: Original data 0.7985 0.2015 0.5303 5471.9902 0.2254 0.4312 0.9301 0.6884 0.6806600 5539.990
Model #2: Transformed data 0.7936 0.2064 0.5221 5427.8251 0.2316 0.4275 0.9248 0.6706 0.6761381 5495.825
Model #3: Lasso 0.8155 0.1845 0.5242 0.8912 NA 0.4219 0.9328 0.6921 0.8154654 -1519.116

4. Multiple Linear Regression

We’ll use Multiple Linear Regression to model the TARGET_AMT response variable, the estimated cost of a crash for a given observation. The data includes observations only with TARGET_FLAG=1 (the case of the car crash). If there is no crash, no payment will be needed.

4.1 Model 1 - Original data

As the first step, we are going to build the linear model based on the original training dataset with some variables transformed to categorical/numeric if needed, with no missing values. First, we include all variables and use Stepwise approach to select variables for the model.

With stepAIC, we can try to see what variables should be included in our model to increase the performance. Based on the the results of the function, we will include BLUEBOOK, MVR_PTS, MALE1, LIC_REVOKED1 variables.
Observations 1617
Dependent variable TARGET_AMT
Type OLS linear regression
F(4,1612) 9.1841
0.0223
Adj. R² 0.0199
Est. S.E. t val. p
(Intercept) 3502.9583 479.1917 7.3101 0.0000
BLUEBOOK 0.1278 0.0241 5.2951 0.0000
MVR_PTS 126.7957 76.4841 1.6578 0.0976
MALE1 634.2666 400.9366 1.5820 0.1139
LIC_REVOKED1 -704.6944 489.8423 -1.4386 0.1505
Standard errors: OLS

Model Performance

The F-statistic is 9.18, the adjusted R-squared is 0.02, and out of the 24 variables, 2 have statistically significant p-values. RMSE is 6621.3, it shows how far predictions fall from measured true values using Euclidean distance. The adjusted R2 indicates that only 2% of the variance in the response variable can be explained by the predictor variables. Positive coefficients increase the cost of the crash. High number of motor Vehicle Record Points, if a driver is male, these factors increase the cost while negative coefficient for a revoked license decreases. The last statement seems strange, we will look intro it in the next model.

The F-statistic is low and the model’s p-value is not statistically significant. We should probably look at other models for better performance.

Adjusted R2: 0.01986
term estimate p.value
(Intercept) 3502.958 0.000
BLUEBOOK 0.128 0.000
MVR_PTS 126.796 0.098
MALE1 634.267 0.114
LIC_REVOKED1 -704.694 0.150

Model Assumptions

We evaluate the linear modeling assumptions using standard diagnostic plots, and Variance Inflation Factor (VIF) to assess collinearity.

The initial review of the diagnostic plots for this model shows some deviation from linear modeling assumptions. The Q-Q plot shows some deviations from the normal distribution at the left end. The residuals-fitted and standardized residuals-fitted plots show a curve in the main cluster, which indicates that we do not have constant variance. All the variables included don’t show any sign of collinearity.

4.2 Model 2 - Transformed data

Next, we are going to build the linear model based on the transformed dataset with boxcox, log transformations and some new variables, without any missing values. First, we include all variables and use Stepwise approach to select variables for the model.

Based on the the results of the function, we will include BLUEBOOK, MVR_PTS, CLM_FREQ, SINGLE_PARENT1, EDUCATION variables.
Observations 1617
Dependent variable TARGET_AMT
Type OLS linear regression
F(8,1608) 4.8523
0.0236
Adj. R² 0.0187
Est. S.E. t val. p
(Intercept) 7.9717 0.0859 92.8526 0.0000
EDUCATIONBachelors -0.0812 0.0622 -1.3056 0.1919
EDUCATIONHigh School -0.0239 0.0569 -0.4206 0.6741
EDUCATIONMasters 0.0562 0.0706 0.7960 0.4261
EDUCATIONPhD 0.1448 0.0972 1.4901 0.1364
BLUEBOOK 0.0017 0.0004 3.9506 0.0001
CLM_FREQ -0.0320 0.0169 -1.8984 0.0578
MVR_PTS 0.0197 0.0080 2.4566 0.0141
SINGLE_PARENT1 0.0941 0.0484 1.9439 0.0521
Standard errors: OLS

Model Performance

The resulting model is much more parsimonious than the first, with statistically significant results for 5 predictors below.

The Adjusted R-Squared is better than Model 1 but still very low (0.019) meaning this model only explains about 2% of the total variance in the response variable TARGET_AMT. RMSE is 0.8, the lower, the better. The F-statistic is 4.85, and out of the 24 variables, 2 have statistically significant p-values. Positive coefficients increase the cost of a crash. As a result, PhDs, people with Masters degrees, high number of motor Vehicle Record Points, single parent driver have a higher cost of a crash. Negative coefficients for claims (Past 5 Years), high school and bachelor students decrease the cost. We should try another approach for variable selection. So far not all the results make sense.

Adjusted R2: 0.01871
term estimate p.value
(Intercept) 7.972 0.000
BLUEBOOK 0.002 0.000
MVR_PTS 0.020 0.014
SINGLE_PARENT1 0.094 0.052
CLM_FREQ -0.032 0.058
EDUCATIONPhD 0.145 0.136
EDUCATIONBachelors -0.081 0.192
EDUCATIONMasters 0.056 0.426
EDUCATIONHigh School -0.024 0.674

Model Assumptions

However, an examination of the residuals indicates most of the key assumptions for linear regression are met - the Residuals vs Fitted plot shows a more constant variability of the residuals, and the Q-Q plot indicates a much better level of normality comparing to model 1. All the variables included don’t show any sign of collinearity.

4.3 Model 3 - Lasso Regression

The cv.glmnet() function was used to perform k-fold cross-validation with variable selection using lasso regularization. The following attribute settings were selected for the model:

  • type.measure = “mse” - The type.measure is set to minimize the Mean Squared Error for the model.
  • nfold = 10 - Given the size of the dataset we defaulted to 10-fold cross-validation.
  • family = gaussian - For Linear Regression
  • alpha = 1 - The alpha value of 1 sets the variable shrinkage method to lasso.
  • weights = a weight of 0.2638 / n for observation with a 0 TARGET_AMT and 0.7362 / n for all observations with all other values of TARGET_AMT.
  • standardize = TRUE - Finally, we explicitly set the standardization attribute to TRUE; this will normalize the prediction variables around a mean of zero and a standard deviation of one before modeling.

The resulting model is explored by extracting coefficients at two different values for lambda, lambda.min and lambda.1se respectively.

The coefficients extracted using lambda.min minimizes the mean cross-validated error. The resulting model includes 33 non-zero coefficients and has an AIC of 60.08. The coefficients extracted using lambda.1se produce the most regularized model (cross-validated error is within one standard error of the minimum). For this model there are 25 non-zero coefficients and it has an AIC of 44.23

The coefficients extracted using lambda.1se results in the lowest AIC (highest model performance) with fewer predictor variables.

## 
## Call:  cv.glmnet(x = X, y = Y, type.measure = "mse", nfolds = 10, family = "gaussian",      standardize = TRUE, alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index  Measure       SE Nonzero
## min  424.8    11 64477454 10231301       1
## 1se 1077.1     1 65308879 10292584       0

## $mse
## lambda.min 
##   64278815 
## attr(,"measure")
## [1] "Mean-Squared Error"
## 
## $mae
## lambda.min 
##   3708.151 
## attr(,"measure")
## [1] "Mean Absolute Error"
## $AICc
## [1] -1584023783
## 
## $BIC
## [1] -1584023777
## $AICc
## [1] 0
## 
## $BIC
## [1] 0

A closer look at the remaining 37 non-zero coefficients for the selected lambda value of lambda.min (r round(lasso.lm.model$lambda.min,3)) we can observe the top predictor variables URBANICITY Highly Urban/ Urban predictor variable has the largest impact on the response variable TARGET_AMT.

In the lasso model the coefficient for URBANICITY Highly Urban/ Urban home work area is biggest contributor to the cost estimates of a car crash by a factor of 2.

Reviewing the top 5 predictor variables that impact likelihood and cost associated with an accident:

  • URBANICITY Highly Urban/ Urban - working or living in an urban neighborhood increase expected cost associated with a crash
  • JOB Doctor - being a doctor reduces the expected costs associated with a crash
  • JOB Manager - being a manager reduces the expected costs associated with a crash
  • CAR_TYPE Sports Car - owning a sports car increases the expected costs associated with a crash
  • CAR_USE Private - using a car for private activities reduces the expected costs associated with a crash
  • REVOKED Yes - a history of having a revoked license increases the expected costs associated with a crash

Some of the notable coefficients that drop out of the model include:

  • HOME_VAL
  • JOB Professional
  • CAR_AGE
  • CAR_RED
## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          4.655667e+03
## KIDSDRIV1            .           
## AGE                  .           
## HOMEKIDS1            .           
## YOJ                  .           
## INCOME               .           
## HOME_VAL             .           
## EDUCATIONBachelors   .           
## EDUCATIONHigh School .           
## EDUCATIONMasters     .           
## EDUCATIONPhD         .           
## JOBClerical          .           
## JOBDoctor            .           
## JOBHome Maker        .           
## JOBLawyer            .           
## JOBManager           .           
## JOBProfessional      .           
## JOBStudent           .           
## JOBUnknown           .           
## TRAVTIME             .           
## BLUEBOOK             7.891294e-02
## TIF                  .           
## CAR_TYPEPanel Truck  .           
## CAR_TYPEPickup       .           
## CAR_TYPESports Car   .           
## CAR_TYPESUV          .           
## CAR_TYPEVan          .           
## OLDCLAIM             .           
## CLM_FREQ             .           
## MVR_PTS              .           
## CAR_AGE              .           
## MALE1                .           
## MARRIED1             .           
## LIC_REVOKED1         .           
## CAR_RED1             .           
## PRIVATE_USE1         .           
## SINGLE_PARENT1       .           
## URBAN1               .

As mentioned earlier, the dataset has a high correlation between predictor variables. The lasso regression approaches this issue by selecting the variable with the highest correlation and shrinking the remaining variables (as can be seen in the plot of coefficients).

Model Performance

The lasso model using coefficients extracted at lambda.1se was used to predict the 60,421 test cases and comparing the predicted insurance AMT to the actual cost of a car crash. The predicted cost of the crash include negative numbers that are effectively 0. We selected a threshold cost and assigning 0 to all amounts below that threshold value. Since

In the training data 337.50 was the lowest crash cost included in the dataset. We used 100 as the measurement threshold and assume that all predicted costs below 100 dollars are effectively 0.

Using the yardstick package to measure model performance, mape, smape and mpe return NaNs while the mase = 0.578 and rmse = 4984.

## # A tibble: 4 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 mape    standard     165.   
## 2 smape   standard      58.0  
## 3 mase    standard       0.696
## 4 mpe     standard    -148.

Analyzing a scatter plot of the prediction errors vs measured costs and a comparative histogram of the predicted and measured costs highlights a shortcoming of the lasso model. The model consistently predicts crash costs lower than the actual measured crash costs. The gap is more pronounced when looking at predicted and ??????

Model Assumptions

To reduce multicollinearity we can use regularization that means to keep all the features but reducing the magnitude of the coefficients of the model. This is a good solution when each predictor contributes to predict the dependent variable.

The Standardized Residuals plot shows increasing variance at higher values of the response variable.

The lasso regression solves the multicollinearity issue by selecting the variable with the largest coefficient while setting the rest to (nearly) zero.

4.4 Model selection

** Model 4.3: Lasso Linear Regression **

Regression

5. Predictions

6. Conclusion

7. References

Appendix: R code