MATH 647 Homework 2

In this case study, we will examine the dataCar dataset in the insuranceData package libarary in R. This dataset is based on a total of n = 67,856 one-year vehicle insurance policies

The target variable is clm, a binary variable, equal to 1 if a claim occurred over the policy period and 0 otherwise.

Our objective here is to construct appropriate GLMs to identify key factors associated with claim occurrence. (Interpretability)

Variable	Description	Characteristics
veh_value	Vehicle value (in $10,000s)	Numeric from 0 to 34.560
exposure	Exposure of policy	Numeric from 0 to 1
clm	Claim occurrence	0 = no, 1 = yes
numclaims	Number of claims	Integer from 0 to 4
claimcst0	Claim size	Positive value from 0 to 55,922.1 (= 0 if no claim)
veh_body	Vehicle body type	Vehicle body, coded as BUS, CONVT(=convertible), COUPE, HBACK(=hatchback), HDTOP(=hardtop), MCARA(=motorized caravan), MIBUS(=minibus), PANVN(=panel van), RDSTR(=roadster), SEDAN, STNWG(=station wagon), TRUCK, UTE(=utility)
veh_age	Vehicle age	Integer from 1 (youngest) to 4
gender	Gender of policyholder	M = male, F = female
area	Area of residence of policyholder	A, B, C, D, E, F
agecat	Age band of policyholder	Integer from 1 (youngest) to 6
X_OBSTAT_	(Unknown)	A categorical variable with one level “01101000”

TASK 1: Edit the data

# CHUNK 1
library(insuranceData)

# Load dataCar
data(dataCar)

# Observe summary of data
summary(dataCar)

##    veh_value         exposure             clm            numclaims      
##  Min.   : 0.000   Min.   :0.002738   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.: 1.010   1st Qu.:0.219028   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median : 1.500   Median :0.446270   Median :0.00000   Median :0.00000  
##  Mean   : 1.777   Mean   :0.468651   Mean   :0.06814   Mean   :0.07276  
##  3rd Qu.: 2.150   3rd Qu.:0.709103   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :34.560   Max.   :0.999316   Max.   :1.00000   Max.   :4.00000  
##                                                                         
##    claimcst0          veh_body        veh_age      gender    area     
##  Min.   :    0.0   SEDAN  :22233   Min.   :1.000   F:38603   A:16312  
##  1st Qu.:    0.0   HBACK  :18915   1st Qu.:2.000   M:29253   B:13341  
##  Median :    0.0   STNWG  :16261   Median :3.000             C:20540  
##  Mean   :  137.3   UTE    : 4586   Mean   :2.674             D: 8173  
##  3rd Qu.:    0.0   TRUCK  : 1750   3rd Qu.:4.000             E: 5912  
##  Max.   :55922.1   HDTOP  : 1579   Max.   :4.000             F: 3578  
##                    (Other): 2532                                      
##      agecat                     X_OBSTAT_    
##  Min.   :1.000   01101    0    0    0:67856  
##  1st Qu.:2.000                               
##  Median :3.000                               
##  Mean   :3.485                               
##  3rd Qu.:5.000                               
##  Max.   :6.000                               
##

Remove questionable observations and variables that are not suitable as predictors

# CHUNK 2

# Remove variables that are not suitable as predictors
dataCar$X_OBSTAT_ <- NULL
dataCar$numclaims <- NULL #Remove because it'll perfectly predict target variable 
dataCar$claimcst0 <- NULL #Remove because it'll perfectly predict target variable

# Vehicle value cannot be zero. Check how many observations are zero? and if it is small, remove zero observations. 
nrow(dataCar[dataCar$veh_value == 0,])

## [1] 53

dataCar <- dataCar[dataCar$veh_value > 0,]

summary(dataCar)

##    veh_value         exposure             clm             veh_body    
##  Min.   : 0.180   Min.   :0.002738   Min.   :0.00000   SEDAN  :22232  
##  1st Qu.: 1.010   1st Qu.:0.219028   1st Qu.:0.00000   HBACK  :18915  
##  Median : 1.500   Median :0.446270   Median :0.00000   STNWG  :16234  
##  Mean   : 1.778   Mean   :0.468481   Mean   :0.06811   UTE    : 4586  
##  3rd Qu.: 2.150   3rd Qu.:0.709103   3rd Qu.:0.00000   TRUCK  : 1742  
##  Max.   :34.560   Max.   :0.999316   Max.   :1.00000   HDTOP  : 1579  
##                                                        (Other): 2515  
##     veh_age      gender    area          agecat     
##  Min.   :1.000   F:38584   A:16302   Min.   :1.000  
##  1st Qu.:2.000   M:29219   B:13328   1st Qu.:2.000  
##  Median :3.000             C:20530   Median :3.000  
##  Mean   :2.673             D: 8161   Mean   :3.485  
##  3rd Qu.:4.000             E: 5907   3rd Qu.:5.000  
##  Max.   :4.000             F: 3575   Max.   :6.000  
##

Do factor conversions

# CHUNK 3

# Convert the numeric predictors to factors if needed.(You do not need to convert Target variable to factor)
dataCar$veh_age <- as.factor(dataCar$veh_age)
dataCar$agecat <- as.factor(dataCar$agecat)

# Relevel the categorical predictors using for loop. 
# Save the names of the categorical predictors as a vector
vars.cat <- c("veh_body","veh_age", "gender", "area", "agecat")

for (v in vars.cat)
{
  df <- as.data.frame(table(dataCar[,v]))
  max <- which.max(df[,2])
  level.name <- as.character(df[max,1])
  dataCar[,v] <- relevel(dataCar[,v], ref = level.name)
}
  

# Check the releveling using summary function. 
summary(dataCar)

##    veh_value         exposure             clm             veh_body    
##  Min.   : 0.180   Min.   :0.002738   Min.   :0.00000   SEDAN  :22232  
##  1st Qu.: 1.010   1st Qu.:0.219028   1st Qu.:0.00000   HBACK  :18915  
##  Median : 1.500   Median :0.446270   Median :0.00000   STNWG  :16234  
##  Mean   : 1.778   Mean   :0.468481   Mean   :0.06811   UTE    : 4586  
##  3rd Qu.: 2.150   3rd Qu.:0.709103   3rd Qu.:0.00000   TRUCK  : 1742  
##  Max.   :34.560   Max.   :0.999316   Max.   :1.00000   HDTOP  : 1579  
##                                                        (Other): 2515  
##  veh_age   gender    area      agecat   
##  3:20060   F:38584   C:20530   4:16179  
##  1:12254   M:29219   A:16302   1: 5734  
##  2:16582             B:13328   2:12868  
##  4:18907             D: 8161   3:15757  
##                      E: 5907   5:10722  
##                      F: 3575   6: 6543  
##

TASK 2: Perform univariate exploration of the numeric variables: veh_value, exposure

graphical representation

# CHUNK 4

# Draw histogram: veh_value, exposure 
library(ggplot2)
ggplot(dataCar, aes(x = veh_value)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dataCar, aes(x = exposure)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Apply transformation and draw histogram
ggplot(dataCar, aes(x = log(veh_value))) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Change origianl variable into transformed variable in data. 
dataCar$log_veh_value <- log(dataCar$veh_value)
dataCar$veh_value <- NULL

[Comment 1: Determine which variable is right skewed and which transfomation can be used to this variable.]

The mean of vehicle value (1.778) is higher than its median (1.5). The maximum (34.56) is much higher than the third quartile (2.150).This indicates a right skewed variable, which is supported by the historgram.
For a strictly positive and right skewed variable, a log transformation is appropriate.

TASK 3: Explore the relationship of each predictor to clm

Numerical feature

graphical representation: split boxplot

# CHUNK 5

# Draw split boxplot of log veh_value over two levels of clm, and exposure over two levels of clm
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot()

ggplot(dataCar, aes(x = factor(clm), y = exposure)) + geom_boxplot()

Descriptive statistics: Mean or median of log_veh_value and exposure over the two levels of clm

# CHUNK 6

# Compute mean, median and number of observations of log veh_value over two levels of clm, and exposure over two levels of clm
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   0.3.4      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

dataCar %>% group_by_("clm") %>% summarise(mean = mean(log_veh_value),median = median(log_veh_value),n = n())

## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

## # A tibble: 2 × 4
##     clm  mean median     n
##   <int> <dbl>  <dbl> <int>
## 1     0 0.382  0.399 63185
## 2     1 0.448  0.451  4618

dataCar %>% group_by_("clm") %>% summarise(mean = mean(exposure),median = median(exposure),n = n())

## # A tibble: 2 × 4
##     clm  mean median     n
##   <int> <dbl>  <dbl> <int>
## 1     0 0.458  0.427 63185
## 2     1 0.611  0.637  4618

Categorical feature

Graphical represenation: filled bar

# CHUNK 7

# Use the vector, "vars.cat" and draw filled bar using for loop. 
for (v in vars.cat)
{
  plot <- ggplot(data = dataCar, aes(x = dataCar[,v], fill = factor(clm))) + geom_bar(position = "fill") + labs(x = v,y = "Percent")
  print(plot)
}

[Comment 2: interpret the graphs]

veh_body: There is a much higher claim occurence rate for the Bus vehicle body
veh_age: Vehicles in group 2 have a slightly higher claim occurence rate than the other groups
gender: The claim occurence rates look to be similar
area: There is a lower claim rate occurence for area D. Areas B and F seems to have that highest claim occurence rates.
agecat: The claim occurence rates seem to decrease as the age increases (gets older).

Descriptive statistics

# CHUNK 8

# Compute mean and number of observations of clm over levels of veh_body
dataCar %>% group_by_("veh_body") %>% summarise(mean = mean(clm), n = n())

## # A tibble: 13 × 3
##    veh_body   mean     n
##    <fct>     <dbl> <int>
##  1 SEDAN    0.0664 22232
##  2 BUS      0.184     38
##  3 CONVT    0.0370    81
##  4 COUPE    0.0872   780
##  5 HBACK    0.0668 18915
##  6 HDTOP    0.0823  1579
##  7 MCARA    0.116    121
##  8 MIBUS    0.0601   716
##  9 PANVN    0.0824   752
## 10 RDSTR    0.0741    27
## 11 STNWG    0.0721 16234
## 12 TRUCK    0.0683  1742
## 13 UTE      0.0567  4586

TASK 4: Reduce the number of factor levels where appropriate

[Comment 3: de we need to reduce the number of levels of veh_body?]

The variable veh_body has a large number of variables (13). This would increase the number of predictors and may dilute the predictive power in a constructed glm.
Some levels of veh_body have relatively few observations (“BUS”,“CONVT”,“RDSTR”)
So we will reduce the number of factor levels by combining these levels with few observations with more populous levels with a similar mean claim occurence rate.

[Comment 4: reduce the number of levels and justify it]

Ultimately, we will choose to relevel the veh_body variable to have two levels.
The BUS level has few observations and the highest claim occurence rate. So we will combine this level with the level with the second highest occurence rate (“MCARA”). The new level will be named “BUS_MCARA”.
The other two sparse levels, “CONVT” and “RDSTR”, have a mean claim occurrence rate similar to to other levels except for BUS and MCARA (0.03 - 0.08). So we will combine all the levels with a mean claim occurrence rate falling in this range as “OTHERS”.

# CHUNK 9

# Combine "BUS" and "MCARA", which has the highest claim occurence rate as one level called "BUS-MCARA"
levels(dataCar$veh_body)[levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA"
levels(dataCar$veh_body)[levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"

# Combine levels other than "BUS_MCARA" as "OTHER"
levels(dataCar$veh_body)[!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER"

# Check the output using summmar()
summary(dataCar)

##     exposure             clm               veh_body     veh_age   gender   
##  Min.   :0.002738   Min.   :0.00000   OTHER    :67644   3:20060   F:38584  
##  1st Qu.:0.219028   1st Qu.:0.00000   BUS-MCARA:  159   1:12254   M:29219  
##  Median :0.446270   Median :0.00000                     2:16582            
##  Mean   :0.468481   Mean   :0.06811                     4:18907            
##  3rd Qu.:0.709103   3rd Qu.:0.00000                                        
##  Max.   :0.999316   Max.   :1.00000                                        
##  area      agecat    log_veh_value     
##  C:20530   4:16179   Min.   :-1.71480  
##  A:16302   1: 5734   1st Qu.: 0.00995  
##  B:13328   2:12868   Median : 0.40547  
##  D: 8161   3:15757   Mean   : 0.38675  
##  E: 5907   5:10722   3rd Qu.: 0.76547  
##  F: 3575   6: 6543   Max.   : 3.54270

We will include the interaction between log_veh_value and veh_body when constructing GLMs in later tasks.

TASK 6: Specify an offset

Creation of training and test sets

# CHUNK 11
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(lattice)
set.seed(4769)
partition <- createDataPartition(y = dataCar$clm, p = .75, list = FALSE)
data.train <- dataCar[partition, ]
data.test <- dataCar[-partition, ]

# CHUNK 12

# Build logistic regression model (the binomial GLMs) except for exposure
logit <- glm(clm ~. - exposure + log_veh_value:veh_body, data = data.train, family = binomial())


# Refit logistic regression model including exposure as an offset
logit.offset <- glm(clm ~. - exposure + log_veh_value:veh_body, data = data.train, family = binomial(), offset = log(exposure))

# Evaluate the models using AIC metic
AIC(logit)

## [1] 25354.09

AIC(logit.offset)

## [1] 24459.17

[Comment 6: interpret the results]

The AIC of the logistic regression with no offset is 25354.09. The AIC of the logistic regression with offset as exposure is 24459.17.
Because the AIC of the regression with an offset, it is recommended to incorporate the offset to improve the quality of the logistic regression model.

TASK 7: Generate confusion matrices and AUCs

Generate confusion matrices (based on an appropriate cutoff), plot the ROC curves, and calculate AUCs for the model fitted on the training set and the test set

# CHUNK 13

cutoff <- 0.06811  # you can try other values

# Generate predicted probabilities on the training set
pred.train <- predict(logit.offset, type = "response")

# Turn predicted probabilities into predicted classes
class.train <- ifelse(pred.train > cutoff, 1, 0)

# Generate confusion matrix on the training set 
confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25474  1017
##          1 21885  2477
##                                          
##                Accuracy : 0.5496         
##                  95% CI : (0.5453, 0.554)
##     No Information Rate : 0.9313         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0655         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.70893        
##             Specificity : 0.53789        
##          Pos Pred Value : 0.10167        
##          Neg Pred Value : 0.96161        
##              Prevalence : 0.06871        
##          Detection Rate : 0.04871        
##    Detection Prevalence : 0.47907        
##       Balanced Accuracy : 0.62341        
##                                          
##        'Positive' Class : 1              
##

# Generate predicted probabilities on the test set
pred.test <- predict(logit.offset, type = "response", newdata = data.test)
class.test <- ifelse(pred.test > cutoff, 1, 0)

# Generate confusion matrix on the test set 
confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8483  332
##          1 7343  792
##                                           
##                Accuracy : 0.5472          
##                  95% CI : (0.5397, 0.5547)
##     No Information Rate : 0.9337          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0617          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.70463         
##             Specificity : 0.53602         
##          Pos Pred Value : 0.09736         
##          Neg Pred Value : 0.96234         
##              Prevalence : 0.06631         
##          Detection Rate : 0.04673         
##    Detection Prevalence : 0.47994         
##       Balanced Accuracy : 0.62032         
##                                           
##        'Positive' Class : 1               
##

[Comment 6: interpret the confusion matrix and statistics]

In both confusion matrices for the training and testing set, the accuracy is only marginally higher than 0.5.
Insurance companies should be more interested in correctly classifying policyholders with a claim than correctly classifying policyholders without a claim. In other words, the sensitivity of the model should be more of interest to the insurance company than the specificity.
We can see that the model has a decent sensitivity on both the training and testing set, whereas its specificity is relatively low. This implies that the model can identify policies with a claim well but not good at identifying policies with no claim.
The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This indicates a little bit of overfitting for the logistic regression model.

# CHUNK 14

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Draw ROC curve and compute AUC on the training set
roc.train <- roc(data.train$clm, pred.train)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc.train)

auc(roc.train)

## Area under the curve: 0.6647

# Draw AUC curve and compute AUC on the test set
roc.test <- roc(data.test$clm, pred.test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc.test)

auc(roc.test)

## Area under the curve: 0.6581

[Comment 7: interpret the results]

The two AUCs are close to 0.66, which means that model is making better predictions than random predictions. The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting.

TASK 8: Select features using stepwise selection

Do Binarization

# CHUNK 15
library(caret)
binarizer <- dummyVars(~ agecat + area + veh_age, data = dataCar, fullRank = T)
binarized.vars <- data.frame(predict(binarizer, newdata = dataCar))
dataCar.bin <- cbind(dataCar, binarized.vars)
dataCar.bin$agecat <- NULL
dataCar.bin$area <- NULL
dataCar.bin$veh_age <- NULL
data.train.bin <- dataCar.bin[partition, ]
data.test.bin <- dataCar.bin[-partition, ]

Stepwise selection: forward selection using the BIC

We will do forward selection using the BIC. Since the BIC represents a more conservative approach to feature selection, this allows the insurance company to identify key factors affecting claim occurrence rates.

# CHUNK 16

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Fit Full model and Null model
model.full <- glm(clm ~. - exposure , data = data.train.bin, family = binomial(), offset = log(exposure))
model.null <- glm(clm ~ 1 - exposure, data = data.train.bin, family = binomial(), offset = log(exposure))

# Do forward selection using the BIC
model.reduced = stepAIC(model.null , direction = "forward", k = log(nrow(data.train.bin)), scope = list(upper = model.full, lower = model.null))

## Start:  AIC=24578.19
## clm ~ 1 - exposure
## 
##                 Df Deviance   AIC
## + log_veh_value  1    24532 24553
## + agecat.1       1    24533 24554
## + agecat.5       1    24541 24563
## + agecat.6       1    24543 24564
## + veh_age.4      1    24549 24571
## + veh_age.2      1    24554 24575
## <none>                24567 24578
## + area.D         1    24559 24580
## + agecat.2       1    24559 24581
## + area.F         1    24561 24583
## + veh_body       1    24562 24583
## + area.B         1    24562 24584
## + area.E         1    24566 24587
## + gender         1    24566 24588
## + veh_age.1      1    24566 24588
## + area.A         1    24567 24588
## + agecat.3       1    24567 24589
## 
## Step:  AIC=24553.41
## clm ~ log_veh_value
## 
##             Df Deviance   AIC
## + agecat.1   1    24498 24531
## + agecat.5   1    24505 24538
## + agecat.6   1    24512 24544
## <none>            24532 24553
## + area.D     1    24522 24554
## + area.B     1    24524 24557
## + agecat.2   1    24525 24557
## + veh_age.2  1    24527 24560
## + veh_body   1    24527 24560
## + gender     1    24529 24561
## + area.F     1    24529 24561
## + area.E     1    24529 24562
## + veh_age.1  1    24530 24563
## + veh_age.4  1    24531 24563
## + area.A     1    24531 24564
## + agecat.3   1    24532 24564
## 
## Step:  AIC=24530.58
## clm ~ log_veh_value + agecat.1
## 
##             Df Deviance   AIC
## + agecat.5   1    24479 24522
## + agecat.6   1    24482 24526
## + agecat.2   1    24485 24528
## <none>            24498 24531
## + area.D     1    24488 24532
## + area.B     1    24490 24534
## + veh_body   1    24494 24537
## + veh_age.2  1    24494 24537
## + gender     1    24495 24538
## + area.F     1    24496 24539
## + area.E     1    24496 24539
## + agecat.3   1    24496 24539
## + veh_age.1  1    24496 24540
## + veh_age.4  1    24497 24541
## + area.A     1    24497 24541
## 
## Step:  AIC=24522.16
## clm ~ log_veh_value + agecat.1 + agecat.5
## 
##             Df Deviance   AIC
## + agecat.6   1    24457 24511
## <none>            24479 24522
## + area.D     1    24469 24524
## + area.B     1    24471 24525
## + agecat.2   1    24472 24526
## + veh_body   1    24474 24528
## + veh_age.2  1    24475 24529
## + gender     1    24476 24530
## + area.E     1    24476 24530
## + area.F     1    24477 24531
## + veh_age.1  1    24477 24531
## + veh_age.4  1    24478 24532
## + area.A     1    24478 24532
## + agecat.3   1    24479 24533
## 
## Step:  AIC=24511.36
## clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6
## 
##             Df Deviance   AIC
## <none>            24457 24511
## + area.D     1    24448 24514
## + area.B     1    24449 24514
## + veh_body   1    24452 24517
## + veh_age.2  1    24452 24517
## + agecat.2   1    24454 24519
## + area.E     1    24455 24520
## + gender     1    24455 24520
## + veh_age.1  1    24456 24521
## + area.F     1    24456 24521
## + veh_age.4  1    24456 24521
## + agecat.3   1    24457 24522
## + area.A     1    24457 24522

summary(model.reduced)

## 
## Call:
## glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
##     family = binomial(), data = data.train.bin, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7235  -0.4503  -0.3479  -0.2238   3.9340  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.84794    0.02533 -72.959  < 2e-16 ***
## log_veh_value  0.15913    0.02917   5.455 4.90e-08 ***
## agecat.1       0.27479    0.05845   4.701 2.59e-06 ***
## agecat.5      -0.25995    0.05300  -4.905 9.36e-07 ***
## agecat.6      -0.30815    0.06856  -4.495 6.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24567  on 50852  degrees of freedom
## Residual deviance: 24457  on 50848  degrees of freedom
## AIC: 24467
## 
## Number of Fisher Scoring iterations: 5

[Comment 8: interpret the results]

The forward selection process adds predictors in the following order: log_veh_value, agecat.1, agecat.5, and agecat.6. Then the process terminates.

# CHUNK 17

# Compute AUC on the test set
pred.reduced = predict(model.reduced, newdata = data.test.bin, type = "response")
roc.reduced = roc(data.test.bin$clm, pred.reduced)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(roc.reduced)

## Area under the curve: 0.6595

[Comment 9: interpret the results]

*The AUC is slightly higher than that of the full model (0.6595 < 0.6581).

The final reduced model is preferred in terms of predictability and interpretability. The predictive power is increase while using less predictors at the same time.

TASK 9: Interpret the final model

Run the selected model on the full dataset and provide the output.

This refitting is done because in a real-world application new data will be available in the future on a periodic basis and these data can be combined with the existing data to improve the robustness of the model.

# CHUNK 18

# Retrain the model with log_veh_value, agecat.1, agecat.5, agecat.6 and offset. 
model.final = glm(clm~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = dataCar.bin, offset = log(exposure))

# Get summary statistics
summary(model.final)

## 
## Call:
## glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
##     family = binomial, data = dataCar.bin, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7130  -0.4490  -0.3472  -0.2236   3.9375  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.86185    0.02205 -84.438  < 2e-16 ***
## log_veh_value  0.15843    0.02532   6.257 3.92e-10 ***
## agecat.1       0.25731    0.05138   5.008 5.51e-07 ***
## agecat.5      -0.25433    0.04603  -5.525 3.29e-08 ***
## agecat.6      -0.23858    0.05784  -4.125 3.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32572  on 67802  degrees of freedom
## Residual deviance: 32445  on 67798  degrees of freedom
## AIC: 32455
## 
## Number of Fisher Scoring iterations: 5

# Exponentiate the coefficients to get the multiplicative effects
exp(coef(model.final))

##   (Intercept) log_veh_value      agecat.1      agecat.5      agecat.6 
##     0.1553857     1.1716752     1.2934442     0.7754369     0.7877484

[Comment 10: interpret the final model]

According to our final model, the most important predictors for predicting the probability of a claim are the log of the vehicle value and the age of the insured.
Holding all other features fixed, a unit increase in the log of vehicle value leads to an expected multiplicative increase of 1.1716752 increase in the odds of claim occurrence. Simply put, the probability of a claim increases as the value of the vehicle increases; drivers will more likely submit an insurance claim in the case of an accident when the vehicle they own has a high value.
Regarding age categories, the odds of a claim occurrence decreases as age increases. Holding all other features fixed, the odds of claim occurrence for age category 1, the youngest age group, is 1.2934442 times that of the base line level (age categories 2,3, and 4). The odds of claim occurrence for age categories 5 and 6, the eldest age groups, are 0.7754369 and 0.7877484 times that of the base line level.
Younger drivers may likely be more reckless than the elder age groups, leading to more accidents among the younger age groups. Elder drivers may be more mature than the other age groups, leading to better decision making and less accidents.