Data Description
We’ll take a look at the vehins dataset (CHUNK 1) – note the following:
- Over 67k obs and 10 variables – the target being clm which is binary variable indicating whether a claim was incurred for that policyholder during the policy period
- The objective to identify key factors leading to claim occurence
- Varying exposures for each record/policyholder
- The mean/p of clm is 6.814% of the 67k policies
- Since the goal is to develop a model that will help us predict claim occurence based on policyholder characteristics, it’s important to note that potential predictors cannot include info that we’d know AFTER a claim is incurred (e.g. numclaims and clamcst) – hence we should drop these two (CHUNK 2)
- There are 53 obs where the vehicle value is 0 – doesn’t make practical sense so let’s remove
- veh_age and agecat can be binarized/converted to factors (CHUNK 3A)
- Again, these should be changed so the estimated coefficients doesn’t impose the restriction where a one unit increase (2 -> 3, 3 -> 4) would be associated with the same change in the linear predictor (holding all other predictors fixed)
# CHUNK 1
vehins <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/vehins.csv")
# The original version of "vehins.csv" can be accessed using the following code
#library(insuranceData)
#vehins <- data(dataCar)
summary(vehins)
## clm exposure veh_value numclaims
## Min. :0.00000 Min. :0.002738 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 1st Qu.:0.00000
## Median :0.00000 Median :0.446270 Median : 1.500 Median :0.00000
## Mean :0.06814 Mean :0.468651 Mean : 1.777 Mean :0.07276
## 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 3rd Qu.:0.00000
## Max. :1.00000 Max. :0.999316 Max. :34.560 Max. :4.00000
##
## claimcst veh_body veh_age gender area
## Min. : 0.0 SEDAN :22233 Min. :1.000 F:38603 A:16312
## 1st Qu.: 0.0 HBACK :18915 1st Qu.:2.000 M:29253 B:13341
## Median : 0.0 STNWG :16261 Median :3.000 C:20540
## Mean : 137.3 UTE : 4586 Mean :2.674 D: 8173
## 3rd Qu.: 0.0 TRUCK : 1750 3rd Qu.:4.000 E: 5912
## Max. :55922.1 HDTOP : 1579 Max. :4.000 F: 3578
## (Other): 2532
## agecat
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :3.485
## 3rd Qu.:5.000
## Max. :6.000
##
str(vehins)
## 'data.frame': 67856 obs. of 10 variables:
## $ clm : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exposure : num 0.304 0.649 0.569 0.318 0.649 ...
## $ veh_value: num 1.06 1.03 3.26 4.14 0.72 2.01 1.6 1.47 0.52 0.38 ...
## $ numclaims: int 0 0 0 0 0 0 0 0 0 0 ...
## $ claimcst : num 0 0 0 0 0 0 0 0 0 0 ...
## $ veh_body : Factor w/ 13 levels "BUS","CONVT",..: 4 4 13 11 4 5 8 4 4 4 ...
## $ veh_age : int 3 2 2 2 4 3 3 2 4 4 ...
## $ gender : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 2 1 1 ...
## $ area : Factor w/ 6 levels "A","B","C","D",..: 3 1 5 4 3 3 1 2 1 2 ...
## $ agecat : int 2 4 2 2 2 4 4 6 3 4 ...
# CHUNK 2
vehins$claimcst <- NULL # cannot serve as a predictor in this context
vehins$numclaims <- NULL # cannot serve as a predictor in this context
nrow(vehins[vehins$veh_value == 0, ]) # check how many obs have 0 vehicle value, doesn't make sense
## [1] 53
vehins <- vehins[vehins$veh_value > 0, ] # remove these obs
# CHUNK 3A
# Before conversion
class(vehins$agecat)
## [1] "integer"
class(vehins$veh_age)
## [1] "integer"
vehins$agecat <- as.factor(vehins$agecat)
vehins$veh_age <- as.factor(vehins$veh_age)
# After conversion
class(vehins$agecat)
## [1] "factor"
class(vehins$veh_age)
## [1] "factor"
Data Exploration
In chunk 3B, we relevel all the categorical variables so that the baseline/reference level is that with the most observations - veh_body (no change, SEDAN was already baseline) - gender (no change, F was already baseline) - veh_age (changed 3 to be the baseline) - area (changed C to be the baseline) - agecat (change 4 to be the baseline)
# CHUNK 3B
# Save the names of the factor variables as a vector
# Relevel
vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat")
for (i in vars.cat){
table <- as.data.frame(table(vehins[, i]))
max <- which.max(table[, 2])
level.name <- as.character(table[max, 1])
vehins[, i] <- relevel(vehins[, i], ref = level.name)
}
summary(vehins)
## clm exposure veh_value veh_body
## Min. :0.00000 Min. :0.002738 Min. : 0.180 SEDAN :22232
## 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 HBACK :18915
## Median :0.00000 Median :0.446270 Median : 1.500 STNWG :16234
## Mean :0.06811 Mean :0.468481 Mean : 1.778 UTE : 4586
## 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 TRUCK : 1742
## Max. :1.00000 Max. :0.999316 Max. :34.560 HDTOP : 1579
## (Other): 2515
## veh_age gender area agecat
## 3:20060 F:38584 C:20530 4:16179
## 1:12254 M:29219 A:16302 1: 5734
## 2:16582 B:13328 2:12868
## 4:18907 D: 8161 3:15757
## E: 5907 5:10722
## F: 3575 6: 6543
##
Now only, veh_value is the only numeric variable to analyze (other than clm and exposure):
- Ranges from 0.180 to 34.560 (or 1,800 to 345,600) – wide range
- The mean is greater than the median, indicating a right skew – we’ll confirm in the histogram generated in CHUNK 4
- We’ll log-transform and null the original variable
# CHUNK 4
library(ggplot2)
ggplot(vehins, aes(x = veh_value)) +
geom_histogram()

vehins$log_veh_value <- log(vehins$veh_value)
vehins$veh_value <- NULL
Next we analyze the (bivariate) relationship between the response (clm) and the only other numeric variable, log_veh_value in CHUNK 5 – via a split box plot sinc clm is technically a categorical. We note the following:
- log_veh_value seems to be slightly higher for those with clm = 1. We’ll see how this positive relationship will be quantified in our GLM later.
# CHUNK 5
ggplot(vehins, aes(x = factor(clm), y = log_veh_value)) +
geom_boxplot(fill = "red")

Since the other variables are all categorical, to explore each bivariate relationship with clm (which is also categorical), a bar plot (stacked or side-by-side) may be used to check for any relationships (see CHUNK 6 which uses a for loop). If there are any relationships, we’d expect differnces in the response across the different levels of a variable. We note the following:
- Slightly higher claim occurrence with veh_age 2
- BUS types have a much higher claim occurrence; while CONVT, mIBUS and UTE seem to have lower occurences
- No discernible difference in gender
- Higher occurence in Area F
- Occurence seems to decline with increasing age-bands
# CHUNK 6
for (i in vars.cat) {
plot <- ggplot(vehins, aes(x = vehins[, i], fill = factor(clm))) +
geom_bar(position = "fill") + # makes bar a percentage
labs(x = i, y = "percent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(plot)
}





In CHUNK 7, we’ll look at summary statistics, splitting each variable into their unique levels and comparing the mean number of claim occurrences and the number of obs. These summaries will be helpful in deciding which levels should be folded into one another to for variables with a high number of levels like veh__body, area and agecat.
In CHUNK 8, we use the grouped statistics to make the following level combinations:
- For veh_body, we can fold all levels except BUS, CONVT and MCARA into one giant category called HYBRID since all except these 3 have similar occurence rates
- Although the three abnormal levels have relatively small n’s, their occurence rate is different from each other that they should be kept separate
- For area, we can fold A/C/D/E together since they have similar occurence rates – call it base. Fold B/F into another called other
- For agecat, 2/3/4 can be folded into one called group 2 and 5/6 can be folded into a new group 3
- For the exam, if asked to reduce/combine levels, we should be doing so in a way that makes sense (intuitively) and backed by summary statistics such as similar means/medians; avoid justifying solely due to small observation sizes, although if severe these levels should be combined into another
# CHUNK 7
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
for (i in vars.cat) {
x <- vehins %>%
group_by_(i) %>% # underscore after group_by lets it pass through string of var name
dplyr::summarise(mean = mean(clm),
n = n()
)
print(x)
}
## Warning: group_by_() is deprecated.
## Please use group_by() instead
##
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
## # A tibble: 4 x 3
## veh_age mean n
## <fct> <dbl> <int>
## 1 3 0.0679 20060
## 2 1 0.0672 12254
## 3 2 0.0759 16582
## 4 4 0.0620 18907
## # A tibble: 13 x 3
## veh_body mean n
## <fct> <dbl> <int>
## 1 SEDAN 0.0664 22232
## 2 BUS 0.184 38
## 3 CONVT 0.0370 81
## 4 COUPE 0.0872 780
## 5 HBACK 0.0668 18915
## 6 HDTOP 0.0823 1579
## 7 MCARA 0.116 121
## 8 MIBUS 0.0601 716
## 9 PANVN 0.0824 752
## 10 RDSTR 0.0741 27
## 11 STNWG 0.0721 16234
## 12 TRUCK 0.0683 1742
## 13 UTE 0.0567 4586
## # A tibble: 2 x 3
## gender mean n
## <fct> <dbl> <int>
## 1 F 0.0686 38584
## 2 M 0.0675 29219
## # A tibble: 6 x 3
## area mean n
## <fct> <dbl> <int>
## 1 C 0.0688 20530
## 2 A 0.0664 16302
## 3 B 0.0723 13328
## 4 D 0.0607 8161
## 5 E 0.0652 5907
## 6 F 0.0783 3575
## # A tibble: 6 x 3
## agecat mean n
## <fct> <dbl> <int>
## 1 4 0.0682 16179
## 2 1 0.0863 5734
## 3 2 0.0724 12868
## 4 3 0.0705 15757
## 5 5 0.0572 10722
## 6 6 0.0556 6543
# CHUNK 8
# Brief note on how mapvalues works: takes in the factor variable, the old levels, and assigns the new level
# veh_body
var <- "veh_body"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
c("HYBRID", "BUS", "CONVT", "HYBRID", "HYBRID",
"HYBRID", "MCARA", "HYBRID", "HYBRID", "HYBRID",
"HYBRID", "HYBRID", "HYBRID"))
# area
var <- "area"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
c("BASE", "BASE", "OTHER", "BASE", "BASE", "OTHER"))
# agecat
var <- "agecat"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
c("2", "1", "2", "2", "3", "3"))
# Relevel
for (i in vars.cat){
table <- as.data.frame(table(vehins[, i]))
max <- which.max(table[, 2])
level.name <- as.character(table[max, 1])
vehins[, i] <- relevel(vehins[, i], ref = level.name)
}
summary(vehins)
## clm exposure veh_body veh_age gender
## Min. :0.00000 Min. :0.002738 HYBRID:67563 3:20060 F:38584
## 1st Qu.:0.00000 1st Qu.:0.219028 BUS : 38 1:12254 M:29219
## Median :0.00000 Median :0.446270 CONVT : 81 2:16582
## Mean :0.06811 Mean :0.468481 MCARA : 121 4:18907
## 3rd Qu.:0.00000 3rd Qu.:0.709103
## Max. :1.00000 Max. :0.999316
## area agecat log_veh_value
## BASE :50900 2:44804 Min. :-1.71480
## OTHER:16903 1: 5734 1st Qu.: 0.00995
## 3:17265 Median : 0.40547
## Mean : 0.38675
## 3rd Qu.: 0.76547
## Max. : 3.54270
# check level summary stats again
for (i in vars.cat) {
x <- vehins %>%
group_by_(i) %>% # underscore after group_by lets it pass through string of var name
dplyr::summarise(mean = mean(clm),
n = n()
)
print(x)
}
## # A tibble: 4 x 3
## veh_age mean n
## <fct> <dbl> <int>
## 1 3 0.0679 20060
## 2 1 0.0672 12254
## 3 2 0.0759 16582
## 4 4 0.0620 18907
## # A tibble: 4 x 3
## veh_body mean n
## <fct> <dbl> <int>
## 1 HYBRID 0.0680 67563
## 2 BUS 0.184 38
## 3 CONVT 0.0370 81
## 4 MCARA 0.116 121
## # A tibble: 2 x 3
## gender mean n
## <fct> <dbl> <int>
## 1 F 0.0686 38584
## 2 M 0.0675 29219
## # A tibble: 2 x 3
## area mean n
## <fct> <dbl> <int>
## 1 BASE 0.0663 50900
## 2 OTHER 0.0735 16903
## # A tibble: 3 x 3
## agecat mean n
## <fct> <dbl> <int>
## 1 2 0.0702 44804
## 2 1 0.0863 5734
## 3 3 0.0566 17265
Next we’ll select an interaction (where a predictor’s relationship with claim occurence is dependent on another predicotr). CHUNK 9 explores a possible interaction between veh_body and log_veh_value. We note the following:
- It does seem like there is an interaction between log_veh_value and veh_body as for some veh_body, a higher vehicle value is associated with a higher claim occurence (HYBRID, CONVT and MCARA), while for BUS the opposite is true (i.e. higher claim occurence is associated with a lower value)
- Seems like we should include an interaction term for these two predictors
# CHUNK 9
ggplot(vehins, aes(x = factor(clm), y = log_veh_value)) +
geom_boxplot() +
facet_wrap(~ veh_body) +
ylim(-1, 1)
