In this case study, we will examine the dataCar dataset in the insuranceData package libarary in R. This dataset is based on a total of n = 67,856 one-year vehicle insurance policies
The target variable is clm, a binary variable, equal to 1 if a claim occurred over the policy period and 0 otherwise.
Our objective here is to construct appropriate GLMs to identify key factors associated with claim occurrence. (Interpretability)
| Variable | Description | Characteristics |
|---|---|---|
| veh_value | Vehicle value (in $10,000s) | Numeric from 0 to 34.560 |
| exposure | Exposure of policy | Numeric from 0 to 1 |
| clm | Claim occurrence | 0 = no, 1 = yes |
| numclaims | Number of claims | Integer from 0 to 4 |
| claimcst0 | Claim size | Positive value from 0 to 55,922.1 (= 0 if no claim) |
| veh_body | Vehicle body type | Vehicle body, coded as BUS, CONVT(=convertible), COUPE, HBACK(=hatchback), HDTOP(=hardtop), MCARA(=motorized caravan), MIBUS(=minibus), PANVN(=panel van), RDSTR(=roadster), SEDAN, STNWG(=station wagon), TRUCK, UTE(=utility) |
| veh_age | Vehicle age | Integer from 1 (youngest) to 4 |
| gender | Gender of policyholder | M = male, F = female |
| area | Area of residence of policyholder | A, B, C, D, E, F |
| agecat | Age band of policyholder | Integer from 1 (youngest) to 6 |
| X_OBSTAT_ | (Unknown) | A categorical variable with one level “01101000” |
# CHUNK 1
library(insuranceData)
# Load dataCar
data(dataCar)
# Observe summary of data
summary(dataCar)
## veh_value exposure clm numclaims
## Min. : 0.000 Min. :0.002738 Min. :0.00000 Min. :0.00000
## 1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 1.500 Median :0.446270 Median :0.00000 Median :0.00000
## Mean : 1.777 Mean :0.468651 Mean :0.06814 Mean :0.07276
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :34.560 Max. :0.999316 Max. :1.00000 Max. :4.00000
##
## claimcst0 veh_body veh_age gender area
## Min. : 0.0 SEDAN :22233 Min. :1.000 F:38603 A:16312
## 1st Qu.: 0.0 HBACK :18915 1st Qu.:2.000 M:29253 B:13341
## Median : 0.0 STNWG :16261 Median :3.000 C:20540
## Mean : 137.3 UTE : 4586 Mean :2.674 D: 8173
## 3rd Qu.: 0.0 TRUCK : 1750 3rd Qu.:4.000 E: 5912
## Max. :55922.1 HDTOP : 1579 Max. :4.000 F: 3578
## (Other): 2532
## agecat X_OBSTAT_
## Min. :1.000 01101 0 0 0:67856
## 1st Qu.:2.000
## Median :3.000
## Mean :3.485
## 3rd Qu.:5.000
## Max. :6.000
##
# CHUNK 2
# Remove variables that are not suitable as predictors
dataCar$X_OBSTAT_ <- NULL
dataCar$numclaims <- NULL #Remove because it'll perfectly predict target variable
dataCar$claimcst0 <- NULL #Remove because it'll perfectly predict target variable
# Vehicle value cannot be zero. Check how many observations are zero? and if it is small, remove zero observations.
nrow(dataCar[dataCar$veh_value == 0,])
## [1] 53
dataCar <- dataCar[dataCar$veh_value > 0,]
summary(dataCar)
## veh_value exposure clm veh_body
## Min. : 0.180 Min. :0.002738 Min. :0.00000 SEDAN :22232
## 1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 HBACK :18915
## Median : 1.500 Median :0.446270 Median :0.00000 STNWG :16234
## Mean : 1.778 Mean :0.468481 Mean :0.06811 UTE : 4586
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 TRUCK : 1742
## Max. :34.560 Max. :0.999316 Max. :1.00000 HDTOP : 1579
## (Other): 2515
## veh_age gender area agecat
## Min. :1.000 F:38584 A:16302 Min. :1.000
## 1st Qu.:2.000 M:29219 B:13328 1st Qu.:2.000
## Median :3.000 C:20530 Median :3.000
## Mean :2.673 D: 8161 Mean :3.485
## 3rd Qu.:4.000 E: 5907 3rd Qu.:5.000
## Max. :4.000 F: 3575 Max. :6.000
##
# CHUNK 3
# Convert the numeric predictors to factors if needed.(You do not need to convert Target variable to factor)
dataCar$veh_age <- as.factor(dataCar$veh_age)
dataCar$agecat <- as.factor(dataCar$agecat)
# Relevel the categorical predictors using for loop.
# Save the names of the categorical predictors as a vector
vars.cat <- c("veh_body","veh_age", "gender", "area", "agecat")
for (v in vars.cat)
{
df <- as.data.frame(table(dataCar[,v]))
max <- which.max(df[,2])
level.name <- as.character(df[max,1])
dataCar[,v] <- relevel(dataCar[,v], ref = level.name)
}
# Check the releveling using summary function.
summary(dataCar)
## veh_value exposure clm veh_body
## Min. : 0.180 Min. :0.002738 Min. :0.00000 SEDAN :22232
## 1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 HBACK :18915
## Median : 1.500 Median :0.446270 Median :0.00000 STNWG :16234
## Mean : 1.778 Mean :0.468481 Mean :0.06811 UTE : 4586
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 TRUCK : 1742
## Max. :34.560 Max. :0.999316 Max. :1.00000 HDTOP : 1579
## (Other): 2515
## veh_age gender area agecat
## 3:20060 F:38584 C:20530 4:16179
## 1:12254 M:29219 A:16302 1: 5734
## 2:16582 B:13328 2:12868
## 4:18907 D: 8161 3:15757
## E: 5907 5:10722
## F: 3575 6: 6543
##
# CHUNK 4
# Draw histogram: veh_value, exposure
library(ggplot2)
ggplot(dataCar, aes(x = veh_value)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(dataCar, aes(x = exposure)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Apply transformation and draw histogram
ggplot(dataCar, aes(x = log(veh_value))) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Change origianl variable into transformed variable in data.
dataCar$log_veh_value <- log(dataCar$veh_value)
dataCar$veh_value <- NULL
# CHUNK 5
# Draw split boxplot of log veh_value over two levels of clm, and exposure over two levels of clm
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot()
ggplot(dataCar, aes(x = factor(clm), y = exposure)) + geom_boxplot()
# CHUNK 6
# Compute mean, median and number of observations of log veh_value over two levels of clm, and exposure over two levels of clm
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
dataCar %>% group_by_("clm") %>% summarise(mean = mean(log_veh_value),median = median(log_veh_value),n = n())
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 2 × 4
## clm mean median n
## <int> <dbl> <dbl> <int>
## 1 0 0.382 0.399 63185
## 2 1 0.448 0.451 4618
dataCar %>% group_by_("clm") %>% summarise(mean = mean(exposure),median = median(exposure),n = n())
## # A tibble: 2 × 4
## clm mean median n
## <int> <dbl> <dbl> <int>
## 1 0 0.458 0.427 63185
## 2 1 0.611 0.637 4618
# CHUNK 7
# Use the vector, "vars.cat" and draw filled bar using for loop.
for (v in vars.cat)
{
plot <- ggplot(data = dataCar, aes(x = dataCar[,v], fill = factor(clm))) + geom_bar(position = "fill") + labs(x = v,y = "Percent")
print(plot)
}
# CHUNK 8
# Compute mean and number of observations of clm over levels of veh_body
dataCar %>% group_by_("veh_body") %>% summarise(mean = mean(clm), n = n())
## # A tibble: 13 × 3
## veh_body mean n
## <fct> <dbl> <int>
## 1 SEDAN 0.0664 22232
## 2 BUS 0.184 38
## 3 CONVT 0.0370 81
## 4 COUPE 0.0872 780
## 5 HBACK 0.0668 18915
## 6 HDTOP 0.0823 1579
## 7 MCARA 0.116 121
## 8 MIBUS 0.0601 716
## 9 PANVN 0.0824 752
## 10 RDSTR 0.0741 27
## 11 STNWG 0.0721 16234
## 12 TRUCK 0.0683 1742
## 13 UTE 0.0567 4586
Ultimately, we will choose to relevel the veh_body variable to have two levels.
The BUS level has few observations and the highest claim occurence rate. So we will combine this level with the level with the second highest occurence rate (“MCARA”). The new level will be named “BUS_MCARA”.
The other two sparse levels, “CONVT” and “RDSTR”, have a mean claim occurrence rate similar to to other levels except for BUS and MCARA (0.03 - 0.08). So we will combine all the levels with a mean claim occurrence rate falling in this range as “OTHERS”.
# CHUNK 9
# Combine "BUS" and "MCARA", which has the highest claim occurence rate as one level called "BUS-MCARA"
levels(dataCar$veh_body)[levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA"
levels(dataCar$veh_body)[levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"
# Combine levels other than "BUS_MCARA" as "OTHER"
levels(dataCar$veh_body)[!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER"
# Check the output using summmar()
summary(dataCar)
## exposure clm veh_body veh_age gender
## Min. :0.002738 Min. :0.00000 OTHER :67644 3:20060 F:38584
## 1st Qu.:0.219028 1st Qu.:0.00000 BUS-MCARA: 159 1:12254 M:29219
## Median :0.446270 Median :0.00000 2:16582
## Mean :0.468481 Mean :0.06811 4:18907
## 3rd Qu.:0.709103 3rd Qu.:0.00000
## Max. :0.999316 Max. :1.00000
## area agecat log_veh_value
## C:20530 4:16179 Min. :-1.71480
## A:16302 1: 5734 1st Qu.: 0.00995
## B:13328 2:12868 Median : 0.40547
## D: 8161 3:15757 Mean : 0.38675
## E: 5907 5:10722 3rd Qu.: 0.76547
## F: 3575 6: 6543 Max. : 3.54270
We will include the interaction between log_veh_value and veh_body when constructing GLMs in later tasks.
# CHUNK 11
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(lattice)
set.seed(4769)
partition <- createDataPartition(y = dataCar$clm, p = .75, list = FALSE)
data.train <- dataCar[partition, ]
data.test <- dataCar[-partition, ]
# CHUNK 12
# Build logistic regression model (the binomial GLMs) except for exposure
logit <- glm(clm ~. - exposure + log_veh_value:veh_body, data = data.train, family = binomial())
# Refit logistic regression model including exposure as an offset
logit.offset <- glm(clm ~. - exposure + log_veh_value:veh_body, data = data.train, family = binomial(), offset = log(exposure))
# Evaluate the models using AIC metic
AIC(logit)
## [1] 25354.09
AIC(logit.offset)
## [1] 24459.17
# CHUNK 13
cutoff <- 0.06811 # you can try other values
# Generate predicted probabilities on the training set
pred.train <- predict(logit.offset, type = "response")
# Turn predicted probabilities into predicted classes
class.train <- ifelse(pred.train > cutoff, 1, 0)
# Generate confusion matrix on the training set
confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25474 1017
## 1 21885 2477
##
## Accuracy : 0.5496
## 95% CI : (0.5453, 0.554)
## No Information Rate : 0.9313
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0655
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70893
## Specificity : 0.53789
## Pos Pred Value : 0.10167
## Neg Pred Value : 0.96161
## Prevalence : 0.06871
## Detection Rate : 0.04871
## Detection Prevalence : 0.47907
## Balanced Accuracy : 0.62341
##
## 'Positive' Class : 1
##
# Generate predicted probabilities on the test set
pred.test <- predict(logit.offset, type = "response", newdata = data.test)
class.test <- ifelse(pred.test > cutoff, 1, 0)
# Generate confusion matrix on the test set
confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8483 332
## 1 7343 792
##
## Accuracy : 0.5472
## 95% CI : (0.5397, 0.5547)
## No Information Rate : 0.9337
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0617
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70463
## Specificity : 0.53602
## Pos Pred Value : 0.09736
## Neg Pred Value : 0.96234
## Prevalence : 0.06631
## Detection Rate : 0.04673
## Detection Prevalence : 0.47994
## Balanced Accuracy : 0.62032
##
## 'Positive' Class : 1
##
In both confusion matrices for the training and testing set, the accuracy is only marginally higher than 0.5.
Insurance companies should be more interested in correctly classifying policyholders with a claim than correctly classifying policyholders without a claim. In other words, the sensitivity of the model should be more of interest to the insurance company than the specificity.
We can see that the model has a decent sensitivity on both the training and testing set, whereas its specificity is relatively low. This implies that the model can identify policies with a claim well but not good at identifying policies with no claim.
The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This indicates a little bit of overfitting for the logistic regression model.
# CHUNK 14
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Draw ROC curve and compute AUC on the training set
roc.train <- roc(data.train$clm, pred.train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.train)
auc(roc.train)
## Area under the curve: 0.6647
# Draw AUC curve and compute AUC on the test set
roc.test <- roc(data.test$clm, pred.test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.test)
auc(roc.test)
## Area under the curve: 0.6581
The two AUCs are close to 0.66, which means that model is making better predictions than random predictions. The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting.
# CHUNK 15
library(caret)
binarizer <- dummyVars(~ agecat + area + veh_age, data = dataCar, fullRank = T)
binarized.vars <- data.frame(predict(binarizer, newdata = dataCar))
dataCar.bin <- cbind(dataCar, binarized.vars)
dataCar.bin$agecat <- NULL
dataCar.bin$area <- NULL
dataCar.bin$veh_age <- NULL
data.train.bin <- dataCar.bin[partition, ]
data.test.bin <- dataCar.bin[-partition, ]
We will do forward selection using the BIC. Since the BIC represents a more conservative approach to feature selection, this allows the insurance company to identify key factors affecting claim occurrence rates.
# CHUNK 16
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Fit Full model and Null model
model.full <- glm(clm ~. - exposure , data = data.train.bin, family = binomial(), offset = log(exposure))
model.null <- glm(clm ~ 1 - exposure, data = data.train.bin, family = binomial(), offset = log(exposure))
# Do forward selection using the BIC
model.reduced = stepAIC(model.null , direction = "forward", k = log(nrow(data.train.bin)), scope = list(upper = model.full, lower = model.null))
## Start: AIC=24578.19
## clm ~ 1 - exposure
##
## Df Deviance AIC
## + log_veh_value 1 24532 24553
## + agecat.1 1 24533 24554
## + agecat.5 1 24541 24563
## + agecat.6 1 24543 24564
## + veh_age.4 1 24549 24571
## + veh_age.2 1 24554 24575
## <none> 24567 24578
## + area.D 1 24559 24580
## + agecat.2 1 24559 24581
## + area.F 1 24561 24583
## + veh_body 1 24562 24583
## + area.B 1 24562 24584
## + area.E 1 24566 24587
## + gender 1 24566 24588
## + veh_age.1 1 24566 24588
## + area.A 1 24567 24588
## + agecat.3 1 24567 24589
##
## Step: AIC=24553.41
## clm ~ log_veh_value
##
## Df Deviance AIC
## + agecat.1 1 24498 24531
## + agecat.5 1 24505 24538
## + agecat.6 1 24512 24544
## <none> 24532 24553
## + area.D 1 24522 24554
## + area.B 1 24524 24557
## + agecat.2 1 24525 24557
## + veh_age.2 1 24527 24560
## + veh_body 1 24527 24560
## + gender 1 24529 24561
## + area.F 1 24529 24561
## + area.E 1 24529 24562
## + veh_age.1 1 24530 24563
## + veh_age.4 1 24531 24563
## + area.A 1 24531 24564
## + agecat.3 1 24532 24564
##
## Step: AIC=24530.58
## clm ~ log_veh_value + agecat.1
##
## Df Deviance AIC
## + agecat.5 1 24479 24522
## + agecat.6 1 24482 24526
## + agecat.2 1 24485 24528
## <none> 24498 24531
## + area.D 1 24488 24532
## + area.B 1 24490 24534
## + veh_body 1 24494 24537
## + veh_age.2 1 24494 24537
## + gender 1 24495 24538
## + area.F 1 24496 24539
## + area.E 1 24496 24539
## + agecat.3 1 24496 24539
## + veh_age.1 1 24496 24540
## + veh_age.4 1 24497 24541
## + area.A 1 24497 24541
##
## Step: AIC=24522.16
## clm ~ log_veh_value + agecat.1 + agecat.5
##
## Df Deviance AIC
## + agecat.6 1 24457 24511
## <none> 24479 24522
## + area.D 1 24469 24524
## + area.B 1 24471 24525
## + agecat.2 1 24472 24526
## + veh_body 1 24474 24528
## + veh_age.2 1 24475 24529
## + gender 1 24476 24530
## + area.E 1 24476 24530
## + area.F 1 24477 24531
## + veh_age.1 1 24477 24531
## + veh_age.4 1 24478 24532
## + area.A 1 24478 24532
## + agecat.3 1 24479 24533
##
## Step: AIC=24511.36
## clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6
##
## Df Deviance AIC
## <none> 24457 24511
## + area.D 1 24448 24514
## + area.B 1 24449 24514
## + veh_body 1 24452 24517
## + veh_age.2 1 24452 24517
## + agecat.2 1 24454 24519
## + area.E 1 24455 24520
## + gender 1 24455 24520
## + veh_age.1 1 24456 24521
## + area.F 1 24456 24521
## + veh_age.4 1 24456 24521
## + agecat.3 1 24457 24522
## + area.A 1 24457 24522
summary(model.reduced)
##
## Call:
## glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6,
## family = binomial(), data = data.train.bin, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7235 -0.4503 -0.3479 -0.2238 3.9340
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.84794 0.02533 -72.959 < 2e-16 ***
## log_veh_value 0.15913 0.02917 5.455 4.90e-08 ***
## agecat.1 0.27479 0.05845 4.701 2.59e-06 ***
## agecat.5 -0.25995 0.05300 -4.905 9.36e-07 ***
## agecat.6 -0.30815 0.06856 -4.495 6.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24567 on 50852 degrees of freedom
## Residual deviance: 24457 on 50848 degrees of freedom
## AIC: 24467
##
## Number of Fisher Scoring iterations: 5
# CHUNK 17
# Compute AUC on the test set
pred.reduced = predict(model.reduced, newdata = data.test.bin, type = "response")
roc.reduced = roc(data.test.bin$clm, pred.reduced)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(roc.reduced)
## Area under the curve: 0.6595
*The AUC is slightly higher than that of the full model (0.6595 < 0.6581).
This refitting is done because in a real-world application new data will be available in the future on a periodic basis and these data can be combined with the existing data to improve the robustness of the model.
# CHUNK 18
# Retrain the model with log_veh_value, agecat.1, agecat.5, agecat.6 and offset.
model.final = glm(clm~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = dataCar.bin, offset = log(exposure))
# Get summary statistics
summary(model.final)
##
## Call:
## glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6,
## family = binomial, data = dataCar.bin, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7130 -0.4490 -0.3472 -0.2236 3.9375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86185 0.02205 -84.438 < 2e-16 ***
## log_veh_value 0.15843 0.02532 6.257 3.92e-10 ***
## agecat.1 0.25731 0.05138 5.008 5.51e-07 ***
## agecat.5 -0.25433 0.04603 -5.525 3.29e-08 ***
## agecat.6 -0.23858 0.05784 -4.125 3.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32572 on 67802 degrees of freedom
## Residual deviance: 32445 on 67798 degrees of freedom
## AIC: 32455
##
## Number of Fisher Scoring iterations: 5
# Exponentiate the coefficients to get the multiplicative effects
exp(coef(model.final))
## (Intercept) log_veh_value agecat.1 agecat.5 agecat.6
## 0.1553857 1.1716752 1.2934442 0.7754369 0.7877484
According to our final model, the most important predictors for predicting the probability of a claim are the log of the vehicle value and the age of the insured.
Holding all other features fixed, a unit increase in the log of vehicle value leads to an expected multiplicative increase of 1.1716752 increase in the odds of claim occurrence. Simply put, the probability of a claim increases as the value of the vehicle increases; drivers will more likely submit an insurance claim in the case of an accident when the vehicle they own has a high value.
Regarding age categories, the odds of a claim occurrence decreases as age increases. Holding all other features fixed, the odds of claim occurrence for age category 1, the youngest age group, is 1.2934442 times that of the base line level (age categories 2,3, and 4). The odds of claim occurrence for age categories 5 and 6, the eldest age groups, are 0.7754369 and 0.7877484 times that of the base line level.
Younger drivers may likely be more reckless than the elder age groups, leading to more accidents among the younger age groups. Elder drivers may be more mature than the other age groups, leading to better decision making and less accidents.
[Comment 1: Determine which variable is right skewed and which transfomation can be used to this variable.]