Analyse the dataset containing price points for a particular motor insurance product for four different insurers on the market covering two price periods. Generate insights from the data and provide guidance on the pricing strategy for this product.
Firstly the data will be compared for the two price periods to see if there was a change in structure or large variance over hte two pricing periods. Second, the dataset will be used to back engineer the pricing composition for all four insurers provided in the dataset. There will be suggestions made on market segmentation considering the pricing structures of the 4 insurers and the price deconstruction results. Market reserach has not been done to contribute but would be neccessary for more indepth market segmentation. It is determined that pricing optimisation would depend heavily on the components of the product such as but not limited to coverage inclusions, duration, excess, max coverage amounts which are unavailable along with the also lacking company specific data like customer segmentation, perentage of new to returning customer makeup, and other comparable portfolio products and further other factors; for this reason pricing optimisation will be greatly limited and will be suggested based upon preliminary analysis around the pricing distribution of the premiums in comparison with other insurers distribution to determine if ther may be room to increase profits without loss to customers while still maintaing a similar standing in the competitor ranking.
All work is done in R and presented in this document. Due to time constraints, all analysis has been greatly simplified and done on an introductory/exploratory level. To appropriately assess and improve the pricing stratgy for this product, one would need to do background reserach within the broader field and local market, obtain additional product and portfolio information, delve more thoroughly into feature selection using knowledge of the relationships between variables, develop better performing models using acquired business acumen (exploring GAMs, GLMs, and Random Forests),d improve model assessment through rigorous accuracy testing using methodologies such as bootstrapping and cross-validation and the statistically test for significant improvements in the various models to determine the best model accuracy vs complexity that may be relied upon for future predictions, strategy improvements and competitor pricing estimations.
library(readr)
library(caret)
library(psych)
library(Hmisc)
library(purrr)
library(tidyr)
library(ggplot2)
library(car)
library(MASS)
library(mgcViz)
library(tidyverse)
library(knitr)
library(corrplot)
library(gridExtra)
library(lattice)
data <- read_csv("~/R/Suncorp Ins Test/data.csv")
defs <- c("Average annual kilometers (in thousands) containing categorical quantitative 5 levels",
"Occurance of convictions or suspensions in the last 5 years binomial",
"Owner age quantitative ranging from 17 to 99",
"Other insurance provider categorical 7 levels including Other and None",
"Registration use type categorical 2 levels",
"Vehicle manufacturer categorical",
"Years insured with other insurance quantitative ranging from 0 to 4",
"Year vehicle was manufactured quantitative ranging from 1999 to 2018",
"Age of youngest driver quantitative ranging from 17 to 79",
"Vehicle model categorical",
"Occurance of at fault accidents in the last 2 years binomial",
"Number of years with NRMA assist quantitative ranging from -1 to 25",
"Number of years drivers license has been obtained quantitative ranging from 0 to 6",
"Other insurance type on the vehicle categorical 3 levles",
"Number of demerit points held quantitative ranging from 0 to 14",
"No claim discount percentage quantitative (2 values)",
"Company ownership binomial Y/N",
"Driver gender, either male or female or Undefined (M/F/U)",
"Occurance of at fault accidents in the last 5 years binomial Y/N",
"Vehical Shape categorical 4 levels",
"Vehical Age in years quantitative ranging from 1 to 20",
"Dependent variable - measures the premium pricing for insurer 1 - Suncorp",
"Dependent variable - measures the premium pricing for insurer 2 - Unknown",
"Dependent variable - measures the premium pricing for insurer 3 - Unknown",
"Dependent variable - measures the premium pricing for insurer 4 - Unknown",
"Commencement date for the policy categorical date 2 levels")
data.dict <- data.frame(names(data), defs, stringsAsFactors = F)
names(data.dict) <- c("Variable Name", "Definition")
kable(data.dict)
| Variable Name | Definition |
|---|---|
| YEARLYKM | Average annual kilometers (in thousands) containing categorical quantitative 5 levels |
| LICCANCNT | Occurance of convictions or suspensions in the last 5 years binomial |
| OWNERAGE | Owner age quantitative ranging from 17 to 99 |
| MVINSURER | Other insurance provider categorical 7 levels including Other and None |
| REGUSE | Registration use type categorical 2 levels |
| MAKE | Vehicle manufacturer categorical |
| MVINSYEARS | Years insured with other insurance quantitative ranging from 0 to 4 |
| MANUFYEAR | Year vehicle was manufactured quantitative ranging from 1999 to 2018 |
| YDAGE | Age of youngest driver quantitative ranging from 17 to 79 |
| MODEL | Vehicle model categorical |
| ATFAULTACCDS | Occurance of at fault accidents in the last 2 years binomial |
| NRMAASST | Number of years with NRMA assist quantitative ranging from -1 to 25 |
| NOYRLICOBT | Number of years drivers license has been obtained quantitative ranging from 0 to 6 |
| MVINSTYPE | Other insurance type on the vehicle categorical 3 levles |
| DEMERITPTS | Number of demerit points held quantitative ranging from 0 to 14 |
| NCBPCT | No claim discount percentage quantitative (2 values) |
| COMREGVEH | Company ownership binomial Y/N |
| YDGENDER | Driver gender, either male or female or Undefined (M/F/U) |
| ATFAULT5YRS | Occurance of at fault accidents in the last 5 years binomial Y/N |
| SHAPE | Vehical Shape categorical 4 levels |
| VEHAGE | Vehical Age in years quantitative ranging from 1 to 20 |
| INSURER1_PREMIUM | Dependent variable - measures the premium pricing for insurer 1 - Suncorp |
| INSURER2_PREMIUM | Dependent variable - measures the premium pricing for insurer 2 - Unknown |
| INSURER3_PREMIUM | Dependent variable - measures the premium pricing for insurer 3 - Unknown |
| INSURER4_PREMIUM | Dependent variable - measures the premium pricing for insurer 4 - Unknown |
| COMMENCEDATE | Commencement date for the policy categorical date 2 levels |
psych::describe(data)
## vars n mean sd median trimmed mad min
## YEARLYKM 1 59504 14.79 8.88 15.00 14.57 0.00 5.00
## LICCANCNT 2 59504 0.00 0.07 0.00 0.00 0.00 0.00
## OWNERAGE 3 59504 49.27 14.03 49.00 48.89 14.83 17.00
## MVINSURER* 4 59504 NaN NA NA NaN NA Inf
## REGUSE* 5 59504 NaN NA NA NaN NA Inf
## MAKE* 6 59504 NaN NA NA NaN NA Inf
## MVINSYEARS 7 59504 1.92 1.70 2.00 1.90 2.97 0.00
## MANUFYEAR 8 59504 2009.37 4.77 2010.00 2009.54 5.93 1999.00
## YDAGE 9 59504 47.37 13.60 46.00 47.11 14.83 17.00
## MODEL* 10 59504 419.10 568.59 323.00 319.56 171.98 2.00
## ATFAULTACCDS 11 59504 0.02 0.14 0.00 0.00 0.00 0.00
## NRMAASST 12 59504 1.97 3.01 2.00 1.73 4.45 -1.00
## NOYRLICOBT 13 59504 5.29 1.83 6.00 5.85 0.00 0.00
## MVINSTYPE* 14 59504 NaN NA NA NaN NA Inf
## DEMERITPTS 15 59504 0.37 1.43 0.00 0.00 0.00 0.00
## NCBPCT 16 59504 25.40 29.65 0.00 24.26 0.00 0.00
## COMREGVEH* 17 59504 NaN NA NA NaN NA Inf
## YDGENDER* 18 59504 NaN NA NA NaN NA Inf
## ATFAULT5YRS* 19 59504 NaN NA NA NaN NA Inf
## SHAPE* 20 59504 999.00 0.00 999.00 999.00 0.00 999.00
## VEHAGE 21 59504 9.63 4.77 9.00 9.46 5.93 1.00
## INSURER1_PREMIUM 22 59504 250.76 67.03 218.98 240.53 32.23 197.07
## INSURER2_PREMIUM 23 59504 288.36 65.74 273.33 287.31 67.88 192.92
## INSURER3_PREMIUM 24 59504 301.21 52.43 295.34 300.71 55.60 202.00
## INSURER4_PREMIUM 25 59504 233.79 53.01 211.13 220.97 18.25 194.32
## COMMENCEDATE 26 59504 NaN NA NA NaN NA Inf
## max range skew kurtosis se
## YEARLYKM 99.00 94.00 8.32 76.91 0.04
## LICCANCNT 1.00 1.00 15.15 227.43 0.00
## OWNERAGE 99.00 82.00 0.24 -0.53 0.06
## MVINSURER* -Inf -Inf NA NA NA
## REGUSE* -Inf -Inf NA NA NA
## MAKE* -Inf -Inf NA NA NA
## MVINSYEARS 4.00 4.00 0.12 -1.69 0.01
## MANUFYEAR 2018.00 19.00 -0.26 -0.89 0.02
## YDAGE 79.00 62.00 0.17 -0.75 0.06
## MODEL* 4008.00 4006.00 5.08 27.80 2.33
## ATFAULTACCDS 1.00 1.00 6.83 44.61 0.00
## NRMAASST 25.00 26.00 2.33 14.09 0.01
## NOYRLICOBT 6.00 6.00 -2.35 3.76 0.01
## MVINSTYPE* -Inf -Inf NA NA NA
## DEMERITPTS 14.00 14.00 5.25 33.01 0.01
## NCBPCT 60.00 60.00 0.31 -1.90 0.12
## COMREGVEH* -Inf -Inf NA NA NA
## YDGENDER* -Inf -Inf NA NA NA
## ATFAULT5YRS* -Inf -Inf NA NA NA
## SHAPE* 999.00 0.00 NaN NaN 0.00
## VEHAGE 20.00 19.00 0.26 -0.89 0.02
## INSURER1_PREMIUM 386.36 189.29 1.32 0.11 0.27
## INSURER2_PREMIUM 386.58 193.66 0.37 -1.22 0.27
## INSURER3_PREMIUM 388.20 186.20 0.19 -0.90 0.21
## INSURER4_PREMIUM 385.34 191.01 1.97 2.43 0.22
## COMMENCEDATE -Inf -Inf NA NA NA
Hmisc::describe(data)
## data
##
## 26 Variables 59504 Observations
## ---------------------------------------------------------------------------
## YEARLYKM
## n missing distinct Info Mean Gmd
## 59504 0 5 0.487 14.79 3.822
##
## Value 5 10 15 20 99
## Frequency 3614 6404 47576 1324 586
## Proportion 0.061 0.108 0.800 0.022 0.010
## ---------------------------------------------------------------------------
## LICCANCNT
## n missing distinct Info Sum Mean Gmd
## 59504 0 2 0.013 256 0.004302 0.008568
##
## ---------------------------------------------------------------------------
## OWNERAGE
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 82 1 49.27 16.01 28 31
## .25 .50 .75 .90 .95
## 38 49 59 69 73
##
## lowest : 17 18 19 20 21, highest: 94 95 96 98 99
## ---------------------------------------------------------------------------
## MVINSURER
## n missing distinct
## 59504 0 7
##
## Value AAMI ALLI GIO NONE NRMA OTH QBE
## Frequency 17496 3538 11130 2078 5412 18796 1054
## Proportion 0.294 0.059 0.187 0.035 0.091 0.316 0.018
## ---------------------------------------------------------------------------
## REGUSE
## n missing distinct
## 59504 0 2
##
## Value BUSG PRIV
## Frequency 4586 54918
## Proportion 0.077 0.923
## ---------------------------------------------------------------------------
## MAKE
## n missing distinct
## 59504 0 58
##
## lowest : ABARTH ALFA ROMEO ASTON MARTIN AUDI BENTLEY
## highest: SUZUKI TESLA TOYOTA VOLKSWAGEN VOLVO
## ---------------------------------------------------------------------------
## MVINSYEARS
## n missing distinct Info Mean Gmd
## 59504 0 5 0.923 1.919 1.869
##
## Value 0 1 2 3 4
## Frequency 19668 9798 5234 5304 19500
## Proportion 0.331 0.165 0.088 0.089 0.328
## ---------------------------------------------------------------------------
## MANUFYEAR
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 20 0.996 2009 5.467 2001 2003
## .25 .50 .75 .90 .95
## 2006 2010 2013 2015 2016
##
## Value 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## Frequency 988 1340 1494 1730 2530 2896 3050 3266 4032 4002
## Proportion 0.017 0.023 0.025 0.029 0.043 0.049 0.051 0.055 0.068 0.067
##
## Value 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
## Frequency 3528 4352 3702 3976 4056 4360 4302 3622 1900 378
## Proportion 0.059 0.073 0.062 0.067 0.068 0.073 0.072 0.061 0.032 0.006
## ---------------------------------------------------------------------------
## YDAGE
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 63 1 47.37 15.58 26 30
## .25 .50 .75 .90 .95
## 37 46 58 67 71
##
## lowest : 17 18 19 20 21, highest: 75 76 77 78 79
## ---------------------------------------------------------------------------
## MODEL
## n missing distinct
## 59504 0 631
##
## lowest : 1 M135I 116I 118D 118I 120D
## highest: YRV Z3 Z4 ZAFIRA ZS
## ---------------------------------------------------------------------------
## ATFAULTACCDS
## n missing distinct Info Sum Mean Gmd
## 59504 0 2 0.059 1200 0.02017 0.03952
##
## ---------------------------------------------------------------------------
## NRMAASST
## n missing distinct Info Mean Gmd
## 59504 0 7 0.942 1.971 3.014
##
## Value -1 1 2 3 5 10 25
## Frequency 19668 9798 5234 9266 14110 1156 272
## Proportion 0.331 0.165 0.088 0.156 0.237 0.019 0.005
## ---------------------------------------------------------------------------
## NOYRLICOBT
## n missing distinct Info Mean Gmd
## 59504 0 7 0.381 5.288 1.251
##
## Value 0 1 2 3 4 5 6
## Frequency 5446 522 706 738 638 774 50680
## Proportion 0.092 0.009 0.012 0.012 0.011 0.013 0.852
## ---------------------------------------------------------------------------
## MVINSTYPE
## n missing distinct
## 59504 0 3
##
## Value C N T
## Frequency 55486 2098 1920
## Proportion 0.932 0.035 0.032
## ---------------------------------------------------------------------------
## DEMERITPTS
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 15 0.256 0.3671 0.6902 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 3
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 53928 804 1022 1586 626 262 502 204 144 96
## Proportion 0.906 0.014 0.017 0.027 0.011 0.004 0.008 0.003 0.002 0.002
##
## Value 10 11 12 13 14
## Frequency 80 42 56 94 58
## Proportion 0.001 0.001 0.001 0.002 0.001
## ---------------------------------------------------------------------------
## NCBPCT
## n missing distinct Info Mean Gmd
## 59504 0 2 0.732 25.4 29.3
##
## Value 0 60
## Frequency 34310 25194
## Proportion 0.577 0.423
## ---------------------------------------------------------------------------
## COMREGVEH
## n missing distinct
## 59504 0 2
##
## Value N Y
## Frequency 55964 3540
## Proportion 0.941 0.059
## ---------------------------------------------------------------------------
## YDGENDER
## n missing distinct
## 59504 0 3
##
## Value F M U
## Frequency 32550 26364 590
## Proportion 0.547 0.443 0.010
## ---------------------------------------------------------------------------
## ATFAULT5YRS
## n missing distinct
## 59504 0 2
##
## Value N Y
## Frequency 58304 1200
## Proportion 0.98 0.02
## ---------------------------------------------------------------------------
## SHAPE
## n missing distinct
## 59504 0 4
##
## Value 999 PVF SED WAG
## Frequency 1816 6120 36886 14682
## Proportion 0.031 0.103 0.620 0.247
## ---------------------------------------------------------------------------
## VEHAGE
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 20 0.996 9.634 5.467 3 4
## .25 .50 .75 .90 .95
## 6 9 13 16 18
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 378 1900 3622 4302 4360 4056 3976 3702 4352 3528
## Proportion 0.006 0.032 0.061 0.072 0.073 0.068 0.067 0.062 0.073 0.059
##
## Value 11 12 13 14 15 16 17 18 19 20
## Frequency 4002 4032 3266 3050 2896 2530 1730 1494 1340 988
## Proportion 0.067 0.068 0.055 0.051 0.049 0.043 0.029 0.025 0.023 0.017
## ---------------------------------------------------------------------------
## INSURER1_PREMIUM
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 41 0.965 250.8 65.47 197.1 197.1
## .25 .50 .75 .90 .95
## 209.3 219.0 247.9 386.4 386.4
##
## lowest : 197.0744 201.9573 203.1732 204.3954 209.2651
## highest: 307.4777 327.9402 339.6837 350.9791 386.3640
## ---------------------------------------------------------------------------
## INSURER2_PREMIUM
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 585 0.996 288.4 74.06 197.6 208.0
## .25 .50 .75 .90 .95
## 236.1 273.3 365.6 386.6 386.6
##
## lowest : 192.9243 193.5552 194.1860 194.8166 195.4470
## highest: 383.3821 383.5928 384.5803 385.1793 386.5811
## ---------------------------------------------------------------------------
## INSURER3_PREMIUM
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 6501 0.998 301.2 59.95 218.9 233.8
## .25 .50 .75 .90 .95
## 262.0 295.3 338.9 388.2 388.2
##
## lowest : 201.9988 202.0748 202.0955 202.1231 202.1715
## highest: 387.9365 387.9826 388.1010 388.1273 388.1997
## ---------------------------------------------------------------------------
## INSURER4_PREMIUM
## n missing distinct Info Mean Gmd .05 .10
## 59504 0 22 0.973 233.8 45.8 198.8 198.8
## .25 .50 .75 .90 .95
## 206.3 211.1 225.5 355.3 379.4
##
## lowest : 194.3246 198.8252 206.3416 211.1321 212.3302
## highest: 355.2922 357.3182 373.8848 379.4100 385.3373
## ---------------------------------------------------------------------------
## COMMENCEDATE
## n missing distinct
## 59504 0 2
##
## Value 2019-07-01 2019-09-01
## Frequency 29752 29752
## Proportion 0.5 0.5
## ---------------------------------------------------------------------------
The dataset has 26 variables, and 59504 cases. The summary reveals some multinomial categorical variables and some quantitative variables are heavily dominated in one level or one value. This occurs in YEARLYKM (80% in 15000km), NOYRLICOBT (85% in 6 years), MVINSTYPE (93% in comprehensive), DEMERITPTS (90% in zero points). Some binomial variables are heavily one sided such as LICCANCNT, ATFAULTACCDS (sparse many zeros), REGUSE (92% PRIV), COMREGVEH (94% No), ATFAULT5YRS (98% No).
Commencement date evenly splits the data 50% 50%. And YDGENDER has a relatively equal split between F/M (55%/45%) with only 1% Unknown and likewise NCBPCT has a similarly even split of 0/60 (58%/42%).
All Insurer Premium distributions have a Mean slightly larger than the median and the mean ranking from smallest to largest is determined to be Insurer4, Insurer1, Insurer2, Insurer3 with the first 3 quartiles for Insurer1 and Insurer3 being much smaller range than that of Insurer2 and Insurer3. ###1.2.1 Missing data
any(is.na(data))
## [1] FALSE
The dataset appears to have no missing cases, yet some issues have been identified through initial raw data review and the results from the above describe statistics. The NRMA assist variable contains a value (-1) which are unable to be interpreted unless taken to be not held. MVINSTYPE has 3 categories assumed to be Comprehensive (C), Third-party (T) and None (N), yet not all rows containing N also signify Zero on MVINSYRS or None on MVINSURER variables. Shape contains the value ‘999’ which may be interpreted as Other. Not all COMREGVEH flagged as ‘Y’ also coincide with the flag ‘BUSG’ under REGUSE variable.
The data has been initially explored for type and values. Now before doing an indepth interrogation into the data distributions and relationships we will split the data into our test and train sets.
set.seed(3456)
split <- createDataPartition(data$MVINSURER, p=0.7, list=FALSE)
train <- data[split,]
test <- data[-split,]
train.num <- train %>%
keep(is.numeric)
train.cat <- train %>%
discard(is.numeric)
train.bpV <- train %>%
select(c(3, 9, 13, 15, 21)) %>%
gather()
summary.boxplotV <- ggplot(train.bpV, aes(x = key, y = value)) +
labs(x = "variable", title = "Insurance Premium Variables Boxplots") +
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
summary.boxplotV
train.bpC <- train %>%
select(c(1, 2, 7, 11, 12, 16)) %>%
gather()
summary.boxplotC <- ggplot(train.bpC, aes(x = key, y = value)) +
labs(x = "variable", title = "Insurance Premium Variables Boxplots") +
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
summary.boxplotC
train.bpI <- train %>%
select(c(22, 23, 24, 25)) %>%
gather()
summary.boxplotI <- ggplot(train.bpI, aes(x = key, y = value)) +
labs(x = "variable", title = "Insurers Premium Boxplots") +
geom_boxplot(outlier.colour = "red", outlier.shape = 1)
summary.boxplotI
The boxplots clearly reinforce what we have been shown in the statistics summary above, the demerit points and number of years license obtained are heavily weighted in one value, while age variables are more normally distributed. We could consider changing any one sided variables to be binomial variables as has specified value or does not.
The boxplots for the insurer insurance premiums clearly group the 4 insurers into two main categories, Insurer1 and Insurer4 are much more similar distribution and pricing wise while Insurer2 and Insurer3 are likewise more similar to each other. This immediately signifies room for market pricing improvements for Suncorp without great effect in changing the market composition.
train.hV <- train %>%
select(c(3, 9, 13, 15, 21)) %>%
gather()
train.histV <- ggplot(data = train.hV, mapping = aes(x = value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
train.histV
train.hC <- train %>%
select(c(1, 2, 7, 11, 12, 16)) %>%
gather()
train.histC <- ggplot(data = train.hC, mapping = aes(x = value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
train.histC
train.hI <- train %>%
select(c(22, 23, 24, 25)) %>%
gather()
train.histI <- ggplot(data = train.hI, mapping = aes(x = value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
train.histI
The histograms provide a good representation of how many categories are heavily skewed distributions. The age variables are seen to be normal as expected and the Insurer premiums now visualised reveal insights on current pricing strategy as Suncorp appears to be heavily skewed to low price premiums indicating a customer segmentation of low risk or low value insurance coverage in this product. Indicating this may be a low end product for Suncorp portfolio. This also indicates the other insurers which are heavily skewed to the high end and may correlate to classifying them as higher risk or pricing a different segment of policies as higher value premiums. This could be explored further with clustering or trees to determine which policy risks are considered as high price in Insurer3 and Insurer2 where as are priced in the lower price category for Suncorp and Insurer4. This will be the segment that can most easily afford a price increase without loss to demand as the gap is widest for this segmentation.
train.b <- train %>%
select(c(4:5, 14,17:20)) %>%
gather()
train.bar <- ggplot(data = train.b, mapping = aes(x = value)) +
geom_bar() +
facet_wrap(~key, scales = 'free_x')
train.bar
train.m <- train %>%
select(c(6)) %>%
gather()
train.barm <- ggplot(data = train.m, mapping = aes(x = value)) +
geom_bar() +
facet_wrap(~key, scales = 'free_x')
train.barm
The bar chart visualisation provides insight into the categorical nominal distributions. As described above we can see some variables are heavily distorted in proportionality while others are more evenly proportioned. The variables with less even distribution will contain more information and provided more predictive power.
The Correlation heatmap is shown below for each of the independent and dependent numb,eric variables in the data.
train.c <- mutate_all(train.num, funs(as.numeric))
corrplot(cor(train.c), method = "color", type = "lower")
The correlation heatmap shows the variables correlations with each of the insurance premiums, some are similar dependencies while others are quite different hinting a difference in the way each insurer would propose their pricing methodologies. For example, youngest driver age appears to be correlated similarly with each premium while demerit points or number of years license obtained has much more variance. From this chart we can also see that the correlation between vehicle age and manufacture year are almost 1 and therefore redundant variables so only one will need to remain in the model. Also noticed is YDAGE and OWNEAGE are quite similar along with NRMAASSIST and MVINSYRS and will will share much information.
corp <- apply(train.c[, -7], 2, function(x) cor.test(x, y=train.c$INSURER1_PREMIUM)$p.value)
cortable <- cor(train.c[, -7], train.c$INSURER1_PREMIUM)
kable(cbind(as.character(corp), cortable), col.names = c("P-value", "Correlation with dependent"))
| P-value | Correlation with dependent | |
|---|---|---|
| YEARLYKM | 0.000494384142394882 | 0.0170684247331176 |
| LICCANCNT | 4.01170752009266e-177 | 0.138365694262635 |
| OWNERAGE | 1.40945800632286e-272 | -0.171556991460753 |
| MVINSYEARS | 2.65330501218757e-122 | -0.114852181381481 |
| MANUFYEAR | 0 | -0.32569740214304 |
| YDAGE | 0 | -0.295169447462153 |
| NRMAASST | 2.14908979503085e-114 | -0.111037982099119 |
| NOYRLICOBT | 0 | -0.721622622947211 |
| DEMERITPTS | 0.00416586503604279 | 0.0140386762140103 |
| NCBPCT | 2.44582103833928e-88 | 0.0973978692306985 |
| VEHAGE | 0 | 0.32569740214304 |
| INSURER1_PREMIUM | 0 | 1 |
| INSURER2_PREMIUM | 0 | 0.651650620864274 |
| INSURER3_PREMIUM | 0 | 0.572303365172195 |
| INSURER4_PREMIUM | 0 | 0.330863594257667 |
In the case above the dependent variable was chosen to be the Insurer1_Premium. The variables with the biggest correlations are NOYRLICOBT followed by VEHAGE and YDAGE. We can also see that despite Insurer1 sharing a similar pricing range and distribution to Insurer4 it is more strongly correlated with premiums from Insurer2 and Insurer3.
We have noted that there is no missing data but there is data that doesn’t exactly align. As not knowing the meaning of -1 for NRMA assist but seeing it is actually quite a common value in the data we will not remove this. The business ownership and regular use mismatch we will leave as there could conceivably be a code where these might not align but we will note this for further investigation. For the data that mismatches for other insurer information, we will remove any that do not make sense. If MVINS = “None”" and MVINSTYPE does not equal “N” and MVINSYRS does not equal “0” (or vise versa) we will remove the data from the set.
train <- train[!((train$MVINSURER=="NONE" & (train$MVINSYEARS!=0 || train$MVINSTYPE!="N")) || (train$MVINSTYPE=="N" & (train$MVINSYEARS!=0 || train$MVINSURER!="NONE"))),]
test <- test[!((test$MVINSURER=="NONE" & (test$MVINSYEARS!=0 || test$MVINSTYPE!="N")) || (test$MVINSTYPE=="N" & (test$MVINSYEARS!=0 || test$MVINSURER!="NONE"))),]
The distribution plots above all the age variables tend to be nearly normal while YEARLYKM and DEMERITPTS are left skewed. Linear regression does not make any assumptions on the normality of any variables, so we will not transform the skewed variables for now. At a more indepth look this may need to be considered for further improvement or reclassification of the variables into categorical is another possible revision.
We will transform three variables that are highly dominated by one value into binary variables, YEARLYKM: LESS15, NOYRSLICOBT: LESS6, DEMERITPTS: ZERO; with any case having a value inside the described range as 1, and all others as 0. We will also convert YDGENDER into an additional variable YDGenderM as a binomial with Unknown assumed in Male and Male taking the 1 and Female taking the value 2. All these additional variables will be used within the feature selection and model training to determine which variables may be better predictors and eliminating the others.
We will also transform any categorical binomials into 1/0 values and any ordinal categorical into numerical ordered values. COMMENCEDATE1 (time1 = 1, time2 = 0), ATFAULT5YRS (FAULT = 1, NOTFAULT = 0),REGUSEP(PRIV = 1, BUSG = 0).
train$YEARLYKMLESS15 <- ifelse(train$YEARLYKM < 15 , 1 , 0)
train$NOYRSLICOBTLESS6 <- ifelse(train$NOYRLICOBT < 6 , 1 , 0)
train$DEMERITPTSZERO <- ifelse(train$DEMERITPTS ==0 , 1 , 0)
train$YDGENDERM <- ifelse(train$YDGENDER =="U" || train$YDGENDER =="M" , 1 , 0)
train$COMMENCEDATE1 <- ifelse(train$COMMENCEDATE =="2019-07-01" , 1 , 0)
train$ATFAULT5YRSY <- ifelse(train$ATFAULT5YRS =="Y" , 1 , 0)
train$REGUSEP <- ifelse(train$REGUSE =="PRIV" , 1 , 0)
test$YEARLYKMLESS15 <- ifelse(test$YEARLYKM < 15 , 1 , 0)
test$NOYRSLICOBTLESS6 <- ifelse(test$NOYRLICOBT < 6 , 1 , 0)
test$DEMERITPTSZERO <- ifelse(test$DEMERITPTS ==0 , 1 , 0)
test$YDGENDERM <- ifelse(test$YDGENDER =="U" || test$YDGENDER =="M" , 1 , 0)
test$COMMENCEDATE1 <- ifelse(test$COMMENCEDATE =="2019-07-01" , 1 , 0)
test$ATFAULT5YRSY <- ifelse(test$ATFAULT5YRS =="Y" , 1 , 0)
test$REGUSEP <- ifelse(test$REGUSE =="PRIV" , 1 , 0)
Although some variables have highly clustered values, all categories have a relatively set amount of values or levels and there is unlikely any reason to remove any of these as outliers.
For the first model, we will not eliminate any variables but will observe the interactions of all in order to serve as a baseline for comparison with subsequent models. We will utilise this output to eliminate one of our duplicate variable sets (with the newly created variables above) and to remove any variables that are not significant enough. We will however remove model because some models may appear in the test data or future data that may not be in the training sets and may not be predicted by the model.
m1 <- lm(formula = INSURER1_PREMIUM ~ . - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM,
data = train)
options(max.print = 4000)
summary(m1)
##
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MODEL - INSURER2_PREMIUM -
## INSURER3_PREMIUM - INSURER4_PREMIUM, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -283.453 -16.655 -4.786 11.334 162.461
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.724e+03 1.235e+02 38.265 < 2e-16 ***
## YEARLYKM 1.622e-01 1.945e-02 8.342 < 2e-16 ***
## LICCANCNT 3.106e+01 2.463e+00 12.609 < 2e-16 ***
## OWNERAGE 2.176e-01 1.929e-02 11.282 < 2e-16 ***
## MVINSURERALLI -3.476e+00 7.275e-01 -4.777 1.78e-06 ***
## MVINSURERGIO 6.463e-01 4.999e-01 1.293 0.196022
## MVINSURERNONE -5.297e+01 8.451e+00 -6.267 3.71e-10 ***
## MVINSURERNRMA -3.307e+00 6.207e-01 -5.329 9.95e-08 ***
## MVINSUREROTH -2.185e+00 4.316e-01 -5.062 4.17e-07 ***
## MVINSURERQBE -4.301e+00 1.239e+00 -3.472 0.000517 ***
## REGUSEPRIV -7.473e-01 1.056e+00 -0.707 0.479268
## MAKEALFA ROMEO -8.743e+00 1.209e+01 -0.723 0.469545
## MAKEASTON MARTIN -2.327e+01 1.575e+01 -1.477 0.139657
## MAKEAUDI -1.409e+01 1.150e+01 -1.225 0.220694
## MAKEBENTLEY -3.470e+00 3.438e+01 -0.101 0.919589
## MAKEBMW -1.344e+01 1.149e+01 -1.169 0.242359
## MAKECHERY -9.684e+00 1.750e+01 -0.553 0.580098
## MAKECHRYSLER -1.113e+01 1.220e+01 -0.912 0.361771
## MAKECITROEN -1.701e+01 1.193e+01 -1.425 0.154129
## MAKEDAEWOO -2.303e+01 1.507e+01 -1.529 0.126375
## MAKEDAIHATSU -2.737e+01 1.331e+01 -2.056 0.039777 *
## MAKEDODGE 2.107e+00 1.240e+01 0.170 0.865089
## MAKEFERRARI -2.922e+01 1.850e+01 -1.579 0.114256
## MAKEFIAT -1.320e+01 1.202e+01 -1.098 0.272024
## MAKEFORD -1.194e+01 1.149e+01 -1.040 0.298560
## MAKEGREAT WALL MOTORS 2.186e+01 1.311e+01 1.667 0.095576 .
## MAKEHAVAL -1.612e+01 2.194e+01 -0.735 0.462588
## MAKEHOLDEN -1.904e+01 1.148e+01 -1.659 0.097099 .
## MAKEHONDA -1.035e+01 1.148e+01 -0.902 0.366963
## MAKEHSV 9.515e+00 1.236e+01 0.770 0.441373
## MAKEHUMMER -1.671e+01 3.437e+01 -0.486 0.626808
## MAKEHYUNDAI -1.521e+01 1.148e+01 -1.325 0.185136
## MAKEINFINITI -1.600e+01 1.620e+01 -0.988 0.323402
## MAKEISUZU -1.581e+01 1.311e+01 -1.206 0.227786
## MAKEJAGUAR -1.453e+01 1.213e+01 -1.198 0.231021
## MAKEJEEP -1.512e+01 1.156e+01 -1.308 0.190762
## MAKEKIA -1.310e+01 1.152e+01 -1.138 0.255333
## MAKELAND ROVER -1.447e+01 1.169e+01 -1.239 0.215512
## MAKELDV -2.310e+01 1.679e+01 -1.376 0.168831
## MAKELEXUS -5.352e+00 1.151e+01 -0.465 0.641889
## MAKELOTUS -2.157e+01 1.680e+01 -1.284 0.199186
## MAKEMASERATI 1.129e+01 1.848e+01 0.611 0.541230
## MAKEMAZDA 2.546e+00 1.147e+01 0.222 0.824385
## MAKEMCLAREN -2.076e+01 2.563e+01 -0.810 0.417916
## MAKEMERCEDES-AMG 6.526e+00 1.261e+01 0.518 0.604807
## MAKEMERCEDES-BENZ 9.081e+00 1.264e+01 0.718 0.472466
## MAKEMERCEDES AMG -1.217e+01 1.750e+01 -0.695 0.486886
## MAKEMERCEDES BENZ -3.502e+00 1.150e+01 -0.305 0.760677
## MAKEMG 2.812e+00 1.457e+01 0.193 0.846936
## MAKEMINI -1.985e+01 1.181e+01 -1.681 0.092751 .
## MAKEMITSUBISHI -1.769e+01 1.149e+01 -1.539 0.123784
## MAKENISSAN -1.314e+01 1.149e+01 -1.144 0.252462
## MAKEOPEL -2.296e+01 1.538e+01 -1.493 0.135425
## MAKEPEUGEOT -1.252e+01 1.166e+01 -1.073 0.283311
## MAKEPORSCHE -2.300e+01 1.181e+01 -1.948 0.051419 .
## MAKEPROTON -2.231e+01 1.389e+01 -1.606 0.108376
## MAKERANGE ROVER -1.136e+00 1.183e+01 -0.096 0.923530
## MAKERENAULT -1.552e+01 1.175e+01 -1.320 0.186690
## MAKESAAB 1.647e+01 1.295e+01 1.271 0.203564
## MAKESKODA -2.396e+01 1.186e+01 -2.020 0.043417 *
## MAKESMART -2.175e+01 1.678e+01 -1.296 0.194923
## MAKESSANGYONG -1.722e+01 1.367e+01 -1.260 0.207762
## MAKESUBARU -1.656e+01 1.148e+01 -1.443 0.149168
## MAKESUZUKI -1.190e+01 1.153e+01 -1.032 0.301869
## MAKETESLA -1.001e+01 1.458e+01 -0.687 0.492058
## MAKETOYOTA -1.326e+01 1.147e+01 -1.157 0.247350
## MAKEVOLKSWAGEN -1.608e+01 1.148e+01 -1.400 0.161475
## MAKEVOLVO -1.981e+01 1.160e+01 -1.708 0.087659 .
## MVINSYEARS -4.466e-01 1.806e-01 -2.474 0.013383 *
## MANUFYEAR -2.158e+00 4.003e-02 -53.902 < 2e-16 ***
## YDAGE -1.087e+00 2.094e-02 -51.920 < 2e-16 ***
## ATFAULTACCDS 1.180e+02 1.150e+00 102.576 < 2e-16 ***
## NRMAASST 3.065e-01 1.001e-01 3.060 0.002214 **
## NOYRLICOBT -2.399e+01 2.502e-01 -95.879 < 2e-16 ***
## MVINSTYPEN 1.573e+02 8.394e+00 18.741 < 2e-16 ***
## MVINSTYPET 1.092e+01 9.522e-01 11.467 < 2e-16 ***
## DEMERITPTS 4.844e-01 1.891e-01 2.561 0.010435 *
## NCBPCT 2.823e-02 6.870e-03 4.108 3.99e-05 ***
## COMREGVEHY -1.040e-02 1.192e+00 -0.009 0.993035
## YDGENDERM 5.337e-01 3.256e-01 1.639 0.101249
## YDGENDERU -9.891e-02 1.625e+00 -0.061 0.951454
## ATFAULT5YRSY NA NA NA NA
## SHAPEPVF 2.141e+00 1.102e+00 1.943 0.052013 .
## SHAPESED -7.219e-01 9.912e-01 -0.728 0.466446
## SHAPEWAG -7.928e-01 1.024e+00 -0.774 0.438841
## VEHAGE NA NA NA NA
## COMMENCEDATE 1.797e-03 5.122e-03 0.351 0.725623
## YEARLYKMLESS15 5.164e-01 5.257e-01 0.982 0.325944
## NOYRSLICOBTLESS6 -4.865e+00 1.287e+00 -3.779 0.000158 ***
## DEMERITPTSZERO 3.112e+00 9.440e-01 3.296 0.000980 ***
## YDGENDERM NA NA NA NA
## COMMENCEDATE1 NA NA NA NA
## ATFAULT5YRSY NA NA NA NA
## REGUSEP NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.39 on 41568 degrees of freedom
## Multiple R-squared: 0.7662, Adjusted R-squared: 0.7657
## F-statistic: 1566 on 87 and 41568 DF, p-value: < 2.2e-16
m2 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS,
data = train)
options(max.print = 4000)
summary(m2)
##
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM -
## INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP -
## COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY -
## SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.497 -17.753 -5.598 12.725 160.457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.701e+03 7.691e+01 61.124 < 2e-16 ***
## YEARLYKM 1.554e-01 1.971e-02 7.887 3.16e-15 ***
## LICCANCNT 3.183e+01 2.461e+00 12.933 < 2e-16 ***
## OWNERAGE 2.108e-01 1.955e-02 10.783 < 2e-16 ***
## MVINSURERALLI -3.945e+00 7.373e-01 -5.351 8.80e-08 ***
## MVINSURERGIO 6.841e-01 5.041e-01 1.357 0.174789
## MVINSURERNONE -5.488e+01 8.578e+00 -6.399 1.59e-10 ***
## MVINSURERNRMA -3.596e+00 6.298e-01 -5.710 1.14e-08 ***
## MVINSUREROTH -2.336e+00 4.368e-01 -5.349 8.90e-08 ***
## MVINSURERQBE -3.587e+00 1.256e+00 -2.856 0.004298 **
## MVINSYEARS -3.570e-01 1.830e-01 -1.950 0.051129 .
## MANUFYEAR -2.136e+00 3.823e-02 -55.860 < 2e-16 ***
## YDAGE -1.075e+00 2.124e-02 -50.629 < 2e-16 ***
## ATFAULTACCDS 1.184e+02 1.168e+00 101.371 < 2e-16 ***
## NRMAASST 2.740e-01 1.015e-01 2.698 0.006974 **
## NOYRLICOBT -2.397e+01 2.535e-01 -94.575 < 2e-16 ***
## MVINSTYPEN 1.589e+02 8.517e+00 18.652 < 2e-16 ***
## MVINSTYPET 1.015e+01 9.653e-01 10.515 < 2e-16 ***
## NCBPCT 2.633e-02 6.927e-03 3.801 0.000144 ***
## YEARLYKMLESS15 4.231e-01 5.334e-01 0.793 0.427721
## NOYRSLICOBTLESS6 -4.618e+00 1.304e+00 -3.543 0.000396 ***
## DEMERITPTSZERO 9.225e-01 6.176e-01 1.494 0.135255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.95 on 41634 degrees of freedom
## Multiple R-squared: 0.7578, Adjusted R-squared: 0.7577
## F-statistic: 6203 on 21 and 41634 DF, p-value: < 2.2e-16
m3 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO,
data = train)
options(max.print = 4000)
summary(m3)
##
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM -
## INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP -
## COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY -
## SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS -
## YEARLYKMLESS15 - DEMERITPTSZERO, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.981 -17.746 -5.589 12.720 160.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.700e+03 7.682e+01 61.178 < 2e-16 ***
## YEARLYKM 1.494e-01 1.827e-02 8.176 3.03e-16 ***
## LICCANCNT 3.145e+01 2.440e+00 12.890 < 2e-16 ***
## OWNERAGE 2.111e-01 1.954e-02 10.802 < 2e-16 ***
## MVINSURERALLI -3.977e+00 7.369e-01 -5.398 6.79e-08 ***
## MVINSURERGIO 6.858e-01 5.041e-01 1.360 0.173704
## MVINSURERNONE -5.489e+01 8.577e+00 -6.399 1.58e-10 ***
## MVINSURERNRMA -3.594e+00 6.297e-01 -5.708 1.15e-08 ***
## MVINSUREROTH -2.363e+00 4.357e-01 -5.424 5.87e-08 ***
## MVINSURERQBE -3.610e+00 1.256e+00 -2.874 0.004053 **
## MVINSYEARS -3.499e-01 1.821e-01 -1.922 0.054627 .
## MANUFYEAR -2.134e+00 3.818e-02 -55.899 < 2e-16 ***
## YDAGE -1.075e+00 2.122e-02 -50.645 < 2e-16 ***
## ATFAULTACCDS 1.184e+02 1.168e+00 101.366 < 2e-16 ***
## NRMAASST 2.731e-01 1.015e-01 2.690 0.007159 **
## NOYRLICOBT -2.397e+01 2.534e-01 -94.614 < 2e-16 ***
## MVINSTYPEN 1.589e+02 8.517e+00 18.656 < 2e-16 ***
## MVINSTYPET 1.015e+01 9.653e-01 10.512 < 2e-16 ***
## NCBPCT 2.659e-02 6.546e-03 4.062 4.87e-05 ***
## NOYRSLICOBTLESS6 -4.602e+00 1.303e+00 -3.531 0.000414 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.95 on 41636 degrees of freedom
## Multiple R-squared: 0.7578, Adjusted R-squared: 0.7577
## F-statistic: 6855 on 19 and 41636 DF, p-value: < 2.2e-16
m4 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER,
data = train)
options(max.print = 4000)
summary(m4)
##
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM -
## INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP -
## COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY -
## SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS -
## YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -287.028 -17.684 -5.676 12.747 159.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.648e+03 7.671e+01 60.591 < 2e-16 ***
## YEARLYKM 1.475e-01 1.830e-02 8.063 7.67e-16 ***
## LICCANCNT 3.123e+01 2.443e+00 12.783 < 2e-16 ***
## OWNERAGE 2.154e-01 1.953e-02 11.029 < 2e-16 ***
## MVINSYEARS -5.709e-01 1.789e-01 -3.190 0.001422 **
## MANUFYEAR -2.110e+00 3.813e-02 -55.325 < 2e-16 ***
## YDAGE -1.069e+00 2.120e-02 -50.411 < 2e-16 ***
## ATFAULTACCDS 1.193e+02 1.165e+00 102.397 < 2e-16 ***
## NRMAASST 3.797e-01 1.004e-01 3.783 0.000155 ***
## NOYRLICOBT -2.396e+01 2.513e-01 -95.343 < 2e-16 ***
## MVINSTYPEN 1.063e+02 9.529e-01 111.558 < 2e-16 ***
## MVINSTYPET 1.074e+01 9.639e-01 11.138 < 2e-16 ***
## NCBPCT 4.327e-02 6.256e-03 6.917 4.69e-12 ***
## NOYRSLICOBTLESS6 -4.269e+00 1.303e+00 -3.277 0.001049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.99 on 41642 degrees of freedom
## Multiple R-squared: 0.757, Adjusted R-squared: 0.757
## F-statistic: 9981 on 13 and 41642 DF, p-value: < 2.2e-16
Three models were run pruning more variables as each model was created. The first noticeable and surprising variable to be found insignificant toward predicting the target variable was Model, next to no models gave any information on the premium cost. A few other variables were removed in the first round of pruning, but mainly others were more due to duplication of variables with shared information, such as YDAGE, ATFAULT5YRS, VEHAGE.. COMMENCEDATE had very little significance implying that pricing strategy was fairly consistent in the two price periods. Surprising variables to be excluded were the Business vehicle flags and the shape. Adjusted R-Squared = 0.7696.
The second model ran after removing the insignificant predictors. In this second run only two additional variables were found to be poor predictions, YEARLYKM and DEMERITPTSZERO. There was a slight reduction in adjusted R-squared which we will test later to determine if significant. Adjusted R-Squared = 0.7614.
The model was run for a third time with the two variables removed and the adjusted R-Squared was unchanged. Adjusted R-Squared = 0.7614.
Finally a fourth round of the model was run removing MVINSURER as not all categories were significant. There was an expected loss to information as adjusted R-Squared reduced but it may be worth the simplicity. Adjusted R-Squared = 0.7607.
anova(m2, m1)
## Analysis of Variance Table
##
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE -
## MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -
## REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS -
## ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 -
## DEMERITPTS
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MODEL -
## INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41634 45196719
## 2 41568 43620355 66 1576364 22.761 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3, m2)
## Analysis of Variance Table
##
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE -
## MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -
## REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS -
## ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 -
## DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE -
## MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -
## REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS -
## ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 -
## DEMERITPTS
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41636 45199262
## 2 41634 45196719 2 2542.5 1.171 0.3101
anova(m4, m3)
## Analysis of Variance Table
##
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE -
## MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -
## REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS -
## ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 -
## DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER +
## REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL +
## ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS +
## NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE +
## INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM +
## COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO +
## YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE -
## MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -
## REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS -
## ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 -
## DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41642 45334345
## 2 41636 45199262 6 135083 20.739 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test was run on all iterations of the model and in each case the subsequent version of the model was outperformed the predecessor. For this reasons the simplest model, m4, has a slightly reduced R-squired value but has been chosen as the best model for the training set data.
#remove model variable for lm model to run
trainm <- train[,c(1:9,11:33)]
testm <- test[,c(1:9,11:33)]
m4a <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER,
data = trainm)
predicted.premium4 <- predict(object = m4a, newdata = testm, type = "response")
rp1 <- ggplot(m4a, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(title = "Residuals vs Fitted")
rp3 <- ggplot(m4a) +
stat_qq(aes(sample = .stdresid)) +
geom_abline()
rp7 <- ggplot(m4a, aes(.hat, .cooksd)) +
geom_vline(xintercept = 0, colour = NA) +
geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
geom_smooth(se = FALSE) +
geom_point() +
labs(title = "Cook's distance vs Leverage")
rp8 <- ggplot(m4a, aes(.resid)) +
geom_histogram(bins = 7, color="darkblue", fill="steelblue")
grid.arrange(rp1, rp3, rp7, rp8, ncol = 2)
The residual graphs show an acceptable fit for the selected model. The residuals appear to be close enough to normally distributed coupled with the fairly good R-squared value we can determine that the model is fairly accurate with low spread around the mean. We will use it to make predictions and compare results.
results.df <- data.frame(cbind(actuals = testm$INSURER1_PREMIUM, predicted = predicted.premium4))
results.df <- results.df %>%
mutate(error = results.df$actuals - results.df$predicted) %>%
round(., 2)
results.df <- results.df %>%
mutate(percerror = paste0(round(results.df$error/results.df$actuals*100,2),"%"))
kable(head(results.df))
| actuals | predicted | error | percerror |
|---|---|---|---|
| 197.07 | 217.83 | -20.75 | -10.53% |
| 231.08 | 244.56 | -13.48 | -5.83% |
| 240.72 | 217.54 | 23.18 | 9.63% |
| 247.93 | 238.24 | 9.70 | 3.91% |
| 209.27 | 222.31 | -13.04 | -6.23% |
| 209.27 | 220.15 | -10.88 | -5.2% |
sprintf("The mean percent error is: %s%%", round(mean(results.df$error/results.df$actuals*100), 2))
## [1] "The mean percent error is: -1.48%"
Plot actuals verse predicted to see the variance of the predicted from the actual premiums.
rp <- ggplot(results.df, aes(results.df$predicted, results.df$actuals)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE) +
labs(title = "Actuals vs Fitted")
grid.arrange(rp , ncol = 1)
Although the model results appear fairly acceptable when reviewing the mean square error and the adjusted R-Squared, when we review the actuals vs fitted, we can see that in actuality, the INSURER1_PREMIUMS are quite banded (41 price points) and therefore categorical in nature. The linear regression model predicts a continuous output. This highlights an additional method which may be acceptable would be to predict the premium bands using regression trees. This exercise will be left for later. The final model contained 12 variables. The next steps will be to reproduce the same steps for the other insurers to determine and main differences in the pricing models. As insurer 2 and 3 had many more price points we expect a better fit where as insurer4 only had 22 we expect the same banding affect to occur.
In this project the aim was to identify price optimisation opportunity through studying the premium pricing models for Suncorp and 3 other unknown competitors. The data was gathered over 2 relatively close time periods and in all models this was found insignificant expect for Insurer2. The pricing differences in the two time periods could be compared outside of this pricing modeling in a anova comparison table to see if there is any significant differences in the two time periods. No additional time was spent down this avenue at this time. The price optimisation was found to be more interesting when studying the structure of the data at hand, the preliminary linear regression models and the feature selection. The pricing structure for each insurer shows there are much more variance in pricing for Insurers 2 and 3 with on average higher price tags for the same policy holder quotes. These pricing structures had much more variance than Insurers 1 and 4 which is indicative of a more variable pricing model inline with a linear regression. Further studies would include investigating the appropriateness of fit for a GLM for these insurers. Insurers 1 and 4 had much more banded pricing structure with a lower mean and less variation in the pricing models would indicate a better avenue would be to investigate the pricing structure with random forests. Using the same model selection process for all four insurers gave a like for like feature selection attributed to the pricing distribution. The results were actually quite intriguing, with Insurer2 having a very different model base than the other Insurers; Insurer3 had a model with variables in between that of Insurer1 and 2.
The overall conclusions of this investigation is that Suncorp (Insurer1) could improve their pricing structure to allow for more pricing bands and therefore increase profitability for the same customer base if it were to develop a model which included variables such as Gender, Business use and Vehicle shape. This added knowledge could allow for more specified risk structure and pricing variance and if the implementation allows for similar risk with increased pricing with no loss to customer segmentation or demand then there could be room for profit gain and improved market segmentation.
The overall conclusions are that this investigation was quite basic and could be greatly improved with more time. To better predict pricing optimisation opportunists more data covering Risk and customer segmentation along with internal strategy data for acquiring and keeping customers would also be useful to improving the theory. As for the modelling, besides investigating other better fit forms of models for this current data set, there could be better accuracy testing with the use of cross fold validation or leave one out cross validation and bootstrapping which were all not implemented in this model due to time constraints.
Research studies show the GLMs offer better predictive power for the given area of study, with GAMS allowing for even better performance but offering less simply explanations for the interactions at work. The use of regression trees and random forests could allow for better insights on variable splitting for information gain. Bagging would then likewise add a complexity that makes simple explanation of results hard but can add for further accuracy. With each model there will be benefits and difficulties that will need to be considered for the overall desired outcome. In this case the outcome was based around an investigation of backward engineering pricing structures for optimisation not accurately predicting price. Although accurately predicting premiums could greatly improve market knowledge and strategy.
The following is to make note of the sites which were read and used for Insurance background knowledge and modelling procedures specific to modelling insurance pricing.
http://rstudio-pubs-static.s3.amazonaws.com/156481_80ee6ee3a0414fd38f5d3ad33d14c771.html
https://rpubs.com/jt_rpubs/279278
https://datasciencebeginners.com/2018/11/26/functions-and-packages-for-feature-selection-in-r/
https://topepo.github.io/caret/pre-processing.html#identifying-correlated-predictors
https://mlr.mlr-org.com/articles/tutorial/feature_selection.html
https://beta.vu.nl/nl/Images/werkstuk-fonti_tcm235-836234.pdf
https://www.math.mcgill.ca/yyang/demo.html
https://www.actuariesindia.org/SeminarDocs/8thCBSGI/ppt/Motor_Insurance_contd.pdf
https://www.kaggle.com/kernels/scriptcontent/3267460/download
https://www.casact.org/pubs/forum/00wforum/00wf107.pdf
https://www.theanalysisfactor.com/generalized-linear-models-glm-r-part4/
https://statisticsbyjim.com/regression/low-r-squared-regression/