0.Purpose and Aim

Analyse the dataset containing price points for a particular motor insurance product for four different insurers on the market covering two price periods. Generate insights from the data and provide guidance on the pricing strategy for this product.

Firstly the data will be compared for the two price periods to see if there was a change in structure or large variance over hte two pricing periods. Second, the dataset will be used to back engineer the pricing composition for all four insurers provided in the dataset. There will be suggestions made on market segmentation considering the pricing structures of the 4 insurers and the price deconstruction results. Market reserach has not been done to contribute but would be neccessary for more indepth market segmentation. It is determined that pricing optimisation would depend heavily on the components of the product such as but not limited to coverage inclusions, duration, excess, max coverage amounts which are unavailable along with the also lacking company specific data like customer segmentation, perentage of new to returning customer makeup, and other comparable portfolio products and further other factors; for this reason pricing optimisation will be greatly limited and will be suggested based upon preliminary analysis around the pricing distribution of the premiums in comparison with other insurers distribution to determine if ther may be room to increase profits without loss to customers while still maintaing a similar standing in the competitor ranking.

All work is done in R and presented in this document. Due to time constraints, all analysis has been greatly simplified and done on an introductory/exploratory level. To appropriately assess and improve the pricing stratgy for this product, one would need to do background reserach within the broader field and local market, obtain additional product and portfolio information, delve more thoroughly into feature selection using knowledge of the relationships between variables, develop better performing models using acquired business acumen (exploring GAMs, GLMs, and Random Forests),d improve model assessment through rigorous accuracy testing using methodologies such as bootstrapping and cross-validation and the statistically test for significant improvements in the various models to determine the best model accuracy vs complexity that may be relied upon for future predictions, strategy improvements and competitor pricing estimations.

1. Data Exploration

library(readr)
library(caret)
library(psych)
library(Hmisc)
library(purrr)
library(tidyr)
library(ggplot2)
library(car)
library(MASS)
library(mgcViz)
library(tidyverse)
library(knitr)
library(corrplot)
library(gridExtra)
library(lattice)

1.1 Import Dataset

data <- read_csv("~/R/Suncorp Ins Test/data.csv")

1.1.1 Data Dictionary

defs <- c("Average annual kilometers (in thousands) containing categorical quantitative 5 levels",
          "Occurance of convictions or suspensions in the last 5 years binomial",
          "Owner age quantitative ranging from 17 to 99",
          "Other insurance provider categorical 7 levels including Other and None",
          "Registration use type categorical 2 levels", 
          "Vehicle manufacturer categorical",
          "Years insured with other insurance quantitative ranging from 0 to 4",
          "Year vehicle was manufactured quantitative ranging from 1999 to 2018",
          "Age of youngest driver quantitative ranging from 17 to 79",
          "Vehicle model categorical",
          "Occurance of at fault accidents in the last 2 years binomial",
          "Number of years with NRMA assist quantitative ranging from -1 to 25",
          "Number of years drivers license has been obtained quantitative ranging from 0 to 6",
          "Other insurance type on the vehicle categorical 3 levles",
          "Number of demerit points held quantitative ranging from 0 to 14",
          "No claim discount percentage quantitative (2 values)",
          "Company ownership binomial Y/N",
          "Driver gender, either male or female or Undefined (M/F/U)", 
          "Occurance of at fault accidents in the last 5 years binomial Y/N",
          "Vehical Shape categorical 4 levels",
          "Vehical Age in years quantitative ranging from 1 to 20",
          "Dependent variable - measures the premium pricing for insurer 1 - Suncorp",
          "Dependent variable - measures the premium pricing for insurer 2 - Unknown",
          "Dependent variable - measures the premium pricing for insurer 3 - Unknown",
          "Dependent variable - measures the premium pricing for insurer 4 - Unknown",
          "Commencement date for the policy categorical date 2 levels")

data.dict <- data.frame(names(data), defs, stringsAsFactors = F)
names(data.dict) <- c("Variable Name", "Definition")

kable(data.dict)

Variable Name	Definition
YEARLYKM	Average annual kilometers (in thousands) containing categorical quantitative 5 levels
LICCANCNT	Occurance of convictions or suspensions in the last 5 years binomial
OWNERAGE	Owner age quantitative ranging from 17 to 99
MVINSURER	Other insurance provider categorical 7 levels including Other and None
REGUSE	Registration use type categorical 2 levels
MAKE	Vehicle manufacturer categorical
MVINSYEARS	Years insured with other insurance quantitative ranging from 0 to 4
MANUFYEAR	Year vehicle was manufactured quantitative ranging from 1999 to 2018
YDAGE	Age of youngest driver quantitative ranging from 17 to 79
MODEL	Vehicle model categorical
ATFAULTACCDS	Occurance of at fault accidents in the last 2 years binomial
NRMAASST	Number of years with NRMA assist quantitative ranging from -1 to 25
NOYRLICOBT	Number of years drivers license has been obtained quantitative ranging from 0 to 6
MVINSTYPE	Other insurance type on the vehicle categorical 3 levles
DEMERITPTS	Number of demerit points held quantitative ranging from 0 to 14
NCBPCT	No claim discount percentage quantitative (2 values)
COMREGVEH	Company ownership binomial Y/N
YDGENDER	Driver gender, either male or female or Undefined (M/F/U)
ATFAULT5YRS	Occurance of at fault accidents in the last 5 years binomial Y/N
SHAPE	Vehical Shape categorical 4 levels
VEHAGE	Vehical Age in years quantitative ranging from 1 to 20
INSURER1_PREMIUM	Dependent variable - measures the premium pricing for insurer 1 - Suncorp
INSURER2_PREMIUM	Dependent variable - measures the premium pricing for insurer 2 - Unknown
INSURER3_PREMIUM	Dependent variable - measures the premium pricing for insurer 3 - Unknown
INSURER4_PREMIUM	Dependent variable - measures the premium pricing for insurer 4 - Unknown
COMMENCEDATE	Commencement date for the policy categorical date 2 levels

1.2 Data Structure

psych::describe(data)
##                  vars     n    mean     sd  median trimmed    mad     min
## YEARLYKM            1 59504   14.79   8.88   15.00   14.57   0.00    5.00
## LICCANCNT           2 59504    0.00   0.07    0.00    0.00   0.00    0.00
## OWNERAGE            3 59504   49.27  14.03   49.00   48.89  14.83   17.00
## MVINSURER*          4 59504     NaN     NA      NA     NaN     NA     Inf
## REGUSE*             5 59504     NaN     NA      NA     NaN     NA     Inf
## MAKE*               6 59504     NaN     NA      NA     NaN     NA     Inf
## MVINSYEARS          7 59504    1.92   1.70    2.00    1.90   2.97    0.00
## MANUFYEAR           8 59504 2009.37   4.77 2010.00 2009.54   5.93 1999.00
## YDAGE               9 59504   47.37  13.60   46.00   47.11  14.83   17.00
## MODEL*             10 59504  419.10 568.59  323.00  319.56 171.98    2.00
## ATFAULTACCDS       11 59504    0.02   0.14    0.00    0.00   0.00    0.00
## NRMAASST           12 59504    1.97   3.01    2.00    1.73   4.45   -1.00
## NOYRLICOBT         13 59504    5.29   1.83    6.00    5.85   0.00    0.00
## MVINSTYPE*         14 59504     NaN     NA      NA     NaN     NA     Inf
## DEMERITPTS         15 59504    0.37   1.43    0.00    0.00   0.00    0.00
## NCBPCT             16 59504   25.40  29.65    0.00   24.26   0.00    0.00
## COMREGVEH*         17 59504     NaN     NA      NA     NaN     NA     Inf
## YDGENDER*          18 59504     NaN     NA      NA     NaN     NA     Inf
## ATFAULT5YRS*       19 59504     NaN     NA      NA     NaN     NA     Inf
## SHAPE*             20 59504  999.00   0.00  999.00  999.00   0.00  999.00
## VEHAGE             21 59504    9.63   4.77    9.00    9.46   5.93    1.00
## INSURER1_PREMIUM   22 59504  250.76  67.03  218.98  240.53  32.23  197.07
## INSURER2_PREMIUM   23 59504  288.36  65.74  273.33  287.31  67.88  192.92
## INSURER3_PREMIUM   24 59504  301.21  52.43  295.34  300.71  55.60  202.00
## INSURER4_PREMIUM   25 59504  233.79  53.01  211.13  220.97  18.25  194.32
## COMMENCEDATE       26 59504     NaN     NA      NA     NaN     NA     Inf
##                      max   range  skew kurtosis   se
## YEARLYKM           99.00   94.00  8.32    76.91 0.04
## LICCANCNT           1.00    1.00 15.15   227.43 0.00
## OWNERAGE           99.00   82.00  0.24    -0.53 0.06
## MVINSURER*          -Inf    -Inf    NA       NA   NA
## REGUSE*             -Inf    -Inf    NA       NA   NA
## MAKE*               -Inf    -Inf    NA       NA   NA
## MVINSYEARS          4.00    4.00  0.12    -1.69 0.01
## MANUFYEAR        2018.00   19.00 -0.26    -0.89 0.02
## YDAGE              79.00   62.00  0.17    -0.75 0.06
## MODEL*           4008.00 4006.00  5.08    27.80 2.33
## ATFAULTACCDS        1.00    1.00  6.83    44.61 0.00
## NRMAASST           25.00   26.00  2.33    14.09 0.01
## NOYRLICOBT          6.00    6.00 -2.35     3.76 0.01
## MVINSTYPE*          -Inf    -Inf    NA       NA   NA
## DEMERITPTS         14.00   14.00  5.25    33.01 0.01
## NCBPCT             60.00   60.00  0.31    -1.90 0.12
## COMREGVEH*          -Inf    -Inf    NA       NA   NA
## YDGENDER*           -Inf    -Inf    NA       NA   NA
## ATFAULT5YRS*        -Inf    -Inf    NA       NA   NA
## SHAPE*            999.00    0.00   NaN      NaN 0.00
## VEHAGE             20.00   19.00  0.26    -0.89 0.02
## INSURER1_PREMIUM  386.36  189.29  1.32     0.11 0.27
## INSURER2_PREMIUM  386.58  193.66  0.37    -1.22 0.27
## INSURER3_PREMIUM  388.20  186.20  0.19    -0.90 0.21
## INSURER4_PREMIUM  385.34  191.01  1.97     2.43 0.22
## COMMENCEDATE        -Inf    -Inf    NA       NA   NA
Hmisc::describe(data)
## data 
## 
##  26  Variables      59504  Observations
## ---------------------------------------------------------------------------
## YEARLYKM 
##        n  missing distinct     Info     Mean      Gmd 
##    59504        0        5    0.487    14.79    3.822 
##                                         
## Value          5    10    15    20    99
## Frequency   3614  6404 47576  1324   586
## Proportion 0.061 0.108 0.800 0.022 0.010
## ---------------------------------------------------------------------------
## LICCANCNT 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    59504        0        2    0.013      256 0.004302 0.008568 
## 
## ---------------------------------------------------------------------------
## OWNERAGE 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       82        1    49.27    16.01       28       31 
##      .25      .50      .75      .90      .95 
##       38       49       59       69       73 
## 
## lowest : 17 18 19 20 21, highest: 94 95 96 98 99
## ---------------------------------------------------------------------------
## MVINSURER 
##        n  missing distinct 
##    59504        0        7 
##                                                     
## Value       AAMI  ALLI   GIO  NONE  NRMA   OTH   QBE
## Frequency  17496  3538 11130  2078  5412 18796  1054
## Proportion 0.294 0.059 0.187 0.035 0.091 0.316 0.018
## ---------------------------------------------------------------------------
## REGUSE 
##        n  missing distinct 
##    59504        0        2 
##                       
## Value       BUSG  PRIV
## Frequency   4586 54918
## Proportion 0.077 0.923
## ---------------------------------------------------------------------------
## MAKE 
##        n  missing distinct 
##    59504        0       58 
## 
## lowest : ABARTH       ALFA ROMEO   ASTON MARTIN AUDI         BENTLEY     
## highest: SUZUKI       TESLA        TOYOTA       VOLKSWAGEN   VOLVO       
## ---------------------------------------------------------------------------
## MVINSYEARS 
##        n  missing distinct     Info     Mean      Gmd 
##    59504        0        5    0.923    1.919    1.869 
##                                         
## Value          0     1     2     3     4
## Frequency  19668  9798  5234  5304 19500
## Proportion 0.331 0.165 0.088 0.089 0.328
## ---------------------------------------------------------------------------
## MANUFYEAR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       20    0.996     2009    5.467     2001     2003 
##      .25      .50      .75      .90      .95 
##     2006     2010     2013     2015     2016 
##                                                                       
## Value       1999  2000  2001  2002  2003  2004  2005  2006  2007  2008
## Frequency    988  1340  1494  1730  2530  2896  3050  3266  4032  4002
## Proportion 0.017 0.023 0.025 0.029 0.043 0.049 0.051 0.055 0.068 0.067
##                                                                       
## Value       2009  2010  2011  2012  2013  2014  2015  2016  2017  2018
## Frequency   3528  4352  3702  3976  4056  4360  4302  3622  1900   378
## Proportion 0.059 0.073 0.062 0.067 0.068 0.073 0.072 0.061 0.032 0.006
## ---------------------------------------------------------------------------
## YDAGE 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       63        1    47.37    15.58       26       30 
##      .25      .50      .75      .90      .95 
##       37       46       58       67       71 
## 
## lowest : 17 18 19 20 21, highest: 75 76 77 78 79
## ---------------------------------------------------------------------------
## MODEL 
##        n  missing distinct 
##    59504        0      631 
## 
## lowest : 1 M135I 116I    118D    118I    120D   
## highest: YRV     Z3      Z4      ZAFIRA  ZS     
## ---------------------------------------------------------------------------
## ATFAULTACCDS 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    59504        0        2    0.059     1200  0.02017  0.03952 
## 
## ---------------------------------------------------------------------------
## NRMAASST 
##        n  missing distinct     Info     Mean      Gmd 
##    59504        0        7    0.942    1.971    3.014 
##                                                     
## Value         -1     1     2     3     5    10    25
## Frequency  19668  9798  5234  9266 14110  1156   272
## Proportion 0.331 0.165 0.088 0.156 0.237 0.019 0.005
## ---------------------------------------------------------------------------
## NOYRLICOBT 
##        n  missing distinct     Info     Mean      Gmd 
##    59504        0        7    0.381    5.288    1.251 
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency   5446   522   706   738   638   774 50680
## Proportion 0.092 0.009 0.012 0.012 0.011 0.013 0.852
## ---------------------------------------------------------------------------
## MVINSTYPE 
##        n  missing distinct 
##    59504        0        3 
##                             
## Value          C     N     T
## Frequency  55486  2098  1920
## Proportion 0.932 0.035 0.032
## ---------------------------------------------------------------------------
## DEMERITPTS 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       15    0.256   0.3671   0.6902        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        3 
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency  53928   804  1022  1586   626   262   502   204   144    96
## Proportion 0.906 0.014 0.017 0.027 0.011 0.004 0.008 0.003 0.002 0.002
##                                         
## Value         10    11    12    13    14
## Frequency     80    42    56    94    58
## Proportion 0.001 0.001 0.001 0.002 0.001
## ---------------------------------------------------------------------------
## NCBPCT 
##        n  missing distinct     Info     Mean      Gmd 
##    59504        0        2    0.732     25.4     29.3 
##                       
## Value          0    60
## Frequency  34310 25194
## Proportion 0.577 0.423
## ---------------------------------------------------------------------------
## COMREGVEH 
##        n  missing distinct 
##    59504        0        2 
##                       
## Value          N     Y
## Frequency  55964  3540
## Proportion 0.941 0.059
## ---------------------------------------------------------------------------
## YDGENDER 
##        n  missing distinct 
##    59504        0        3 
##                             
## Value          F     M     U
## Frequency  32550 26364   590
## Proportion 0.547 0.443 0.010
## ---------------------------------------------------------------------------
## ATFAULT5YRS 
##        n  missing distinct 
##    59504        0        2 
##                       
## Value          N     Y
## Frequency  58304  1200
## Proportion  0.98  0.02
## ---------------------------------------------------------------------------
## SHAPE 
##        n  missing distinct 
##    59504        0        4 
##                                   
## Value        999   PVF   SED   WAG
## Frequency   1816  6120 36886 14682
## Proportion 0.031 0.103 0.620 0.247
## ---------------------------------------------------------------------------
## VEHAGE 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       20    0.996    9.634    5.467        3        4 
##      .25      .50      .75      .90      .95 
##        6        9       13       16       18 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency    378  1900  3622  4302  4360  4056  3976  3702  4352  3528
## Proportion 0.006 0.032 0.061 0.072 0.073 0.068 0.067 0.062 0.073 0.059
##                                                                       
## Value         11    12    13    14    15    16    17    18    19    20
## Frequency   4002  4032  3266  3050  2896  2530  1730  1494  1340   988
## Proportion 0.067 0.068 0.055 0.051 0.049 0.043 0.029 0.025 0.023 0.017
## ---------------------------------------------------------------------------
## INSURER1_PREMIUM 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       41    0.965    250.8    65.47    197.1    197.1 
##      .25      .50      .75      .90      .95 
##    209.3    219.0    247.9    386.4    386.4 
## 
## lowest : 197.0744 201.9573 203.1732 204.3954 209.2651
## highest: 307.4777 327.9402 339.6837 350.9791 386.3640
## ---------------------------------------------------------------------------
## INSURER2_PREMIUM 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0      585    0.996    288.4    74.06    197.6    208.0 
##      .25      .50      .75      .90      .95 
##    236.1    273.3    365.6    386.6    386.6 
## 
## lowest : 192.9243 193.5552 194.1860 194.8166 195.4470
## highest: 383.3821 383.5928 384.5803 385.1793 386.5811
## ---------------------------------------------------------------------------
## INSURER3_PREMIUM 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0     6501    0.998    301.2    59.95    218.9    233.8 
##      .25      .50      .75      .90      .95 
##    262.0    295.3    338.9    388.2    388.2 
## 
## lowest : 201.9988 202.0748 202.0955 202.1231 202.1715
## highest: 387.9365 387.9826 388.1010 388.1273 388.1997
## ---------------------------------------------------------------------------
## INSURER4_PREMIUM 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    59504        0       22    0.973    233.8     45.8    198.8    198.8 
##      .25      .50      .75      .90      .95 
##    206.3    211.1    225.5    355.3    379.4 
## 
## lowest : 194.3246 198.8252 206.3416 211.1321 212.3302
## highest: 355.2922 357.3182 373.8848 379.4100 385.3373
## ---------------------------------------------------------------------------
## COMMENCEDATE 
##        n  missing distinct 
##    59504        0        2 
##                                 
## Value      2019-07-01 2019-09-01
## Frequency       29752      29752
## Proportion        0.5        0.5
## ---------------------------------------------------------------------------

The dataset has 26 variables, and 59504 cases. The summary reveals some multinomial categorical variables and some quantitative variables are heavily dominated in one level or one value. This occurs in YEARLYKM (80% in 15000km), NOYRLICOBT (85% in 6 years), MVINSTYPE (93% in comprehensive), DEMERITPTS (90% in zero points). Some binomial variables are heavily one sided such as LICCANCNT, ATFAULTACCDS (sparse many zeros), REGUSE (92% PRIV), COMREGVEH (94% No), ATFAULT5YRS (98% No).

Commencement date evenly splits the data 50% 50%. And YDGENDER has a relatively equal split between F/M (55%/45%) with only 1% Unknown and likewise NCBPCT has a similarly even split of 0/60 (58%/42%).

All Insurer Premium distributions have a Mean slightly larger than the median and the mean ranking from smallest to largest is determined to be Insurer4, Insurer1, Insurer2, Insurer3 with the first 3 quartiles for Insurer1 and Insurer3 being much smaller range than that of Insurer2 and Insurer3. ###1.2.1 Missing data

any(is.na(data))
## [1] FALSE

The dataset appears to have no missing cases, yet some issues have been identified through initial raw data review and the results from the above describe statistics. The NRMA assist variable contains a value (-1) which are unable to be interpreted unless taken to be not held. MVINSTYPE has 3 categories assumed to be Comprehensive (C), Third-party (T) and None (N), yet not all rows containing N also signify Zero on MVINSYRS or None on MVINSURER variables. Shape contains the value ‘999’ which may be interpreted as Other. Not all COMREGVEH flagged as ‘Y’ also coincide with the flag ‘BUSG’ under REGUSE variable.

1.3 Splitting data

The data has been initially explored for type and values. Now before doing an indepth interrogation into the data distributions and relationships we will split the data into our test and train sets.

set.seed(3456)
split <- createDataPartition(data$MVINSURER, p=0.7, list=FALSE)
train <- data[split,]
test <- data[-split,]

train.num <- train %>%
  keep(is.numeric) 
train.cat <- train %>%
  discard(is.numeric)

1.4 Visualizations

1.4.1 Boxplot

train.bpV <- train %>%
  select(c(3, 9, 13, 15, 21)) %>%
  gather()

summary.boxplotV <- ggplot(train.bpV, aes(x = key, y = value)) +
  labs(x = "variable", title = "Insurance Premium Variables Boxplots") +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1)

summary.boxplotV


train.bpC <- train %>%
  select(c(1, 2, 7, 11, 12, 16)) %>%
  gather()

summary.boxplotC <- ggplot(train.bpC, aes(x = key, y = value)) +
  labs(x = "variable", title = "Insurance Premium Variables Boxplots") +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1)

summary.boxplotC


train.bpI <- train %>%
  select(c(22, 23, 24, 25)) %>%
  gather()

summary.boxplotI <- ggplot(train.bpI, aes(x = key, y = value)) +
  labs(x = "variable", title = "Insurers Premium Boxplots") +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1)

summary.boxplotI

The boxplots clearly reinforce what we have been shown in the statistics summary above, the demerit points and number of years license obtained are heavily weighted in one value, while age variables are more normally distributed. We could consider changing any one sided variables to be binomial variables as has specified value or does not.

The boxplots for the insurer insurance premiums clearly group the 4 insurers into two main categories, Insurer1 and Insurer4 are much more similar distribution and pricing wise while Insurer2 and Insurer3 are likewise more similar to each other. This immediately signifies room for market pricing improvements for Suncorp without great effect in changing the market composition.

1.4.2 Histogram

train.hV <- train %>%
  select(c(3, 9, 13, 15, 21)) %>%
  gather()

train.histV <- ggplot(data = train.hV, mapping = aes(x = value)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~key, scales = 'free_x')

train.histV


train.hC <- train %>%
  select(c(1, 2, 7, 11, 12, 16)) %>%
  gather()

train.histC <- ggplot(data = train.hC, mapping = aes(x = value)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~key, scales = 'free_x')

train.histC


train.hI <- train %>%
  select(c(22, 23, 24, 25)) %>%
  gather()

train.histI <- ggplot(data = train.hI, mapping = aes(x = value)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~key, scales = 'free_x')

train.histI

The histograms provide a good representation of how many categories are heavily skewed distributions. The age variables are seen to be normal as expected and the Insurer premiums now visualised reveal insights on current pricing strategy as Suncorp appears to be heavily skewed to low price premiums indicating a customer segmentation of low risk or low value insurance coverage in this product. Indicating this may be a low end product for Suncorp portfolio. This also indicates the other insurers which are heavily skewed to the high end and may correlate to classifying them as higher risk or pricing a different segment of policies as higher value premiums. This could be explored further with clustering or trees to determine which policy risks are considered as high price in Insurer3 and Insurer2 where as are priced in the lower price category for Suncorp and Insurer4. This will be the segment that can most easily afford a price increase without loss to demand as the gap is widest for this segmentation.

1.4.3 Bar Chart

train.b <- train %>%
  select(c(4:5, 14,17:20)) %>%
  gather()

train.bar <- ggplot(data = train.b, mapping = aes(x = value)) + 
  geom_bar() + 
  facet_wrap(~key, scales = 'free_x')

train.bar


train.m <- train %>%
  select(c(6)) %>%
  gather()

train.barm <- ggplot(data = train.m, mapping = aes(x = value)) + 
  geom_bar() + 
  facet_wrap(~key, scales = 'free_x')

train.barm

The bar chart visualisation provides insight into the categorical nominal distributions. As described above we can see some variables are heavily distorted in proportionality while others are more evenly proportioned. The variables with less even distribution will contain more information and provided more predictive power.

1.4.4 Correlation

The Correlation heatmap is shown below for each of the independent and dependent numb,eric variables in the data.


train.c <- mutate_all(train.num, funs(as.numeric))
corrplot(cor(train.c), method = "color", type = "lower")

The correlation heatmap shows the variables correlations with each of the insurance premiums, some are similar dependencies while others are quite different hinting a difference in the way each insurer would propose their pricing methodologies. For example, youngest driver age appears to be correlated similarly with each premium while demerit points or number of years license obtained has much more variance. From this chart we can also see that the correlation between vehicle age and manufacture year are almost 1 and therefore redundant variables so only one will need to remain in the model. Also noticed is YDAGE and OWNEAGE are quite similar along with NRMAASSIST and MVINSYRS and will will share much information.


corp <- apply(train.c[, -7], 2, function(x) cor.test(x, y=train.c$INSURER1_PREMIUM)$p.value)
cortable <- cor(train.c[, -7], train.c$INSURER1_PREMIUM)
kable(cbind(as.character(corp), cortable), col.names = c("P-value", "Correlation with dependent"))

	P-value	Correlation with dependent
YEARLYKM	0.000494384142394882	0.0170684247331176
LICCANCNT	4.01170752009266e-177	0.138365694262635
OWNERAGE	1.40945800632286e-272	-0.171556991460753
MVINSYEARS	2.65330501218757e-122	-0.114852181381481
MANUFYEAR	0	-0.32569740214304
YDAGE	0	-0.295169447462153
NRMAASST	2.14908979503085e-114	-0.111037982099119
NOYRLICOBT	0	-0.721622622947211
DEMERITPTS	0.00416586503604279	0.0140386762140103
NCBPCT	2.44582103833928e-88	0.0973978692306985
VEHAGE	0	0.32569740214304
INSURER1_PREMIUM	0	1
INSURER2_PREMIUM	0	0.651650620864274
INSURER3_PREMIUM	0	0.572303365172195
INSURER4_PREMIUM	0	0.330863594257667

In the case above the dependent variable was chosen to be the Insurer1_Premium. The variables with the biggest correlations are NOYRLICOBT followed by VEHAGE and YDAGE. We can also see that despite Insurer1 sharing a similar pricing range and distribution to Insurer4 it is more strongly correlated with premiums from Insurer2 and Insurer3.

2. Data Preparation

2.1 Missing Data

We have noted that there is no missing data but there is data that doesn’t exactly align. As not knowing the meaning of -1 for NRMA assist but seeing it is actually quite a common value in the data we will not remove this. The business ownership and regular use mismatch we will leave as there could conceivably be a code where these might not align but we will note this for further investigation. For the data that mismatches for other insurer information, we will remove any that do not make sense. If MVINS = “None”" and MVINSTYPE does not equal “N” and MVINSYRS does not equal “0” (or vise versa) we will remove the data from the set.

train <- train[!((train$MVINSURER=="NONE" & (train$MVINSYEARS!=0 || train$MVINSTYPE!="N")) || (train$MVINSTYPE=="N" & (train$MVINSYEARS!=0 || train$MVINSURER!="NONE"))),]
test <- test[!((test$MVINSURER=="NONE" & (test$MVINSYEARS!=0 || test$MVINSTYPE!="N")) || (test$MVINSTYPE=="N" & (test$MVINSYEARS!=0 || test$MVINSURER!="NONE"))),]

2.2 Normality of Predictor Variables

The distribution plots above all the age variables tend to be nearly normal while YEARLYKM and DEMERITPTS are left skewed. Linear regression does not make any assumptions on the normality of any variables, so we will not transform the skewed variables for now. At a more indepth look this may need to be considered for further improvement or reclassification of the variables into categorical is another possible revision.

2.3 Variable Transformation

We will transform three variables that are highly dominated by one value into binary variables, YEARLYKM: LESS15, NOYRSLICOBT: LESS6, DEMERITPTS: ZERO; with any case having a value inside the described range as 1, and all others as 0. We will also convert YDGENDER into an additional variable YDGenderM as a binomial with Unknown assumed in Male and Male taking the 1 and Female taking the value 2. All these additional variables will be used within the feature selection and model training to determine which variables may be better predictors and eliminating the others.

We will also transform any categorical binomials into 1/0 values and any ordinal categorical into numerical ordered values. COMMENCEDATE1 (time1 = 1, time2 = 0), ATFAULT5YRS (FAULT = 1, NOTFAULT = 0),REGUSEP(PRIV = 1, BUSG = 0).

train$YEARLYKMLESS15 <- ifelse(train$YEARLYKM < 15 , 1 , 0)
train$NOYRSLICOBTLESS6 <- ifelse(train$NOYRLICOBT < 6 , 1 , 0)
train$DEMERITPTSZERO <- ifelse(train$DEMERITPTS ==0 , 1 , 0)
train$YDGENDERM <- ifelse(train$YDGENDER =="U" || train$YDGENDER =="M" , 1 , 0)
train$COMMENCEDATE1 <- ifelse(train$COMMENCEDATE =="2019-07-01" , 1 , 0)
train$ATFAULT5YRSY <- ifelse(train$ATFAULT5YRS =="Y" , 1 , 0)
train$REGUSEP <- ifelse(train$REGUSE =="PRIV" , 1 , 0)

test$YEARLYKMLESS15 <- ifelse(test$YEARLYKM < 15 , 1 , 0)
test$NOYRSLICOBTLESS6 <- ifelse(test$NOYRLICOBT < 6 , 1 , 0)
test$DEMERITPTSZERO <- ifelse(test$DEMERITPTS ==0 , 1 , 0)
test$YDGENDERM <- ifelse(test$YDGENDER =="U" || test$YDGENDER =="M" , 1 , 0)
test$COMMENCEDATE1 <- ifelse(test$COMMENCEDATE =="2019-07-01" , 1 , 0)
test$ATFAULT5YRSY <- ifelse(test$ATFAULT5YRS =="Y" , 1 , 0)
test$REGUSEP <- ifelse(test$REGUSE =="PRIV" , 1 , 0)

2.5 Outliers

Although some variables have highly clustered values, all categories have a relatively set amount of values or levels and there is unlikely any reason to remove any of these as outliers.

3. Build Models

3.1 Model 1

For the first model, we will not eliminate any variables but will observe the interactions of all in order to serve as a baseline for comparison with subsequent models. We will utilise this output to eliminate one of our duplicate variable sets (with the newly created variables above) and to remove any variables that are not significant enough. We will however remove model because some models may appear in the test data or future data that may not be in the training sets and may not be predicted by the model.

m1 <- lm(formula = INSURER1_PREMIUM ~ . - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM,
         data = train)

options(max.print = 4000)
summary(m1)
## 
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MODEL - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -283.453  -16.655   -4.786   11.334  162.461 
## 
## Coefficients: (6 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.724e+03  1.235e+02  38.265  < 2e-16 ***
## YEARLYKM               1.622e-01  1.945e-02   8.342  < 2e-16 ***
## LICCANCNT              3.106e+01  2.463e+00  12.609  < 2e-16 ***
## OWNERAGE               2.176e-01  1.929e-02  11.282  < 2e-16 ***
## MVINSURERALLI         -3.476e+00  7.275e-01  -4.777 1.78e-06 ***
## MVINSURERGIO           6.463e-01  4.999e-01   1.293 0.196022    
## MVINSURERNONE         -5.297e+01  8.451e+00  -6.267 3.71e-10 ***
## MVINSURERNRMA         -3.307e+00  6.207e-01  -5.329 9.95e-08 ***
## MVINSUREROTH          -2.185e+00  4.316e-01  -5.062 4.17e-07 ***
## MVINSURERQBE          -4.301e+00  1.239e+00  -3.472 0.000517 ***
## REGUSEPRIV            -7.473e-01  1.056e+00  -0.707 0.479268    
## MAKEALFA ROMEO        -8.743e+00  1.209e+01  -0.723 0.469545    
## MAKEASTON MARTIN      -2.327e+01  1.575e+01  -1.477 0.139657    
## MAKEAUDI              -1.409e+01  1.150e+01  -1.225 0.220694    
## MAKEBENTLEY           -3.470e+00  3.438e+01  -0.101 0.919589    
## MAKEBMW               -1.344e+01  1.149e+01  -1.169 0.242359    
## MAKECHERY             -9.684e+00  1.750e+01  -0.553 0.580098    
## MAKECHRYSLER          -1.113e+01  1.220e+01  -0.912 0.361771    
## MAKECITROEN           -1.701e+01  1.193e+01  -1.425 0.154129    
## MAKEDAEWOO            -2.303e+01  1.507e+01  -1.529 0.126375    
## MAKEDAIHATSU          -2.737e+01  1.331e+01  -2.056 0.039777 *  
## MAKEDODGE              2.107e+00  1.240e+01   0.170 0.865089    
## MAKEFERRARI           -2.922e+01  1.850e+01  -1.579 0.114256    
## MAKEFIAT              -1.320e+01  1.202e+01  -1.098 0.272024    
## MAKEFORD              -1.194e+01  1.149e+01  -1.040 0.298560    
## MAKEGREAT WALL MOTORS  2.186e+01  1.311e+01   1.667 0.095576 .  
## MAKEHAVAL             -1.612e+01  2.194e+01  -0.735 0.462588    
## MAKEHOLDEN            -1.904e+01  1.148e+01  -1.659 0.097099 .  
## MAKEHONDA             -1.035e+01  1.148e+01  -0.902 0.366963    
## MAKEHSV                9.515e+00  1.236e+01   0.770 0.441373    
## MAKEHUMMER            -1.671e+01  3.437e+01  -0.486 0.626808    
## MAKEHYUNDAI           -1.521e+01  1.148e+01  -1.325 0.185136    
## MAKEINFINITI          -1.600e+01  1.620e+01  -0.988 0.323402    
## MAKEISUZU             -1.581e+01  1.311e+01  -1.206 0.227786    
## MAKEJAGUAR            -1.453e+01  1.213e+01  -1.198 0.231021    
## MAKEJEEP              -1.512e+01  1.156e+01  -1.308 0.190762    
## MAKEKIA               -1.310e+01  1.152e+01  -1.138 0.255333    
## MAKELAND ROVER        -1.447e+01  1.169e+01  -1.239 0.215512    
## MAKELDV               -2.310e+01  1.679e+01  -1.376 0.168831    
## MAKELEXUS             -5.352e+00  1.151e+01  -0.465 0.641889    
## MAKELOTUS             -2.157e+01  1.680e+01  -1.284 0.199186    
## MAKEMASERATI           1.129e+01  1.848e+01   0.611 0.541230    
## MAKEMAZDA              2.546e+00  1.147e+01   0.222 0.824385    
## MAKEMCLAREN           -2.076e+01  2.563e+01  -0.810 0.417916    
## MAKEMERCEDES-AMG       6.526e+00  1.261e+01   0.518 0.604807    
## MAKEMERCEDES-BENZ      9.081e+00  1.264e+01   0.718 0.472466    
## MAKEMERCEDES AMG      -1.217e+01  1.750e+01  -0.695 0.486886    
## MAKEMERCEDES BENZ     -3.502e+00  1.150e+01  -0.305 0.760677    
## MAKEMG                 2.812e+00  1.457e+01   0.193 0.846936    
## MAKEMINI              -1.985e+01  1.181e+01  -1.681 0.092751 .  
## MAKEMITSUBISHI        -1.769e+01  1.149e+01  -1.539 0.123784    
## MAKENISSAN            -1.314e+01  1.149e+01  -1.144 0.252462    
## MAKEOPEL              -2.296e+01  1.538e+01  -1.493 0.135425    
## MAKEPEUGEOT           -1.252e+01  1.166e+01  -1.073 0.283311    
## MAKEPORSCHE           -2.300e+01  1.181e+01  -1.948 0.051419 .  
## MAKEPROTON            -2.231e+01  1.389e+01  -1.606 0.108376    
## MAKERANGE ROVER       -1.136e+00  1.183e+01  -0.096 0.923530    
## MAKERENAULT           -1.552e+01  1.175e+01  -1.320 0.186690    
## MAKESAAB               1.647e+01  1.295e+01   1.271 0.203564    
## MAKESKODA             -2.396e+01  1.186e+01  -2.020 0.043417 *  
## MAKESMART             -2.175e+01  1.678e+01  -1.296 0.194923    
## MAKESSANGYONG         -1.722e+01  1.367e+01  -1.260 0.207762    
## MAKESUBARU            -1.656e+01  1.148e+01  -1.443 0.149168    
## MAKESUZUKI            -1.190e+01  1.153e+01  -1.032 0.301869    
## MAKETESLA             -1.001e+01  1.458e+01  -0.687 0.492058    
## MAKETOYOTA            -1.326e+01  1.147e+01  -1.157 0.247350    
## MAKEVOLKSWAGEN        -1.608e+01  1.148e+01  -1.400 0.161475    
## MAKEVOLVO             -1.981e+01  1.160e+01  -1.708 0.087659 .  
## MVINSYEARS            -4.466e-01  1.806e-01  -2.474 0.013383 *  
## MANUFYEAR             -2.158e+00  4.003e-02 -53.902  < 2e-16 ***
## YDAGE                 -1.087e+00  2.094e-02 -51.920  < 2e-16 ***
## ATFAULTACCDS           1.180e+02  1.150e+00 102.576  < 2e-16 ***
## NRMAASST               3.065e-01  1.001e-01   3.060 0.002214 ** 
## NOYRLICOBT            -2.399e+01  2.502e-01 -95.879  < 2e-16 ***
## MVINSTYPEN             1.573e+02  8.394e+00  18.741  < 2e-16 ***
## MVINSTYPET             1.092e+01  9.522e-01  11.467  < 2e-16 ***
## DEMERITPTS             4.844e-01  1.891e-01   2.561 0.010435 *  
## NCBPCT                 2.823e-02  6.870e-03   4.108 3.99e-05 ***
## COMREGVEHY            -1.040e-02  1.192e+00  -0.009 0.993035    
## YDGENDERM              5.337e-01  3.256e-01   1.639 0.101249    
## YDGENDERU             -9.891e-02  1.625e+00  -0.061 0.951454    
## ATFAULT5YRSY                  NA         NA      NA       NA    
## SHAPEPVF               2.141e+00  1.102e+00   1.943 0.052013 .  
## SHAPESED              -7.219e-01  9.912e-01  -0.728 0.466446    
## SHAPEWAG              -7.928e-01  1.024e+00  -0.774 0.438841    
## VEHAGE                        NA         NA      NA       NA    
## COMMENCEDATE           1.797e-03  5.122e-03   0.351 0.725623    
## YEARLYKMLESS15         5.164e-01  5.257e-01   0.982 0.325944    
## NOYRSLICOBTLESS6      -4.865e+00  1.287e+00  -3.779 0.000158 ***
## DEMERITPTSZERO         3.112e+00  9.440e-01   3.296 0.000980 ***
## YDGENDERM                     NA         NA      NA       NA    
## COMMENCEDATE1                 NA         NA      NA       NA    
## ATFAULT5YRSY                  NA         NA      NA       NA    
## REGUSEP                       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.39 on 41568 degrees of freedom
## Multiple R-squared:  0.7662, Adjusted R-squared:  0.7657 
## F-statistic:  1566 on 87 and 41568 DF,  p-value: < 2.2e-16

m2 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS,
         data = train)
options(max.print = 4000)
summary(m2)
## 
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP - 
##     COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -285.497  -17.753   -5.598   12.725  160.457 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.701e+03  7.691e+01  61.124  < 2e-16 ***
## YEARLYKM          1.554e-01  1.971e-02   7.887 3.16e-15 ***
## LICCANCNT         3.183e+01  2.461e+00  12.933  < 2e-16 ***
## OWNERAGE          2.108e-01  1.955e-02  10.783  < 2e-16 ***
## MVINSURERALLI    -3.945e+00  7.373e-01  -5.351 8.80e-08 ***
## MVINSURERGIO      6.841e-01  5.041e-01   1.357 0.174789    
## MVINSURERNONE    -5.488e+01  8.578e+00  -6.399 1.59e-10 ***
## MVINSURERNRMA    -3.596e+00  6.298e-01  -5.710 1.14e-08 ***
## MVINSUREROTH     -2.336e+00  4.368e-01  -5.349 8.90e-08 ***
## MVINSURERQBE     -3.587e+00  1.256e+00  -2.856 0.004298 ** 
## MVINSYEARS       -3.570e-01  1.830e-01  -1.950 0.051129 .  
## MANUFYEAR        -2.136e+00  3.823e-02 -55.860  < 2e-16 ***
## YDAGE            -1.075e+00  2.124e-02 -50.629  < 2e-16 ***
## ATFAULTACCDS      1.184e+02  1.168e+00 101.371  < 2e-16 ***
## NRMAASST          2.740e-01  1.015e-01   2.698 0.006974 ** 
## NOYRLICOBT       -2.397e+01  2.535e-01 -94.575  < 2e-16 ***
## MVINSTYPEN        1.589e+02  8.517e+00  18.652  < 2e-16 ***
## MVINSTYPET        1.015e+01  9.653e-01  10.515  < 2e-16 ***
## NCBPCT            2.633e-02  6.927e-03   3.801 0.000144 ***
## YEARLYKMLESS15    4.231e-01  5.334e-01   0.793 0.427721    
## NOYRSLICOBTLESS6 -4.618e+00  1.304e+00  -3.543 0.000396 ***
## DEMERITPTSZERO    9.225e-01  6.176e-01   1.494 0.135255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.95 on 41634 degrees of freedom
## Multiple R-squared:  0.7578, Adjusted R-squared:  0.7577 
## F-statistic:  6203 on 21 and 41634 DF,  p-value: < 2.2e-16

m3 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO,
         data = train)
options(max.print = 4000)
summary(m3)
## 
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP - 
##     COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - 
##     YEARLYKMLESS15 - DEMERITPTSZERO, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -285.981  -17.746   -5.589   12.720  160.546 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.700e+03  7.682e+01  61.178  < 2e-16 ***
## YEARLYKM          1.494e-01  1.827e-02   8.176 3.03e-16 ***
## LICCANCNT         3.145e+01  2.440e+00  12.890  < 2e-16 ***
## OWNERAGE          2.111e-01  1.954e-02  10.802  < 2e-16 ***
## MVINSURERALLI    -3.977e+00  7.369e-01  -5.398 6.79e-08 ***
## MVINSURERGIO      6.858e-01  5.041e-01   1.360 0.173704    
## MVINSURERNONE    -5.489e+01  8.577e+00  -6.399 1.58e-10 ***
## MVINSURERNRMA    -3.594e+00  6.297e-01  -5.708 1.15e-08 ***
## MVINSUREROTH     -2.363e+00  4.357e-01  -5.424 5.87e-08 ***
## MVINSURERQBE     -3.610e+00  1.256e+00  -2.874 0.004053 ** 
## MVINSYEARS       -3.499e-01  1.821e-01  -1.922 0.054627 .  
## MANUFYEAR        -2.134e+00  3.818e-02 -55.899  < 2e-16 ***
## YDAGE            -1.075e+00  2.122e-02 -50.645  < 2e-16 ***
## ATFAULTACCDS      1.184e+02  1.168e+00 101.366  < 2e-16 ***
## NRMAASST          2.731e-01  1.015e-01   2.690 0.007159 ** 
## NOYRLICOBT       -2.397e+01  2.534e-01 -94.614  < 2e-16 ***
## MVINSTYPEN        1.589e+02  8.517e+00  18.656  < 2e-16 ***
## MVINSTYPET        1.015e+01  9.653e-01  10.512  < 2e-16 ***
## NCBPCT            2.659e-02  6.546e-03   4.062 4.87e-05 ***
## NOYRSLICOBTLESS6 -4.602e+00  1.303e+00  -3.531 0.000414 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.95 on 41636 degrees of freedom
## Multiple R-squared:  0.7578, Adjusted R-squared:  0.7577 
## F-statistic:  6855 on 19 and 41636 DF,  p-value: < 2.2e-16

m4 <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER,
         data = train)
options(max.print = 4000)
summary(m4)
## 
## Call:
## lm(formula = INSURER1_PREMIUM ~ . - MAKE - MODEL - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - REGUSE - REGUSEP - 
##     COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - 
##     YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -287.028  -17.684   -5.676   12.747  159.901 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.648e+03  7.671e+01  60.591  < 2e-16 ***
## YEARLYKM          1.475e-01  1.830e-02   8.063 7.67e-16 ***
## LICCANCNT         3.123e+01  2.443e+00  12.783  < 2e-16 ***
## OWNERAGE          2.154e-01  1.953e-02  11.029  < 2e-16 ***
## MVINSYEARS       -5.709e-01  1.789e-01  -3.190 0.001422 ** 
## MANUFYEAR        -2.110e+00  3.813e-02 -55.325  < 2e-16 ***
## YDAGE            -1.069e+00  2.120e-02 -50.411  < 2e-16 ***
## ATFAULTACCDS      1.193e+02  1.165e+00 102.397  < 2e-16 ***
## NRMAASST          3.797e-01  1.004e-01   3.783 0.000155 ***
## NOYRLICOBT       -2.396e+01  2.513e-01 -95.343  < 2e-16 ***
## MVINSTYPEN        1.063e+02  9.529e-01 111.558  < 2e-16 ***
## MVINSTYPET        1.074e+01  9.639e-01  11.138  < 2e-16 ***
## NCBPCT            4.327e-02  6.256e-03   6.917 4.69e-12 ***
## NOYRSLICOBTLESS6 -4.269e+00  1.303e+00  -3.277 0.001049 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.99 on 41642 degrees of freedom
## Multiple R-squared:  0.757,  Adjusted R-squared:  0.757 
## F-statistic:  9981 on 13 and 41642 DF,  p-value: < 2.2e-16

3.2 Model Interpretation

Three models were run pruning more variables as each model was created. The first noticeable and surprising variable to be found insignificant toward predicting the target variable was Model, next to no models gave any information on the premium cost. A few other variables were removed in the first round of pruning, but mainly others were more due to duplication of variables with shared information, such as YDAGE, ATFAULT5YRS, VEHAGE.. COMMENCEDATE had very little significance implying that pricing strategy was fairly consistent in the two price periods. Surprising variables to be excluded were the Business vehicle flags and the shape. Adjusted R-Squared = 0.7696.

The second model ran after removing the insignificant predictors. In this second run only two additional variables were found to be poor predictions, YEARLYKM and DEMERITPTSZERO. There was a slight reduction in adjusted R-squared which we will test later to determine if significant. Adjusted R-Squared = 0.7614.

The model was run for a third time with the two variables removed and the adjusted R-Squared was unchanged. Adjusted R-Squared = 0.7614.

Finally a fourth round of the model was run removing MVINSURER as not all categories were significant. There was an expected loss to information as adjusted R-Squared reduced but it may be worth the simplicity. Adjusted R-Squared = 0.7607.

3.3 Compare models

anova(m2, m1)
## Analysis of Variance Table
## 
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE - 
##     MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - 
##     REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MODEL - 
##     INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41634 45196719                                  
## 2  41568 43620355 66   1576364 22.761 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3, m2)
## Analysis of Variance Table
## 
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE - 
##     MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - 
##     REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE - 
##     MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - 
##     REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS
##   Res.Df      RSS Df Sum of Sq     F Pr(>F)
## 1  41636 45199262                          
## 2  41634 45196719  2    2542.5 1.171 0.3101
anova(m4, m3)
## Analysis of Variance Table
## 
## Model 1: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE - 
##     MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - 
##     REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER
## Model 2: INSURER1_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + MODEL + 
##     ATFAULTACCDS + NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + 
##     NCBPCT + COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + INSURER4_PREMIUM + 
##     COMMENCEDATE + YEARLYKMLESS15 + NOYRSLICOBTLESS6 + DEMERITPTSZERO + 
##     YDGENDERM + COMMENCEDATE1 + ATFAULT5YRSY + REGUSEP) - MAKE - 
##     MODEL - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - 
##     REGUSE - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41642 45334345                                  
## 2  41636 45199262  6    135083 20.739 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test was run on all iterations of the model and in each case the subsequent version of the model was outperformed the predecessor. For this reasons the simplest model, m4, has a slightly reduced R-squired value but has been chosen as the best model for the training set data.

4. Model Validation

4.1 Prediction

#remove model variable for lm model to run 
trainm <- train[,c(1:9,11:33)]
testm <- test[,c(1:9,11:33)]
m4a <- lm(formula = INSURER1_PREMIUM ~ . - MAKE - INSURER2_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM -REGUSE -REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - SHAPE - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - DEMERITPTSZERO - MVINSURER,
         data = trainm)

predicted.premium4 <- predict(object = m4a, newdata = testm, type = "response")


rp1 <- ggplot(m4a, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Fitted")

rp3 <- ggplot(m4a) +
  stat_qq(aes(sample = .stdresid)) +
  geom_abline()
 

rp7 <- ggplot(m4a, aes(.hat, .cooksd)) +
  geom_vline(xintercept = 0, colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point() +
  labs(title = "Cook's distance vs Leverage")

rp8 <- ggplot(m4a, aes(.resid)) +
  geom_histogram(bins = 7, color="darkblue", fill="steelblue")
 

grid.arrange(rp1, rp3, rp7, rp8, ncol = 2)

The residual graphs show an acceptable fit for the selected model. The residuals appear to be close enough to normally distributed coupled with the fairly good R-squared value we can determine that the model is fairly accurate with low spread around the mean. We will use it to make predictions and compare results.

4.2 Prediction Results

results.df <- data.frame(cbind(actuals = testm$INSURER1_PREMIUM, predicted = predicted.premium4))

results.df <- results.df %>%
  mutate(error = results.df$actuals - results.df$predicted) %>%
  round(., 2)
results.df <- results.df %>%
  mutate(percerror = paste0(round(results.df$error/results.df$actuals*100,2),"%"))

kable(head(results.df))

actuals	predicted	error	percerror
197.07	217.83	-20.75	-10.53%
231.08	244.56	-13.48	-5.83%
240.72	217.54	23.18	9.63%
247.93	238.24	9.70	3.91%
209.27	222.31	-13.04	-6.23%
209.27	220.15	-10.88	-5.2%


sprintf("The mean percent error is: %s%%", round(mean(results.df$error/results.df$actuals*100), 2))
## [1] "The mean percent error is: -1.48%"

Plot actuals verse predicted to see the variance of the predicted from the actual premiums.

 rp <- ggplot(results.df, aes(results.df$predicted, results.df$actuals)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Actuals vs Fitted")

grid.arrange(rp , ncol = 1)

4.3 Remarks on Suncorp Model Results

Although the model results appear fairly acceptable when reviewing the mean square error and the adjusted R-Squared, when we review the actuals vs fitted, we can see that in actuality, the INSURER1_PREMIUMS are quite banded (41 price points) and therefore categorical in nature. The linear regression model predicts a continuous output. This highlights an additional method which may be acceptable would be to predict the premium bands using regression trees. This exercise will be left for later. The final model contained 12 variables. The next steps will be to reproduce the same steps for the other insurers to determine and main differences in the pricing models. As insurer 2 and 3 had many more price points we expect a better fit where as insurer4 only had 22 we expect the same banding affect to occur.

5. Insurer Comparison and premium back engineering

5.1 INSURER2_PREMIUMS

m1.2 <- lm(formula = INSURER2_PREMIUM ~ .  - INSURER1_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM,
         data = trainm)

options(max.print = 4000)
summary(m1.2)
## 
## Call:
## lm(formula = INSURER2_PREMIUM ~ . - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -238.350  -16.161   -2.122   14.112  209.107 
## 
## Coefficients: (6 not defined because of singularities)
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            8.174e+03  1.081e+02   75.634  < 2e-16 ***
## YEARLYKM               1.241e-01  1.702e-02    7.287 3.22e-13 ***
## LICCANCNT             -4.971e+01  2.156e+00  -23.052  < 2e-16 ***
## OWNERAGE              -7.777e-02  1.689e-02   -4.605 4.13e-06 ***
## MVINSURERALLI         -1.877e+00  6.368e-01   -2.947 0.003212 ** 
## MVINSURERGIO          -3.235e+00  4.376e-01   -7.392 1.47e-13 ***
## MVINSURERNONE          3.000e+01  7.398e+00    4.055 5.02e-05 ***
## MVINSURERNRMA         -2.110e+00  5.433e-01   -3.884 0.000103 ***
## MVINSUREROTH          -2.137e+00  3.778e-01   -5.658 1.54e-08 ***
## MVINSURERQBE          -1.699e+00  1.084e+00   -1.567 0.117229    
## REGUSEPRIV            -1.137e+01  9.246e-01  -12.295  < 2e-16 ***
## MAKEALFA ROMEO        -3.924e+01  1.058e+01   -3.708 0.000209 ***
## MAKEASTON MARTIN      -1.659e+01  1.379e+01   -1.203 0.229034    
## MAKEAUDI              -3.670e+01  1.007e+01   -3.645 0.000268 ***
## MAKEBENTLEY           -1.980e+01  3.009e+01   -0.658 0.510489    
## MAKEBMW               -3.566e+01  1.006e+01   -3.544 0.000394 ***
## MAKECHERY             -5.153e+00  1.532e+01   -0.336 0.736638    
## MAKECHRYSLER          -3.637e+01  1.068e+01   -3.405 0.000662 ***
## MAKECITROEN           -3.001e+01  1.045e+01   -2.872 0.004077 ** 
## MAKEDAEWOO            -5.698e+00  1.319e+01   -0.432 0.665712    
## MAKEDAIHATSU          -2.282e+01  1.165e+01   -1.958 0.050180 .  
## MAKEDODGE             -3.818e+01  1.085e+01   -3.518 0.000435 ***
## MAKEFERRARI           -4.792e+01  1.619e+01   -2.959 0.003085 ** 
## MAKEFIAT              -3.699e+01  1.052e+01   -3.516 0.000438 ***
## MAKEFORD              -1.722e+01  1.005e+01   -1.712 0.086825 .  
## MAKEGREAT WALL MOTORS -1.951e+01  1.148e+01   -1.700 0.089165 .  
## MAKEHAVAL             -3.322e+01  1.921e+01   -1.729 0.083781 .  
## MAKEHOLDEN            -1.926e+01  1.005e+01   -1.917 0.055190 .  
## MAKEHONDA             -1.813e+01  1.005e+01   -1.805 0.071074 .  
## MAKEHSV               -1.446e+01  1.082e+01   -1.336 0.181513    
## MAKEHUMMER            -3.077e+01  3.009e+01   -1.023 0.306395    
## MAKEHYUNDAI           -3.221e+01  1.005e+01   -3.205 0.001351 ** 
## MAKEINFINITI          -1.728e+01  1.418e+01   -1.218 0.223147    
## MAKEISUZU             -2.860e+01  1.147e+01   -2.493 0.012678 *  
## MAKEJAGUAR            -3.522e+01  1.062e+01   -3.315 0.000916 ***
## MAKEJEEP              -3.573e+01  1.012e+01   -3.531 0.000415 ***
## MAKEKIA               -3.229e+01  1.008e+01   -3.202 0.001364 ** 
## MAKELAND ROVER        -4.042e+01  1.023e+01   -3.951 7.79e-05 ***
## MAKELDV               -2.506e+01  1.470e+01   -1.705 0.088183 .  
## MAKELEXUS             -3.990e+01  1.007e+01   -3.960 7.50e-05 ***
## MAKELOTUS             -5.526e+00  1.470e+01   -0.376 0.707055    
## MAKEMASERATI          -3.050e+01  1.617e+01   -1.885 0.059377 .  
## MAKEMAZDA             -3.081e+01  1.004e+01   -3.068 0.002157 ** 
## MAKEMCLAREN            5.086e+00  2.244e+01    0.227 0.820663    
## MAKEMERCEDES-AMG      -1.984e+01  1.104e+01   -1.798 0.072245 .  
## MAKEMERCEDES-BENZ     -3.010e+01  1.106e+01   -2.721 0.006514 ** 
## MAKEMERCEDES AMG      -2.248e+01  1.532e+01   -1.468 0.142238    
## MAKEMERCEDES BENZ     -3.079e+01  1.006e+01   -3.059 0.002220 ** 
## MAKEMG                -3.747e+01  1.275e+01   -2.938 0.003307 ** 
## MAKEMINI              -2.019e+01  1.034e+01   -1.954 0.050747 .  
## MAKEMITSUBISHI        -1.980e+01  1.006e+01   -1.968 0.049103 *  
## MAKENISSAN            -1.892e+01  1.005e+01   -1.882 0.059858 .  
## MAKEOPEL              -2.928e+01  1.346e+01   -2.175 0.029613 *  
## MAKEPEUGEOT           -3.569e+01  1.021e+01   -3.495 0.000475 ***
## MAKEPORSCHE           -4.200e+01  1.033e+01   -4.065 4.82e-05 ***
## MAKEPROTON            -2.452e+01  1.216e+01   -2.016 0.043790 *  
## MAKERANGE ROVER       -3.996e+01  1.036e+01   -3.859 0.000114 ***
## MAKERENAULT           -1.761e+01  1.029e+01   -1.711 0.087063 .  
## MAKESAAB              -2.305e+01  1.134e+01   -2.033 0.042026 *  
## MAKESKODA             -2.373e+01  1.038e+01   -2.285 0.022311 *  
## MAKESMART             -3.773e+01  1.469e+01   -2.569 0.010202 *  
## MAKESSANGYONG         -2.002e+01  1.196e+01   -1.673 0.094249 .  
## MAKESUBARU            -3.815e+01  1.005e+01   -3.795 0.000148 ***
## MAKESUZUKI            -1.884e+01  1.009e+01   -1.867 0.061922 .  
## MAKETESLA             -3.977e+00  1.276e+01   -0.312 0.755248    
## MAKETOYOTA            -1.815e+01  1.004e+01   -1.808 0.070580 .  
## MAKEVOLKSWAGEN        -3.721e+01  1.005e+01   -3.702 0.000214 ***
## MAKEVOLVO             -3.743e+01  1.015e+01   -3.687 0.000227 ***
## MVINSYEARS             5.009e-01  1.581e-01    3.169 0.001532 ** 
## MANUFYEAR             -4.551e+00  3.504e-02 -129.866  < 2e-16 ***
## YDAGE                 -1.436e+00  1.833e-02  -78.368  < 2e-16 ***
## ATFAULTACCDS           7.440e+01  1.007e+00   73.908  < 2e-16 ***
## NRMAASST              -3.284e-01  8.767e-02   -3.746 0.000180 ***
## NOYRLICOBT            -1.986e+01  2.190e-01  -90.662  < 2e-16 ***
## MVINSTYPEN             1.051e+01  7.348e+00    1.430 0.152752    
## MVINSTYPET             4.423e+01  8.336e-01   53.062  < 2e-16 ***
## DEMERITPTS            -5.728e-01  1.655e-01   -3.460 0.000540 ***
## NCBPCT                -1.374e-03  6.014e-03   -0.228 0.819334    
## COMREGVEHY            -1.604e+01  1.043e+00  -15.381  < 2e-16 ***
## YDGENDERM              9.658e+00  2.851e-01   33.879  < 2e-16 ***
## YDGENDERU              8.360e+00  1.422e+00    5.878 4.17e-09 ***
## ATFAULT5YRSY                  NA         NA       NA       NA    
## SHAPEPVF               2.655e+00  9.645e-01    2.753 0.005910 ** 
## SHAPESED               5.691e+00  8.676e-01    6.560 5.46e-11 ***
## SHAPEWAG               2.044e+00  8.964e-01    2.280 0.022616 *  
## VEHAGE                        NA         NA       NA       NA    
## COMMENCEDATE           8.640e-02  4.483e-03   19.271  < 2e-16 ***
## YEARLYKMLESS15        -6.039e-01  4.601e-01   -1.312 0.189372    
## NOYRSLICOBTLESS6      -3.190e+01  1.127e+00  -28.304  < 2e-16 ***
## DEMERITPTSZERO        -1.102e+02  8.264e-01 -133.355  < 2e-16 ***
## YDGENDERM                     NA         NA       NA       NA    
## COMMENCEDATE1                 NA         NA       NA       NA    
## ATFAULT5YRSY                  NA         NA       NA       NA    
## REGUSEP                       NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.36 on 41568 degrees of freedom
## Multiple R-squared:  0.8145, Adjusted R-squared:  0.8142 
## F-statistic:  2099 on 87 and 41568 DF,  p-value: < 2.2e-16

m2.2 <- lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15,
         data = trainm)
options(max.print = 4000)
summary(m2.2)
## 
## Call:
## lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -231.947  -17.929   -2.223   17.006  218.647 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       8.598e+03  1.096e+02   78.435  < 2e-16 ***
## YEARLYKM          1.397e-01  1.639e-02    8.529  < 2e-16 ***
## LICCANCNT        -5.146e+01  2.202e+00  -23.366  < 2e-16 ***
## OWNERAGE         -6.984e-02  1.751e-02   -3.988 6.68e-05 ***
## MVINSURERALLI    -2.782e+00  6.572e-01   -4.234 2.30e-05 ***
## MVINSURERGIO     -3.385e+00  4.531e-01   -7.470 8.15e-14 ***
## MVINSURERNONE     2.383e+01  7.689e+00    3.100 0.001937 ** 
## MVINSURERNRMA    -2.034e+00  5.578e-01   -3.646 0.000267 ***
## MVINSUREROTH     -1.516e+00  3.815e-01   -3.974 7.08e-05 ***
## MVINSURERQBE     -1.387e+00  1.122e+00   -1.236 0.216304    
## REGUSEPRIV       -1.061e+01  9.590e-01  -11.060  < 2e-16 ***
## MVINSYEARS        5.763e-01  1.638e-01    3.519 0.000433 ***
## MANUFYEAR        -4.777e+00  3.485e-02 -137.081  < 2e-16 ***
## YDAGE            -1.429e+00  1.894e-02  -75.435  < 2e-16 ***
## ATFAULTACCDS      7.429e+01  1.046e+00   71.041  < 2e-16 ***
## NRMAASST         -2.651e-01  9.103e-02   -2.912 0.003597 ** 
## NOYRLICOBT       -1.952e+01  2.252e-01  -86.660  < 2e-16 ***
## MVINSTYPEN        1.701e+01  7.637e+00    2.227 0.025945 *  
## MVINSTYPET        4.539e+01  8.465e-01   53.625  < 2e-16 ***
## COMREGVEHY       -1.665e+01  1.080e+00  -15.417  < 2e-16 ***
## YDGENDERM         9.738e+00  2.951e-01   32.995  < 2e-16 ***
## YDGENDERU         8.910e+00  1.473e+00    6.048 1.48e-09 ***
## SHAPEPVF          4.554e+00  9.602e-01    4.743 2.11e-06 ***
## SHAPESED          8.897e+00  8.658e-01   10.276  < 2e-16 ***
## SHAPEWAG          4.433e+00  8.946e-01    4.955 7.26e-07 ***
## COMMENCEDATE      8.608e-02  4.661e-03   18.468  < 2e-16 ***
## NOYRSLICOBTLESS6 -2.970e+01  1.167e+00  -25.451  < 2e-16 ***
## DEMERITPTSZERO   -1.072e+02  5.156e-01 -207.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.49 on 41628 degrees of freedom
## Multiple R-squared:  0.7992, Adjusted R-squared:  0.7991 
## F-statistic:  6136 on 27 and 41628 DF,  p-value: < 2.2e-16

m3.2 <- lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 -MVINSTYPE - NRMAASST,
         data = trainm)
options(max.print = 4000)
summary(m3.2)
## 
## Call:
## lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15 - MVINSTYPE - NRMAASST, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -238.006  -18.312   -3.196   15.580  219.139 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      9321.48135  112.49271   82.863  < 2e-16 ***
## YEARLYKM            0.12153    0.01694    7.174 7.40e-13 ***
## LICCANCNT         -50.90092    2.27749  -22.350  < 2e-16 ***
## OWNERAGE           -0.06915    0.01811   -3.819 0.000134 ***
## MVINSURERALLI      -2.99943    0.67934   -4.415 1.01e-05 ***
## MVINSURERGIO       -3.68004    0.46505   -7.913 2.57e-15 ***
## MVINSURERNONE      36.34506    0.90211   40.289  < 2e-16 ***
## MVINSURERNRMA      -2.01641    0.57661   -3.497 0.000471 ***
## MVINSUREROTH       -1.84381    0.39423   -4.677 2.92e-06 ***
## MVINSURERQBE       -0.28250    1.15989   -0.244 0.807571    
## REGUSEPRIV        -10.73140    0.99165  -10.822  < 2e-16 ***
## MVINSYEARS          0.10097    0.09560    1.056 0.290915    
## MANUFYEAR          -5.14364    0.03533 -145.569  < 2e-16 ***
## YDAGE              -1.50153    0.01953  -76.883  < 2e-16 ***
## ATFAULTACCDS       75.33979    1.08123   69.680  < 2e-16 ***
## NOYRLICOBT        -18.94186    0.23259  -81.440  < 2e-16 ***
## COMREGVEHY        -16.43799    1.11631  -14.725  < 2e-16 ***
## YDGENDERM          10.10867    0.30507   33.135  < 2e-16 ***
## YDGENDERU           7.41881    1.51934    4.883 1.05e-06 ***
## SHAPEPVF            4.14621    0.99285    4.176 2.97e-05 ***
## SHAPESED            9.51988    0.89524   10.634  < 2e-16 ***
## SHAPEWAG            4.69734    0.92509    5.078 3.84e-07 ***
## COMMENCEDATE        0.08697    0.00482   18.045  < 2e-16 ***
## NOYRSLICOBTLESS6  -26.19178    1.20492  -21.737  < 2e-16 ***
## DEMERITPTSZERO   -108.35987    0.53275 -203.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.49 on 41631 degrees of freedom
## Multiple R-squared:  0.7852, Adjusted R-squared:  0.7851 
## F-statistic:  6342 on 24 and 41631 DF,  p-value: < 2.2e-16

m4.2 <- lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 -MVINSTYPE - NRMAASST -MVINSYEARS - MVINSURER,
         data = trainm)
options(max.print = 4000)
summary(m4.2)
## 
## Call:
## lm(formula = INSURER2_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER3_PREMIUM - INSURER4_PREMIUM - NCBPCT - REGUSEP - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - 
##     DEMERITPTS - YEARLYKMLESS15 - MVINSTYPE - NRMAASST - MVINSYEARS - 
##     MVINSURER, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -226.253  -18.790   -3.858   14.598  203.930 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       9.719e+03  1.141e+02   85.141  < 2e-16 ***
## YEARLYKM          1.156e-01  1.734e-02    6.670 2.58e-11 ***
## LICCANCNT        -3.238e+01  2.290e+00  -14.138  < 2e-16 ***
## OWNERAGE         -9.647e-02  1.849e-02   -5.217 1.82e-07 ***
## REGUSEPRIV       -1.079e+01  1.015e+00  -10.632  < 2e-16 ***
## MANUFYEAR        -5.347e+00  3.533e-02 -151.355  < 2e-16 ***
## YDAGE            -1.563e+00  1.977e-02  -79.059  < 2e-16 ***
## ATFAULTACCDS      7.590e+01  1.102e+00   68.862  < 2e-16 ***
## NOYRLICOBT       -1.857e+01  2.353e-01  -78.931  < 2e-16 ***
## COMREGVEHY       -1.554e+01  1.142e+00  -13.606  < 2e-16 ***
## YDGENDERM         1.081e+01  3.116e-01   34.682  < 2e-16 ***
## YDGENDERU         7.497e+00  1.542e+00    4.863 1.16e-06 ***
## SHAPEPVF          4.185e+00  1.016e+00    4.120 3.79e-05 ***
## SHAPESED          1.011e+01  9.158e-01   11.040  < 2e-16 ***
## SHAPEWAG          4.882e+00  9.467e-01    5.157 2.53e-07 ***
## COMMENCEDATE      8.766e-02  4.934e-03   17.766  < 2e-16 ***
## NOYRSLICOBTLESS6 -2.312e+01  1.229e+00  -18.818  < 2e-16 ***
## DEMERITPTSZERO   -1.091e+02  5.385e-01 -202.690  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.22 on 41638 degrees of freedom
## Multiple R-squared:  0.7749, Adjusted R-squared:  0.7748 
## F-statistic:  8430 on 17 and 41638 DF,  p-value: < 2.2e-16

anova(m2.2, m1.2)
## Analysis of Variance Table
## 
## Model 1: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15
## Model 2: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41628 36194799                                  
## 2  41568 33425194 60   2769606 57.405 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3.2, m2.2)
## Analysis of Variance Table
## 
## Model 1: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - 
##     MVINSTYPE - NRMAASST
## Model 2: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41631 38708357                                  
## 2  41628 36194799  3   2513557 963.62 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m4.2, m3.2)
## Analysis of Variance Table
## 
## Model 1: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - 
##     MVINSTYPE - NRMAASST - MVINSYEARS - MVINSURER
## Model 2: INSURER2_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER3_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER3_PREMIUM - 
##     INSURER4_PREMIUM - NCBPCT - REGUSEP - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE1 - DEMERITPTS - YEARLYKMLESS15 - 
##     MVINSTYPE - NRMAASST
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41638 40577123                                  
## 2  41631 38708357  7   1868767 287.12 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


predicted.premium4.2 <- predict(object = m4.2, newdata = testm, type = "response")

rp1.2 <- ggplot(m4.2, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Fitted")

rp3.2 <- ggplot(m4.2) +
  stat_qq(aes(sample = .stdresid)) +
  geom_abline()
 

rp7.2 <- ggplot(m4.2, aes(.hat, .cooksd)) +
  geom_vline(xintercept = 0, colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point() +
  labs(title = "Cook's distance vs Leverage")

rp8.2 <- ggplot(m4.2, aes(.resid)) +
  geom_histogram(bins = 7, color="darkblue", fill="steelblue")
 

grid.arrange(rp1.2, rp3.2, rp7.2, rp8.2, ncol = 2)


results.df.2 <- data.frame(cbind(actuals = testm$INSURER2_PREMIUM, predicted = predicted.premium4.2))

results.df.2 <- results.df.2 %>%
  mutate(error = results.df.2$actuals - results.df.2$predicted) %>%
  round(., 2)
results.df.2 <- results.df.2 %>%
  mutate(percerror = paste0(round(results.df.2$error/results.df.2$actuals*100,2),"%"))

kable(head(results.df.2))

actuals	predicted	error	percerror
386.58	357.84	28.74	7.43%
292.33	294.66	-2.33	-0.8%
221.36	229.24	-7.87	-3.56%
342.12	286.04	56.08	16.39%
211.37	238.95	-27.59	-13.05%
241.26	247.99	-6.73	-2.79%


sprintf("The mean percent error is: %s%%", round(mean(results.df.2$error/results.df.2$actuals*100), 2))
## [1] "The mean percent error is: -1.18%"

rp.2 <- ggplot(results.df.2, aes(results.df.2$predicted, results.df.2$actuals)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Actuals vs Fitted")

grid.arrange(rp.2 , ncol = 1)

Insurer2 had much improved predictive results than Insurer1 with an improved adjusted R-Squared value of 0.7794. The cooks distance leverage plot does show a large effect for some observations to affect the output if removed. The overall Actuals vs Fitted plot is much more in line with expected and only deviates for the higher priced premiums. Although if we remember back to the initial histograms there are a large portion of higher priced premiums for Insurers 2 and 3 than 1 and 4. The final model for Insurer2 had 14 variables significant to predicting the outcome variable. The variables removed from Insurer1 model were: MVINSYEARS, NRMAASST, MVINSTYPE, NCBPCT; while the variables new to the model were: REGUSE, COMREGVEH, YDGENDER, SHAPE, COMMENCEDATE, DEMERITPTSZERO. Many of these new variables are intuitively important variables to determining the driver risk and behavior. The addition of business use is an interesting addition. The additional spread in pricing points for the premium distribution seems to have an ability to better classify policy holders, spread the risk across more specified pricing bandings and for Insurer3 allows for higher pricing and therefore profit in a like for like policy holder scenario.

5.2. INSURER3_PREMIUMS

m1.3 <- lm(formula = INSURER3_PREMIUM ~ .  - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER4_PREMIUM,
         data = trainm)

options(max.print = 4000)
summary(m1.3)
## 
## Call:
## lm(formula = INSURER3_PREMIUM ~ . - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.555  -17.661   -3.178   14.519  167.635 
## 
## Coefficients: (6 not defined because of singularities)
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            1.156e+04  1.023e+02  112.998  < 2e-16 ***
## YEARLYKM               1.933e-02  1.611e-02    1.200 0.230124    
## LICCANCNT             -1.711e+01  2.040e+00   -8.384  < 2e-16 ***
## OWNERAGE               2.800e-01  1.598e-02   17.526  < 2e-16 ***
## MVINSURERALLI         -3.711e-01  6.026e-01   -0.616 0.538033    
## MVINSURERGIO           4.512e+00  4.141e-01   10.897  < 2e-16 ***
## MVINSURERNONE         -9.189e+01  7.001e+00  -13.126  < 2e-16 ***
## MVINSURERNRMA         -3.686e+01  5.141e-01  -71.699  < 2e-16 ***
## MVINSUREROTH           5.485e-01  3.575e-01    1.534 0.124961    
## MVINSURERQBE           1.187e+00  1.026e+00    1.157 0.247321    
## REGUSEPRIV             7.160e+00  8.750e-01    8.183 2.84e-16 ***
## MAKEALFA ROMEO        -2.805e+01  1.001e+01   -2.801 0.005094 ** 
## MAKEASTON MARTIN       4.858e-01  1.305e+01    0.037 0.970302    
## MAKEAUDI              -2.798e+01  9.527e+00   -2.937 0.003313 ** 
## MAKEBENTLEY            6.141e+00  2.848e+01    0.216 0.829258    
## MAKEBMW               -1.980e+01  9.520e+00   -2.080 0.037535 *  
## MAKECHERY              8.231e+00  1.450e+01    0.568 0.570249    
## MAKECHRYSLER          -1.483e+01  1.011e+01   -1.468 0.142150    
## MAKECITROEN            5.818e+00  9.885e+00    0.589 0.556175    
## MAKEDAEWOO            -2.464e+01  1.248e+01   -1.975 0.048329 *  
## MAKEDAIHATSU          -2.061e+01  1.103e+01   -1.869 0.061569 .  
## MAKEDODGE              2.953e+01  1.027e+01    2.875 0.004042 ** 
## MAKEFERRARI           -1.635e+01  1.532e+01   -1.067 0.286131    
## MAKEFIAT              -3.975e+01  9.956e+00   -3.992 6.56e-05 ***
## MAKEFORD               1.012e+01  9.514e+00    1.063 0.287624    
## MAKEGREAT WALL MOTORS  5.168e+00  1.086e+01    0.476 0.634269    
## MAKEHAVAL             -2.771e+01  1.818e+01   -1.524 0.127457    
## MAKEHOLDEN             8.238e+00  9.506e+00    0.867 0.386149    
## MAKEHONDA             -7.266e-01  9.507e+00   -0.076 0.939073    
## MAKEHSV                1.331e+01  1.024e+01    1.300 0.193506    
## MAKEHUMMER             4.231e+01  2.847e+01    1.486 0.137284    
## MAKEHYUNDAI            2.075e+00  9.509e+00    0.218 0.827216    
## MAKEINFINITI          -1.126e+01  1.342e+01   -0.839 0.401595    
## MAKEISUZU             -5.898e+00  1.086e+01   -0.543 0.587016    
## MAKEJAGUAR            -2.923e+01  1.005e+01   -2.908 0.003643 ** 
## MAKEJEEP               1.661e+01  9.576e+00    1.735 0.082726 .  
## MAKEKIA                5.065e+00  9.541e+00    0.531 0.595530    
## MAKELAND ROVER         3.380e+00  9.681e+00    0.349 0.726997    
## MAKELDV               -9.874e+00  1.391e+01   -0.710 0.477668    
## MAKELEXUS             -3.138e+01  9.533e+00   -3.291 0.000998 ***
## MAKELOTUS             -1.355e+01  1.392e+01   -0.974 0.330200    
## MAKEMASERATI           5.576e+01  1.531e+01    3.643 0.000270 ***
## MAKEMAZDA              2.584e+00  9.502e+00    0.272 0.785710    
## MAKEMCLAREN           -6.546e-01  2.123e+01   -0.031 0.975404    
## MAKEMERCEDES-AMG      -1.772e+00  1.045e+01   -0.170 0.865262    
## MAKEMERCEDES-BENZ      7.009e-01  1.047e+01    0.067 0.946626    
## MAKEMERCEDES AMG      -1.824e+01  1.450e+01   -1.258 0.208390    
## MAKEMERCEDES BENZ     -1.686e+01  9.524e+00   -1.770 0.076762 .  
## MAKEMG                -1.827e+01  1.207e+01   -1.514 0.130126    
## MAKEMINI              -4.368e+01  9.780e+00   -4.467 7.97e-06 ***
## MAKEMITSUBISHI         1.263e+00  9.521e+00    0.133 0.894508    
## MAKENISSAN            -1.641e+00  9.514e+00   -0.173 0.863044    
## MAKEOPEL               2.442e+01  1.274e+01    1.917 0.055228 .  
## MAKEPEUGEOT           -3.580e+01  9.663e+00   -3.705 0.000211 ***
## MAKEPORSCHE           -2.406e+01  9.779e+00   -2.461 0.013869 *  
## MAKEPROTON            -2.036e+00  1.151e+01   -0.177 0.859599    
## MAKERANGE ROVER        4.317e+00  9.800e+00    0.441 0.659540    
## MAKERENAULT           -2.607e+00  9.737e+00   -0.268 0.788923    
## MAKESAAB              -4.063e-01  1.073e+01   -0.038 0.969787    
## MAKESKODA             -6.505e+00  9.826e+00   -0.662 0.507958    
## MAKESMART             -1.327e+01  1.390e+01   -0.955 0.339696    
## MAKESSANGYONG          1.373e+00  1.132e+01    0.121 0.903467    
## MAKESUBARU             2.819e-01  9.512e+00    0.030 0.976358    
## MAKESUZUKI            -1.617e+00  9.550e+00   -0.169 0.865538    
## MAKETESLA              6.614e+00  1.207e+01    0.548 0.583820    
## MAKETOYOTA             2.784e+00  9.497e+00    0.293 0.769378    
## MAKEVOLKSWAGEN        -3.203e+01  9.513e+00   -3.367 0.000761 ***
## MAKEVOLVO             -2.842e+01  9.607e+00   -2.958 0.003099 ** 
## MVINSYEARS            -8.200e+00  1.496e-01  -54.823  < 2e-16 ***
## MANUFYEAR             -5.557e+00  3.316e-02 -167.556  < 2e-16 ***
## YDAGE                 -1.395e+00  1.735e-02  -80.433  < 2e-16 ***
## ATFAULTACCDS           5.812e+01  9.526e-01   61.006  < 2e-16 ***
## NRMAASST              -9.471e-01  8.296e-02  -11.417  < 2e-16 ***
## NOYRLICOBT             1.289e+00  2.073e-01    6.220 5.02e-10 ***
## MVINSTYPEN             1.250e+02  6.953e+00   17.976  < 2e-16 ***
## MVINSTYPET             3.761e+01  7.888e-01   47.684  < 2e-16 ***
## DEMERITPTS            -2.941e-01  1.567e-01   -1.878 0.060455 .  
## NCBPCT                 1.007e-02  5.691e-03    1.769 0.076853 .  
## COMREGVEHY            -2.276e+00  9.871e-01   -2.305 0.021144 *  
## YDGENDERM             -6.357e-01  2.698e-01   -2.357 0.018452 *  
## YDGENDERU              8.265e+00  1.346e+00    6.141 8.27e-10 ***
## ATFAULT5YRSY                  NA         NA       NA       NA    
## SHAPEPVF               7.355e+00  9.126e-01    8.059 7.89e-16 ***
## SHAPESED               5.988e-01  8.210e-01    0.729 0.465794    
## SHAPEWAG               2.318e+00  8.483e-01    2.732 0.006296 ** 
## VEHAGE                        NA         NA       NA       NA    
## COMMENCEDATE          -9.633e-04  4.242e-03   -0.227 0.820375    
## YEARLYKMLESS15         2.566e-01  4.354e-01    0.589 0.555702    
## NOYRSLICOBTLESS6       1.579e+01  1.066e+00   14.804  < 2e-16 ***
## DEMERITPTSZERO        -2.069e+01  7.820e-01  -26.458  < 2e-16 ***
## YDGENDERM                     NA         NA       NA       NA    
## COMMENCEDATE1                 NA         NA       NA       NA    
## ATFAULT5YRSY                  NA         NA       NA       NA    
## REGUSEP                       NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.83 on 41568 degrees of freedom
## Multiple R-squared:  0.7389, Adjusted R-squared:  0.7384 
## F-statistic:  1352 on 87 and 41568 DF,  p-value: < 2.2e-16

m2.3 <- lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15,
         data = trainm)
options(max.print = 4000)
summary(m2.3)
## 
## Call:
## lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - 
##     REGUSEP - COMREGVEH - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - 
##     YEARLYKMLESS15, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.549  -18.666   -2.191   17.385  154.292 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       1.189e+04  6.815e+01  174.490  < 2e-16 ***
## LICCANCNT        -1.690e+01  2.207e+00   -7.659 1.92e-14 ***
## OWNERAGE          2.826e-01  1.755e-02   16.109  < 2e-16 ***
## MVINSURERALLI    -1.471e+00  6.584e-01   -2.234  0.02546 *  
## MVINSURERGIO      5.234e+00  4.540e-01   11.529  < 2e-16 ***
## MVINSURERNONE    -9.864e+01  7.703e+00  -12.807  < 2e-16 ***
## MVINSURERNRMA    -3.646e+01  5.591e-01  -65.222  < 2e-16 ***
## MVINSUREROTH      7.789e-01  3.822e-01    2.038  0.04154 *  
## MVINSURERQBE      9.361e-01  1.124e+00    0.833  0.40510    
## REGUSEPRIV        1.127e+01  5.625e-01   20.028  < 2e-16 ***
## MVINSYEARS       -8.127e+00  1.641e-01  -49.536  < 2e-16 ***
## MANUFYEAR        -5.735e+00  3.385e-02 -169.415  < 2e-16 ***
## YDAGE            -1.404e+00  1.898e-02  -73.952  < 2e-16 ***
## ATFAULTACCDS      5.798e+01  1.048e+00   55.338  < 2e-16 ***
## NRMAASST         -8.228e-01  9.123e-02   -9.019  < 2e-16 ***
## NOYRLICOBT        1.370e+00  2.255e-01    6.074 1.26e-09 ***
## MVINSTYPEN        1.320e+02  7.650e+00   17.259  < 2e-16 ***
## MVINSTYPET        3.786e+01  8.475e-01   44.674  < 2e-16 ***
## YDGENDERM        -7.762e-01  2.956e-01   -2.626  0.00865 ** 
## YDGENDERU         1.048e+01  1.476e+00    7.100 1.27e-12 ***
## NOYRSLICOBTLESS6  1.714e+01  1.168e+00   14.671  < 2e-16 ***
## DEMERITPTSZERO   -1.945e+01  5.159e-01  -37.707  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.55 on 41634 degrees of freedom
## Multiple R-squared:  0.6828, Adjusted R-squared:  0.6826 
## F-statistic:  4268 on 21 and 41634 DF,  p-value: < 2.2e-16

m3.3 <- lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15 - MVINSURER,
         data = trainm)
options(max.print = 4000)
summary(m3.3)
## 
## Call:
## lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - 
##     REGUSEP - COMREGVEH - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - 
##     YEARLYKMLESS15 - MVINSURER, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.476  -19.691   -1.387   19.169  155.316 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       1.173e+04  7.264e+01  161.518  < 2e-16 ***
## LICCANCNT        -1.798e+01  2.355e+00   -7.634 2.32e-14 ***
## OWNERAGE          2.864e-01  1.870e-02   15.320  < 2e-16 ***
## REGUSEPRIV        1.028e+01  5.992e-01   17.162  < 2e-16 ***
## MVINSYEARS       -9.414e+00  1.720e-01  -54.721  < 2e-16 ***
## MANUFYEAR        -5.658e+00  3.608e-02 -156.817  < 2e-16 ***
## YDAGE            -1.345e+00  2.018e-02  -66.642  < 2e-16 ***
## ATFAULTACCDS      6.102e+01  1.114e+00   54.757  < 2e-16 ***
## NRMAASST         -3.729e-01  9.604e-02   -3.883 0.000103 ***
## NOYRLICOBT        1.371e+00  2.385e-01    5.747 9.12e-09 ***
## MVINSTYPEN        3.618e+01  9.013e-01   40.143  < 2e-16 ***
## MVINSTYPET        3.762e+01  9.044e-01   41.597  < 2e-16 ***
## YDGENDERM        -5.793e-01  3.154e-01   -1.837 0.066259 .  
## YDGENDERU         1.432e+01  1.565e+00    9.150  < 2e-16 ***
## NOYRSLICOBTLESS6  1.852e+01  1.245e+00   14.868  < 2e-16 ***
## DEMERITPTSZERO   -1.899e+01  5.493e-01  -34.578  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.55 on 41640 degrees of freedom
## Multiple R-squared:  0.6384, Adjusted R-squared:  0.6382 
## F-statistic:  4900 on 15 and 41640 DF,  p-value: < 2.2e-16

m4.3 <- lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - YDGENDER - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - NRMAASST - DEMERITPTS - YEARLYKMLESS15 - MVINSURER,
         data = trainm)
options(max.print = 4000)
summary(m4.3)
## 
## Call:
## lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER4_PREMIUM - YEARLYKM - NCBPCT - 
##     REGUSEP - COMREGVEH - YDGENDER - YDGENDERM - ATFAULT5YRS - 
##     ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - 
##     NRMAASST - DEMERITPTS - YEARLYKMLESS15 - MVINSURER, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.770  -19.777   -1.393   19.203  154.743 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       1.176e+04  7.256e+01  162.034  < 2e-16 ***
## LICCANCNT        -1.824e+01  2.358e+00   -7.734 1.06e-14 ***
## OWNERAGE          2.867e-01  1.871e-02   15.324  < 2e-16 ***
## REGUSEPRIV        1.013e+01  5.965e-01   16.977  < 2e-16 ***
## MVINSYEARS       -9.963e+00  9.829e-02 -101.366  < 2e-16 ***
## MANUFYEAR        -5.670e+00  3.605e-02 -157.304  < 2e-16 ***
## YDAGE            -1.343e+00  2.018e-02  -66.577  < 2e-16 ***
## ATFAULTACCDS      6.121e+01  1.115e+00   54.892  < 2e-16 ***
## NOYRLICOBT        1.420e+00  2.386e-01    5.951 2.69e-09 ***
## MVINSTYPEN        3.620e+01  9.011e-01   40.172  < 2e-16 ***
## MVINSTYPET        3.746e+01  9.051e-01   41.388  < 2e-16 ***
## NOYRSLICOBTLESS6  1.874e+01  1.246e+00   15.034  < 2e-16 ***
## DEMERITPTSZERO   -1.891e+01  5.497e-01  -34.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.59 on 41643 degrees of freedom
## Multiple R-squared:  0.6375, Adjusted R-squared:  0.6374 
## F-statistic:  6103 on 12 and 41643 DF,  p-value: < 2.2e-16

anova(m2.3, m1.3)
## Analysis of Variance Table
## 
## Model 1: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - 
##     COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15
## Model 2: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41634 36362700                                  
## 2  41568 29930853 66   6431847 135.34 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3.3, m2.3)
## Analysis of Variance Table
## 
## Model 1: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - 
##     COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15 - MVINSURER
## Model 2: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - 
##     COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
## 1  41640 41457881                                 
## 2  41634 36362700  6   5095181 972.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m4.3, m3.3)
## Analysis of Variance Table
## 
## Model 1: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - 
##     YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - 
##     COMMENCEDATE - COMMENCEDATE1 - SHAPE - NRMAASST - DEMERITPTS - 
##     YEARLYKMLESS15 - MVINSURER
## Model 2: INSURER3_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER4_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER4_PREMIUM - YEARLYKM - NCBPCT - REGUSEP - COMREGVEH - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - 
##     COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKMLESS15 - MVINSURER
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41643 41557901                                  
## 2  41640 41457881  3    100019 33.486 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


predicted.premium4.3 <- predict(object = m4.3, newdata = testm, type = "response")

rp1.3 <- ggplot(m4.3, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Fitted")

rp3.3 <- ggplot(m4.3) +
  stat_qq(aes(sample = .stdresid)) +
  geom_abline()
 

rp7.3 <- ggplot(m4.3, aes(.hat, .cooksd)) +
  geom_vline(xintercept = 0, colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point() +
  labs(title = "Cook's distance vs Leverage")

rp8.3 <- ggplot(m4.3, aes(.resid)) +
  geom_histogram(bins = 7, color="darkblue", fill="steelblue")
 

grid.arrange(rp1.3, rp3.3, rp7.3, rp8.3, ncol = 2)


results.df.3 <- data.frame(cbind(actuals = testm$INSURER3_PREMIUM, predicted = predicted.premium4.3))

results.df.3 <- results.df.3 %>%
  mutate(error = results.df.3$actuals - results.df.3$predicted) %>%
  round(., 2)
results.df.3 <- results.df.3 %>%
  mutate(percerror = paste0(round(results.df.3$error/results.df.3$actuals*100,2),"%"))

kable(head(results.df.3))

actuals	predicted	error	percerror
309.22	317.78	-8.57	-2.77%
257.64	307.47	-49.83	-19.34%
254.69	268.62	-13.93	-5.47%
265.50	291.72	-26.21	-9.87%
214.49	245.42	-30.93	-14.42%
214.49	235.68	-21.19	-9.88%


sprintf("The mean percent error is: %s%%", round(mean(results.df.3$error/results.df.3$actuals*100), 2))
## [1] "The mean percent error is: -1.15%"

rp.3 <- ggplot(results.df.3, aes(results.df.3$predicted, results.df.3$actuals)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Actuals vs Fitted")

grid.arrange(rp.3 , ncol = 1)

The results for the actual vs fitted for Insurer3 predicted values and model are again closer to the 45 degree line as expected due to the higher pricing points more with large numbers (6447) as with Insurer2. Yet the similarities with Insurer3 and Insurer1 variable selection is much closer aligned than with Insurer2. The Adjusted R-SQUARE is lower at 0.6403. The model excluded the variables NCBPCT, NRMAASST, YEARLYKM and included REGUSE and DEMERITPTS. Insurer3 has the highest mean and medium premiums of the 4 insurers quoted. This again shows that Insure3 is either classing some of the policy holders as higher risk or pricing the insurance higher for similar risk. We cannot speak to profitability without the attached claims and claim costs data but can say that for the same policy holder quotes Insurers 1 and 4 have much lower averages and therefore are either at greater risk or may be pricing lower to gain more policies or may be losing on possible optimal profits.

5.3. INSURER3_PREMIUMS

m1.4 <- lm(formula = INSURER4_PREMIUM ~ .  - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER3_PREMIUM,
         data = trainm)

options(max.print = 4000)
summary(m1.4)
## 
## Call:
## lm(formula = INSURER4_PREMIUM ~ . - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM, data = trainm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.946 -14.676  -3.895   6.478 180.385 
## 
## Coefficients: (6 not defined because of singularities)
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            3.663e+03  1.174e+02   31.186  < 2e-16 ***
## YEARLYKM               3.784e-02  1.850e-02    2.045 0.040827 *  
## LICCANCNT             -7.395e+00  2.343e+00   -3.156 0.001602 ** 
## OWNERAGE              -3.935e-01  1.835e-02  -21.444  < 2e-16 ***
## MVINSURERALLI          9.943e-01  6.920e-01    1.437 0.150817    
## MVINSURERGIO           6.860e+00  4.755e-01   14.425  < 2e-16 ***
## MVINSURERNONE          1.126e+00  8.040e+00    0.140 0.888651    
## MVINSURERNRMA          9.955e-01  5.904e-01    1.686 0.091798 .  
## MVINSUREROTH           2.021e+00  4.105e-01    4.923 8.57e-07 ***
## MVINSURERQBE           1.238e+00  1.178e+00    1.051 0.293283    
## REGUSEPRIV             4.388e-01  1.005e+00    0.437 0.662357    
## MAKEALFA ROMEO         1.126e+01  1.150e+01    0.979 0.327403    
## MAKEASTON MARTIN       2.011e+01  1.498e+01    1.342 0.179545    
## MAKEAUDI               1.054e+01  1.094e+01    0.963 0.335383    
## MAKEBENTLEY            2.413e+01  3.270e+01    0.738 0.460569    
## MAKEBMW                1.193e+01  1.093e+01    1.091 0.275224    
## MAKECHERY             -2.556e+00  1.665e+01   -0.153 0.878005    
## MAKECHRYSLER           9.399e+00  1.161e+01    0.810 0.418052    
## MAKECITROEN            8.384e+00  1.135e+01    0.738 0.460227    
## MAKEDAEWOO             2.490e+01  1.433e+01    1.737 0.082382 .  
## MAKEDAIHATSU           5.437e+00  1.266e+01    0.429 0.667683    
## MAKEDODGE              1.248e+01  1.179e+01    1.058 0.289930    
## MAKEFERRARI           -3.608e+00  1.760e+01   -0.205 0.837558    
## MAKEFIAT               1.441e+01  1.143e+01    1.260 0.207532    
## MAKEFORD               1.242e+01  1.093e+01    1.137 0.255733    
## MAKEGREAT WALL MOTORS  2.486e+01  1.248e+01    1.993 0.046255 *  
## MAKEHAVAL              1.510e+01  2.088e+01    0.723 0.469499    
## MAKEHOLDEN             1.278e+01  1.092e+01    1.171 0.241692    
## MAKEHONDA              1.170e+01  1.092e+01    1.072 0.283677    
## MAKEHSV                7.521e+00  1.176e+01    0.640 0.522394    
## MAKEHUMMER             2.164e+01  3.270e+01    0.662 0.508184    
## MAKEHYUNDAI            1.821e+01  1.092e+01    1.668 0.095353 .  
## MAKEINFINITI           1.714e+01  1.541e+01    1.112 0.266266    
## MAKEISUZU              7.327e+00  1.247e+01    0.588 0.556789    
## MAKEJAGUAR             1.432e+01  1.154e+01    1.241 0.214664    
## MAKEJEEP               1.705e+01  1.100e+01    1.551 0.120953    
## MAKEKIA                1.272e+01  1.096e+01    1.161 0.245506    
## MAKELAND ROVER         1.079e+01  1.112e+01    0.970 0.331865    
## MAKELDV                5.433e+00  1.597e+01    0.340 0.733711    
## MAKELEXUS              1.198e+01  1.095e+01    1.095 0.273660    
## MAKELOTUS             -8.132e+00  1.598e+01   -0.509 0.610843    
## MAKEMASERATI           9.281e-01  1.758e+01    0.053 0.957890    
## MAKEMAZDA              1.214e+01  1.091e+01    1.112 0.266034    
## MAKEMCLAREN           -2.725e+01  2.438e+01   -1.118 0.263669    
## MAKEMERCEDES-AMG       4.866e+00  1.200e+01    0.406 0.685017    
## MAKEMERCEDES-BENZ      1.284e+01  1.202e+01    1.068 0.285433    
## MAKEMERCEDES AMG       7.314e+00  1.665e+01    0.439 0.660455    
## MAKEMERCEDES BENZ      1.130e+01  1.094e+01    1.033 0.301684    
## MAKEMG                 2.524e+01  1.386e+01    1.821 0.068582 .  
## MAKEMINI               7.432e+00  1.123e+01    0.662 0.508136    
## MAKEMITSUBISHI         1.487e+01  1.093e+01    1.360 0.173899    
## MAKENISSAN             1.220e+01  1.093e+01    1.117 0.263999    
## MAKEOPEL               1.267e+01  1.463e+01    0.866 0.386484    
## MAKEPEUGEOT            1.089e+01  1.110e+01    0.982 0.326315    
## MAKEPORSCHE            7.989e+00  1.123e+01    0.711 0.476818    
## MAKEPROTON             1.756e+01  1.322e+01    1.329 0.183978    
## MAKERANGE ROVER        8.586e+00  1.125e+01    0.763 0.445499    
## MAKERENAULT            7.586e+00  1.118e+01    0.678 0.497535    
## MAKESAAB               1.349e+01  1.232e+01    1.095 0.273467    
## MAKESKODA              8.781e+00  1.128e+01    0.778 0.436443    
## MAKESMART              7.325e+00  1.596e+01    0.459 0.646276    
## MAKESSANGYONG          1.081e+01  1.300e+01    0.831 0.405738    
## MAKESUBARU             1.399e+01  1.092e+01    1.281 0.200151    
## MAKESUZUKI             1.773e+01  1.097e+01    1.617 0.105980    
## MAKETESLA             -3.425e+00  1.387e+01   -0.247 0.804888    
## MAKETOYOTA             1.228e+01  1.091e+01    1.126 0.260156    
## MAKEVOLKSWAGEN         1.337e+01  1.092e+01    1.224 0.221016    
## MAKEVOLVO              1.360e+01  1.103e+01    1.233 0.217551    
## MVINSYEARS            -5.928e-01  1.718e-01   -3.451 0.000558 ***
## MANUFYEAR             -1.663e+00  3.808e-02  -43.675  < 2e-16 ***
## YDAGE                 -7.474e-01  1.992e-02  -37.520  < 2e-16 ***
## ATFAULTACCDS           1.628e+01  1.094e+00   14.883  < 2e-16 ***
## NRMAASST               4.050e-01  9.527e-02    4.251 2.13e-05 ***
## NOYRLICOBT             5.004e+00  2.380e-01   21.023  < 2e-16 ***
## MVINSTYPEN             1.161e+01  7.985e+00    1.454 0.145832    
## MVINSTYPET             1.278e+01  9.059e-01   14.109  < 2e-16 ***
## DEMERITPTS             2.767e-01  1.799e-01    1.538 0.124104    
## NCBPCT                -8.254e-02  6.535e-03  -12.629  < 2e-16 ***
## COMREGVEHY             1.372e+01  1.134e+00   12.101  < 2e-16 ***
## YDGENDERM             -6.567e-02  3.098e-01   -0.212 0.832130    
## YDGENDERU             -3.877e+00  1.546e+00   -2.509 0.012123 *  
## ATFAULT5YRSY                  NA         NA       NA       NA    
## SHAPEPVF              -3.976e+00  1.048e+00   -3.794 0.000149 ***
## SHAPESED               9.058e-01  9.429e-01    0.961 0.336742    
## SHAPEWAG              -4.980e+00  9.742e-01   -5.112 3.20e-07 ***
## VEHAGE                        NA         NA       NA       NA    
## COMMENCEDATE           1.383e-03  4.872e-03    0.284 0.776576    
## YEARLYKMLESS15         3.477e+00  5.000e-01    6.953 3.62e-12 ***
## NOYRSLICOBTLESS6       4.979e+01  1.225e+00   40.658  < 2e-16 ***
## DEMERITPTSZERO        -1.166e+02  8.981e-01 -129.852  < 2e-16 ***
## YDGENDERM                     NA         NA       NA       NA    
## COMMENCEDATE1                 NA         NA       NA       NA    
## ATFAULT5YRSY                  NA         NA       NA       NA    
## REGUSEP                       NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.82 on 41568 degrees of freedom
## Multiple R-squared:  0.6641, Adjusted R-squared:  0.6634 
## F-statistic: 944.5 on 87 and 41568 DF,  p-value: < 2.2e-16

m2.4 <- lm(formula = INSURER4_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSEP - YDGENDER - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS,
         data = trainm)
options(max.print = 4000)
summary(m2.4)
## 
## Call:
## lm(formula = INSURER4_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSEP - 
##     YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - 
##     COMMENCEDATE - COMMENCEDATE1 - DEMERITPTS, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.023  -14.633   -3.958    6.202  182.080 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      3619.53897   74.81453   48.380  < 2e-16 ***
## YEARLYKM            0.04115    0.01853    2.220  0.02640 *  
## LICCANCNT          -6.31699    2.31031   -2.734  0.00625 ** 
## OWNERAGE           -0.37602    0.01834  -20.498  < 2e-16 ***
## REGUSEPRIV          0.83431    1.00503    0.830  0.40647    
## MVINSYEARS         -0.96689    0.16890   -5.725 1.04e-08 ***
## MANUFYEAR          -1.62417    0.03719  -43.679  < 2e-16 ***
## YDAGE              -0.73740    0.01992  -37.011  < 2e-16 ***
## ATFAULTACCDS       16.99968    1.09286   15.555  < 2e-16 ***
## NRMAASST            0.59419    0.09412    6.313 2.76e-10 ***
## NOYRLICOBT          5.49209    0.23619   23.252  < 2e-16 ***
## MVINSTYPEN         11.00177    0.89513   12.291  < 2e-16 ***
## MVINSTYPET         12.99879    0.90480   14.366  < 2e-16 ***
## NCBPCT             -0.07379    0.00624  -11.825  < 2e-16 ***
## COMREGVEHY         13.83039    1.13466   12.189  < 2e-16 ***
## SHAPEPVF           -2.76618    1.00784   -2.745  0.00606 ** 
## SHAPESED            1.81223    0.90850    1.995  0.04608 *  
## SHAPEWAG           -4.04552    0.93876   -4.309 1.64e-05 ***
## YEARLYKMLESS15      3.42415    0.49964    6.853 7.32e-12 ***
## NOYRSLICOBTLESS6   51.16695    1.22437   41.790  < 2e-16 ***
## DEMERITPTSZERO   -117.57959    0.58245 -201.869  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.94 on 41635 degrees of freedom
## Multiple R-squared:  0.6608, Adjusted R-squared:  0.6606 
## F-statistic:  4055 on 20 and 41635 DF,  p-value: < 2.2e-16

m3.4 <- lm(formula = INSURER4_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSE - REGUSEP - YDGENDER - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS,
         data = trainm)
options(max.print = 4000)
summary(m3.4)
## 
## Call:
## lm(formula = INSURER4_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSE - 
##     REGUSEP - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS, 
##     data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.205  -14.699   -4.016    5.965  179.576 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       3.785e+03  7.292e+01   51.899  < 2e-16 ***
## YEARLYKM          3.520e-02  1.858e-02    1.894   0.0582 .  
## LICCANCNT        -6.100e+00  2.318e+00   -2.632   0.0085 ** 
## OWNERAGE         -3.707e-01  1.840e-02  -20.146  < 2e-16 ***
## MVINSYEARS       -9.257e-01  1.694e-01   -5.464 4.69e-08 ***
## MANUFYEAR        -1.707e+00  3.626e-02  -47.071  < 2e-16 ***
## YDAGE            -7.345e-01  1.999e-02  -36.752  < 2e-16 ***
## ATFAULTACCDS      1.719e+01  1.096e+00   15.676  < 2e-16 ***
## NRMAASST          5.880e-01  9.443e-02    6.226 4.83e-10 ***
## NOYRLICOBT        5.620e+00  2.368e-01   23.733  < 2e-16 ***
## MVINSTYPEN        1.153e+01  8.974e-01   12.843  < 2e-16 ***
## MVINSTYPET        1.345e+01  9.071e-01   14.828  < 2e-16 ***
## NCBPCT           -7.342e-02  6.259e-03  -11.730  < 2e-16 ***
## COMREGVEHY        1.237e+01  6.654e-01   18.584  < 2e-16 ***
## YEARLYKMLESS15    3.536e+00  5.012e-01    7.056 1.74e-12 ***
## NOYRSLICOBTLESS6  5.223e+01  1.227e+00   42.575  < 2e-16 ***
## DEMERITPTSZERO   -1.173e+02  5.842e-01 -200.833  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.05 on 41639 degrees of freedom
## Multiple R-squared:  0.6584, Adjusted R-squared:  0.6583 
## F-statistic:  5017 on 16 and 41639 DF,  p-value: < 2.2e-16

m4.4 <- lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSE - REGUSEP - YDGENDER - YDGENDERM  - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - YEARLYKM - NOYRLICOBT - YEARLYKMLESS15,
         data = trainm)
options(max.print = 4000)
summary(m4.4)
## 
## Call:
## lm(formula = INSURER3_PREMIUM ~ . - MAKE - INSURER1_PREMIUM - 
##     INSURER2_PREMIUM - INSURER3_PREMIUM - MVINSURER - REGUSE - 
##     REGUSEP - YDGENDER - YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - 
##     VEHAGE - COMMENCEDATE - COMMENCEDATE1 - SHAPE - DEMERITPTS - 
##     YEARLYKM - NOYRLICOBT - YEARLYKMLESS15, data = trainm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.897  -19.218   -0.807   19.313  150.181 
## 
## Coefficients:
##                    Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)       1.068e+04  7.406e+01  144.243  < 2e-16 ***
## LICCANCNT        -1.692e+01  2.289e+00   -7.394 1.45e-13 ***
## OWNERAGE          3.767e-01  1.826e-02   20.631  < 2e-16 ***
## MVINSYEARS       -9.211e+00  1.669e-01  -55.177  < 2e-16 ***
## MANUFYEAR        -5.172e+00  3.656e-02 -141.470  < 2e-16 ***
## YDAGE            -1.197e+00  2.003e-02  -59.747  < 2e-16 ***
## ATFAULTACCDS      5.712e+01  1.084e+00   52.699  < 2e-16 ***
## NRMAASST         -5.241e-01  9.331e-02   -5.617 1.95e-08 ***
## MVINSTYPEN        3.522e+01  8.878e-01   39.673  < 2e-16 ***
## MVINSTYPET        3.633e+01  8.984e-01   40.436  < 2e-16 ***
## NCBPCT            8.938e-02  5.936e-03   15.058  < 2e-16 ***
## COMREGVEHY       -1.256e+01  6.584e-01  -19.075  < 2e-16 ***
## INSURER4_PREMIUM  2.402e-01  4.806e-03   49.978  < 2e-16 ***
## NOYRSLICOBTLESS6  4.707e+00  4.773e-01    9.862  < 2e-16 ***
## DEMERITPTSZERO    8.270e+00  7.909e-01   10.456  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.68 on 41641 degrees of freedom
## Multiple R-squared:  0.6582, Adjusted R-squared:  0.6581 
## F-statistic:  5728 on 14 and 41641 DF,  p-value: < 2.2e-16

anova(m2.4, m1.4)
## Analysis of Variance Table
## 
## Model 1: INSURER4_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - MVINSURER - REGUSEP - YDGENDER - YDGENDERM - 
##     ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS
## Model 2: INSURER4_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41635 39862843                                  
## 2  41568 39474390 67    388453 6.1053 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m3.4, m2.4)
## Analysis of Variance Table
## 
## Model 1: INSURER4_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - MVINSURER - REGUSE - REGUSEP - YDGENDER - 
##     YDGENDERM - ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - 
##     COMMENCEDATE1 - SHAPE - DEMERITPTS
## Model 2: INSURER4_PREMIUM ~ (YEARLYKM + LICCANCNT + OWNERAGE + MVINSURER + 
##     REGUSE + MAKE + MVINSYEARS + MANUFYEAR + YDAGE + ATFAULTACCDS + 
##     NRMAASST + NOYRLICOBT + MVINSTYPE + DEMERITPTS + NCBPCT + 
##     COMREGVEH + YDGENDER + ATFAULT5YRS + SHAPE + VEHAGE + INSURER1_PREMIUM + 
##     INSURER2_PREMIUM + INSURER3_PREMIUM + COMMENCEDATE + YEARLYKMLESS15 + 
##     NOYRSLICOBTLESS6 + DEMERITPTSZERO + YDGENDERM + COMMENCEDATE1 + 
##     ATFAULT5YRSY + REGUSEP) - MAKE - INSURER1_PREMIUM - INSURER2_PREMIUM - 
##     INSURER3_PREMIUM - MVINSURER - REGUSEP - YDGENDER - YDGENDERM - 
##     ATFAULT5YRS - ATFAULT5YRSY - VEHAGE - COMMENCEDATE - COMMENCEDATE1 - 
##     DEMERITPTS
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  41639 40136740                                  
## 2  41635 39862843  4    273898 71.519 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m4.4, m3.4)
## Analysis of Variance Table
## 
## Response: INSURER3_PREMIUM
##                     Df   Sum Sq  Mean Sq   F value    Pr(>F)    
## LICCANCNT            1   969171   969171  1029.943 < 2.2e-16 ***
## OWNERAGE             1  6794433  6794433  7220.474 < 2.2e-16 ***
## MVINSYEARS           1 11525987 11525987 12248.718 < 2.2e-16 ***
## MANUFYEAR            1 37160812 37160812 39490.959 < 2.2e-16 ***
## YDAGE                1  7842485  7842485  8334.244 < 2.2e-16 ***
## ATFAULTACCDS         1  3348808  3348808  3558.793 < 2.2e-16 ***
## NRMAASST             1    14109    14109    14.993 0.0001081 ***
## MVINSTYPE            2  3311932  1655966  1759.802 < 2.2e-16 ***
## NCBPCT               1    72108    72108    76.630 < 2.2e-16 ***
## COMREGVEH            1   322891   322891   343.137 < 2.2e-16 ***
## INSURER4_PREMIUM     1  3847946  3847946  4089.230 < 2.2e-16 ***
## NOYRSLICOBTLESS6     1   142508   142508   151.444 < 2.2e-16 ***
## DEMERITPTSZERO       1   102875   102875   109.326 < 2.2e-16 ***
## Residuals        41641 39183991      941                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


predicted.premium3.4 <- predict(object = m3.4, newdata = testm, type = "response")

rp1.4 <- ggplot(m3.4, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Residuals vs Fitted")

rp3.4 <- ggplot(m3.4) +
  stat_qq(aes(sample = .stdresid)) +
  geom_abline()
 

rp7.4 <- ggplot(m3.4, aes(.hat, .cooksd)) +
  geom_vline(xintercept = 0, colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point() +
  labs(title = "Cook's distance vs Leverage")

rp8.4 <- ggplot(m3.4, aes(.resid)) +
  geom_histogram(bins = 7, color="darkblue", fill="steelblue")
 

grid.arrange(rp1.4, rp3.4, rp7.4, rp8.4, ncol = 2)


results.df.4 <- data.frame(cbind(actuals = testm$INSURER4_PREMIUM, predicted = predicted.premium3.4))

results.df.4 <- results.df.4 %>%
  mutate(error = results.df.4$actuals - results.df.4$predicted) %>%
  round(., 2)
results.df.4 <- results.df.4 %>%
  mutate(percerror = paste0(round(results.df.4$error/results.df.4$actuals*100,2),"%"))

kable(head(results.df.4))

actuals	predicted	error	percerror
334.59	329.39	5.20	1.55%
211.13	233.64	-22.51	-10.66%
211.13	212.67	-1.54	-0.73%
253.60	242.52	11.08	4.37%
206.34	221.83	-15.49	-7.51%
224.52	230.38	-5.86	-2.61%


sprintf("The mean percent error is: %s%%", round(mean(results.df.4$error/results.df.4$actuals*100), 2))
## [1] "The mean percent error is: -1.2%"

rp.4 <- ggplot(results.df.4, aes(results.df.4$predicted, results.df.4$actuals)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE) +
  labs(title = "Actuals vs Fitted")

grid.arrange(rp.4 , ncol = 1)

We can see that model 3 predicts the premiums with much more variance in the data. The adjusted R-squared was the worst of all 4 insurer models at 0.6595 with a mean percent error of similar 1.15%, the data has the least price points and the model results vary greatly in the range to which it estimates each price points. Also model 4 of the tested variations was not found to be more significant then model 3. This brings to light an additional method to be run, there can me better backward stepwise feature selection chosen to test for the appropriate fits. Taking some features out can actually strengthen the correlation of other previously removed features. If more time was to be dedicated this path would be taken. The cooks distance shows much larger distances and the residual fit is much worse.

6. Final Remarks

In this project the aim was to identify price optimisation opportunity through studying the premium pricing models for Suncorp and 3 other unknown competitors. The data was gathered over 2 relatively close time periods and in all models this was found insignificant expect for Insurer2. The pricing differences in the two time periods could be compared outside of this pricing modeling in a anova comparison table to see if there is any significant differences in the two time periods. No additional time was spent down this avenue at this time. The price optimisation was found to be more interesting when studying the structure of the data at hand, the preliminary linear regression models and the feature selection. The pricing structure for each insurer shows there are much more variance in pricing for Insurers 2 and 3 with on average higher price tags for the same policy holder quotes. These pricing structures had much more variance than Insurers 1 and 4 which is indicative of a more variable pricing model inline with a linear regression. Further studies would include investigating the appropriateness of fit for a GLM for these insurers. Insurers 1 and 4 had much more banded pricing structure with a lower mean and less variation in the pricing models would indicate a better avenue would be to investigate the pricing structure with random forests. Using the same model selection process for all four insurers gave a like for like feature selection attributed to the pricing distribution. The results were actually quite intriguing, with Insurer2 having a very different model base than the other Insurers; Insurer3 had a model with variables in between that of Insurer1 and 2.

The overall conclusions of this investigation is that Suncorp (Insurer1) could improve their pricing structure to allow for more pricing bands and therefore increase profitability for the same customer base if it were to develop a model which included variables such as Gender, Business use and Vehicle shape. This added knowledge could allow for more specified risk structure and pricing variance and if the implementation allows for similar risk with increased pricing with no loss to customer segmentation or demand then there could be room for profit gain and improved market segmentation.

The overall conclusions are that this investigation was quite basic and could be greatly improved with more time. To better predict pricing optimisation opportunists more data covering Risk and customer segmentation along with internal strategy data for acquiring and keeping customers would also be useful to improving the theory. As for the modelling, besides investigating other better fit forms of models for this current data set, there could be better accuracy testing with the use of cross fold validation or leave one out cross validation and bootstrapping which were all not implemented in this model due to time constraints.

Research studies show the GLMs offer better predictive power for the given area of study, with GAMS allowing for even better performance but offering less simply explanations for the interactions at work. The use of regression trees and random forests could allow for better insights on variable splitting for information gain. Bagging would then likewise add a complexity that makes simple explanation of results hard but can add for further accuracy. With each model there will be benefits and difficulties that will need to be considered for the overall desired outcome. In this case the outcome was based around an investigation of backward engineering pricing structures for optimisation not accurately predicting price. Although accurately predicting premiums could greatly improve market knowledge and strategy.

7. Website Reference List

The following is to make note of the sites which were read and used for Insurance background knowledge and modelling procedures specific to modelling insurance pricing.

http://rstudio-pubs-static.s3.amazonaws.com/156481_80ee6ee3a0414fd38f5d3ad33d14c771.html

https://rpubs.com/jt_rpubs/279278

https://datasciencebeginners.com/2018/11/26/functions-and-packages-for-feature-selection-in-r/

https://topepo.github.io/caret/pre-processing.html#identifying-correlated-predictors

https://mlr.mlr-org.com/articles/tutorial/feature_selection.html

https://beta.vu.nl/nl/Images/werkstuk-fonti_tcm235-836234.pdf

https://stats.stackexchange.com/questions/152158/r-model-selection-with-categorical-variables-using-leaps-and-glmnet

https://www.math.mcgill.ca/yyang/demo.html

https://www.actuariesindia.org/SeminarDocs/8thCBSGI/ppt/Motor_Insurance_contd.pdf

https://www.kaggle.com/kernels/scriptcontent/3267460/download

https://www.casact.org/pubs/forum/00wforum/00wf107.pdf

https://www.theanalysisfactor.com/generalized-linear-models-glm-r-part4/