ACST4005 Actuarial Data Analytics Case Study

As an actuarial analyst, we have been provided with a dataset with the following characteristics:
\(x_1\): weight of the car (\(\mathtt{weight}\))
\(x_2\): annually driven distance (\(\mathtt{distance}\))
\(x_3\): age of driver (\(\mathtt{age}\))
\(x_4\): age of car (\(\mathtt{carage}\))
\(x_5\): gender of driver (\(\mathtt{gender}\))

Using the given dataset, “CaseStudyData.csv” the following questions have been answered.

Please ensure all the required packages are installed and loaded. Please refer to the rmd file to install/load any packages.

If any of the above packages have not been previously installed, please use install.packages().

Part a: Exploratory Data Analysis

Firstly, we read the provided dataset in R using the below code and familarise ourselves with the data, make adjustments to the character variable - gender to be turned into numeric for ease of use.

##   Counts gender distance weight age carage  exposure
## 1      0 female       23   2444  40     11 0.8793250
## 2      0 female        8   1147  34     12 0.9407260
## 3      0   male       12   2358  27     10 0.9607210
## 4      0   male       13   2006  84     29 0.8911884
## 5      0   male       44    693  31     16 0.9622051
## 6      0 female       35   2304  43     14 0.8637496

## 'data.frame':    150000 obs. of  7 variables:
##  $ Counts  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ gender  : chr  "female" "female" "male" "male" ...
##  $ distance: int  23 8 12 13 44 35 46 27 9 70 ...
##  $ weight  : int  2444 1147 2358 2006 693 2304 1127 1836 1158 2023 ...
##  $ age     : int  40 34 27 84 31 43 41 33 28 41 ...
##  $ carage  : int  11 12 10 29 16 14 33 4 12 8 ...
##  $ exposure: num  0.879 0.941 0.961 0.891 0.962 ...

##      Counts         gender             distance         weight    
##  Min.   :0.000   Length:150000      Min.   : 1.00   Min.   : 457  
##  1st Qu.:0.000   Class :character   1st Qu.:18.00   1st Qu.:1318  
##  Median :0.000   Mode  :character   Median :32.00   Median :1813  
##  Mean   :0.146                      Mean   :35.96   Mean   :1918  
##  3rd Qu.:0.000                      3rd Qu.:51.00   3rd Qu.:2457  
##  Max.   :5.000                      Max.   :95.00   Max.   :3992  
##       age            carage         exposure     
##  Min.   :20.00   Min.   : 2.00   Min.   :0.8000  
##  1st Qu.:35.00   1st Qu.:13.00   1st Qu.:0.8496  
##  Median :42.00   Median :23.00   Median :0.8997  
##  Mean   :50.11   Mean   :22.02   Mean   :0.8998  
##  3rd Qu.:71.00   3rd Qu.:31.00   3rd Qu.:0.9499  
##  Max.   :96.00   Max.   :44.00   Max.   :1.0000

From the above summary, we can see the difference in the scale sizes for each variable of the dataset, their minimum and maximum values and other key statistics.

Now, to analyse the shape of these variables and the distribution, we plot the histograms for each variable.

Age: The histogram of age shows 2 peaks exhibiting a multi-modal distribution which can be inferred as existence of 2 key age groups.
Car age: Similar to age of driver, age of car also displays a multimodal histogram indicating towards prevalance of 2 distinct age brackets.
Distance: From the plot, it is visible that there is the distribution for distance is positively skewed.
Weight: Similarly, weight shows signs of positive skewness which hints at more data being present
Gender: From the bar chart, it is visible that in the given observations, there are more male drivers than female drivers.
Exposure: Lastly, for our dataset, the exposure/volume term is standard across all observations with little to no variability in this variable.

Now, we create a correlation chart to check for any collinearity between these variables:

From the above correlation plot, with correlation values as small as 0.01, it can be witnessed that there’s no prominent instances of correlation amongst the variables. The highest correlation metric is between Counts and carage which aligns with the intuitive idea of more claims for older cars. Another correlation to consider is the correlation of Counts with age. As the driver of the car grows older, the instances of claims/ counts can see an incline with a low degree of inclination from the correlation metric of 0.13.

Part b: Best Model Selection:

With many models available to fit on the provided data, it is important to select a suitable model with checks the required condition. For our dataset, it has been decided to proceed with fitting a Generalised Linear Model (GLM) to the dataset under the Poisson family.

The reason to choose GLM models and using backward selection to find the best fit model was due to its simplicity and interpretability. In comparison to complex models such as LASSO and Ridge, GLMs provide clear and interpretable meanings to the relationships between the variables. While our claims count were poisson distributed, GLM’s flexibility to incorporate various distributions make it a good choice for selecting the best model.

The model to estimate \(\lambda(x)\) is a combination of linear, quadratic and mixed terms for the 5 variables being assumed for ln\(\lambda(x)\).

To generate the best model, we use the selection criteria of lowest AIC ( Akaike Information Criteria) to determine which covariates are singificant to the claims count and should be retained. The formula for AIC includes: \[AIC = 2k-2ln \widehat{L}\]

where \(k\) = number of parameters and \(ln \widehat{L}\) = MLE.

Usng the stepAIC function, backwards selection was performed on the initial model with all the linear, quadratic and mixed terms which removed a few covariates leaving the following with their coefficients as per below:

## 
## Call:
## glm(formula = model, family = poisson(link = "log"), data = data, 
##     offset = log(exposure))
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.559e+00  1.740e-01 -37.702   <2e-16 ***
## weight           1.102e-04  6.523e-05   1.689   0.0911 .  
## distance         5.203e-03  2.184e-03   2.382   0.0172 *  
## age              4.533e-02  3.728e-03  12.158   <2e-16 ***
## carage           1.750e-01  6.227e-03  28.106   <2e-16 ***
## I(weight^2)      2.926e-09  1.053e-08   0.278   0.7811    
## I(distance^2)   -4.822e-06  1.419e-05  -0.340   0.7339    
## I(age^2)        -2.042e-05  2.759e-05  -0.740   0.4593    
## I(carage^2)     -1.511e-03  1.058e-04 -14.284   <2e-16 ***
## gender           2.717e-02  1.383e-02   1.964   0.0495 *  
## weight:distance  9.521e-07  4.038e-07   2.358   0.0184 *  
## weight:age       5.264e-08  4.510e-07   0.117   0.9071    
## weight:carage    5.170e-07  1.118e-06   0.462   0.6437    
## distance:age    -2.694e-05  1.624e-05  -1.659   0.0970 .  
## distance:carage -1.062e-05  4.041e-05  -0.263   0.7928    
## age:carage      -1.060e-03  4.645e-05 -22.829   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89999  on 149999  degrees of freedom
## Residual deviance: 81661  on 149984  degrees of freedom
## AIC: 122662
## 
## Number of Fisher Scoring iterations: 6

##     (Intercept)          weight        distance             age          carage 
##   -6.559040e+00    1.102013e-04    5.203182e-03    4.532931e-02    1.750280e-01 
##     I(weight^2)   I(distance^2)        I(age^2)     I(carage^2)          gender 
##    2.925551e-09   -4.821993e-06   -2.041825e-05   -1.510534e-03    2.716896e-02 
## weight:distance      weight:age   weight:carage    distance:age distance:carage 
##    9.520771e-07    5.264170e-08    5.170357e-07   -2.694253e-05   -1.061542e-05 
##      age:carage 
##   -1.060374e-03

## 
## Call:
## glm(formula = Counts ~ weight + distance + age + carage + I(carage^2) + 
##     gender + weight:distance + distance:age + age:carage, family = poisson(link = "log"), 
##     data = data, offset = log(exposure))
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.518e+00  9.926e-02 -65.668  < 2e-16 ***
## weight           1.393e-04  1.773e-05   7.854 4.03e-15 ***
## distance         4.433e-03  1.239e-03   3.579 0.000345 ***
## age              4.291e-02  1.370e-03  31.315  < 2e-16 ***
## carage           1.754e-01  5.598e-03  31.343  < 2e-16 ***
## I(carage^2)     -1.511e-03  1.057e-04 -14.294  < 2e-16 ***
## gender           2.719e-02  1.383e-02   1.966 0.049337 *  
## weight:distance  9.533e-07  4.036e-07   2.362 0.018168 *  
## distance:age    -2.562e-05  1.553e-05  -1.650 0.098963 .  
## age:carage      -1.055e-03  4.591e-05 -22.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89999  on 149999  degrees of freedom
## Residual deviance: 81662  on 149990  degrees of freedom
## AIC: 122651
## 
## Number of Fisher Scoring iterations: 6

##            Variable   Coefficient
##  1:     (Intercept) -6.518323e+00
##  2:          weight  1.392719e-04
##  3:        distance  4.433349e-03
##  4:             age  4.290980e-02
##  5:          carage  1.754431e-01
##  6:     I(carage^2) -1.511457e-03
##  7:          gender  2.718808e-02
##  8: weight:distance  9.533191e-07
##  9:    distance:age -2.562160e-05
## 10:      age:carage -1.055352e-03

Now that we have generated our best model, as a benchmark for other parts, we estimate the ln\(\lambda(x)\) where all the variables are fixed with \(x_1\)= 2000, \(x_2\) = 10, \(x_3\) = 30, \(x_4\) = 5 & \(x_5\) = male. This gives us an estimated ln\(\lambda(x)=0.01526081\).

We also plot a line chart to show the relationship between distance and \(\lambda(x)\) with all other parameters being fixed.

##          1 
## 0.01526081

The above plot chart shows that an increasing direct relationship between \(\lambda(x)\) and distance with a positive slope for the line chart. THe reasoning behind it could be explained with increase in possibility of getting into an accident if the distance travelled increases giving more time for the claim incidence to occur.

Part c: Clustering on Age and Carage

This part uses the variables age and carage to partition observations in our dataset into K distinct, non-overlapping subgroups of clusters. Each data point belongs to the cluster with the nearest mean and is characterised by a centroid which is usually the mean of the cluster points. The purpose of this technique is to find clusters that minimise the within-cluster sum of squared distances. This is done by calculating within-cluster sum of squared distances for different values of K-clustering and then plotting the value of within SS for each K clusters to decide on the optimal cluster size.

From the above chart, looking at the elbow/kink created for the graph, it is around 3 clusters where the within-cluster sum of squares starts to flatten out. This indicates that our dataset can be distinctively separated into 3 clusters.

To confirm the above, we plot the clusters with unique color for each cluster group. The 3 chosen cluster groups and their respective centroids are clearly visible in the below chart.

##        age   carage
## 1 37.51435 11.03777
## 2 75.51893 22.97871
## 3 37.51188 32.01717

Part d:K-means cluster as the 6th Covariate

From the last part, we had chosen 3 cluster groups and labelled each observation as part of one of the 3 clusters. We now combine this information with our existing dataset and use a Poisson regression to find the best model for our dataset which should include linear and mixed terms only.

By laying out all the possible linear and mixed terms, similar to part b, we use the GLM function to fit the model, use backwards selection method with AIC as the criteria to remove any insignificant covariates. The stepAIC function allows numerous iterations of backward selection to minimise AIC. Once this step has been completed, we are provided with covariates that are significant to Claim counts and their corresponding coefficients.

##   Counts gender distance weight age carage  exposure cluster
## 1      0      1       23   2444  40     11 0.8793250       1
## 2      0      1        8   1147  34     12 0.9407260       1
## 3      0      2       12   2358  27     10 0.9607210       1
## 4      0      2       13   2006  84     29 0.8911884       2
## 5      0      2       44    693  31     16 0.9622051       1
## 6      0      1       35   2304  43     14 0.8637496       1

## 
## Call:
## glm(formula = model_form_extended, family = poisson(link = "log"), 
##     data = data_clus, offset = log(exposure))
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.163e+00  2.670e-01 -26.826  < 2e-16 ***
## weight            1.812e-04  6.676e-05   2.715  0.00663 ** 
## distance          6.440e-03  2.467e-03   2.610  0.00904 ** 
## age               3.146e-02  4.080e-03   7.710 1.25e-14 ***
## carage            1.227e-01  1.199e-02  10.232  < 2e-16 ***
## gender            1.648e-01  1.026e-01   1.605  0.10840    
## cluster           1.386e+00  1.148e-01  12.077  < 2e-16 ***
## weight:distance   9.061e-07  4.036e-07   2.245  0.02476 *  
## weight:age       -7.879e-08  5.070e-07  -0.155  0.87649    
## weight:carage     7.797e-07  1.621e-06   0.481  0.63045    
## weight:gender    -2.171e-05  1.785e-05  -1.216  0.22394    
## weight:cluster   -9.027e-06  2.285e-05  -0.395  0.69276    
## distance:gender   2.028e-04  6.422e-04   0.316  0.75215    
## distance:cluster -1.249e-03  8.228e-04  -1.518  0.12912    
## distance:age     -3.862e-05  1.820e-05  -2.122  0.03387 *  
## distance:carage   5.408e-05  5.858e-05   0.923  0.35594    
## age:carage       -4.922e-04  8.490e-05  -5.797 6.73e-09 ***
## age:gender       -4.988e-04  8.060e-04  -0.619  0.53602    
## age:cluster      -2.070e-03  1.992e-03  -1.039  0.29880    
## carage:gender     2.420e-03  2.583e-03   0.937  0.34892    
## carage:cluster   -3.541e-02  2.815e-03 -12.580  < 2e-16 ***
## gender:cluster   -5.855e-02  3.644e-02  -1.607  0.10808    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89999  on 149999  degrees of freedom
## Residual deviance: 81231  on 149978  degrees of freedom
## AIC: 122244
## 
## Number of Fisher Scoring iterations: 6

## 
## Call:
## glm(formula = Counts ~ weight + distance + age + carage + gender + 
##     cluster + weight:distance + distance:age + age:carage + carage:cluster, 
##     family = poisson(link = "log"), data = data_clus, offset = log(exposure))
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.760e+00  1.340e-01 -50.456  < 2e-16 ***
## weight           1.413e-04  1.773e-05   7.969 1.60e-15 ***
## distance         4.497e-03  1.240e-03   3.627 0.000287 ***
## age              2.654e-02  1.648e-03  16.103  < 2e-16 ***
## carage           1.364e-01  8.655e-03  15.756  < 2e-16 ***
## gender           2.723e-02  1.383e-02   1.969 0.049009 *  
## cluster          1.159e+00  5.024e-02  23.076  < 2e-16 ***
## weight:distance  9.092e-07  4.034e-07   2.254 0.024197 *  
## distance:age    -2.504e-05  1.557e-05  -1.608 0.107872    
## age:carage      -5.410e-04  7.113e-05  -7.606 2.83e-14 ***
## carage:cluster  -3.674e-02  2.504e-03 -14.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 89999  on 149999  degrees of freedom
## Residual deviance: 81239  on 149989  degrees of freedom
## AIC: 122230
## 
## Number of Fisher Scoring iterations: 6

##            Variable   Coefficient
##  1:     (Intercept) -6.759578e+00
##  2:          weight  1.412614e-04
##  3:        distance  4.497108e-03
##  4:             age  2.653666e-02
##  5:          carage  1.363712e-01
##  6:          gender  2.722726e-02
##  7:         cluster  1.159226e+00
##  8: weight:distance  9.092433e-07
##  9:    distance:age -2.503650e-05
## 10:      age:carage -5.409778e-04
## 11:  carage:cluster -3.674465e-02

Using the above coefficients and covariates, we make our final model dropping the insignificant interaction terms which will used in later parts.

Additionally, along with the parameters set for part b, we also select our cluster group based on set values of age and carage. For \(age=30\) and \(carage=5\), this interaction occurs in the 1st cluster group and has been plotted within the red coloured cluster. The estimate for \(\lambda\) for the new best model including the clusters equals \(0.04884311\).

Now,similar to part b, we set other parameters as the same and plot the relationship between distance and \(\lambda(x)\) for all distance values - from its min (0) to maximum (95). However, with the new cluster variable, we plot this relation for all 3 cluster groups to draw comparison.

##          1 
## 0.04884311

From the above chart, it is evident that cluster 2 has the lowest \(\lambda(x)\) with cluster group 3 having the highest values for \(\lambda(x)\). This is expected as cluster 3 has a centre of 75.5 yrs as age of driver and 23 years as age od the driver. An old driver with a relatively old car will witness a high \(\lambda(x)\). Similarly, cluster 2 which comprises of a centroid of 37.5 years of drier’s age and comparitively newer car is expected to contribute less towards claim counts which is seen with the lowest \(\lambda(x)\).

Part e: Poission Regression Tree model

This part attempts to estimate \(\lambda(x)\) by using Poisson regreesion tree as the model. As expected from the name, this technique continuously divides the data into partitions to create braches to follow these and classify data. With visualisation of this data, regression trees are easy to understand and interpret. The decisions are divided into tree nodes that provides more information about its impact, responses and other attributes. Using the function rpart, we find the initial decision tree. This uses the parameters of xval = 10, minbucket = 10,0000 and initial cp of 0.0000001. Depending on the size of the data in use, the minbucket and other parameters can be adjusted which can provide a different number of final nodes.

## Call:
## rpart(formula = cbind(exposure, Counts) ~ weight + distance + 
##     carage + age + gender, data = data, method = "poisson", parms = list(shrink = 1), 
##     control = rpart.control(xval = 10, minbucket = 10000, cp = 1e-07))
##   n= 150000 
## 
##             CP nsplit rel error    xerror        xstd
## 1 0.0625694937      0 1.0000000 1.0000138 0.004194198
## 2 0.0191915427      1 0.9374305 0.9384286 0.004061468
## 3 0.0032577960      2 0.9182390 0.9240704 0.003965530
## 4 0.0021087866      3 0.9149812 0.9212613 0.003960397
## 5 0.0006075031      4 0.9128724 0.9178082 0.003952378
## 6 0.0004445469      5 0.9122649 0.9173828 0.003952153
## 7 0.0000001000      6 0.9118203 0.9171723 0.003951285
## 
## Variable importance
##   carage      age   weight distance 
##       71       24        4        1 
## 
## Node number 1: 150000 observations,    complexity param=0.06256949
##   events=21906,  estimated rate=0.1623092 , mean deviance=0.5999927 
##   left son=2 (57002 obs) right son=3 (92998 obs)
##   Primary splits:
##       carage   < 16.5   to the left,  improve=5631.186000, (0 missing)
##       age      < 54.5   to the left,  improve=2591.392000, (0 missing)
##       weight   < 2285.5 to the left,  improve= 317.037200, (0 missing)
##       distance < 57.5   to the left,  improve= 162.785000, (0 missing)
##       gender   < 1.5    to the left,  improve=   2.981832, (0 missing)
##   Surrogate splits:
##       weight < 3976   to the right, agree=0.62, adj=0, (0 split)
## 
## Node number 2: 57002 observations,    complexity param=0.01919154
##   events=3290,  estimated rate=0.06416035 , mean deviance=0.3375049 
##   left son=4 (46821 obs) right son=5 (10181 obs)
##   Primary splits:
##       age      < 48.5   to the left,  improve=1.727218e+03, (0 missing)
##       carage   < 13.5   to the left,  improve=2.360167e+02, (0 missing)
##       weight   < 1691.5 to the left,  improve=4.372351e+01, (0 missing)
##       distance < 56.5   to the left,  improve=3.903214e+01, (0 missing)
##       gender   < 1.5    to the left,  improve=6.304037e-03, (0 missing)
##   Surrogate splits:
##       weight < 3971   to the left,  agree=0.821, adj=0, (0 split)
## 
## Node number 3: 92998 observations,    complexity param=0.003257796
##   events=18616,  estimated rate=0.2224689 , mean deviance=0.7003297 
##   left son=6 (66578 obs) right son=7 (26420 obs)
##   Primary splits:
##       weight   < 2350.5 to the left,  improve=293.198100, (0 missing)
##       age      < 59.5   to the left,  improve=276.928500, (0 missing)
##       distance < 22.5   to the left,  improve=125.684100, (0 missing)
##       carage   < 20.5   to the left,  improve= 51.336590, (0 missing)
##       gender   < 1.5    to the left,  improve=  4.228065, (0 missing)
##   Surrogate splits:
##       age < 95.5   to the left,  agree=0.716, adj=0, (0 split)
## 
## Node number 4: 46821 observations,    complexity param=0.0004445469
##   events=1653,  estimated rate=0.03926357 , mean deviance=0.2380483 
##   left son=8 (20694 obs) right son=9 (26127 obs)
##   Primary splits:
##       weight   < 1691.5 to the left,  improve=40.009950, (0 missing)
##       distance < 17.5   to the left,  improve=25.550930, (0 missing)
##       age      < 33.5   to the right, improve=10.151380, (0 missing)
##       carage   < 8.5    to the left,  improve= 3.617623, (0 missing)
##       gender   < 1.5    to the left,  improve= 1.285290, (0 missing)
##   Surrogate splits:
##       age      < 21.5   to the left,  agree=0.558, adj=0, (0 split)
##       distance < 93.5   to the right, agree=0.558, adj=0, (0 split)
## 
## Node number 5: 10181 observations
##   events=1637,  estimated rate=0.1785488 , mean deviance=0.6252406 
## 
## Node number 6: 66578 observations,    complexity param=0.002108787
##   events=12255,  estimated rate=0.2045565 , mean deviance=0.6707115 
##   left son=12 (37409 obs) right son=13 (29169 obs)
##   Primary splits:
##       age      < 59.5   to the left,  improve=189.78850, (0 missing)
##       distance < 39.5   to the left,  improve= 79.42493, (0 missing)
##       weight   < 1297.5 to the left,  improve= 56.78335, (0 missing)
##       carage   < 22.5   to the left,  improve= 12.09155, (0 missing)
##       gender   < 1.5    to the left,  improve=  4.63418, (0 missing)
##   Surrogate splits:
##       carage < 27.5   to the right, agree=0.77, adj=0.475, (0 split)
## 
## Node number 7: 26420 observations
##   events=6361,  estimated rate=0.2675945 , mean deviance=0.7638694 
## 
## Node number 8: 20694 observations
##   events=604,  estimated rate=0.03248432 , mean deviance=0.2076103 
## 
## Node number 9: 26127 observations
##   events=1049,  estimated rate=0.04466688 , mean deviance=0.2606255 
## 
## Node number 12: 37409 observations,    complexity param=0.0006075031
##   events=6128,  estimated rate=0.1819975 , mean deviance=0.6301997 
##   left son=24 (22582 obs) right son=25 (14827 obs)
##   Primary splits:
##       distance < 39.5   to the left,  improve=54.674630, (0 missing)
##       carage   < 31.5   to the left,  improve=43.488650, (0 missing)
##       weight   < 1233.5 to the left,  improve=27.131100, (0 missing)
##       gender   < 1.5    to the left,  improve= 1.247796, (0 missing)
##       age      < 35.5   to the right, improve= 0.452618, (0 missing)
##   Surrogate splits:
##       carage < 43.5   to the left,  agree=0.604, adj=0, (0 split)
## 
## Node number 13: 29169 observations
##   events=6127,  estimated rate=0.2334934 , mean deviance=0.7161612 
## 
## Node number 24: 22582 observations
##   events=3415,  estimated rate=0.1679898 , mean deviance=0.6034629 
## 
## Node number 25: 14827 observations
##   events=2713,  estimated rate=0.2033281 , mean deviance=0.6672331

Now, to remove the insignificant nodes, we use the difference in deviance from prior and after the removal. To assess this, we plot the relative error and cross validation error against our complexity parameter.

From the above graph, we can get an estimate for our optinmal cp parameter. We want to avoid over-fitting of the model and large errors. Calculating the optimal complexity paramenter, we get our estimate to be \(0.002108787\). Now, using this optimal cp, we generate another decision tree by plotting pruned regression tree.

##          1 
## 0.03926357

From the above decision tree, our optimla complexity parameter gives us 2 splits and 4 ending terminal nodes. The variables used in this decision includes carage, age and weight. This indicates that the variable distance does not have a significant impact on the response and as a result has been omitted from the prune regression.
To compare it against the benchmark calculated in b, ssing the above, we estimated the \(\lambda(x)\) to be 0.03926357 when all other parameters were set.

As discussed previously, the variable distance has no impact on the decision or \(\lambda(x)\). This is further proven by plotting the \(\lambda(x)\) against distance relationship. A flat line with no slope indicates a constant value of 0.39 of \(\lambda(x)\) for all ranges of distance.

Part f: Poisson Boosting with no base model

For this part, we use the poisson boosting technique which requires us to split the data into training and testing dataset. The working of this method involves use information of many trees that were previously grown. This makes poisson boosting a better predictor than the previously used regression model that only provides few nodes to decide. It is beneficial to use this method in situations where variance of the response variable exceeds the mean. The performance of a Poisson boosting model addresses the over-dispersion by adding more flexibility through the boosting process in comparison to the traditional Poission model.

This flexibility can be achieved by selecting the key 3 parameters required to generate the Boosting model. These include - depth, number of iterations and the shrinkage parameter. Providing a number, the iternations # estimates the number of trees that should fit the model. The shrinkage parameter controls the progress of the technique through the different trees. The interaction depth or depth, provides the number of splits for each tree under this algorithm. Any inaccurate estimates - large or small can result in overfitting of the model and affect its reliability.

To find more accurate estimates of these 3 hyper-parameters, we set a for loop within the boosting model and calculate the out of sample error for each loop with different combinations of shrinkage, depth and boosting steps. To predict the optimal parameters, we find the step that results in the minimum out of sample error. We get the following set of Optimal parameters:

##     Depth Shrinkage Iteration Min error
## 666     3       0.7        16 0.5381935

For the optimal boosting step, we also see the out-of-sample and in-sample errors for all iterations for the optimal depth to be low.

Now, with the optimal parameters known, we predict the \(\lambda(x)\) using the set parameters from part b.

## Predicted lambda x = 0.04017778

Part g :Poisson Boosting with base model

This method is very similar to the one used in part f. However, for this section, we have been provided with the base model which is our glm backward selected model chosen in part b. We use the best backward model and adjust the boosting step formula to accomodate for this change and calculate the optimal parameters.

Looking at the plots of the out-of-sample and in-sample for our optimal choice, we can also confirm that our optimal choice of depth, boosting steps and shrinkage results in the minimum error.

##     Depth Shrinkage Iteration Min error
## 769     4       0.1        19 0.5395754

Now, using the optimal parameters and the set variables, we estimate the \(\lambda(x)\).

Furthermore, to show the relation between \(\lambda(x)\) and distance:

## Predicted lambda x = 0.01558776

Part h: K-fold cross validation comparison

Using the numerous k-fold cross validation methods, we calculate the Cross validation error for all the 5 models.

From comparison, it is evident that Model 2 i.e. GLM model with clustering has the lowest Cross Validation error. The logical reasoning supporting the below calculations is that Model 2 adds a new variable \(x_6\) that is used to categorise all observations into cluster groups based on their age and carage. As clusters are being formed for similar observations and getting more closer to their relavant centroids, the accuracy of the model increases and the CV error reduces.

On the other hand, the model with the highest CV error was Poison Boosting Tree with no base model provided. Since no base model was used, attempting to boost can lead to instability or lack of covergence in the boosting process. Additionally, without a base model, there is a possibility of overfitting to the new data which justifies the higher CV error.

##                               Model name  CV Error
## 1               GLM - backward selection 0.5445713
## 2                        GLM - Clustered 0.5417893
## 3                        Regression Tree 0.5494924
## 4   Poison Boosting Tree (No base model) 0.7372893
## 5 Poison Boosting Tree (with base model) 0.5442835

46428739ChhabraAarushiCaseStudy

2023-10-16