For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will:

Preliminary Steps

We first need to load our data into the R environment. The data is stored in a file called insurance_dat.rda.

load("G:/My Drive/MSBA-SA/Statistics Toolkit/insurance_dat.rda") # Load Dataset

We also need to load the packages which we are going to use for this analysis into R.

library(ggplot2) # Load ggplot for plotting
## Warning: package 'ggplot2' was built under R version 4.2.3

We see that we have 1338 samples and eight variables in the dataset which are:

Log-transform

To provide a more meaningful response variable and reduce the effect of outliers we can transform the data. One way to do this is to use log transform. Log transform reduces the variance of very high values in the dataset and is often useful for dealing with skewed variables. Look what happens when we log transform the values of 1, 10, 100, 1,000, and 10,000.

log(1) # Return log 1
## [1] 0
log(10) # Return log 10
## [1] 2.302585
log(100) # Return log 100
## [1] 4.60517
log(1000) # Return log 1000
## [1] 6.907755
log(10000) # Return log 10000
## [1] 9.21034
log(100000) # Return log 100000
## [1] 11.51293

Here we see that all the variables are now less than 12. Note that as the log transform value of 1 is 0 we will add 1 to each of the values prior to log transforming them:

log_charges <- log(insurance$charges + 1) # Log transform insurance charges

summary(log_charges) # Summarize log transformed variable
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.024   8.465   9.147   9.100   9.721  11.063
insurance$log_charges <- log_charges # Add variable to dataset

First Model

As a first attempt at determining the factors involved in insurance prices let’s look at the age of the individual. We hypothesize that higher age leads to a higher price for health insurance given the medical complications that come with old age. First let’s create a scatter plot of the log transformed charges vs the age of the individual:

# Create plot
g_3 <- ggplot(insurance, # Set dataset 
              aes(y = log_charges, # Set y-axis as insurance charges 
                  x = age)) + # Set x-axis as age.
  geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
  theme_bw() + # Set theme for plot
  theme(panel.grid.major = element_blank(), # Turn of the background grid
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank()) +
  labs(y = "Log(Insurance Charges + 1)", # Set plot labels
       x = "Age",
       title = "Insurance Charges v Age")

g_3 # Generate plot

It looks like there is a positive relationship between age and insurance charges, let’s add a smoothing line, geom_smooth() to identify patterns in the relationship.

# Create plot
g_4 <- ggplot(insurance, # Set dataset 
              aes(y = log_charges, # Set y-axis as insurance charges 
                  x = age)) + # Set x-axis as age.
  geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
  geom_smooth(method = "lm") + # Add smoothing line
  theme_bw() + # Set theme for plot
  theme(panel.grid.major = element_blank(), # Turn of the background grid
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank()) +
  labs(y = "Log(Insurance Charges + 1)", # Set plot labels
       x = "Age",
       title = "Insurance Charges v Age")

g_4 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'

From this it does look like there is quite a significant relationship between the variables with those of higher ages paying a higher premium.

Lets try fitting a model and see what the results say.

Fitting a linear regression model

To fit a linear regression model we use the lm command. Inside this function we need to set the formula which is in the form of “response variable ~ explanatory variables”, in our case this is log_charges ~ age. We also need to specify the dataset which we are using for the model, insurance.

fit_1 <- lm(log_charges ~ age, # Set formula
            data = insurance) # Set dataset

We can then access the results of our model by running the summary command on the fitted model:

summary(fit_1)
## 
## Call:
## lm(formula = log_charges ~ age, data = insurance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3451 -0.4160 -0.3095  0.5015  2.1973 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.748453   0.063380  122.25   <2e-16 ***
## age         0.034461   0.001521   22.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.781 on 1335 degrees of freedom
## Multiple R-squared:  0.2776, Adjusted R-squared:  0.2771 
## F-statistic: 513.1 on 1 and 1335 DF,  p-value: < 2.2e-16

This prints out quite a bit of information about the model. We are most interested in the estimated coefficients. From this we can see that our intercept value (beta 0) is 7.744703 and the coefficient for inflation (beta 1) is 0.034538. Thus there does seem to be a positive relationship between the age of the individual and the insurance premium they pay.

We must next ask is this relationship statistically significant. In the column Pr(>|t|) the p-values for our coefficients are reported. p-values measure significance in statistics and can be considered as the probability that chance alone produced the relationship. The p-value of the coefficient for inflation is “2e-16” or 0.0000000000000002. Thus it is unlikely that the relationship between age and insurance premium is due to chance.

The final numbers we are interested in from the summary read out are the “Multiple R-squared” and “Adjusted R-squared” values. R-squared is the proportion of the variation in the response variable which is explained by the explanatory variables while adjusted R-squared is a modified version which also accounts for the number of variables in a model. The values for these for our model are 0.2786 and 0.278 respectively. This indicates that approximately 3% of the variation in the insurance premium is explained by our model at present. So, while we have managed to identify a statistically significant relationship between age and insurance premium this model is only explaining about a quarter of the variation in the data and is likely to have poor predictive performance.

More variables

Here we see that there is not much separation between the two distributions, with men having more extreme values of each side. This is not likely to be a significant variable.

Multiple Linear Regression

To specify a multiple linear regression model we use the same lm() function as before, however, now our formula will be log_charges ~ age + sex + bmi + children + smoker + region.

fit_2 <- lm(log_charges ~ age + sex + bmi + children + smoker + region, # Set formula
            data = insurance) # Set dataset

Now that we have fitted the linear regression model we again use the summary() command to examine the model fit.

summary(fit_2)
## 
## Call:
## lm(formula = log_charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07371 -0.19623 -0.04972  0.06656  2.16460 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.0319900  0.0723707  97.166  < 2e-16 ***
## age              0.0345317  0.0008724  39.584  < 2e-16 ***
## sexmale         -0.0745350  0.0244008  -3.055  0.00230 ** 
## bmi              0.0134005  0.0020953   6.396 2.21e-10 ***
## children         0.1015096  0.0100986  10.052  < 2e-16 ***
## smokeryes        1.5535850  0.0302707  51.323  < 2e-16 ***
## regionnorthwest -0.0620360  0.0349192  -1.777  0.07587 .  
## regionsoutheast -0.1572671  0.0350688  -4.485 7.94e-06 ***
## regionsouthwest -0.1289352  0.0350131  -3.682  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4441 on 1328 degrees of freedom
## Multiple R-squared:  0.7676, Adjusted R-squared:  0.7662 
## F-statistic: 548.4 on 8 and 1328 DF,  p-value: < 2.2e-16

Here we see that many of the new variables are significant in our model. With age, bmi, children, and smoker-yes having a positive relationship with the insurance premium. While sex male has a negative relationship.

You will note that the values sex-female, smoker-no, and the north east region do not appear in the results. This is because the linear regression function converts categorical variables into binaries and leaves out one of the levels. The effect of these variables is then absorbed into the intercept term and other coefficients should be interpreted relative to them. E.g. smokers pay 1.5541456 more logged charges than non-smokers, and those in the northwest, southeast, and southwest regions pay less than those in the northeast region.

We have significantly increased the proportion of the variation in insurance charges that we are explaining with the multiple R-squared and adjusted R-squared now having values of 0.768 and 0.766 respectively.

Lets try adding some interaction terms into our model and see if that improves our model fit.

Interaction terms

Interaction terms are used when we believe that the relationship of one variable and the response variable depends on the value of another variable.

Let’s look again at the relationship between BMI and charges:

# Create plot
g_11 <- ggplot(insurance, # Set dataset 
              aes(y = log_charges, # Set y-axis as insurance charges 
                  x = bmi)) + # Set x-axis as bmi.
  geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
  geom_smooth(method = "lm") + # Add smoothing line
  theme_bw() + # Set theme for plot
  theme(panel.grid.major = element_blank(), # Turn of the background grid
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank()) +
  labs(y = "Log(Insurance Charges + 1)", # Set plot labels
       x = "BMI",
       title = "Insurance Charges v BMI")

g_11 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'

We saw previously that this was not a strong relationship though it does appear to be positive.

However, if we think that the relationship between BMI and Insurance Premium is different depending on if someone smokes or doesn’t smoke we could also add that to the plot. Let’s see what happens when we color the points on this plot using the smoker variable:

# Create plot
g_12 <- ggplot(insurance, # Set dataset 
              aes(y = log_charges, # Set y-axis as insurance charges 
                  x = bmi, # Set x-axis as bmi.
                  color = smoker)) + 
  geom_point(alpha = 0.3) + # Use geom_point to get scatter plot
  geom_smooth(method = "lm") + # Add smoothing line
  theme_bw() + # Set theme for plot
  theme(panel.grid.major = element_blank(), # Turn of the background grid
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank()) +
  scale_color_manual(values =  c("no" = "blue", "yes" = "red")) + # Set color values manually
  labs(y = "Log(Insurance Charges + 1)", # Set plot labels
       x = "BMI",
       color = "Smoking Status",
       title = "Insurance Charges v BMI and Smoker")

g_12 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'

So here we see that when someone smokes there’s a very strong positive relationship between BMI and insurance premium, however if someone doesn’t smoke then the relationship between BMI and insurance premium is still positive but much weaker. So here the relationship between BMI and insurance premium is different depending on the smoking variable. Therefore it would be a good idea to include the smoking * BMI interaction term into a model of insurance premiums.

Let’s add this term to our model and see if we get any improvement in explanatory power. To do this, we multiply the terms in the formula instead of adding them together, our formula is now: log_charges ~ age + sex + bmi * smoker + children + region

fit_3 <- lm(log_charges ~ age + sex + bmi * smoker + children + region, # Set formula
            data = insurance) # Set dataset

Now that we have fitted the linear regression model we again use the summary() command to examine the model fit.

summary(fit_3)
## 
## Call:
## lm(formula = log_charges ~ age + sex + bmi * smoker + children + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91344 -0.17755 -0.05193  0.04739  2.22141 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.3386448  0.0766432  95.751  < 2e-16 ***
## age              0.0347462  0.0008432  41.209  < 2e-16 ***
## sexmale         -0.0861980  0.0236065  -3.651 0.000271 ***
## bmi              0.0034354  0.0022668   1.516 0.129873    
## smokeryes        0.1563102  0.1459375   1.071 0.284330    
## children         0.1028083  0.0097583  10.535  < 2e-16 ***
## regionnorthwest -0.0694167  0.0337477  -2.057 0.039888 *  
## regionsoutheast -0.1627921  0.0338886  -4.804 1.73e-06 ***
## regionsouthwest -0.1374913  0.0338413  -4.063 5.13e-05 ***
## bmi:smokeryes    0.0455543  0.0046614   9.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4291 on 1327 degrees of freedom
## Multiple R-squared:  0.7832, Adjusted R-squared:  0.7818 
## F-statistic: 532.8 on 9 and 1327 DF,  p-value: < 2.2e-16

We see now that the multiple R-squared value and the adjusted R-squared value have increased to 0.7835 and 0.7821 respectively. In addition the smoker and BMI coefficients are no longer significant as their effect has been captured in the interaction term which has a significant positive effect.

Higher Order Interactions

We can also create higher level interaction terms by multiplying multiple terms together:

fit_5 <- lm(log_charges ~  sex + children + bmi * smoker * age * region, # Set formula
            data = insurance) # Set dataset

Let’s view how the results of the model:

summary(fit_5) # Summarize model 5
## 
## Call:
## lm(formula = log_charges ~ sex + children + bmi * smoker * age * 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69124 -0.15784 -0.06371  0.00325  2.44532 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        6.480e+00  3.690e-01  17.558  < 2e-16 ***
## sexmale                           -8.352e-02  2.123e-02  -3.935 8.77e-05 ***
## children                           1.046e-01  8.814e-03  11.863  < 2e-16 ***
## bmi                                3.080e-02  1.243e-02   2.477   0.0134 *  
## smokeryes                          1.377e+00  7.525e-01   1.830   0.0675 .  
## age                                5.213e-02  9.021e-03   5.779 9.38e-09 ***
## regionnorthwest                    7.507e-01  5.373e-01   1.397   0.1626    
## regionsoutheast                    6.610e-01  5.210e-01   1.269   0.2048    
## regionsouthwest                   -1.338e-01  5.105e-01  -0.262   0.7933    
## bmi:smokeryes                      4.013e-02  2.593e-02   1.547   0.1220    
## bmi:age                           -5.269e-04  2.988e-04  -1.763   0.0781 .  
## smokeryes:age                     -3.197e-02  1.981e-02  -1.614   0.1068    
## bmi:regionnorthwest               -3.097e-02  1.826e-02  -1.696   0.0901 .  
## bmi:regionsoutheast               -3.901e-02  1.647e-02  -2.368   0.0180 *  
## bmi:regionsouthwest               -1.319e-02  1.709e-02  -0.772   0.4405    
## smokeryes:regionnorthwest         -1.158e+00  1.346e+00  -0.861   0.3894    
## smokeryes:regionsoutheast         -1.556e-01  1.038e+00  -0.150   0.8809    
## smokeryes:regionsouthwest          1.067e-01  1.270e+00   0.084   0.9331    
## age:regionnorthwest               -1.610e-02  1.324e-02  -1.215   0.2245    
## age:regionsoutheast               -8.930e-03  1.279e-02  -0.698   0.4853    
## age:regionsouthwest                5.666e-03  1.235e-02   0.459   0.6465    
## bmi:smokeryes:age                  7.782e-05  6.674e-04   0.117   0.9072    
## bmi:smokeryes:regionnorthwest      4.188e-02  4.595e-02   0.911   0.3623    
## bmi:smokeryes:regionsoutheast      1.723e-02  3.328e-02   0.518   0.6046    
## bmi:smokeryes:regionsouthwest      1.074e-02  4.135e-02   0.260   0.7951    
## bmi:age:regionnorthwest            6.167e-04  4.434e-04   1.391   0.1645    
## bmi:age:regionsoutheast            5.960e-04  3.989e-04   1.494   0.1353    
## bmi:age:regionsouthwest            1.000e-04  4.039e-04   0.248   0.8045    
## smokeryes:age:regionnorthwest      2.266e-02  3.378e-02   0.671   0.5025    
## smokeryes:age:regionsoutheast      2.911e-03  2.615e-02   0.111   0.9114    
## smokeryes:age:regionsouthwest     -9.415e-03  3.397e-02  -0.277   0.7817    
## bmi:smokeryes:age:regionnorthwest -8.028e-04  1.140e-03  -0.704   0.4815    
## bmi:smokeryes:age:regionsoutheast -3.043e-04  8.332e-04  -0.365   0.7150    
## bmi:smokeryes:age:regionsouthwest  7.916e-05  1.084e-03   0.073   0.9418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3833 on 1303 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8259 
## F-statistic: 193.1 on 33 and 1303 DF,  p-value: < 2.2e-16

Note that this can rapidly increase the amount of terms that need to included in the model and harm interpretability. Though out multiple R-squared and adjusted R-squared did increase quite a bit to 0.8304 and 0.8261 respectively.