For this workbook we will be applying linear regression to determine the factors that play a role in insurance pricing. During this workshop we will:
We first need to load our data into the R environment. The data is
stored in a file called insurance_dat.rda.
load("G:/My Drive/MSBA-SA/Statistics Toolkit/insurance_dat.rda") # Load Dataset
We also need to load the packages which we are going to use for this analysis into R.
library(ggplot2) # Load ggplot for plotting
## Warning: package 'ggplot2' was built under R version 4.2.3
We see that we have 1338 samples and eight variables in the dataset which are:
To provide a more meaningful response variable and reduce the effect of outliers we can transform the data. One way to do this is to use log transform. Log transform reduces the variance of very high values in the dataset and is often useful for dealing with skewed variables. Look what happens when we log transform the values of 1, 10, 100, 1,000, and 10,000.
log(1) # Return log 1
## [1] 0
log(10) # Return log 10
## [1] 2.302585
log(100) # Return log 100
## [1] 4.60517
log(1000) # Return log 1000
## [1] 6.907755
log(10000) # Return log 10000
## [1] 9.21034
log(100000) # Return log 100000
## [1] 11.51293
Here we see that all the variables are now less than 12. Note that as the log transform value of 1 is 0 we will add 1 to each of the values prior to log transforming them:
log_charges <- log(insurance$charges + 1) # Log transform insurance charges
summary(log_charges) # Summarize log transformed variable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.024 8.465 9.147 9.100 9.721 11.063
insurance$log_charges <- log_charges # Add variable to dataset
As a first attempt at determining the factors involved in insurance prices let’s look at the age of the individual. We hypothesize that higher age leads to a higher price for health insurance given the medical complications that come with old age. First let’s create a scatter plot of the log transformed charges vs the age of the individual:
# Create plot
g_3 <- ggplot(insurance, # Set dataset
aes(y = log_charges, # Set y-axis as insurance charges
x = age)) + # Set x-axis as age.
geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
theme_bw() + # Set theme for plot
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log(Insurance Charges + 1)", # Set plot labels
x = "Age",
title = "Insurance Charges v Age")
g_3 # Generate plot
It looks like there is a positive relationship between age and
insurance charges, let’s add a smoothing line,
geom_smooth() to identify patterns in the relationship.
# Create plot
g_4 <- ggplot(insurance, # Set dataset
aes(y = log_charges, # Set y-axis as insurance charges
x = age)) + # Set x-axis as age.
geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
geom_smooth(method = "lm") + # Add smoothing line
theme_bw() + # Set theme for plot
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log(Insurance Charges + 1)", # Set plot labels
x = "Age",
title = "Insurance Charges v Age")
g_4 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'
From this it does look like there is quite a significant relationship between the variables with those of higher ages paying a higher premium.
Lets try fitting a model and see what the results say.
To fit a linear regression model we use the lm command.
Inside this function we need to set the formula which is in the form of
“response variable ~ explanatory variables”, in our case this is
log_charges ~ age. We also need to specify the dataset
which we are using for the model, insurance.
fit_1 <- lm(log_charges ~ age, # Set formula
data = insurance) # Set dataset
We can then access the results of our model by running the summary command on the fitted model:
summary(fit_1)
##
## Call:
## lm(formula = log_charges ~ age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3451 -0.4160 -0.3095 0.5015 2.1973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.748453 0.063380 122.25 <2e-16 ***
## age 0.034461 0.001521 22.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.781 on 1335 degrees of freedom
## Multiple R-squared: 0.2776, Adjusted R-squared: 0.2771
## F-statistic: 513.1 on 1 and 1335 DF, p-value: < 2.2e-16
This prints out quite a bit of information about the model. We are most interested in the estimated coefficients. From this we can see that our intercept value (beta 0) is 7.744703 and the coefficient for inflation (beta 1) is 0.034538. Thus there does seem to be a positive relationship between the age of the individual and the insurance premium they pay.
We must next ask is this relationship statistically significant. In
the column Pr(>|t|) the p-values for our coefficients
are reported. p-values measure significance in statistics and can be
considered as the probability that chance alone produced the
relationship. The p-value of the coefficient for inflation is “2e-16” or
0.0000000000000002. Thus it is unlikely that the relationship between
age and insurance premium is due to chance.
The final numbers we are interested in from the summary read out are the “Multiple R-squared” and “Adjusted R-squared” values. R-squared is the proportion of the variation in the response variable which is explained by the explanatory variables while adjusted R-squared is a modified version which also accounts for the number of variables in a model. The values for these for our model are 0.2786 and 0.278 respectively. This indicates that approximately 3% of the variation in the insurance premium is explained by our model at present. So, while we have managed to identify a statistically significant relationship between age and insurance premium this model is only explaining about a quarter of the variation in the data and is likely to have poor predictive performance.
Here we see that there is not much separation between the two distributions, with men having more extreme values of each side. This is not likely to be a significant variable.
To specify a multiple linear regression model we use the same
lm() function as before, however, now our formula will be
log_charges ~ age + sex + bmi + children + smoker + region.
fit_2 <- lm(log_charges ~ age + sex + bmi + children + smoker + region, # Set formula
data = insurance) # Set dataset
Now that we have fitted the linear regression model we again use the
summary() command to examine the model fit.
summary(fit_2)
##
## Call:
## lm(formula = log_charges ~ age + sex + bmi + children + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07371 -0.19623 -0.04972 0.06656 2.16460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0319900 0.0723707 97.166 < 2e-16 ***
## age 0.0345317 0.0008724 39.584 < 2e-16 ***
## sexmale -0.0745350 0.0244008 -3.055 0.00230 **
## bmi 0.0134005 0.0020953 6.396 2.21e-10 ***
## children 0.1015096 0.0100986 10.052 < 2e-16 ***
## smokeryes 1.5535850 0.0302707 51.323 < 2e-16 ***
## regionnorthwest -0.0620360 0.0349192 -1.777 0.07587 .
## regionsoutheast -0.1572671 0.0350688 -4.485 7.94e-06 ***
## regionsouthwest -0.1289352 0.0350131 -3.682 0.00024 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4441 on 1328 degrees of freedom
## Multiple R-squared: 0.7676, Adjusted R-squared: 0.7662
## F-statistic: 548.4 on 8 and 1328 DF, p-value: < 2.2e-16
Here we see that many of the new variables are significant in our model. With age, bmi, children, and smoker-yes having a positive relationship with the insurance premium. While sex male has a negative relationship.
You will note that the values sex-female, smoker-no, and the north east region do not appear in the results. This is because the linear regression function converts categorical variables into binaries and leaves out one of the levels. The effect of these variables is then absorbed into the intercept term and other coefficients should be interpreted relative to them. E.g. smokers pay 1.5541456 more logged charges than non-smokers, and those in the northwest, southeast, and southwest regions pay less than those in the northeast region.
We have significantly increased the proportion of the variation in insurance charges that we are explaining with the multiple R-squared and adjusted R-squared now having values of 0.768 and 0.766 respectively.
Lets try adding some interaction terms into our model and see if that improves our model fit.
Interaction terms are used when we believe that the relationship of one variable and the response variable depends on the value of another variable.
Let’s look again at the relationship between BMI and charges:
# Create plot
g_11 <- ggplot(insurance, # Set dataset
aes(y = log_charges, # Set y-axis as insurance charges
x = bmi)) + # Set x-axis as bmi.
geom_point(color = "blue", alpha = 0.3) + # Use geom_point to get scatter plot
geom_smooth(method = "lm") + # Add smoothing line
theme_bw() + # Set theme for plot
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log(Insurance Charges + 1)", # Set plot labels
x = "BMI",
title = "Insurance Charges v BMI")
g_11 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'
We saw previously that this was not a strong relationship though it does appear to be positive.
However, if we think that the relationship between BMI and Insurance Premium is different depending on if someone smokes or doesn’t smoke we could also add that to the plot. Let’s see what happens when we color the points on this plot using the smoker variable:
# Create plot
g_12 <- ggplot(insurance, # Set dataset
aes(y = log_charges, # Set y-axis as insurance charges
x = bmi, # Set x-axis as bmi.
color = smoker)) +
geom_point(alpha = 0.3) + # Use geom_point to get scatter plot
geom_smooth(method = "lm") + # Add smoothing line
theme_bw() + # Set theme for plot
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
scale_color_manual(values = c("no" = "blue", "yes" = "red")) + # Set color values manually
labs(y = "Log(Insurance Charges + 1)", # Set plot labels
x = "BMI",
color = "Smoking Status",
title = "Insurance Charges v BMI and Smoker")
g_12 # Generate plot
## `geom_smooth()` using formula = 'y ~ x'
So here we see that when someone smokes there’s a very strong positive relationship between BMI and insurance premium, however if someone doesn’t smoke then the relationship between BMI and insurance premium is still positive but much weaker. So here the relationship between BMI and insurance premium is different depending on the smoking variable. Therefore it would be a good idea to include the smoking * BMI interaction term into a model of insurance premiums.
Let’s add this term to our model and see if we get any improvement in
explanatory power. To do this, we multiply the terms in the formula
instead of adding them together, our formula is now:
log_charges ~ age + sex + bmi * smoker + children + region
fit_3 <- lm(log_charges ~ age + sex + bmi * smoker + children + region, # Set formula
data = insurance) # Set dataset
Now that we have fitted the linear regression model we again use the
summary() command to examine the model fit.
summary(fit_3)
##
## Call:
## lm(formula = log_charges ~ age + sex + bmi * smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91344 -0.17755 -0.05193 0.04739 2.22141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3386448 0.0766432 95.751 < 2e-16 ***
## age 0.0347462 0.0008432 41.209 < 2e-16 ***
## sexmale -0.0861980 0.0236065 -3.651 0.000271 ***
## bmi 0.0034354 0.0022668 1.516 0.129873
## smokeryes 0.1563102 0.1459375 1.071 0.284330
## children 0.1028083 0.0097583 10.535 < 2e-16 ***
## regionnorthwest -0.0694167 0.0337477 -2.057 0.039888 *
## regionsoutheast -0.1627921 0.0338886 -4.804 1.73e-06 ***
## regionsouthwest -0.1374913 0.0338413 -4.063 5.13e-05 ***
## bmi:smokeryes 0.0455543 0.0046614 9.773 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4291 on 1327 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7818
## F-statistic: 532.8 on 9 and 1327 DF, p-value: < 2.2e-16
We see now that the multiple R-squared value and the adjusted R-squared value have increased to 0.7835 and 0.7821 respectively. In addition the smoker and BMI coefficients are no longer significant as their effect has been captured in the interaction term which has a significant positive effect.
We can also create higher level interaction terms by multiplying multiple terms together:
fit_5 <- lm(log_charges ~ sex + children + bmi * smoker * age * region, # Set formula
data = insurance) # Set dataset
Let’s view how the results of the model:
summary(fit_5) # Summarize model 5
##
## Call:
## lm(formula = log_charges ~ sex + children + bmi * smoker * age *
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69124 -0.15784 -0.06371 0.00325 2.44532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.480e+00 3.690e-01 17.558 < 2e-16 ***
## sexmale -8.352e-02 2.123e-02 -3.935 8.77e-05 ***
## children 1.046e-01 8.814e-03 11.863 < 2e-16 ***
## bmi 3.080e-02 1.243e-02 2.477 0.0134 *
## smokeryes 1.377e+00 7.525e-01 1.830 0.0675 .
## age 5.213e-02 9.021e-03 5.779 9.38e-09 ***
## regionnorthwest 7.507e-01 5.373e-01 1.397 0.1626
## regionsoutheast 6.610e-01 5.210e-01 1.269 0.2048
## regionsouthwest -1.338e-01 5.105e-01 -0.262 0.7933
## bmi:smokeryes 4.013e-02 2.593e-02 1.547 0.1220
## bmi:age -5.269e-04 2.988e-04 -1.763 0.0781 .
## smokeryes:age -3.197e-02 1.981e-02 -1.614 0.1068
## bmi:regionnorthwest -3.097e-02 1.826e-02 -1.696 0.0901 .
## bmi:regionsoutheast -3.901e-02 1.647e-02 -2.368 0.0180 *
## bmi:regionsouthwest -1.319e-02 1.709e-02 -0.772 0.4405
## smokeryes:regionnorthwest -1.158e+00 1.346e+00 -0.861 0.3894
## smokeryes:regionsoutheast -1.556e-01 1.038e+00 -0.150 0.8809
## smokeryes:regionsouthwest 1.067e-01 1.270e+00 0.084 0.9331
## age:regionnorthwest -1.610e-02 1.324e-02 -1.215 0.2245
## age:regionsoutheast -8.930e-03 1.279e-02 -0.698 0.4853
## age:regionsouthwest 5.666e-03 1.235e-02 0.459 0.6465
## bmi:smokeryes:age 7.782e-05 6.674e-04 0.117 0.9072
## bmi:smokeryes:regionnorthwest 4.188e-02 4.595e-02 0.911 0.3623
## bmi:smokeryes:regionsoutheast 1.723e-02 3.328e-02 0.518 0.6046
## bmi:smokeryes:regionsouthwest 1.074e-02 4.135e-02 0.260 0.7951
## bmi:age:regionnorthwest 6.167e-04 4.434e-04 1.391 0.1645
## bmi:age:regionsoutheast 5.960e-04 3.989e-04 1.494 0.1353
## bmi:age:regionsouthwest 1.000e-04 4.039e-04 0.248 0.8045
## smokeryes:age:regionnorthwest 2.266e-02 3.378e-02 0.671 0.5025
## smokeryes:age:regionsoutheast 2.911e-03 2.615e-02 0.111 0.9114
## smokeryes:age:regionsouthwest -9.415e-03 3.397e-02 -0.277 0.7817
## bmi:smokeryes:age:regionnorthwest -8.028e-04 1.140e-03 -0.704 0.4815
## bmi:smokeryes:age:regionsoutheast -3.043e-04 8.332e-04 -0.365 0.7150
## bmi:smokeryes:age:regionsouthwest 7.916e-05 1.084e-03 0.073 0.9418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3833 on 1303 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8259
## F-statistic: 193.1 on 33 and 1303 DF, p-value: < 2.2e-16
Note that this can rapidly increase the amount of terms that need to included in the model and harm interpretability. Though out multiple R-squared and adjusted R-squared did increase quite a bit to 0.8304 and 0.8261 respectively.