This dataset was sourced from the Department of Statistics and Biostatistics California State University, East Bay Spring 2018, from their subject “Stat 4869/6620 Statistical Learning with R”. It appears to be a synthetic dataset relating to a number of customer characteristics and their insurance expenses. The dataset can be viewed on these links:

http://www.sci.csueastbay.edu/~esuess/stat6620/ http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/insurance.csv

#Read in the data:
library(tidyverse)
insurance <- read_csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/insurance.csv") 

Let’s take a quick look at the data…

names(insurance)
## [1] "age"      "sex"      "bmi"      "children" "smoker"   "region"  
## [7] "expenses"
summary(insurance)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :16.00   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.67   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.70   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.10   Max.   :5.000  
##     smoker             region             expenses    
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770
sum(is.na(insurance))
## [1] 0

The predictive variables are age, sex, bmi, children, region and the target variable is expenses. There are no NA values.

Let’s run multiple linear regression including all variables and see what we get.

multiple_lm1 <- lm(expenses ~ age + sex + bmi + children + smoker + region,data = insurance)
summary(multiple_lm1)
## 
## Call:
## lm(formula = expenses ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## bmi                339.3       28.6  11.864  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Ok. First of all, the p-value is < 2.2e-16 which is highly significant. This shows that one or more of the predictor variables is significantly related to the target variable. Looking next at the coefficients, the t-values indicate that if someone is a smoker, that has the highest predictive capacity of our variables (as it has the largest t-value), followed by age, bmi and children. Each of these variables is deemed to be statistically significant. The other values (sex, region) are not, so we will remove them from our model.

The r-squared measure shows how close the data is fitted to the fitted regression line. This output shows that 75% of the variability is explained by this model, which is pretty good.

multiple_lm2 <- lm(expenses ~ bmi + age + smoker + children,data = insurance)
summary(multiple_lm2)
## 
## Call:
## lm(formula = expenses ~ bmi + age + smoker + children, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11895  -2921   -985   1382  29499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12105.48     941.95 -12.851  < 2e-16 ***
## bmi            321.94      27.38  11.760  < 2e-16 ***
## age            257.83      11.90  21.674  < 2e-16 ***
## smokeryes    23810.32     411.21  57.903  < 2e-16 ***
## children       473.69     137.79   3.438 0.000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.749 
## F-statistic: 998.2 on 4 and 1333 DF,  p-value: < 2.2e-16

By removing the non-significant variables, the F-statistic has jumped from 500 to 998 which means that the model fits the data better. Furthermore, the interpretability has improved. The trade-off is only a marginal decrease in the models predictive capability - the r squared values have dropped by less than 1%.

Next, let’s look at the plots of the model.

plot(multiple_lm2)

Residuals vs Fitted: Output shows that the data clumps around 3 main groups. It seems to indicate that the residuals and fitted values are not correlated and there is a linear relationship between the predictor variables and target. Heteroscedasticity and non-linear relationships are not evident from this output.

Q-Q Plot: shows that our distribution has heavy tails. That is, the points fall along the line in the middle but curve off a bit at each end - particularly at the higher end. This shows that the data doesn’t follow a normal distribution, and tends to have some extreme values.

Scale-Location: This plot allows us to check equal variance (homoscedasticity). The output appears to show that this is (unfortunately) not the case, and the higher levels of variance in the first clump drag the line down towards the left.

Residuals vs Leverage: This plot helps to find influencial points that distort our model. When the line is pulled outside of a dashed line (Cook’s distance), the points are influential on the regression results. In this case, as the Cook line isn’t even visible, it shows that there are no influential points that have distorted our model.

Let’s go back and look at the correlation between each of predictive variables and the target. In the case of smoker, which is a categorical variable (yes/no), we need to convert to numerical values in order to calculate the correlation between expenses and smoking

#Let's look at the correlation between each predictor (BMI, age and smoker) and the response (expenses)
cor(insurance$expenses,insurance$children)
## [1] 0.06799823
cor(insurance$expenses,insurance$bmi)
## [1] 0.1985763
cor(insurance$expenses,insurance$age)
## [1] 0.2990082
library(plyr)
insurance$smoker <- revalue(insurance$smoker, c("yes"=1))
insurance$smoker <- revalue(insurance$smoker, c("no"=0))
insurance$smoker <- as.numeric(insurance$smoker)
cor(insurance$expenses,insurance$smoker)
## [1] 0.7872514

As we saw in the multiple regression output, smoking has the highest preditive power highly correlated variable with insurance expenses.

Multi-collinearity:

library(corrplot)
myvars <- insurance %>% 
  select(bmi, age, smoker, children)
myvars$smoker <- as.numeric(as.character(myvars$smoker))
corrplot(cor(myvars), method = "number", tl.col="black")
#Because we cannot read the numbers in the plot output above (because the values are so small and light), the following code is necessary to darken the text
corrplot(cor(myvars), add=T, type="lower", method="number",
         col="black", diag=F, tl.pos="n", cl.pos="n")

The very small values in this table indicate that there is no multicollinearity between our variables, so this is not something that needs to be factored into the model.