Learning Objectives

In this lesson students will learn how to…

  • Fit a multiple linear regression model
  • Create graphics to explore relationships between two variables
  • Engineer features

0. Import the Data

These data for `insurance’ charges come from the “US Health Insurance Dataset” on Kaggle.

Souce: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset

library(tidyverse)

### use raw file from github
insurance_all<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/insurance.csv", 
                    stringsAsFactors = TRUE)

Let’s start by looking at the relationships between these variables.

We will use the GGally package because it makes beautiful graphics.

#install.packages("GGally")
library(GGally)

ggpairs(insurance_all, columns=c(1, 3, 4, 5, 7),
        ggplot2::aes(colour = smoker))

We should partition the data into testing and training sets:

dim(insurance_all) # 1338
## [1] 1338    7
set.seed(123)
### Let's do a 70-30 split
trainInd<-sample(1:1338, 937)

insurance<-insurance_all[trainInd,]
insurance_test<-insurance_all[-trainInd,]

Questions:

  • What variables are contained in these data?
  • Which variable might you use as the outcome/response?
  • Which variable(s) might you use as the features/explanatory?
str(insurance)
## 'data.frame':    937 obs. of  7 variables:
##  $ age     : int  19 62 46 18 18 39 41 62 20 24 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 2 2 2 ...
##  $ bmi     : num  35.1 38.1 28.9 33.9 34.4 ...
##  $ children: int  0 2 2 0 0 5 3 0 0 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 2 1 4 3 3 2 4 3 4 1 ...
##  $ charges : num  2135 15230 8823 11483 1137 ...

I. Simple Linear Regression

1. Plot the data

Our first task is to create a scatterplot of charges and bmi (body mass index).

### Charges vs BMI
ggplot(data=insurance, aes(x=bmi, y=charges))+
  geom_point()

Question:

  • How would you describe this scatterplot?

    • Direction
    • Form
    • Strength
    • Outliers

2. Fit a Simple Linear Model

Add a geom_smooth() using the linear model method.

### Add a geom_smooth
ggplot(data=insurance, aes(x=bmi, y=charges))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

To find the equation for the line use the lm() function.

mod1<-lm(charges~bmi, data=insurance)
summary(mod1)
## 
## Call:
## lm(formula = charges ~ bmi, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20829  -8248  -3831   4790  49281 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1736.46    2020.40   0.859     0.39    
## bmi           381.26      64.19   5.939 4.04e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12040 on 935 degrees of freedom
## Multiple R-squared:  0.03635,    Adjusted R-squared:  0.03532 
## F-statistic: 35.27 on 1 and 935 DF,  p-value: 4.035e-09

Question:

  • Does it appear that the relationship between charges and bmi is significant? Why or why not?

3. Check the Diagnostics

Using the plot function we can quickly generate diagnostics to assess our model assumptions.

plot(mod1)

Questions:

  • What do you observe?

It appears that there are two groups within these data. Let’s explore that further.

II. Parallel Lines

\[Y_i=\beta_0+\beta_1 X_i+\beta_2 Z_i\] In the parallel lines model,

\(Z_i\) is an indicator that takes that values 0 or 1.

When

  • \(Zi=0\) then \(Y_i=\beta_0+\beta_1 X_i\)
  • \(Zi=1\) then \(Y_i=(\beta_0+\beta_2)+\beta_1 X_i\)

It might be reasonable consider smoker status in our model in addition to bmi. Smoking status is a categorical variable.

1. Plot the data

### Charges vs BMI (color with SMOKER)
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
  geom_point()

Questions:

  • What are your observations?
  • How can we incorporate these into our model?

2. Fit a model

Let’s add a main effect for smoker.

### ADD another variable with a +
mod2<-lm(charges~bmi+smoker, data=insurance)
summary(mod2)
## 
## Call:
## lm(formula = charges ~ bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16244.6  -4559.1   -646.7   3766.9  30777.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4127.80    1195.20  -3.454 0.000578 ***
## bmi           405.34      37.72  10.746  < 2e-16 ***
## smokeryes   23636.63     561.10  42.125  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7075 on 934 degrees of freedom
## Multiple R-squared:  0.6677, Adjusted R-squared:  0.667 
## F-statistic: 938.4 on 2 and 934 DF,  p-value: < 2.2e-16

When you look at this output you’ll see smokeryes. This represents the variable smoker and the level of the factor yes. Note in this case no is used as the reference group. Factor variables create binary indicators in R.

contrasts(insurance$smoker)
##     yes
## no    0
## yes   1

The estimate associated with smokeryes is how much, on average, charges differ from an individual who does not smoke, while all other variables are held constant.

Let’s look at those coefficients again:

mod2$coefficients
## (Intercept)         bmi   smokeryes 
##  -4127.8038    405.3422  23636.6262

Questions:

  • Is the effect of smoker significant?

3. Visualize the Model

### Parallel Lines
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
  geom_point()+
  geom_abline(intercept=mod2$coefficients[1]+mod2$coefficients[3], 
              slope=mod2$coefficients[2], color="blue")+
  geom_abline(intercept=mod2$coefficients[1], 
              slope=mod2$coefficients[2], color="red")

Questions:

  • What are your observations?
  • Would to make changes to this model?

III. Interactions

\[Y_i=\beta_0+\beta_1 X_i+\beta_2 Z_i+\beta_3 X_i \times Z_i\] In the parallel lines model,

\(Z_i\) is an indicator that takes that values 0 or 1.

When

  • \(Zi=0\) then \(Y_i=\beta_0+\beta_1 X_i\)
  • \(Zi=1\) then \(Y_i=(\beta_0+\beta_2)+(\beta_1+\beta_3) X_i\)

You might notice that the restriction to parallel lines does not accurately reflect the relationship for the smoker group.

1. Make a Plot

We can use color as a grouping variable that is used to partition these data.

### Charges vs BMI (color with SMOKER)
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Note that when we partition the data in this way, it performs two separate linear models.

2. Fit a model

We can model these groups simultaneously to decrease the error overall.

Use a * to get all main effects and subsequent interactions.

## * for interactions
mod3<-lm(charges~bmi*smoker, data=insurance)
summary(mod3)
## 
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19193.0  -4327.6   -771.2   2993.6  31127.5 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5863.85    1175.02   4.990 7.19e-07 ***
## bmi               82.20      37.29   2.204   0.0278 *  
## smokeryes     -18434.12    2408.94  -7.652 4.91e-14 ***
## bmi:smokeryes   1368.25      76.74  17.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6113 on 933 degrees of freedom
## Multiple R-squared:  0.7521, Adjusted R-squared:  0.7514 
## F-statistic: 943.8 on 3 and 933 DF,  p-value: < 2.2e-16

Looking at the output you’ll see bmi:smokeryes. This is known as the interaction. An interaction communicates that the relationship between the response and an explanatory variable changes at different levels of a factor. This causes in change in both the intercept and the slope.

## Look at the coefficient vector
mod3$coefficients
##   (Intercept)           bmi     smokeryes bmi:smokeryes 
##    5863.85365      82.19608  -18434.12034    1368.25455
## Reference
yint_0<-mod3$coefficients[1]
slope_0<-mod3$coefficients[2]

## Alternative
yint_1<-mod3$coefficients[1]+mod3$coefficients[3]
yint_1
## (Intercept) 
##   -12570.27
slope_1<-mod3$coefficients[2]+mod3$coefficients[4]
slope_1
##      bmi 
## 1450.451

3. Visualize the model

### Parallel Lines
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
  geom_point()+
  geom_abline(intercept=yint_1, 
              slope=slope_1, color="blue")+
  geom_abline(intercept=yint_0, 
              slope=slope_0, color="red")

IV. Fitting Planes

Suppose now that we have two numeric features (explanatory variables) how would we graph that?

We would need to add a dimension.

There are several packages that can be used to create a 3D scatterplot:

1. Make a Plot

## SCATTERPLOT3D
#install.packages("scatterplot3d") # Install
library("scatterplot3d")

s3d<-scatterplot3d(insurance$bmi, insurance$age, insurance$charges)

  1. Fit a Model

\[Y_i=\beta_0+\beta_1 X_{i,1}+\beta_2 X_{i,2}\] In this MLR model we have two different slope coefficients. The slope related to the first feature, \(X_1\), is \(\beta_1\) and the slope related to the second feature, \(X_2\), is \(\beta_2\).

We interpret these slopes, \(\beta_j\) as the average amount of change in the response for a one unit change in the \(X_j\) direction, on average.

mod4<-lm(charges~bmi+age, data=insurance)

summary(mod4)
## 
## Call:
## lm(formula = charges ~ bmi + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15200  -7211  -5238   7503  48012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5349.72    2156.51  -2.481   0.0133 *  
## bmi           335.27      62.49   5.365 1.02e-07 ***
## age           216.70      27.63   7.842 1.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11670 on 934 degrees of freedom
## Multiple R-squared:  0.09588,    Adjusted R-squared:  0.09394 
## F-statistic: 49.52 on 2 and 934 DF,  p-value: < 2.2e-16
  1. Visualize the model
scatterplot3d(insurance$bmi, insurance$age, insurance$charges, pch=16)
# Add regression plane
s3d$plane3d(mod4)

Her’s another example

#install.packages(c("plot3D", "magrittr"))
library(plot3D)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
insurance%$%
  scatter3D(bmi, age, charges, theta=30, phi=10)

4. BONUS: More Interactions

mod5<-lm(charges~bmi*smoker*age, data=insurance)
summary(mod5)
## 
## Call:
## lm(formula = charges ~ bmi * smoker * age, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12023.5  -1966.2  -1254.6   -268.1  29428.6 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.633e+03  2.797e+03  -0.584 0.559531    
## bmi               -3.305e+00  8.947e+01  -0.037 0.970538    
## smokeryes         -1.034e+04  5.919e+03  -1.746 0.081130 .  
## age                2.449e+02  6.905e+01   3.548 0.000408 ***
## bmi:smokeryes      1.116e+03  1.873e+02   5.958 3.61e-09 ***
## bmi:age            3.724e-01  2.176e+00   0.171 0.864151    
## smokeryes:age     -2.688e+02  1.491e+02  -1.803 0.071736 .  
## bmi:smokeryes:age  8.726e+00  4.701e+00   1.856 0.063732 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4977 on 929 degrees of freedom
## Multiple R-squared:  0.8364, Adjusted R-squared:  0.8352 
## F-statistic: 678.5 on 7 and 929 DF,  p-value: < 2.2e-16
#install.packages("rgl")
library(rgl)
library(car)

scatter3d(charges~bmi*age*as.factor(smoker), data=insurance)

# Print the html graphic
 rglwidget(width = 520, height = 520)

V. Feature Engineering

1. The Kitchen Sink

There are more variables to play with in these data:

str(insurance)
## 'data.frame':    937 obs. of  7 variables:
##  $ age     : int  19 62 46 18 18 39 41 62 20 24 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 2 2 2 ...
##  $ bmi     : num  35.1 38.1 28.9 33.9 34.4 ...
##  $ children: int  0 2 2 0 0 5 3 0 0 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 2 1 4 3 3 2 4 3 4 1 ...
##  $ charges : num  2135 15230 8823 11483 1137 ...

2. Fit a Model

mod6<-lm(charges~age+children+bmi+sex+smoker+region, data=insurance)
summary(mod6)
## 
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11021.0  -2989.3   -994.3   1715.5  30293.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -12480.69    1215.51 -10.268  < 2e-16 ***
## age                240.12      14.67  16.371  < 2e-16 ***
## children           674.08     165.24   4.079  4.9e-05 ***
## bmi                370.06      34.73  10.656  < 2e-16 ***
## sexmale           -246.82     408.20  -0.605    0.546    
## smokeryes        23939.65     493.54  48.506  < 2e-16 ***
## regionnorthwest   -595.02     580.29  -1.025    0.305    
## regionsoutheast   -952.66     585.26  -1.628    0.104    
## regionsouthwest  -1116.82     581.28  -1.921    0.055 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6182 on 928 degrees of freedom
## Multiple R-squared:  0.7479, Adjusted R-squared:  0.7458 
## F-statistic: 344.2 on 8 and 928 DF,  p-value: < 2.2e-16

Questions:

  • How do we interpret the region estimates?
    • What is the reference group?
  • Looking at the pairs plot what are your observations? Do the relationships meet our LM assumptions?

3. Mutate!

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.

https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10

You might have made the following observations: * There looks to be some curvature in the relationship between age and charges * Perhaps charges are different for patients that are defined as obese (BMI > 30) * It stands to reason that the having both traits of smoking and being obese could have even worse effects.

Let’s create new variables to help accomplish these:

insurance<-insurance %>%
  mutate(bmi30=(bmi>=30), 
         age2=age^2)

4. Fitting the Model with New Features

mod7<-lm(charges~age+age2+children+bmi+sex+bmi30*smoker+region, data=insurance)
summary(mod7)
## 
## Call:
## lm(formula = charges ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4609.9 -1650.1 -1187.5  -626.9 23877.3 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           514.0293  1665.6029   0.309  0.75768    
## age                   -71.1145    74.5724  -0.954  0.34052    
## age2                    4.1190     0.9309   4.425 1.08e-05 ***
## children              777.4579   126.1384   6.164 1.06e-09 ***
## bmi                   126.2632    40.3375   3.130  0.00180 ** 
## sexmale              -491.5381   293.4848  -1.675  0.09430 .  
## bmi30TRUE            -970.4090   503.2088  -1.928  0.05411 .  
## smokeryes           13624.3500   510.4337  26.692  < 2e-16 ***
## regionnorthwest      -229.4323   417.7103  -0.549  0.58296    
## regionsoutheast      -666.2763   421.3407  -1.581  0.11415    
## regionsouthwest     -1183.2241   417.7891  -2.832  0.00472 ** 
## bmi30TRUE:smokeryes 20122.6098   708.0504  28.420  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4442 on 925 degrees of freedom
## Multiple R-squared:  0.8703, Adjusted R-squared:  0.8687 
## F-statistic: 564.1 on 11 and 925 DF,  p-value: < 2.2e-16

VI. Comparing the models

What metrics can we use for model comparison:

library(broom)

## COMPARE MODEL 1 and MODEL 7
mod1%>%
  glance()
## # A tibble: 1 × 12
##   r.squared adj.r.s…¹  sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>     <dbl>  <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1    0.0364    0.0353 12041.    35.3 4.04e-9     1 -10133. 20271. 20286. 1.36e11
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
mod7%>%
  glance()
## # A tibble: 1 × 12
##   r.squared adj.r.squ…¹ sigma stati…² p.value    df logLik    AIC    BIC devia…³
##       <dbl>       <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1     0.870       0.869 4442.    564.       0    11 -9193. 18412. 18475. 1.83e10
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

Recall that \(\sigma\) is estimated with \(s\), where \[s=\sqrt{\frac{1}{n-p-1}\sum_{i=1}^n(\hat{y}_i-y_i)^2}\]

Metrics from the training set:

## MODEL 1: charges~bmi
train1<-sigma(mod1)
train1
## [1] 12041.5
## MODEL 2: charges~bmi+smoker
train2<-sigma(mod2)
train2
## [1] 7074.866
## MODEL 3: charges~bmi*smoker
train3<-sigma(mod3)
train3
## [1] 6113.404
## MODEL 4: charges~bmi+age
train4<-sigma(mod4)
train4
## [1] 11669.9
## MODEL 5: charges~bmi*smoker*age
train5<-sigma(mod5)
train5
## [1] 4977.349
## MODEL 6: (Kitchen Sink) charges~age+children+bmi+sex+smoker+region
train6<-sigma(mod6)
train6
## [1] 6181.611
## MODEL 7: (NEW Features) charges~age+age2+children+bmi+sex+bmi30*smoker+region
train7<-sigma(mod7)
train7
## [1] 4441.866

Let’s look at the test performance by assessing the root mean squared error: \[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n(\hat{y}_i-y_i)^2}\]

#install.packages("caret")
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## MOD 1
testPred1<-predict(mod1, insurance_test)
RMSE(testPred1, insurance_test$charges)
## [1] 11477.92
## MOD 2
testPred2<-predict(mod2, insurance_test)
RMSE(testPred2, insurance_test$charges)
## [1] 7124.835
## MOD 3
testPred3<-predict(mod3, insurance_test)
RMSE(testPred3, insurance_test$charges)
## [1] 6271.597
## MOD 4
testPred4<-predict(mod4, insurance_test)
RMSE(testPred4, insurance_test$charges)
## [1] 10721.93
## MOD 5
testPred5<-predict(mod5, insurance_test)
RMSE(testPred5, insurance_test$charges)
## [1] 4742.929
## MOD 6
testPred6<-predict(mod6, insurance_test)
RMSE(testPred6, insurance_test$charges)
## [1] 5828.947
## MOD 7
insurance_test<-insurance_test%>%
    mutate(bmi30=(bmi>=30), 
         age2=age^2)

testPred7<-predict(mod7, insurance_test)
RMSE(testPred7, insurance_test$charges)
## [1] 4476.419