In this lesson students will learn how to…
These data for `insurance’ charges come from the “US Health Insurance Dataset” on Kaggle.
Souce: https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset
library(tidyverse)
### use raw file from github
insurance_all<-read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/insurance.csv",
stringsAsFactors = TRUE)
Let’s start by looking at the relationships between these variables.
We will use the GGally
package because it makes
beautiful graphics.
#install.packages("GGally")
library(GGally)
ggpairs(insurance_all, columns=c(1, 3, 4, 5, 7),
ggplot2::aes(colour = smoker))
We should partition the data into testing and training sets:
dim(insurance_all) # 1338
## [1] 1338 7
set.seed(123)
### Let's do a 70-30 split
trainInd<-sample(1:1338, 937)
insurance<-insurance_all[trainInd,]
insurance_test<-insurance_all[-trainInd,]
str(insurance)
## 'data.frame': 937 obs. of 7 variables:
## $ age : int 19 62 46 18 18 39 41 62 20 24 ...
## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 2 2 2 ...
## $ bmi : num 35.1 38.1 28.9 33.9 34.4 ...
## $ children: int 0 2 2 0 0 5 3 0 0 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 2 1 4 3 3 2 4 3 4 1 ...
## $ charges : num 2135 15230 8823 11483 1137 ...
Our first task is to create a scatterplot of charges
and
bmi
(body mass index).
### Charges vs BMI
ggplot(data=insurance, aes(x=bmi, y=charges))+
geom_point()
How would you describe this scatterplot?
Add a geom_smooth()
using the linear model method.
### Add a geom_smooth
ggplot(data=insurance, aes(x=bmi, y=charges))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
To find the equation for the line use the lm()
function.
mod1<-lm(charges~bmi, data=insurance)
summary(mod1)
##
## Call:
## lm(formula = charges ~ bmi, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20829 -8248 -3831 4790 49281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1736.46 2020.40 0.859 0.39
## bmi 381.26 64.19 5.939 4.04e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12040 on 935 degrees of freedom
## Multiple R-squared: 0.03635, Adjusted R-squared: 0.03532
## F-statistic: 35.27 on 1 and 935 DF, p-value: 4.035e-09
charges
and bmi
is significant? Why or why not?Using the plot
function we can quickly generate
diagnostics to assess our model assumptions.
plot(mod1)
It appears that there are two groups within these data. Let’s explore that further.
\[Y_i=\beta_0+\beta_1 X_i+\beta_2 Z_i\] In the parallel lines model,
\(Z_i\) is an indicator that takes that values 0 or 1.
When
It might be reasonable consider smoker
status in our
model in addition to bmi
. Smoking status is a categorical
variable.
### Charges vs BMI (color with SMOKER)
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
geom_point()
Let’s add a main effect for smoker
.
### ADD another variable with a +
mod2<-lm(charges~bmi+smoker, data=insurance)
summary(mod2)
##
## Call:
## lm(formula = charges ~ bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16244.6 -4559.1 -646.7 3766.9 30777.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4127.80 1195.20 -3.454 0.000578 ***
## bmi 405.34 37.72 10.746 < 2e-16 ***
## smokeryes 23636.63 561.10 42.125 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7075 on 934 degrees of freedom
## Multiple R-squared: 0.6677, Adjusted R-squared: 0.667
## F-statistic: 938.4 on 2 and 934 DF, p-value: < 2.2e-16
When you look at this output you’ll see smokeryes
. This
represents the variable smoker
and the level of the factor
yes
. Note in this case no
is used as the
reference group. Factor variables create binary indicators in R.
contrasts(insurance$smoker)
## yes
## no 0
## yes 1
The estimate associated with smokeryes
is how much, on
average, charges differ from an individual who does not smoke, while all
other variables are held constant.
Let’s look at those coefficients again:
mod2$coefficients
## (Intercept) bmi smokeryes
## -4127.8038 405.3422 23636.6262
smoker
significant?### Parallel Lines
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
geom_point()+
geom_abline(intercept=mod2$coefficients[1]+mod2$coefficients[3],
slope=mod2$coefficients[2], color="blue")+
geom_abline(intercept=mod2$coefficients[1],
slope=mod2$coefficients[2], color="red")
\[Y_i=\beta_0+\beta_1 X_i+\beta_2 Z_i+\beta_3 X_i \times Z_i\] In the parallel lines model,
\(Z_i\) is an indicator that takes that values 0 or 1.
When
You might notice that the restriction to parallel lines does not accurately reflect the relationship for the smoker group.
We can use color
as a grouping variable that is used to
partition these data.
### Charges vs BMI (color with SMOKER)
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
geom_point()+
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Note that when we partition the data in this way, it performs two
separate linear models.
We can model these groups simultaneously to decrease the error overall.
Use a *
to get all main effects and subsequent
interactions.
## * for interactions
mod3<-lm(charges~bmi*smoker, data=insurance)
summary(mod3)
##
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19193.0 -4327.6 -771.2 2993.6 31127.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5863.85 1175.02 4.990 7.19e-07 ***
## bmi 82.20 37.29 2.204 0.0278 *
## smokeryes -18434.12 2408.94 -7.652 4.91e-14 ***
## bmi:smokeryes 1368.25 76.74 17.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6113 on 933 degrees of freedom
## Multiple R-squared: 0.7521, Adjusted R-squared: 0.7514
## F-statistic: 943.8 on 3 and 933 DF, p-value: < 2.2e-16
Looking at the output you’ll see bmi:smokeryes
. This is
known as the interaction. An interaction communicates that the
relationship between the response and an explanatory variable changes at
different levels of a factor. This causes in change in both the
intercept and the slope.
## Look at the coefficient vector
mod3$coefficients
## (Intercept) bmi smokeryes bmi:smokeryes
## 5863.85365 82.19608 -18434.12034 1368.25455
## Reference
yint_0<-mod3$coefficients[1]
slope_0<-mod3$coefficients[2]
## Alternative
yint_1<-mod3$coefficients[1]+mod3$coefficients[3]
yint_1
## (Intercept)
## -12570.27
slope_1<-mod3$coefficients[2]+mod3$coefficients[4]
slope_1
## bmi
## 1450.451
### Parallel Lines
ggplot(data=insurance, aes(x=bmi, y=charges, color=smoker))+
geom_point()+
geom_abline(intercept=yint_1,
slope=slope_1, color="blue")+
geom_abline(intercept=yint_0,
slope=slope_0, color="red")
Suppose now that we have two numeric features (explanatory variables) how would we graph that?
We would need to add a dimension.
There are several packages that can be used to create a 3D scatterplot:
scatterplot3d
: http://www.sthda.com/english/wiki/scatterplot3d-3d-graphics-r-software-and-data-visualization
scatter3d
in the car
package: http://www.sthda.com/english/wiki/amazing-interactive-3d-scatter-plots-r-software-and-data-visualization
‘plot3D’
## SCATTERPLOT3D
#install.packages("scatterplot3d") # Install
library("scatterplot3d")
s3d<-scatterplot3d(insurance$bmi, insurance$age, insurance$charges)
\[Y_i=\beta_0+\beta_1 X_{i,1}+\beta_2 X_{i,2}\] In this MLR model we have two different slope coefficients. The slope related to the first feature, \(X_1\), is \(\beta_1\) and the slope related to the second feature, \(X_2\), is \(\beta_2\).
We interpret these slopes, \(\beta_j\) as the average amount of change in the response for a one unit change in the \(X_j\) direction, on average.
mod4<-lm(charges~bmi+age, data=insurance)
summary(mod4)
##
## Call:
## lm(formula = charges ~ bmi + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15200 -7211 -5238 7503 48012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5349.72 2156.51 -2.481 0.0133 *
## bmi 335.27 62.49 5.365 1.02e-07 ***
## age 216.70 27.63 7.842 1.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11670 on 934 degrees of freedom
## Multiple R-squared: 0.09588, Adjusted R-squared: 0.09394
## F-statistic: 49.52 on 2 and 934 DF, p-value: < 2.2e-16
scatterplot3d(insurance$bmi, insurance$age, insurance$charges, pch=16)
# Add regression plane
s3d$plane3d(mod4)
Her’s another example
#install.packages(c("plot3D", "magrittr"))
library(plot3D)
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
insurance%$%
scatter3D(bmi, age, charges, theta=30, phi=10)
mod5<-lm(charges~bmi*smoker*age, data=insurance)
summary(mod5)
##
## Call:
## lm(formula = charges ~ bmi * smoker * age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12023.5 -1966.2 -1254.6 -268.1 29428.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.633e+03 2.797e+03 -0.584 0.559531
## bmi -3.305e+00 8.947e+01 -0.037 0.970538
## smokeryes -1.034e+04 5.919e+03 -1.746 0.081130 .
## age 2.449e+02 6.905e+01 3.548 0.000408 ***
## bmi:smokeryes 1.116e+03 1.873e+02 5.958 3.61e-09 ***
## bmi:age 3.724e-01 2.176e+00 0.171 0.864151
## smokeryes:age -2.688e+02 1.491e+02 -1.803 0.071736 .
## bmi:smokeryes:age 8.726e+00 4.701e+00 1.856 0.063732 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4977 on 929 degrees of freedom
## Multiple R-squared: 0.8364, Adjusted R-squared: 0.8352
## F-statistic: 678.5 on 7 and 929 DF, p-value: < 2.2e-16
#install.packages("rgl")
library(rgl)
library(car)
scatter3d(charges~bmi*age*as.factor(smoker), data=insurance)
# Print the html graphic
rglwidget(width = 520, height = 520)
There are more variables to play with in these data:
str(insurance)
## 'data.frame': 937 obs. of 7 variables:
## $ age : int 19 62 46 18 18 39 41 62 20 24 ...
## $ sex : Factor w/ 2 levels "female","male": 1 1 1 1 2 1 1 2 2 2 ...
## $ bmi : num 35.1 38.1 28.9 33.9 34.4 ...
## $ children: int 0 2 2 0 0 5 3 0 0 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 2 1 4 3 3 2 4 3 4 1 ...
## $ charges : num 2135 15230 8823 11483 1137 ...
mod6<-lm(charges~age+children+bmi+sex+smoker+region, data=insurance)
summary(mod6)
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11021.0 -2989.3 -994.3 1715.5 30293.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12480.69 1215.51 -10.268 < 2e-16 ***
## age 240.12 14.67 16.371 < 2e-16 ***
## children 674.08 165.24 4.079 4.9e-05 ***
## bmi 370.06 34.73 10.656 < 2e-16 ***
## sexmale -246.82 408.20 -0.605 0.546
## smokeryes 23939.65 493.54 48.506 < 2e-16 ***
## regionnorthwest -595.02 580.29 -1.025 0.305
## regionsoutheast -952.66 585.26 -1.628 0.104
## regionsouthwest -1116.82 581.28 -1.921 0.055 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6182 on 928 degrees of freedom
## Multiple R-squared: 0.7479, Adjusted R-squared: 0.7458
## F-statistic: 344.2 on 8 and 928 DF, p-value: < 2.2e-16
region
estimates?
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.
You might have made the following observations: * There looks to be
some curvature in the relationship between age
and
charges
* Perhaps charges are different for patients that
are defined as obese (BMI > 30) * It stands to reason that the having
both traits of smoking and being obese could have even worse
effects.
Let’s create new variables to help accomplish these:
insurance<-insurance %>%
mutate(bmi30=(bmi>=30),
age2=age^2)
mod7<-lm(charges~age+age2+children+bmi+sex+bmi30*smoker+region, data=insurance)
summary(mod7)
##
## Call:
## lm(formula = charges ~ age + age2 + children + bmi + sex + bmi30 *
## smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4609.9 -1650.1 -1187.5 -626.9 23877.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 514.0293 1665.6029 0.309 0.75768
## age -71.1145 74.5724 -0.954 0.34052
## age2 4.1190 0.9309 4.425 1.08e-05 ***
## children 777.4579 126.1384 6.164 1.06e-09 ***
## bmi 126.2632 40.3375 3.130 0.00180 **
## sexmale -491.5381 293.4848 -1.675 0.09430 .
## bmi30TRUE -970.4090 503.2088 -1.928 0.05411 .
## smokeryes 13624.3500 510.4337 26.692 < 2e-16 ***
## regionnorthwest -229.4323 417.7103 -0.549 0.58296
## regionsoutheast -666.2763 421.3407 -1.581 0.11415
## regionsouthwest -1183.2241 417.7891 -2.832 0.00472 **
## bmi30TRUE:smokeryes 20122.6098 708.0504 28.420 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4442 on 925 degrees of freedom
## Multiple R-squared: 0.8703, Adjusted R-squared: 0.8687
## F-statistic: 564.1 on 11 and 925 DF, p-value: < 2.2e-16
What metrics can we use for model comparison:
library(broom)
## COMPARE MODEL 1 and MODEL 7
mod1%>%
glance()
## # A tibble: 1 × 12
## r.squared adj.r.s…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0364 0.0353 12041. 35.3 4.04e-9 1 -10133. 20271. 20286. 1.36e11
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
mod7%>%
glance()
## # A tibble: 1 × 12
## r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.870 0.869 4442. 564. 0 11 -9193. 18412. 18475. 1.83e10
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
Recall that \(\sigma\) is estimated with \(s\), where \[s=\sqrt{\frac{1}{n-p-1}\sum_{i=1}^n(\hat{y}_i-y_i)^2}\]
Metrics from the training set:
## MODEL 1: charges~bmi
train1<-sigma(mod1)
train1
## [1] 12041.5
## MODEL 2: charges~bmi+smoker
train2<-sigma(mod2)
train2
## [1] 7074.866
## MODEL 3: charges~bmi*smoker
train3<-sigma(mod3)
train3
## [1] 6113.404
## MODEL 4: charges~bmi+age
train4<-sigma(mod4)
train4
## [1] 11669.9
## MODEL 5: charges~bmi*smoker*age
train5<-sigma(mod5)
train5
## [1] 4977.349
## MODEL 6: (Kitchen Sink) charges~age+children+bmi+sex+smoker+region
train6<-sigma(mod6)
train6
## [1] 6181.611
## MODEL 7: (NEW Features) charges~age+age2+children+bmi+sex+bmi30*smoker+region
train7<-sigma(mod7)
train7
## [1] 4441.866
Let’s look at the test performance by assessing the root mean squared error: \[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n(\hat{y}_i-y_i)^2}\]
#install.packages("caret")
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## MOD 1
testPred1<-predict(mod1, insurance_test)
RMSE(testPred1, insurance_test$charges)
## [1] 11477.92
## MOD 2
testPred2<-predict(mod2, insurance_test)
RMSE(testPred2, insurance_test$charges)
## [1] 7124.835
## MOD 3
testPred3<-predict(mod3, insurance_test)
RMSE(testPred3, insurance_test$charges)
## [1] 6271.597
## MOD 4
testPred4<-predict(mod4, insurance_test)
RMSE(testPred4, insurance_test$charges)
## [1] 10721.93
## MOD 5
testPred5<-predict(mod5, insurance_test)
RMSE(testPred5, insurance_test$charges)
## [1] 4742.929
## MOD 6
testPred6<-predict(mod6, insurance_test)
RMSE(testPred6, insurance_test$charges)
## [1] 5828.947
## MOD 7
insurance_test<-insurance_test%>%
mutate(bmi30=(bmi>=30),
age2=age^2)
testPred7<-predict(mod7, insurance_test)
RMSE(testPred7, insurance_test$charges)
## [1] 4476.419