Predicting medical “expenses” using linear regression

Reference: Lantz, B. , Machine Learning with R, Second Edition, Packt Publishing, 2015

To install the package and active it

#install.packages("psych")
library(psych)

Step 1 – Collecting data

insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)

Step 2 – Exploring and preparing the data

To confirm if data is had expected

str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770
names(insurance)=c("age", "sex", "bmi", "children", "smoker", "region", "expenses")
summary(insurance$expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

Regression models require that every feature is numeric.

table(insurance$region)
## 
## northeast northwest southeast southwest 
##       324       325       364       325

Correlation matrix - from independent variables to dependent variables.

The diagonal is 1.0000000 since there is a prefect correlaction between variable and itself.

cor(insurance[c("age", "bmi", "children", "expenses")])
##                age       bmi   children   expenses
## age      1.0000000 0.1092719 0.04246900 0.29900819
## bmi      0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## expenses 0.2990082 0.1983410 0.06799823 1.00000000

Histograma

hist(insurance$expenses, col='grey')

Visualizing relationship among them and it gives the possibility to check the transpositions among them, i.e., there are patterns!

A kind of form more easer to visualize it. As we can see there are the same values showed before via correlation matrix: > cor(insurance[c(“age”, “bmi”, “children”, “expenses”)]):

pairs(insurance[c("age", "bmi", "children", "expenses")])

pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

Step 3 – training a model on the data

To train the data the lm function (from stats packet) will be used. It will return the regression model object

that can be used to make predictions, i.e., that function will build te model.

To make predictions the predict function should be used.

head(insurance)
##   age    sex    bmi children smoker    region  expenses
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
ins_model <- lm(expenses ~ age + children + bmi + sex + 
                  smoker + region, data = insurance)

Step 4 – evaluating model performance

To pay attention to stars

summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Step 5 – improving model performance

Model specification – adding non-linear relationships

In this case we can face the effect of age on medical expenditure may not be constant thoughout all age values.

The treatment can be more expensive for oldest people.

To add non-linear age to the model. This will allow the model to separate linear and non-linear impact of age.

insurance$age2 <- insurance$age^2
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex + 
                  smoker + region, data = insurance)

Transformation – converting a numeric variable to a binary indicator

Imagine if there is any feeling about the impact of BMI (body mass index) to obesity medical treatment.

1 for “me” 30 and 0 if less.

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
insurance$bmi30
##    [1] 0 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 0
##   [35] 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0
##   [69] 1 0 0 0 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 0
##  [103] 1 0 0 0 0 1 0 1 1 0 1 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0
##  [137] 1 0 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0
##  [171] 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1
##  [205] 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1
##  [239] 0 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 1 0 1 1 0 0 1
##  [273] 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 1
##  [307] 0 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0
##  [341] 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1
##  [375] 1 0 0 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 0
##  [409] 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1
##  [443] 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0
##  [477] 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0
##  [511] 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1
##  [545] 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1
##  [579] 1 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 0 1
##  [613] 1 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 1 1
##  [647] 0 0 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0
##  [681] 0 0 1 0 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 0 1 1
##  [715] 0 0 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0
##  [749] 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 1
##  [783] 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 1 1
##  [817] 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 1 1 1 0 1
##  [851] 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1
##  [885] 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 0
##  [919] 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 0 1
##  [953] 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0
##  [987] 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1
## [1021] 1 1 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
## [1055] 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 0 0 1
## [1089] 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 1
## [1123] 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0
## [1157] 1 0 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0
## [1191] 1 0 1 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0
## [1225] 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0
## [1259] 1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 1
## [1293] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1
## [1327] 1 1 0 1 0 1 1 1 1 1 0 0

Model specification – adding interaction effects

* (star) operator <=> : (colon) between two variables

Putting it all together – an improved regression model [Hands on!]

improvements

- Added a non-linear term for age

- Created an indicator for obesity

- Specified an interaction between besity and smoking

ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance)

Conclusion

Our model is now explaining 87 percent of the variation in medical treatment costs

summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17296.4  -1656.0  -1263.3   -722.1  24160.2 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       134.2509  1362.7511   0.099 0.921539    
## age               -32.6851    59.8242  -0.546 0.584915    
## age2                3.7316     0.7463   5.000 6.50e-07 ***
## children          678.5612   105.8831   6.409 2.04e-10 ***
## bmi               120.0196    34.2660   3.503 0.000476 ***
## sexmale          -496.8245   244.3659  -2.033 0.042240 *  
## bmi30           -1000.1403   422.8402  -2.365 0.018159 *  
## smokeryes       13404.6866   439.9491  30.469  < 2e-16 ***
## regionnorthwest  -279.2038   349.2746  -0.799 0.424212    
## regionsoutheast  -828.5467   351.6352  -2.356 0.018604 *  
## regionsouthwest -1222.6437   350.5285  -3.488 0.000503 ***
## bmi30:smokeryes 19810.7533   604.6567  32.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16
summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16