For fully functional html features, please visit http://www.rpubs.com/jasonchanhku/medical

Libraries Used

library(plotly) #data visualization
library(RColorBrewer) #color palette
library(psych) #scatterplot matrix
library(prettydoc) 


Objective

Medical expenses are difficult to estimate because the most costly conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.

The goal of this project is to use patient data to estimate the average medical care expenses using regression for provided population segments.


Step 1: Data Exploration

The dataset is simulated containing hypothetical medical expenses for patients in the United Stated, released from the US Census Bureau.

The dataset includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.

Data Preview

insurance <- read.csv(file = "Machine-Learning-with-R-datasets-master/insurance.csv", stringsAsFactors = TRUE)

knitr::kable(head(insurance), caption = NULL)
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.924
18 male 33.770 1 no southeast 1725.552
28 male 33.000 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.471
32 male 28.880 0 no northwest 3866.855
31 female 25.740 0 no southeast 3756.622

Features

From the dataset, the following has been idenfified as target variables and features:

Target Variable
The target variable is obviously the charges, which is also the dependent variable.

Key Features
The identified 6 key features are the following:

  • age
  • sex
  • bmi
  • children
  • smoker
  • region


Data Visualization (Part 1)

The data visualization will focus on identifying patterns of key features which clearly distinguishes a patient’s charges. Some fill identifies if the patient is a smoker.

Age

plot_ly(data = insurance, x = ~age, y = ~charges, color = ~smoker, colors = brewer.pal(3, "Set1"), type = "scatter")

Seems like higher charges are associated with smokers of older age. Perhaps the older the smoker gets, the more complicated his health conditions which results in higher charges.

BMI

plot_ly(data = insurance, x = ~bmi, y = ~charges, color = ~smoker, colors = brewer.pal(3, "Dark2"), type = "scatter")

Seems like higher charges are slightly associated with smokers with higher bmi. Perhaps the higher bmi of the smoker, the more prone to deseases which results in higher charges.

Sex

plot_ly(insurance, color = ~sex, y = ~charges, type = "box")

Higher charges are associated with males. This is because males are more common smokers than females.

No. of Children

insurance2 <- read.csv(file = "Machine-Learning-with-R-datasets-master/insurance.csv", stringsAsFactors = TRUE)

insurance2$children[insurance$children == 0] <- "a_Nil"
insurance2$children[insurance$children == 1] <- "b_One"
insurance2$children[insurance$children == 2] <- "c_Two"
insurance2$children[insurance$children == 3] <- "d_Three"
insurance2$children[insurance$children == 4] <- "e_Four"
insurance2$children[insurance$children == 5] <- "f_Five"
insurance2$children <- as.factor(insurance2$children)
plot_ly(data = insurance2, y = ~charges, color = ~children, type = "box")

As for number of children, there is no clear association of charges. Hence, expect low correlation for this feature. Perhaps having high charges is independent of number of childrens.

Region

plot_ly(data = insurance, y = ~charges, color = ~region, type = "box")

As for region, there is no clear association of charges. Hence, expect low correlation for this feature. Perhaps having high charges is independent of region.

Data Visualization (Part 2)

In this section, a scatterplot matrix is used to summarize most individual visualizations. This also includes correlation analysis all in one visualization. This is only for numeric features.

pairs.panels(insurance[c("age", "bmi", "children", "charges")])

Preliminary Insights

Based on data exploration, the following preliminary insights which leads to higher medical charges is identified:

  • older age
  • smoker
  • male
  • higher bmi (distinct after bmi > 30)

Also from the correlation segment of the scatterplot matrix, age and bmi has the largest correlation to charges.


Step 2: Model Training

As there is no need for data preparation, the dataset is now ready to be modeled and trained using R’s built-in regression function from the stats package. Bear in mind that there is no training set present. (Please click show code)

#building the regressor
ins_model <- lm(charges ~. , insurance)

#regressor preview
ins_model
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -11938.5            256.9           -131.3            339.2  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##           475.5          23848.5           -353.0          -1035.0  
## regionsouthwest  
##          -960.1

Interpreting the Regressor

Below are the interpretation of each coefficient from the table above:

  • Intercept: Intercept offers of little value here as it is impossible to have independent variables as zero in reality
  • Age: An increase in age by 1 year increases medical charges by $256.9 a year
  • Sex: Males tend to have $131.3 less annually compared to females
  • Bmi: Increase of Bmi by 1 unit increases annual charges of $339.2
  • Children: Increase of 1 children increases charges by $475.5
  • Smoker: Being a smoker increases charges by $23,848.5 annually
  • Region: Northeast region has the highest average charges


Step 3: Model Evaluation

To evaluate the model, a summary is produced as below with several crucial indicators:

summary(ins_model)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Interpreting the Evaluation

The three key evaluators are the following:

  • Residuals: Residuals are actuals minus predicted. Max error of nearly $30,000 is substantial as the model underpredicts by this much. 50% of the data were over-predicted by $2,848 and underpredicted by $1,393

  • p-value: Small p values suggests likely to have relationship with charges and the coefficient is not zero. Indicates statistically significant. Many of the features are statistically significant, so it’s not a concern

  • Multiple R-squared: This is a measure of how much of variation is explained by the model. In this case, 75% of the variation is explained by the dependent variable.

Given the preceding three performance indicators, our model is performing fairly well. It is not uncommon for regression models of real-world data to have fairly low R-squared values; a value of 0.75 is actually quite good. The size of some of the errors is a bit concerning, but not surprising given the nature of medical expense data.


Step 4: Improving the Model

A key difference between the regression models and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.

Accounting Non-linear Relationship

In linear regression, the relationship between an independent variable and the dependent variable may not necessarily be linear. For example, the effect of age on medical charges may not be constant throughout all the age values. This can be represented in a polynomial:

\[y\quad =\quad \alpha \quad +\quad { \beta }_{ 1 }x\quad +\quad { \beta }_{ 2 }{ x }^{ 2 }\]

To add non linear (higher order) age to the model, the following is done (please click on show code)

#adding non linear to the model by creating new column
insurance$age2 <- insurance$age^2

#will put in lm() as charges ~ age + age2

Transformation - Numeric to Binary

Suppose the effect of a feature is not cumulative, rather it has an effect only after a specific threshold has been reached. For instance, BMI may have zero impact on medical expenditures for individuals in the normal weight range, but it may be strongly related to higher costs for the obese (BMI of 30 or above).

This relationship can be modeled by binary coding using the following code (please click on show code):

#binary coding for bmi feature
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

#will still use both bm and bmi30 in lm()

Feature Interaction

What if certain features have a combined impact on the dependent variable? For instance, smoking and obesity may combined effect may be worse than the sum of each one alone. In other words, these two features have interaction.

To implement interaction in the model, the following is done (please click show code):

#in the model in lm(), add charges ~ bmi30*smoker
# same as charges ~ bmi30 + smokeryes + bmi30:smokeryes

Improved Regression Model

Based on a subject matter knowledge of how medical costs may be related to patient characteristics, a more accurate regression model is specified with the following:

  • added non linearity to a feature (age)
  • binary coding for obesity (bmi >= 30)
  • feature interaction of bmi and smoker

The following code implements the more specified regression model:

ins_model2 <- lm(charges ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance)

summary(ins_model2)
## 
## Call:
## lm(formula = charges ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17296.4  -1656.0  -1263.3   -722.1  24160.2 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       134.2509  1362.7511   0.099 0.921539    
## age               -32.6851    59.8242  -0.546 0.584915    
## age2                3.7316     0.7463   5.000 6.50e-07 ***
## children          678.5612   105.8831   6.409 2.04e-10 ***
## bmi               120.0196    34.2660   3.503 0.000476 ***
## sexmale          -496.8245   244.3659  -2.033 0.042240 *  
## bmi30           -1000.1403   422.8402  -2.365 0.018159 *  
## smokeryes       13404.6866   439.9491  30.469  < 2e-16 ***
## regionnorthwest  -279.2038   349.2746  -0.799 0.424212    
## regionsoutheast  -828.5467   351.6352  -2.356 0.018604 *  
## regionsouthwest -1222.6437   350.5285  -3.488 0.000503 ***
## bmi30:smokeryes 19810.7533   604.6567  32.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

Improved Evaluation

  • Reduced Residuals: Max residual reduced from $30,000 to $24,160 and reduced interquartile range

  • Improved \(R^{2}\) : Improvement from 75% to 87%. A lot more of the variation in charges is explained by our features

  • Statistically Significant: The specified features bmi30 and age2 are statistically significant

Improved Model Test

The model will predict the medical charges with the following inputs:

test <- data.frame("age" = 50, "sex" = "female", bmi = 33, children = 2, smoker = "yes", region = "northwest", age2 = 2500, "bmi30" = 1)

test
##   age    sex bmi children smoker    region age2 bmi30
## 1  50 female  33        2    yes northwest 2500     1

After running it through the model, the medical charges are:

ins_pred <- predict(ins_model2, test)
ins_pred
##       1 
## 45082.8