For fully functional html features, please visit http://www.rpubs.com/jasonchanhku/medical
Libraries Used
library(plotly) #data visualization
library(RColorBrewer) #color palette
library(psych) #scatterplot matrix
library(prettydoc)
Objective
Medical expenses are difficult to estimate because the most costly conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.
The goal of this project is to use patient data to estimate the average medical care expenses using regression for provided population segments.
Step 1: Data Exploration
The dataset is simulated containing hypothetical medical expenses for patients in the United Stated, released from the US Census Bureau.
The dataset includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.
Data Preview
insurance <- read.csv(file = "Machine-Learning-with-R-datasets-master/insurance.csv", stringsAsFactors = TRUE)
knitr::kable(head(insurance), caption = NULL)
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 |
| 28 | male | 33.000 | 3 | no | southeast | 4449.462 |
| 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
| 32 | male | 28.880 | 0 | no | northwest | 3866.855 |
| 31 | female | 25.740 | 0 | no | southeast | 3756.622 |
Features
From the dataset, the following has been idenfified as target variables and features:
Target Variable
The target variable is obviously the charges, which is also the dependent variable.
Key Features
The identified 6 key features are the following:
- age
- sex
- bmi
- children
- smoker
- region
Data Visualization (Part 1)
The data visualization will focus on identifying patterns of key features which clearly distinguishes a patient’s charges. Some fill identifies if the patient is a smoker.
Age
plot_ly(data = insurance, x = ~age, y = ~charges, color = ~smoker, colors = brewer.pal(3, "Set1"), type = "scatter")
Seems like higher charges are associated with smokers of older age. Perhaps the older the smoker gets, the more complicated his health conditions which results in higher charges.
BMI
plot_ly(data = insurance, x = ~bmi, y = ~charges, color = ~smoker, colors = brewer.pal(3, "Dark2"), type = "scatter")
Seems like higher charges are slightly associated with smokers with higher bmi. Perhaps the higher bmi of the smoker, the more prone to deseases which results in higher charges.
Sex
plot_ly(insurance, color = ~sex, y = ~charges, type = "box")
Higher charges are associated with males. This is because males are more common smokers than females.
No. of Children
insurance2 <- read.csv(file = "Machine-Learning-with-R-datasets-master/insurance.csv", stringsAsFactors = TRUE)
insurance2$children[insurance$children == 0] <- "a_Nil"
insurance2$children[insurance$children == 1] <- "b_One"
insurance2$children[insurance$children == 2] <- "c_Two"
insurance2$children[insurance$children == 3] <- "d_Three"
insurance2$children[insurance$children == 4] <- "e_Four"
insurance2$children[insurance$children == 5] <- "f_Five"
insurance2$children <- as.factor(insurance2$children)
plot_ly(data = insurance2, y = ~charges, color = ~children, type = "box")
As for number of children, there is no clear association of charges. Hence, expect low correlation for this feature. Perhaps having high charges is independent of number of childrens.
Region
plot_ly(data = insurance, y = ~charges, color = ~region, type = "box")
As for region, there is no clear association of charges. Hence, expect low correlation for this feature. Perhaps having high charges is independent of region.
Data Visualization (Part 2)
In this section, a scatterplot matrix is used to summarize most individual visualizations. This also includes correlation analysis all in one visualization. This is only for numeric features.
pairs.panels(insurance[c("age", "bmi", "children", "charges")])
Preliminary Insights
Based on data exploration, the following preliminary insights which leads to higher medical charges is identified:
- older age
- smoker
- male
- higher bmi (distinct after bmi > 30)
Also from the correlation segment of the scatterplot matrix, age and bmi has the largest correlation to charges.
Step 2: Model Training
As there is no need for data preparation, the dataset is now ready to be modeled and trained using R’s built-in regression function from the stats package. Bear in mind that there is no training set present. (Please click show code)
#building the regressor
ins_model <- lm(charges ~. , insurance)
#regressor preview
ins_model
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Coefficients:
## (Intercept) age sexmale bmi
## -11938.5 256.9 -131.3 339.2
## children smokeryes regionnorthwest regionsoutheast
## 475.5 23848.5 -353.0 -1035.0
## regionsouthwest
## -960.1
Interpreting the Regressor
Below are the interpretation of each coefficient from the table above:
- Intercept: Intercept offers of little value here as it is impossible to have independent variables as zero in reality
- Age: An increase in age by 1 year increases medical charges by $256.9 a year
- Sex: Males tend to have $131.3 less annually compared to females
- Bmi: Increase of Bmi by 1 unit increases annual charges of $339.2
- Children: Increase of 1 children increases charges by $475.5
- Smoker: Being a smoker increases charges by $23,848.5 annually
- Region: Northeast region has the highest average charges
Step 3: Model Evaluation
To evaluate the model, a summary is produced as below with several crucial indicators:
summary(ins_model)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Interpreting the Evaluation
The three key evaluators are the following:
Residuals: Residuals are actuals minus predicted. Max error of nearly $30,000 is substantial as the model underpredicts by this much. 50% of the data were over-predicted by $2,848 and underpredicted by $1,393
p-value: Small p values suggests likely to have relationship with charges and the coefficient is not zero. Indicates statistically significant. Many of the features are statistically significant, so it’s not a concern
Multiple R-squared: This is a measure of how much of variation is explained by the model. In this case, 75% of the variation is explained by the dependent variable.
Given the preceding three performance indicators, our model is performing fairly well. It is not uncommon for regression models of real-world data to have fairly low R-squared values; a value of 0.75 is actually quite good. The size of some of the errors is a bit concerning, but not surprising given the nature of medical expense data.
Step 4: Improving the Model
A key difference between the regression models and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.
Accounting Non-linear Relationship
In linear regression, the relationship between an independent variable and the dependent variable may not necessarily be linear. For example, the effect of age on medical charges may not be constant throughout all the age values. This can be represented in a polynomial:
To add non linear (higher order) age to the model, the following is done (please click on show code)
#adding non linear to the model by creating new column
insurance$age2 <- insurance$age^2
#will put in lm() as charges ~ age + age2
Transformation - Numeric to Binary
Suppose the effect of a feature is not cumulative, rather it has an effect only after a specific threshold has been reached. For instance, BMI may have zero impact on medical expenditures for individuals in the normal weight range, but it may be strongly related to higher costs for the obese (BMI of 30 or above).
This relationship can be modeled by binary coding using the following code (please click on show code):
#binary coding for bmi feature
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
#will still use both bm and bmi30 in lm()
Feature Interaction
What if certain features have a combined impact on the dependent variable? For instance, smoking and obesity may combined effect may be worse than the sum of each one alone. In other words, these two features have interaction.
To implement interaction in the model, the following is done (please click show code):
#in the model in lm(), add charges ~ bmi30*smoker
# same as charges ~ bmi30 + smokeryes + bmi30:smokeryes
Improved Regression Model
Based on a subject matter knowledge of how medical costs may be related to patient characteristics, a more accurate regression model is specified with the following:
- added non linearity to a feature (age)
- binary coding for obesity (bmi >= 30)
- feature interaction of bmi and smoker
The following code implements the more specified regression model:
ins_model2 <- lm(charges ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance)
summary(ins_model2)
##
## Call:
## lm(formula = charges ~ age + age2 + children + bmi + sex + bmi30 *
## smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17296.4 -1656.0 -1263.3 -722.1 24160.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.2509 1362.7511 0.099 0.921539
## age -32.6851 59.8242 -0.546 0.584915
## age2 3.7316 0.7463 5.000 6.50e-07 ***
## children 678.5612 105.8831 6.409 2.04e-10 ***
## bmi 120.0196 34.2660 3.503 0.000476 ***
## sexmale -496.8245 244.3659 -2.033 0.042240 *
## bmi30 -1000.1403 422.8402 -2.365 0.018159 *
## smokeryes 13404.6866 439.9491 30.469 < 2e-16 ***
## regionnorthwest -279.2038 349.2746 -0.799 0.424212
## regionsoutheast -828.5467 351.6352 -2.356 0.018604 *
## regionsouthwest -1222.6437 350.5285 -3.488 0.000503 ***
## bmi30:smokeryes 19810.7533 604.6567 32.764 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653
## F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16
Improved Evaluation
Reduced Residuals: Max residual reduced from $30,000 to $24,160 and reduced interquartile range
Improved \(R^{2}\) : Improvement from 75% to 87%. A lot more of the variation in charges is explained by our features
Statistically Significant: The specified features bmi30 and age2 are statistically significant
Improved Model Test
The model will predict the medical charges with the following inputs:
test <- data.frame("age" = 50, "sex" = "female", bmi = 33, children = 2, smoker = "yes", region = "northwest", age2 = 2500, "bmi30" = 1)
test
## age sex bmi children smoker region age2 bmi30
## 1 50 female 33 2 yes northwest 2500 1
After running it through the model, the medical charges are:
ins_pred <- predict(ins_model2, test)
ins_pred
## 1
## 45082.8