Data: simuated data for hypothetical medical expenses for patients in the United States from the Packt Publishing group (www.packtpub.com)

setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
head(insurance)
##   age    sex  bmi children smoker    region expenses
## 1  19 female 27.9        0    yes southwest 16884.92
## 2  18   male 33.8        1     no southeast  1725.55
## 3  28   male 33.0        3     no southeast  4449.46
## 4  33   male 22.7        0     no northwest 21984.47
## 5  32   male 28.9        0     no northwest  3866.86
## 6  31 female 25.7        0     no southeast  3756.62

There are 6 independent features and the expenses column (dependent variable).

Independent variables:

  1. The age of the primary beneficiary, (2) the sex (gender) of the primary beneficiary (3) The body mass index (BMI) of the primary beneficiary (4) The number of children that the primary beneficiary has (5) Weather or not the primary beneficiary smokes (6) the geographic location of the primary beneficiary.

Data exploration

str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :16.00   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.67   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.10   Max.   :5.000             
##        region       expenses    
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Number of examples

1338 people enrolled as primary beneficiaries in this program

Types of variables

Age, BMI, number of children and expenses are numeric variables Sex, gender and smoking status are non-numeric

Distribution of medical expenses (dependent variable)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
ggplot(insurance, aes(x = expenses)) +
  theme_bw() +
  geom_histogram(binwidth = 7500) +
  labs(y = "Frequency",
       x = "Expenses",
       title = "Medical Expenses Distribtion") +
  theme(text = element_text(size=20))

The distribution is positively skewed Most of the people enroll in this program have between $0 and $20,000 medical expenses each year

Corelation matrix of the variables in the insurance dataset

cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000
library(psych)
## Warning: package 'psych' was built under R version 3.4.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

  • There is a weak positive correlation between age & medical expenses, BMI & medical expenses and number of children & medical expenses.

Data transformation

age

The older we get, we turn to disproportionately spend more on our health insurance

I will create 2 new features called “age2” and “age3” which has values that are two and three fold the original age values respectively.

insurance$age2 <- insurance$age^2
insurance$age3 <- insurance$age^3

bmi

BMI has a threshold effect on our health. A BMI <= 30 might not affect the health or medical expenses of an individual but if the person becomes obese with a BMI > 30, they might see an increase in their medical expenses.

I will convert the numeric BMI into a binary indicator that distinguishes obese from non-obese individuals

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

Combine effect of obesity and smoking.

The combination of Obesity and smoking might have a more profound effect on our health and medical expenses.

Combine effect of aging and smoking

Also aging and smoking can have a more negative effect on our health, which can lead to increases on our medical expenses.

Split data

Step 1: Randomize our dataset

insurance <- insurance[sample(nrow(insurance)),]

Step 2: we will split our data into training and testing datasets.

insurance_train <-insurance[1:1100, ]
insurance_test <-insurance[1101:1338, ]

Model training

insurance_model <- lm(expenses ~ age + age2 + age3 + children + bmi + sex +
bmi30*smoker + age3*bmi + age3*smoker + region, data = insurance_train)
insurance_model
## 
## Call:
## lm(formula = expenses ~ age + age2 + age3 + children + bmi + 
##     sex + bmi30 * smoker + age3 * bmi + age3 * smoker + region, 
##     data = insurance_train)
## 
## Coefficients:
##     (Intercept)              age             age2             age3  
##       2.635e+03       -2.848e+02        1.033e+01       -5.170e-02  
##        children              bmi          sexmale            bmi30  
##       7.003e+02        1.444e+02       -6.273e+02       -1.201e+03  
##       smokeryes  regionnorthwest  regionsoutheast  regionsouthwest  
##       1.352e+04       -3.091e+02       -1.115e+03       -1.441e+03  
## bmi30:smokeryes         age3:bmi   age3:smokeryes  
##       1.979e+04       -5.688e-05       -1.141e-03

The intercept values are not very informative for this particular purpose because for some of the variables, there is no true zero. For example, we can not have a BMI of zero.

Model evaluation

summary(insurance_model)
## 
## Call:
## lm(formula = expenses ~ age + age2 + age3 + children + bmi + 
##     sex + bmi30 * smoker + age3 * bmi + age3 * smoker + region, 
##     data = insurance_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17091.0  -1771.7  -1274.1   -682.4  22661.8 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.635e+03  3.854e+03   0.684 0.494241    
## age             -2.848e+02  3.087e+02  -0.923 0.356379    
## age2             1.033e+01  8.100e+00   1.275 0.202434    
## age3            -5.170e-02  6.809e-02  -0.759 0.447871    
## children         7.003e+02  1.220e+02   5.741 1.22e-08 ***
## bmi              1.444e+02  4.588e+01   3.148 0.001691 ** 
## sexmale         -6.273e+02  2.759e+02  -2.274 0.023181 *  
## bmi30           -1.201e+03  4.782e+02  -2.511 0.012197 *  
## smokeryes        1.352e+04  5.964e+02  22.674  < 2e-16 ***
## regionnorthwest -3.091e+02  3.936e+02  -0.785 0.432440    
## regionsoutheast -1.115e+03  3.967e+02  -2.812 0.005014 ** 
## regionsouthwest -1.441e+03  3.959e+02  -3.640 0.000285 ***
## bmi30:smokeryes  1.979e+04  6.814e+02  29.049  < 2e-16 ***
## age3:bmi        -5.688e-05  3.144e-04  -0.181 0.856452    
## age3:smokeryes  -1.141e-03  4.603e-03  -0.248 0.804264    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4550 on 1085 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8597 
## F-statistic:   482 on 14 and 1085 DF,  p-value: < 2.2e-16

Multiple R-squared

The multiple R-squared value of ~ 0.85 suggest that this model can explain about 85% of the variation in the expenses (dependent variable).

Prediction using the test examples

prediction<-predict(insurance_model, insurance_test)
cor(prediction, insurance_test$expenses)
## [1] 0.9431769

There is an excellent corelation (> 0.90) between the predicted expenses to the true expenses in the testing dataset.

References