Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 6.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to use patient data to forecast the average medical care expenses for such population segments.

Required packages

Package pysch , but I will skip it right now.

#library(psych)

Step 1 - collecting data

Data is created using demographic statistics from the US Census Bureau

Step 2 - exploring and preparing the data

#Read the csv file, I set stringAsFactor = TRUE as the data read as character
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)

#Explore structure of the dataset
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...

We will use expenses as our dependence variable

summary(insurance$expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770
#we find median < mean, we expect the distribution of the insurance expenses is right-skewed
hist(insurance$expenses)

#from the chart we found that this distribution is not ideal for linear regression.
#Exploring factor-type features
table(insurance$region)
## 
## northeast northwest southeast southwest 
##       324       325       364       325
#data distributed between the four region almost uniformly.

Exploring relationships amont features - the Correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000

Visualizing relationships amont features - the scatterplot matrix

pairs(insurance[c("age", "bmi", "children", "expenses")])

The plots are showing some patterns but not completely clear, we will use pairs.panels to get more insights

#I will stop this chart as I didn't import library psych.
##pairs.panel(insurance[c("age", "bmi", "children", "expenses")])

Step 3 - training a model on the data

we will run a linear regression model that relates the six independent variables to the total medical expenses.

ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region, data = insurance)
#we can run the model also 
#ins_model <- lm(expenses ~ ., data = insurance) as we are using all variables in the model.
ins_model
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Coefficients:
##     (Intercept)              age         children              bmi  
##        -11941.6            256.8            475.7            339.3  
##         sexmale        smokeryes  regionnorthwest  regionsoutheast  
##          -131.4          23847.5           -352.8          -1035.6  
## regionsouthwest  
##          -959.3

Step 4 - evaluating model performance

summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## bmi                339.3       28.6  11.864  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

In the residuals we can see that max residual (true value - predicted value) we have under-predicted expenses by nearly 30,000$ as we can see in the max residual. we can see that 50% of the population have errors between 1Q -2850.9 to 3Q 1383.9 USD which means that majority of predictions were between 2850 over the true value and 1389.9 under the true value

The pr(<|t|) indicate the p-value which provides an estimate of the probability that the true coefficient is zero given the value of the estimate. small p-value suggest that the true coefficient is very unlikely to be zero, which means that the feature is extremely unlikely to have no relationship with the dependent variable. some variables have stars beside them which specify the significance level met by the estimate.

Multiple R-squared value provides a measure of how well our model as a whole explains the values of the dependent variable. Adjusted R-squared penalize the model with a large number of independent variables.

Step 5 - improving model performance

Model specification - adding nonlinear relationship

We will add a new engineered feature age^2 so we can create a relation between the expenses and the age squared, but of course it is harder to explain nonlinear coefficients.

insurance$age2 <- insurance$age^2

Transofrmation - converting a numeric variable to a binary indicator

We will add a feature BMI as a binary variable, 0 if it is below 30, and 1 more than 30 which mean obesity

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

Model specification - adding interaction effects

we can find interaction between multiple features, like bmi30*smoker and include it in the formula. this will include bmi30 + smokeryes + bmi30:smokeryes.

Putting it all together - an improved regression model

ins_model2 <- lm(expenses ~ age + age2+ children + bmi + sex + bmi30*smoker + region, data = insurance)

summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

R-squared increased from 0.75 to 0.87. so Our model explains now 87 percent of the variation in medical treatment costs.

Making predictions with a regression model

#creating a new column named pred in the insurance data frame for model evaluation
insurance$pred <- predict(ins_model2, insurance)

cor(insurance$pred, insurance$expenses)
## [1] 0.9307999

The correlation of 0.93 suggests a very strong linear relationship between predicted and actual values.

plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)

To use the model in predictions we will use the ins_model2 in prediction function along with the new data sets.

# we have three cases to predict

predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
##        1 
## 5973.774
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
##        1 
## 6470.543
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 0,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
##       1 
## 5113.34

From the first and the second example we can see the difference between male and female is 5973-6470= -496 which is the coefficient for example, so on average males are estimated to have about 496 USD less in expenses for the medical expenses per year.

From the third example we can see reducing children by two will reduce the insurance cost to be 5113, which the difference between 6470.543-5113.34 = 1357.203 which equals to 2 * 678 which the coefficient of children in our regression model.

Summary

developing linear regression will help to understand variables impacting dependent variable with a simple explanations, but there is always a way to improve our understanding and prediction.