This project is from Book: Machine learning with R by Brett Lantz, chapter 6.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim is to use patient data to forecast the average medical care expenses for such population segments.
Package pysch , but I will skip it right now.
#library(psych)
Data is created using demographic statistics from the US Census Bureau
#Read the csv file, I set stringAsFactor = TRUE as the data read as character
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
#Explore structure of the dataset
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
We will use expenses as our dependence variable
summary(insurance$expenses)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
#we find median < mean, we expect the distribution of the insurance expenses is right-skewed
hist(insurance$expenses)
#from the chart we found that this distribution is not ideal for linear regression.
#Exploring factor-type features
table(insurance$region)
##
## northeast northwest southeast southwest
## 324 325 364 325
#data distributed between the four region almost uniformly.
cor(insurance[c("age", "bmi", "children", "expenses")])
## age bmi children expenses
## age 1.0000000 0.10934101 0.04246900 0.29900819
## bmi 0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000
pairs(insurance[c("age", "bmi", "children", "expenses")])
The plots are showing some patterns but not completely clear, we will use pairs.panels to get more insights
#I will stop this chart as I didn't import library psych.
##pairs.panel(insurance[c("age", "bmi", "children", "expenses")])
we will run a linear regression model that relates the six independent variables to the total medical expenses.
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region, data = insurance)
#we can run the model also
#ins_model <- lm(expenses ~ ., data = insurance) as we are using all variables in the model.
ins_model
##
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Coefficients:
## (Intercept) age children bmi
## -11941.6 256.8 475.7 339.3
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.4 23847.5 -352.8 -1035.6
## regionsouthwest
## -959.3
summary(ins_model)
##
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11302.7 -2850.9 -979.6 1383.9 29981.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11941.6 987.8 -12.089 < 2e-16 ***
## age 256.8 11.9 21.586 < 2e-16 ***
## children 475.7 137.8 3.452 0.000574 ***
## bmi 339.3 28.6 11.864 < 2e-16 ***
## sexmale -131.3 332.9 -0.395 0.693255
## smokeryes 23847.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -352.8 476.3 -0.741 0.458976
## regionsoutheast -1035.6 478.7 -2.163 0.030685 *
## regionsouthwest -959.3 477.9 -2.007 0.044921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.9 on 8 and 1329 DF, p-value: < 2.2e-16
In the residuals we can see that max residual (true value - predicted value) we have under-predicted expenses by nearly 30,000$ as we can see in the max residual. we can see that 50% of the population have errors between 1Q -2850.9 to 3Q 1383.9 USD which means that majority of predictions were between 2850 over the true value and 1389.9 under the true value
The pr(<|t|) indicate the p-value which provides an estimate of the probability that the true coefficient is zero given the value of the estimate. small p-value suggest that the true coefficient is very unlikely to be zero, which means that the feature is extremely unlikely to have no relationship with the dependent variable. some variables have stars beside them which specify the significance level met by the estimate.
Multiple R-squared value provides a measure of how well our model as a whole explains the values of the dependent variable. Adjusted R-squared penalize the model with a large number of independent variables.
We will add a new engineered feature age^2 so we can create a relation between the expenses and the age squared, but of course it is harder to explain nonlinear coefficients.
insurance$age2 <- insurance$age^2
We will add a feature BMI as a binary variable, 0 if it is below 30, and 1 more than 30 which mean obesity
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
we can find interaction between multiple features, like bmi30*smoker and include it in the formula. this will include bmi30 + smokeryes + bmi30:smokeryes.
ins_model2 <- lm(expenses ~ age + age2+ children + bmi + sex + bmi30*smoker + region, data = insurance)
summary(ins_model2)
##
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 *
## smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17297.1 -1656.0 -1262.7 -727.8 24161.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.0053 1363.1359 0.102 0.918792
## age -32.6181 59.8250 -0.545 0.585690
## age2 3.7307 0.7463 4.999 6.54e-07 ***
## children 678.6017 105.8855 6.409 2.03e-10 ***
## bmi 119.7715 34.2796 3.494 0.000492 ***
## sexmale -496.7690 244.3713 -2.033 0.042267 *
## bmi30 -997.9355 422.9607 -2.359 0.018449 *
## smokeryes 13404.5952 439.9591 30.468 < 2e-16 ***
## regionnorthwest -279.1661 349.2826 -0.799 0.424285
## regionsoutheast -828.0345 351.6484 -2.355 0.018682 *
## regionsouthwest -1222.1619 350.5314 -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534 604.6769 32.762 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653
## F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16
R-squared increased from 0.75 to 0.87. so Our model explains now 87 percent of the variation in medical treatment costs.
#creating a new column named pred in the insurance data frame for model evaluation
insurance$pred <- predict(ins_model2, insurance)
cor(insurance$pred, insurance$expenses)
## [1] 0.9307999
The correlation of 0.93 suggests a very strong linear relationship between predicted and actual values.
plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)
To use the model in predictions we will use the ins_model2 in prediction function along with the new data sets.
# we have three cases to predict
predict(ins_model2,
data.frame(age = 30, age2 = 30^2, children = 2,
bmi = 30, sex = "male", bmi30 = 1,
smoker = "no", region = "northeast"))
## 1
## 5973.774
predict(ins_model2,
data.frame(age = 30, age2 = 30^2, children = 2,
bmi = 30, sex = "female", bmi30 = 1,
smoker = "no", region = "northeast"))
## 1
## 6470.543
predict(ins_model2,
data.frame(age = 30, age2 = 30^2, children = 0,
bmi = 30, sex = "female", bmi30 = 1,
smoker = "no", region = "northeast"))
## 1
## 5113.34
From the first and the second example we can see the difference between male and female is 5973-6470= -496 which is the coefficient for example, so on average males are estimated to have about 496 USD less in expenses for the medical expenses per year.
From the third example we can see reducing children by two will reduce the insurance cost to be 5113, which the difference between 6470.543-5113.34 = 1357.203 which equals to 2 * 678 which the coefficient of children in our regression model.
developing linear regression will help to understand variables impacting dependent variable with a simple explanations, but there is always a way to improve our understanding and prediction.