Linear Regression on Health Insurance Cost Prediction
Introduction
This dataset using data Medical Cost Personal Datasets from kaggle. Many factors that affect how much you pay for health insurance are not within your control. Medical costs are difficult to predict since most money comes from rare conditions of the patients. Nonetheless, it’s good to have an understanding of what they are. Using linear regression analysis we will make a model to predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not. Additionally, we will also determine what the most important variable influencing insurance costs is. These estimates could be used to create actuarial tables that set the price of yearly premiums higher or lower according to the expected treatment costs.
Data Preparation
Load the required package.
library(dplyr)
library(ggplot2)
library(caret)
library(MLmetrics)
library(car)
library(lmtest)
library(GGally)
library(kableExtra)Load Dataset
insurance <- read.csv("data_input/insurance.csv")
rmarkdown::paged_table(insurance)Data Wrangling
glimpse(insurance)Rows: 1,338
Columns: 7
$ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
$ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
Herewith the description of each column in insurance dataset :
age: age of primary beneficiary.
sex: insurance contractor gender, female and male.
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.
children: Number of children covered by health insurance / Number of dependents.
smoker: Smoking, yes or no.
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
The data has 1338 rows and 7 columns. Since we are predicting insurance costs, charges will be our target feature.
Before we go further, first we need to make sure that our data is clean and will be useful, and we have to check which column is not correct for the data type. We have to change the data type of each column which not correct.
# check duplicated data
insurance[duplicated(insurance), ] age sex bmi children smoker region charges
582 19 male 30.59 0 no northwest 1639.563
There is one. It’s unlikely that two people have the same age, sex, bmi, and children from the same region, both non-smokers and have exactly the same medical charges. We can drop this duplicated row.
insurance <- insurance %>% distinct()# Check Missing value
colSums(is.na(insurance)) age sex bmi children smoker region charges
0 0 0 0 0 0 0
ok, from the output our dataset have no missing value, and from the glimpse output there are 3 column we have to change into categorical data type, which is : sex, smoker, and region.
# transform character into factor
insurance <- insurance %>% mutate_if(~is.character(.), ~as.factor(.))
glimpse(insurance)Rows: 1,337
Columns: 7
$ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
$ sex <fct> female, male, male, male, male, female, female, female, male,~
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
$ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
$ smoker <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
$ region <fct> southwest, southeast, southeast, northwest, northwest, southe~
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
Exploratory Data Analysis
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
summary(insurance) age sex bmi children smoker
Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1063
1st Qu.:27.00 male :675 1st Qu.:26.29 1st Qu.:0.000 yes: 274
Median :39.00 Median :30.40 Median :1.000
Mean :39.22 Mean :30.66 Mean :1.096
3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
region charges
northeast:324 Min. : 1122
northwest:324 1st Qu.: 4746
southeast:364 Median : 9386
southwest:325 Mean :13279
3rd Qu.:16658
Max. :63770
In terms of categorical features, the dataset has similar number of people for each category, except for smoker. We have more non-smokers than smokers, which makes sense. The charges itself varies greatly from around $1,000 to $64,000.
Let’s check the distribution of charges.
ggplot(data = insurance, aes(x = charges)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of Charges")The distribution is right skewed with a long tail to the right. There’s a bump at around $40,000, perhaps another hidden distribution. To dig this up, we need categorical features.
Lets’s check the boxplot of sex, region , children, and smoker against health insurance of charges
for (col in c('sex', 'region', 'children', 'smoker')) {
plot <- ggplot(data = insurance,
aes_string(x = col, y = 'charges', group = col, fill = col)) +
geom_boxplot(show.legend = FALSE) +
ggtitle(glue::glue("Boxplot of Medical Charges per {col}"))
print(plot)
}sexandregiondon’t have noticeable differences for each category in terms ofchargesgiven. We can see that there is an increasing trend inchargesas thechildrenincreases. Lastly,smokerseems to make a significant difference tochargesgiven by health insurance.
Let’s check charges distribution categorizing into smoker
ggplot(data = insurance, aes(x = charges, fill = smoker)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of Charges per Smoking Category")Smokers definitely have more charges than non-smokers.
Let’s check the medical charges by age, bmi and children according to the smoker factor.
for (feat in c('age', 'bmi', 'children')) {
plot <- ggplot(data = insurance, aes_string(x = feat, y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) +
geom_jitter() +
geom_smooth(method = 'lm') +
ggtitle(glue::glue("Charges vs {feat}"))
print(plot)
}smoker seems to have the highest impact on medical charges, even though the charges are growing with age, bmi, and children. Also, people who have more children generally smoke less.
Let’s check, the correlations between features as follows, using ggcorr, as below :
ggcorr(insurance %>% mutate_if(is.factor, as.numeric), label = TRUE, label_size = 3,hjust = .85, size=3)We can safely say that the most correlate and significant is between smoker and charges, for other features the significant value is to low, it’s quite not linear between the variables.
Modeling
Splitting the Data
RNGkind(sample.kind = "Rounding") # agar outputnya sama semua dari berbagai versi
set.seed(417) # mengunci random yang dihasilkan oleh fungsi sample
index <- sample(nrow(insurance), nrow(insurance)*0.8)
data_train <- insurance[index, ]
data_test <- insurance[-index, ]Linear Regression
Here we try to create a model using one significant predictor first, from the ggcorr plot we find out the second one significant predictor is age, because the smoker which is the most significant predictor is categorical variable , it will put in the dummy variable in the model, then we try using age predictor first.
model_1 <- lm(charges ~ age, data = data_train)
summary(model_1)
Call:
lm(formula = charges ~ age, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-8144 -6745 -5995 5430 47718
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3036.49 1070.88 2.836 0.00466 **
age 263.08 25.63 10.264 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11640 on 1067 degrees of freedom
Multiple R-squared: 0.08986, Adjusted R-squared: 0.089
F-statistic: 105.3 on 1 and 1067 DF, p-value: < 0.00000000000000022
From the Pr(>|t|) column, we see that the regression coefficient(2.2e-16) it is significant from zero(p<0.001) and indicates that there’s an expected increase of 263.08 of charges for every 1 year increase in age. The multiple R-squared(0.0898) indicates that the model accounts for 8.98% of the variance in charges, it is very low, and it’s not really good model. The multiple R-squared is also the squared correlation between the actual and predicted value. The residual standard error(11640), its so high,it can be thought of as the average error in predicting charges from age using this model.
We try to implemented with automatic feature selection step wise regression using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination.
model_all <- lm(charges ~., data_train)
step(model_all, direction = "backward")Start: AIC=18653.07
charges ~ age + sex + bmi + children + smoker + region
Df Sum of Sq RSS AIC
- region 3 125695795 39909377669 18650
- sex 1 26583308 39810265182 18652
<none> 39783681874 18653
- children 1 325925064 40109606938 18660
- bmi 1 4310717833 44094399707 18761
- age 1 13605497970 53389179844 18966
- smoker 1 98109106718 137892788592 19980
Step: AIC=18650.44
charges ~ age + sex + bmi + children + smoker
Df Sum of Sq RSS AIC
- sex 1 25761162 39935138831 18649
<none> 39909377669 18650
- children 1 323234651 40232612320 18657
- bmi 1 4419512000 44328889669 18761
- age 1 13711875099 53621252768 18964
- smoker 1 98888881401 138798259070 19981
Step: AIC=18649.13
charges ~ age + bmi + children + smoker
Df Sum of Sq RSS AIC
<none> 39935138831 18649
- children 1 321259117 40256397948 18656
- bmi 1 4400297661 44335436492 18759
- age 1 13744865480 53680004310 18963
- smoker 1 99607627702 139542766532 19985
Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)
Coefficients:
(Intercept) age bmi children smokeryes
-12599.2 259.7 336.4 455.7 23801.5
Save of model backward into model_2 object, summary , then predict. After that, calculate the metrics.
model_2 <- lm(formula = charges ~ age + bmi + children + smoker, data = data_train)
summary(model_2)
Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-11745 -2982 -988 1485 29488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12599.15 1077.14 -11.697 < 0.0000000000000002 ***
age 259.74 13.57 19.137 < 0.0000000000000002 ***
bmi 336.44 31.07 10.828 < 0.0000000000000002 ***
children 455.69 155.76 2.926 0.00351 **
smokeryes 23801.49 462.02 51.516 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6126 on 1064 degrees of freedom
Multiple R-squared: 0.7485, Adjusted R-squared: 0.7475
F-statistic: 791.5 on 4 and 1064 DF, p-value: < 0.00000000000000022
Model Evaluation
Performance
1. Overall Test
The first test we do is called the overall test. The hypothesis used are:
H0 : The model is not significant (the model has not been able to explain the target variable) H1: Significant model (Model is able to explain the target variable)
To check the overall test, we can check it from the summary of the model and look at the F-statistic value displayed in the model and see the lowest p-value. Because the p-value from both model (model_1 & model_2) < \(\alpha\) which is < 2.2e-16 < 0.05, so we get a decision to reject H0 which means that our model can explain the target variable that we have, charges.
2. R-Squared
We have already removed the non-significant variables, which is the output from the backward elimination. Too see if this action affect our model, we can check and compare the R-Squared value from our two models.
# compare R-squared model_1 & model_2
paste("R2 Simple LR:", round(summary(model_1)$r.squared, 4))[1] "R2 Simple LR: 0.0899"
paste("R2 Multiple LR:", round(summary(model_2)$adj.r.squared, 4))[1] "R2 Multiple LR: 0.7475"
The first model using only one predictor have a very low R-squared about 8.99%, meaning that the model can explain only 8.99% of variance in the target variable (charges of health insurance). Meanwhile, our second model has adjusted R-squared of 74.75%, its extremely increase from the first model. It shows that it is safe to remove variables that has no significant coefficient values.
3. Model Prediction & Error
The performance of our model (how well our model predict the target variable) can be calculated using Root Mean Squared Error (RMSE). RMSE is better than MAE or mean absolute error, because RMSE squared isthe difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.
We will try to predict the value of charges based on the value of the predictor variable, and the results will be compared with the data test we have.
# prediction model_1
prediction1 <- predict(model_1, newdata = data_test)
# RMSE of model_1
RMSE(prediction1, data_test$charges)[1] 11266.94
#prediction model_2
prediction2 <- predict(model_2, newdata = data_test)
# RMSE of model_2
RMSE(prediction2, data_test$charges)[1] 5844.718
The second model is better than the first one in predicting the testing dataset, and the error of the training dataset in the second model is lower than the first model. It seems that the second model is better than the first one.Then , I think we can use model_2 for further analysis evaluation.
Assumptions
Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model using backward elimination) which is better from the first one.
1. Linearity
The linear regression model assume that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.
Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption.
Let’s check our model linearity, whether it is satisfying for the assumptions or not :
linearity_plot <- data.frame(residual = model_2$residuals, fitted = model_2$fitted.values)
linearity_plot %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())From the plot residuals become more negative as the fitted values increase, the plot pattern indicate that our model is not quite linear.
2. Normality Test
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.
hist(model_2$residuals, main = "Histogram of model_2 Residuals", xlab = "Model_2 Residuals", ylab = "Count")Another way is to use Shapiro-Wilk test to our residual.
H0: Residuals are normally distributed H1: Residuals are not normally distributed
shapiro.test(model_2$residuals)
Shapiro-Wilk normality test
data: model_2$residuals
W = 0.9019, p-value < 0.00000000000000022
Since p-value is below alpha (0.05), reject H0, hence from the plot, We can see the distribution of these residuals is not approximately normal.
3. Homoscedasticity
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.
plot <- data.frame(residual = model_2$residuals, fitted = model_2$fitted.values)plot %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0, col = "red"))Another way is to use Breusch-Pagan hypothesis.
H0: Homoscedasticity H1: Heteroscedasticity
bptest(model_2)
studentized Breusch-Pagan test
data: model_2
BP = 97.692, df = 4, p-value < 0.00000000000000022
Since p-value is lower than alpha (0.05), reject H0. This means the residuals has Heteroscedasticity.
4. Multicollinearity
One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.
vif(model_2) age bmi children smoker
1.011702 1.010234 1.001195 1.000402
VIF < 10 means, no multicollinearity acurred among predictors in our model.
It seems that our model is not satisfying base on the assumptions, maybe we can try to improve our model.
Improving Model Performance
Polynomial Regression
We can improve our model by feature engineering, specifically, by making new features that captures the interactions between existing features. This is called polynomial regression. The idea is to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a,b], the degree-2 polynomial features are [1,a,b,a2,ab,b2]. We will use 2 degree .
We don’t want charges to be included in the process of generating the polynomial combinations, so we take out charges from train and test and save them as y_train and y_test, respectively.
y_train <- data_train$charges
y_test <- data_test$chargesFrom EDA we know that sex and region have no correlation with charges. We can drop them. Also, since polynomial combinations don’t make sense to categorical features, we mutate smoker as numeric.
X_train <- data_train %>%
select(-c(charges, sex, region)) %>%
mutate(smoker = as.numeric(smoker))
X_test <- data_test %>%
select(-c(charges, sex, region)) %>%
mutate(smoker = as.numeric(smoker))We use formula below to apply polynomial combinations.
formula <- as.formula(
paste(
' ~ .^2 + ',
paste('poly(', colnames(X_train), ', 2, raw=TRUE)[, 2]', collapse = ' + ')
)
)
formula~.^2 + poly(age, 2, raw = TRUE)[, 2] + poly(bmi, 2, raw = TRUE)[,
2] + poly(children, 2, raw = TRUE)[, 2] + poly(smoker, 2,
raw = TRUE)[, 2]
Then, insert y_train and y_test back to the new datasets.
train_poly <- as.data.frame(model.matrix(formula, data = X_train))
test_poly <- as.data.frame(model.matrix(formula, data = X_test))
train_poly$charges <- y_train
test_poly$charges <- y_test
colnames(train_poly) [1] "(Intercept)" "age"
[3] "bmi" "children"
[5] "smoker" "poly(age, 2, raw = TRUE)[, 2]"
[7] "poly(bmi, 2, raw = TRUE)[, 2]" "poly(children, 2, raw = TRUE)[, 2]"
[9] "poly(smoker, 2, raw = TRUE)[, 2]" "age:bmi"
[11] "age:children" "age:smoker"
[13] "bmi:children" "bmi:smoker"
[15] "children:smoker" "charges"
We can see that our new datasets train_poly and test_poly now have 16 columns:
- (Intercept) is a column consists of constant 1, this is the constant term in the polynomial.
- age,bmi,children,smoker are the original features.
- age2,bmi2,children2,smoker2 are the square of the original features.
- ageĂ—bmi,ageĂ—children,ageĂ—smoker,bmiĂ—children,bmiĂ—smoker,childrenĂ—smoker are six interactions between pairs of four features.
- charges is the target feature.
Now, we are ready to make the model. As usual, we start with all features, and then create the model using step wise regression backward elimination.
Performance
model_all2 <- lm(charges ~., train_poly)
step(model_all2, direction = "backward")Start: AIC=18162.46
charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`poly(smoker, 2, raw = TRUE)[, 2]` + `age:bmi` + `age:children` +
`age:smoker` + `bmi:children` + `bmi:smoker` + `children:smoker`
Step: AIC=18162.46
charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:bmi` + `age:children` + `age:smoker` + `bmi:children` +
`bmi:smoker` + `children:smoker`
Step: AIC=18162.46
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:bmi` + `age:children` + `age:smoker` + `bmi:children` +
`bmi:smoker` + `children:smoker`
Df Sum of Sq RSS AIC
- `age:smoker` 1 87237 24907389704 18161
- `bmi:children` 1 12114168 24919416636 18161
- `age:bmi` 1 12714116 24920016584 18161
- age 1 17561078 24924863545 18161
- `age:children` 1 28200470 24935502938 18162
- `children:smoker` 1 40148624 24947451091 18162
<none> 24907302468 18163
- `poly(children, 2, raw = TRUE)[, 2]` 1 63030315 24970332783 18163
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 86401198 24993703666 18164
- children 1 187168061 25094470529 18169
- `poly(age, 2, raw = TRUE)[, 2]` 1 414386731 25321689199 18178
- bmi 1 658673687 25565976155 18188
- smoker 1 2213655840 27120958308 18252
- `bmi:smoker` 1 13939558684 38846861151 18636
Step: AIC=18160.46
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:bmi` + `age:children` + `bmi:children` + `bmi:smoker` +
`children:smoker`
Df Sum of Sq RSS AIC
- `bmi:children` 1 12097944 24919487648 18159
- `age:bmi` 1 12678739 24920068443 18159
- age 1 20268521 24927658226 18159
- `age:children` 1 28218188 24935607892 18160
- `children:smoker` 1 40509187 24947898891 18160
<none> 24907389704 18161
- `poly(children, 2, raw = TRUE)[, 2]` 1 63032142 24970421846 18161
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 86557527 24993947231 18162
- children 1 187380595 25094770299 18167
- `poly(age, 2, raw = TRUE)[, 2]` 1 414336473 25321726177 18176
- bmi 1 659782432 25567172136 18186
- smoker 1 2685916990 27593306694 18268
- `bmi:smoker` 1 14024886066 38932275770 18636
Step: AIC=18158.98
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`
Df Sum of Sq RSS AIC
- `age:bmi` 1 12011284 24931498932 18158
- age 1 19588711 24939076359 18158
- `age:children` 1 34037250 24953524897 18158
- `children:smoker` 1 38348896 24957836543 18159
<none> 24919487648 18159
- `poly(children, 2, raw = TRUE)[, 2]` 1 62415583 24981903231 18160
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 87634033 25007121681 18161
- children 1 231350017 25150837665 18167
- `poly(age, 2, raw = TRUE)[, 2]` 1 414919697 25334407344 18175
- bmi 1 673172853 25592660501 18186
- smoker 1 2673948395 27593436043 18266
- `bmi:smoker` 1 14034226485 38953714133 18635
Step: AIC=18157.5
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:children` + `bmi:smoker` + `children:smoker`
Df Sum of Sq RSS AIC
- age 1 9160764 24940659696 18156
- `age:children` 1 33154389 24964653321 18157
- `children:smoker` 1 38928840 24970427772 18157
<none> 24931498932 18158
- `poly(children, 2, raw = TRUE)[, 2]` 1 63557572 24995056504 18158
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 81101276 25012600208 18159
- children 1 232877866 25164376798 18165
- `poly(age, 2, raw = TRUE)[, 2]` 1 432337926 25363836858 18174
- bmi 1 666073590 25597572522 18184
- smoker 1 2679798970 27611297902 18265
- `bmi:smoker` 1 14048864520 38980363452 18633
Step: AIC=18155.89
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`age:children` + `bmi:smoker` + `children:smoker`
Df Sum of Sq RSS AIC
- `age:children` 1 35667383 24976327079 18155
- `children:smoker` 1 38765424 24979425119 18156
<none> 24940659696 18156
- `poly(children, 2, raw = TRUE)[, 2]` 1 56943592 24997603287 18156
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 81482031 25022141727 18157
- children 1 225871743 25166531438 18164
- bmi 1 665798814 25606458510 18182
- smoker 1 2688617038 27629276734 18263
- `poly(age, 2, raw = TRUE)[, 2]` 1 10324076352 35264736047 18524
- `bmi:smoker` 1 14077127035 39017786731 18632
Step: AIC=18155.42
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`bmi:smoker` + `children:smoker`
Df Sum of Sq RSS AIC
- `children:smoker` 1 37721028 25014048107 18155
<none> 24976327079 18155
- `poly(children, 2, raw = TRUE)[, 2]` 1 49529043 25025856121 18156
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 79418092 25055745171 18157
- children 1 206827875 25183154954 18162
- bmi 1 675467497 25651794576 18182
- smoker 1 2701478319 27677805397 18263
- `bmi:smoker` 1 14103067817 39079394896 18632
- `poly(age, 2, raw = TRUE)[, 2]` 1 14416801552 39393128631 18641
Step: AIC=18155.03
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`bmi:smoker`
Df Sum of Sq RSS AIC
- `poly(children, 2, raw = TRUE)[, 2]` 1 39048162 25053096269 18155
<none> 25014048107 18155
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 85815707 25099863814 18157
- children 1 247054482 25261102589 18164
- bmi 1 662920634 25676968741 18181
- smoker 1 2975033096 27989081203 18273
- `bmi:smoker` 1 14154861302 39168909409 18632
- `poly(age, 2, raw = TRUE)[, 2]` 1 14393807419 39407855526 18639
Step: AIC=18154.7
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`
Df Sum of Sq RSS AIC
<none> 25053096269 18155
- `poly(bmi, 2, raw = TRUE)[, 2]` 1 87172604 25140268873 18156
- children 1 657421145 25710517414 18180
- bmi 1 661992396 25715088665 18181
- smoker 1 2987010790 28040107059 18273
- `bmi:smoker` 1 14202648668 39255744937 18633
- `poly(age, 2, raw = TRUE)[, 2]` 1 14400185274 39453281543 18638
Call:
lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)
Coefficients:
(Intercept) bmi
17034.745 -1085.987
children smoker
651.730 -20934.955
`poly(age, 2, raw = TRUE)[, 2]` `poly(bmi, 2, raw = TRUE)[, 2]`
3.318 -5.713
`bmi:smoker`
1458.495
Save the model in an object model_poly.
model_poly <- lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)Model Evaluation
Let’s see the summary of our original Linear Regression model, in model_2 before.
summary(model_2)
Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-11745 -2982 -988 1485 29488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12599.15 1077.14 -11.697 < 0.0000000000000002 ***
age 259.74 13.57 19.137 < 0.0000000000000002 ***
bmi 336.44 31.07 10.828 < 0.0000000000000002 ***
children 455.69 155.76 2.926 0.00351 **
smokeryes 23801.49 462.02 51.516 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6126 on 1064 degrees of freedom
Multiple R-squared: 0.7485, Adjusted R-squared: 0.7475
F-statistic: 791.5 on 4 and 1064 DF, p-value: < 0.00000000000000022
We have four features, all of which are significant on
charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged -$12,599 by health insurance, it could be impossible scenario. Also, sincesmokerhas the biggest coefficient of all features, a unit change insmokergives bigger change inchargesthan a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have less charge than asmokerby $23,801, which makes sense.This model also has 0.7475 adjusted R-squared, which means the model with its features explains 74.75% of total variation in charges.
Now, let’s compare this to the new Polynomial Regression model.
summary(model_poly)
Call:
lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)
Residuals:
Min 1Q Median 3Q Max
-8661.9 -1735.4 -1352.5 -751.4 30386.0
Coefficients:
Estimate Std. Error t value
(Intercept) 17034.7451 3778.1463 4.509
bmi -1085.9870 205.0060 -5.297
children 651.7300 123.4566 5.279
smoker -20934.9550 1860.4675 -11.253
`poly(age, 2, raw = TRUE)[, 2]` 3.3177 0.1343 24.707
`poly(bmi, 2, raw = TRUE)[, 2]` -5.7128 2.9719 -1.922
`bmi:smoker` 1458.4955 59.4414 24.537
Pr(>|t|)
(Intercept) 0.000007245 ***
bmi 0.000000143 ***
children 0.000000157 ***
smoker < 0.0000000000000002 ***
`poly(age, 2, raw = TRUE)[, 2]` < 0.0000000000000002 ***
`poly(bmi, 2, raw = TRUE)[, 2]` 0.0548 .
`bmi:smoker` < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4857 on 1062 degrees of freedom
Multiple R-squared: 0.8422, Adjusted R-squared: 0.8413
F-statistic: 944.7 on 6 and 1062 DF, p-value: < 0.00000000000000022
We have six features, all of which are significant on charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged $17,034 by health insurance. Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, using polynomial regression method given all other features change, a non-smoker would have more charge than a smoker by $20,934. From this analysis, we know that by introducing new features to our model using polynomial combinations, the assumptions about the model may change and the interpretation of the model may be misleading.
The adjusted R-squared of this model is 0.8413, which means the model with its features explains 84% of total variation in charges. In other words, this Polynomial Regression model captures more variance of charges than the earlier Linear Regression model.
RMSE
let’s check for model predict & error of the model_poly compare to the model_2 :
# RMSE of model_2
rmse1 <- RMSE(prediction2, data_test$charges)
paste("RMSE model_2:", round(rmse1, 3))[1] "RMSE model_2: 5844.718"
# RMSE model_poly
y_pred <- predict(model_poly, test_poly)
rmse2 <- RMSE(y_pred, data_test$charges)
paste("RMSE model_poly:", round(rmse2, 3))[1] "RMSE model_poly: 4667.661"
There is an improvement from the model_poly, the error is lower than model_2 , which is RMSE 4667.66.
Assumptions
But we still need to check several assumptions which need to be satisfied when using Linear Regression model:
1. Linearity
linearity_plot2 <- data.frame(residual = model_poly$residuals, fitted = model_poly$fitted.values)
linearity_plot2 %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
2. Normality of Residuals
hist(model_poly$residuals, main = "Histogram of model_poly Residuals", xlab = "model_poly residuals", ylab = "count")shapiro.test(model_poly$residuals)
Shapiro-Wilk normality test
data: model_poly$residuals
W = 0.63647, p-value < 0.00000000000000022
The same with the previous model , model_2, residuals are not normally distributed.
3. Homoscedasticity
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.
plot %>% ggplot(aes(model_poly$fitted.values, model_poly$residuals)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0, col = "red")) Another way is to use Breusch-Pagan hypothesis.
H0: Homoscedasticity H1: Heteroscedasticity
bptest(model_poly)
studentized Breusch-Pagan test
data: model_poly
BP = 9.4173, df = 6, p-value = 0.1514
from the output we find out that p-value is higher than alpha (0.05), then , accept H0. This means the residuals satisfies Homoscedasticity assumption.
4. Multicollinearity
One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.
vif(model_poly) bmi children
69.966614 1.000774
smoker `poly(age, 2, raw = TRUE)[, 2]`
25.808689 1.016194
`poly(bmi, 2, raw = TRUE)[, 2]` `bmi:smoker`
59.120272 34.573577
No multicollinearity found in Linear Regression model, but many found in Polynomial Regression model. This makes sense since some features in Polynomial Regression were created by multiplying two features from Linear Regression model.
Conclusion
Herewith the comparison between model_2 using linear regression and model_poly which we improve using polynomial regression technique
| Algorithm | Model | R-squared | RMSE |
|---|---|---|---|
| Linear Regression | model_2 | 0.7475 | 5844.718 |
| Polynomial Regression | model_poly | 0.8413 | 4667.661 |
We can conclude after processing the model, feature engineering plays an important role to improve the model. In this case we apply by making polynomial combinations of features with degree 2. We can see that the model improves significantly, with R-squared : 0.8413, higher than model_2 and RMSE : 4667.661, lower than model_2. However, some assumption of Linear Regression is not satisfying , like the normality of Residuals still not distributed normally, and multicollinearity found in Polynomial Regression model, but in this case of multicollinearity is makes sense since some features in polynomial regression were created by multiplying two features from linear regression model. Sometimes in business, assumptions is not really matter, but it’s good to use assumptions, to make our model better to use.