Linear Regression on Health Insurance Cost Prediction

Introduction

This dataset using data Medical Cost Personal Datasets from kaggle. Many factors that affect how much you pay for health insurance are not within your control. Medical costs are difficult to predict since most money comes from rare conditions of the patients. Nonetheless, it’s good to have an understanding of what they are. Using linear regression analysis we will make a model to predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not. Additionally, we will also determine what the most important variable influencing insurance costs is. These estimates could be used to create actuarial tables that set the price of yearly premiums higher or lower according to the expected treatment costs.

Data Preparation

Load the required package.

library(dplyr)          
library(ggplot2)        
library(caret)          
library(MLmetrics)      
library(car)            
library(lmtest)         
library(GGally)
library(kableExtra)

Load Dataset

insurance <- read.csv("data_input/insurance.csv")

rmarkdown::paged_table(insurance)

Data Wrangling

glimpse(insurance)
Rows: 1,338
Columns: 7
$ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
$ sex      <chr> "female", "male", "male", "male", "male", "female", "female",~
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
$ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
$ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
$ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

Herewith the description of each column in insurance dataset :

  • age: age of primary beneficiary.

  • sex: insurance contractor gender, female and male.

  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.

  • children: Number of children covered by health insurance / Number of dependents.

  • smoker: Smoking, yes or no.

  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

The data has 1338 rows and 7 columns. Since we are predicting insurance costs, charges will be our target feature.

Before we go further, first we need to make sure that our data is clean and will be useful, and we have to check which column is not correct for the data type. We have to change the data type of each column which not correct.

# check duplicated data
insurance[duplicated(insurance), ]
    age  sex   bmi children smoker    region  charges
582  19 male 30.59        0     no northwest 1639.563

There is one. It’s unlikely that two people have the same age, sex, bmi, and children from the same region, both non-smokers and have exactly the same medical charges. We can drop this duplicated row.

insurance <- insurance %>% distinct()
# Check Missing value

colSums(is.na(insurance))
     age      sex      bmi children   smoker   region  charges 
       0        0        0        0        0        0        0 

ok, from the output our dataset have no missing value, and from the glimpse output there are 3 column we have to change into categorical data type, which is : sex, smoker, and region.

# transform character into factor
insurance <- insurance %>% mutate_if(~is.character(.), ~as.factor(.))

glimpse(insurance)
Rows: 1,337
Columns: 7
$ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
$ sex      <fct> female, male, male, male, male, female, female, female, male,~
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
$ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
$ smoker   <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
$ region   <fct> southwest, southeast, southeast, northwest, northwest, southe~
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

summary(insurance)
      age            sex           bmi           children     smoker    
 Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1063  
 1st Qu.:27.00   male  :675   1st Qu.:26.29   1st Qu.:0.000   yes: 274  
 Median :39.00                Median :30.40   Median :1.000             
 Mean   :39.22                Mean   :30.66   Mean   :1.096             
 3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
 Max.   :64.00                Max.   :53.13   Max.   :5.000             
       region       charges     
 northeast:324   Min.   : 1122  
 northwest:324   1st Qu.: 4746  
 southeast:364   Median : 9386  
 southwest:325   Mean   :13279  
                 3rd Qu.:16658  
                 Max.   :63770  

In terms of categorical features, the dataset has similar number of people for each category, except for smoker. We have more non-smokers than smokers, which makes sense. The charges itself varies greatly from around $1,000 to $64,000.

Let’s check the distribution of charges.

ggplot(data = insurance, aes(x = charges)) + 
  geom_density(alpha = 0.5) + 
  ggtitle("Distribution of Charges")

The distribution is right skewed with a long tail to the right. There’s a bump at around $40,000, perhaps another hidden distribution. To dig this up, we need categorical features.

Lets’s check the boxplot of sex, region , children, and smoker against health insurance of charges

for (col in c('sex', 'region', 'children', 'smoker')) {
   plot <- ggplot(data = insurance,
                 aes_string(x = col, y = 'charges', group = col, fill = col)) + 
            geom_boxplot(show.legend = FALSE) + 
            ggtitle(glue::glue("Boxplot of Medical Charges per {col}"))
  print(plot)
}

  • sex and region don’t have noticeable differences for each category in terms of charges given. We can see that there is an increasing trend in charges as the children increases. Lastly, smoker seems to make a significant difference to charges given by health insurance.

Let’s check charges distribution categorizing into smoker

ggplot(data = insurance, aes(x = charges, fill = smoker)) + 
  geom_density(alpha = 0.5) + 
  ggtitle("Distribution of Charges per Smoking Category")

Smokers definitely have more charges than non-smokers.

Let’s check the medical charges by age, bmi and children according to the smoker factor.

for (feat in c('age', 'bmi', 'children')) {
  plot <- ggplot(data = insurance, aes_string(x = feat, y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) + 
    geom_jitter() + 
    geom_smooth(method = 'lm') +
    ggtitle(glue::glue("Charges vs {feat}"))  
  print(plot)
}

smoker seems to have the highest impact on medical charges, even though the charges are growing with age, bmi, and children. Also, people who have more children generally smoke less.

Let’s check, the correlations between features as follows, using ggcorr, as below :

ggcorr(insurance %>% mutate_if(is.factor, as.numeric), label = TRUE, label_size = 3,hjust = .85, size=3)

We can safely say that the most correlate and significant is between smoker and charges, for other features the significant value is to low, it’s quite not linear between the variables.

Modeling

Splitting the Data

RNGkind(sample.kind = "Rounding") # agar outputnya sama semua dari berbagai versi
set.seed(417) # mengunci random yang dihasilkan oleh fungsi sample
index <- sample(nrow(insurance), nrow(insurance)*0.8)
data_train <- insurance[index, ]
data_test <- insurance[-index, ]

Linear Regression

Here we try to create a model using one significant predictor first, from the ggcorr plot we find out the second one significant predictor is age, because the smoker which is the most significant predictor is categorical variable , it will put in the dummy variable in the model, then we try using age predictor first.

model_1 <- lm(charges ~ age, data = data_train)
summary(model_1)

Call:
lm(formula = charges ~ age, data = data_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -8144  -6745  -5995   5430  47718 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3036.49    1070.88   2.836              0.00466 ** 
age           263.08      25.63  10.264 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11640 on 1067 degrees of freedom
Multiple R-squared:  0.08986,   Adjusted R-squared:  0.089 
F-statistic: 105.3 on 1 and 1067 DF,  p-value: < 0.00000000000000022

From the Pr(>|t|) column, we see that the regression coefficient(2.2e-16) it is significant from zero(p<0.001) and indicates that there’s an expected increase of 263.08 of charges for every 1 year increase in age. The multiple R-squared(0.0898) indicates that the model accounts for 8.98% of the variance in charges, it is very low, and it’s not really good model. The multiple R-squared is also the squared correlation between the actual and predicted value. The residual standard error(11640), its so high,it can be thought of as the average error in predicting charges from age using this model.

We try to implemented with automatic feature selection step wise regression using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination.

model_all <- lm(charges ~., data_train)
step(model_all, direction = "backward")
Start:  AIC=18653.07
charges ~ age + sex + bmi + children + smoker + region

           Df   Sum of Sq          RSS   AIC
- region    3   125695795  39909377669 18650
- sex       1    26583308  39810265182 18652
<none>                     39783681874 18653
- children  1   325925064  40109606938 18660
- bmi       1  4310717833  44094399707 18761
- age       1 13605497970  53389179844 18966
- smoker    1 98109106718 137892788592 19980

Step:  AIC=18650.44
charges ~ age + sex + bmi + children + smoker

           Df   Sum of Sq          RSS   AIC
- sex       1    25761162  39935138831 18649
<none>                     39909377669 18650
- children  1   323234651  40232612320 18657
- bmi       1  4419512000  44328889669 18761
- age       1 13711875099  53621252768 18964
- smoker    1 98888881401 138798259070 19981

Step:  AIC=18649.13
charges ~ age + bmi + children + smoker

           Df   Sum of Sq          RSS   AIC
<none>                     39935138831 18649
- children  1   321259117  40256397948 18656
- bmi       1  4400297661  44335436492 18759
- age       1 13744865480  53680004310 18963
- smoker    1 99607627702 139542766532 19985

Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)

Coefficients:
(Intercept)          age          bmi     children    smokeryes  
   -12599.2        259.7        336.4        455.7      23801.5  

Save of model backward into model_2 object, summary , then predict. After that, calculate the metrics.

model_2 <- lm(formula = charges ~ age + bmi + children + smoker, data = data_train)
summary(model_2)

Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)

Residuals:
   Min     1Q Median     3Q    Max 
-11745  -2982   -988   1485  29488 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12599.15    1077.14 -11.697 < 0.0000000000000002 ***
age            259.74      13.57  19.137 < 0.0000000000000002 ***
bmi            336.44      31.07  10.828 < 0.0000000000000002 ***
children       455.69     155.76   2.926              0.00351 ** 
smokeryes    23801.49     462.02  51.516 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6126 on 1064 degrees of freedom
Multiple R-squared:  0.7485,    Adjusted R-squared:  0.7475 
F-statistic: 791.5 on 4 and 1064 DF,  p-value: < 0.00000000000000022

Model Evaluation

Performance

1. Overall Test

The first test we do is called the overall test. The hypothesis used are:

H0 : The model is not significant (the model has not been able to explain the target variable) H1: Significant model (Model is able to explain the target variable)

To check the overall test, we can check it from the summary of the model and look at the F-statistic value displayed in the model and see the lowest p-value. Because the p-value from both model (model_1 & model_2) < \(\alpha\) which is < 2.2e-16 < 0.05, so we get a decision to reject H0 which means that our model can explain the target variable that we have, charges.

2. R-Squared

We have already removed the non-significant variables, which is the output from the backward elimination. Too see if this action affect our model, we can check and compare the R-Squared value from our two models.

# compare R-squared model_1 & model_2
paste("R2 Simple LR:", round(summary(model_1)$r.squared, 4))
[1] "R2 Simple LR: 0.0899"
paste("R2 Multiple LR:", round(summary(model_2)$adj.r.squared, 4))
[1] "R2 Multiple LR: 0.7475"

The first model using only one predictor have a very low R-squared about 8.99%, meaning that the model can explain only 8.99% of variance in the target variable (charges of health insurance). Meanwhile, our second model has adjusted R-squared of 74.75%, its extremely increase from the first model. It shows that it is safe to remove variables that has no significant coefficient values.

3. Model Prediction & Error

The performance of our model (how well our model predict the target variable) can be calculated using Root Mean Squared Error (RMSE). RMSE is better than MAE or mean absolute error, because RMSE squared isthe difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

We will try to predict the value of charges based on the value of the predictor variable, and the results will be compared with the data test we have.

# prediction model_1
prediction1 <- predict(model_1, newdata = data_test)

# RMSE of model_1
RMSE(prediction1, data_test$charges)
[1] 11266.94
#prediction model_2
prediction2 <- predict(model_2, newdata = data_test)

# RMSE of model_2
RMSE(prediction2, data_test$charges)
[1] 5844.718

The second model is better than the first one in predicting the testing dataset, and the error of the training dataset in the second model is lower than the first model. It seems that the second model is better than the first one.Then , I think we can use model_2 for further analysis evaluation.

Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model using backward elimination) which is better from the first one.

1. Linearity

The linear regression model assume that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption.

Let’s check our model linearity, whether it is satisfying for the assumptions or not :

linearity_plot <- data.frame(residual = model_2$residuals, fitted = model_2$fitted.values)

linearity_plot %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

From the plot residuals become more negative as the fitted values increase, the plot pattern indicate that our model is not quite linear.

2. Normality Test

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.

hist(model_2$residuals, main = "Histogram of model_2 Residuals", xlab = "Model_2 Residuals", ylab = "Count")

Another way is to use Shapiro-Wilk test to our residual.

H0: Residuals are normally distributed H1: Residuals are not normally distributed

shapiro.test(model_2$residuals)

    Shapiro-Wilk normality test

data:  model_2$residuals
W = 0.9019, p-value < 0.00000000000000022

Since p-value is below alpha (0.05), reject H0, hence from the plot, We can see the distribution of these residuals is not approximately normal.

3. Homoscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.

plot <- data.frame(residual = model_2$residuals, fitted = model_2$fitted.values)
plot %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0, col = "red"))

Another way is to use Breusch-Pagan hypothesis.

H0: Homoscedasticity H1: Heteroscedasticity

bptest(model_2)

    studentized Breusch-Pagan test

data:  model_2
BP = 97.692, df = 4, p-value < 0.00000000000000022

Since p-value is lower than alpha (0.05), reject H0. This means the residuals has Heteroscedasticity.

4. Multicollinearity

One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.

vif(model_2)
     age      bmi children   smoker 
1.011702 1.010234 1.001195 1.000402 

VIF < 10 means, no multicollinearity acurred among predictors in our model.

It seems that our model is not satisfying base on the assumptions, maybe we can try to improve our model.

Improving Model Performance

Polynomial Regression

We can improve our model by feature engineering, specifically, by making new features that captures the interactions between existing features. This is called polynomial regression. The idea is to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a,b], the degree-2 polynomial features are [1,a,b,a2,ab,b2]. We will use 2 degree .

We don’t want charges to be included in the process of generating the polynomial combinations, so we take out charges from train and test and save them as y_train and y_test, respectively.

y_train <- data_train$charges
y_test <- data_test$charges

From EDA we know that sex and region have no correlation with charges. We can drop them. Also, since polynomial combinations don’t make sense to categorical features, we mutate smoker as numeric.

X_train <- data_train %>% 
  select(-c(charges, sex, region)) %>% 
  mutate(smoker = as.numeric(smoker))
X_test <- data_test %>% 
  select(-c(charges, sex, region)) %>% 
  mutate(smoker = as.numeric(smoker))

We use formula below to apply polynomial combinations.

formula <- as.formula(
  paste(
    ' ~ .^2 + ', 
    paste('poly(', colnames(X_train), ', 2, raw=TRUE)[, 2]', collapse = ' + ')
  )
)

formula
~.^2 + poly(age, 2, raw = TRUE)[, 2] + poly(bmi, 2, raw = TRUE)[, 
    2] + poly(children, 2, raw = TRUE)[, 2] + poly(smoker, 2, 
    raw = TRUE)[, 2]

Then, insert y_train and y_test back to the new datasets.

train_poly <- as.data.frame(model.matrix(formula, data = X_train))
test_poly <- as.data.frame(model.matrix(formula, data = X_test))
train_poly$charges <- y_train
test_poly$charges <- y_test
colnames(train_poly)
 [1] "(Intercept)"                        "age"                               
 [3] "bmi"                                "children"                          
 [5] "smoker"                             "poly(age, 2, raw = TRUE)[, 2]"     
 [7] "poly(bmi, 2, raw = TRUE)[, 2]"      "poly(children, 2, raw = TRUE)[, 2]"
 [9] "poly(smoker, 2, raw = TRUE)[, 2]"   "age:bmi"                           
[11] "age:children"                       "age:smoker"                        
[13] "bmi:children"                       "bmi:smoker"                        
[15] "children:smoker"                    "charges"                           

We can see that our new datasets train_poly and test_poly now have 16 columns:

  1. (Intercept) is a column consists of constant 1, this is the constant term in the polynomial.
  2. age,bmi,children,smoker are the original features.
  3. age2,bmi2,children2,smoker2 are the square of the original features.
  4. ageĂ—bmi,ageĂ—children,ageĂ—smoker,bmiĂ—children,bmiĂ—smoker,childrenĂ—smoker are six interactions between pairs of four features.
  5. charges is the target feature.

Now, we are ready to make the model. As usual, we start with all features, and then create the model using step wise regression backward elimination.

Performance

model_all2 <- lm(charges ~., train_poly)
step(model_all2, direction = "backward")
Start:  AIC=18162.46
charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `poly(smoker, 2, raw = TRUE)[, 2]` + `age:bmi` + `age:children` + 
    `age:smoker` + `bmi:children` + `bmi:smoker` + `children:smoker`


Step:  AIC=18162.46
charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:bmi` + `age:children` + `age:smoker` + `bmi:children` + 
    `bmi:smoker` + `children:smoker`


Step:  AIC=18162.46
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:bmi` + `age:children` + `age:smoker` + `bmi:children` + 
    `bmi:smoker` + `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `age:smoker`                          1       87237 24907389704 18161
- `bmi:children`                        1    12114168 24919416636 18161
- `age:bmi`                             1    12714116 24920016584 18161
- age                                   1    17561078 24924863545 18161
- `age:children`                        1    28200470 24935502938 18162
- `children:smoker`                     1    40148624 24947451091 18162
<none>                                                24907302468 18163
- `poly(children, 2, raw = TRUE)[, 2]`  1    63030315 24970332783 18163
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    86401198 24993703666 18164
- children                              1   187168061 25094470529 18169
- `poly(age, 2, raw = TRUE)[, 2]`       1   414386731 25321689199 18178
- bmi                                   1   658673687 25565976155 18188
- smoker                                1  2213655840 27120958308 18252
- `bmi:smoker`                          1 13939558684 38846861151 18636

Step:  AIC=18160.46
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:bmi` + `age:children` + `bmi:children` + `bmi:smoker` + 
    `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `bmi:children`                        1    12097944 24919487648 18159
- `age:bmi`                             1    12678739 24920068443 18159
- age                                   1    20268521 24927658226 18159
- `age:children`                        1    28218188 24935607892 18160
- `children:smoker`                     1    40509187 24947898891 18160
<none>                                                24907389704 18161
- `poly(children, 2, raw = TRUE)[, 2]`  1    63032142 24970421846 18161
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    86557527 24993947231 18162
- children                              1   187380595 25094770299 18167
- `poly(age, 2, raw = TRUE)[, 2]`       1   414336473 25321726177 18176
- bmi                                   1   659782432 25567172136 18186
- smoker                                1  2685916990 27593306694 18268
- `bmi:smoker`                          1 14024886066 38932275770 18636

Step:  AIC=18158.98
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `age:bmi`                             1    12011284 24931498932 18158
- age                                   1    19588711 24939076359 18158
- `age:children`                        1    34037250 24953524897 18158
- `children:smoker`                     1    38348896 24957836543 18159
<none>                                                24919487648 18159
- `poly(children, 2, raw = TRUE)[, 2]`  1    62415583 24981903231 18160
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    87634033 25007121681 18161
- children                              1   231350017 25150837665 18167
- `poly(age, 2, raw = TRUE)[, 2]`       1   414919697 25334407344 18175
- bmi                                   1   673172853 25592660501 18186
- smoker                                1  2673948395 27593436043 18266
- `bmi:smoker`                          1 14034226485 38953714133 18635

Step:  AIC=18157.5
charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:children` + `bmi:smoker` + `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- age                                   1     9160764 24940659696 18156
- `age:children`                        1    33154389 24964653321 18157
- `children:smoker`                     1    38928840 24970427772 18157
<none>                                                24931498932 18158
- `poly(children, 2, raw = TRUE)[, 2]`  1    63557572 24995056504 18158
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    81101276 25012600208 18159
- children                              1   232877866 25164376798 18165
- `poly(age, 2, raw = TRUE)[, 2]`       1   432337926 25363836858 18174
- bmi                                   1   666073590 25597572522 18184
- smoker                                1  2679798970 27611297902 18265
- `bmi:smoker`                          1 14048864520 38980363452 18633

Step:  AIC=18155.89
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `age:children` + `bmi:smoker` + `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `age:children`                        1    35667383 24976327079 18155
- `children:smoker`                     1    38765424 24979425119 18156
<none>                                                24940659696 18156
- `poly(children, 2, raw = TRUE)[, 2]`  1    56943592 24997603287 18156
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    81482031 25022141727 18157
- children                              1   225871743 25166531438 18164
- bmi                                   1   665798814 25606458510 18182
- smoker                                1  2688617038 27629276734 18263
- `poly(age, 2, raw = TRUE)[, 2]`       1 10324076352 35264736047 18524
- `bmi:smoker`                          1 14077127035 39017786731 18632

Step:  AIC=18155.42
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `bmi:smoker` + `children:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `children:smoker`                     1    37721028 25014048107 18155
<none>                                                24976327079 18155
- `poly(children, 2, raw = TRUE)[, 2]`  1    49529043 25025856121 18156
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    79418092 25055745171 18157
- children                              1   206827875 25183154954 18162
- bmi                                   1   675467497 25651794576 18182
- smoker                                1  2701478319 27677805397 18263
- `bmi:smoker`                          1 14103067817 39079394896 18632
- `poly(age, 2, raw = TRUE)[, 2]`       1 14416801552 39393128631 18641

Step:  AIC=18155.03
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `bmi:smoker`

                                       Df   Sum of Sq         RSS   AIC
- `poly(children, 2, raw = TRUE)[, 2]`  1    39048162 25053096269 18155
<none>                                                25014048107 18155
- `poly(bmi, 2, raw = TRUE)[, 2]`       1    85815707 25099863814 18157
- children                              1   247054482 25261102589 18164
- bmi                                   1   662920634 25676968741 18181
- smoker                                1  2975033096 27989081203 18273
- `bmi:smoker`                          1 14154861302 39168909409 18632
- `poly(age, 2, raw = TRUE)[, 2]`       1 14393807419 39407855526 18639

Step:  AIC=18154.7
charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`

                                  Df   Sum of Sq         RSS   AIC
<none>                                           25053096269 18155
- `poly(bmi, 2, raw = TRUE)[, 2]`  1    87172604 25140268873 18156
- children                         1   657421145 25710517414 18180
- bmi                              1   661992396 25715088665 18181
- smoker                           1  2987010790 28040107059 18273
- `bmi:smoker`                     1 14202648668 39255744937 18633
- `poly(age, 2, raw = TRUE)[, 2]`  1 14400185274 39453281543 18638

Call:
lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)

Coefficients:
                    (Intercept)                              bmi  
                      17034.745                        -1085.987  
                       children                           smoker  
                        651.730                       -20934.955  
`poly(age, 2, raw = TRUE)[, 2]`  `poly(bmi, 2, raw = TRUE)[, 2]`  
                          3.318                           -5.713  
                   `bmi:smoker`  
                       1458.495  

Save the model in an object model_poly.

model_poly <- lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)

Model Evaluation

Let’s see the summary of our original Linear Regression model, in model_2 before.

summary(model_2)

Call:
lm(formula = charges ~ age + bmi + children + smoker, data = data_train)

Residuals:
   Min     1Q Median     3Q    Max 
-11745  -2982   -988   1485  29488 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12599.15    1077.14 -11.697 < 0.0000000000000002 ***
age            259.74      13.57  19.137 < 0.0000000000000002 ***
bmi            336.44      31.07  10.828 < 0.0000000000000002 ***
children       455.69     155.76   2.926              0.00351 ** 
smokeryes    23801.49     462.02  51.516 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6126 on 1064 degrees of freedom
Multiple R-squared:  0.7485,    Adjusted R-squared:  0.7475 
F-statistic: 791.5 on 4 and 1064 DF,  p-value: < 0.00000000000000022
  • We have four features, all of which are significant on charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged -$12,599 by health insurance, it could be impossible scenario. Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have less charge than a smoker by $23,801, which makes sense.

  • This model also has 0.7475 adjusted R-squared, which means the model with its features explains 74.75% of total variation in charges.

Now, let’s compare this to the new Polynomial Regression model.

summary(model_poly)

Call:
lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `bmi:smoker`, data = train_poly)

Residuals:
    Min      1Q  Median      3Q     Max 
-8661.9 -1735.4 -1352.5  -751.4 30386.0 

Coefficients:
                                   Estimate  Std. Error t value
(Intercept)                      17034.7451   3778.1463   4.509
bmi                              -1085.9870    205.0060  -5.297
children                           651.7300    123.4566   5.279
smoker                          -20934.9550   1860.4675 -11.253
`poly(age, 2, raw = TRUE)[, 2]`      3.3177      0.1343  24.707
`poly(bmi, 2, raw = TRUE)[, 2]`     -5.7128      2.9719  -1.922
`bmi:smoker`                      1458.4955     59.4414  24.537
                                            Pr(>|t|)    
(Intercept)                              0.000007245 ***
bmi                                      0.000000143 ***
children                                 0.000000157 ***
smoker                          < 0.0000000000000002 ***
`poly(age, 2, raw = TRUE)[, 2]` < 0.0000000000000002 ***
`poly(bmi, 2, raw = TRUE)[, 2]`               0.0548 .  
`bmi:smoker`                    < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4857 on 1062 degrees of freedom
Multiple R-squared:  0.8422,    Adjusted R-squared:  0.8413 
F-statistic: 944.7 on 6 and 1062 DF,  p-value: < 0.00000000000000022
  • We have six features, all of which are significant on charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged $17,034 by health insurance. Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, using polynomial regression method given all other features change, a non-smoker would have more charge than a smoker by $20,934. From this analysis, we know that by introducing new features to our model using polynomial combinations, the assumptions about the model may change and the interpretation of the model may be misleading.

  • The adjusted R-squared of this model is 0.8413, which means the model with its features explains 84% of total variation in charges. In other words, this Polynomial Regression model captures more variance of charges than the earlier Linear Regression model.

RMSE

let’s check for model predict & error of the model_poly compare to the model_2 :

# RMSE of model_2
rmse1 <- RMSE(prediction2, data_test$charges)
paste("RMSE model_2:", round(rmse1, 3))
[1] "RMSE model_2: 5844.718"
# RMSE model_poly
y_pred <- predict(model_poly, test_poly)
rmse2 <- RMSE(y_pred, data_test$charges)
paste("RMSE model_poly:", round(rmse2, 3))
[1] "RMSE model_poly: 4667.661"

There is an improvement from the model_poly, the error is lower than model_2 , which is RMSE 4667.66.

Assumptions

But we still need to check several assumptions which need to be satisfied when using Linear Regression model:

1. Linearity

linearity_plot2 <- data.frame(residual = model_poly$residuals, fitted = model_poly$fitted.values)

linearity_plot2 %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

2. Normality of Residuals

hist(model_poly$residuals, main = "Histogram of model_poly Residuals", xlab = "model_poly residuals", ylab = "count")

shapiro.test(model_poly$residuals)

    Shapiro-Wilk normality test

data:  model_poly$residuals
W = 0.63647, p-value < 0.00000000000000022

The same with the previous model , model_2, residuals are not normally distributed.

3. Homoscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.

plot %>% ggplot(aes(model_poly$fitted.values, model_poly$residuals)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0, col = "red"))

Another way is to use Breusch-Pagan hypothesis.

H0: Homoscedasticity H1: Heteroscedasticity

bptest(model_poly)

    studentized Breusch-Pagan test

data:  model_poly
BP = 9.4173, df = 6, p-value = 0.1514

from the output we find out that p-value is higher than alpha (0.05), then , accept H0. This means the residuals satisfies Homoscedasticity assumption.

4. Multicollinearity

One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.

vif(model_poly)
                            bmi                        children 
                      69.966614                        1.000774 
                         smoker `poly(age, 2, raw = TRUE)[, 2]` 
                      25.808689                        1.016194 
`poly(bmi, 2, raw = TRUE)[, 2]`                    `bmi:smoker` 
                      59.120272                       34.573577 

No multicollinearity found in Linear Regression model, but many found in Polynomial Regression model. This makes sense since some features in Polynomial Regression were created by multiplying two features from Linear Regression model.

Conclusion

Herewith the comparison between model_2 using linear regression and model_poly which we improve using polynomial regression technique

Algorithm Model R-squared RMSE
Linear Regression model_2 0.7475 5844.718
Polynomial Regression model_poly 0.8413 4667.661

We can conclude after processing the model, feature engineering plays an important role to improve the model. In this case we apply by making polynomial combinations of features with degree 2. We can see that the model improves significantly, with R-squared : 0.8413, higher than model_2 and RMSE : 4667.661, lower than model_2. However, some assumption of Linear Regression is not satisfying , like the normality of Residuals still not distributed normally, and multicollinearity found in Polynomial Regression model, but in this case of multicollinearity is makes sense since some features in polynomial regression were created by multiplying two features from linear regression model. Sometimes in business, assumptions is not really matter, but it’s good to use assumptions, to make our model better to use.