Medical Cost Prediction

Library

library(dplyr)          # data wrangling
library(ggplot2)        # graphing
library(caret)          # machine learning functions
library(MLmetrics)      # machine learning metrics
library(car)            # VIF calculation
library(lmtest)         # linear regression model testing
library(GGally)         # correlation plot

Problem Statement

This project is to accurately predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not, etc. Additionally, we will also determine what the most important variable influencing insurance costs is. This is a regression problem.

Dataset

The original dataset is available on kaggle. In this project, it’s already separated randomly into train and test dataset. Let’s read them.

train <- read.csv("train.csv", stringsAsFactors=TRUE)
test <- read.csv("test.csv", stringsAsFactors=TRUE)
glimpse(train)

#> Rows: 1,070
#> Columns: 7
#> $ age      <int> 37, 18, 23, 32, 58, 25, 36, 34, 53, 45, 20, 60, 58, 34, 60...
#> $ sex      <fct> male, male, female, male, female, female, male, female, ma...
#> $ bmi      <dbl> 34.100, 34.430, 36.670, 35.200, 32.395, 26.790, 35.200, 33...
#> $ children <int> 4, 0, 2, 2, 1, 2, 1, 1, 0, 5, 1, 1, 0, 0, 0, 0, 1, 3, 5, 1...
#> $ smoker   <fct> yes, no, yes, no, no, no, yes, no, no, no, no, no, no, no,...
#> $ region   <fct> southwest, southeast, northeast, southwest, northeast, nor...
#> $ charges  <dbl> 40182.246, 1137.470, 38511.628, 4670.640, 13019.161, 4189....

As we can see, we got these features:

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body Mass Index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight ($kg/m^2$) using the ratio of height to weight, ideally 18.5 to 24.9
children: number of children covered by health insurance / number of dependents
smoker: smoking or not
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: individual medical costs billed by health insurance

Since we are predicting insurance costs, charges will be our target feature.

Data Cleaning

First, we can see above that each feature has already in its correct type. Let’s check if there are any duplicated observations on train dataset.

train[duplicated(train), ]

#>     age  sex   bmi children smoker    region  charges
#> 268  19 male 30.59        0     no northwest 1639.563

There is one. It’s unlikely that two people have the same age, sex, bmi, and children from the same region, both non-smokers and have exactly the same medical charges. We can drop this duplicated row.

train <- train %>% distinct()

Great! Now, we inspect missing values.

colSums(is.na(train))

#>      age      sex      bmi children   smoker   region  charges 
#>        0        0        0        0        0        0        0

colSums(is.na(test))

#>      age      sex      bmi children   smoker   region  charges 
#>        0        0        0        0        0        0        0

Awesome! No missing values.

Exploratory Data Analysis

Here is some descriptive statistic.

summary(train)

#>       age            sex           bmi           children    smoker   
#>  Min.   :18.00   female:543   Min.   :15.96   Min.   :0.00   no :850  
#>  1st Qu.:26.00   male  :526   1st Qu.:26.32   1st Qu.:0.00   yes:219  
#>  Median :39.00                Median :30.40   Median :1.00            
#>  Mean   :39.11                Mean   :30.73   Mean   :1.08            
#>  3rd Qu.:51.00                3rd Qu.:34.80   3rd Qu.:2.00            
#>  Max.   :64.00                Max.   :53.13   Max.   :5.00            
#>        region       charges     
#>  northeast:249   Min.   : 1122  
#>  northwest:253   1st Qu.: 4747  
#>  southeast:294   Median : 9447  
#>  southwest:273   Mean   :13212  
#>                  3rd Qu.:16587  
#>                  Max.   :63770

In terms of categorical features, the dataset has similar number of people for each category, except for smoker. We have more non-smokers than smokers, which makes sense. The charges itself varies greatly from around $1,000 to $64,000.

Let’s see the distribution of charges.

ggplot(data = train, aes(x = charges)) + 
  geom_density(alpha = 0.5) + 
  ggtitle("Distribution of Charges")

The distribution is right skewed with a long tail to the right. There’s a bump at around $40,000, perhaps another hidden distribution. To dig this up, we need categorical features.

for (col in c('sex', 'region', 'children', 'smoker')) {
  plot <- ggplot(data = train,
                 aes_string(x = col, y = 'charges', group = col, fill = col)) + 
            geom_boxplot(show.legend = FALSE) + 
            ggtitle(glue::glue("Boxplot of Medical Charges per {col}"))
  print(plot)
}

sex and region don’t have noticeable differences for each category in terms of charges given. We can see that there is an increasing trend in charges as the children increases. Lastly, smoker seems to make a significant difference to charges given by health insurance.

Let’s draw again the distribution of charges, now categorizing them into smoker.

ggplot(data = train, aes(x = charges, fill = smoker)) + 
  geom_density(alpha = 0.5) + 
  ggtitle("Distribution of Charges per Smoking Category")

Look at what we found! Smokers definitely have more charges than non-smokers.

Let’s analyze the medical charges by age, bmi and children according to the smoker factor.

for (feat in c('age', 'bmi', 'children')) {
  plot <- ggplot(data = train, aes_string(x = feat, y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) + 
    geom_jitter() + 
    geom_smooth(method = 'lm') +
    ggtitle(glue::glue("Charges vs {feat}"))  
  print(plot)
}

smoker seems to have the highest impact on medical charges, even though the charges are growing with age, bmi, and children. Also, people who have more children generally smoke less.

Lastly, we have the correlations between features as follows.

ggcorr(train %>% mutate_if(is.factor, as.numeric), label = TRUE)

As we can see, there are almost no correlations between features except for smoker and charges.

Metrics and Validation Strategy

We will use Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE) as our metrics. These three metrics can be used depends on the business point of view. To see what we mean, consider a true value of one observation of charges be $10,000. Assume the model predictions are exactly the same as true values, except for this particular observation which the model predicts as $x$. We will vary $x$ from $1,000 to $19,000 and see the resulted error.

true <- 10000
pred <- seq(from = 1000, to = 19000, length.out = 181)
x <- pred - true
rmse <- (x ^ 2) ^ 0.5
rmsle <- ((log(pred) - log(true)) ^ 2) ^ 0.5

par(mfrow = c(1, 2))
plot(x = x, y = rmse, 
     type = "l", 
     main = "Root Mean Squared Error", 
     xlab = "Error (prediction - actual)", ylab = "RMSE")
plot(x = x, y = rmsle, 
     type = "l", 
     main = "Root Mean Squared Logarithmic Error", 
     xlab = "Error (prediction - actual)", ylab = "RMSLE")

As we can see, RMSLE incurs a larger penalty for the underestimation of the actual variable than the overestimation. Also, RMSLE metric only considers the relative error between and the predicted and the actual value and the scale of the error is not significant. On the other hand, RMSE value increases in magnitude if the scale of error increases. This means RMSLE should be more useful than RMSE when underestimation is undesirable.

MAE and RMSE are indifferent to the direction of errors. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful than MAE when large errors are particularly undesirable.

Knowing the metrics, we validate the performance of our model simply by applying new data test.csv to it and see the metrics score. We don’t do k-fold cross validation since the data is small.

Modeling

We will build and train Linear Regression model for this problem. For starters, let’s use all available features in the model.

Linear Regression

Linear Regression will be implemented with automatic feature selection using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination in a greedy manner.

temp <- lm(charges ~ ., data = train)
step(temp)

#> Start:  AIC=18667.88
#> charges ~ age + sex + bmi + children + smoker + region
#> 
#>            Df   Sum of Sq          RSS   AIC
#> - region    3   136779757  40475571916 18666
#> - sex       1       43071  40338835230 18666
#> <none>                     40338792158 18668
#> - children  1   294681919  40633474078 18674
#> - bmi       1  4133255306  44472047465 18770
#> - age       1 13330028969  53668821128 18971
#> - smoker    1 95616334451 135955126610 19965
#> 
#> Step:  AIC=18665.5
#> charges ~ age + sex + bmi + children + smoker
#> 
#>            Df   Sum of Sq          RSS   AIC
#> - sex       1      120916  40475692832 18664
#> <none>                     40475571916 18666
#> - children  1   288132465  40763704381 18671
#> - bmi       1  4147134564  44622706480 18768
#> - age       1 13500662196  53976234111 18971
#> - smoker    1 96273590176 136749162092 19965
#> 
#> Step:  AIC=18663.5
#> charges ~ age + bmi + children + smoker
#> 
#>            Df   Sum of Sq          RSS   AIC
#> <none>                     40475692832 18664
#> - children  1   288027795  40763720627 18669
#> - bmi       1  4151671360  44627364191 18766
#> - age       1 13508639838  53984332670 18969
#> - smoker    1 96513856276 136989549108 19965

#> 
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker, data = train)
#> 
#> Coefficients:
#> (Intercept)          age          bmi     children    smokeryes  
#>    -11905.1        254.9        320.6        429.9      23586.1

Save the best model as lm_all, then predict. After that, calculate the metrics.

lm_all <- lm(formula = charges ~ age + bmi + children + smoker, data = train)
y_pred <- predict(lm_all, test)
mae <- MAE(y_pred, test$charges)
rmse <- RMSE(y_pred, test$charges)

To calculate RMSLE, note that we must have the values for prediction and target variable to be positive, or else the logarithmic will result in NaN. As we check below, we have indeed a negative value in the prediction, observation 149. A common way to overcome this problem is by taking log() of charges before training the model, then convert the prediction back using exp(). However, since we have used Linear Regression model earlier without taking log() of charges, we will stick to use this model. The observations with negative prediction will simply be discarded from RMSLE calculation.

y_pred[y_pred <=0]

#>       149 
#> -87.05982

rmsle <- RMSLE(y_pred[-149], test$charges[-149])
lin_reg <- cbind("MAE" = mae, "RMSE" = rmse, "RMSLE" = rmsle)
lin_reg

#>           MAE     RMSE     RMSLE
#> [1,] 3941.464 5672.102 0.5455373

We obtain the above errors. For MAE and RMSE, we can interpret them as: the model predicts charges with $3,941 and $5,672 average difference from the true values, respectively. However, we can’t interpret RMSLE in terms of charges anymore. Please note that RMSE is always bigger or equal to MAE. In fact, we have this inequality: $MAE \leq RMSE \leq \sqrt{n} \times MAE$.

Polynomial Regression

We can improve our model by feature engineering, specifically, by making new features that captures the interactions between existing features. This is called polynomial regression. The idea is to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form $[a, b]$, the degree-2 polynomial features are $[1, a, b, a^2, ab, b^2]$. We will use degree 2.

We don’t want charges to be included in the process of generating the polynomial combinations, so we take out charges from train and test and save them as y_train and y_test, respectively.

y_train <- train$charges
y_test <- test$charges

From EDA we know that sex and region have no correlation with charges. We can drop them. Also, since polynomial combinations don’t make sense to categorical features, we mutate smoker as numeric.

X_train <- train %>% 
  select(-c(charges, sex, region)) %>% 
  mutate(smoker = as.numeric(smoker))
X_test <- test %>% 
  select(-c(charges, sex, region)) %>% 
  mutate(smoker = as.numeric(smoker))

We use formula below to apply polynomial combinations.

formula <- as.formula(
  paste(
    ' ~ .^2 + ', 
    paste('poly(', colnames(X_train), ', 2, raw=TRUE)[, 2]', collapse = ' + ')
  )
)

formula

#> ~.^2 + poly(age, 2, raw = TRUE)[, 2] + poly(bmi, 2, raw = TRUE)[, 
#>     2] + poly(children, 2, raw = TRUE)[, 2] + poly(smoker, 2, 
#>     raw = TRUE)[, 2]

Then, insert y_train and y_test back to the new datasets.

train_poly <- as.data.frame(model.matrix(formula, data = X_train))
test_poly <- as.data.frame(model.matrix(formula, data = X_test))
train_poly$charges <- y_train
test_poly$charges <- y_test
colnames(train_poly)

#>  [1] "(Intercept)"                        "age"                               
#>  [3] "bmi"                                "children"                          
#>  [5] "smoker"                             "poly(age, 2, raw = TRUE)[, 2]"     
#>  [7] "poly(bmi, 2, raw = TRUE)[, 2]"      "poly(children, 2, raw = TRUE)[, 2]"
#>  [9] "poly(smoker, 2, raw = TRUE)[, 2]"   "age:bmi"                           
#> [11] "age:children"                       "age:smoker"                        
#> [13] "bmi:children"                       "bmi:smoker"                        
#> [15] "children:smoker"                    "charges"

We can see that our new datasets train_poly and test_poly now have 16 columns:

$(Intercept)$ is a column consists of constant 1, this is the constant term in the polynomial.
$age, bmi, children, smoker$ are the original features.
$age^2, bmi^2, children^2, smoker^2$ are the square of the original features.
$age \times bmi, age \times children, age \times smoker, bmi \times children, bmi \times smoker, children \times smoker$ are six interactions between pairs of four features.
$charges$ is the target feature.

Now, we are ready to make the model. As usual, we start with all features, and work our way down using backward elimination.

temp <- lm(formula = charges ~ ., data = train_poly)
step(temp)

#> Start:  AIC=18196.7
#> charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `poly(smoker, 2, raw = TRUE)[, 2]` + `age:bmi` + `age:children` + 
#>     `age:smoker` + `bmi:children` + `bmi:smoker` + `children:smoker`
#> 
#> 
#> Step:  AIC=18196.7
#> charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `age:children` + `age:smoker` + `bmi:children` + 
#>     `bmi:smoker` + `children:smoker`
#> 
#> 
#> Step:  AIC=18196.7
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `age:children` + `age:smoker` + `bmi:children` + 
#>     `bmi:smoker` + `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> - `age:smoker`                          1      638029 25718637159 18195
#> - `bmi:children`                        1      817883 25718817013 18195
#> - age                                   1     5612815 25723611945 18195
#> - `age:children`                        1     9673381 25727672511 18195
#> - `age:bmi`                             1    26766877 25744766007 18196
#> - `children:smoker`                     1    45654901 25763654030 18197
#> - children                              1    47711668 25765710798 18197
#> <none>                                                25717999130 18197
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    63400816 25781399946 18197
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   257208016 25975207146 18205
#> - `poly(age, 2, raw = TRUE)[, 2]`       1   258018286 25976017416 18205
#> - bmi                                   1   446640632 26164639762 18213
#> - smoker                                1  2015205306 27733204436 18275
#> - `bmi:smoker`                          1 13705549198 39423548327 18651
#> 
#> Step:  AIC=18194.73
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `age:children` + `bmi:children` + `bmi:smoker` + 
#>     `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> - `bmi:children`                        1      828285 25719465444 18193
#> - age                                   1     4976641 25723613800 18193
#> - `age:children`                        1     9825302 25728462461 18193
#> - `age:bmi`                             1    26787718 25745424877 18194
#> - `children:smoker`                     1    45119820 25763756979 18195
#> - children                              1    47286634 25765923793 18195
#> <none>                                                25718637159 18195
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    63295487 25781932646 18195
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   257419441 25976056600 18203
#> - `poly(age, 2, raw = TRUE)[, 2]`       1   257631866 25976269025 18203
#> - bmi                                   1   447419331 26166056489 18211
#> - smoker                                1  2445531784 28164168943 18290
#> - `bmi:smoker`                          1 13756292505 39474929664 18651
#> 
#> Step:  AIC=18192.76
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> - age                                   1     5196666 25724662109 18191
#> - `age:children`                        1    11062155 25730527598 18191
#> - `age:bmi`                             1    27346149 25746811593 18192
#> - `children:smoker`                     1    46443285 25765908729 18193
#> <none>                                                25719465444 18193
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    63858086 25783323530 18193
#> - children                              1   101526014 25820991458 18195
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   256592182 25976057625 18201
#> - `poly(age, 2, raw = TRUE)[, 2]`       1   258013469 25977478912 18201
#> - bmi                                   1   446982085 26166447529 18209
#> - smoker                                1  2466700190 28186165633 18289
#> - `bmi:smoker`                          1 13890308278 39609773722 18652
#> 
#> Step:  AIC=18190.98
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> - `age:children`                        1    10800698 25735462807 18189
#> - `age:bmi`                             1    22525274 25747187384 18190
#> - `children:smoker`                     1    46333040 25770995149 18191
#> <none>                                                25724662109 18191
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    59835331 25784497440 18192
#> - children                              1    97931108 25822593218 18193
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   252131105 25976793215 18199
#> - bmi                                   1   441828081 26166490190 18207
#> - `poly(age, 2, raw = TRUE)[, 2]`       1   527300413 26251962522 18211
#> - smoker                                1  2465304613 28189966723 18287
#> - `bmi:smoker`                          1 13885998170 39610660279 18650
#> 
#> Step:  AIC=18189.43
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `age:bmi` + `bmi:smoker` + `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> - `age:bmi`                             1    24487435 25759950242 18188
#> <none>                                                25735462807 18189
#> - `children:smoker`                     1    49159638 25784622445 18190
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    64458233 25799921040 18190
#> - children                              1   235234433 25970697239 18197
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   255427192 25990889999 18198
#> - bmi                                   1   439409358 26174872164 18206
#> - `poly(age, 2, raw = TRUE)[, 2]`       1   548137618 26283600425 18210
#> - smoker                                1  2461486940 28196949747 18285
#> - `bmi:smoker`                          1 13883155020 39618617827 18649
#> 
#> Step:  AIC=18188.44
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `bmi:smoker` + `children:smoker`
#> 
#>                                        Df   Sum of Sq         RSS   AIC
#> <none>                                                25759950242 18188
#> - `children:smoker`                     1    49343171 25809293413 18189
#> - `poly(children, 2, raw = TRUE)[, 2]`  1    75043200 25834993442 18190
#> - `poly(bmi, 2, raw = TRUE)[, 2]`       1   246287385 26006237626 18197
#> - children                              1   257057142 26017007384 18197
#> - bmi                                   1   414955971 26174906212 18204
#> - smoker                                1  2440592260 28200542501 18283
#> - `bmi:smoker`                          1 13871440992 39631391233 18647
#> - `poly(age, 2, raw = TRUE)[, 2]`       1 14593600500 40353550742 18666

#> 
#> Call:
#> lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `bmi:smoker` + `children:smoker`, data = train_poly)
#> 
#> Coefficients:
#>                          (Intercept)                                   bmi  
#>                            11544.691                              -820.904  
#>                             children                                smoker  
#>                             1711.975                            -18804.065  
#>      `poly(age, 2, raw = TRUE)[, 2]`       `poly(bmi, 2, raw = TRUE)[, 2]`  
#>                                3.319                                -9.138  
#> `poly(children, 2, raw = TRUE)[, 2]`                          `bmi:smoker`  
#>                             -154.913                              1405.110  
#>                    `children:smoker`  
#>                             -458.435

Save the best model as lm_poly, then predict. After that, calculate the metrics. Now, we are fortunate to have no negative values in prediction, hence RMSLE calculation can directly be applied.

lm_poly <- lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
    `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
    `bmi:smoker` + `children:smoker`, data = train_poly)

y_pred <- predict(lm_poly, test_poly)
mae <- MAE(y_pred, test$charges)
rmse <- RMSE(y_pred, test$charges)
rmsle <- RMSLE(y_pred, test$charges)

poly_reg <- cbind("MAE" = mae, "RMSE" = rmse, "RMSLE" = rmsle)
poly_reg

#>           MAE     RMSE     RMSLE
#> [1,] 2835.106 4327.179 0.3926167

Model Evaluation

Let’s see the summary of our original Linear Regression model.

summary(lm_all)

#> 
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker, data = train)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -11734  -2983  -1004   1356  29708 
#> 
#> Coefficients:
#>              Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -11905.11    1060.63 -11.225 < 0.0000000000000002 ***
#> age            254.87      13.52  18.844 < 0.0000000000000002 ***
#> bmi            320.64      30.69  10.447 < 0.0000000000000002 ***
#> children       429.86     156.22   2.752              0.00603 ** 
#> smokeryes    23586.13     468.26  50.370 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6168 on 1064 degrees of freedom
#> Multiple R-squared:  0.7359, Adjusted R-squared:  0.7349 
#> F-statistic: 741.3 on 4 and 1064 DF,  p-value: < 0.00000000000000022

We have four features, all of which are significant (has real effect, not due to random chance and sampling) on charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged -$11,905 by health insurance (which we know this scenario is impossible). Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have less charge than a smoker by $23,586, which makes sense.

This model also has 0.7349 adjusted R-squared, which means the model with its features explains 73% of total variation in charges.

Now, let’s compare this to the new Polynomial Regression model.

summary(lm_poly)

#> 
#> Call:
#> lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` + 
#>     `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` + 
#>     `bmi:smoker` + `children:smoker`, data = train_poly)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -11017.7  -1932.9  -1339.8   -559.8  29962.3 
#> 
#> Coefficients:
#>                                         Estimate  Std. Error t value
#> (Intercept)                           11544.6914   3699.3731   3.121
#> bmi                                    -820.9038    198.6602  -4.132
#> children                               1711.9751    526.3834   3.252
#> smoker                               -18804.0652   1876.3926 -10.021
#> `poly(age, 2, raw = TRUE)[, 2]`           3.3189      0.1354  24.505
#> `poly(bmi, 2, raw = TRUE)[, 2]`          -9.1376      2.8703  -3.183
#> `poly(children, 2, raw = TRUE)[, 2]`   -154.9126     88.1557  -1.757
#> `bmi:smoker`                           1405.1099     58.8124  23.891
#> `children:smoker`                      -458.4348    321.7241  -1.425
#>                                                  Pr(>|t|)    
#> (Intercept)                                       0.00185 ** 
#> bmi                                             0.0000388 ***
#> children                                          0.00118 ** 
#> smoker                               < 0.0000000000000002 ***
#> `poly(age, 2, raw = TRUE)[, 2]`      < 0.0000000000000002 ***
#> `poly(bmi, 2, raw = TRUE)[, 2]`                   0.00150 ** 
#> `poly(children, 2, raw = TRUE)[, 2]`              0.07916 .  
#> `bmi:smoker`                         < 0.0000000000000002 ***
#> `children:smoker`                                 0.15447    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4930 on 1060 degrees of freedom
#> Multiple R-squared:  0.8319, Adjusted R-squared:  0.8307 
#> F-statistic: 655.9 on 8 and 1060 DF,  p-value: < 0.00000000000000022

We have eight features, all of which are significant on charges, except for children:smoker. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged $11,540 by health insurance (which we know this scenario is impossible). Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have more charge than a smoker by $18,800. Wait, what!? From this analysis, we know that by introducing new features to our model using polynomial combinations, the assumptions about the model may change and the interpretation of the model may be misleading.

The adjusted R-squared of this model is 0.8307, which means the model with its features explains 83% of total variation in charges. In other words, this Polynomial Regression model captures more variance of charges than the earlier Linear Regression model.

But don’t get puffed up! There are several assumptions need to be satisfied when using Linear Regression model:

Linearity of the Data

We need to make sure that there is a linear relationship between predictors and target variable. This can be done by visually looking at the correlation between each pair of predictor and target variable.

cols <- c('age', 'children', 'bmi', 'smoker')

temp <- train %>% 
  select(cols) %>% 
  mutate(smoker = as.numeric(smoker))
temp$charges <- y_train
ggcorr(temp, label = T)

Another way is to use hypothesis testing with Pearson’s product-moment correlation.

H0: the predictor does not correlate with charge
H1: the predictor correlates with charge

for (col in cols) {
  cor <- cor.test(temp[, col], temp$charges)
  print(round(cor$p.value, 4))
}

#> [1] 0
#> [1] 0.0062
#> [1] 0
#> [1] 0

Since p-value for each predictor-target pair is below alpha (0.05), reject H0. We can safely say that the predictors correlate with target variable.

Now, for the Polynomial Regression.

cols <- c('bmi', 'children', 'smoker', 'poly(age, 2, raw = TRUE)[, 2]', 
          'poly(bmi, 2, raw = TRUE)[, 2]', 'poly(children, 2, raw = TRUE)[, 2]', 
          'bmi:smoker', 'children:smoker')

ggcorr(train_poly %>% select(c(cols, 'charges')), hjust = 1, layout.exp = 2, label = T)

for (col in cols) {
  cor <- cor.test(train_poly[, col], train_poly$charges)
  print(round(cor$p.value, 4))
}

#> [1] 0
#> [1] 0.0062
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0.1327
#> [1] 0
#> [1] 0

We can say that the predictors correlate with target variable, except for $children^2$.

Normality of Residuals

The residuals from Linear Regression model should be normally distributed because we expect to get residuals near the zero value. To see this, we can plot the histogram of residuals.

hist(lm_all$residuals)

Another way is to use Shapiro-Wilk test to our residual.

H0: Residuals are normally distributed
H1: Residuals are not normally distributed

shapiro.test(lm_all$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  lm_all$residuals
#> W = 0.89481, p-value < 0.00000000000000022

Since p-value is below alpha (0.05), reject H0. Hence, residuals are not normally distributed.

Now, for the Polynomial Regression.

hist(lm_poly$residuals)

shapiro.test(lm_poly$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  lm_poly$residuals
#> W = 0.65297, p-value < 0.00000000000000022

The same thing applies, residuals are not normally distributed.

Homoscedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.

plot(lm_all$fitted.values, lm_all$residuals)
abline(h=0, col = "red")

Another way is to use Breusch-Pagan hypothesis.

H0: Homoscedasticity
H1: Heteroscedasticity

bptest(lm_all)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm_all
#> BP = 89.206, df = 4, p-value < 0.00000000000000022

Since p-value is lower than alpha (0.05), reject H0. This means the residuals has Heteroscedasticity.

Now, for the Polynomial Regression.

plot(lm_poly$fitted.values, lm_poly$residuals)
abline(h=0, col = "red")

bptest(lm_poly)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  lm_poly
#> BP = 13.059, df = 8, p-value = 0.1098

Since p-value is higher than alpha (0.05), accept H0. This means the residuals satisfies Homoscedasticity assumption.

No multicollinearity between predictors

One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.

vif(lm_all)

#>      age      bmi children   smoker 
#> 1.018303 1.013178 1.004060 1.003712

vif(lm_poly)

#>                                  bmi                             children 
#>                            66.442833                            17.844408 
#>                               smoker      `poly(age, 2, raw = TRUE)[, 2]` 
#>                            25.228626                             1.020981 
#>      `poly(bmi, 2, raw = TRUE)[, 2]` `poly(children, 2, raw = TRUE)[, 2]` 
#>                            56.740085                             6.702367 
#>                         `bmi:smoker`                    `children:smoker` 
#>                            32.271971                            11.527064

No multicollinearity found in Linear Regression model, but many found in Polynomial Regression model. This makes sense since some features in Polynomial Regression were created by multiplying two features from Linear Regression model.

Conclusion

result <- rbind(lin_reg, poly_reg)
rownames(result) <- c("Linear Regression", "Polynomial Regression")
result

#>                            MAE     RMSE     RMSLE
#> Linear Regression     3941.464 5672.102 0.5455373
#> Polynomial Regression 2835.106 4327.179 0.3926167

For a simple model like Linear Regression, feature engineering plays an important role to improve the model. In this project, we apply this technique by making polynomial combinations of features with degree 2. We see that the model improves significantly, with MAE 2835, RMSE 4327, and RMSLE 0.39. However, some assumptions on Linear Regression may break down in the process. Also, smoking is not good for your wallet!!