Medical Cost Prediction
Library
Problem Statement
This project is to accurately predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not, etc. Additionally, we will also determine what the most important variable influencing insurance costs is. This is a regression problem.
Dataset
The original dataset is available on kaggle. In this project, it’s already separated randomly into train and test dataset. Let’s read them.
train <- read.csv("train.csv", stringsAsFactors=TRUE)
test <- read.csv("test.csv", stringsAsFactors=TRUE)
glimpse(train)#> Rows: 1,070
#> Columns: 7
#> $ age <int> 37, 18, 23, 32, 58, 25, 36, 34, 53, 45, 20, 60, 58, 34, 60...
#> $ sex <fct> male, male, female, male, female, female, male, female, ma...
#> $ bmi <dbl> 34.100, 34.430, 36.670, 35.200, 32.395, 26.790, 35.200, 33...
#> $ children <int> 4, 0, 2, 2, 1, 2, 1, 1, 0, 5, 1, 1, 0, 0, 0, 0, 1, 3, 5, 1...
#> $ smoker <fct> yes, no, yes, no, no, no, yes, no, no, no, no, no, no, no,...
#> $ region <fct> southwest, southeast, northeast, southwest, northeast, nor...
#> $ charges <dbl> 40182.246, 1137.470, 38511.628, 4670.640, 13019.161, 4189....
As we can see, we got these features:
age: age of primary beneficiarysex: insurance contractor gender, female, malebmi: Body Mass Index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (\(kg/m^2\)) using the ratio of height to weight, ideally 18.5 to 24.9children: number of children covered by health insurance / number of dependentssmoker: smoking or notregion: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.charges: individual medical costs billed by health insurance
Since we are predicting insurance costs, charges will be our target feature.
Data Cleaning
First, we can see above that each feature has already in its correct type. Let’s check if there are any duplicated observations on train dataset.
#> age sex bmi children smoker region charges
#> 268 19 male 30.59 0 no northwest 1639.563
There is one. It’s unlikely that two people have the same age, sex, bmi, and children from the same region, both non-smokers and have exactly the same medical charges. We can drop this duplicated row.
Great! Now, we inspect missing values.
#> age sex bmi children smoker region charges
#> 0 0 0 0 0 0 0
#> age sex bmi children smoker region charges
#> 0 0 0 0 0 0 0
Awesome! No missing values.
Exploratory Data Analysis
Here is some descriptive statistic.
#> age sex bmi children smoker
#> Min. :18.00 female:543 Min. :15.96 Min. :0.00 no :850
#> 1st Qu.:26.00 male :526 1st Qu.:26.32 1st Qu.:0.00 yes:219
#> Median :39.00 Median :30.40 Median :1.00
#> Mean :39.11 Mean :30.73 Mean :1.08
#> 3rd Qu.:51.00 3rd Qu.:34.80 3rd Qu.:2.00
#> Max. :64.00 Max. :53.13 Max. :5.00
#> region charges
#> northeast:249 Min. : 1122
#> northwest:253 1st Qu.: 4747
#> southeast:294 Median : 9447
#> southwest:273 Mean :13212
#> 3rd Qu.:16587
#> Max. :63770
In terms of categorical features, the dataset has similar number of people for each category, except for smoker. We have more non-smokers than smokers, which makes sense. The charges itself varies greatly from around $1,000 to $64,000.
Let’s see the distribution of charges.
ggplot(data = train, aes(x = charges)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of Charges")The distribution is right skewed with a long tail to the right. There’s a bump at around $40,000, perhaps another hidden distribution. To dig this up, we need categorical features.
for (col in c('sex', 'region', 'children', 'smoker')) {
plot <- ggplot(data = train,
aes_string(x = col, y = 'charges', group = col, fill = col)) +
geom_boxplot(show.legend = FALSE) +
ggtitle(glue::glue("Boxplot of Medical Charges per {col}"))
print(plot)
}sex and region don’t have noticeable differences for each category in terms of charges given. We can see that there is an increasing trend in charges as the children increases. Lastly, smoker seems to make a significant difference to charges given by health insurance.
Let’s draw again the distribution of charges, now categorizing them into smoker.
ggplot(data = train, aes(x = charges, fill = smoker)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of Charges per Smoking Category")Look at what we found! Smokers definitely have more charges than non-smokers.
Let’s analyze the medical charges by age, bmi and children according to the smoker factor.
for (feat in c('age', 'bmi', 'children')) {
plot <- ggplot(data = train, aes_string(x = feat, y = 'charges', group = 'smoker', fill = 'smoker', col = 'smoker')) +
geom_jitter() +
geom_smooth(method = 'lm') +
ggtitle(glue::glue("Charges vs {feat}"))
print(plot)
}smoker seems to have the highest impact on medical charges, even though the charges are growing with age, bmi, and children. Also, people who have more children generally smoke less.
Lastly, we have the correlations between features as follows.
As we can see, there are almost no correlations between features except for smoker and charges.
Metrics and Validation Strategy
We will use Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE) as our metrics. These three metrics can be used depends on the business point of view. To see what we mean, consider a true value of one observation of charges be $10,000. Assume the model predictions are exactly the same as true values, except for this particular observation which the model predicts as \(x\). We will vary \(x\) from $1,000 to $19,000 and see the resulted error.
true <- 10000
pred <- seq(from = 1000, to = 19000, length.out = 181)
x <- pred - true
rmse <- (x ^ 2) ^ 0.5
rmsle <- ((log(pred) - log(true)) ^ 2) ^ 0.5
par(mfrow = c(1, 2))
plot(x = x, y = rmse,
type = "l",
main = "Root Mean Squared Error",
xlab = "Error (prediction - actual)", ylab = "RMSE")
plot(x = x, y = rmsle,
type = "l",
main = "Root Mean Squared Logarithmic Error",
xlab = "Error (prediction - actual)", ylab = "RMSLE")As we can see, RMSLE incurs a larger penalty for the underestimation of the actual variable than the overestimation. Also, RMSLE metric only considers the relative error between and the predicted and the actual value and the scale of the error is not significant. On the other hand, RMSE value increases in magnitude if the scale of error increases. This means RMSLE should be more useful than RMSE when underestimation is undesirable.
MAE and RMSE are indifferent to the direction of errors. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful than MAE when large errors are particularly undesirable.
Knowing the metrics, we validate the performance of our model simply by applying new data test.csv to it and see the metrics score. We don’t do k-fold cross validation since the data is small.
Modeling
We will build and train Linear Regression model for this problem. For starters, let’s use all available features in the model.
Linear Regression
Linear Regression will be implemented with automatic feature selection using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination in a greedy manner.
#> Start: AIC=18667.88
#> charges ~ age + sex + bmi + children + smoker + region
#>
#> Df Sum of Sq RSS AIC
#> - region 3 136779757 40475571916 18666
#> - sex 1 43071 40338835230 18666
#> <none> 40338792158 18668
#> - children 1 294681919 40633474078 18674
#> - bmi 1 4133255306 44472047465 18770
#> - age 1 13330028969 53668821128 18971
#> - smoker 1 95616334451 135955126610 19965
#>
#> Step: AIC=18665.5
#> charges ~ age + sex + bmi + children + smoker
#>
#> Df Sum of Sq RSS AIC
#> - sex 1 120916 40475692832 18664
#> <none> 40475571916 18666
#> - children 1 288132465 40763704381 18671
#> - bmi 1 4147134564 44622706480 18768
#> - age 1 13500662196 53976234111 18971
#> - smoker 1 96273590176 136749162092 19965
#>
#> Step: AIC=18663.5
#> charges ~ age + bmi + children + smoker
#>
#> Df Sum of Sq RSS AIC
#> <none> 40475692832 18664
#> - children 1 288027795 40763720627 18669
#> - bmi 1 4151671360 44627364191 18766
#> - age 1 13508639838 53984332670 18969
#> - smoker 1 96513856276 136989549108 19965
#>
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker, data = train)
#>
#> Coefficients:
#> (Intercept) age bmi children smokeryes
#> -11905.1 254.9 320.6 429.9 23586.1
Save the best model as lm_all, then predict. After that, calculate the metrics.
lm_all <- lm(formula = charges ~ age + bmi + children + smoker, data = train)
y_pred <- predict(lm_all, test)
mae <- MAE(y_pred, test$charges)
rmse <- RMSE(y_pred, test$charges)To calculate RMSLE, note that we must have the values for prediction and target variable to be positive, or else the logarithmic will result in NaN. As we check below, we have indeed a negative value in the prediction, observation 149. A common way to overcome this problem is by taking log() of charges before training the model, then convert the prediction back using exp(). However, since we have used Linear Regression model earlier without taking log() of charges, we will stick to use this model. The observations with negative prediction will simply be discarded from RMSLE calculation.
#> 149
#> -87.05982
rmsle <- RMSLE(y_pred[-149], test$charges[-149])
lin_reg <- cbind("MAE" = mae, "RMSE" = rmse, "RMSLE" = rmsle)
lin_reg#> MAE RMSE RMSLE
#> [1,] 3941.464 5672.102 0.5455373
We obtain the above errors. For MAE and RMSE, we can interpret them as: the model predicts charges with $3,941 and $5,672 average difference from the true values, respectively. However, we can’t interpret RMSLE in terms of charges anymore. Please note that RMSE is always bigger or equal to MAE. In fact, we have this inequality: \(MAE \leq RMSE \leq \sqrt{n} \times MAE\).
Polynomial Regression
We can improve our model by feature engineering, specifically, by making new features that captures the interactions between existing features. This is called polynomial regression. The idea is to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form \([a, b]\), the degree-2 polynomial features are \([1, a, b, a^2, ab, b^2]\). We will use degree 2.
We don’t want charges to be included in the process of generating the polynomial combinations, so we take out charges from train and test and save them as y_train and y_test, respectively.
From EDA we know that sex and region have no correlation with charges. We can drop them. Also, since polynomial combinations don’t make sense to categorical features, we mutate smoker as numeric.
X_train <- train %>%
select(-c(charges, sex, region)) %>%
mutate(smoker = as.numeric(smoker))
X_test <- test %>%
select(-c(charges, sex, region)) %>%
mutate(smoker = as.numeric(smoker))We use formula below to apply polynomial combinations.
formula <- as.formula(
paste(
' ~ .^2 + ',
paste('poly(', colnames(X_train), ', 2, raw=TRUE)[, 2]', collapse = ' + ')
)
)
formula#> ~.^2 + poly(age, 2, raw = TRUE)[, 2] + poly(bmi, 2, raw = TRUE)[,
#> 2] + poly(children, 2, raw = TRUE)[, 2] + poly(smoker, 2,
#> raw = TRUE)[, 2]
Then, insert y_train and y_test back to the new datasets.
train_poly <- as.data.frame(model.matrix(formula, data = X_train))
test_poly <- as.data.frame(model.matrix(formula, data = X_test))
train_poly$charges <- y_train
test_poly$charges <- y_test
colnames(train_poly)#> [1] "(Intercept)" "age"
#> [3] "bmi" "children"
#> [5] "smoker" "poly(age, 2, raw = TRUE)[, 2]"
#> [7] "poly(bmi, 2, raw = TRUE)[, 2]" "poly(children, 2, raw = TRUE)[, 2]"
#> [9] "poly(smoker, 2, raw = TRUE)[, 2]" "age:bmi"
#> [11] "age:children" "age:smoker"
#> [13] "bmi:children" "bmi:smoker"
#> [15] "children:smoker" "charges"
We can see that our new datasets train_poly and test_poly now have 16 columns:
- \((Intercept)\) is a column consists of constant 1, this is the constant term in the polynomial.
- \(age, bmi, children, smoker\) are the original features.
- \(age^2, bmi^2, children^2, smoker^2\) are the square of the original features.
- \(age \times bmi, age \times children, age \times smoker, bmi \times children, bmi \times smoker, children \times smoker\) are six interactions between pairs of four features.
- \(charges\) is the target feature.
Now, we are ready to make the model. As usual, we start with all features, and work our way down using backward elimination.
#> Start: AIC=18196.7
#> charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `poly(smoker, 2, raw = TRUE)[, 2]` + `age:bmi` + `age:children` +
#> `age:smoker` + `bmi:children` + `bmi:smoker` + `children:smoker`
#>
#>
#> Step: AIC=18196.7
#> charges ~ `(Intercept)` + age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `age:children` + `age:smoker` + `bmi:children` +
#> `bmi:smoker` + `children:smoker`
#>
#>
#> Step: AIC=18196.7
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `age:children` + `age:smoker` + `bmi:children` +
#> `bmi:smoker` + `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> - `age:smoker` 1 638029 25718637159 18195
#> - `bmi:children` 1 817883 25718817013 18195
#> - age 1 5612815 25723611945 18195
#> - `age:children` 1 9673381 25727672511 18195
#> - `age:bmi` 1 26766877 25744766007 18196
#> - `children:smoker` 1 45654901 25763654030 18197
#> - children 1 47711668 25765710798 18197
#> <none> 25717999130 18197
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 63400816 25781399946 18197
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 257208016 25975207146 18205
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 258018286 25976017416 18205
#> - bmi 1 446640632 26164639762 18213
#> - smoker 1 2015205306 27733204436 18275
#> - `bmi:smoker` 1 13705549198 39423548327 18651
#>
#> Step: AIC=18194.73
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `age:children` + `bmi:children` + `bmi:smoker` +
#> `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> - `bmi:children` 1 828285 25719465444 18193
#> - age 1 4976641 25723613800 18193
#> - `age:children` 1 9825302 25728462461 18193
#> - `age:bmi` 1 26787718 25745424877 18194
#> - `children:smoker` 1 45119820 25763756979 18195
#> - children 1 47286634 25765923793 18195
#> <none> 25718637159 18195
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 63295487 25781932646 18195
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 257419441 25976056600 18203
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 257631866 25976269025 18203
#> - bmi 1 447419331 26166056489 18211
#> - smoker 1 2445531784 28164168943 18290
#> - `bmi:smoker` 1 13756292505 39474929664 18651
#>
#> Step: AIC=18192.76
#> charges ~ age + bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> - age 1 5196666 25724662109 18191
#> - `age:children` 1 11062155 25730527598 18191
#> - `age:bmi` 1 27346149 25746811593 18192
#> - `children:smoker` 1 46443285 25765908729 18193
#> <none> 25719465444 18193
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 63858086 25783323530 18193
#> - children 1 101526014 25820991458 18195
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 256592182 25976057625 18201
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 258013469 25977478912 18201
#> - bmi 1 446982085 26166447529 18209
#> - smoker 1 2466700190 28186165633 18289
#> - `bmi:smoker` 1 13890308278 39609773722 18652
#>
#> Step: AIC=18190.98
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `age:children` + `bmi:smoker` + `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> - `age:children` 1 10800698 25735462807 18189
#> - `age:bmi` 1 22525274 25747187384 18190
#> - `children:smoker` 1 46333040 25770995149 18191
#> <none> 25724662109 18191
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 59835331 25784497440 18192
#> - children 1 97931108 25822593218 18193
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 252131105 25976793215 18199
#> - bmi 1 441828081 26166490190 18207
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 527300413 26251962522 18211
#> - smoker 1 2465304613 28189966723 18287
#> - `bmi:smoker` 1 13885998170 39610660279 18650
#>
#> Step: AIC=18189.43
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `age:bmi` + `bmi:smoker` + `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> - `age:bmi` 1 24487435 25759950242 18188
#> <none> 25735462807 18189
#> - `children:smoker` 1 49159638 25784622445 18190
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 64458233 25799921040 18190
#> - children 1 235234433 25970697239 18197
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 255427192 25990889999 18198
#> - bmi 1 439409358 26174872164 18206
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 548137618 26283600425 18210
#> - smoker 1 2461486940 28196949747 18285
#> - `bmi:smoker` 1 13883155020 39618617827 18649
#>
#> Step: AIC=18188.44
#> charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `bmi:smoker` + `children:smoker`
#>
#> Df Sum of Sq RSS AIC
#> <none> 25759950242 18188
#> - `children:smoker` 1 49343171 25809293413 18189
#> - `poly(children, 2, raw = TRUE)[, 2]` 1 75043200 25834993442 18190
#> - `poly(bmi, 2, raw = TRUE)[, 2]` 1 246287385 26006237626 18197
#> - children 1 257057142 26017007384 18197
#> - bmi 1 414955971 26174906212 18204
#> - smoker 1 2440592260 28200542501 18283
#> - `bmi:smoker` 1 13871440992 39631391233 18647
#> - `poly(age, 2, raw = TRUE)[, 2]` 1 14593600500 40353550742 18666
#>
#> Call:
#> lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `bmi:smoker` + `children:smoker`, data = train_poly)
#>
#> Coefficients:
#> (Intercept) bmi
#> 11544.691 -820.904
#> children smoker
#> 1711.975 -18804.065
#> `poly(age, 2, raw = TRUE)[, 2]` `poly(bmi, 2, raw = TRUE)[, 2]`
#> 3.319 -9.138
#> `poly(children, 2, raw = TRUE)[, 2]` `bmi:smoker`
#> -154.913 1405.110
#> `children:smoker`
#> -458.435
Save the best model as lm_poly, then predict. After that, calculate the metrics. Now, we are fortunate to have no negative values in prediction, hence RMSLE calculation can directly be applied.
lm_poly <- lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
`poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
`bmi:smoker` + `children:smoker`, data = train_poly)
y_pred <- predict(lm_poly, test_poly)
mae <- MAE(y_pred, test$charges)
rmse <- RMSE(y_pred, test$charges)
rmsle <- RMSLE(y_pred, test$charges)
poly_reg <- cbind("MAE" = mae, "RMSE" = rmse, "RMSLE" = rmsle)
poly_reg#> MAE RMSE RMSLE
#> [1,] 2835.106 4327.179 0.3926167
Model Evaluation
Let’s see the summary of our original Linear Regression model.
#>
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker, data = train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11734 -2983 -1004 1356 29708
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -11905.11 1060.63 -11.225 < 0.0000000000000002 ***
#> age 254.87 13.52 18.844 < 0.0000000000000002 ***
#> bmi 320.64 30.69 10.447 < 0.0000000000000002 ***
#> children 429.86 156.22 2.752 0.00603 **
#> smokeryes 23586.13 468.26 50.370 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6168 on 1064 degrees of freedom
#> Multiple R-squared: 0.7359, Adjusted R-squared: 0.7349
#> F-statistic: 741.3 on 4 and 1064 DF, p-value: < 0.00000000000000022
We have four features, all of which are significant (has real effect, not due to random chance and sampling) on charges. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged -$11,905 by health insurance (which we know this scenario is impossible). Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have less charge than a smoker by $23,586, which makes sense.
This model also has 0.7349 adjusted R-squared, which means the model with its features explains 73% of total variation in charges.
Now, let’s compare this to the new Polynomial Regression model.
#>
#> Call:
#> lm(formula = charges ~ bmi + children + smoker + `poly(age, 2, raw = TRUE)[, 2]` +
#> `poly(bmi, 2, raw = TRUE)[, 2]` + `poly(children, 2, raw = TRUE)[, 2]` +
#> `bmi:smoker` + `children:smoker`, data = train_poly)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11017.7 -1932.9 -1339.8 -559.8 29962.3
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 11544.6914 3699.3731 3.121
#> bmi -820.9038 198.6602 -4.132
#> children 1711.9751 526.3834 3.252
#> smoker -18804.0652 1876.3926 -10.021
#> `poly(age, 2, raw = TRUE)[, 2]` 3.3189 0.1354 24.505
#> `poly(bmi, 2, raw = TRUE)[, 2]` -9.1376 2.8703 -3.183
#> `poly(children, 2, raw = TRUE)[, 2]` -154.9126 88.1557 -1.757
#> `bmi:smoker` 1405.1099 58.8124 23.891
#> `children:smoker` -458.4348 321.7241 -1.425
#> Pr(>|t|)
#> (Intercept) 0.00185 **
#> bmi 0.0000388 ***
#> children 0.00118 **
#> smoker < 0.0000000000000002 ***
#> `poly(age, 2, raw = TRUE)[, 2]` < 0.0000000000000002 ***
#> `poly(bmi, 2, raw = TRUE)[, 2]` 0.00150 **
#> `poly(children, 2, raw = TRUE)[, 2]` 0.07916 .
#> `bmi:smoker` < 0.0000000000000002 ***
#> `children:smoker` 0.15447
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4930 on 1060 degrees of freedom
#> Multiple R-squared: 0.8319, Adjusted R-squared: 0.8307
#> F-statistic: 655.9 on 8 and 1060 DF, p-value: < 0.00000000000000022
We have eight features, all of which are significant on charges, except for children:smoker. From the coefficients, we know that a non-smoker zero years old who has no children and zero bmi will be charged $11,540 by health insurance (which we know this scenario is impossible). Also, since smoker has the biggest coefficient of all features, a unit change in smoker gives bigger change in charges than a unit change in other features give, given all other features are fixed. In this case, given all other features are fixed, a non-smoker would have more charge than a smoker by $18,800. Wait, what!? From this analysis, we know that by introducing new features to our model using polynomial combinations, the assumptions about the model may change and the interpretation of the model may be misleading.
The adjusted R-squared of this model is 0.8307, which means the model with its features explains 83% of total variation in charges. In other words, this Polynomial Regression model captures more variance of charges than the earlier Linear Regression model.
But don’t get puffed up! There are several assumptions need to be satisfied when using Linear Regression model:
- Linearity of the Data
We need to make sure that there is a linear relationship between predictors and target variable. This can be done by visually looking at the correlation between each pair of predictor and target variable.
cols <- c('age', 'children', 'bmi', 'smoker')
temp <- train %>%
select(cols) %>%
mutate(smoker = as.numeric(smoker))
temp$charges <- y_train
ggcorr(temp, label = T)Another way is to use hypothesis testing with Pearson’s product-moment correlation.
- H0: the predictor does not correlate with
charge - H1: the predictor correlates with
charge
#> [1] 0
#> [1] 0.0062
#> [1] 0
#> [1] 0
Since p-value for each predictor-target pair is below alpha (0.05), reject H0. We can safely say that the predictors correlate with target variable.
Now, for the Polynomial Regression.
cols <- c('bmi', 'children', 'smoker', 'poly(age, 2, raw = TRUE)[, 2]',
'poly(bmi, 2, raw = TRUE)[, 2]', 'poly(children, 2, raw = TRUE)[, 2]',
'bmi:smoker', 'children:smoker')
ggcorr(train_poly %>% select(c(cols, 'charges')), hjust = 1, layout.exp = 2, label = T)for (col in cols) {
cor <- cor.test(train_poly[, col], train_poly$charges)
print(round(cor$p.value, 4))
}#> [1] 0
#> [1] 0.0062
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0.1327
#> [1] 0
#> [1] 0
We can say that the predictors correlate with target variable, except for \(children^2\).
- Normality of Residuals
The residuals from Linear Regression model should be normally distributed because we expect to get residuals near the zero value. To see this, we can plot the histogram of residuals.
Another way is to use Shapiro-Wilk test to our residual.
- H0: Residuals are normally distributed
- H1: Residuals are not normally distributed
#>
#> Shapiro-Wilk normality test
#>
#> data: lm_all$residuals
#> W = 0.89481, p-value < 0.00000000000000022
Since p-value is below alpha (0.05), reject H0. Hence, residuals are not normally distributed.
Now, for the Polynomial Regression.
#>
#> Shapiro-Wilk normality test
#>
#> data: lm_poly$residuals
#> W = 0.65297, p-value < 0.00000000000000022
The same thing applies, residuals are not normally distributed.
- Homoscedasticity
Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value. In a Linear Regression model, if the variance of its error is showing unequal variation across the target variable range, it shows that heteroscedasticity is present and the implication to that is related to a non-random pattern in residual. We can see this visually by plotting fitted values vs residuals.
Another way is to use Breusch-Pagan hypothesis.
- H0: Homoscedasticity
- H1: Heteroscedasticity
#>
#> studentized Breusch-Pagan test
#>
#> data: lm_all
#> BP = 89.206, df = 4, p-value < 0.00000000000000022
Since p-value is lower than alpha (0.05), reject H0. This means the residuals has Heteroscedasticity.
Now, for the Polynomial Regression.
#>
#> studentized Breusch-Pagan test
#>
#> data: lm_poly
#> BP = 13.059, df = 8, p-value = 0.1098
Since p-value is higher than alpha (0.05), accept H0. This means the residuals satisfies Homoscedasticity assumption.
- No multicollinearity between predictors
One of the statistical tool to assess multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model. VIF < 10 means, no multicollinearity acurred among predictors.
#> age bmi children smoker
#> 1.018303 1.013178 1.004060 1.003712
#> bmi children
#> 66.442833 17.844408
#> smoker `poly(age, 2, raw = TRUE)[, 2]`
#> 25.228626 1.020981
#> `poly(bmi, 2, raw = TRUE)[, 2]` `poly(children, 2, raw = TRUE)[, 2]`
#> 56.740085 6.702367
#> `bmi:smoker` `children:smoker`
#> 32.271971 11.527064
No multicollinearity found in Linear Regression model, but many found in Polynomial Regression model. This makes sense since some features in Polynomial Regression were created by multiplying two features from Linear Regression model.
Conclusion
result <- rbind(lin_reg, poly_reg)
rownames(result) <- c("Linear Regression", "Polynomial Regression")
result#> MAE RMSE RMSLE
#> Linear Regression 3941.464 5672.102 0.5455373
#> Polynomial Regression 2835.106 4327.179 0.3926167
For a simple model like Linear Regression, feature engineering plays an important role to improve the model. In this project, we apply this technique by making polynomial combinations of features with degree 2. We see that the model improves significantly, with MAE 2835, RMSE 4327, and RMSLE 0.39. However, some assumptions on Linear Regression may break down in the process. Also, smoking is not good for your wallet!!