This dataset comes from the book Machine Learning with R by Brett Lantz. The data contains medical information and costs billed by health insurance companies. There are 1338 observations and 7 variables in this dataset:
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
There are 4 quantitative and 3 categorical variables that each describe a certain feature of the contractor. Below are the first five observations of the data.
# A tibble: 6 x 7
age sex bmi children smoker region charges
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 19 female 27.9 0 yes southwest 16885.
2 18 male 33.8 1 no southeast 1726.
3 28 male 33 3 no southeast 4449.
4 33 male 22.7 0 no northwest 21984.
5 32 male 28.9 0 no northwest 3867.
6 31 female 25.7 0 no southeast 3757.
One question of interest is whether insurance costs can be determined from individual features. In our dataset, the insurance costs are the charges variable. It is a continuous variable in terms of dollars. age, bmi and children are numerical features while sex, smoker (whether this person smokes or not) and region are categorical features.
Below is a summary of the basic statistics for our variables. Looking at the response variable, the minimum value is 1122 while the maximum value is 63770. Most points cluster between 4740 and 16640. This large variance in the response variable indicates that there are potential outliers. The other quantitative variables are reasonably varied.
age sex bmi children
Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
Median :39.00 Mode :character Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
smoker region charges
Length:1338 Length:1338 Min. : 1122
Class :character Class :character 1st Qu.: 4740
Mode :character Mode :character Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
The Children variable seems to be a categorical variable because there are only a few values in the variable (1-5). Below is the frequency distribution of children. It confirms that children should be treated as a categorical variable even though it is numerical.
# A tibble: 6 x 2
children count
<dbl> <int>
1 0 574
2 1 324
3 2 240
4 3 157
5 4 25
6 5 18
Also from the plot above, age shows a linear relationship with our response, charges, albeit in clusters. bmi also shows a somewhat linear relationship with charges. In addition, bmi and age displays a certain level of correlation. The interaction of these two variables may play an important role in our model.
The plot above shows the boxplot of variable sex for insurance costs. The median costs for both sexes are pretty equal though there is more variance in insurnace costs for male.
There’s a clear trend here. Smokers have a much higher median insurance costs in comparison with non-smokers.
There’s not a clear trend for variable region in relation with insurance costs. The insurance costs decreases slightly from east to west, however.
The median insurance costs start high for contractors with zero children then goes down for 1 children contractors. The median costs keep increasing but then decreases when a contractor has 5 children. This could be due to the insurance companies policy to start with a high default cost. They give discount for contractors with children at a small rate then give really high discount for contractors with more than 5 children. One thing to note is that the boxplots show there are many outliers in our categorical variables. The outliers have the potential to influence the model so we’ll come back to address this issue if necessary.
One interesting question is whether the insurance costs can be determined by individual features. I’ll start with a full model that includes all the variables in the data. The full model equation is below.
\(charges = \beta_{0} + \beta_{1}age + \beta_{2}bmi + \beta_{3}sex + \beta_{4}children + \beta_{5}region + \beta_{6}smoker\)
Before modeling, we need to develop a rigorous approach for model selection.There are many methods for model selection such as AIC, BIC and cross validation, etc. From my own research AIC and cross validation strive to ahieve the same thing in different ways. Since the priority is the predictive power of the model, cross validation is chosen and the matrics are R-squared and RMSE. The greater the R-squared and the smaller the RMSE values, the better the predictive performance of our model is.
The data will be split into train and test sets using a 70 / 30 ratio. Cross validation is performed on the train set (70/30), then the test set will be used to assess the performance of the model.
Call:
lm(formula = charges ~ . - children - charges, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-14039.8 -2517.8 -822.7 1546.6 29337.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9101.54 1161.80 -7.834 1.30e-14 ***
age 241.16 13.83 17.436 < 2e-16 ***
sexmale -13.17 386.42 -0.034 0.97282
bmi 274.86 33.01 8.327 2.98e-16 ***
smokeryes 24392.10 481.84 50.623 < 2e-16 ***
regionnorthwest -838.04 560.75 -1.495 0.13538
regionsoutheast -1218.66 550.66 -2.213 0.02714 *
regionsouthwest -1493.82 555.91 -2.687 0.00734 **
children_fct1 -330.82 485.87 -0.681 0.49611
children_fct2 2208.49 549.32 4.020 6.28e-05 ***
children_fct3 622.69 635.66 0.980 0.32755
children_fct4 2181.29 1411.28 1.546 0.12254
children_fct5 1046.76 1547.26 0.677 0.49888
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5845 on 923 degrees of freedom
Multiple R-squared: 0.7696, Adjusted R-squared: 0.7666
F-statistic: 257 on 12 and 923 DF, p-value: < 2.2e-16
The diagnostic plots show a lot of problems with our model. The residual vs fitted values plot show that our residuals are not linear as the residuals are clustered in groups. In addition, the QQ plot shows that our residuals are not approximately normally distributed since the values stray from the diagonal line as it goes further to the right. The scale-location plot also shows non-constant variance. The residuals get bigger as the fitted values get bigger. There are outliers (Some residuals are 4 standard deviations away from 0) in our model though no leverage or high influential points as demonstrated by no visible Cook’s distance line.
The summary confirms our earlier analysis. Sex does not play an important role in determining the insurance costs of a contractor as shown by their large p-values. On the other hand, age, bmi and smoker are important factors. Region and children, however, show statistical significance on some levels.
From the initial diagnosis, the linear regression model built above violates the linearity, normal distribution and constant variance assumptions. A few ways to address these problems are presented below. ##### Box-Cox transformation
Box-Cox transformation is one way to address non linearity nad non-normality. The transformation is applied to both the response and predictors where appropriate. The idea is to find a lambda value that maximizes the log-likelihood of the data. The transformation formula is below: \(y_{i}^\lambda = \frac{y_{i}^\lambda - 1}{\lambda} if ~\lambda \neq 0~else~log(y_{i})\)
Call:
lm(formula = charges_box ~ . - children - charges - charges_box,
data = train_data1)
Residuals:
Min 1Q Median 3Q Max
-4.6542 -1.1396 -0.3081 0.3638 11.4065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.57832 0.45611 29.770 < 2e-16 ***
age 0.16404 0.00543 30.211 < 2e-16 ***
sexmale -0.19211 0.15171 -1.266 0.205720
bmi 0.06850 0.01296 5.286 1.56e-07 ***
smokeryes 8.77037 0.18917 46.363 < 2e-16 ***
regionnorthwest -0.56005 0.22014 -2.544 0.011121 *
regionsoutheast -0.97345 0.21618 -4.503 7.56e-06 ***
regionsouthwest -0.92839 0.21824 -4.254 2.31e-05 ***
children_fct1 0.31220 0.19075 1.637 0.102026
children_fct2 1.37302 0.21566 6.367 3.04e-10 ***
children_fct3 0.96183 0.24955 3.854 0.000124 ***
children_fct4 2.29867 0.55405 4.149 3.65e-05 ***
children_fct5 1.76913 0.60744 2.912 0.003673 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.295 on 923 degrees of freedom
Multiple R-squared: 0.7796, Adjusted R-squared: 0.7767
F-statistic: 272 on 12 and 923 DF, p-value: < 2.2e-16
The diagnostic plots again show improvements in our model. Residuals vs Fitted plot shows a curved pattern though the red line is much aligned with the 0 line. The residuals are also scattered around the 0 line more symmetrically. The QQ plot still shows non-normality while the scale-location plot displays improvement but the variance still increases as fitted values get bigger..
The summary shows an improvement from the non-transformed model as demonstrated by a jump of R-squared adjusted values from 0.7653 to 0.7868. Aside from age, smoker, bmi, most other variables have also become statistically significant (p-values < 0.05).
In the previous analysis, the scatter plots of age vs charges show non-linearity pattern as age is clustered in 3 groups. It is reasonable to apply a transformation to age.
Call:
lm(formula = charges_box ~ . - children - age + age_box - charges -
charges_box, data = train_data2)
Residuals:
Min 1Q Median 3Q Max
-4.4951 -1.1967 -0.3689 0.4387 11.7920
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.20892 0.64299 6.546 9.80e-11 ***
sexmale -0.19321 0.15233 -1.268 0.204980
bmi 0.07318 0.01300 5.631 2.38e-08 ***
smokeryes 8.76425 0.18994 46.142 < 2e-16 ***
regionnorthwest -0.53226 0.22104 -2.408 0.016237 *
regionsoutheast -0.95742 0.21709 -4.410 1.15e-05 ***
regionsouthwest -0.90991 0.21915 -4.152 3.60e-05 ***
children_fct1 0.09672 0.19194 0.504 0.614456
children_fct2 1.11429 0.21693 5.137 3.41e-07 ***
children_fct3 0.77221 0.25124 3.074 0.002177 **
children_fct4 2.05264 0.55623 3.690 0.000237 ***
children_fct5 1.52497 0.60972 2.501 0.012553 *
age_box 3.08870 0.10309 29.962 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.304 on 923 degrees of freedom
Multiple R-squared: 0.7778, Adjusted R-squared: 0.7749
F-statistic: 269.2 on 12 and 923 DF, p-value: < 2.2e-16
The summary and diagnostic plots show similar results as compared to the model with only the response variable transformed. In order to ensure the models do not overfit. Cross validation will be run on 3 models to assess their performances.
The table below shows the results of three models using cross validation. Out of three models, the untransformed model perferms the best with the biggest R2 and smallest RMSE . Clearly, our transformation overfits the data.
full fullbox fullboxwage
R2 0.7007255 0.6829274 0.6837907
RMSE 6651.9781793 7085.1301643 7035.0548614
As discussed above, the non constant variance assumption was violated. One way to address this issue is to put more weights on the non-outliers. Below is fitted model with weighted least square. From now on, when referring to the base model, the model at hand is the first model (non transformed one)
Call:
lm(formula = charges ~ . - children - charges, data = train_data,
weights = 1/sighat)
Weighted Residuals:
Min 1Q Median 3Q Max
-262.27 -22.83 -6.03 18.63 444.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4726.487 598.979 -7.891 8.47e-15 ***
age 255.355 7.492 34.083 < 2e-16 ***
sexmale -256.407 201.707 -1.271 0.203983
bmi 71.499 17.100 4.181 3.17e-05 ***
smokeryes 22185.953 492.113 45.083 < 2e-16 ***
regionnorthwest -306.554 294.723 -1.040 0.298545
regionsoutheast -725.664 296.706 -2.446 0.014642 *
regionsouthwest -790.717 290.835 -2.719 0.006675 **
children_fct1 120.244 250.114 0.481 0.630804
children_fct2 1007.446 296.403 3.399 0.000706 ***
children_fct3 1184.519 341.607 3.467 0.000550 ***
children_fct4 2119.571 750.886 2.823 0.004864 **
children_fct5 2140.908 714.089 2.998 0.002790 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 75.77 on 923 degrees of freedom
Multiple R-squared: 0.7765, Adjusted R-squared: 0.7736
F-statistic: 267.2 on 12 and 923 DF, p-value: < 2.2e-16
The model diagnostic plots and summary are pretty similar to the base. The adjusted R-squared is a little bit higher than the non-WLS one. Below is the cv results of the WLS model.
full wlm4_cv
R2 0.7007255 0.6869969
RMSE 6651.9781793 6928.6263306
The base model has higher R2 score and lower RMSE scores which signals the base one is still the best model so far.
From the scatterplots, age displays behaviour that are not strictly linear. It follows that improvements could be made by adding non-linear terms to the model.
full age2_cv
R2 0.7007255 0.7045399
RMSE 6651.9781793 6610.7094775
This time the model with non-linear term in age performs slightly better than the base model. Next, variables that are not statistically significant in the base model will be removed to see if improvements could be made.
full no_sex_cv no_chld_cv no_reg_cv red_cv
R2 0.7007255 0.7007007 0.7069823 0.7029727 0.7084115
RMSE 6651.9781793 6652.2254304 6576.5634766 6624.1697360 6558.9003454
From left to right, respectively, the models are base one, model without sex variable, model without children variable, model without region variable and model without sex, children and region variables. The best model is the one without region variable as it has the biggest R2 and smallest RMSE. Although the model without the region variable only performs the best, the model without sex, children and region performs the second best. Their results are only slightly different. Since parsimonious model is the goal of building a model. The red_cv model should be chosen as the next base model. We know from previous analysis that adding non-linear terms led to better performance. Below shows the cv results when removing region variable from the non-linear model.
full red_cv age2_noreg_cv
R2 0.7007255 0.7084115 0.7102093
RMSE 6651.9781793 6558.9003454 6538.0669398
Although the performance is lightly better for the model with non-linear term, the base model is still preferred as it is most parsimonious and performs almost as well as the one with non-linear term. Next is the results of a model fitted with interactive terms.
full age2_noreg_cv age2_inter_cv
R2 0.7007255 0.7102093 0.8285419
RMSE 6651.9781793 6538.0669398 5028.3938277
The model performance improves significantly in comparison with the base models. R-square value jumps from ~0.71 to ~0.83 while RMSE decreases from ~6538 to ~5028. Below is the model summary.
Call:
lm(formula = charges ~ age * bmi + smoker * age + age * smoker +
bmi * smoker, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-10065.9 -1977.7 -1284.7 -212.9 29144.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1196.868 2442.144 0.490 0.62419
age 175.578 60.177 2.918 0.00361 **
bmi -92.211 77.769 -1.186 0.23605
smokeryes -20535.842 2412.306 -8.513 < 2e-16 ***
age:bmi 2.515 1.882 1.337 0.18170
age:smokeryes 11.788 28.430 0.415 0.67851
bmi:smokeryes 1440.046 68.183 21.120 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4862 on 929 degrees of freedom
Multiple R-squared: 0.8396, Adjusted R-squared: 0.8385
F-statistic: 810.3 on 6 and 929 DF, p-value: < 2.2e-16
The diagnostic plots show improvements accross assumptions for our model, though these improvements still violate the assumptions.
Once the interactive terms were added to the model, most predictors become statistically insignificant. Smoking, age and the interaction between bmi and smoking statistically significant predictors. Below is the model with smoker, age, bmi and the interaction between bmi and smoker as the predictors. One thing to note, however, is that the coeefficients for variable smokeryes change from positive to negative from the full model to the best model.
Call:
lm(formula = charges ~ age + smoker + bmi + smoker * bmi, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-10129.3 -1934.4 -1291.7 -228.4 29189.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1861.140 973.032 -1.913 0.0561 .
age 256.574 11.451 22.407 <2e-16 ***
smokeryes -20117.464 2145.951 -9.375 <2e-16 ***
bmi 3.443 29.124 0.118 0.9059
smokeryes:bmi 1440.427 68.180 21.127 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4862 on 931 degrees of freedom
Multiple R-squared: 0.8392, Adjusted R-squared: 0.8385
F-statistic: 1215 on 4 and 931 DF, p-value: < 2.2e-16
R2 RMSE
0.5364007 8336.4479265
This model clearly overfits the data as the adjusted R-squared on train data is much higher than on the test data.
In this analysis, a linear model with all the variables was fitted on the insurance data. The model shows a lot of violations of linear regression assumptions. Non-linearity, non-norammlly distributed
A few approaches were taken to address the issue. Box-Cox transformations were performed on both the response and age variables. The diagnostic plots show no apparent fix of the non-linear assumption. Weighted-Least square method was used to fit on the data to address non-constant variance. The diagnostic plots show an improvement over the base one. However, the performance of the model decreases as showcased by smaller R-squared and RMSE values for both methods.
Once statistically insignificant predictors were removed, there was a big increase in the model’s performance. The best models shows that age, smoking and and the interaction between bmi and smoking are significant in determining the insurance costs. Removing the interaction terms between age:bmi and age:smoking, nontheless, worsens the model’s performance.