Data 605 Discussion 13

library(knitr)
library(rmdformats)

## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=35)

library(utils)
library(skimr)

Problem Statement

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term,bone dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

After an extensive search, I finally settled on using a Kaggle dataset Medical Cost Personal Datasets (link). I think this is the most appropriate dataset for this exercise instead of some of the auto insurance datasets that did pique my interest in the beginning but did not have the type of response variable I wanted.

Data Exploration

setwd("/Users/dpong/Data 605/Week 13/discussion/Data 605 Discussion 13")
insurance <- read.csv("insurance.csv")
skim(insurance)
Data summary
Name insurance
Number of rows 1338
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 mal: 676, fem: 662
smoker 0 1 FALSE 2 no: 1064, yes: 274
region 0 1 FALSE 4 sou: 364, nor: 325, sou: 325, nor: 324

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 39.21 14.05 18.00 27.00 39.00 51.00 64.00 ▇▅▅▆▆
bmi 0 1 30.66 6.10 15.96 26.30 30.40 34.69 53.13 ▂▇▇▂▁
children 0 1 1.09 1.21 0.00 0.00 1.00 2.00 5.00 ▇▂▂▁▁
charges 0 1 13270.42 12110.01 1121.87 4740.29 9382.03 16639.91 63770.43 ▇▂▁▁▁
head(insurance, n = 10)
   age    sex    bmi children smoker    region   charges
1   19 female 27.900        0    yes southwest 16884.924
2   18   male 33.770        1     no southeast  1725.552
3   28   male 33.000        3     no southeast  4449.462
4   33   male 22.705        0     no northwest 21984.471
5   32   male 28.880        0     no northwest  3866.855
6   31 female 25.740        0     no southeast  3756.622
7   46 female 33.440        1     no southeast  8240.590
8   37 female 27.740        3     no northwest  7281.506
9   37   male 29.830        2     no northeast  6406.411
10  60 female 25.840        0     no northwest 28923.137
par(mfrow = c(1, 2))
hist(insurance$bmi, xlab = "BMI (Body Mass Index)", main = "Distribution of BMI")
hist(insurance$charges, xlab = "Medical Charges", main = "Distribution of Medical Charges")

We see that BMI is distributed normally whereas Medical Charges is right-skewed.

Let’s plot charges against smoker, sex, and regions respectively.

par(mfrow = c(1, 3))
with(insurance, plot(charges ~ smoker + sex + region))

Smokers shown to be having a higher median costs in medical charges than non-smokers.

For the medical charges, the medians are about the same at 10000 for all 4 regions and genders.

Fitting a linear regression model

Dependent variable

  • charges (numerical, continuous)

Independent Variables

  • children (numerical, discrete)

  • sex (categorical)

  • bmi (numerical, continuous)

  • age (numerical, discrete)

  • smoker (categorical)

  • region (categorical)

    \(\widehat{charges} = \beta_{0} + \beta_{1} \times children + \beta_{2} \times sex + \beta_{3} \times bmi + \beta_{4} \times age + \beta_{5} \times smoker + \beta_{6} \times region\)

insur.lm <- lm(charges ~ children + sex + bmi + age + smoker + region, data = insurance)
summary(insur.lm)

Call:
lm(formula = charges ~ children + sex + bmi + age + smoker + 
    region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
children           475.5      137.8   3.451 0.000577 ***
sexmale           -131.3      332.9  -0.394 0.693348    
bmi                339.2       28.6  11.860  < 2e-16 ***
age                256.9       11.9  21.587  < 2e-16 ***
smokeryes        23848.5      413.1  57.723  < 2e-16 ***
regionnorthwest   -353.0      476.3  -0.741 0.458769    
regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

So this simple multiple linear regression has the following performance metrics.

  • Adj. R-Squared of .75 - pretty good considering it’s without any transformations

  • F Statistic of 501 with a p-value of < 2.2e-16, which is an indication that we have a decently-fit model.

  • There are 2 coefficients that we can dispose as they are not statistically significant in our prediction for medical charges ($charges). They are sexmale and regionnorthwest. But since the variables regionsoutheast and regionsouthwest are stat sig., we’ll go ahead and include the variable region going forward and drop the single variable in sex.

insur.lm1 <- lm(charges ~ children + bmi + age + smoker + region, data = insurance)
summary(insur.lm1)

Call:
lm(formula = charges ~ children + bmi + age + smoker + region, 
    data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11367.2  -2835.4   -979.7   1361.9  29935.5 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
children           474.57     137.74   3.445 0.000588 ***
bmi                338.66      28.56  11.858  < 2e-16 ***
age                256.97      11.89  21.610  < 2e-16 ***
smokeryes        23836.30     411.86  57.875  < 2e-16 ***
regionnorthwest   -352.18     476.12  -0.740 0.459618    
regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7496 
F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
  • Adj. R-Squared of .75

  • F Statistic of 573 with a p-value of < 2.2e-16

Fit a model that has an interaction term and a quadratic term

  • adding bmi*region and bmi*smoker to the linear model.

  • adding a quadratic term in bmi

insur.lm2 <- lm(charges ~ children + bmi + age + smoker + region + bmi * smoker + 
    bmi * region + I(bmi^2), data = insurance)
summary(insur.lm2)

Call:
lm(formula = charges ~ children + bmi + age + smoker + region + 
    bmi * smoker + bmi * region + I(bmi^2), data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11240.4  -1986.9  -1306.7   -224.8  29867.5 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -9.478e+03  2.901e+03  -3.266  0.00112 ** 
children             5.086e+02  1.100e+02   4.625 4.11e-06 ***
bmi                  4.465e+02  1.823e+02   2.450  0.01443 *  
age                  2.618e+02  9.514e+00  27.517  < 2e-16 ***
smokeryes           -2.070e+04  1.657e+03 -12.489  < 2e-16 ***
regionnorthwest     -6.424e+02  2.060e+03  -0.312  0.75522    
regionsoutheast      2.690e+03  2.049e+03   1.313  0.18951    
regionsouthwest      3.784e+02  2.010e+03   0.188  0.85072    
I(bmi^2)            -5.921e+00  2.941e+00  -2.013  0.04427 *  
bmi:smokeryes        1.451e+03  5.289e+01  27.442  < 2e-16 ***
bmi:regionnorthwest  1.666e-01  6.940e+01   0.002  0.99809    
bmi:regionsoutheast -1.228e+02  6.491e+01  -1.892  0.05871 .  
bmi:regionsouthwest -5.647e+01  6.590e+01  -0.857  0.39167    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4830 on 1325 degrees of freedom
Multiple R-squared:  0.8423,    Adjusted R-squared:  0.8409 
F-statistic: 589.9 on 12 and 1325 DF,  p-value: < 2.2e-16
  • Adj. R-Squared of .84

  • F Statistic of 590 with a p-value of < 2.2e-16

This is overwhelmingly better.

Since region and the interaction term bmi*region are not stat sig., I’m going to make a decision to remove them.

insur.lm3 <- lm(charges ~ children + bmi + age + smoker + bmi * smoker + I(bmi^2), 
    data = insurance)
summary(insur.lm3)

Call:
lm(formula = charges ~ children + bmi + age + smoker + bmi * 
    smoker + I(bmi^2), data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11232.3  -2032.4  -1317.6   -301.1  29569.9 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -10864.661   2618.135  -4.150 3.54e-05 ***
children         509.302    110.213   4.621 4.19e-06 ***
bmi              546.683    167.015   3.273  0.00109 ** 
age              263.154      9.534  27.602  < 2e-16 ***
smokeryes     -20260.804   1648.618 -12.290  < 2e-16 ***
I(bmi^2)          -8.583      2.620  -3.276  0.00108 ** 
bmi:smokeryes   1436.783     52.639  27.295  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4853 on 1331 degrees of freedom
Multiple R-squared:  0.8401,    Adjusted R-squared:  0.8394 
F-statistic:  1166 on 6 and 1331 DF,  p-value: < 2.2e-16
  • Adj. R-Squared of .84

  • F Statistic of 1166 with a p-value of < 2.2e-16 as compared to residual standard error of 4853, which is a good ratio. It proves that this is the best model after all these transformations and dropping of variables.

The final model is as follows,

\(\widehat{charges} = -10864.661 + 509.302 \times children + 546.683 \times bmi + 263.154 \times age - 20260.804 \times smoker - 8.583 \times bmi^2 + 1436.783 \times bmi*smoker\)

where smoker = 1 for smokers and = 0 for non-smokers

Interpreting the coefficients

  • Intercept: This tells us that leaving all other terms constant, on average the estimated medical charge is about $-10864.66

    which logically won’t make sense and is good there are other terms in the model. So I will rule the intercept meaningless.

  • Children: All other factors held constant, a person is expected to pay $509.30 per children added in the family.

  • BMI: BMI estimate can be looked at independently as the interaction term of BMI*Smoker has already been added to the equation. So, if we want to look at BMI at isolation, we need to go back to the previous model where there are no interaction terms. By looking at the estimates at insur.lm1, we know that roughly speaking, the cost of medical expenses would increase by $338.66 per each unit of BMI (kg / m2) increase.

  • Age: Leaving all other terms constant, a person can be expected to pay about $262.15 in medical expenses per an increase in 1 year of age.

  • Smoker: Same as BMI. The effect of smoker alone can only be evaluated in the insur.lm1 model. A person who is a non-smoker would be better off by not incurring a medical expense of $23836.30, which a smoker is expected to pay upfront, all other factors held constant.

  • BMI-squared: Leaving all other factors constant, a person will have to pay $1463.78 * BMI2 for a medical expense per each unit of BMI (kg / m2) increase.

  • BMI*Smoker: There are no general simple explanations for interaction terms. So we can skip this.

Checking for assumptions

gvlma::gvlma(insur.lm3)

Call:
lm(formula = charges ~ children + bmi + age + smoker + bmi * 
    smoker + I(bmi^2), data = insurance)

Coefficients:
  (Intercept)       children            bmi            age      smokeryes  
   -10864.661        509.302        546.683        263.154     -20260.804  
     I(bmi^2)  bmi:smokeryes  
       -8.583       1436.783  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = insur.lm3) 

                       Value p-value                   Decision
Global Stat        4335.0482  0.0000 Assumptions NOT satisfied!
Skewness           1410.5282  0.0000 Assumptions NOT satisfied!
Kurtosis           2921.8803  0.0000 Assumptions NOT satisfied!
Link Function         0.3102  0.5776    Assumptions acceptable.
Heteroscedasticity    2.3295  0.1269    Assumptions acceptable.
# plot(gvlma::gvlma(insur.lm3))

Residual Analysis

par(mfrow = c(2, 2))
hist(insur.lm3$residuals, main = "Distribution of Residuals", xlab = "")
plot(insur.lm3$residuals, fitted(insur.lm3))
qqnorm(insur.lm3$residuals)
qqline(insur.lm3$residuals)

Nope. Nothing checks out. Normal Q-Q plot doesn’t normality of residuals. Distribution of residuals doesn’t check out. Lastly, residuals vs fitted values doesn’t show constant variance.

Conclusion

I don’t think a multiple linear regression is the most appropriate approach to fit this dataset for predicting personal medical costs just by sheer violations in assumption check and residual analysis.