Data 605 Discussion 13
library(knitr)
library(rmdformats)
## Global options
options(max.print="85")
opts_chunk$set(cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=35)
library(utils)
library(skimr)Problem Statement
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term,bone dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
After an extensive search, I finally settled on using a Kaggle dataset Medical Cost Personal Datasets (link). I think this is the most appropriate dataset for this exercise instead of some of the auto insurance datasets that did pique my interest in the beginning but did not have the type of response variable I wanted.
Data Exploration
setwd("/Users/dpong/Data 605/Week 13/discussion/Data 605 Discussion 13")
insurance <- read.csv("insurance.csv")
skim(insurance)| Name | insurance |
| Number of rows | 1338 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | mal: 676, fem: 662 |
| smoker | 0 | 1 | FALSE | 2 | no: 1064, yes: 274 |
| region | 0 | 1 | FALSE | 4 | sou: 364, nor: 325, sou: 325, nor: 324 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 39.21 | 14.05 | 18.00 | 27.00 | 39.00 | 51.00 | 64.00 | ▇▅▅▆▆ |
| bmi | 0 | 1 | 30.66 | 6.10 | 15.96 | 26.30 | 30.40 | 34.69 | 53.13 | ▂▇▇▂▁ |
| children | 0 | 1 | 1.09 | 1.21 | 0.00 | 0.00 | 1.00 | 2.00 | 5.00 | ▇▂▂▁▁ |
| charges | 0 | 1 | 13270.42 | 12110.01 | 1121.87 | 4740.29 | 9382.03 | 16639.91 | 63770.43 | ▇▂▁▁▁ |
age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
7 46 female 33.440 1 no southeast 8240.590
8 37 female 27.740 3 no northwest 7281.506
9 37 male 29.830 2 no northeast 6406.411
10 60 female 25.840 0 no northwest 28923.137
par(mfrow = c(1, 2))
hist(insurance$bmi, xlab = "BMI (Body Mass Index)", main = "Distribution of BMI")
hist(insurance$charges, xlab = "Medical Charges", main = "Distribution of Medical Charges")We see that BMI is distributed normally whereas Medical Charges is right-skewed.
Let’s plot charges against smoker, sex, and regions respectively.
Smokers shown to be having a higher median costs in medical charges than non-smokers.
For the medical charges, the medians are about the same at 10000 for all 4 regions and genders.
Fitting a linear regression model
Dependent variable
- charges (numerical, continuous)
Independent Variables
children (numerical, discrete)
sex (categorical)
bmi (numerical, continuous)
age (numerical, discrete)
smoker (categorical)
region (categorical)
\(\widehat{charges} = \beta_{0} + \beta_{1} \times children + \beta_{2} \times sex + \beta_{3} \times bmi + \beta_{4} \times age + \beta_{5} \times smoker + \beta_{6} \times region\)
insur.lm <- lm(charges ~ children + sex + bmi + age + smoker + region, data = insurance)
summary(insur.lm)
Call:
lm(formula = charges ~ children + sex + bmi + age + smoker +
region, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
children 475.5 137.8 3.451 0.000577 ***
sexmale -131.3 332.9 -0.394 0.693348
bmi 339.2 28.6 11.860 < 2e-16 ***
age 256.9 11.9 21.587 < 2e-16 ***
smokeryes 23848.5 413.1 57.723 < 2e-16 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
So this simple multiple linear regression has the following performance metrics.
Adj. R-Squared of .75 - pretty good considering it’s without any transformations
F Statistic of 501 with a p-value of < 2.2e-16, which is an indication that we have a decently-fit model.
There are 2 coefficients that we can dispose as they are not statistically significant in our prediction for medical charges ($charges). They are sexmale and regionnorthwest. But since the variables regionsoutheast and regionsouthwest are stat sig., we’ll go ahead and include the variable region going forward and drop the single variable in sex.
insur.lm1 <- lm(charges ~ children + bmi + age + smoker + region, data = insurance)
summary(insur.lm1)
Call:
lm(formula = charges ~ children + bmi + age + smoker + region,
data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11367.2 -2835.4 -979.7 1361.9 29935.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
children 474.57 137.74 3.445 0.000588 ***
bmi 338.66 28.56 11.858 < 2e-16 ***
age 256.97 11.89 21.610 < 2e-16 ***
smokeryes 23836.30 411.86 57.875 < 2e-16 ***
regionnorthwest -352.18 476.12 -0.740 0.459618
regionsoutheast -1034.36 478.54 -2.162 0.030834 *
regionsouthwest -959.37 477.78 -2.008 0.044846 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
Adj. R-Squared of .75
F Statistic of 573 with a p-value of < 2.2e-16
Fit a model that has an interaction term and a quadratic term
adding bmi*region and bmi*smoker to the linear model.
adding a quadratic term in bmi
insur.lm2 <- lm(charges ~ children + bmi + age + smoker + region + bmi * smoker +
bmi * region + I(bmi^2), data = insurance)
summary(insur.lm2)
Call:
lm(formula = charges ~ children + bmi + age + smoker + region +
bmi * smoker + bmi * region + I(bmi^2), data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11240.4 -1986.9 -1306.7 -224.8 29867.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.478e+03 2.901e+03 -3.266 0.00112 **
children 5.086e+02 1.100e+02 4.625 4.11e-06 ***
bmi 4.465e+02 1.823e+02 2.450 0.01443 *
age 2.618e+02 9.514e+00 27.517 < 2e-16 ***
smokeryes -2.070e+04 1.657e+03 -12.489 < 2e-16 ***
regionnorthwest -6.424e+02 2.060e+03 -0.312 0.75522
regionsoutheast 2.690e+03 2.049e+03 1.313 0.18951
regionsouthwest 3.784e+02 2.010e+03 0.188 0.85072
I(bmi^2) -5.921e+00 2.941e+00 -2.013 0.04427 *
bmi:smokeryes 1.451e+03 5.289e+01 27.442 < 2e-16 ***
bmi:regionnorthwest 1.666e-01 6.940e+01 0.002 0.99809
bmi:regionsoutheast -1.228e+02 6.491e+01 -1.892 0.05871 .
bmi:regionsouthwest -5.647e+01 6.590e+01 -0.857 0.39167
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4830 on 1325 degrees of freedom
Multiple R-squared: 0.8423, Adjusted R-squared: 0.8409
F-statistic: 589.9 on 12 and 1325 DF, p-value: < 2.2e-16
Adj. R-Squared of .84
F Statistic of 590 with a p-value of < 2.2e-16
This is overwhelmingly better.
Since region and the interaction term bmi*region are not stat sig., I’m going to make a decision to remove them.
insur.lm3 <- lm(charges ~ children + bmi + age + smoker + bmi * smoker + I(bmi^2),
data = insurance)
summary(insur.lm3)
Call:
lm(formula = charges ~ children + bmi + age + smoker + bmi *
smoker + I(bmi^2), data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11232.3 -2032.4 -1317.6 -301.1 29569.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10864.661 2618.135 -4.150 3.54e-05 ***
children 509.302 110.213 4.621 4.19e-06 ***
bmi 546.683 167.015 3.273 0.00109 **
age 263.154 9.534 27.602 < 2e-16 ***
smokeryes -20260.804 1648.618 -12.290 < 2e-16 ***
I(bmi^2) -8.583 2.620 -3.276 0.00108 **
bmi:smokeryes 1436.783 52.639 27.295 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4853 on 1331 degrees of freedom
Multiple R-squared: 0.8401, Adjusted R-squared: 0.8394
F-statistic: 1166 on 6 and 1331 DF, p-value: < 2.2e-16
Adj. R-Squared of .84
F Statistic of 1166 with a p-value of < 2.2e-16 as compared to residual standard error of 4853, which is a good ratio. It proves that this is the best model after all these transformations and dropping of variables.
The final model is as follows,
\(\widehat{charges} = -10864.661 + 509.302 \times children + 546.683 \times bmi + 263.154 \times age - 20260.804 \times smoker - 8.583 \times bmi^2 + 1436.783 \times bmi*smoker\)
where smoker = 1 for smokers and = 0 for non-smokers
Interpreting the coefficients
Intercept: This tells us that leaving all other terms constant, on average the estimated medical charge is about $-10864.66
which logically won’t make sense and is good there are other terms in the model. So I will rule the intercept meaningless.
Children: All other factors held constant, a person is expected to pay $509.30 per children added in the family.
BMI: BMI estimate can be looked at independently as the interaction term of BMI*Smoker has already been added to the equation. So, if we want to look at BMI at isolation, we need to go back to the previous model where there are no interaction terms. By looking at the estimates at insur.lm1, we know that roughly speaking, the cost of medical expenses would increase by $338.66 per each unit of BMI (kg / m2) increase.
Age: Leaving all other terms constant, a person can be expected to pay about $262.15 in medical expenses per an increase in 1 year of age.
Smoker: Same as BMI. The effect of smoker alone can only be evaluated in the insur.lm1 model. A person who is a non-smoker would be better off by not incurring a medical expense of $23836.30, which a smoker is expected to pay upfront, all other factors held constant.
BMI-squared: Leaving all other factors constant, a person will have to pay $1463.78 * BMI2 for a medical expense per each unit of BMI (kg / m2) increase.
BMI*Smoker: There are no general simple explanations for interaction terms. So we can skip this.
Checking for assumptions
Call:
lm(formula = charges ~ children + bmi + age + smoker + bmi *
smoker + I(bmi^2), data = insurance)
Coefficients:
(Intercept) children bmi age smokeryes
-10864.661 509.302 546.683 263.154 -20260.804
I(bmi^2) bmi:smokeryes
-8.583 1436.783
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma::gvlma(x = insur.lm3)
Value p-value Decision
Global Stat 4335.0482 0.0000 Assumptions NOT satisfied!
Skewness 1410.5282 0.0000 Assumptions NOT satisfied!
Kurtosis 2921.8803 0.0000 Assumptions NOT satisfied!
Link Function 0.3102 0.5776 Assumptions acceptable.
Heteroscedasticity 2.3295 0.1269 Assumptions acceptable.
Residual Analysis
par(mfrow = c(2, 2))
hist(insur.lm3$residuals, main = "Distribution of Residuals", xlab = "")
plot(insur.lm3$residuals, fitted(insur.lm3))
qqnorm(insur.lm3$residuals)
qqline(insur.lm3$residuals)Nope. Nothing checks out. Normal Q-Q plot doesn’t normality of residuals. Distribution of residuals doesn’t check out. Lastly, residuals vs fitted values doesn’t show constant variance.
Conclusion
I don’t think a multiple linear regression is the most appropriate approach to fit this dataset for predicting personal medical costs just by sheer violations in assumption check and residual analysis.