Data taken from Kaggle Medical Cost Personal Dataset and edited to include a square term of age and a “normalized bmi” index (subtracted 21.75 from the bmi to determine the distance from the average healthy bmi)
Insur <- read.csv("insurance.csv")
Insur$age2 <- Insur$age^2
Insur$bmiNorm <- abs(Insur$bmi - 21.75)
head(Insur)
## age sex bmi children smoker region charges age2 bmiNorm
## 1 19 female 27.90 0 yes southwest 16885 361 6.150
## 2 18 male 33.77 1 no southeast 1726 324 12.020
## 3 28 male 33.00 3 no southeast 4449 784 11.250
## 4 33 male 22.70 0 no northwest 21984 1089 0.955
## 5 32 male 28.88 0 no northwest 3867 1024 7.130
## 6 31 female 25.74 0 no southeast 3757 961 3.990
Ins_lm <- lm(charges ~ age + sex + children + smoker + age2 + bmiNorm,data=Insur)
summary(Ins_lm)
##
## Call:
## lm(formula = charges ~ age + sex + children + smoker + age2 +
## bmiNorm, data = Insur)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13877 -2899 -863 1119 30480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -119.44 1476.15 -0.08 0.93553
## age -50.40 81.21 -0.62 0.53493
## sexmale -127.10 332.03 -0.38 0.70194
## children 637.97 143.87 4.43 0.00001 ***
## smokeryes 23823.69 410.92 57.98 < 0.0000000000000002 ***
## age2 3.91 1.01 3.86 0.00012 ***
## bmiNorm 334.90 29.12 11.50 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6050 on 1331 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.751
## F-statistic: 672 on 6 and 1331 DF, p-value: <0.0000000000000002
pairs(Insur, gap = 0.5)
Based on the summary we can see that the normalized bmi index and whether the person is a smoker or not are the strongest predictors. Let’s perform backwards elimination to narrow the model down a bit.
Ins_lm <- update(Ins_lm, .~. - age, data = Insur)
Ins_lm <- update(Ins_lm, .~. - sex, data = Insur)
summary(Ins_lm)
##
## Call:
## lm(formula = charges ~ children + smoker + age2 + bmiNorm, data = Insur)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13712 -2911 -855 1121 30285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1049.440 423.599 -2.48 0.013 *
## children 610.263 137.104 4.45 0.0000093 ***
## smokeryes 23811.004 409.529 58.14 < 0.0000000000000002 ***
## age2 3.291 0.148 22.30 < 0.0000000000000002 ***
## bmiNorm 334.820 29.064 11.52 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6040 on 1333 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.751
## F-statistic: 1.01e+03 on 4 and 1333 DF, p-value: <0.0000000000000002
We see here that we have narrowed the factors downt to those with very significant p-values. Our R-squared value indicates we describe 75.2% of the model.
Our model suggests that:
plot(fitted(Ins_lm),resid(Ins_lm))
qqnorm(resid(Ins_lm))
qqline(resid(Ins_lm))
Based on the residual analysis above we see that the data is fairly uniform around 0, but does have a large clumping of data greater than zero. The Q-Q plot does not follow the line well at all.
Based on this we can conclude that this model is not a very good representation of the behavior at the more extreme values (which makes sense as the premiums are more likely to be driven by factors outside of this analysis in cases where they are very large.)