Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?


Upload Dataset

Data taken from Kaggle Medical Cost Personal Dataset and edited to include a square term of age and a “normalized bmi” index (subtracted 21.75 from the bmi to determine the distance from the average healthy bmi)

Insur <- read.csv("insurance.csv")
Insur$age2 <- Insur$age^2
Insur$bmiNorm <- abs(Insur$bmi - 21.75)
head(Insur)
##   age    sex   bmi children smoker    region charges age2 bmiNorm
## 1  19 female 27.90        0    yes southwest   16885  361   6.150
## 2  18   male 33.77        1     no southeast    1726  324  12.020
## 3  28   male 33.00        3     no southeast    4449  784  11.250
## 4  33   male 22.70        0     no northwest   21984 1089   0.955
## 5  32   male 28.88        0     no northwest    3867 1024   7.130
## 6  31 female 25.74        0     no southeast    3757  961   3.990
Ins_lm <- lm(charges ~ age + sex + children + smoker + age2 + bmiNorm,data=Insur)
summary(Ins_lm)
## 
## Call:
## lm(formula = charges ~ age + sex + children + smoker + age2 + 
##     bmiNorm, data = Insur)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13877  -2899   -863   1119  30480 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  -119.44    1476.15   -0.08              0.93553    
## age           -50.40      81.21   -0.62              0.53493    
## sexmale      -127.10     332.03   -0.38              0.70194    
## children      637.97     143.87    4.43              0.00001 ***
## smokeryes   23823.69     410.92   57.98 < 0.0000000000000002 ***
## age2            3.91       1.01    3.86              0.00012 ***
## bmiNorm       334.90      29.12   11.50 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6050 on 1331 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.751 
## F-statistic:  672 on 6 and 1331 DF,  p-value: <0.0000000000000002
pairs(Insur, gap = 0.5)

Based on the summary we can see that the normalized bmi index and whether the person is a smoker or not are the strongest predictors. Let’s perform backwards elimination to narrow the model down a bit.

Ins_lm <- update(Ins_lm, .~. - age, data = Insur)
Ins_lm <- update(Ins_lm, .~. - sex, data = Insur)
summary(Ins_lm)
## 
## Call:
## lm(formula = charges ~ children + smoker + age2 + bmiNorm, data = Insur)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13712  -2911   -855   1121  30285 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1049.440    423.599   -2.48                0.013 *  
## children      610.263    137.104    4.45            0.0000093 ***
## smokeryes   23811.004    409.529   58.14 < 0.0000000000000002 ***
## age2            3.291      0.148   22.30 < 0.0000000000000002 ***
## bmiNorm       334.820     29.064   11.52 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6040 on 1333 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.751 
## F-statistic: 1.01e+03 on 4 and 1333 DF,  p-value: <0.0000000000000002

We see here that we have narrowed the factors downt to those with very significant p-values. Our R-squared value indicates we describe 75.2% of the model.

Our model suggests that:

  1. For each child the insurance premium goes up $610
  2. A smoker’s premium is on average $23,811 higher than a non-smoker’s
  3. For every increase of 1 in the square of the person’s age, thepremium increases by $3.29
  4. Every increment away from 21.75 of the person’s BMI index leads to an increase of $334.82

Residual Analysis

plot(fitted(Ins_lm),resid(Ins_lm))

qqnorm(resid(Ins_lm))
qqline(resid(Ins_lm))

Based on the residual analysis above we see that the data is fairly uniform around 0, but does have a large clumping of data greater than zero. The Q-Q plot does not follow the line well at all.

Based on this we can conclude that this model is not a very good representation of the behavior at the more extreme values (which makes sense as the premiums are more likely to be driven by factors outside of this analysis in cases where they are very large.)