insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
The data set consists of 1338 observations of 7 variables of which 3, including the target, are continuous, and the rest categorical. The target variable is expenses.
# summarize the charges variable
summary(insurance$expenses)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
# histogram of insurance charges
hist(insurance$expenses)
# table of region
table(insurance$region)
##
## northeast northwest southeast southwest
## 324 325 364 325
The expenses are right skewed between $1,122 and $63,770, with a mean of $13,270. Insurance claims are fairly evenly distributed amongst the four regions, but with somewhat heavier load (in numbers of claims) in the Southeast.
A look at correlations amongst variables, in tabular and graphical format:
# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
## age bmi children expenses
## age 1.0000000 0.10934101 0.04246900 0.29900819
## bmi 0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000
# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])
# more informative scatterplot matrix
library(psych)
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])
It does not appear that any of the variables explored above are excessively correlated, but the strongest correlation appears to be age and expense, followed by BMI and expense, which are unsurprising.
## Step 3: Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
data = insurance)
# see the estimated beta coefficients
ins_model
##
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Coefficients:
## (Intercept) age children bmi
## -11941.6 256.8 475.7 339.3
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.4 23847.5 -352.8 -1035.6
## regionsouthwest
## -959.3
Preliminarily, expenses can be predicted using the following formula: -$11,941.6 + $256.8(age) - $131.4(sex=male) + $339.3(bmi) + $475.7(children) + $23,847.5(smoker) - $352.8(region=NW) - $1,035.6(region=SE) - $959.3(region=SW)
For instance, starting with a baseline of -$11,941.6, expenses rise $257 for every unit increase in age, $339 for every unit increase in BMI, and a whopping $23,848 if the insuree is a smoker. However if the insuree resides in the Southeast, their expenses are likely to decrease by $1,036 if all other variables are held constant.
However we haven’t yet looked at which of these variables are significant.
## Step 4: Evaluating model performance ----
# see more detail about the estimated beta coefficients
summary(ins_model)
##
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11302.7 -2850.9 -979.6 1383.9 29981.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11941.6 987.8 -12.089 < 2e-16 ***
## age 256.8 11.9 21.586 < 2e-16 ***
## children 475.7 137.8 3.452 0.000574 ***
## bmi 339.3 28.6 11.864 < 2e-16 ***
## sexmale -131.3 332.9 -0.395 0.693255
## smokeryes 23847.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -352.8 476.3 -0.741 0.458976
## regionsoutheast -1035.6 478.7 -2.163 0.030685 *
## regionsouthwest -959.3 477.9 -2.007 0.044921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.9 on 8 and 1329 DF, p-value: < 2.2e-16
The resulting model has an RSE of 6,062 on 1,329 degrees of freedom. (I am as of yet unclear as to how to tell whether this is good or bad, unless I am able to compare the value with another model’s result.)
The Multiple and Adjusted R-squared values indicate that about 75% of the variability in expenses may be accounted for by this LR model, which is good. There are six total variables whose p-values are below 0.05, the alpha value generally considered the threshold for significance.
These variables are:
Age (p < 0.001)
BMI (p < 0.001)
Children (p < 0.001)
Smoker = yes (p < 0.001)
Region of residence = SE (p = 0.031)
Region of residence = SW (p = 0.045)
The F-statistic is 500.9 on 8 and 1,329 Degrees of Freedom, giving an overall p-value < 0.001. Thus we would reject the null hypothesis (e.g. no relationship between explanatory and target variables).
First, remove the insignificant (p close to 0.05) variables, which are sex and region and rerun the model:
ins_model1.5 <- lm(expenses ~ age + children + bmi + smoker,
data = insurance)
# see the estimated beta coefficients
ins_model1.5
##
## Call:
## lm(formula = expenses ~ age + children + bmi + smoker, data = insurance)
##
## Coefficients:
## (Intercept) age children bmi smokeryes
## -12105.5 257.8 473.7 321.9 23810.3
summary(ins_model1.5)
##
## Call:
## lm(formula = expenses ~ age + children + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11895 -2921 -985 1382 29499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12105.48 941.95 -12.851 < 2e-16 ***
## age 257.83 11.90 21.674 < 2e-16 ***
## children 473.69 137.79 3.438 0.000604 ***
## bmi 321.94 27.38 11.760 < 2e-16 ***
## smokeryes 23810.32 411.21 57.903 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.749
## F-statistic: 998.2 on 4 and 1333 DF, p-value: < 2.2e-16
The adjustments result in a higher F-statistic at 998.2 on 4 and 1,333 Degrees of Freedom, indicating that more of the total variability is accounted for by this model. We also see a smaller difference between the Multiple- and Adjusted R-squared values, indicating that we have shed some non-value-adding variables. All four remaining explanatory variables have significant p-values, but the RSE has increased to 6,068 (although a six-point increase likely is inconsequential, given the range of the RSE).
Now we return to the original model and try transforming the explanatory variables in order to arrive at a stronger model.
The changes made are:
Square the age: explores the possibility that age and expense are related, but in a non-linear fashion.
Bin the BMI values into two levels, < 30 and >= 30: determining whether clinically obesity (BMI > 30) as a group, contributes to higher expenses.
Create an aggregated variable that combines BMI >= 30 and smoking: looks at interaction effect of two unhealthful characteristics (obesity and smoking) to see if the presence of both is likely to contribute more to the outcome than each alone.
These changes made, rerun the model:
## Step 5: Improving model performance ----
# add a higher-order "age" term
insurance$age2 <- insurance$age^2
# add an indicator for BMI >= 30
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
bmi30*smoker + region, data = insurance)
summary(ins_model2)
##
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 *
## smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17297.1 -1656.0 -1262.7 -727.8 24161.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.0053 1363.1359 0.102 0.918792
## age -32.6181 59.8250 -0.545 0.585690
## age2 3.7307 0.7463 4.999 6.54e-07 ***
## children 678.6017 105.8855 6.409 2.03e-10 ***
## bmi 119.7715 34.2796 3.494 0.000492 ***
## sexmale -496.7690 244.3713 -2.033 0.042267 *
## bmi30 -997.9355 422.9607 -2.359 0.018449 *
## smokeryes 13404.5952 439.9591 30.468 < 2e-16 ***
## regionnorthwest -279.1661 349.2826 -0.799 0.424285
## regionsoutheast -828.0345 351.6484 -2.355 0.018682 *
## regionsouthwest -1222.1619 350.5314 -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534 604.6769 32.762 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653
## F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16
This model is significantly better. The RSE has decreased from about 6,070 to 4,445 on 1,326 degrees of freedom; the Multiple- and Adjusted R-Squared values have gone up significantly, to around 0.866. The F-statistic is around 782, which is a decrease from the last model. This model has several insignificant variables (bmi, sex, region), so we will try one more model removing them, and see where we end up.
# create final model
ins_model2.5 <- lm(expenses ~ age + age2 + children +
bmi30*smoker, data = insurance)
summary(ins_model2.5)
##
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi30 * smoker,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18718.8 -1641.7 -1313.8 -846.4 23799.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2237.3824 1083.9103 2.064 0.0392 *
## age -24.5114 60.2871 -0.407 0.6844
## age2 3.6643 0.7524 4.870 1.25e-06 ***
## children 669.3870 106.6878 6.274 4.74e-10 ***
## bmi30 42.3929 276.9065 0.153 0.8783
## smokeryes 13389.6763 442.5871 30.253 < 2e-16 ***
## bmi30:smokeryes 19759.1124 608.6309 32.465 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4483 on 1331 degrees of freedom
## Multiple R-squared: 0.8636, Adjusted R-squared: 0.8629
## F-statistic: 1404 on 6 and 1331 DF, p-value: < 2.2e-16
Our final result has an RSE of 4,483 on 1,331 degrees of freedom, Multiple- and Adjusted R-squared values around 0.863, and an F-statistic of 1,404 on 6 and 1,331 degrees of freedom. All variables except age (retained as polynomial regression variable with age^2) and bmi30 (retained to uphold the hierarchical principle) have p-values < 0.001.
Another look at the results from the four runs in chronolical order (all models had p-values <<< 0.05):
| Model | RSE (DF) | R-sq | Adj R-sq | F-stat (DF) |
|---|---|---|---|---|
| 1 | 6062 (1329) | .7509 | .7494 | 501 ( 8/1329) |
| 2 | 6068 (1333) | .7497 | .749 | 998 ( 4/1333) |
| 3 | 4445 (1326) | .8664 | .8653 | 782 (11/1326) |
| 4 | 4483 (1331) | .8636 | .8629 | 1404 (6/1331) |
On the basis of the highest R-squared values and lowest RSE, the third model performed better; however in terms of parsimony, the size of the difference between the R-sq and Adj R-sq scores and a much higher F-statistic, the final model might slightly edge it out. The difference between the two may or may not be significant.
It is these subtleties with which I have difficulty.