I. Get the data.

insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...

The data set consists of 1338 observations of 7 variables of which 3, including the target, are continuous, and the rest categorical. The target variable is expenses.

II. Explore and prepare the data for modeling.

Distribution

# summarize the charges variable
summary(insurance$expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770
# histogram of insurance charges
hist(insurance$expenses)

# table of region
table(insurance$region)
## 
## northeast northwest southeast southwest 
##       324       325       364       325

The expenses are right skewed between $1,122 and $63,770, with a mean of $13,270. Insurance claims are fairly evenly distributed amongst the four regions, but with somewhat heavier load (in numbers of claims) in the Southeast.

Relationships between variables

A look at correlations amongst variables, in tabular and graphical format:

# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000
# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])

# more informative scatterplot matrix
library(psych)
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

It does not appear that any of the variables explored above are excessively correlated, but the strongest correlation appears to be age and expense, followed by BMI and expense, which are unsurprising.

III. Train the model.

## Step 3: Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
                data = insurance)

# see the estimated beta coefficients
ins_model
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Coefficients:
##     (Intercept)              age         children              bmi  
##        -11941.6            256.8            475.7            339.3  
##         sexmale        smokeryes  regionnorthwest  regionsoutheast  
##          -131.4          23847.5           -352.8          -1035.6  
## regionsouthwest  
##          -959.3

Preliminarily, expenses can be predicted using the following formula: -$11,941.6 + $256.8(age) - $131.4(sex=male) + $339.3(bmi) + $475.7(children) + $23,847.5(smoker) - $352.8(region=NW) - $1,035.6(region=SE) - $959.3(region=SW)

For instance, starting with a baseline of -$11,941.6, expenses rise $257 for every unit increase in age, $339 for every unit increase in BMI, and a whopping $23,848 if the insuree is a smoker. However if the insuree resides in the Southeast, their expenses are likely to decrease by $1,036 if all other variables are held constant.

However we haven’t yet looked at which of these variables are significant.

IV. Evaluate model performance.

## Step 4: Evaluating model performance ----
# see more detail about the estimated beta coefficients
summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + sex + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## bmi                339.3       28.6  11.864  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

The resulting model has an RSE of 6,062 on 1,329 degrees of freedom. (I am as of yet unclear as to how to tell whether this is good or bad, unless I am able to compare the value with another model’s result.)

The Multiple and Adjusted R-squared values indicate that about 75% of the variability in expenses may be accounted for by this LR model, which is good. There are six total variables whose p-values are below 0.05, the alpha value generally considered the threshold for significance.

These variables are:

The F-statistic is 500.9 on 8 and 1,329 Degrees of Freedom, giving an overall p-value < 0.001. Thus we would reject the null hypothesis (e.g. no relationship between explanatory and target variables).

V. Improve model performance.

First, remove the insignificant (p close to 0.05) variables, which are sex and region and rerun the model:

ins_model1.5 <- lm(expenses ~ age + children + bmi + smoker,
                data = insurance)

# see the estimated beta coefficients
ins_model1.5
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + smoker, data = insurance)
## 
## Coefficients:
## (Intercept)          age     children          bmi    smokeryes  
##    -12105.5        257.8        473.7        321.9      23810.3
summary(ins_model1.5)
## 
## Call:
## lm(formula = expenses ~ age + children + bmi + smoker, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11895  -2921   -985   1382  29499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12105.48     941.95 -12.851  < 2e-16 ***
## age            257.83      11.90  21.674  < 2e-16 ***
## children       473.69     137.79   3.438 0.000604 ***
## bmi            321.94      27.38  11.760  < 2e-16 ***
## smokeryes    23810.32     411.21  57.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6068 on 1333 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.749 
## F-statistic: 998.2 on 4 and 1333 DF,  p-value: < 2.2e-16

The adjustments result in a higher F-statistic at 998.2 on 4 and 1,333 Degrees of Freedom, indicating that more of the total variability is accounted for by this model. We also see a smaller difference between the Multiple- and Adjusted R-squared values, indicating that we have shed some non-value-adding variables. All four remaining explanatory variables have significant p-values, but the RSE has increased to 6,068 (although a six-point increase likely is inconsequential, given the range of the RSE).

Now we return to the original model and try transforming the explanatory variables in order to arrive at a stronger model.

The changes made are:

  1. Square the age: explores the possibility that age and expense are related, but in a non-linear fashion.

  2. Bin the BMI values into two levels, < 30 and >= 30: determining whether clinically obesity (BMI > 30) as a group, contributes to higher expenses.

  3. Create an aggregated variable that combines BMI >= 30 and smoking: looks at interaction effect of two unhealthful characteristics (obesity and smoking) to see if the presence of both is likely to contribute more to the outcome than each alone.

These changes made, rerun the model:

## Step 5: Improving model performance ----

# add a higher-order "age" term
insurance$age2 <- insurance$age^2

# add an indicator for BMI >= 30
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)

summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

This model is significantly better. The RSE has decreased from about 6,070 to 4,445 on 1,326 degrees of freedom; the Multiple- and Adjusted R-Squared values have gone up significantly, to around 0.866. The F-statistic is around 782, which is a decrease from the last model. This model has several insignificant variables (bmi, sex, region), so we will try one more model removing them, and see where we end up.

# create final model
ins_model2.5 <- lm(expenses ~ age + age2 + children + 
                   bmi30*smoker, data = insurance)

summary(ins_model2.5)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi30 * smoker, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18718.8  -1641.7  -1313.8   -846.4  23799.7 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2237.3824  1083.9103   2.064   0.0392 *  
## age               -24.5114    60.2871  -0.407   0.6844    
## age2                3.6643     0.7524   4.870 1.25e-06 ***
## children          669.3870   106.6878   6.274 4.74e-10 ***
## bmi30              42.3929   276.9065   0.153   0.8783    
## smokeryes       13389.6763   442.5871  30.253  < 2e-16 ***
## bmi30:smokeryes 19759.1124   608.6309  32.465  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4483 on 1331 degrees of freedom
## Multiple R-squared:  0.8636, Adjusted R-squared:  0.8629 
## F-statistic:  1404 on 6 and 1331 DF,  p-value: < 2.2e-16

Our final result has an RSE of 4,483 on 1,331 degrees of freedom, Multiple- and Adjusted R-squared values around 0.863, and an F-statistic of 1,404 on 6 and 1,331 degrees of freedom. All variables except age (retained as polynomial regression variable with age^2) and bmi30 (retained to uphold the hierarchical principle) have p-values < 0.001.

Conclusion:

Results Summary:

Another look at the results from the four runs in chronolical order (all models had p-values <<< 0.05):

Model RSE (DF) R-sq Adj R-sq F-stat (DF)
1 6062 (1329) .7509 .7494 501 ( 8/1329)
2 6068 (1333) .7497 .749 998 ( 4/1333)
3 4445 (1326) .8664 .8653 782 (11/1326)
4 4483 (1331) .8636 .8629 1404 (6/1331)

On the basis of the highest R-squared values and lowest RSE, the third model performed better; however in terms of parsimony, the size of the difference between the R-sq and Adj R-sq scores and a much higher F-statistic, the final model might slightly edge it out. The difference between the two may or may not be significant.

It is these subtleties with which I have difficulty.