Step 1-collecting data —-

For this analysis, we will use a simulated dataset containing hypothetical medical expenses for patients in the United States. This data was created for this book using demographic statistics from the US Census Bureau, and thus, approximately reflect real-world conditions.

Step 2: Exploring and preparing the data —-

insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
cannot open file 'insurance.csv': No such file or directoryError in file(file, "rt") : cannot open the connection

summarize the charges variable

summary(insurance$expenses)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1122    4740    9382   13270   16640   63770 

histogram of insurance charges

hist(insurance$expenses)

table of region

table(insurance$region)

northeast northwest southeast southwest 
      324       325       364       325 

exploring relationships among features: correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])
               age        bmi   children   expenses
age      1.0000000 0.10934101 0.04246900 0.29900819
bmi      0.1093410 1.00000000 0.01264471 0.19857626
children 0.0424690 0.01264471 1.00000000 0.06799823
expenses 0.2990082 0.19857626 0.06799823 1.00000000

visualing relationships among features: scatterplot matrix paris creates scatter plot matrix

pairs(insurance[c("age", "bmi", "children", "expenses")])

The relationship between age and expenses displays several relatively straight lines, while the bmi versus expenses plot has two distinct groups of points.

more informative scatterplot matrix

library(psych)
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

The stretched ellipse represents a strong coreelation and Vice Versa. Therefore, there is week correlation between bmi and children.

Step 3: Training a model on the data —-

ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
                data = insurance)
ins_model <- lm(expenses ~ ., data = insurance) # this is equivalent to above

see the estimated beta coefficients

ins_model

Call:
lm(formula = expenses ~ ., data = insurance)

Coefficients:
    (Intercept)              age          sexmale  
       -11941.6            256.8           -131.4  
            bmi         children        smokeryes  
          339.3            475.7          23847.5  
regionnorthwest  regionsoutheast  regionsouthwest  
         -352.8          -1035.6           -959.3  

Interpretation: for each one year increase in the age, the expense increase by 256.8; A male has 131.4 less expense each year than a female.

Step 4: Evaluating model performance —-

see more detail about the estimated beta coefficients

summary(ins_model)

Call:
lm(formula = expenses ~ ., data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11302.7  -2850.9   -979.6   1383.9  29981.7 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
age                256.8       11.9  21.586  < 2e-16 ***
sexmale           -131.3      332.9  -0.395 0.693255    
bmi                339.3       28.6  11.864  < 2e-16 ***
children           475.7      137.8   3.452 0.000574 ***
smokeryes        23847.5      413.1  57.723  < 2e-16 ***
regionnorthwest   -352.8      476.3  -0.741 0.458976    
regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Step 5: Improving model performance —-

add a higher-order “age” term : we assume that there is a linear relation between age and expenditure which might not be true.Therefore,adding a square term to verify.

insurance$age2 <- insurance$age^2

Add an indicator for BMI>=30: The estimated beta for this binary feature would then indicate the average net impact on medical expenses for individuals with BMI of 30 or above, relative to those with BMI less than 30.

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

create final model

ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)
summary(ins_model2)

Call:
lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
    smoker + region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-17297.1  -1656.0  -1262.7   -727.8  24161.6 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       139.0053  1363.1359   0.102 0.918792    
age               -32.6181    59.8250  -0.545 0.585690    
age2                3.7307     0.7463   4.999 6.54e-07 ***
children          678.6017   105.8855   6.409 2.03e-10 ***
bmi               119.7715    34.2796   3.494 0.000492 ***
sexmale          -496.7690   244.3713  -2.033 0.042267 *  
bmi30            -997.9355   422.9607  -2.359 0.018449 *  
smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4445 on 1326 degrees of freedom
Multiple R-squared:  0.8664,    Adjusted R-squared:  0.8653 
F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

Interpretation: 50% of the predict values is above th true values in the range 1656 and 727.8.The model fit statistics help to determine whether our changes improved the performance of the regression model. Relative to our first model, the R-squared value has improved from 0.75 to about 0.87. Similarly, the adjusted R-squared value, which takes into account the fact that the model grew in complexity, also improved from 0.75 to 0.87. Our model is now explaining 87 percent of the variation in medical treatment costs. Additionally, our theories about the model’s functional form seem to be validated. The higher-order age2 term is statistically significant, as is the obesity indicator, bmi30. The interaction between obesity and smoking suggests a massive effect; in addition to the increased costs of over $13,404 for smoking alone, obese smokers spend another $19,810 per year. This may suggest that smoking exacerbates diseases associated with obesity.

LS0tCnRpdGxlOiAiIFByZWRpY3RpbmcgTWVkaWNhbCBFeHBlbnNlcyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQojIyBTdGVwIDEtY29sbGVjdGluZyBkYXRhIC0tLS0KRm9yIHRoaXMgYW5hbHlzaXMsIHdlIHdpbGwgdXNlIGEgc2ltdWxhdGVkIGRhdGFzZXQgY29udGFpbmluZyBoeXBvdGhldGljYWwgbWVkaWNhbCBleHBlbnNlcyBmb3IgcGF0aWVudHMgaW4gdGhlIFVuaXRlZCBTdGF0ZXMuIFRoaXMgZGF0YSB3YXMgY3JlYXRlZCBmb3IgdGhpcyBib29rIHVzaW5nIGRlbW9ncmFwaGljIHN0YXRpc3RpY3MgZnJvbSB0aGUgVVMgQ2Vuc3VzIEJ1cmVhdSwgYW5kIHRodXMsIGFwcHJveGltYXRlbHkgcmVmbGVjdCByZWFsLXdvcmxkIGNvbmRpdGlvbnMuCgojIyBTdGVwIDI6IEV4cGxvcmluZyBhbmQgcHJlcGFyaW5nIHRoZSBkYXRhIC0tLS0KYGBge3J9Cmluc3VyYW5jZSA8LSByZWFkLmNzdigiaW5zdXJhbmNlLmNzdiIsIHN0cmluZ3NBc0ZhY3RvcnMgPSBUUlVFKQpzdHIoaW5zdXJhbmNlKQpgYGAKCnN1bW1hcml6ZSB0aGUgY2hhcmdlcyB2YXJpYWJsZQpgYGB7cn0Kc3VtbWFyeShpbnN1cmFuY2UkZXhwZW5zZXMpCmBgYAoKaGlzdG9ncmFtIG9mIGluc3VyYW5jZSBjaGFyZ2VzCmBgYHtyfQpoaXN0KGluc3VyYW5jZSRleHBlbnNlcykKYGBgCgp0YWJsZSBvZiByZWdpb24KYGBge3J9CnRhYmxlKGluc3VyYW5jZSRyZWdpb24pCmBgYAoKZXhwbG9yaW5nIHJlbGF0aW9uc2hpcHMgYW1vbmcgZmVhdHVyZXM6IGNvcnJlbGF0aW9uIG1hdHJpeApgYGB7cn0KY29yKGluc3VyYW5jZVtjKCJhZ2UiLCAiYm1pIiwgImNoaWxkcmVuIiwgImV4cGVuc2VzIildKQpgYGAKCnZpc3VhbGluZyByZWxhdGlvbnNoaXBzIGFtb25nIGZlYXR1cmVzOiBzY2F0dGVycGxvdCBtYXRyaXgKcGFyaXMgY3JlYXRlcyBzY2F0dGVyIHBsb3QgbWF0cml4CmBgYHtyfQpwYWlycyhpbnN1cmFuY2VbYygiYWdlIiwgImJtaSIsICJjaGlsZHJlbiIsICJleHBlbnNlcyIpXSkKYGBgClRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiBhZ2UgYW5kIGV4cGVuc2VzIGRpc3BsYXlzIHNldmVyYWwgcmVsYXRpdmVseSBzdHJhaWdodCBsaW5lcywgd2hpbGUgdGhlIGJtaSB2ZXJzdXMgZXhwZW5zZXMgcGxvdCBoYXMgdHdvIGRpc3RpbmN0IGdyb3VwcyBvZiBwb2ludHMuCgptb3JlIGluZm9ybWF0aXZlIHNjYXR0ZXJwbG90IG1hdHJpeApgYGB7cn0KbGlicmFyeShwc3ljaCkKcGFpcnMucGFuZWxzKGluc3VyYW5jZVtjKCJhZ2UiLCAiYm1pIiwgImNoaWxkcmVuIiwgImV4cGVuc2VzIildKQpgYGAKVGhlIHN0cmV0Y2hlZCBlbGxpcHNlIHJlcHJlc2VudHMgYSBzdHJvbmcgY29yZWVsYXRpb24gYW5kIFZpY2UgVmVyc2EuIFRoZXJlZm9yZSwgdGhlcmUgaXMgd2VlayBjb3JyZWxhdGlvbiBiZXR3ZWVuIGJtaSBhbmQgY2hpbGRyZW4uCgojIyBTdGVwIDM6IFRyYWluaW5nIGEgbW9kZWwgb24gdGhlIGRhdGEgLS0tLQpgYGB7cn0KaW5zX21vZGVsIDwtIGxtKGV4cGVuc2VzIH4gYWdlICsgY2hpbGRyZW4gKyBibWkgKyBzZXggKyBzbW9rZXIgKyByZWdpb24sCiAgICAgICAgICAgICAgICBkYXRhID0gaW5zdXJhbmNlKQppbnNfbW9kZWwgPC0gbG0oZXhwZW5zZXMgfiAuLCBkYXRhID0gaW5zdXJhbmNlKSAjIHRoaXMgaXMgZXF1aXZhbGVudCB0byBhYm92ZQpgYGAKCnNlZSB0aGUgZXN0aW1hdGVkIGJldGEgY29lZmZpY2llbnRzCmBgYHtyfQppbnNfbW9kZWwKYGBgCkludGVycHJldGF0aW9uOiBmb3IgZWFjaCBvbmUgeWVhciBpbmNyZWFzZSBpbiB0aGUgYWdlLCB0aGUgZXhwZW5zZSBpbmNyZWFzZSBieSAyNTYuODsgQSBtYWxlIGhhcyAxMzEuNCBsZXNzIGV4cGVuc2UgZWFjaCB5ZWFyIHRoYW4gYSBmZW1hbGUuICAKCgojIyBTdGVwIDQ6IEV2YWx1YXRpbmcgbW9kZWwgcGVyZm9ybWFuY2UgLS0tLQpzZWUgbW9yZSBkZXRhaWwgYWJvdXQgdGhlIGVzdGltYXRlZCBiZXRhIGNvZWZmaWNpZW50cwpgYGB7cn0Kc3VtbWFyeShpbnNfbW9kZWwpCmBgYAoKIyMgU3RlcCA1OiBJbXByb3ZpbmcgbW9kZWwgcGVyZm9ybWFuY2UgLS0tLQphZGQgYSBoaWdoZXItb3JkZXIgImFnZSIgdGVybSA6CndlIGFzc3VtZSB0aGF0IHRoZXJlIGlzIGEgbGluZWFyIHJlbGF0aW9uIGJldHdlZW4gYWdlIGFuZCBleHBlbmRpdHVyZSB3aGljaCBtaWdodCBub3QgYmUgdHJ1ZS5UaGVyZWZvcmUsYWRkaW5nIGEgc3F1YXJlIHRlcm0gdG8gdmVyaWZ5LgpgYGB7cn0KaW5zdXJhbmNlJGFnZTIgPC0gaW5zdXJhbmNlJGFnZV4yCmBgYAoKQWRkIGFuIGluZGljYXRvciBmb3IgQk1JPj0zMDoKVGhlIGVzdGltYXRlZCBiZXRhIGZvciB0aGlzIGJpbmFyeSBmZWF0dXJlIHdvdWxkIHRoZW4gaW5kaWNhdGUgdGhlIGF2ZXJhZ2UgbmV0IGltcGFjdCBvbiBtZWRpY2FsIGV4cGVuc2VzIGZvciBpbmRpdmlkdWFscyB3aXRoIEJNSSBvZiAzMCBvciBhYm92ZSwgcmVsYXRpdmUgdG8gdGhvc2Ugd2l0aCBCTUkgbGVzcyB0aGFuIDMwLgpgYGB7cn0KaW5zdXJhbmNlJGJtaTMwIDwtIGlmZWxzZShpbnN1cmFuY2UkYm1pID49IDMwLCAxLCAwKQpgYGAKCiMgY3JlYXRlIGZpbmFsIG1vZGVsCmBgYHtyfQppbnNfbW9kZWwyIDwtIGxtKGV4cGVuc2VzIH4gYWdlICsgYWdlMiArIGNoaWxkcmVuICsgYm1pICsgc2V4ICsKICAgICAgICAgICAgICAgICAgIGJtaTMwKnNtb2tlciArIHJlZ2lvbiwgZGF0YSA9IGluc3VyYW5jZSkKCnN1bW1hcnkoaW5zX21vZGVsMikKYGBgCkludGVycHJldGF0aW9uOgo1MCUgb2YgdGhlIHByZWRpY3QgdmFsdWVzIGlzIGFib3ZlIHRoIHRydWUgdmFsdWVzIGluIHRoZSByYW5nZSAxNjU2IGFuZCA3MjcuOC5UaGUgbW9kZWwgZml0IHN0YXRpc3RpY3MgaGVscCB0byBkZXRlcm1pbmUgd2hldGhlciBvdXIgY2hhbmdlcyBpbXByb3ZlZCB0aGUgcGVyZm9ybWFuY2Ugb2YgdGhlIHJlZ3Jlc3Npb24gbW9kZWwuIFJlbGF0aXZlIHRvIG91ciBmaXJzdCBtb2RlbCwgdGhlIFItc3F1YXJlZCB2YWx1ZSBoYXMgaW1wcm92ZWQgZnJvbSAwLjc1IHRvIGFib3V0IDAuODcuIFNpbWlsYXJseSwgdGhlIGFkanVzdGVkIFItc3F1YXJlZCB2YWx1ZSwKd2hpY2ggdGFrZXMgaW50byBhY2NvdW50IHRoZSBmYWN0IHRoYXQgdGhlIG1vZGVsIGdyZXcgaW4gIGNvbXBsZXhpdHksIGFsc28gaW1wcm92ZWQgZnJvbSAwLjc1IHRvIDAuODcuIE91ciBtb2RlbCBpcyBub3cgZXhwbGFpbmluZyA4NyBwZXJjZW50IG9mIHRoZSB2YXJpYXRpb24gaW4gbWVkaWNhbCB0cmVhdG1lbnQgY29zdHMuIEFkZGl0aW9uYWxseSwgb3VyIHRoZW9yaWVzIGFib3V0IHRoZSBtb2RlbCdzIGZ1bmN0aW9uYWwgZm9ybSBzZWVtIHRvIGJlIHZhbGlkYXRlZC4gVGhlIGhpZ2hlci1vcmRlciBhZ2UyIHRlcm0gaXMgc3RhdGlzdGljYWxseSBzaWduaWZpY2FudCwgYXMgaXMgdGhlIG9iZXNpdHkgaW5kaWNhdG9yLCBibWkzMC4gVGhlIGludGVyYWN0aW9uIGJldHdlZW4gb2Jlc2l0eSBhbmQgc21va2luZyBzdWdnZXN0cyBhIG1hc3NpdmUgZWZmZWN0OyBpbiBhZGRpdGlvbiB0byB0aGUgaW5jcmVhc2VkIGNvc3RzIG9mIG92ZXIgJDEzLDQwNCBmb3Igc21va2luZyBhbG9uZSwgb2Jlc2UKc21va2VycyBzcGVuZCBhbm90aGVyICQxOSw4MTAgcGVyIHllYXIuIFRoaXMgbWF5IHN1Z2dlc3QgdGhhdCBzbW9raW5nIGV4YWNlcmJhdGVzIGRpc2Vhc2VzIGFzc29jaWF0ZWQgd2l0aCBvYmVzaXR5LgoKCgoKCg==