Overview
This notebook uses a real healthcare-related dataset on medical insurance charges.
You will be tasked to conduct analysis on the medical insurance charges.
Your Task
- Import and load a dataset in R.
- Identify outcome and predictor variables.
- Fit simple and multiple linear regression models & carry out forward selection based on p-values.
- Interpret coefficients, p-values, and model fit statistics.
- Make predictions from a fitted regression model.
Data preparation
Question 2a:
List the variables that are categorical?
Question 2b:
What should you do with the data type of these variables before fitting a regression model?
Simple linear regression
Question 3a
Fit a simple linear regression model with charges as the response and bmi as the predictor.
Multiple linear regression
Question 4a
Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.
Question 4b
Using the fitted model, interpret the coefficient of smokeryes.
Question 4c
Using the fitted model, interpret the coefficient of age.
Model Selection - Forward selection based on p-values
In this section, use forward selection based on p-values.
Start with a null model containing only the intercept.
model_0 <- lm(charges ~ 1, data = df)
summary(model_0)
Call:
lm(formula = charges ~ 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-12149 -8530 -3888 3369 50500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13270.4 331.1 40.08 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12110 on 1337 degrees of freedom
Question 5a
Fit one-predictor models for each candidate variable and compare their p-values.
m_age <- lm(charges ~ age, data = df)
m_sex <- lm(charges ~ sex, data = df)
m_bmi <- lm(charges ~ bmi, data = df)
m_children <- lm(charges ~ children, data = df)
m_smoker <- lm(charges ~ smoker, data = df)
m_region <- lm(charges ~ region, data = df)
summary(m_age)
Call:
lm(formula = charges ~ age, data = df)
Residuals:
Min 1Q Median 3Q Max
-8059 -6671 -5939 5440 47829
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3165.9 937.1 3.378 0.000751 ***
age 257.7 22.5 11.453 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11560 on 1336 degrees of freedom
Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
F-statistic: 131.2 on 1 and 1336 DF, p-value: < 0.00000000000000022
Call:
lm(formula = charges ~ sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-12835 -8435 -3980 3476 51201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12569.6 470.1 26.740 <0.0000000000000002 ***
sexmale 1387.2 661.3 2.098 0.0361 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
Call:
lm(formula = charges ~ bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-20956 -8118 -3757 4722 49442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1192.94 1664.80 0.717 0.474
bmi 393.87 53.25 7.397 0.000000000000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
Call:
lm(formula = charges ~ children, data = df)
Residuals:
Min 1Q Median 3Q Max
-11585 -8759 -4071 3468 51248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12522.5 446.5 28.049 <0.0000000000000002 ***
children 683.1 274.2 2.491 0.0129 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared: 0.004624, Adjusted R-squared: 0.003879
F-statistic: 6.206 on 1 and 1336 DF, p-value: 0.01285
Call:
lm(formula = charges ~ smoker, data = df)
Residuals:
Min 1Q Median 3Q Max
-19221 -5042 -919 3705 31720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8434.3 229.0 36.83 <0.0000000000000002 ***
smokeryes 23616.0 506.1 46.66 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared: 0.6198, Adjusted R-squared: 0.6195
F-statistic: 2178 on 1 and 1336 DF, p-value: < 0.00000000000000022
Call:
lm(formula = charges ~ region, data = df)
Residuals:
Min 1Q Median 3Q Max
-13614 -8463 -3793 3385 49035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13406.4 671.3 19.971 <0.0000000000000002 ***
regionnorthwest -988.8 948.6 -1.042 0.297
regionsoutheast 1329.0 922.9 1.440 0.150
regionsouthwest -1059.4 948.6 -1.117 0.264
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12080 on 1334 degrees of freedom
Multiple R-squared: 0.006634, Adjusted R-squared: 0.0044
F-statistic: 2.97 on 3 and 1334 DF, p-value: 0.03089
Which variable should enter first?
Question 5b
Write down your final selected model.
Prediction
Question 6
Predict the insurance charges for the following person:
- age = 40
- sex = female
- bmi = 30
- children = 2
- smoker = no
- region = southeast