age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
3 Data preparation
3.1 Question 2a:
List the variables that are categorical?
Ans: There are three variables that are categorical. sex, smoker and region.
3.2 Question 2b:
What should you do with the data type of these variables before fitting a regression model?
Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.
age sex bmi children smoker
Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
Median :39.00 Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
region charges
northeast:324 Min. : 1122
northwest:325 1st Qu.: 4740
southeast:364 Median : 9382
southwest:325 Mean :13270
3rd Qu.:16640
Max. :63770
3.3 Question 2c:
Which is the response variable?
Ans: Based on the question in overview, our response variable should be charges.
4 Simple linear regression
4.1 Question 3a
Fit a simple linear regression model with charges as the response and bmi as the predictor.
model_simple <-lm(charges ~ bmi, data = df)summary(model_simple)
Call:
lm(formula = charges ~ bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-20956 -8118 -3757 4722 49442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1192.94 1664.80 0.717 0.474
bmi 393.87 53.25 7.397 0.000000000000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
4.2 Question 3b
Interpret the coefficient of bmi in plain language.
Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average.
5 Multiple linear regression
5.1 Question 4a
Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.
model_full <-lm(charges ~ age + sex + bmi + children + smoker + region, data = df)summary(model_full)
Call:
lm(formula = charges ~ age + sex + bmi + children + smoker +
region, data = df)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 0.0000000000000002 ***
age 256.9 11.9 21.587 < 0.0000000000000002 ***
sexmale -131.3 332.9 -0.394 0.693348
bmi 339.2 28.6 11.860 < 0.0000000000000002 ***
children 475.5 137.8 3.451 0.000577 ***
smokeryes 23848.5 413.1 57.723 < 0.0000000000000002 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
5.2 Question 4b
Using the fitted model, interpret the coefficient of smokeryes.
Based on the output from the model, at 95% significance level, being a smoker is significantly associated with the medical insurance charges. The coefficient of smokeryes is 23847.268, which means that being a smoker increases the medical insurance charges by 23847.268 dollars, on average, holding all other variables constant.
5.3 Question 4c
Using the fitted model, interpret the coefficient of age.
Based on the output from the model, at 95% significance level, age of the insured is significantly associated with the medical insurance charges. The coefficient of age is 256.854, which means that for every year increase in age, the medical insurance charges increases by 256.854 dollars, on average, holding all other variables constant.
6 Model Selection - Forward selection based on p-values
In this section, use forward selection based on p-values.
Start with a null model containing only the intercept.
model_0 <-lm(charges ~1, data = df)summary(model_0)
Call:
lm(formula = charges ~ 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-12149 -8530 -3888 3369 50500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13270.4 331.1 40.08 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12110 on 1337 degrees of freedom
6.1 Question 5a
Fit one-predictor models for each candidate variable and compare their p-values.
m_age <-lm(charges ~ age, data = df)m_sex <-lm(charges ~ sex, data = df)m_bmi <-lm(charges ~ bmi, data = df)m_children <-lm(charges ~ children, data = df)m_smoker <-lm(charges ~ smoker, data = df)m_region <-lm(charges ~ region, data = df)summary(m_age)
Call:
lm(formula = charges ~ age, data = df)
Residuals:
Min 1Q Median 3Q Max
-8059 -6671 -5939 5440 47829
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3165.9 937.1 3.378 0.000751 ***
age 257.7 22.5 11.453 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11560 on 1336 degrees of freedom
Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
F-statistic: 131.2 on 1 and 1336 DF, p-value: < 0.00000000000000022
summary(m_sex)
Call:
lm(formula = charges ~ sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-12835 -8435 -3980 3476 51201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12569.6 470.1 26.740 <0.0000000000000002 ***
sexmale 1387.2 661.3 2.098 0.0361 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
summary(m_bmi)
Call:
lm(formula = charges ~ bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-20956 -8118 -3757 4722 49442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1192.94 1664.80 0.717 0.474
bmi 393.87 53.25 7.397 0.000000000000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
summary(m_children)
Call:
lm(formula = charges ~ children, data = df)
Residuals:
Min 1Q Median 3Q Max
-11585 -8759 -4071 3468 51248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12522.5 446.5 28.049 <0.0000000000000002 ***
children 683.1 274.2 2.491 0.0129 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared: 0.004624, Adjusted R-squared: 0.003879
F-statistic: 6.206 on 1 and 1336 DF, p-value: 0.01285
summary(m_smoker)
Call:
lm(formula = charges ~ smoker, data = df)
Residuals:
Min 1Q Median 3Q Max
-19221 -5042 -919 3705 31720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8434.3 229.0 36.83 <0.0000000000000002 ***
smokeryes 23616.0 506.1 46.66 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared: 0.6198, Adjusted R-squared: 0.6195
F-statistic: 2178 on 1 and 1336 DF, p-value: < 0.00000000000000022
summary(m_region)
Call:
lm(formula = charges ~ region, data = df)
Residuals:
Min 1Q Median 3Q Max
-13614 -8463 -3793 3385 49035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13406.4 671.3 19.971 <0.0000000000000002 ***
regionnorthwest -988.8 948.6 -1.042 0.297
regionsoutheast 1329.0 922.9 1.440 0.150
regionsouthwest -1059.4 948.6 -1.117 0.264
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12080 on 1334 degrees of freedom
Multiple R-squared: 0.006634, Adjusted R-squared: 0.0044
F-statistic: 2.97 on 3 and 1334 DF, p-value: 0.03089
Which variable should enter first?
Ans: The variable that should enter first is smoker. You can see that the model with smoker has the lowest p-value.
6.2 Question 5b
Write down your final selected model.
Assume smoker is selected first. Fit the model with smoker only.
model_1 <-lm(charges ~ smoker, data = df)summary(model_1)
Call:
lm(formula = charges ~ smoker, data = df)
Residuals:
Min 1Q Median 3Q Max
-19221 -5042 -919 3705 31720
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8434.3 229.0 36.83 <0.0000000000000002 ***
smokeryes 23616.0 506.1 46.66 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared: 0.6198, Adjusted R-squared: 0.6195
F-statistic: 2178 on 1 and 1336 DF, p-value: < 0.00000000000000022
Now test each remaining variable one at a time after including smoker.
m1_age <-lm(charges ~ smoker + age, data = df)m1_sex <-lm(charges ~ smoker + sex, data = df)m1_bmi <-lm(charges ~ smoker + bmi, data = df)m1_children <-lm(charges ~ smoker + children, data = df)m1_region <-lm(charges ~ smoker + region, data = df)summary(m1_age)
Call:
lm(formula = charges ~ smoker + age, data = df)
Residuals:
Min 1Q Median 3Q Max
-16088.1 -2046.8 -1336.4 -212.7 28760.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2391.63 528.30 -4.527 0.00000652 ***
smokeryes 23855.30 433.49 55.031 < 0.0000000000000002 ***
age 274.87 12.46 22.069 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6397 on 1335 degrees of freedom
Multiple R-squared: 0.7214, Adjusted R-squared: 0.721
F-statistic: 1728 on 2 and 1335 DF, p-value: < 0.00000000000000022
summary(m1_sex)
Call:
lm(formula = charges ~ smoker + sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-19193 -5074 -909 3739 31682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8466.04 303.54 27.89 <0.0000000000000002 ***
smokeryes 23622.13 507.74 46.52 <0.0000000000000002 ***
sexmale -65.38 409.81 -0.16 0.873
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7473 on 1335 degrees of freedom
Multiple R-squared: 0.6198, Adjusted R-squared: 0.6192
F-statistic: 1088 on 2 and 1335 DF, p-value: < 0.00000000000000022
summary(m1_bmi)
Call:
lm(formula = charges ~ smoker + bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-15992.7 -4600.2 -802.4 3636.2 30677.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3459.10 998.28 -3.465 0.000547 ***
smokeryes 23593.98 480.18 49.136 < 0.0000000000000002 ***
bmi 388.02 31.79 12.207 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7088 on 1335 degrees of freedom
Multiple R-squared: 0.6579, Adjusted R-squared: 0.6574
F-statistic: 1284 on 2 and 1335 DF, p-value: < 0.00000000000000022
summary(m1_children)
Call:
lm(formula = charges ~ smoker + children, data = df)
Residuals:
Min 1Q Median 3Q Max
-19773 -5013 -1199 3902 32413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7755.7 292.9 26.48 < 0.0000000000000002 ***
smokeryes 23601.7 503.7 46.85 < 0.0000000000000002 ***
children 622.4 168.7 3.69 0.000233 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7435 on 1335 degrees of freedom
Multiple R-squared: 0.6236, Adjusted R-squared: 0.623
F-statistic: 1106 on 2 and 1335 DF, p-value: < 0.00000000000000022
summary(m1_region)
Call:
lm(formula = charges ~ smoker + region, data = df)
Residuals:
Min 1Q Median 3Q Max
-19268 -5041 -795 3746 31362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8533.5 428.2 19.927 <0.0000000000000002 ***
smokeryes 23564.5 507.7 46.417 <0.0000000000000002 ***
regionnorthwest -321.3 586.9 -0.547 0.584
regionsoutheast 310.8 571.2 0.544 0.586
regionsouthwest -391.9 586.9 -0.668 0.504
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7473 on 1333 degrees of freedom
Multiple R-squared: 0.6203, Adjusted R-squared: 0.6192
F-statistic: 544.4 on 4 and 1333 DF, p-value: < 0.00000000000000022
Determine the next variable to add in. In this case, age should be added next as variable as it has the smallest p-value.
Now assume smoker and age are selected. Now test each remaining variable one at a time.
m2_sex <-lm(charges ~ smoker + age + sex, data = df)m2_bmi <-lm(charges ~ smoker + age + bmi, data = df)m2_children <-lm(charges ~ smoker + age + children, data = df)m2_region <-lm(charges ~ smoker + age + region, data = df)summary(m2_sex)
Call:
lm(formula = charges ~ smoker + age + sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-16122.4 -2048.5 -1318.9 -228.2 28725.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2433.56 558.26 -4.359 0.0000141 ***
smokeryes 23847.63 434.89 54.836 < 0.0000000000000002 ***
age 274.93 12.46 22.061 < 0.0000000000000002 ***
sexmale 81.82 350.98 0.233 0.816
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6399 on 1334 degrees of freedom
Multiple R-squared: 0.7214, Adjusted R-squared: 0.7208
F-statistic: 1151 on 3 and 1334 DF, p-value: < 0.00000000000000022
summary(m2_bmi)
Call:
lm(formula = charges ~ smoker + age + bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-12415.4 -2970.9 -980.5 1480.0 28971.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11676.83 937.57 -12.45 <0.0000000000000002 ***
smokeryes 23823.68 412.87 57.70 <0.0000000000000002 ***
age 259.55 11.93 21.75 <0.0000000000000002 ***
bmi 322.62 27.49 11.74 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6092 on 1334 degrees of freedom
Multiple R-squared: 0.7475, Adjusted R-squared: 0.7469
F-statistic: 1316 on 3 and 1334 DF, p-value: < 0.00000000000000022
summary(m2_children)
Call:
lm(formula = charges ~ smoker + age + children, data = df)
Residuals:
Min 1Q Median 3Q Max
-15547.3 -1941.4 -1319.1 -425.3 29313.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2851.99 543.78 -5.245 0.000000182 ***
smokeryes 23842.60 431.84 55.212 < 0.0000000000000002 ***
age 273.09 12.42 21.990 < 0.0000000000000002 ***
children 486.65 144.70 3.363 0.000792 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6372 on 1334 degrees of freedom
Multiple R-squared: 0.7237, Adjusted R-squared: 0.7231
F-statistic: 1165 on 3 and 1334 DF, p-value: < 0.00000000000000022
summary(m2_region)
Call:
lm(formula = charges ~ smoker + age + region, data = df)
Residuals:
Min 1Q Median 3Q Max
-15695.5 -2069.1 -1292.6 -211.1 28563.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2317.56 612.90 -3.781 0.000163 ***
smokeryes 23797.23 434.61 54.755 < 0.0000000000000002 ***
age 275.10 12.45 22.089 < 0.0000000000000002 ***
regionnorthwest -294.97 502.27 -0.587 0.557117
regionsoutheast 391.25 488.88 0.800 0.423680
regionsouthwest -436.71 502.27 -0.869 0.384744
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6396 on 1332 degrees of freedom
Multiple R-squared: 0.7221, Adjusted R-squared: 0.7211
F-statistic: 692.2 on 5 and 1332 DF, p-value: < 0.00000000000000022
Determine the next variable to add in. In this case, bmi should be added next as variable as it has the smallest p-value.
Now assume smoker, age and bmi are selected. Now test each remaining variable one at a time.
m3_sex <-lm(charges ~ smoker + age + bmi + sex, data = df)m3_children <-lm(charges ~ smoker + age + bmi + children, data = df)m3_region <-lm(charges ~ smoker + age + bmi + region, data = df)summary(m3_sex)
Call:
lm(formula = charges ~ smoker + age + bmi + sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-12364.7 -2972.2 -983.2 1475.8 29018.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11633.49 947.27 -12.281 <0.0000000000000002 ***
smokeryes 23833.87 414.19 57.544 <0.0000000000000002 ***
age 259.45 11.94 21.727 <0.0000000000000002 ***
bmi 323.05 27.53 11.735 <0.0000000000000002 ***
sexmale -109.04 334.66 -0.326 0.745
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6094 on 1333 degrees of freedom
Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
F-statistic: 986.5 on 4 and 1333 DF, p-value: < 0.00000000000000022
summary(m3_children)
Call:
lm(formula = charges ~ smoker + age + bmi + children, data = df)
Residuals:
Min 1Q Median 3Q Max
-11897.9 -2920.8 -986.6 1392.2 29509.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12102.77 941.98 -12.848 < 0.0000000000000002 ***
smokeryes 23811.40 411.22 57.904 < 0.0000000000000002 ***
age 257.85 11.90 21.675 < 0.0000000000000002 ***
bmi 321.85 27.38 11.756 < 0.0000000000000002 ***
children 473.50 137.79 3.436 0.000608 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6068 on 1333 degrees of freedom
Multiple R-squared: 0.7497, Adjusted R-squared: 0.7489
F-statistic: 998.1 on 4 and 1333 DF, p-value: < 0.00000000000000022
summary(m3_region)
Call:
lm(formula = charges ~ smoker + age + bmi + region, data = df)
Residuals:
Min 1Q Median 3Q Max
-11905 -3041 -1000 1542 29419
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11601.56 976.20 -11.884 <0.0000000000000002 ***
smokeryes 23852.48 413.51 57.683 <0.0000000000000002 ***
age 258.64 11.93 21.680 <0.0000000000000002 ***
bmi 340.01 28.67 11.858 <0.0000000000000002 ***
regionnorthwest -303.52 477.85 -0.635 0.5254
regionsoutheast -1038.63 480.49 -2.162 0.0308 *
regionsouthwest -915.94 479.56 -1.910 0.0564 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6085 on 1331 degrees of freedom
Multiple R-squared: 0.7487, Adjusted R-squared: 0.7475
F-statistic: 660.8 on 6 and 1331 DF, p-value: < 0.00000000000000022
Determine the next variable to add in. In this case, children should be added next as variable as it has the smallest p-value.
Now we have smoker, age, bmi and children selected. We will again test each remaining variable one at a time.
m4_sex <-lm(charges ~ smoker + age + bmi + children + sex, data = df)m4_region <-lm(charges ~ smoker + age + bmi + children + region, data = df)summary(m4_sex)
Call:
lm(formula = charges ~ smoker + age + bmi + children + sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-11837.2 -2916.7 -994.2 1375.3 29565.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12052.46 951.26 -12.670 < 0.0000000000000002 ***
smokeryes 23823.39 412.52 57.750 < 0.0000000000000002 ***
age 257.73 11.90 21.651 < 0.0000000000000002 ***
bmi 322.36 27.42 11.757 < 0.0000000000000002 ***
children 474.41 137.86 3.441 0.000597 ***
sexmale -128.64 333.36 -0.386 0.699641
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6070 on 1332 degrees of freedom
Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
F-statistic: 798 on 5 and 1332 DF, p-value: < 0.00000000000000022
summary(m4_region)
Call:
lm(formula = charges ~ smoker + age + bmi + children + region,
data = df)
Residuals:
Min 1Q Median 3Q Max
-11367.2 -2835.4 -979.7 1361.9 29935.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11990.27 978.76 -12.250 < 0.0000000000000002 ***
smokeryes 23836.30 411.86 57.875 < 0.0000000000000002 ***
age 256.97 11.89 21.610 < 0.0000000000000002 ***
bmi 338.66 28.56 11.858 < 0.0000000000000002 ***
children 474.57 137.74 3.445 0.000588 ***
regionnorthwest -352.18 476.12 -0.740 0.459618
regionsoutheast -1034.36 478.54 -2.162 0.030834 *
regionsouthwest -959.37 477.78 -2.008 0.044846 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
F-statistic: 572.7 on 7 and 1330 DF, p-value: < 0.00000000000000022
Looking at the results, region would be a borderline addition as it is significant only in two of the regions. However, one could observe from R-Square or adjusted R-square that adding region does increase the model fit. Hence, it would be reasonable to add region as a predictor in the model.
Something extra for you to think about - we could also think about whether it makes to do some form of feature engineering by combining the regions into fewer categories.
With the explantion above, our variables selected now are smoker, age, bmi, children and region. We will again test each remaining variable one at a time.
m5_sex <-lm(charges ~ smoker + age + bmi + children + region + sex, data = df)summary(m5_sex)
Call:
lm(formula = charges ~ smoker + age + bmi + children + region +
sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 0.0000000000000002 ***
smokeryes 23848.5 413.1 57.723 < 0.0000000000000002 ***
age 256.9 11.9 21.587 < 0.0000000000000002 ***
bmi 339.2 28.6 11.860 < 0.0000000000000002 ***
children 475.5 137.8 3.451 0.000577 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
sexmale -131.3 332.9 -0.394 0.693348
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
The p-value for gender is above 0.05, which means that it is not significant at 95% confidence level. We will stop here and our final model will be the one with smoker, age, bmi, children and region as predictors.
Final Model - charges ~ smoker + age + bmi + children + region
7 Predictions
7.1 Question 6
Predict the insurance charges for the following person using the final model: