age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
3 Data preparation
3.1 Question 2a:
List the variables that are categorical?
Ans: There are three variables that are categorical. sex, smoker and region.
3.2 Question 2b:
What should you do with the data type of these variables before fitting a regression model?
Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.
age sex bmi children smoker
Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
Median :39.00 Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
region charges
northeast:324 Min. : 1122
northwest:325 1st Qu.: 4740
southeast:364 Median : 9382
southwest:325 Mean :13270
3rd Qu.:16640
Max. :63770
3.3 Question 2c:
Which is the response variable?
Ans: Based on the question in overview, our response variable should be charges.
4 Simple linear regression
4.1 Question 3a
Fit a simple linear regression model with charges as the response and bmi as the predictor.
model_simple <-lm(charges ~ bmi, data = df)summary(model_simple)
Call:
lm(formula = charges ~ bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-20956 -8118 -3757 4722 49442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1192.94 1664.80 0.717 0.474
bmi 393.87 53.25 7.397 0.000000000000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
4.2 Question 3b
Interpret the coefficient of bmi in plain language.
Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average.
5 Multiple linear regression
5.1 Question 4a
Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.
model_full <-lm(charges ~ age + sex + bmi + children + smoker + region, data = df)summary(model_full)
Call:
lm(formula = charges ~ age + sex + bmi + children + smoker +
region, data = df)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 0.0000000000000002 ***
age 256.9 11.9 21.587 < 0.0000000000000002 ***
sexmale -131.3 332.9 -0.394 0.693348
bmi 339.2 28.6 11.860 < 0.0000000000000002 ***
children 475.5 137.8 3.451 0.000577 ***
smokeryes 23848.5 413.1 57.723 < 0.0000000000000002 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 0.00000000000000022
5.2 Question 4b
Using the fitted model, interpret the coefficient of smokeryes.
Based on the output from the model, at 95% significance level, being a smoker is significantly associated with the medical insurance charges. The coefficient of smokeryes is 23847.268, which means that being a smoker increases the medical insurance charges by 23847.268 dollars, on average, holding all other variables constant.
5.3 Question 4c
Using the fitted model, interpret the coefficient of age.
Based on the output from the model, at 95% significance level, age of the insured is significantly associated with the medical insurance charges. The coefficient of age is 256.854, which means that for every year increase in age, the medical insurance charges increases by 256.854 dollars, on average, holding all other variables constant.