age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
3 Data preparation
3.1 Question 2a:
List the variables that are categorical?
Ans: There are three variables that are categorical. sex, smoker and region.
3.2 Question 2b:
What should you do with the data type of these variables before fitting a regression model?
Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.
age sex bmi children smoker
Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
Median :39.00 Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
region charges
northeast:324 Min. : 1122
northwest:325 1st Qu.: 4740
southeast:364 Median : 9382
southwest:325 Mean :13270
3rd Qu.:16640
Max. :63770
3.3 Question 2c:
Which is the response variable?
Ans: Based on the question in overview, our response variable should be charges.
4 Simple linear regression
4.1 Question 3a
Fit a simple linear regression model with charges as the response and bmi as the predictor.
model_simple <-lm(charges ~ bmi, data = df)summary(model_simple)
Call:
lm(formula = charges ~ bmi, data = df)
Residuals:
Min 1Q Median 3Q Max
-20956 -8118 -3757 4722 49442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1192.94 1664.80 0.717 0.474
bmi 393.87 53.25 7.397 0.000000000000246 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared: 0.03934, Adjusted R-squared: 0.03862
F-statistic: 54.71 on 1 and 1336 DF, p-value: 0.0000000000002459
4.2 Question 3b
Interpret the coefficient of bmi in plain language.
Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average.