Medical Insurance Charges (Answers)

Author

Samuel Lim

Published

March 28, 2026

1 Overview

This notebook uses a real healthcare-related dataset on medical insurance charges.

You will be tasked to conduct analysis on the medical insurance charges.

1.1 Your Task

  1. Import and load a dataset in R.
  2. Identify outcome and predictor variables.
  3. Fit simple and multiple linear regression models & carry out forward selection based on p-values.
  4. Interpret coefficients, p-values, and model fit statistics.
  5. Make predictions from a fitted regression model.

2 Load the data

2.1 Question 1:

Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)

Download the data in csv and print out the first few rows of the data.

Click on this link for answers to question 1

check that you have the data set as shown below.

df <- read.csv("~/Documents/Projects/Regression Practice/insurance.csv")
head(df)
  age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

3 Data preparation

3.1 Question 2a:

List the variables that are categorical?

Ans: There are three variables that are categorical. sex, smoker and region.

3.2 Question 2b:

What should you do with the data type of these variables before fitting a regression model?

Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)

str(df)
'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ charges : num  16885 1726 4449 21984 3867 ...
summary(df)
      age            sex           bmi           children     smoker    
 Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
 1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
 Median :39.00                Median :30.40   Median :1.000             
 Mean   :39.21                Mean   :30.66   Mean   :1.095             
 3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
 Max.   :64.00                Max.   :53.13   Max.   :5.000             
       region       charges     
 northeast:324   Min.   : 1122  
 northwest:325   1st Qu.: 4740  
 southeast:364   Median : 9382  
 southwest:325   Mean   :13270  
                 3rd Qu.:16640  
                 Max.   :63770  

3.3 Question 2c:

Which is the response variable?

Ans: Based on the question in overview, our response variable should be charges.

4 Simple linear regression

4.1 Question 3a

Fit a simple linear regression model with charges as the response and bmi as the predictor.

model_simple <- lm(charges ~ bmi, data = df)
summary(model_simple)

Call:
lm(formula = charges ~ bmi, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-20956  -8118  -3757   4722  49442 

Coefficients:
            Estimate Std. Error t value          Pr(>|t|)    
(Intercept)  1192.94    1664.80   0.717             0.474    
bmi           393.87      53.25   7.397 0.000000000000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared:  0.03934,   Adjusted R-squared:  0.03862 
F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

4.2 Question 3b

Interpret the coefficient of bmi in plain language.

Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average.

5 Multiple linear regression

5.1 Question 4a

Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.

model_full <- lm(charges ~ age + sex + bmi + children + smoker + region, data = df)
summary(model_full)

Call:
lm(formula = charges ~ age + sex + bmi + children + smoker + 
    region, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086 < 0.0000000000000002 ***
age                256.9       11.9  21.587 < 0.0000000000000002 ***
sexmale           -131.3      332.9  -0.394             0.693348    
bmi                339.2       28.6  11.860 < 0.0000000000000002 ***
children           475.5      137.8   3.451             0.000577 ***
smokeryes        23848.5      413.1  57.723 < 0.0000000000000002 ***
regionnorthwest   -353.0      476.3  -0.741             0.458769    
regionsoutheast  -1035.0      478.7  -2.162             0.030782 *  
regionsouthwest   -960.0      477.9  -2.009             0.044765 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

5.2 Question 4b

Using the fitted model, interpret the coefficient of smokeryes.

Based on the output from the model, at 95% significance level, being a smoker is significantly associated with the medical insurance charges. The coefficient of smokeryes is 23847.268, which means that being a smoker increases the medical insurance charges by 23847.268 dollars, on average, holding all other variables constant.

5.3 Question 4c

Using the fitted model, interpret the coefficient of age.

Based on the output from the model, at 95% significance level, age of the insured is significantly associated with the medical insurance charges. The coefficient of age is 256.854, which means that for every year increase in age, the medical insurance charges increases by 256.854 dollars, on average, holding all other variables constant.