Medical Insurance Charges (Questions)

Author

Samuel Lim

Published

March 28, 2026

1 Overview

This notebook uses a real healthcare-related dataset on medical insurance charges.

You will be tasked to conduct analysis on the medical insurance charges.

1.1 Your Task

  1. Import and load a dataset in R.
  2. Identify outcome and predictor variables.
  3. Fit simple and multiple linear regression models & carry out forward selection based on p-values.
  4. Interpret coefficients, p-values, and model fit statistics.
  5. Make predictions from a fitted regression model.

2 Load the data

2.1 Question 1:

Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)

Download the data in csv and print out the first few rows of the data.

Click on this link for answers to question 1: https://rpubs.com/Samuelllim/Q1

3 Data preparation

3.1 Question 2a:

List the variables that are categorical?

3.2 Question 2b:

What should you do with the data type of these variables before fitting a regression model?

3.3 Question 2c:

Which is the response variable?

Click on this link for answers to question 1: https://rpubs.com/Samuelllim/Q2

4 Simple linear regression

4.1 Question 3a

Fit a simple linear regression model with charges as the response and bmi as the predictor.

4.2 Question 3b

Interpret the coefficient of bmi

Click on this link for answers to question 1: https://rpubs.com/Samuelllim/Q3

5 Multiple linear regression

5.1 Question 4a

Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.

5.2 Question 4b

Using the fitted model, interpret the coefficient of smokeryes.

5.3 Question 4c

Using the fitted model, interpret the coefficient of age.

6 Model Selection - Forward selection based on p-values

In this section, use forward selection based on p-values.

Start with a null model containing only the intercept.

model_0 <- lm(charges ~ 1, data = df)
summary(model_0)

Call:
lm(formula = charges ~ 1, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-12149  -8530  -3888   3369  50500 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  13270.4      331.1   40.08 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12110 on 1337 degrees of freedom

6.1 Question 5a

Fit one-predictor models for each candidate variable and compare their p-values.

m_age      <- lm(charges ~ age, data = df)
m_sex      <- lm(charges ~ sex, data = df)
m_bmi      <- lm(charges ~ bmi, data = df)
m_children <- lm(charges ~ children, data = df)
m_smoker   <- lm(charges ~ smoker, data = df)
m_region   <- lm(charges ~ region, data = df)

summary(m_age)

Call:
lm(formula = charges ~ age, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
 -8059  -6671  -5939   5440  47829 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   3165.9      937.1   3.378             0.000751 ***
age            257.7       22.5  11.453 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11560 on 1336 degrees of freedom
Multiple R-squared:  0.08941,   Adjusted R-squared:  0.08872 
F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 0.00000000000000022
summary(m_sex)

Call:
lm(formula = charges ~ sex, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-12835  -8435  -3980   3476  51201 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  12569.6      470.1  26.740 <0.0000000000000002 ***
sexmale       1387.2      661.3   2.098              0.0361 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared:  0.003282,  Adjusted R-squared:  0.002536 
F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613
summary(m_bmi)

Call:
lm(formula = charges ~ bmi, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-20956  -8118  -3757   4722  49442 

Coefficients:
            Estimate Std. Error t value          Pr(>|t|)    
(Intercept)  1192.94    1664.80   0.717             0.474    
bmi           393.87      53.25   7.397 0.000000000000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared:  0.03934,   Adjusted R-squared:  0.03862 
F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459
summary(m_children)

Call:
lm(formula = charges ~ children, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-11585  -8759  -4071   3468  51248 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  12522.5      446.5  28.049 <0.0000000000000002 ***
children       683.1      274.2   2.491              0.0129 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared:  0.004624,  Adjusted R-squared:  0.003879 
F-statistic: 6.206 on 1 and 1336 DF,  p-value: 0.01285
summary(m_smoker)

Call:
lm(formula = charges ~ smoker, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19221  -5042   -919   3705  31720 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   8434.3      229.0   36.83 <0.0000000000000002 ***
smokeryes    23616.0      506.1   46.66 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared:  0.6198,    Adjusted R-squared:  0.6195 
F-statistic:  2178 on 1 and 1336 DF,  p-value: < 0.00000000000000022
summary(m_region)

Call:
lm(formula = charges ~ region, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-13614  -8463  -3793   3385  49035 

Coefficients:
                Estimate Std. Error t value            Pr(>|t|)    
(Intercept)      13406.4      671.3  19.971 <0.0000000000000002 ***
regionnorthwest   -988.8      948.6  -1.042               0.297    
regionsoutheast   1329.0      922.9   1.440               0.150    
regionsouthwest  -1059.4      948.6  -1.117               0.264    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12080 on 1334 degrees of freedom
Multiple R-squared:  0.006634,  Adjusted R-squared:  0.0044 
F-statistic:  2.97 on 3 and 1334 DF,  p-value: 0.03089

Which variable should enter first?

6.2 Question 5b

Write down your final selected model.

7 Prediction

7.1 Question 6

Predict the insurance charges for the following person:

  • age = 40
  • sex = female
  • bmi = 30
  • children = 2
  • smoker = no
  • region = southeast