Medical Insurance Charges (Questions)

Author

Samuel Lim

Published

March 28, 2026

1 Overview

This notebook uses a real healthcare-related dataset on medical insurance charges.

You will be tasked to conduct analysis on the medical insurance charges.

1.1 Your Task

  1. Import and load a dataset in R.
  2. Identify outcome and predictor variables.
  3. Fit simple and multiple linear regression models & carry out forward selection based on p-values.
  4. Interpret coefficients, p-values, and model fit statistics.
  5. Make predictions from a fitted regression model.

2 Load the data

2.1 Question 1:

Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)

Download the data in csv and print out the first few rows of the data.

Click on this link for answers to question 1: https://rpubs.com/Samuelllim/Q1

3 Data preparation

3.1 Question 2a:

List the variables that are categorical?

3.2 Question 2b:

What should you do with the data type of these variables before fitting a regression model?

3.3 Question 2c:

Which is the response variable?

Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q2

4 Simple linear regression

4.1 Question 3a

Fit a simple linear regression model with charges as the response and bmi as the predictor.

4.2 Question 3b

Interpret the coefficient of bmi

Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q3

5 Multiple linear regression

5.1 Question 4a

Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.

5.2 Question 4b

Using the fitted model, interpret the coefficient of smokeryes.

5.3 Question 4c

Using the fitted model, interpret the coefficient of age.

6 Model Selection - Forward selection based on p-values

In this section, use forward selection based on p-values.

Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q4

6.1 Question 5a

Fit one-predictor models for each candidate variable and compare their p-values.

Which variable should enter first?

6.2 Question 5b

Write down your final selected model.

Click on this link for answers to question 5: https://rpubs.com/Samuelllim/Q5

7 Prediction

7.1 Question 6

Predict the insurance charges for the following person:

  • age = 40
  • sex = female
  • bmi = 30
  • children = 2
  • smoker = no
  • region = southeast

Click on this link for answers to question 6: https://rpubs.com/Samuelllim/Q6