Medical Insurance Charges (Answers)

Author

Samuel Lim

Published

March 28, 2026

1 Overview

This notebook uses a real healthcare-related dataset on medical insurance charges.

You will be tasked to conduct analysis on the medical insurance charges.

1.1 Your Task

  1. Import and load a dataset in R.
  2. Identify outcome and predictor variables.
  3. Fit simple and multiple linear regression models & carry out forward selection based on p-values.
  4. Interpret coefficients, p-values, and model fit statistics.
  5. Make predictions from a fitted regression model.

2 Load the data

2.1 Question 1:

Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)

Download the data in csv and print out the first few rows of the data.

Click on this link for answers to question 1

check that you have the data set as shown below.

df <- read.csv("~/Documents/Projects/Regression Practice/insurance.csv")
head(df)
  age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

3 Data preparation

3.1 Question 2a:

List the variables that are categorical?

Ans: There are three variables that are categorical. sex, smoker and region.

3.2 Question 2b:

What should you do with the data type of these variables before fitting a regression model?

Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)

str(df)
'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ charges : num  16885 1726 4449 21984 3867 ...
summary(df)
      age            sex           bmi           children     smoker    
 Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
 1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
 Median :39.00                Median :30.40   Median :1.000             
 Mean   :39.21                Mean   :30.66   Mean   :1.095             
 3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
 Max.   :64.00                Max.   :53.13   Max.   :5.000             
       region       charges     
 northeast:324   Min.   : 1122  
 northwest:325   1st Qu.: 4740  
 southeast:364   Median : 9382  
 southwest:325   Mean   :13270  
                 3rd Qu.:16640  
                 Max.   :63770  

3.3 Question 2c:

Which is the response variable?

Ans: Based on the question in overview, our response variable should be charges.

4 Simple linear regression

4.1 Question 3a

Fit a simple linear regression model with charges as the response and bmi as the predictor.

model_simple <- lm(charges ~ bmi, data = df)
summary(model_simple)

Call:
lm(formula = charges ~ bmi, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-20956  -8118  -3757   4722  49442 

Coefficients:
            Estimate Std. Error t value          Pr(>|t|)    
(Intercept)  1192.94    1664.80   0.717             0.474    
bmi           393.87      53.25   7.397 0.000000000000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared:  0.03934,   Adjusted R-squared:  0.03862 
F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

4.2 Question 3b

Interpret the coefficient of bmi in plain language.

Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average.