Medical Insurance Charges (Answers)

Author

Samuel Lim

Published

March 28, 2026

1 Overview

This notebook uses a real healthcare-related dataset on medical insurance charges.

You will be tasked to conduct analysis on the medical insurance charges.

1.1 Your Task

Import and load a dataset in R.
Identify outcome and predictor variables.
Fit simple and multiple linear regression models & carry out forward selection based on p-values.
Interpret coefficients, p-values, and model fit statistics.
Make predictions from a fitted regression model.

2 Load the data

2.1 Question 1:

Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)

Download the data in csv and print out the first few rows of the data.

Click on this link for answers to question 1

check that you have the data set as shown below.

df <- read.csv("~/Documents/Projects/Regression Practice/insurance.csv")
head(df)

  age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

3 Data preparation

3.1 Question 2a:

List the variables that are categorical?

Ans: There are three variables that are categorical. sex, smoker and region.

3.2 Question 2b:

What should you do with the data type of these variables before fitting a regression model?

Ans: These data are currently in data type characters. We would have to make them into a factor before we can do any modelling.

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)

str(df)

'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ charges : num  16885 1726 4449 21984 3867 ...

summary(df)

      age            sex           bmi           children     smoker    
 Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
 1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
 Median :39.00                Median :30.40   Median :1.000             
 Mean   :39.21                Mean   :30.66   Mean   :1.095             
 3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
 Max.   :64.00                Max.   :53.13   Max.   :5.000             
       region       charges     
 northeast:324   Min.   : 1122  
 northwest:325   1st Qu.: 4740  
 southeast:364   Median : 9382  
 southwest:325   Mean   :13270  
                 3rd Qu.:16640  
                 Max.   :63770

3.3 Question 2c:

Which is the response variable?

Ans: Based on the question in overview, our response variable should be charges.

4 Simple linear regression

4.1 Question 3a

Fit a simple linear regression model with charges as the response and bmi as the predictor.

model_simple <- lm(charges ~ bmi, data = df)
summary(model_simple)


Call:
lm(formula = charges ~ bmi, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-20956  -8118  -3757   4722  49442 

Coefficients:
            Estimate Std. Error t value          Pr(>|t|)    
(Intercept)  1192.94    1664.80   0.717             0.474    
bmi           393.87      53.25   7.397 0.000000000000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared:  0.03934,   Adjusted R-squared:  0.03862 
F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

4.2 Question 3b

Interpret the coefficient of bmi in plain language.

Based on the output from the model, at 95% significance level, BMI of the insured is significantly associated with the medical insurance charges. The coefficient of BMI is 393.296, which means that for every unit increase in BMI, the medical insurance charges increases by 393.296 dollars, on average, holding all other variables constant.

5 Multiple linear regression

5.1 Question 4a

Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.

model_full <- lm(charges ~ age + sex + bmi + children + smoker + region, data = df)
summary(model_full)


Call:
lm(formula = charges ~ age + sex + bmi + children + smoker + 
    region, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086 < 0.0000000000000002 ***
age                256.9       11.9  21.587 < 0.0000000000000002 ***
sexmale           -131.3      332.9  -0.394             0.693348    
bmi                339.2       28.6  11.860 < 0.0000000000000002 ***
children           475.5      137.8   3.451             0.000577 ***
smokeryes        23848.5      413.1  57.723 < 0.0000000000000002 ***
regionnorthwest   -353.0      476.3  -0.741             0.458769    
regionsoutheast  -1035.0      478.7  -2.162             0.030782 *  
regionsouthwest   -960.0      477.9  -2.009             0.044765 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

5.2 Question 4b

Using the fitted model, interpret the coefficient of smokeryes.

Based on the output from the model, at 95% significance level, being a smoker is significantly associated with the medical insurance charges. The coefficient of smokeryes is 23847.268, which means that being a smoker increases the medical insurance charges by 23847.268 dollars, on average, holding all other variables constant.

5.3 Question 4c

Using the fitted model, interpret the coefficient of age.

Based on the output from the model, at 95% significance level, age of the insured is significantly associated with the medical insurance charges. The coefficient of age is 256.854, which means that for every year increase in age, the medical insurance charges increases by 256.854 dollars, on average, holding all other variables constant.

6 Model Selection - Forward selection based on p-values

In this section, use forward selection based on p-values.

Start with a null model containing only the intercept.

model_0 <- lm(charges ~ 1, data = df)
summary(model_0)


Call:
lm(formula = charges ~ 1, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-12149  -8530  -3888   3369  50500 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  13270.4      331.1   40.08 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12110 on 1337 degrees of freedom

6.1 Question 5a

Fit one-predictor models for each candidate variable and compare their p-values.

m_age      <- lm(charges ~ age, data = df)
m_sex      <- lm(charges ~ sex, data = df)
m_bmi      <- lm(charges ~ bmi, data = df)
m_children <- lm(charges ~ children, data = df)
m_smoker   <- lm(charges ~ smoker, data = df)
m_region   <- lm(charges ~ region, data = df)

summary(m_age)


Call:
lm(formula = charges ~ age, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
 -8059  -6671  -5939   5440  47829 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   3165.9      937.1   3.378             0.000751 ***
age            257.7       22.5  11.453 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11560 on 1336 degrees of freedom
Multiple R-squared:  0.08941,   Adjusted R-squared:  0.08872 
F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 0.00000000000000022

summary(m_sex)


Call:
lm(formula = charges ~ sex, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-12835  -8435  -3980   3476  51201 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  12569.6      470.1  26.740 <0.0000000000000002 ***
sexmale       1387.2      661.3   2.098              0.0361 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared:  0.003282,  Adjusted R-squared:  0.002536 
F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

summary(m_bmi)


Call:
lm(formula = charges ~ bmi, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-20956  -8118  -3757   4722  49442 

Coefficients:
            Estimate Std. Error t value          Pr(>|t|)    
(Intercept)  1192.94    1664.80   0.717             0.474    
bmi           393.87      53.25   7.397 0.000000000000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11870 on 1336 degrees of freedom
Multiple R-squared:  0.03934,   Adjusted R-squared:  0.03862 
F-statistic: 54.71 on 1 and 1336 DF,  p-value: 0.0000000000002459

summary(m_children)


Call:
lm(formula = charges ~ children, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-11585  -8759  -4071   3468  51248 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  12522.5      446.5  28.049 <0.0000000000000002 ***
children       683.1      274.2   2.491              0.0129 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared:  0.004624,  Adjusted R-squared:  0.003879 
F-statistic: 6.206 on 1 and 1336 DF,  p-value: 0.01285

summary(m_smoker)


Call:
lm(formula = charges ~ smoker, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19221  -5042   -919   3705  31720 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   8434.3      229.0   36.83 <0.0000000000000002 ***
smokeryes    23616.0      506.1   46.66 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared:  0.6198,    Adjusted R-squared:  0.6195 
F-statistic:  2178 on 1 and 1336 DF,  p-value: < 0.00000000000000022

summary(m_region)


Call:
lm(formula = charges ~ region, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-13614  -8463  -3793   3385  49035 

Coefficients:
                Estimate Std. Error t value            Pr(>|t|)    
(Intercept)      13406.4      671.3  19.971 <0.0000000000000002 ***
regionnorthwest   -988.8      948.6  -1.042               0.297    
regionsoutheast   1329.0      922.9   1.440               0.150    
regionsouthwest  -1059.4      948.6  -1.117               0.264    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12080 on 1334 degrees of freedom
Multiple R-squared:  0.006634,  Adjusted R-squared:  0.0044 
F-statistic:  2.97 on 3 and 1334 DF,  p-value: 0.03089

Which variable should enter first?

Ans: The variable that should enter first is smoker. You can see that the model with smoker has the lowest p-value.

6.2 Question 5b

Write down your final selected model.

Assume smoker is selected first. Fit the model with smoker only.

model_1 <- lm(charges ~ smoker, data = df)
summary(model_1)


Call:
lm(formula = charges ~ smoker, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19221  -5042   -919   3705  31720 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   8434.3      229.0   36.83 <0.0000000000000002 ***
smokeryes    23616.0      506.1   46.66 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7470 on 1336 degrees of freedom
Multiple R-squared:  0.6198,    Adjusted R-squared:  0.6195 
F-statistic:  2178 on 1 and 1336 DF,  p-value: < 0.00000000000000022

Now test each remaining variable one at a time after including smoker.

m1_age      <- lm(charges ~ smoker + age, data = df)
m1_sex      <- lm(charges ~ smoker + sex, data = df)
m1_bmi      <- lm(charges ~ smoker + bmi, data = df)
m1_children <- lm(charges ~ smoker + children, data = df)
m1_region   <- lm(charges ~ smoker + region, data = df)

summary(m1_age)


Call:
lm(formula = charges ~ smoker + age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-16088.1  -2046.8  -1336.4   -212.7  28760.0 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -2391.63     528.30  -4.527           0.00000652 ***
smokeryes   23855.30     433.49  55.031 < 0.0000000000000002 ***
age           274.87      12.46  22.069 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6397 on 1335 degrees of freedom
Multiple R-squared:  0.7214,    Adjusted R-squared:  0.721 
F-statistic:  1728 on 2 and 1335 DF,  p-value: < 0.00000000000000022

summary(m1_sex)


Call:
lm(formula = charges ~ smoker + sex, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19193  -5074   -909   3739  31682 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  8466.04     303.54   27.89 <0.0000000000000002 ***
smokeryes   23622.13     507.74   46.52 <0.0000000000000002 ***
sexmale       -65.38     409.81   -0.16               0.873    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7473 on 1335 degrees of freedom
Multiple R-squared:  0.6198,    Adjusted R-squared:  0.6192 
F-statistic:  1088 on 2 and 1335 DF,  p-value: < 0.00000000000000022

summary(m1_bmi)


Call:
lm(formula = charges ~ smoker + bmi, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15992.7  -4600.2   -802.4   3636.2  30677.8 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -3459.10     998.28  -3.465             0.000547 ***
smokeryes   23593.98     480.18  49.136 < 0.0000000000000002 ***
bmi           388.02      31.79  12.207 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7088 on 1335 degrees of freedom
Multiple R-squared:  0.6579,    Adjusted R-squared:  0.6574 
F-statistic:  1284 on 2 and 1335 DF,  p-value: < 0.00000000000000022

summary(m1_children)


Call:
lm(formula = charges ~ smoker + children, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19773  -5013  -1199   3902  32413 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   7755.7      292.9   26.48 < 0.0000000000000002 ***
smokeryes    23601.7      503.7   46.85 < 0.0000000000000002 ***
children       622.4      168.7    3.69             0.000233 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7435 on 1335 degrees of freedom
Multiple R-squared:  0.6236,    Adjusted R-squared:  0.623 
F-statistic:  1106 on 2 and 1335 DF,  p-value: < 0.00000000000000022

summary(m1_region)


Call:
lm(formula = charges ~ smoker + region, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-19268  -5041   -795   3746  31362 

Coefficients:
                Estimate Std. Error t value            Pr(>|t|)    
(Intercept)       8533.5      428.2  19.927 <0.0000000000000002 ***
smokeryes        23564.5      507.7  46.417 <0.0000000000000002 ***
regionnorthwest   -321.3      586.9  -0.547               0.584    
regionsoutheast    310.8      571.2   0.544               0.586    
regionsouthwest   -391.9      586.9  -0.668               0.504    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7473 on 1333 degrees of freedom
Multiple R-squared:  0.6203,    Adjusted R-squared:  0.6192 
F-statistic: 544.4 on 4 and 1333 DF,  p-value: < 0.00000000000000022

Determine the next variable to add in. In this case, age should be added next as variable as it has the smallest p-value.

Now assume smoker and age are selected. Now test each remaining variable one at a time.

m2_sex      <- lm(charges ~ smoker + age + sex, data = df)
m2_bmi      <- lm(charges ~ smoker + age + bmi, data = df)
m2_children <- lm(charges ~ smoker + age + children, data = df)
m2_region   <- lm(charges ~ smoker + age + region, data = df)

summary(m2_sex)


Call:
lm(formula = charges ~ smoker + age + sex, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-16122.4  -2048.5  -1318.9   -228.2  28725.3 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -2433.56     558.26  -4.359            0.0000141 ***
smokeryes   23847.63     434.89  54.836 < 0.0000000000000002 ***
age           274.93      12.46  22.061 < 0.0000000000000002 ***
sexmale        81.82     350.98   0.233                0.816    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6399 on 1334 degrees of freedom
Multiple R-squared:  0.7214,    Adjusted R-squared:  0.7208 
F-statistic:  1151 on 3 and 1334 DF,  p-value: < 0.00000000000000022

summary(m2_bmi)


Call:
lm(formula = charges ~ smoker + age + bmi, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-12415.4  -2970.9   -980.5   1480.0  28971.8 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) -11676.83     937.57  -12.45 <0.0000000000000002 ***
smokeryes    23823.68     412.87   57.70 <0.0000000000000002 ***
age            259.55      11.93   21.75 <0.0000000000000002 ***
bmi            322.62      27.49   11.74 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6092 on 1334 degrees of freedom
Multiple R-squared:  0.7475,    Adjusted R-squared:  0.7469 
F-statistic:  1316 on 3 and 1334 DF,  p-value: < 0.00000000000000022

summary(m2_children)


Call:
lm(formula = charges ~ smoker + age + children, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15547.3  -1941.4  -1319.1   -425.3  29313.3 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -2851.99     543.78  -5.245          0.000000182 ***
smokeryes   23842.60     431.84  55.212 < 0.0000000000000002 ***
age           273.09      12.42  21.990 < 0.0000000000000002 ***
children      486.65     144.70   3.363             0.000792 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6372 on 1334 degrees of freedom
Multiple R-squared:  0.7237,    Adjusted R-squared:  0.7231 
F-statistic:  1165 on 3 and 1334 DF,  p-value: < 0.00000000000000022

summary(m2_region)


Call:
lm(formula = charges ~ smoker + age + region, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15695.5  -2069.1  -1292.6   -211.1  28563.2 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     -2317.56     612.90  -3.781             0.000163 ***
smokeryes       23797.23     434.61  54.755 < 0.0000000000000002 ***
age               275.10      12.45  22.089 < 0.0000000000000002 ***
regionnorthwest  -294.97     502.27  -0.587             0.557117    
regionsoutheast   391.25     488.88   0.800             0.423680    
regionsouthwest  -436.71     502.27  -0.869             0.384744    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6396 on 1332 degrees of freedom
Multiple R-squared:  0.7221,    Adjusted R-squared:  0.7211 
F-statistic: 692.2 on 5 and 1332 DF,  p-value: < 0.00000000000000022

Determine the next variable to add in. In this case, bmi should be added next as variable as it has the smallest p-value.

Now assume smoker, age and bmi are selected. Now test each remaining variable one at a time.

m3_sex      <- lm(charges ~ smoker + age + bmi + sex, data = df)
m3_children <- lm(charges ~ smoker + age + bmi + children, data = df)
m3_region   <- lm(charges ~ smoker + age + bmi + region, data = df)

summary(m3_sex)


Call:
lm(formula = charges ~ smoker + age + bmi + sex, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-12364.7  -2972.2   -983.2   1475.8  29018.3 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) -11633.49     947.27 -12.281 <0.0000000000000002 ***
smokeryes    23833.87     414.19  57.544 <0.0000000000000002 ***
age            259.45      11.94  21.727 <0.0000000000000002 ***
bmi            323.05      27.53  11.735 <0.0000000000000002 ***
sexmale       -109.04     334.66  -0.326               0.745    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6094 on 1333 degrees of freedom
Multiple R-squared:  0.7475,    Adjusted R-squared:  0.7467 
F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 0.00000000000000022

summary(m3_children)


Call:
lm(formula = charges ~ smoker + age + bmi + children, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11897.9  -2920.8   -986.6   1392.2  29509.6 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12102.77     941.98 -12.848 < 0.0000000000000002 ***
smokeryes    23811.40     411.22  57.904 < 0.0000000000000002 ***
age            257.85      11.90  21.675 < 0.0000000000000002 ***
bmi            321.85      27.38  11.756 < 0.0000000000000002 ***
children       473.50     137.79   3.436             0.000608 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6068 on 1333 degrees of freedom
Multiple R-squared:  0.7497,    Adjusted R-squared:  0.7489 
F-statistic: 998.1 on 4 and 1333 DF,  p-value: < 0.00000000000000022

summary(m3_region)


Call:
lm(formula = charges ~ smoker + age + bmi + region, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-11905  -3041  -1000   1542  29419 

Coefficients:
                 Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     -11601.56     976.20 -11.884 <0.0000000000000002 ***
smokeryes        23852.48     413.51  57.683 <0.0000000000000002 ***
age                258.64      11.93  21.680 <0.0000000000000002 ***
bmi                340.01      28.67  11.858 <0.0000000000000002 ***
regionnorthwest   -303.52     477.85  -0.635              0.5254    
regionsoutheast  -1038.63     480.49  -2.162              0.0308 *  
regionsouthwest   -915.94     479.56  -1.910              0.0564 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6085 on 1331 degrees of freedom
Multiple R-squared:  0.7487,    Adjusted R-squared:  0.7475 
F-statistic: 660.8 on 6 and 1331 DF,  p-value: < 0.00000000000000022

Determine the next variable to add in. In this case, children should be added next as variable as it has the smallest p-value.

Now assume smoker, age, bmi and children are selected. We will again test each remaining variable one at a time.

m4_sex      <- lm(charges ~ smoker + age + bmi + children + sex, data = df)
m4_region   <- lm(charges ~ smoker + age + bmi + children + region, data = df)

summary(m4_sex)


Call:
lm(formula = charges ~ smoker + age + bmi + children + sex, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11837.2  -2916.7   -994.2   1375.3  29565.5 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -12052.46     951.26 -12.670 < 0.0000000000000002 ***
smokeryes    23823.39     412.52  57.750 < 0.0000000000000002 ***
age            257.73      11.90  21.651 < 0.0000000000000002 ***
bmi            322.36      27.42  11.757 < 0.0000000000000002 ***
children       474.41     137.86   3.441             0.000597 ***
sexmale       -128.64     333.36  -0.386             0.699641    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6070 on 1332 degrees of freedom
Multiple R-squared:  0.7497,    Adjusted R-squared:  0.7488 
F-statistic:   798 on 5 and 1332 DF,  p-value: < 0.00000000000000022

summary(m4_region)


Call:
lm(formula = charges ~ smoker + age + bmi + children + region, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11367.2  -2835.4   -979.7   1361.9  29935.5 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     -11990.27     978.76 -12.250 < 0.0000000000000002 ***
smokeryes        23836.30     411.86  57.875 < 0.0000000000000002 ***
age                256.97      11.89  21.610 < 0.0000000000000002 ***
bmi                338.66      28.56  11.858 < 0.0000000000000002 ***
children           474.57     137.74   3.445             0.000588 ***
regionnorthwest   -352.18     476.12  -0.740             0.459618    
regionsoutheast  -1034.36     478.54  -2.162             0.030834 *  
regionsouthwest   -959.37     477.78  -2.008             0.044846 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7496 
F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 0.00000000000000022

Determine the next variable to add in.

In this case, region would be a borderline addition as it is significant only in two of the regions.

One could also think about whether it makes to do some form of feature engineering by combining the regions into fewer categories.

In any case, now we have smoker, age, bmi, children and region. We will again test each remaining variable one at a time.

m5_sex      <- lm(charges ~ smoker + age + bmi + children + region + sex, data = df)

summary(m5_sex)


Call:
lm(formula = charges ~ smoker + age + bmi + children + region + 
    sex, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value             Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086 < 0.0000000000000002 ***
smokeryes        23848.5      413.1  57.723 < 0.0000000000000002 ***
age                256.9       11.9  21.587 < 0.0000000000000002 ***
bmi                339.2       28.6  11.860 < 0.0000000000000002 ***
children           475.5      137.8   3.451             0.000577 ***
regionnorthwest   -353.0      476.3  -0.741             0.458769    
regionsoutheast  -1035.0      478.7  -2.162             0.030782 *  
regionsouthwest   -960.0      477.9  -2.009             0.044765 *  
sexmale           -131.3      332.9  -0.394             0.693348    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 0.00000000000000022

The p-value for gender is above 0.05, which means that it is not significant at 95% confidence level. We will stop here and our final model will be the one with smoker, age, bmi, children and region as predictors.

Final Model - charges ~ smoker + age + bmi + children + region