Introduction:
For this analysis, the dataset I am using is the insurance dataset. It is from Brett Lantz’s GitHub repository for his book titled “Machine Learning With R”. This dataset consists of 7 variables which are age, sex, BMI, children, smoker, region, and charges.
First, I loaded the data into R:
# Load the data
insurance <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
head(insurance)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
In my prior analysis, I made a Gamma Regression Model to determine whether certain predictors have an affect on others; specifically observing how age, BMI, and smoking habits affect insurance charges.
# Convert "smoker" into a binary factor
insurance$smoker <- as.factor(insurance$smoker)
# Fit the Gamma Regression Model
m_gama <- glm(charges ~ age + bmi + smoker,
family = Gamma(link = "log"),
data = insurance)
summary(m_gama)
##
## Call:
## glm(formula = charges ~ age + bmi + smoker, family = Gamma(link = "log"),
## data = insurance)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.444595 0.105187 70.775 < 2e-16 ***
## age 0.028426 0.001339 21.230 < 2e-16 ***
## bmi 0.012131 0.003084 3.934 8.79e-05 ***
## smokeryes 1.470576 0.046320 31.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.4671822)
##
## Null deviance: 1056.04 on 1337 degrees of freedom
## Residual deviance: 357.41 on 1334 degrees of freedom
## AIC: 26448
##
## Number of Fisher Scoring iterations: 7
For my previous analysis, some of the key findings were:
The p-values for age, BMI, and smoking are all less than 0.05, which indicates that they are highly significant. This means that all these variables have a significant effect on insurance charges.
Smoker had the largest effect on insurance charges, followed by age and BMI.
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(clarify)
library(modelsummary)
dim(insurance) # Shows the dimensions of the data
## [1] 1338 7
Using the dim() function, I was able to view the dimensions of the dataset. There are 1338 rows and 7 variables. They are age, sex, bmi, children, smoker, region, and charges.
library(modelsummary)
model.ins <- lm(charges ~ age + bmi + smoker, data = insurance)
modelsummary(model.ins, output = "markdown")
(1) | |
---|---|
(Intercept) | -11676.830 |
(937.569) | |
age | 259.547 |
(11.934) | |
bmi | 322.615 |
(27.487) | |
smokeryes | 23823.684 |
(412.867) | |
Num.Obs. | 1338 |
R2 | 0.747 |
R2 Adj. | 0.747 |
AIC | 27123.8 |
BIC | 27149.8 |
Log.Lik. | -13556.918 |
F | 1316.230 |
RMSE | 6083.21 |
However, only 3 predictors were used in the model, which were age, BMI, and smokeryes. The other variables were not included because they were not significant predictors of insurance charges.
library(Amelia)
# Imputation (for missing values)
a.out <- amelia(insurance, m = 5, idvars = c("sex", "region", "smoker"), ords = "children")
## Amelia Error Code: 39
## Your data has no missing values. Make sure the code for
## missing data is set to the code for R, which is NA.
# View imputed data
summary(a.out)
## Length Class Mode
## code 1 -none- numeric
## message 1 -none- character
This shows that there is no missing data. However, I have created simulation missing data to illustrate the functionality of the Amelia package:
set.seed(123) # for reproducibility
insurance_missing <- insurance
insurance_missing[sample(1:nrow(insurance), 50), "bmi"] <- NA # randomly remove 50 BMI values
# Now try imputation
library(Amelia)
a.out <- amelia(insurance_missing, m = 5, idvars = c("sex", "region", "smoker"), ords = "children")
## -- Imputation 1 --
##
## 1 2
##
## -- Imputation 2 --
##
## 1 2
##
## -- Imputation 3 --
##
## 1 2
##
## -- Imputation 4 --
##
## 1 2
##
## -- Imputation 5 --
##
## 1 2
summary(a.out)
##
## Amelia output with 5 imputed datasets.
## Return code: 1
## Message: Normal EM convergence.
##
## Chain Lengths:
## --------------
## Imputation 1: 2
## Imputation 2: 2
## Imputation 3: 2
## Imputation 4: 2
## Imputation 5: 2
##
## Rows after Listwise Deletion: 1288
## Rows after Imputation: 1338
## Patterns of missingness in the data: 2
##
## Fraction Missing for original variables:
## -----------------------------------------
##
## Fraction Missing
## age 0.00000000
## sex 0.00000000
## bmi 0.03736921
## children 0.00000000
## smoker 0.00000000
## region 0.00000000
## charges 0.00000000
head(a.out$imputations[[1]])
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
library(broom) # for tidy model output
library(modelsummary) # for formatting
library(purrr) # for mapping across imputations
# Run the Gamma GLM on each imputed dataset
models <- map(a.out$imputations, ~
glm(charges ~ age + bmi + smoker,
family = Gamma(link = "log"),
data = .x)
)
modelsummary(models, output = "markdown")
imp1 | imp2 | imp3 | imp4 | imp5 | |
---|---|---|---|---|---|
(Intercept) | 7.451 | 7.449 | 7.471 | 7.434 | 7.465 |
(0.105) | (0.104) | (0.105) | (0.104) | (0.105) | |
age | 0.028 | 0.028 | 0.028 | 0.028 | 0.028 |
(0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
bmi | 0.012 | 0.012 | 0.011 | 0.013 | 0.011 |
(0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |
smokeryes | 1.472 | 1.470 | 1.471 | 1.468 | 1.470 |
(0.046) | (0.046) | (0.046) | (0.046) | (0.046) | |
Num.Obs. | 1338 | 1338 | 1338 | 1338 | 1338 |
AIC | 26449.2 | 26448.5 | 26451.5 | 26446.1 | 26450.8 |
BIC | 26475.2 | 26474.5 | 26477.5 | 26472.1 | 26476.8 |
Log.Lik. | -13219.579 | -13219.259 | -13220.733 | -13218.075 | -13220.385 |
RMSE | 7491.09 | 7472.75 | 7492.37 | 7419.69 | 7491.53 |
library(mitools) # used for pooling data
# Convert models to 'imputationList' format
imputation_list <- imputationList(lapply(a.out$imputations, as.data.frame))
# Refit model using 'with()' from mitools
glm_results <- with(imputation_list,
glm(charges ~ age + bmi + smoker, family = Gamma(link = "log"))
)
# Pool results
pooled <- MIcombine(glm_results)
# View pooled coefficients and standard errors
summary(pooled)
## Multiple imputation results:
## with(imputation_list, glm(charges ~ age + bmi + smoker, family = Gamma(link = "log")))
## MIcombine.default(glm_results)
## results se (lower upper) missInfo
## (Intercept) 7.45393999 0.105934349 7.246280471 7.66159951 2 %
## age 0.02841702 0.001338188 0.025794219 0.03103982 0 %
## bmi 0.01184797 0.003115676 0.005739423 0.01795652 3 %
## smokeryes 1.47009738 0.046311792 1.379327917 1.56086685 0 %
Comparing pre and post imputation models:
My original data set had no missing values, so the model summary was based on the complete dataset. The imputed data set had 50 missing values for BMI, which were filled in using the Amelia package. Also, the coefficients are nearly unchanged, which suggests that the imputation did not significantly alter the relationships between the variables. The standard errors are slightly larger in the imputed model, which is expected due to the added uncertainty from the imputation process. In conclusion, age and BMI increase insurance costs significantly, while smoking has a much larger effect on insurance costs. The imputation process did not significantly change the model results, indicating that the missing data was not a major issue in this analysis.