Assignment 9

Donna Parker

2025-04-25


Introduction:

For this analysis, the dataset I am using is the insurance dataset. It is from Brett Lantz’s GitHub repository for his book titled “Machine Learning With R”. This dataset consists of 7 variables which are age, sex, BMI, children, smoker, region, and charges.

First, I loaded the data into R:

# Load the data

insurance <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")

head(insurance)  
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

In my prior analysis, I made a Gamma Regression Model to determine whether certain predictors have an affect on others; specifically observing how age, BMI, and smoking habits affect insurance charges.

# Convert "smoker" into a binary factor

insurance$smoker <- as.factor(insurance$smoker)

# Fit the Gamma Regression Model

m_gama <- glm(charges ~ age + bmi + smoker, 
              family = Gamma(link = "log"), 
              data = insurance)

summary(m_gama)
## 
## Call:
## glm(formula = charges ~ age + bmi + smoker, family = Gamma(link = "log"), 
##     data = insurance)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.444595   0.105187  70.775  < 2e-16 ***
## age         0.028426   0.001339  21.230  < 2e-16 ***
## bmi         0.012131   0.003084   3.934 8.79e-05 ***
## smokeryes   1.470576   0.046320  31.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.4671822)
## 
##     Null deviance: 1056.04  on 1337  degrees of freedom
## Residual deviance:  357.41  on 1334  degrees of freedom
## AIC: 26448
## 
## Number of Fisher Scoring iterations: 7

For my previous analysis, some of the key findings were:

The p-values for age, BMI, and smoking are all less than 0.05, which indicates that they are highly significant. This means that all these variables have a significant effect on insurance charges.

Smoker had the largest effect on insurance charges, followed by age and BMI.

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(clarify)
library(modelsummary)


dim(insurance) # Shows the dimensions of the data
## [1] 1338    7

Using the dim() function, I was able to view the dimensions of the dataset. There are 1338 rows and 7 variables. They are age, sex, bmi, children, smoker, region, and charges.

library(modelsummary)

model.ins <- lm(charges ~ age + bmi + smoker, data = insurance)

modelsummary(model.ins, output = "markdown")
(1)
(Intercept) -11676.830
(937.569)
age 259.547
(11.934)
bmi 322.615
(27.487)
smokeryes 23823.684
(412.867)
Num.Obs. 1338
R2 0.747
R2 Adj. 0.747
AIC 27123.8
BIC 27149.8
Log.Lik. -13556.918
F 1316.230
RMSE 6083.21

However, only 3 predictors were used in the model, which were age, BMI, and smokeryes. The other variables were not included because they were not significant predictors of insurance charges.

library(Amelia)

# Imputation (for missing values)
a.out <- amelia(insurance, m = 5, idvars = c("sex", "region", "smoker"), ords = "children")
## Amelia Error Code:  39 
##  Your data has no missing values.  Make sure the code for 
##  missing data is set to the code for R, which is NA.
# View imputed data
summary(a.out)
##         Length Class  Mode     
## code    1      -none- numeric  
## message 1      -none- character

This shows that there is no missing data. However, I have created simulation missing data to illustrate the functionality of the Amelia package:

set.seed(123)  # for reproducibility
insurance_missing <- insurance
insurance_missing[sample(1:nrow(insurance), 50), "bmi"] <- NA  # randomly remove 50 BMI values

# Now try imputation
library(Amelia)
a.out <- amelia(insurance_missing, m = 5, idvars = c("sex", "region", "smoker"), ords = "children")
## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##   1  2
## 
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2
summary(a.out)
## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  2
## Imputation 2:  2
## Imputation 3:  2
## Imputation 4:  2
## Imputation 5:  2
## 
## Rows after Listwise Deletion:  1288 
## Rows after Imputation:  1338 
## Patterns of missingness in the data:  2 
## 
## Fraction Missing for original variables: 
## -----------------------------------------
## 
##          Fraction Missing
## age            0.00000000
## sex            0.00000000
## bmi            0.03736921
## children       0.00000000
## smoker         0.00000000
## region         0.00000000
## charges        0.00000000
head(a.out$imputations[[1]])
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
library(broom)       # for tidy model output
library(modelsummary) # for formatting
library(purrr)       # for mapping across imputations

# Run the Gamma GLM on each imputed dataset
models <- map(a.out$imputations, ~ 
  glm(charges ~ age + bmi + smoker, 
      family = Gamma(link = "log"), 
      data = .x)
)

modelsummary(models, output = "markdown")
imp1 imp2 imp3 imp4 imp5
(Intercept) 7.451 7.449 7.471 7.434 7.465
(0.105) (0.104) (0.105) (0.104) (0.105)
age 0.028 0.028 0.028 0.028 0.028
(0.001) (0.001) (0.001) (0.001) (0.001)
bmi 0.012 0.012 0.011 0.013 0.011
(0.003) (0.003) (0.003) (0.003) (0.003)
smokeryes 1.472 1.470 1.471 1.468 1.470
(0.046) (0.046) (0.046) (0.046) (0.046)
Num.Obs. 1338 1338 1338 1338 1338
AIC 26449.2 26448.5 26451.5 26446.1 26450.8
BIC 26475.2 26474.5 26477.5 26472.1 26476.8
Log.Lik. -13219.579 -13219.259 -13220.733 -13218.075 -13220.385
RMSE 7491.09 7472.75 7492.37 7419.69 7491.53
library(mitools)  # used for pooling data

# Convert models to 'imputationList' format
imputation_list <- imputationList(lapply(a.out$imputations, as.data.frame))

# Refit model using 'with()' from mitools
glm_results <- with(imputation_list, 
  glm(charges ~ age + bmi + smoker, family = Gamma(link = "log"))
)

# Pool results 
pooled <- MIcombine(glm_results)

# View pooled coefficients and standard errors
summary(pooled)
## Multiple imputation results:
##       with(imputation_list, glm(charges ~ age + bmi + smoker, family = Gamma(link = "log")))
##       MIcombine.default(glm_results)
##                results          se      (lower     upper) missInfo
## (Intercept) 7.45393999 0.105934349 7.246280471 7.66159951      2 %
## age         0.02841702 0.001338188 0.025794219 0.03103982      0 %
## bmi         0.01184797 0.003115676 0.005739423 0.01795652      3 %
## smokeryes   1.47009738 0.046311792 1.379327917 1.56086685      0 %

Comparing pre and post imputation models:

My original data set had no missing values, so the model summary was based on the complete dataset. The imputed data set had 50 missing values for BMI, which were filled in using the Amelia package. Also, the coefficients are nearly unchanged, which suggests that the imputation did not significantly alter the relationships between the variables. The standard errors are slightly larger in the imputed model, which is expected due to the added uncertainty from the imputation process. In conclusion, age and BMI increase insurance costs significantly, while smoking has a much larger effect on insurance costs. The imputation process did not significantly change the model results, indicating that the missing data was not a major issue in this analysis.