Loading package ———————————————————

pacman::p_load(
  rio,          # File import
  here,         # File locator
  janitor,      # adding totals and percents to tables
  tidyverse,
  VIM,
  mice,
  ggplot2
)

Backgound

An adequate level of risk perception is critical to adopt a protective behavior concerning HIV acquisition. The purpose of this study is to identify factors associated with HIV risk perception among AGYW. Here, we will focus on risk perception at baseline and which factors are associated therewith. Data description: - Outcome: risk_score - Risk factors: age + factor(educational_level) + factor(residential_area) + monthly_income

## Warning: NAs introduced by coercion

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00   10.00   16.48   20.00  999.00     531

##    Length     Class      Mode 
##      2200 character character

##    Length     Class      Mode 
##      2200 character character

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    4000    5321    6000  100000

Primary endpoint

The primary endpoint of this analysis was self-assessed risk of acquiring HIV infection. At baseline, participants were asked “What do you think the chance is that you will get HIV in the next six months?”, thereby expressing this on a scale ranging from 0 to 100, with 100 being the maximum value. 1. How many participants are included in the study? How many variables?

n <- nrow(all_dat)
n

## [1] 2200

tab_primary = table(all_dat$risk_score)
all_dat$risk_score = as.numeric(all_dat$risk_score)
tab_score = table(all_dat$risk_score)
rel_freq = tab_score/n
h_primary = hist(all_dat$risk_score, xlim = c(0,100), breaks = 50, main = "",
                 xlab = "Risk score (between 0 and 100)")

# lines(as.numeric(names(tab_score)), tab_score, col = 2, lwd = 2)
# text(xgrid[-length(4)] + 0.5, tab_score + 25,
#      paste(round(rel_freq*100,1),"%"), cex = 0.8)

What is the age distribution of individuals in the DREAMS study?

Skew distribution

tab_age = table(all_dat$age)
rel_freq = tab_age/n
xgrid = seq(min(all_dat$age)-0.5,max(all_dat$age)+0.5,1)
h = hist(all_dat$age, main = "The age distribution of individuals in the DREAMS study", ylab = "Frequency", xlab = "Age",
         breaks = xgrid, ylim = c(0,550))
lines(as.numeric(names(tab_age)), tab_age, col = 2, lwd = 2)
text(xgrid[-length(xgrid)] + 0.5, tab_age + 25,
     paste(round(rel_freq*100,1),"%"), cex = 0.8)

Missing data

For some of the variables, missing observations are coded as 999 instead of NA. First recode missing observations as NA. NOTE: for the primary endpoint, some observations are equal to 777 which requires recording to NA as well.

all_dat$risk_score[all_dat$risk_score == 777 | all_dat$risk_score == 999] = NA
n_recodings = apply(all_dat, 2, function(x){sum(x == 999, na.rm = T)})
all_dat[which(all_dat == 999, arr.ind = T)] = NA
# Number of missing observations for the primary endpoint
sum(is.na(all_dat$risk_score))

## [1] 535

Explore the level of missing data in the data set by calculating the frequency of missing observations for each variable

n_missing = apply(all_dat, 2, FUN = function(x){sum(is.na(x))})
perc_missing = n_missing/nrow(all_dat)
perc_missing

##                     study_id                          age 
##                 0.0000000000                 0.0000000000 
##                     studying            educational_level 
##                 0.0000000000                 0.0000000000 
##            country_of_origin                 mother_alive 
##                 0.0000000000                 0.0000000000 
##                mother_status                 father_alive 
##                 0.0000000000                 0.0000000000 
##                father_status             residential_area 
##                 0.0000000000                 0.0000000000 
##            household_members              household_water 
##                 0.0000000000                 0.0000000000 
##          electricity_at_home         no_food_in_household 
##                 0.0000000000                 0.0000000000 
##              not_enough_food       whole_day_without_food 
##                 0.0000000000                 0.0000000000 
##               monthly_income                steady_income 
##                 0.0000000000                 0.0000000000 
##             government_grant   r1000_in_personal_accoount 
##                 0.0000000000                 0.0000000000 
##                   dependants            financial_support 
##                 0.0000000000                 0.0000000000 
##   control_on_household_money               away_from_home 
##                 0.0000000000                 0.0004545455 
##                have_children            how_many_children 
##                 0.0000000000                 0.4559090909 
##             kids_living_with           ever_been_pregnant 
##                 0.4513636364                 0.0000000000 
##           currently_pregnant   want_to_have_more_children 
##                 0.3263636364                 0.1618181818 
##         preventing_pregnancy     method_prevent_pregnancy 
##                 0.1622727273                 0.3936363636 
##                  alcohol_use                          sti 
##                 0.0000000000                 0.0000000000 
##         discuss_health_issue                     hiv_talk 
##                 0.0000000000                 0.0000000000 
##              hiv_talk_mostly          hiv_on_social_media 
##                 0.1009090909                 0.0000000000 
##            hiv_info_internet          ever_tested_for_hiv 
##                 0.0000000000                 0.0000000000 
##                last_hiv_test       first_sexual_encounter 
##                 0.0445454545                 0.0000000000 
##              sexual_partners        how_many_intercourses 
##                 0.0000000000                 0.0000000000 
##            sex_after_alcohol       sex_with_drunk_partner 
##                 0.0000000000                 0.0000000000 
##     sex_for_service_or_goods                sex_for_money 
##                 0.0000000000                 0.0000000000 
##            physical_violence physical_violence_by_partner 
##                 0.0000000000                 0.0000000000 
##              sexual_violence          relationship_status 
##                 0.0000000000                 0.0000000000 
##      current_sexual_partners                   risk_score 
##                 0.0009090909                 0.2431818182 
##      sex_partners_past_6_mos                  risk_cat_bl 
##                 0.8268181818                 0.0000000000 
##                     risk_cat                      agediff 
##                 0.2404545455                 0.0000000000 
##                     age5diff       total_relationship_yrs 
##                 0.0000000000                 0.0000000000

#Method 2
missing_vars <- names(which(colSums(is.na(all_dat)) > 0))
data_with_missing <- all_dat[missing_vars]
missing_freq <- colSums(is.na(data_with_missing))
# Calculate the percentage of missing observations for each variable
missing_pct <- 100 * missing_freq / nrow(all_dat)
# Create a table with variable names, missing frequencies, and missing percentages
missing_table <- data.frame(
  Missing_Frequency = missing_freq,
  Missing_Percentage = missing_pct
)
# Print the table
missing_table

##                            Missing_Frequency Missing_Percentage
## away_from_home                             1         0.04545455
## how_many_children                       1003        45.59090909
## kids_living_with                         993        45.13636364
## currently_pregnant                       718        32.63636364
## want_to_have_more_children               356        16.18181818
## preventing_pregnancy                     357        16.22727273
## method_prevent_pregnancy                 866        39.36363636
## hiv_talk_mostly                          222        10.09090909
## last_hiv_test                             98         4.45454545
## current_sexual_partners                    2         0.09090909
## risk_score                               535        24.31818182
## sex_partners_past_6_mos                 1819        82.68181818
## risk_cat                                 529        24.04545455

How many participants have at least one missing observation for the recorded variables?

n_subjects_missing_values = sum(apply(all_dat, 1, function(x){sum(is.na(x))}) > 0)
n_subjects_missing_values

## [1] 2071

# Method 2
# Calculate the number of missing observations for each participant
missing_obs <- rowSums(is.na(all_dat))

# Count the number of participants with at least one missing observation
n_missing_participants <- sum(missing_obs > 0)

# Print the result
n_missing_participants

## [1] 2071

Alternatively, provide a graphical exploration of the amount of missing data by variable using the function aggr in the R package VIM.

aggr_plot <- aggr(all_dat, col=c('navyblue','yellow'), numbers=TRUE, sortVars=TRUE,
                  labels=names(all_dat), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)

## 
##  Variables sorted by number of missings: 
##                      Variable        Count
##       sex_partners_past_6_mos 0.8268181818
##             how_many_children 0.4559090909
##              kids_living_with 0.4513636364
##      method_prevent_pregnancy 0.3936363636
##            currently_pregnant 0.3263636364
##                    risk_score 0.2431818182
##                      risk_cat 0.2404545455
##          preventing_pregnancy 0.1622727273
##    want_to_have_more_children 0.1618181818
##               hiv_talk_mostly 0.1009090909
##                 last_hiv_test 0.0445454545
##       current_sexual_partners 0.0009090909
##                away_from_home 0.0004545455
##                      study_id 0.0000000000
##                           age 0.0000000000
##                      studying 0.0000000000
##             educational_level 0.0000000000
##             country_of_origin 0.0000000000
##                  mother_alive 0.0000000000
##                 mother_status 0.0000000000
##                  father_alive 0.0000000000
##                 father_status 0.0000000000
##              residential_area 0.0000000000
##             household_members 0.0000000000
##               household_water 0.0000000000
##           electricity_at_home 0.0000000000
##          no_food_in_household 0.0000000000
##               not_enough_food 0.0000000000
##        whole_day_without_food 0.0000000000
##                monthly_income 0.0000000000
##                 steady_income 0.0000000000
##              government_grant 0.0000000000
##    r1000_in_personal_accoount 0.0000000000
##                    dependants 0.0000000000
##             financial_support 0.0000000000
##    control_on_household_money 0.0000000000
##                 have_children 0.0000000000
##            ever_been_pregnant 0.0000000000
##                   alcohol_use 0.0000000000
##                           sti 0.0000000000
##          discuss_health_issue 0.0000000000
##                      hiv_talk 0.0000000000
##           hiv_on_social_media 0.0000000000
##             hiv_info_internet 0.0000000000
##           ever_tested_for_hiv 0.0000000000
##        first_sexual_encounter 0.0000000000
##               sexual_partners 0.0000000000
##         how_many_intercourses 0.0000000000
##             sex_after_alcohol 0.0000000000
##        sex_with_drunk_partner 0.0000000000
##      sex_for_service_or_goods 0.0000000000
##                 sex_for_money 0.0000000000
##             physical_violence 0.0000000000
##  physical_violence_by_partner 0.0000000000
##               sexual_violence 0.0000000000
##           relationship_status 0.0000000000
##                   risk_cat_bl 0.0000000000
##                       agediff 0.0000000000
##                      age5diff 0.0000000000
##        total_relationship_yrs 0.0000000000

Missing data techniques and questions

1. Which missingness mechanisms do exist? Explain.

Missing Completely At Random (MCAR): If the missingness is evenly distributed across all values of the variable, the missingness mechanism may be MCAR. This means that the missingness is not related to the observed or unobserved values of the variable or any other variables in the data set. In the aggr plot, an MCAR pattern would show random yellow bars across all values of the variable. –> Not MCAR

Missing At Random (MAR): If the missingness is related to the observed values of other variables in the data set, the missingness mechanism may be MAR. This means that the probability of missingness depends on the observed values of other variables in the data set, but not on the unobserved values of the variable itself. In the aggr plot, an MAR pattern would show non-random yellow bars for certain values of the variable, depending on the observed values of other variables in the data set.

Missing Not At Random (MNAR): If the missingness is related to the unobserved values of the variable, the missingness mechanism may be MNAR. This means that the probability of missingness depends on the unobserved values of the variable itself or other variables in the data set. In the aggr plot, an MNAR pattern would show non-random yellow bars for certain values of the variable, depending on the unobserved values of the variable itself or other variables in the data set.

–> Maybe MAR or MNAR

2. Formulate an imputation model for the (pseudo-)continuous endpoint Z defined as the risk perception score (on a range of 0 – 100). What about the excess number of zero observations?

The presence of many zeros in the outcome variable can lead to inflated or deflated regression coefficients, depending on the specific model assumptions.

Step 1: Choose a method: There are several methods for imputing missing values, including mean imputation, regression imputation, and multiple imputation. In this case, a regression imputation method could be appropriate, as we want to model the relationship between Z and the predictor variables.

lm_fit1 = glm(risk_score ~ age + factor(educational_level) +
                factor(residential_area) + monthly_income, data = all_dat)
summary(lm_fit1)

## 
## Call:
## glm(formula = risk_score ~ age + factor(educational_level) + 
##     factor(residential_area) + monthly_income, data = all_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -24.199  -13.681   -5.067    5.428   86.853  
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             1.105e+01  5.162e+00
## age                                                     1.339e-01  2.234e-01
## factor(educational_level)Completed primary school      -2.986e+00  3.865e+00
## factor(educational_level)No formal schooling            1.323e+01  1.347e+01
## factor(educational_level)Some high/secondary schooling  7.670e-01  1.092e+00
## factor(educational_level)Some primary school           -5.813e+00  5.310e+00
## factor(educational_level)Tertiary education             4.507e-01  1.254e+00
## factor(residential_area)Diepsloot                       1.232e+00  1.545e+00
## factor(residential_area)Fourways                        2.625e+00  2.638e+00
## factor(residential_area)Kyasand/pipeline                3.653e+00  2.504e+00
## factor(residential_area)Msawawa                        -6.355e-01  2.796e+00
## factor(residential_area)Other                          -7.215e-01  1.800e+00
## monthly_income                                         -1.296e-04  7.654e-05
##                                                        t value Pr(>|t|)  
## (Intercept)                                              2.141   0.0324 *
## age                                                      0.599   0.5492  
## factor(educational_level)Completed primary school       -0.773   0.4399  
## factor(educational_level)No formal schooling             0.982   0.3260  
## factor(educational_level)Some high/secondary schooling   0.703   0.4825  
## factor(educational_level)Some primary school            -1.095   0.2738  
## factor(educational_level)Tertiary education              0.359   0.7194  
## factor(residential_area)Diepsloot                        0.797   0.4253  
## factor(residential_area)Fourways                         0.995   0.3199  
## factor(residential_area)Kyasand/pipeline                 1.459   0.1448  
## factor(residential_area)Msawawa                         -0.227   0.8202  
## factor(residential_area)Other                           -0.401   0.6886  
## monthly_income                                          -1.694   0.0905 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 355.6833)
## 
##     Null deviance: 592378  on 1664  degrees of freedom
## Residual deviance: 587589  on 1652  degrees of freedom
##   (535 observations deleted due to missingness)
## AIC: 14520
## 
## Number of Fisher Scoring iterations: 2

## Model diagnostics (normality)
##-------------------------------
qqnorm(residuals(lm_fit1), pch = 1, frame = FALSE)
qqline(residuals(lm_fit1), col = "steelblue", lwd = 2)

## Linear regression model with transformation
##------------------------------------------------
lm_fit2 = glm(log(risk_score + 1) ~ age + factor(educational_level) + #to avoid taking the log of zero or negative values
                factor(residential_area) + monthly_income, data = all_dat)
summary(lm_fit2)

## 
## Call:
## glm(formula = log(risk_score + 1) ~ age + factor(educational_level) + 
##     factor(residential_area) + monthly_income, data = all_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0619  -1.7277   0.5179   1.1975   2.9059  
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             1.876e+00  4.148e-01
## age                                                    -4.932e-03  1.796e-02
## factor(educational_level)Completed primary school      -2.068e-01  3.106e-01
## factor(educational_level)No formal schooling            1.191e+00  1.083e+00
## factor(educational_level)Some high/secondary schooling  1.128e-01  8.774e-02
## factor(educational_level)Some primary school           -5.101e-01  4.268e-01
## factor(educational_level)Tertiary education            -1.348e-02  1.008e-01
## factor(residential_area)Diepsloot                       1.218e-01  1.242e-01
## factor(residential_area)Fourways                       -4.595e-02  2.120e-01
## factor(residential_area)Kyasand/pipeline                1.744e-01  2.013e-01
## factor(residential_area)Msawawa                        -1.398e-01  2.247e-01
## factor(residential_area)Other                          -7.376e-02  1.447e-01
## monthly_income                                         -7.442e-06  6.151e-06
##                                                        t value Pr(>|t|)    
## (Intercept)                                              4.522 6.57e-06 ***
## age                                                     -0.275    0.784    
## factor(educational_level)Completed primary school       -0.666    0.506    
## factor(educational_level)No formal schooling             1.100    0.271    
## factor(educational_level)Some high/secondary schooling   1.285    0.199    
## factor(educational_level)Some primary school            -1.195    0.232    
## factor(educational_level)Tertiary education             -0.134    0.894    
## factor(residential_area)Diepsloot                        0.981    0.327    
## factor(residential_area)Fourways                        -0.217    0.828    
## factor(residential_area)Kyasand/pipeline                 0.866    0.386    
## factor(residential_area)Msawawa                         -0.622    0.534    
## factor(residential_area)Other                           -0.510    0.610    
## monthly_income                                          -1.210    0.227    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.297046)
## 
##     Null deviance: 3831.5  on 1664  degrees of freedom
## Residual deviance: 3794.7  on 1652  degrees of freedom
##   (535 observations deleted due to missingness)
## AIC: 6124.7
## 
## Number of Fisher Scoring iterations: 2

## Model diagnostics (normality)
##-------------------------------
qqnorm(residuals(lm_fit2), pch = 1, frame = FALSE)
qqline(residuals(lm_fit2), col = "steelblue", lwd = 2)

Step 2: Impute missing values: Finally, use the trained model to impute missing values for Z based on the values of the predictor variables for each observation with missing data.

predictors <- all_dat %>% select(age, educational_level,
              residential_area, monthly_income)
predictors = as.matrix(predictors)
predictions <- predict(lm_fit1, all_dat)
# We can evaluate the performance of our model using the root mean squared error (RMSE) on the training data
rmse <- sqrt(mean((all_dat$risk_score - predictions)^2))
print(paste("RMSE:", rmse))

## [1] "RMSE: NA"

# imputed_data <- mice(all_dat, m=5, method="pmm", pred=predictors)
# complete_data <- complete(imputed_data, 1)

Consider the random variable Y defined as having a risk score equal to zero (no risk of HIV acquisition). Construct an imputation model for Y, under the assumption of missing at random (MAR), and depending on the covariates age, educational level, residential area, and monthly income. What type of model will you consider?

Logistic resgression model

Step 1: Define Y = risk perception score equal to zero. Y=1 - risk perception score equal to zero, Y=0 - risk perception score is not equal to zero

all_dat$y1 = as.numeric(all_dat$risk_score == 0)
table(all_dat$y1)

## 
##    0    1 
## 1051  614

sum(is.na(all_dat$y1))

## [1] 535

Step 2: Logistic regression model

glm_fit = glm(y1 ~ age + factor(educational_level) + factor(residential_area) +
                monthly_income, data = all_dat,
              family = "binomial"(link = logit))
summary(glm_fit)

## 
## Call:
## glm(formula = y1 ~ age + factor(educational_level) + factor(residential_area) + 
##     monthly_income, family = binomial(link = logit), data = all_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5278  -0.9541  -0.8862   1.3584   1.5555  
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                            -7.181e-01  5.713e-01
## age                                                     1.283e-02  2.477e-02
## factor(educational_level)Completed primary school       2.061e-01  4.148e-01
## factor(educational_level)No formal schooling           -1.331e+01  3.785e+02
## factor(educational_level)Some high/secondary schooling -1.710e-01  1.213e-01
## factor(educational_level)Some primary school            5.908e-01  5.683e-01
## factor(educational_level)Tertiary education            -1.993e-03  1.370e-01
## factor(residential_area)Diepsloot                      -1.566e-01  1.689e-01
## factor(residential_area)Fourways                        2.139e-01  2.841e-01
## factor(residential_area)Kyasand/pipeline               -1.963e-01  2.796e-01
## factor(residential_area)Msawawa                         2.031e-01  3.032e-01
## factor(residential_area)Other                           8.328e-02  1.953e-01
## monthly_income                                          4.637e-06  8.306e-06
##                                                        z value Pr(>|z|)
## (Intercept)                                             -1.257    0.209
## age                                                      0.518    0.604
## factor(educational_level)Completed primary school        0.497    0.619
## factor(educational_level)No formal schooling            -0.035    0.972
## factor(educational_level)Some high/secondary schooling  -1.410    0.159
## factor(educational_level)Some primary school             1.040    0.298
## factor(educational_level)Tertiary education             -0.015    0.988
## factor(residential_area)Diepsloot                       -0.927    0.354
## factor(residential_area)Fourways                         0.753    0.452
## factor(residential_area)Kyasand/pipeline                -0.702    0.483
## factor(residential_area)Msawawa                          0.670    0.503
## factor(residential_area)Other                            0.427    0.670
## monthly_income                                           0.558    0.577
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2192.1  on 1664  degrees of freedom
## Residual deviance: 2177.0  on 1652  degrees of freedom
##   (535 observations deleted due to missingness)
## AIC: 2203
## 
## Number of Fisher Scoring iterations: 12

4. Based on the fitted imputation model (for Y), predict the missing y-values, and store the results in a new data frame in R. Prediction of the probability of a zero risk can be performed using the predict-function in R and prediction of y (being equal to zero or one) should be based on an estimated probability of having a zero-risk perception exceeding 0.5.

# Define new imputed dataset
##----------------------------
imp_dat = all_dat
## Single imputation
##-------------------
imp_dat$y1_imp = as.numeric(predict(glm_fit, newdata = all_dat,
                                    type = "response") > 0.5)
## RESTORE previous observed values (after assessment of prediction error)
##-------------------------------------------------------------------------
imp_dat$y1_imp[is.na(imp_dat$y1) == F] = imp_dat$y1[is.na(imp_dat$y1) == F]

5. How would you evaluate the performance of this imputation (or prediction) model?

sum((imp_dat$y1 - imp_dat$y1_imp) == 1, na.rm = T)/sum(!is.na(imp_dat$y1))

## [1] 0

# We can evaluate the performance of our model using the root mean squared error (RMSE) on the training data
rmse <- sqrt(mean((all_dat$risk_score - predictions)^2))
print(paste("RMSE:", rmse))

## [1] "RMSE: NA"

6. Repeatedly and multiple timesimpute the missing observations for the random variable Y? Create 10 different imputation sets based on such an approach.

## Multiple imputation
##---------------------
M = 10
set.seed(2504)
for (imp_id in 1:M){
  imp_set_name = paste0("imp_dataset",imp_id)
  ynew = rbinom(n, size = 1, prob = predict(glm_fit, newdata = all_dat,
                                            type = "response"))
  ynew[is.na(all_dat$y1) == F] = all_dat$y1[is.na(all_dat$y1) == F]
  assign(imp_set_name, cbind(all_dat, ynew))
}

7. How can we impute the missing values for our primary endpoint Z, conditional on Y being equal to 0 (i.e., implying that the risk perception score differs from zero). Formulate an imputation model for Z depending on age, educational level, residential area, and monthly income. What type of model will you consider?

Linear regression

## Define data with non-zero risk scores
##---------------------------------------
reduced_dat = all_dat[all_dat$risk_score != 0, ]
## Linear regression models
##--------------------------
lm_fit = lm(risk_score ~ age + factor(educational_level) +
              factor(residential_area) + monthly_income, data = reduced_dat)
summary(lm_fit)

## 
## Call:
## lm(formula = risk_score ~ age + factor(educational_level) + factor(residential_area) + 
##     monthly_income, data = reduced_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.119 -12.851  -7.539   7.443  78.298 
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             1.528e+01  6.792e+00
## age                                                     3.387e-01  2.910e-01
## factor(educational_level)Completed primary school      -2.713e+00  5.262e+00
## factor(educational_level)No formal schooling            2.271e+00  1.389e+01
## factor(educational_level)Some high/secondary schooling -1.121e-01  1.395e+00
## factor(educational_level)Some primary school           -4.876e+00  8.081e+00
## factor(educational_level)Tertiary education             8.339e-01  1.640e+00
## factor(residential_area)Diepsloot                       8.001e-01  2.015e+00
## factor(residential_area)Fourways                        6.701e+00  3.574e+00
## factor(residential_area)Kyasand/pipeline                4.261e+00  3.183e+00
## factor(residential_area)Msawawa                         1.058e+00  3.707e+00
## factor(residential_area)Other                          -4.441e-01  2.381e+00
## monthly_income                                         -1.459e-04  9.493e-05
##                                                        t value Pr(>|t|)  
## (Intercept)                                              2.250   0.0247 *
## age                                                      1.164   0.2447  
## factor(educational_level)Completed primary school       -0.516   0.6062  
## factor(educational_level)No formal schooling             0.164   0.8702  
## factor(educational_level)Some high/secondary schooling  -0.080   0.9359  
## factor(educational_level)Some primary school            -0.603   0.5464  
## factor(educational_level)Tertiary education              0.509   0.6112  
## factor(residential_area)Diepsloot                        0.397   0.6914  
## factor(residential_area)Fourways                         1.875   0.0611 .
## factor(residential_area)Kyasand/pipeline                 1.339   0.1809  
## factor(residential_area)Msawawa                          0.286   0.7753  
## factor(residential_area)Other                           -0.186   0.8521  
## monthly_income                                          -1.537   0.1246  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.31 on 1038 degrees of freedom
##   (535 observations deleted due to missingness)
## Multiple R-squared:  0.01006,    Adjusted R-squared:  -0.001384 
## F-statistic: 0.879 on 12 and 1038 DF,  p-value: 0.5682

hist(residuals(lm_fit), main = "", xlab = "Residuals")

plot(reduced_dat$age[!is.na(reduced_dat$age)], residuals(lm_fit), xlab = "Age",
     ylab = "Residuals")

lm_fit2 = lm(log(risk_score + 1) ~ age + factor(educational_level) +
               factor(residential_area) + monthly_income, data = reduced_dat)
summary(lm_fit2)

## 
## Call:
## lm(formula = log(risk_score + 1) ~ age + factor(educational_level) + 
##     factor(residential_area) + monthly_income, data = reduced_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2777 -0.4986 -0.1094  0.5412  1.7537 
## 
## Coefficients:
##                                                          Estimate Std. Error
## (Intercept)                                             2.765e+00  2.687e-01
## age                                                     6.023e-03  1.151e-02
## factor(educational_level)Completed primary school      -1.013e-01  2.082e-01
## factor(educational_level)No formal schooling           -9.313e-02  5.494e-01
## factor(educational_level)Some high/secondary schooling  2.185e-03  5.518e-02
## factor(educational_level)Some primary school           -1.723e-01  3.197e-01
## factor(educational_level)Tertiary education            -2.020e-02  6.487e-02
## factor(residential_area)Diepsloot                       3.200e-02  7.971e-02
## factor(residential_area)Fourways                        1.741e-01  1.414e-01
## factor(residential_area)Kyasand/pipeline                7.667e-02  1.259e-01
## factor(residential_area)Msawawa                         6.093e-03  1.466e-01
## factor(residential_area)Other                          -2.753e-02  9.420e-02
## monthly_income                                         -5.909e-06  3.755e-06
##                                                        t value Pr(>|t|)    
## (Intercept)                                             10.292   <2e-16 ***
## age                                                      0.523    0.601    
## factor(educational_level)Completed primary school       -0.486    0.627    
## factor(educational_level)No formal schooling            -0.170    0.865    
## factor(educational_level)Some high/secondary schooling   0.040    0.968    
## factor(educational_level)Some primary school            -0.539    0.590    
## factor(educational_level)Tertiary education             -0.311    0.756    
## factor(residential_area)Diepsloot                        0.401    0.688    
## factor(residential_area)Fourways                         1.232    0.218    
## factor(residential_area)Kyasand/pipeline                 0.609    0.543    
## factor(residential_area)Msawawa                          0.042    0.967    
## factor(residential_area)Other                           -0.292    0.770    
## monthly_income                                          -1.574    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7639 on 1038 degrees of freedom
##   (535 observations deleted due to missingness)
## Multiple R-squared:  0.006073,   Adjusted R-squared:  -0.005417 
## F-statistic: 0.5286 on 12 and 1038 DF,  p-value: 0.8972

hist(residuals(lm_fit2), main = "", xlab = "Residuals")

plot(reduced_dat$age[!is.na(reduced_dat$age)], residuals(lm_fit2), xlab = "Age",
     ylab = "Residuals")

par(mfrow = c(1,2))
qqnorm(residuals(lm_fit), pch = 1, frame = FALSE)
qqline(residuals(lm_fit), col = "steelblue", lwd = 2)
qqnorm(residuals(lm_fit2), pch = 1, frame = FALSE)
qqline(residuals(lm_fit2), col = "steelblue", lwd = 2)

Although the model based on the transformed outcome performs better in terms of underlying normality assumption, the presence of digit preference stil induces deviations from normality as well as problems with the assumption of homoscedasticity (constancy of variance of the error terms). Moreover, the explanatory power of the ‘final model’ (lm_fit2) is very low indicating that the predictive ability of the set of covariates is low. The problem regarding digit preference will not be discussed further in this tutorial, while the issue about the predictive power of the imputation model is partly remediated by considering the full conditional specification (FCS) approach in the mice package. More specifically, the FCS approach will rely on the information about all covariates in the construction of the imputation model for the primary endpoint.

9. How would you extend the approach to perform multiple imputation?

Consider random generation of error terms (random noise);

M = 10
d_imp = plot(density(all_dat$risk_score, na.rm = T), main = "",
             xlab = "Risk score", ylab = "Density plot", lwd = 3, col = 1)
for (imp_id in 1:M){
  imp_set_old = paste0("imp_dataset",imp_id)
  imp_set_name = paste0("imp_dataset_full",imp_id)## Without transformation
  ##------------------------
  #znew = predict(lm_fit, newdata = all_dat) + rnorm(n, mean = 0,
  # sd = sigma(lm_fit))
  ## With transformation
  ##---------------------
  znew = exp(predict(lm_fit2, newdata = all_dat) + rnorm(n, mean = 0,
                                                         sd = sigma(lm_fit2))-1)
  znew[is.na(all_dat$risk_score) == F] =
    all_dat$risk_score[is.na(all_dat$risk_score) == F]
  znew[get(imp_set_old)$ynew == 1] = 0
  lines(density(znew), col = imp_id + 1, lwd = 2)
  assign(imp_set_name, cbind(get(imp_set_old), znew))
}

b. Random generation of the asymptotic distribution of the estimators of the model parameters in combination with random noise

M = 10
d_imp = plot(density(all_dat$risk_score, na.rm = T), main = "",
             xlab = "Risk score", ylab = "Density plot", lwd = 3, col = 1)
for (imp_id in 1:M){
  imp_set_old = paste0("imp_dataset",imp_id)
  imp_set_name = paste0("imp_dataset_full2",imp_id)
  ## Without transformation
  ##------------------------
  #sm_values = summary(lm_fit)$coefficients
  #new_coef = rnorm(nrow(sm_values), mean = sm_values[,1], sd = sm_values[,2])
  #new_pred = apply(model.matrix.lm(lm_fit, data = all_dat,
  # na.action = "na.pass")%*%new_coef, 1, sum) +
  # rnorm(n, mean = 0, sd = sigma(lm_fit))
  #znew = new_pred
  ## With transformation
  ##---------------------
  sm_values = summary(lm_fit2)$coefficients
  new_coef = rnorm(nrow(sm_values), mean = sm_values[,1], sd = sm_values[,2])
  new_pred = apply(model.matrix.lm(lm_fit2, data = all_dat,
                                   na.action = "na.pass")%*%new_coef, 1, sum) +
    rnorm(n, mean = 0, sd = sigma(lm_fit2))
  znew = exp(new_pred)-1
  znew[is.na(all_dat$risk_score) == F] =
    all_dat$risk_score[is.na(all_dat$risk_score) == F]
  znew[get(imp_set_old)$ynew == 1] = 0
  lines(density(znew), col = imp_id + 1, lwd = 2)
  assign(imp_set_name, cbind(get(imp_set_old), znew))
}

1. Construct M = 10 imputation sets and estimate whether there exists a significant difference in average baseline risk perception score among AGYW living in Cosmocity versus those living in Fourways. How would you combine the results of the different analyses?

## MI analyses
##-------------
coef_imp = vector()
se_imp = vector()
for (imp_id in 1:M){
  imp_set_name = paste0("imp_dataset_full2",imp_id)
  glm_fit_imp = glm(ynew ~ age + factor(educational_level) +
                      factor(residential_area) + monthly_income,
                    data = get(imp_set_name), family = binomial(link = "logit"))
  sm_glm_imp = summary(glm_fit_imp)$coefficients
  cov_id = which(rownames(sm_glm_imp) == "factor(residential_area)Fourways")
  coef_imp[imp_id] = sm_glm_imp[cov_id,1]
  se_imp[imp_id] = sm_glm_imp[cov_id,2]
}
mean_estimate = mean(coef_imp); mean_estimate

## [1] 0.2410253

between_var = var(coef_imp); between_var

## [1] 0.009113935

within_var = mean(se_imp**2); within_var

## [1] 0.07136779

total_var = within_var + (1 + 1/M)*between_var; total_var

## [1] 0.08139312

### Odds ratio estimation
###-----------------------
OR = exp(mean_estimate); OR

## [1] 1.272553

### Hypothesis problem H0: beta = 0; H1: beta != 0
###------------------------------------------------
wald_test_statistic = mean_estimate/sqrt(total_var);
wald_test_statistic

## [1] 0.8448288

n = nrow(imp_dataset_full21)
k = nrow(sm_glm_imp)
lambda = (between_var + (between_var/M))/total_var
df_old = (M - 1)/(lambda*lambda)
df_obs = (((n - k) + 1)/((n - k) + 3)) * ((n - k)*(1 - lambda))
df_adj = (df_old*df_obs)/(df_old + df_obs)
### Calculation of the two-sided p-value
###--------------------------------------
p_old = 2*pt(wald_test_statistic, df = df_old, lower.tail = FALSE)
p_adj = 2*pt(wald_test_statistic, df = df_adj, lower.tail = FALSE)
p_old; p_adj

## [1] 0.398547

## [1] 0.3986524

### 95% CI for the odds ratio
###---------------------------
logOR = mean_estimate;
alpha = 0.05
df = df_adj
cv = qt(1-alpha/2, df = df)
ll_logOR = logOR - cv*sqrt(total_var)
ul_logOR = logOR + cv*sqrt(total_var)
ll_OR = exp(ll_logOR)
ul_OR = exp(ul_logOR)
print(c(OR, ll_OR, ul_OR))

## [1] 1.2725532 0.7264102 2.2293072

Based on the unadjusted (old) and adjusted p-value (< 0.05), we do not reject the null hypothesis (beta = 0) at 5% s.l. implying that there does not exist a significant difference in probability of reporting zero perceived risk between Fourways and Cosmocity AGYW (reference category). Alternatively, the $pooled odds ratio (after multiple imputation) is found to be NOT significantly different from one given that the 95% CI for the pooled OR contains value one.

NOTE: the use of Rubin’s rules is based on the assumption of asymptotic normality of the quantity of interest. If this assumption is violated one can not rely on the Wald test statistic (and corresponding t-distribution).

#11. Use the MICE package to perform multiple imputation for the DREAMS dataset at hand.

MICE is based on Full Conditional Specification meaning that the default setting is to use all available information (read for all variables) to impute missing observations in a sequential manner. More specifically, the algorithm starts with imputing a specific variable based on the available information regarding all other variables in the dataset and subsequently performs imputation on the ‘next’ variable until a complete dataset is retained. For specific details concerning the MICE algorithm, please have a look at the MICE documentation or the book by Van Buuren (Van Buuren, 2011).

Here under, an easy example concerning the use of the MICE package is given. Note that we start here from a subset of the entire DREAMS dataset, which is strictly speaking not necessary, but for illustration purposes we consider this more convenient. As mentioned previously, the imputation model for the perceived risk score still suffers from the fact that the data are subject to digit preference and the lack of predictive power for the subset of covariates considered in the imputation model. Under FCS, we are able to improve the prediction model, at least under the assumption of data being missing at random.

## Define the dataset to be used
##-------------------------------
mice_dat = subset(all_dat, select = c("age", "educational_level",
"residential_area", "household_members",
"household_water", "electricity_at_home",
"monthly_income","alcohol_use","sti",
"discuss_health_issue","hiv_talk",
"ever_tested_for_hiv","sexual_partners",
"physical_violence","preventing_pregnancy",
"risk_score"))

## Perform multiple imputation using the mice function
##-----------------------------------------------------
imp = mice(mice_dat, m = 10, maxit = 10, printFlag = FALSE, seed = 123)

## Warning: Number of logged events: 12

summary(imp)

## Class: mids
## Number of multiple imputations:  10 
## Imputation methods:
##                  age    educational_level     residential_area 
##                   ""                   ""                   "" 
##    household_members      household_water  electricity_at_home 
##                   ""                   ""                   "" 
##       monthly_income          alcohol_use                  sti 
##                   ""                   ""                   "" 
## discuss_health_issue             hiv_talk  ever_tested_for_hiv 
##                   ""                   ""                   "" 
##      sexual_partners    physical_violence preventing_pregnancy 
##                   ""                   ""                   "" 
##           risk_score 
##                "pmm" 
## PredictorMatrix:
##                     age educational_level residential_area household_members
## age                   0                 0                0                 0
## educational_level     1                 0                0                 0
## residential_area      1                 0                0                 0
## household_members     1                 0                0                 0
## household_water       1                 0                0                 0
## electricity_at_home   1                 0                0                 0
##                     household_water electricity_at_home monthly_income
## age                               0                   0              1
## educational_level                 0                   0              1
## residential_area                  0                   0              1
## household_members                 0                   0              1
## household_water                   0                   0              1
## electricity_at_home               0                   0              1
##                     alcohol_use sti discuss_health_issue hiv_talk
## age                           0   0                    0        0
## educational_level             0   0                    0        0
## residential_area              0   0                    0        0
## household_members             0   0                    0        0
## household_water               0   0                    0        0
## electricity_at_home           0   0                    0        0
##                     ever_tested_for_hiv sexual_partners physical_violence
## age                                   0               1                 0
## educational_level                     0               1                 0
## residential_area                      0               1                 0
## household_members                     0               1                 0
## household_water                       0               1                 0
## electricity_at_home                   0               1                 0
##                     preventing_pregnancy risk_score
## age                                    0          1
## educational_level                      0          1
## residential_area                       0          1
## household_members                      0          1
## household_water                        0          1
## electricity_at_home                    0          1
## Number of logged events:  12 
##   it im dep     meth                 out
## 1  0  0     constant   educational_level
## 2  0  0     constant    residential_area
## 3  0  0     constant   household_members
## 4  0  0     constant     household_water
## 5  0  0     constant electricity_at_home
## 6  0  0     constant         alcohol_use

Here we use the default method for imputation depending on the type of variable in the dataset. By default, the method uses (1) pmm, predictive mean matching for numeric data, (2) logreg, logistic regression imputation for binary data, (3) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) and (4) polr, proportional odds model for (ordered, > 2 levels).

NOTE: the default method is not necessarily the best option, please verify the underlying assumptions of each of these models in the context of the different variables that require imputation.

## Analyse the imputed dataset
##-----------------------------
m1 <- with(data = imp, expr = lm(log(risk_score + 1) ~ age +
factor(educational_level) +
factor(residential_area) +
  factor(household_water) +
monthly_income +
factor(alcohol_use) +
sti +
hiv_talk +
ever_tested_for_hiv +
sexual_partners +
physical_violence +
preventing_pregnancy))

pooled_analysis <- pool(m1)
pooled_analysis

## Class: mipo    m = 10 
##                                                                    term  m
## 1                                                           (Intercept) 10
## 2                                                                   age 10
## 3                     factor(educational_level)Completed primary school 10
## 4                          factor(educational_level)No formal schooling 10
## 5                factor(educational_level)Some high/secondary schooling 10
## 6                          factor(educational_level)Some primary school 10
## 7                           factor(educational_level)Tertiary education 10
## 8                                     factor(residential_area)Diepsloot 10
## 9                                      factor(residential_area)Fourways 10
## 10                             factor(residential_area)Kyasand/pipeline 10
## 11                                      factor(residential_area)Msawawa 10
## 12                                        factor(residential_area)Other 10
## 13                                    factor(household_water)Public tap 10
## 14                                   factor(household_water)Tap in yard 10
## 15                              factor(household_water)Tap inside house 10
## 16                                                       monthly_income 10
## 17                                     factor(alcohol_use)I never drink 10
## 18     factor(alcohol_use)I only drink with friends but never get drunk 10
## 19 factor(alcohol_use)I sometimes get drunk when going out with friends 10
## 20                                                               stiYes 10
## 21                                                          hiv_talkYes 10
## 22                                               ever_tested_for_hivYes 10
## 23                                                      sexual_partners 10
## 24                                                 physical_violenceYes 10
## 25                                              preventing_pregnancyYes 10
##         estimate         ubar            b            t dfcom         df
## 1   1.389681e+00 6.136016e-01 4.325889e-01 1.089449e+00  1818   45.09594
## 2   2.951896e-03 2.897828e-04 2.695790e-04 5.863196e-04  1818   33.85746
## 3  -1.391368e-02 8.176015e-02 3.821599e-02 1.237977e-01  1818   73.28416
## 4   2.153583e+00 2.392759e+00 1.676663e-02 2.411203e+00  1818 1781.24509
## 5   5.738824e-02 6.856657e-03 1.346487e-03 8.337793e-03  1818  239.47074
## 6  -1.147794e-01 1.110010e-01 7.631146e-02 1.949436e-01  1818   46.36316
## 7  -1.125283e-02 1.046706e-02 3.523574e-03 1.434300e-02  1818  112.75897
## 8   1.018402e-01 1.755289e-02 1.407315e-03 1.910093e-02  1818  752.42008
## 9  -8.781784e-02 6.350173e-02 5.489005e-03 6.953964e-02  1818  694.12003
## 10  1.057038e-01 4.423044e-02 1.174809e-02 5.715334e-02  1818  156.44192
## 11 -5.694351e-02 4.981363e-02 9.611845e-03 6.038666e-02  1818  245.47278
## 12 -4.045472e-02 2.211832e-02 1.896585e-03 2.420456e-02  1818  700.25415
## 13  1.779829e-01 2.650839e-01 1.315794e-01 4.098213e-01  1818   67.97979
## 14  1.493264e-01 2.680807e-01 1.364533e-01 4.181792e-01  1818   65.90311
## 15  5.540675e-02 2.742222e-01 1.493008e-01 4.384530e-01  1818   60.71816
## 16 -9.312449e-06 3.520037e-11 7.893914e-12 4.388368e-11  1818  198.53790
## 17 -2.228877e-01 1.486809e-01 3.856259e-02 1.910998e-01  1818  161.74963
## 18 -1.723853e-01 1.536104e-01 3.671404e-02 1.939958e-01  1818  181.46439
## 19 -2.656930e-01 1.532982e-01 4.331472e-02 2.009444e-01  1818  143.49933
## 20  2.493106e-01 1.009886e-02 2.433675e-03 1.277590e-02  1818  179.36845
## 21  5.433033e-02 1.528497e-02 2.475518e-03 1.800804e-02  1818  313.53897
## 22  2.089980e-01 3.173274e-02 4.127311e-03 3.627279e-02  1818  421.92150
## 23  9.221425e-02 2.410156e-03 2.722489e-04 2.709630e-03  1818  505.99159
## 24  1.546882e-01 3.544881e-02 5.187693e-03 4.115527e-02  1818  360.29597
## 25  4.990773e-02 6.543927e-03 1.565009e-03 8.265438e-03  1818  181.30727
##            riv     lambda         fmi
## 1  0.775499696 0.43677828 0.460199038
## 2  1.023307522 0.50575976 0.532578767
## 3  0.514157474 0.33956671 0.356881795
## 4  0.007707958 0.00764900 0.008761348
## 5  0.216014196 0.17764118 0.184424339
## 6  0.756232889 0.43059943 0.453669295
## 7  0.370297882 0.27023167 0.282840079
## 8  0.088193244 0.08104557 0.083478535
## 9  0.095082529 0.08682682 0.089446661
## 10 0.292172123 0.22610929 0.235816785
## 11 0.212251735 0.17508883 0.181728679
## 12 0.094321948 0.08619214 0.088790936
## 13 0.546005758 0.35317188 0.371397576
## 14 0.559900849 0.35893361 0.377541377
## 15 0.598896928 0.37456882 0.394199990
## 16 0.246682197 0.19787095 0.205831036
## 17 0.285301190 0.22197224 0.231417216
## 18 0.262908307 0.20817688 0.216761982
## 19 0.310807253 0.23711133 0.247526241
## 20 0.265083709 0.20953847 0.218207314
## 21 0.178153438 0.15121412 0.156577038
## 22 0.143071204 0.12516386 0.129281496
## 23 0.124254938 0.11052203 0.114017090
## 24 0.160977521 0.13865688 0.143398706
## 25 0.263069869 0.20827816 0.216869486

Difference in mean perceived risk score for AGYW living in Fourways versus Cosmocity (based on the univariate Wald test discussed above)

pooled_analysis$pooled$estimate[9]

## [1] -0.08781784

pvalue = 2*pt(pooled_analysis$pooled$t[9], df = pooled_analysis$pooled$df[9],
lower.tail = FALSE)
pvalue

## [1] 0.9445801

12. Explain the difference between MICE and Multivariate Normal Imputation? Which method is to be preferred?

MICE is an iterative imputation method that imputes missing values by modeling each variable with missing data as a function of other variables in the dataset. In each iteration, the missing values in one variable are imputed based on the available data in the other variables. This process is repeated until convergence, and multiple imputed datasets are generated. MICE is a flexible method that can handle different types of missing data patterns and can incorporate a range of statistical models to impute missing values.

On the other hand, Multivariate Normal Imputation is a method that assumes that the data follows a multivariate normal distribution, and the missing values are imputed by sampling from the multivariate normal distribution estimated from the observed data. This method requires the assumption that the data is normally distributed, and it is sensitive to outliers in the data.

The choice between MICE and Multivariate Normal Imputation depends on the specific characteristics of the dataset and the research question. MICE is generally preferred over Multivariate Normal Imputation because it is more flexible and can handle a wider range of data patterns and missing data mechanisms. However, if the data is known to be normally distributed, Multivariate Normal Imputation can be a simple and effective approach. Ultimately, the best approach will depend on the specific research question and the nature of the missing data in the dataset.

13. What is the difference between predictive mean matching (pmm) and regression imputation considered in the previous steps? What are the advantages and disadvantages of both approaches?

Predictive mean matching (PMM) and regression imputation are two commonly used approaches for imputing missing data in regression analysis.

PMM is a non-parametric imputation method that is based on a matching procedure. In PMM, the observed values of a predictor variable (or variables) are used to predict the missing value of the response variable. The imputed value is then selected from the observed response values that have a predicted value closest to the predicted value of the missing value. PMM is useful when the distribution of the response variable is skewed or has outliers and can preserve the distribution of the response variable better than other imputation methods.

Regression imputation, on the other hand, is a parametric imputation method that is based on fitting a regression model to the observed data and then using the estimated model to impute the missing values. The imputed values are the predictions of the response variable based on the observed values of the predictor variables. Regression imputation assumes a linear relationship between the predictor and response variables and can be sensitive to outliers.

The advantages of PMM are that it is a non-parametric method and can preserve the distribution of the response variable. It can also handle nonlinear relationships between the predictor and response variables. However, PMM can be computationally intensive, especially when dealing with large datasets, and may not work well with highly skewed or heavy-tailed distributions.

The advantages of regression imputation are that it is a simple and fast method that can handle large datasets with many missing values. It is also useful when the relationship between the predictor and response variables is well understood and linear. However, regression imputation assumes a linear relationship between the predictor and response variables and can be sensitive to outliers and misspecification of the model.

In summary, the choice between PMM and regression imputation will depend on the characteristics of the dataset and the research question. PMM may be more appropriate when the distribution of the response variable is skewed or when there are nonlinear relationships between the predictor and response variables, while regression imputation may be more appropriate when the relationship between the predictor and response variables is well understood and linear.

The MICE (Multivariate Imputation by Chained Equations) package is a flexible imputation method that can be used with a variety of imputation models, including both predictive mean matching (PMM) and regression imputation. In fact, MICE can be used with a range of imputation methods, including logistic regression, random forests, and others, depending on the data and the research question.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Multiple imputattion

Khanh Huyen

2023-02-19