Loading package ———————————————————
pacman::p_load(
rio, # File import
here, # File locator
janitor, # adding totals and percents to tables
tidyverse,
VIM,
mice,
ggplot2
)
Backgound
An adequate level of risk perception is critical to adopt a protective behavior concerning HIV acquisition. The purpose of this study is to identify factors associated with HIV risk perception among AGYW. Here, we will focus on risk perception at baseline and which factors are associated therewith. Data description: - Outcome: risk_score - Risk factors: age + factor(educational_level) + factor(residential_area) + monthly_income
## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 10.00 16.48 20.00 999.00 531
## Length Class Mode
## 2200 character character
## Length Class Mode
## 2200 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2500 4000 5321 6000 100000
Primary endpoint
The primary endpoint of this analysis was self-assessed risk of acquiring HIV infection. At baseline, participants were asked “What do you think the chance is that you will get HIV in the next six months?”, thereby expressing this on a scale ranging from 0 to 100, with 100 being the maximum value. 1. How many participants are included in the study? How many variables?
n <- nrow(all_dat)
n
## [1] 2200
tab_primary = table(all_dat$risk_score)
all_dat$risk_score = as.numeric(all_dat$risk_score)
tab_score = table(all_dat$risk_score)
rel_freq = tab_score/n
h_primary = hist(all_dat$risk_score, xlim = c(0,100), breaks = 50, main = "",
xlab = "Risk score (between 0 and 100)")

# lines(as.numeric(names(tab_score)), tab_score, col = 2, lwd = 2)
# text(xgrid[-length(4)] + 0.5, tab_score + 25,
# paste(round(rel_freq*100,1),"%"), cex = 0.8)
- What is the age distribution of individuals in the DREAMS study?
Skew distribution
tab_age = table(all_dat$age)
rel_freq = tab_age/n
xgrid = seq(min(all_dat$age)-0.5,max(all_dat$age)+0.5,1)
h = hist(all_dat$age, main = "The age distribution of individuals in the DREAMS study", ylab = "Frequency", xlab = "Age",
breaks = xgrid, ylim = c(0,550))
lines(as.numeric(names(tab_age)), tab_age, col = 2, lwd = 2)
text(xgrid[-length(xgrid)] + 0.5, tab_age + 25,
paste(round(rel_freq*100,1),"%"), cex = 0.8)

Missing data
- For some of the variables, missing observations are coded as 999 instead of NA. First recode missing observations as NA. NOTE: for the primary endpoint, some observations are equal to 777 which requires recording to NA as well.
all_dat$risk_score[all_dat$risk_score == 777 | all_dat$risk_score == 999] = NA
n_recodings = apply(all_dat, 2, function(x){sum(x == 999, na.rm = T)})
all_dat[which(all_dat == 999, arr.ind = T)] = NA
# Number of missing observations for the primary endpoint
sum(is.na(all_dat$risk_score))
## [1] 535
- Explore the level of missing data in the data set by calculating the frequency of missing observations for each variable
n_missing = apply(all_dat, 2, FUN = function(x){sum(is.na(x))})
perc_missing = n_missing/nrow(all_dat)
perc_missing
## study_id age
## 0.0000000000 0.0000000000
## studying educational_level
## 0.0000000000 0.0000000000
## country_of_origin mother_alive
## 0.0000000000 0.0000000000
## mother_status father_alive
## 0.0000000000 0.0000000000
## father_status residential_area
## 0.0000000000 0.0000000000
## household_members household_water
## 0.0000000000 0.0000000000
## electricity_at_home no_food_in_household
## 0.0000000000 0.0000000000
## not_enough_food whole_day_without_food
## 0.0000000000 0.0000000000
## monthly_income steady_income
## 0.0000000000 0.0000000000
## government_grant r1000_in_personal_accoount
## 0.0000000000 0.0000000000
## dependants financial_support
## 0.0000000000 0.0000000000
## control_on_household_money away_from_home
## 0.0000000000 0.0004545455
## have_children how_many_children
## 0.0000000000 0.4559090909
## kids_living_with ever_been_pregnant
## 0.4513636364 0.0000000000
## currently_pregnant want_to_have_more_children
## 0.3263636364 0.1618181818
## preventing_pregnancy method_prevent_pregnancy
## 0.1622727273 0.3936363636
## alcohol_use sti
## 0.0000000000 0.0000000000
## discuss_health_issue hiv_talk
## 0.0000000000 0.0000000000
## hiv_talk_mostly hiv_on_social_media
## 0.1009090909 0.0000000000
## hiv_info_internet ever_tested_for_hiv
## 0.0000000000 0.0000000000
## last_hiv_test first_sexual_encounter
## 0.0445454545 0.0000000000
## sexual_partners how_many_intercourses
## 0.0000000000 0.0000000000
## sex_after_alcohol sex_with_drunk_partner
## 0.0000000000 0.0000000000
## sex_for_service_or_goods sex_for_money
## 0.0000000000 0.0000000000
## physical_violence physical_violence_by_partner
## 0.0000000000 0.0000000000
## sexual_violence relationship_status
## 0.0000000000 0.0000000000
## current_sexual_partners risk_score
## 0.0009090909 0.2431818182
## sex_partners_past_6_mos risk_cat_bl
## 0.8268181818 0.0000000000
## risk_cat agediff
## 0.2404545455 0.0000000000
## age5diff total_relationship_yrs
## 0.0000000000 0.0000000000
#Method 2
missing_vars <- names(which(colSums(is.na(all_dat)) > 0))
data_with_missing <- all_dat[missing_vars]
missing_freq <- colSums(is.na(data_with_missing))
# Calculate the percentage of missing observations for each variable
missing_pct <- 100 * missing_freq / nrow(all_dat)
# Create a table with variable names, missing frequencies, and missing percentages
missing_table <- data.frame(
Missing_Frequency = missing_freq,
Missing_Percentage = missing_pct
)
# Print the table
missing_table
## Missing_Frequency Missing_Percentage
## away_from_home 1 0.04545455
## how_many_children 1003 45.59090909
## kids_living_with 993 45.13636364
## currently_pregnant 718 32.63636364
## want_to_have_more_children 356 16.18181818
## preventing_pregnancy 357 16.22727273
## method_prevent_pregnancy 866 39.36363636
## hiv_talk_mostly 222 10.09090909
## last_hiv_test 98 4.45454545
## current_sexual_partners 2 0.09090909
## risk_score 535 24.31818182
## sex_partners_past_6_mos 1819 82.68181818
## risk_cat 529 24.04545455
- How many participants have at least one missing observation for the recorded variables?
n_subjects_missing_values = sum(apply(all_dat, 1, function(x){sum(is.na(x))}) > 0)
n_subjects_missing_values
## [1] 2071
# Method 2
# Calculate the number of missing observations for each participant
missing_obs <- rowSums(is.na(all_dat))
# Count the number of participants with at least one missing observation
n_missing_participants <- sum(missing_obs > 0)
# Print the result
n_missing_participants
## [1] 2071
- Alternatively, provide a graphical exploration of the amount of missing data by variable using the function aggr in the R package VIM.
aggr_plot <- aggr(all_dat, col=c('navyblue','yellow'), numbers=TRUE, sortVars=TRUE,
labels=names(all_dat), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)

##
## Variables sorted by number of missings:
## Variable Count
## sex_partners_past_6_mos 0.8268181818
## how_many_children 0.4559090909
## kids_living_with 0.4513636364
## method_prevent_pregnancy 0.3936363636
## currently_pregnant 0.3263636364
## risk_score 0.2431818182
## risk_cat 0.2404545455
## preventing_pregnancy 0.1622727273
## want_to_have_more_children 0.1618181818
## hiv_talk_mostly 0.1009090909
## last_hiv_test 0.0445454545
## current_sexual_partners 0.0009090909
## away_from_home 0.0004545455
## study_id 0.0000000000
## age 0.0000000000
## studying 0.0000000000
## educational_level 0.0000000000
## country_of_origin 0.0000000000
## mother_alive 0.0000000000
## mother_status 0.0000000000
## father_alive 0.0000000000
## father_status 0.0000000000
## residential_area 0.0000000000
## household_members 0.0000000000
## household_water 0.0000000000
## electricity_at_home 0.0000000000
## no_food_in_household 0.0000000000
## not_enough_food 0.0000000000
## whole_day_without_food 0.0000000000
## monthly_income 0.0000000000
## steady_income 0.0000000000
## government_grant 0.0000000000
## r1000_in_personal_accoount 0.0000000000
## dependants 0.0000000000
## financial_support 0.0000000000
## control_on_household_money 0.0000000000
## have_children 0.0000000000
## ever_been_pregnant 0.0000000000
## alcohol_use 0.0000000000
## sti 0.0000000000
## discuss_health_issue 0.0000000000
## hiv_talk 0.0000000000
## hiv_on_social_media 0.0000000000
## hiv_info_internet 0.0000000000
## ever_tested_for_hiv 0.0000000000
## first_sexual_encounter 0.0000000000
## sexual_partners 0.0000000000
## how_many_intercourses 0.0000000000
## sex_after_alcohol 0.0000000000
## sex_with_drunk_partner 0.0000000000
## sex_for_service_or_goods 0.0000000000
## sex_for_money 0.0000000000
## physical_violence 0.0000000000
## physical_violence_by_partner 0.0000000000
## sexual_violence 0.0000000000
## relationship_status 0.0000000000
## risk_cat_bl 0.0000000000
## agediff 0.0000000000
## age5diff 0.0000000000
## total_relationship_yrs 0.0000000000
Missing data techniques and questions
1. Which missingness mechanisms do exist? Explain.
Missing Completely At Random (MCAR): If the missingness is evenly distributed across all values of the variable, the missingness mechanism may be MCAR. This means that the missingness is not related to the observed or unobserved values of the variable or any other variables in the data set. In the aggr plot, an MCAR pattern would show random yellow bars across all values of the variable. –> Not MCAR
Missing At Random (MAR): If the missingness is related to the observed values of other variables in the data set, the missingness mechanism may be MAR. This means that the probability of missingness depends on the observed values of other variables in the data set, but not on the unobserved values of the variable itself. In the aggr plot, an MAR pattern would show non-random yellow bars for certain values of the variable, depending on the observed values of other variables in the data set.
Missing Not At Random (MNAR): If the missingness is related to the unobserved values of the variable, the missingness mechanism may be MNAR. This means that the probability of missingness depends on the unobserved values of the variable itself or other variables in the data set. In the aggr plot, an MNAR pattern would show non-random yellow bars for certain values of the variable, depending on the unobserved values of the variable itself or other variables in the data set.
–> Maybe MAR or MNAR
2. Formulate an imputation model for the (pseudo-)continuous endpoint Z defined as the risk perception score (on a range of 0 – 100). What about the excess number of zero observations?
The presence of many zeros in the outcome variable can lead to inflated or deflated regression coefficients, depending on the specific model assumptions.
Step 1: Choose a method: There are several methods for imputing missing values, including mean imputation, regression imputation, and multiple imputation. In this case, a regression imputation method could be appropriate, as we want to model the relationship between Z and the predictor variables.
lm_fit1 = glm(risk_score ~ age + factor(educational_level) +
factor(residential_area) + monthly_income, data = all_dat)
summary(lm_fit1)
##
## Call:
## glm(formula = risk_score ~ age + factor(educational_level) +
## factor(residential_area) + monthly_income, data = all_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -24.199 -13.681 -5.067 5.428 86.853
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.105e+01 5.162e+00
## age 1.339e-01 2.234e-01
## factor(educational_level)Completed primary school -2.986e+00 3.865e+00
## factor(educational_level)No formal schooling 1.323e+01 1.347e+01
## factor(educational_level)Some high/secondary schooling 7.670e-01 1.092e+00
## factor(educational_level)Some primary school -5.813e+00 5.310e+00
## factor(educational_level)Tertiary education 4.507e-01 1.254e+00
## factor(residential_area)Diepsloot 1.232e+00 1.545e+00
## factor(residential_area)Fourways 2.625e+00 2.638e+00
## factor(residential_area)Kyasand/pipeline 3.653e+00 2.504e+00
## factor(residential_area)Msawawa -6.355e-01 2.796e+00
## factor(residential_area)Other -7.215e-01 1.800e+00
## monthly_income -1.296e-04 7.654e-05
## t value Pr(>|t|)
## (Intercept) 2.141 0.0324 *
## age 0.599 0.5492
## factor(educational_level)Completed primary school -0.773 0.4399
## factor(educational_level)No formal schooling 0.982 0.3260
## factor(educational_level)Some high/secondary schooling 0.703 0.4825
## factor(educational_level)Some primary school -1.095 0.2738
## factor(educational_level)Tertiary education 0.359 0.7194
## factor(residential_area)Diepsloot 0.797 0.4253
## factor(residential_area)Fourways 0.995 0.3199
## factor(residential_area)Kyasand/pipeline 1.459 0.1448
## factor(residential_area)Msawawa -0.227 0.8202
## factor(residential_area)Other -0.401 0.6886
## monthly_income -1.694 0.0905 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 355.6833)
##
## Null deviance: 592378 on 1664 degrees of freedom
## Residual deviance: 587589 on 1652 degrees of freedom
## (535 observations deleted due to missingness)
## AIC: 14520
##
## Number of Fisher Scoring iterations: 2
## Model diagnostics (normality)
##-------------------------------
qqnorm(residuals(lm_fit1), pch = 1, frame = FALSE)
qqline(residuals(lm_fit1), col = "steelblue", lwd = 2)

## Linear regression model with transformation
##------------------------------------------------
lm_fit2 = glm(log(risk_score + 1) ~ age + factor(educational_level) + #to avoid taking the log of zero or negative values
factor(residential_area) + monthly_income, data = all_dat)
summary(lm_fit2)
##
## Call:
## glm(formula = log(risk_score + 1) ~ age + factor(educational_level) +
## factor(residential_area) + monthly_income, data = all_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0619 -1.7277 0.5179 1.1975 2.9059
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.876e+00 4.148e-01
## age -4.932e-03 1.796e-02
## factor(educational_level)Completed primary school -2.068e-01 3.106e-01
## factor(educational_level)No formal schooling 1.191e+00 1.083e+00
## factor(educational_level)Some high/secondary schooling 1.128e-01 8.774e-02
## factor(educational_level)Some primary school -5.101e-01 4.268e-01
## factor(educational_level)Tertiary education -1.348e-02 1.008e-01
## factor(residential_area)Diepsloot 1.218e-01 1.242e-01
## factor(residential_area)Fourways -4.595e-02 2.120e-01
## factor(residential_area)Kyasand/pipeline 1.744e-01 2.013e-01
## factor(residential_area)Msawawa -1.398e-01 2.247e-01
## factor(residential_area)Other -7.376e-02 1.447e-01
## monthly_income -7.442e-06 6.151e-06
## t value Pr(>|t|)
## (Intercept) 4.522 6.57e-06 ***
## age -0.275 0.784
## factor(educational_level)Completed primary school -0.666 0.506
## factor(educational_level)No formal schooling 1.100 0.271
## factor(educational_level)Some high/secondary schooling 1.285 0.199
## factor(educational_level)Some primary school -1.195 0.232
## factor(educational_level)Tertiary education -0.134 0.894
## factor(residential_area)Diepsloot 0.981 0.327
## factor(residential_area)Fourways -0.217 0.828
## factor(residential_area)Kyasand/pipeline 0.866 0.386
## factor(residential_area)Msawawa -0.622 0.534
## factor(residential_area)Other -0.510 0.610
## monthly_income -1.210 0.227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.297046)
##
## Null deviance: 3831.5 on 1664 degrees of freedom
## Residual deviance: 3794.7 on 1652 degrees of freedom
## (535 observations deleted due to missingness)
## AIC: 6124.7
##
## Number of Fisher Scoring iterations: 2
## Model diagnostics (normality)
##-------------------------------
qqnorm(residuals(lm_fit2), pch = 1, frame = FALSE)
qqline(residuals(lm_fit2), col = "steelblue", lwd = 2)

Step 2: Impute missing values: Finally, use the trained model to impute missing values for Z based on the values of the predictor variables for each observation with missing data.
predictors <- all_dat %>% select(age, educational_level,
residential_area, monthly_income)
predictors = as.matrix(predictors)
predictions <- predict(lm_fit1, all_dat)
# We can evaluate the performance of our model using the root mean squared error (RMSE) on the training data
rmse <- sqrt(mean((all_dat$risk_score - predictions)^2))
print(paste("RMSE:", rmse))
## [1] "RMSE: NA"
# imputed_data <- mice(all_dat, m=5, method="pmm", pred=predictors)
# complete_data <- complete(imputed_data, 1)
- Consider the random variable Y defined as having a risk score equal to zero (no risk of HIV acquisition). Construct an imputation model for Y, under the assumption of missing at random (MAR), and depending on the covariates age, educational level, residential area, and monthly income. What type of model will you consider?
Logistic resgression model
Step 1: Define Y = risk perception score equal to zero. Y=1 - risk perception score equal to zero, Y=0 - risk perception score is not equal to zero
all_dat$y1 = as.numeric(all_dat$risk_score == 0)
table(all_dat$y1)
##
## 0 1
## 1051 614
sum(is.na(all_dat$y1))
## [1] 535
Step 2: Logistic regression model
glm_fit = glm(y1 ~ age + factor(educational_level) + factor(residential_area) +
monthly_income, data = all_dat,
family = "binomial"(link = logit))
summary(glm_fit)
##
## Call:
## glm(formula = y1 ~ age + factor(educational_level) + factor(residential_area) +
## monthly_income, family = binomial(link = logit), data = all_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5278 -0.9541 -0.8862 1.3584 1.5555
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -7.181e-01 5.713e-01
## age 1.283e-02 2.477e-02
## factor(educational_level)Completed primary school 2.061e-01 4.148e-01
## factor(educational_level)No formal schooling -1.331e+01 3.785e+02
## factor(educational_level)Some high/secondary schooling -1.710e-01 1.213e-01
## factor(educational_level)Some primary school 5.908e-01 5.683e-01
## factor(educational_level)Tertiary education -1.993e-03 1.370e-01
## factor(residential_area)Diepsloot -1.566e-01 1.689e-01
## factor(residential_area)Fourways 2.139e-01 2.841e-01
## factor(residential_area)Kyasand/pipeline -1.963e-01 2.796e-01
## factor(residential_area)Msawawa 2.031e-01 3.032e-01
## factor(residential_area)Other 8.328e-02 1.953e-01
## monthly_income 4.637e-06 8.306e-06
## z value Pr(>|z|)
## (Intercept) -1.257 0.209
## age 0.518 0.604
## factor(educational_level)Completed primary school 0.497 0.619
## factor(educational_level)No formal schooling -0.035 0.972
## factor(educational_level)Some high/secondary schooling -1.410 0.159
## factor(educational_level)Some primary school 1.040 0.298
## factor(educational_level)Tertiary education -0.015 0.988
## factor(residential_area)Diepsloot -0.927 0.354
## factor(residential_area)Fourways 0.753 0.452
## factor(residential_area)Kyasand/pipeline -0.702 0.483
## factor(residential_area)Msawawa 0.670 0.503
## factor(residential_area)Other 0.427 0.670
## monthly_income 0.558 0.577
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2192.1 on 1664 degrees of freedom
## Residual deviance: 2177.0 on 1652 degrees of freedom
## (535 observations deleted due to missingness)
## AIC: 2203
##
## Number of Fisher Scoring iterations: 12
6. Repeatedly and multiple timesimpute the missing observations for the random variable Y? Create 10 different imputation sets based on such an approach.
## Multiple imputation
##---------------------
M = 10
set.seed(2504)
for (imp_id in 1:M){
imp_set_name = paste0("imp_dataset",imp_id)
ynew = rbinom(n, size = 1, prob = predict(glm_fit, newdata = all_dat,
type = "response"))
ynew[is.na(all_dat$y1) == F] = all_dat$y1[is.na(all_dat$y1) == F]
assign(imp_set_name, cbind(all_dat, ynew))
}
7. How can we impute the missing values for our primary endpoint Z, conditional on Y being equal to 0 (i.e., implying that the risk perception score differs from zero). Formulate an imputation model for Z depending on age, educational level, residential area, and monthly income. What type of model will you consider?
Linear regression
## Define data with non-zero risk scores
##---------------------------------------
reduced_dat = all_dat[all_dat$risk_score != 0, ]
## Linear regression models
##--------------------------
lm_fit = lm(risk_score ~ age + factor(educational_level) +
factor(residential_area) + monthly_income, data = reduced_dat)
summary(lm_fit)
##
## Call:
## lm(formula = risk_score ~ age + factor(educational_level) + factor(residential_area) +
## monthly_income, data = reduced_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.119 -12.851 -7.539 7.443 78.298
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.528e+01 6.792e+00
## age 3.387e-01 2.910e-01
## factor(educational_level)Completed primary school -2.713e+00 5.262e+00
## factor(educational_level)No formal schooling 2.271e+00 1.389e+01
## factor(educational_level)Some high/secondary schooling -1.121e-01 1.395e+00
## factor(educational_level)Some primary school -4.876e+00 8.081e+00
## factor(educational_level)Tertiary education 8.339e-01 1.640e+00
## factor(residential_area)Diepsloot 8.001e-01 2.015e+00
## factor(residential_area)Fourways 6.701e+00 3.574e+00
## factor(residential_area)Kyasand/pipeline 4.261e+00 3.183e+00
## factor(residential_area)Msawawa 1.058e+00 3.707e+00
## factor(residential_area)Other -4.441e-01 2.381e+00
## monthly_income -1.459e-04 9.493e-05
## t value Pr(>|t|)
## (Intercept) 2.250 0.0247 *
## age 1.164 0.2447
## factor(educational_level)Completed primary school -0.516 0.6062
## factor(educational_level)No formal schooling 0.164 0.8702
## factor(educational_level)Some high/secondary schooling -0.080 0.9359
## factor(educational_level)Some primary school -0.603 0.5464
## factor(educational_level)Tertiary education 0.509 0.6112
## factor(residential_area)Diepsloot 0.397 0.6914
## factor(residential_area)Fourways 1.875 0.0611 .
## factor(residential_area)Kyasand/pipeline 1.339 0.1809
## factor(residential_area)Msawawa 0.286 0.7753
## factor(residential_area)Other -0.186 0.8521
## monthly_income -1.537 0.1246
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.31 on 1038 degrees of freedom
## (535 observations deleted due to missingness)
## Multiple R-squared: 0.01006, Adjusted R-squared: -0.001384
## F-statistic: 0.879 on 12 and 1038 DF, p-value: 0.5682
hist(residuals(lm_fit), main = "", xlab = "Residuals")

plot(reduced_dat$age[!is.na(reduced_dat$age)], residuals(lm_fit), xlab = "Age",
ylab = "Residuals")

lm_fit2 = lm(log(risk_score + 1) ~ age + factor(educational_level) +
factor(residential_area) + monthly_income, data = reduced_dat)
summary(lm_fit2)
##
## Call:
## lm(formula = log(risk_score + 1) ~ age + factor(educational_level) +
## factor(residential_area) + monthly_income, data = reduced_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2777 -0.4986 -0.1094 0.5412 1.7537
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.765e+00 2.687e-01
## age 6.023e-03 1.151e-02
## factor(educational_level)Completed primary school -1.013e-01 2.082e-01
## factor(educational_level)No formal schooling -9.313e-02 5.494e-01
## factor(educational_level)Some high/secondary schooling 2.185e-03 5.518e-02
## factor(educational_level)Some primary school -1.723e-01 3.197e-01
## factor(educational_level)Tertiary education -2.020e-02 6.487e-02
## factor(residential_area)Diepsloot 3.200e-02 7.971e-02
## factor(residential_area)Fourways 1.741e-01 1.414e-01
## factor(residential_area)Kyasand/pipeline 7.667e-02 1.259e-01
## factor(residential_area)Msawawa 6.093e-03 1.466e-01
## factor(residential_area)Other -2.753e-02 9.420e-02
## monthly_income -5.909e-06 3.755e-06
## t value Pr(>|t|)
## (Intercept) 10.292 <2e-16 ***
## age 0.523 0.601
## factor(educational_level)Completed primary school -0.486 0.627
## factor(educational_level)No formal schooling -0.170 0.865
## factor(educational_level)Some high/secondary schooling 0.040 0.968
## factor(educational_level)Some primary school -0.539 0.590
## factor(educational_level)Tertiary education -0.311 0.756
## factor(residential_area)Diepsloot 0.401 0.688
## factor(residential_area)Fourways 1.232 0.218
## factor(residential_area)Kyasand/pipeline 0.609 0.543
## factor(residential_area)Msawawa 0.042 0.967
## factor(residential_area)Other -0.292 0.770
## monthly_income -1.574 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7639 on 1038 degrees of freedom
## (535 observations deleted due to missingness)
## Multiple R-squared: 0.006073, Adjusted R-squared: -0.005417
## F-statistic: 0.5286 on 12 and 1038 DF, p-value: 0.8972
hist(residuals(lm_fit2), main = "", xlab = "Residuals")

plot(reduced_dat$age[!is.na(reduced_dat$age)], residuals(lm_fit2), xlab = "Age",
ylab = "Residuals")

par(mfrow = c(1,2))
qqnorm(residuals(lm_fit), pch = 1, frame = FALSE)
qqline(residuals(lm_fit), col = "steelblue", lwd = 2)
qqnorm(residuals(lm_fit2), pch = 1, frame = FALSE)
qqline(residuals(lm_fit2), col = "steelblue", lwd = 2)
Although the model based on the transformed outcome performs better in terms of underlying normality assumption, the presence of digit preference stil induces deviations from normality as well as problems with the assumption of homoscedasticity (constancy of variance of the error terms). Moreover, the explanatory power of the ‘final model’ (lm_fit2) is very low indicating that the predictive ability of the set of covariates is low. The problem regarding digit preference will not be discussed further in this tutorial, while the issue about the predictive power of the imputation model is partly remediated by considering the full conditional specification (FCS) approach in the mice package. More specifically, the FCS approach will rely on the information about all covariates in the construction of the imputation model for the primary endpoint.
9. How would you extend the approach to perform multiple imputation?
- Consider random generation of error terms (random noise);
M = 10
d_imp = plot(density(all_dat$risk_score, na.rm = T), main = "",
xlab = "Risk score", ylab = "Density plot", lwd = 3, col = 1)
for (imp_id in 1:M){
imp_set_old = paste0("imp_dataset",imp_id)
imp_set_name = paste0("imp_dataset_full",imp_id)## Without transformation
##------------------------
#znew = predict(lm_fit, newdata = all_dat) + rnorm(n, mean = 0,
# sd = sigma(lm_fit))
## With transformation
##---------------------
znew = exp(predict(lm_fit2, newdata = all_dat) + rnorm(n, mean = 0,
sd = sigma(lm_fit2))-1)
znew[is.na(all_dat$risk_score) == F] =
all_dat$risk_score[is.na(all_dat$risk_score) == F]
znew[get(imp_set_old)$ynew == 1] = 0
lines(density(znew), col = imp_id + 1, lwd = 2)
assign(imp_set_name, cbind(get(imp_set_old), znew))
}
b. Random generation of the asymptotic distribution of the estimators of the model parameters in combination with random noise
M = 10
d_imp = plot(density(all_dat$risk_score, na.rm = T), main = "",
xlab = "Risk score", ylab = "Density plot", lwd = 3, col = 1)
for (imp_id in 1:M){
imp_set_old = paste0("imp_dataset",imp_id)
imp_set_name = paste0("imp_dataset_full2",imp_id)
## Without transformation
##------------------------
#sm_values = summary(lm_fit)$coefficients
#new_coef = rnorm(nrow(sm_values), mean = sm_values[,1], sd = sm_values[,2])
#new_pred = apply(model.matrix.lm(lm_fit, data = all_dat,
# na.action = "na.pass")%*%new_coef, 1, sum) +
# rnorm(n, mean = 0, sd = sigma(lm_fit))
#znew = new_pred
## With transformation
##---------------------
sm_values = summary(lm_fit2)$coefficients
new_coef = rnorm(nrow(sm_values), mean = sm_values[,1], sd = sm_values[,2])
new_pred = apply(model.matrix.lm(lm_fit2, data = all_dat,
na.action = "na.pass")%*%new_coef, 1, sum) +
rnorm(n, mean = 0, sd = sigma(lm_fit2))
znew = exp(new_pred)-1
znew[is.na(all_dat$risk_score) == F] =
all_dat$risk_score[is.na(all_dat$risk_score) == F]
znew[get(imp_set_old)$ynew == 1] = 0
lines(density(znew), col = imp_id + 1, lwd = 2)
assign(imp_set_name, cbind(get(imp_set_old), znew))
}

- Construct M = 10 imputation sets and estimate whether there exists a significant difference in average baseline risk perception score among AGYW living in Cosmocity versus those living in Fourways. How would you combine the results of the different analyses?
## MI analyses
##-------------
coef_imp = vector()
se_imp = vector()
for (imp_id in 1:M){
imp_set_name = paste0("imp_dataset_full2",imp_id)
glm_fit_imp = glm(ynew ~ age + factor(educational_level) +
factor(residential_area) + monthly_income,
data = get(imp_set_name), family = binomial(link = "logit"))
sm_glm_imp = summary(glm_fit_imp)$coefficients
cov_id = which(rownames(sm_glm_imp) == "factor(residential_area)Fourways")
coef_imp[imp_id] = sm_glm_imp[cov_id,1]
se_imp[imp_id] = sm_glm_imp[cov_id,2]
}
mean_estimate = mean(coef_imp); mean_estimate
## [1] 0.2410253
between_var = var(coef_imp); between_var
## [1] 0.009113935
within_var = mean(se_imp**2); within_var
## [1] 0.07136779
total_var = within_var + (1 + 1/M)*between_var; total_var
## [1] 0.08139312
### Odds ratio estimation
###-----------------------
OR = exp(mean_estimate); OR
## [1] 1.272553
### Hypothesis problem H0: beta = 0; H1: beta != 0
###------------------------------------------------
wald_test_statistic = mean_estimate/sqrt(total_var);
wald_test_statistic
## [1] 0.8448288
n = nrow(imp_dataset_full21)
k = nrow(sm_glm_imp)
lambda = (between_var + (between_var/M))/total_var
df_old = (M - 1)/(lambda*lambda)
df_obs = (((n - k) + 1)/((n - k) + 3)) * ((n - k)*(1 - lambda))
df_adj = (df_old*df_obs)/(df_old + df_obs)
### Calculation of the two-sided p-value
###--------------------------------------
p_old = 2*pt(wald_test_statistic, df = df_old, lower.tail = FALSE)
p_adj = 2*pt(wald_test_statistic, df = df_adj, lower.tail = FALSE)
p_old; p_adj
## [1] 0.398547
## [1] 0.3986524
### 95% CI for the odds ratio
###---------------------------
logOR = mean_estimate;
alpha = 0.05
df = df_adj
cv = qt(1-alpha/2, df = df)
ll_logOR = logOR - cv*sqrt(total_var)
ul_logOR = logOR + cv*sqrt(total_var)
ll_OR = exp(ll_logOR)
ul_OR = exp(ul_logOR)
print(c(OR, ll_OR, ul_OR))
## [1] 1.2725532 0.7264102 2.2293072
Based on the unadjusted (old) and adjusted p-value (< 0.05), we do not reject the null hypothesis (beta = 0) at 5% s.l. implying that there does not exist a significant difference in probability of reporting zero perceived risk between Fourways and Cosmocity AGYW (reference category). Alternatively, the $pooled odds ratio (after multiple imputation) is found to be NOT significantly different from one given that the 95% CI for the pooled OR contains value one.
NOTE: the use of Rubin’s rules is based on the assumption of asymptotic normality of the quantity of interest. If this assumption is violated one can not rely on the Wald test statistic (and corresponding t-distribution).
#11. Use the MICE package to perform multiple imputation for the DREAMS dataset at hand.
MICE is based on Full Conditional Specification meaning that the default setting is to use all available information (read for all variables) to impute missing observations in a sequential manner. More specifically, the algorithm starts with imputing a specific variable based on the available information regarding all other variables in the dataset and subsequently performs imputation on the ‘next’ variable until a complete dataset is retained. For specific details concerning the MICE algorithm, please have a look at the MICE documentation or the book by Van Buuren (Van Buuren, 2011).
Here under, an easy example concerning the use of the MICE package is given. Note that we start here from a subset of the entire DREAMS dataset, which is strictly speaking not necessary, but for illustration purposes we consider this more convenient. As mentioned previously, the imputation model for the perceived risk score still suffers from the fact that the data are subject to digit preference and the lack of predictive power for the subset of covariates considered in the imputation model. Under FCS, we are able to improve the prediction model, at least under the assumption of data being missing at random.
## Define the dataset to be used
##-------------------------------
mice_dat = subset(all_dat, select = c("age", "educational_level",
"residential_area", "household_members",
"household_water", "electricity_at_home",
"monthly_income","alcohol_use","sti",
"discuss_health_issue","hiv_talk",
"ever_tested_for_hiv","sexual_partners",
"physical_violence","preventing_pregnancy",
"risk_score"))
## Perform multiple imputation using the mice function
##-----------------------------------------------------
imp = mice(mice_dat, m = 10, maxit = 10, printFlag = FALSE, seed = 123)
## Warning: Number of logged events: 12
summary(imp)
## Class: mids
## Number of multiple imputations: 10
## Imputation methods:
## age educational_level residential_area
## "" "" ""
## household_members household_water electricity_at_home
## "" "" ""
## monthly_income alcohol_use sti
## "" "" ""
## discuss_health_issue hiv_talk ever_tested_for_hiv
## "" "" ""
## sexual_partners physical_violence preventing_pregnancy
## "" "" ""
## risk_score
## "pmm"
## PredictorMatrix:
## age educational_level residential_area household_members
## age 0 0 0 0
## educational_level 1 0 0 0
## residential_area 1 0 0 0
## household_members 1 0 0 0
## household_water 1 0 0 0
## electricity_at_home 1 0 0 0
## household_water electricity_at_home monthly_income
## age 0 0 1
## educational_level 0 0 1
## residential_area 0 0 1
## household_members 0 0 1
## household_water 0 0 1
## electricity_at_home 0 0 1
## alcohol_use sti discuss_health_issue hiv_talk
## age 0 0 0 0
## educational_level 0 0 0 0
## residential_area 0 0 0 0
## household_members 0 0 0 0
## household_water 0 0 0 0
## electricity_at_home 0 0 0 0
## ever_tested_for_hiv sexual_partners physical_violence
## age 0 1 0
## educational_level 0 1 0
## residential_area 0 1 0
## household_members 0 1 0
## household_water 0 1 0
## electricity_at_home 0 1 0
## preventing_pregnancy risk_score
## age 0 1
## educational_level 0 1
## residential_area 0 1
## household_members 0 1
## household_water 0 1
## electricity_at_home 0 1
## Number of logged events: 12
## it im dep meth out
## 1 0 0 constant educational_level
## 2 0 0 constant residential_area
## 3 0 0 constant household_members
## 4 0 0 constant household_water
## 5 0 0 constant electricity_at_home
## 6 0 0 constant alcohol_use
Here we use the default method for imputation depending on the type of variable in the dataset. By default, the method uses (1) pmm, predictive mean matching for numeric data, (2) logreg, logistic regression imputation for binary data, (3) polyreg, polytomous regression imputation for unordered categorical data (factor > 2 levels) and (4) polr, proportional odds model for (ordered, > 2 levels).
NOTE: the default method is not necessarily the best option, please verify the underlying assumptions of each of these models in the context of the different variables that require imputation.
## Analyse the imputed dataset
##-----------------------------
m1 <- with(data = imp, expr = lm(log(risk_score + 1) ~ age +
factor(educational_level) +
factor(residential_area) +
factor(household_water) +
monthly_income +
factor(alcohol_use) +
sti +
hiv_talk +
ever_tested_for_hiv +
sexual_partners +
physical_violence +
preventing_pregnancy))
pooled_analysis <- pool(m1)
pooled_analysis
## Class: mipo m = 10
## term m
## 1 (Intercept) 10
## 2 age 10
## 3 factor(educational_level)Completed primary school 10
## 4 factor(educational_level)No formal schooling 10
## 5 factor(educational_level)Some high/secondary schooling 10
## 6 factor(educational_level)Some primary school 10
## 7 factor(educational_level)Tertiary education 10
## 8 factor(residential_area)Diepsloot 10
## 9 factor(residential_area)Fourways 10
## 10 factor(residential_area)Kyasand/pipeline 10
## 11 factor(residential_area)Msawawa 10
## 12 factor(residential_area)Other 10
## 13 factor(household_water)Public tap 10
## 14 factor(household_water)Tap in yard 10
## 15 factor(household_water)Tap inside house 10
## 16 monthly_income 10
## 17 factor(alcohol_use)I never drink 10
## 18 factor(alcohol_use)I only drink with friends but never get drunk 10
## 19 factor(alcohol_use)I sometimes get drunk when going out with friends 10
## 20 stiYes 10
## 21 hiv_talkYes 10
## 22 ever_tested_for_hivYes 10
## 23 sexual_partners 10
## 24 physical_violenceYes 10
## 25 preventing_pregnancyYes 10
## estimate ubar b t dfcom df
## 1 1.389681e+00 6.136016e-01 4.325889e-01 1.089449e+00 1818 45.09594
## 2 2.951896e-03 2.897828e-04 2.695790e-04 5.863196e-04 1818 33.85746
## 3 -1.391368e-02 8.176015e-02 3.821599e-02 1.237977e-01 1818 73.28416
## 4 2.153583e+00 2.392759e+00 1.676663e-02 2.411203e+00 1818 1781.24509
## 5 5.738824e-02 6.856657e-03 1.346487e-03 8.337793e-03 1818 239.47074
## 6 -1.147794e-01 1.110010e-01 7.631146e-02 1.949436e-01 1818 46.36316
## 7 -1.125283e-02 1.046706e-02 3.523574e-03 1.434300e-02 1818 112.75897
## 8 1.018402e-01 1.755289e-02 1.407315e-03 1.910093e-02 1818 752.42008
## 9 -8.781784e-02 6.350173e-02 5.489005e-03 6.953964e-02 1818 694.12003
## 10 1.057038e-01 4.423044e-02 1.174809e-02 5.715334e-02 1818 156.44192
## 11 -5.694351e-02 4.981363e-02 9.611845e-03 6.038666e-02 1818 245.47278
## 12 -4.045472e-02 2.211832e-02 1.896585e-03 2.420456e-02 1818 700.25415
## 13 1.779829e-01 2.650839e-01 1.315794e-01 4.098213e-01 1818 67.97979
## 14 1.493264e-01 2.680807e-01 1.364533e-01 4.181792e-01 1818 65.90311
## 15 5.540675e-02 2.742222e-01 1.493008e-01 4.384530e-01 1818 60.71816
## 16 -9.312449e-06 3.520037e-11 7.893914e-12 4.388368e-11 1818 198.53790
## 17 -2.228877e-01 1.486809e-01 3.856259e-02 1.910998e-01 1818 161.74963
## 18 -1.723853e-01 1.536104e-01 3.671404e-02 1.939958e-01 1818 181.46439
## 19 -2.656930e-01 1.532982e-01 4.331472e-02 2.009444e-01 1818 143.49933
## 20 2.493106e-01 1.009886e-02 2.433675e-03 1.277590e-02 1818 179.36845
## 21 5.433033e-02 1.528497e-02 2.475518e-03 1.800804e-02 1818 313.53897
## 22 2.089980e-01 3.173274e-02 4.127311e-03 3.627279e-02 1818 421.92150
## 23 9.221425e-02 2.410156e-03 2.722489e-04 2.709630e-03 1818 505.99159
## 24 1.546882e-01 3.544881e-02 5.187693e-03 4.115527e-02 1818 360.29597
## 25 4.990773e-02 6.543927e-03 1.565009e-03 8.265438e-03 1818 181.30727
## riv lambda fmi
## 1 0.775499696 0.43677828 0.460199038
## 2 1.023307522 0.50575976 0.532578767
## 3 0.514157474 0.33956671 0.356881795
## 4 0.007707958 0.00764900 0.008761348
## 5 0.216014196 0.17764118 0.184424339
## 6 0.756232889 0.43059943 0.453669295
## 7 0.370297882 0.27023167 0.282840079
## 8 0.088193244 0.08104557 0.083478535
## 9 0.095082529 0.08682682 0.089446661
## 10 0.292172123 0.22610929 0.235816785
## 11 0.212251735 0.17508883 0.181728679
## 12 0.094321948 0.08619214 0.088790936
## 13 0.546005758 0.35317188 0.371397576
## 14 0.559900849 0.35893361 0.377541377
## 15 0.598896928 0.37456882 0.394199990
## 16 0.246682197 0.19787095 0.205831036
## 17 0.285301190 0.22197224 0.231417216
## 18 0.262908307 0.20817688 0.216761982
## 19 0.310807253 0.23711133 0.247526241
## 20 0.265083709 0.20953847 0.218207314
## 21 0.178153438 0.15121412 0.156577038
## 22 0.143071204 0.12516386 0.129281496
## 23 0.124254938 0.11052203 0.114017090
## 24 0.160977521 0.13865688 0.143398706
## 25 0.263069869 0.20827816 0.216869486
Difference in mean perceived risk score for AGYW living in Fourways versus Cosmocity (based on the univariate Wald test discussed above)
pooled_analysis$pooled$estimate[9]
## [1] -0.08781784
pvalue = 2*pt(pooled_analysis$pooled$t[9], df = pooled_analysis$pooled$df[9],
lower.tail = FALSE)
pvalue
## [1] 0.9445801
12. Explain the difference between MICE and Multivariate Normal Imputation? Which method is to be preferred?
MICE is an iterative imputation method that imputes missing values by modeling each variable with missing data as a function of other variables in the dataset. In each iteration, the missing values in one variable are imputed based on the available data in the other variables. This process is repeated until convergence, and multiple imputed datasets are generated. MICE is a flexible method that can handle different types of missing data patterns and can incorporate a range of statistical models to impute missing values.
On the other hand, Multivariate Normal Imputation is a method that assumes that the data follows a multivariate normal distribution, and the missing values are imputed by sampling from the multivariate normal distribution estimated from the observed data. This method requires the assumption that the data is normally distributed, and it is sensitive to outliers in the data.
The choice between MICE and Multivariate Normal Imputation depends on the specific characteristics of the dataset and the research question. MICE is generally preferred over Multivariate Normal Imputation because it is more flexible and can handle a wider range of data patterns and missing data mechanisms. However, if the data is known to be normally distributed, Multivariate Normal Imputation can be a simple and effective approach. Ultimately, the best approach will depend on the specific research question and the nature of the missing data in the dataset.
13. What is the difference between predictive mean matching (pmm) and regression imputation considered in the previous steps? What are the advantages and disadvantages of both approaches?
Predictive mean matching (PMM) and regression imputation are two commonly used approaches for imputing missing data in regression analysis.
PMM is a non-parametric imputation method that is based on a matching procedure. In PMM, the observed values of a predictor variable (or variables) are used to predict the missing value of the response variable. The imputed value is then selected from the observed response values that have a predicted value closest to the predicted value of the missing value. PMM is useful when the distribution of the response variable is skewed or has outliers and can preserve the distribution of the response variable better than other imputation methods.
Regression imputation, on the other hand, is a parametric imputation method that is based on fitting a regression model to the observed data and then using the estimated model to impute the missing values. The imputed values are the predictions of the response variable based on the observed values of the predictor variables. Regression imputation assumes a linear relationship between the predictor and response variables and can be sensitive to outliers.
The advantages of PMM are that it is a non-parametric method and can preserve the distribution of the response variable. It can also handle nonlinear relationships between the predictor and response variables. However, PMM can be computationally intensive, especially when dealing with large datasets, and may not work well with highly skewed or heavy-tailed distributions.
The advantages of regression imputation are that it is a simple and fast method that can handle large datasets with many missing values. It is also useful when the relationship between the predictor and response variables is well understood and linear. However, regression imputation assumes a linear relationship between the predictor and response variables and can be sensitive to outliers and misspecification of the model.
In summary, the choice between PMM and regression imputation will depend on the characteristics of the dataset and the research question. PMM may be more appropriate when the distribution of the response variable is skewed or when there are nonlinear relationships between the predictor and response variables, while regression imputation may be more appropriate when the relationship between the predictor and response variables is well understood and linear.
The MICE (Multivariate Imputation by Chained Equations) package is a flexible imputation method that can be used with a variety of imputation models, including both predictive mean matching (PMM) and regression imputation. In fact, MICE can be used with a range of imputation methods, including logistic regression, random forests, and others, depending on the data and the research question.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.