In this assignment, we aim to improve our imputation model and analysis of the NHANES dataset by using multiple imputation. We first performed multiple imputation using naive and passive imputation models. As before, we are assessing the effect that age has on depression, using the variables ‘age’ and ‘depression’ score. Then, we also add BMI to our analysis, represented by the variable ‘bmi,’ since this makes our model more interesting and we found out this variable has a stronger correlation with depression scores. Our research question is the following: Do age and BMI have a significant effect on depression score? In order to obtain a meaningful insight into our imputation models, we first start imputing and analyzing for our research question with only age involved, like we did in assignment A, and discuss convergence and plausibility. Then, later we add BMI to the analysis to see if this provides us with more information about our data. Finally, we discuss the limitations and possible improvements that can be made to our procedure Research Question 1: Does age have a significant impact on depression H1: Age has a significant impact on depression score H0: Age does not have a significant impact on depression score. Research Question 2: Does BMI have a significant impact on depression? H1: BMI has a significant impact on depression score H0: BMI does not have a significant impact on depression score. ## 2 Preparing Data We begin by preparing our data and creating our depression score (a combination of depression scores 1 to 9) using both the complete and incomplete data.
data_complete <- readRDS("g3_complete_data.rds")
data_incomplete <- readRDS("g3_incomplete_data.rds")
#adding depression score:
data_complete1 <- data_complete %>% mutate(depression_score = dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9)
data_incomplete1 <- data_incomplete %>% mutate(depression_score = dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9)
In order to improve our analysis in comparison to the first assignment, we have now employed multiple imputation methods rather than single imputation methods. Using multiple imputation helps to reduce bias, and increase precision, statistical power. It is a more flexible and robust strategy for handling missing data that provides more accurate and precise estimates while reducing bias and increasing statistical power. We first use the default MICE imputation method, and then use passive imputation to perform multiple imputations on our data. Then, we pool and analyze the data, and then compare the effect of age on depression score and the effect of age and BMI on depression score. Throughout, we assess the convergence of plausibility of the imputed data. It is critical to assess both the convergence and plausibility in order to assess the validity and quality of the imputed data. Convergence refers to the stability of the imputation and how much the imputed values have stabilized. On the other hand, plausibility refers to how reasonable or ‘plausible’ the imputed values appear when compared to the observed data. Both are necessary to assess to ensure accurate imputations. ### 3.1 Default MICE Imputation Method We begin by performing a default imputation of the data using the MICE algorithm, using default multiple imputation. The default imputation method uses “pmm” (predictive mean matching) for continuous variables and “polyreg” (polynomial regression) for categorical variables. We chose this as a starting point for our imputations.
# default multiple imputation with 10 imps, 20 its:
imp <- mice(data_incomplete1, m = 10, maxit = 20, seed = 1234567, print = FALSE)
# checking the loggedEvents; due to depression_score:
imp$loggedEvents %>% head()
## it im dep meth out
## 1 2 3 height pmm depression_score
## 2 2 3 bmi pmm depression_score
## 3 2 3 bp_sys2 pmm depression_score
## 4 2 3 bp_dia2 pmm depression_score
## 5 2 6 drink_regularly logreg depression_score
## 6 2 8 weight pmm depression_score
Now, we will assess the convergence and plausibility of this imputation. In order to do this, we must create plots in order to visualize the convergence and plausibility. Convergence can be assessed using a convergence plot. A convergence plot shows the change in imputation estimates as the number of imputations increase. In order to interpret a convergence plot, we look at the stabilization, plateau, and fluctuations of the imputation estimates in the plot. Convergence can also be assessed by using the ‘convergence’ function in MICE. This function provides information on the maximum absolute difference between successive imputations for each variable at each iteration. This can be plotted to show auto-correlation. Ideally auto-correlation should be close to 0 to demonstrate convergence. Plausibility is assessed by comparing the imputed data to the original data or to external sources. This can be done by a density plot, which visualizes the distribution of the imputed values vs the observed data. If the imputed values are plausible, then the shape, center, and spread of the density curves should be similar between the imputed values and the observed values.
# looking at the plots from our initial model:
plot(imp)
densityplot(imp)
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_vrb, group = .imp)) +
## ggplot2::geom_density()
##
## See amices.org/ggmice for more info.
stripplot(imp)
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = .imp, y = my_vrb)) +
## ggplot2::geom_jitter()
##
## See amices.org/ggmice for more info.
# auto-correlation should ideally be 0, but deviates from that for some variables:
conv <- convergence(imp)
ggplot(conv, aes(x = .it, y = ac)) +
geom_hline(yintercept = 0, color = "grey", linetype = "dashed") +
geom_line() +
facet_wrap(~vrb, scales = "free") +
theme_classic() +
labs(x = "Iteration", y = "Auto-correlation")
# potential scale reduction factors are high; ideally they should be 1 for each variable:
ggplot(conv, aes(x = .it, y = psrf)) +
geom_hline(yintercept = 1, color = "grey", linetype = "dashed") +
geom_line() +
facet_wrap(~vrb, scales = "free") +
theme_classic() +
labs(x = "Iteration", y = "Potential scale reduction factor")
It is clear that this imputation model shows non-convergence, which we have quantified and displayed. It can be seen in the plot of the imputation using the ‘plot’ function that the depression score, weight, height, and bmi do not show or show little convergence over iterations. In the plot of the ‘convergence’ function, the auto-correlation value is high for weight, height, and depression score. This indicates non-convergence. The density plot also demonstrates low plausibility for the weight and height variables. Depression score has better plausibility, but some of the individual depression variables have low plausibility. ### 3.3 Passive Imputation Now we try to improve our multiple imputation in order to have better convergence. Since we added a column for the depression score, which is the sum of all depression scores in a row, we are going to do passive imputation for this variable. Passive imputation is ideal for transformed variables, when compared with other methods such as Impute, then transform. This gives us a more accurate imputation on the depression score variable. To implement passive imputation, we adjust two features of the mice() setup: the method vector: we use the method vector to define the deterministic relations, and the predictor matrix: we adjust the predictor matrix to keep a transformed variable from being used as a predictor of its raw version.
# currently all missing data is treated with pmm:
(method <- imp$method)
## id sex age ethnicity
## "" "" "" ""
## education marital household_size household_income
## "" "" "" ""
## weight height bmi pulse
## "pmm" "pmm" "pmm" ""
## bp_sys1 bp_dia1 bp_sys2 bp_dia2
## "" "" "pmm" "pmm"
## time_sed drink_regularly days_drinking dep1
## "" "logreg" "" ""
## dep2 dep3 dep4 dep5
## "pmm" "pmm" "" "pmm"
## dep6 dep7 dep8 dep9
## "pmm" "" "pmm" "pmm"
## depression_score
## "pmm"
# since our data is not MCAR, pmm is mostly fine:
out <- mcar_test(data_incomplete)
out$statistic # 1448.785
## [1] 1448.785
out$p.value # 1.968048e-10
## [1] 1.968048e-10
# we change the method to implement passive imputation depression_score :
method["depression_score"]<- "~ I(dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9)"
method
## id
## ""
## sex
## ""
## age
## ""
## ethnicity
## ""
## education
## ""
## marital
## ""
## household_size
## ""
## household_income
## ""
## weight
## "pmm"
## height
## "pmm"
## bmi
## "pmm"
## pulse
## ""
## bp_sys1
## ""
## bp_dia1
## ""
## bp_sys2
## "pmm"
## bp_dia2
## "pmm"
## time_sed
## ""
## drink_regularly
## "logreg"
## days_drinking
## ""
## dep1
## ""
## dep2
## "pmm"
## dep3
## "pmm"
## dep4
## ""
## dep5
## "pmm"
## dep6
## "pmm"
## dep7
## ""
## dep8
## "pmm"
## dep9
## "pmm"
## depression_score
## "~ I(dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9)"
# adjusting the prediction matrix to avoid circularity:
(pm <- imp$predictorMatrix)
## id sex age ethnicity education marital household_size
## id 0 1 1 1 1 1 1
## sex 1 0 1 1 1 1 1
## age 1 1 0 1 1 1 1
## ethnicity 1 1 1 0 1 1 1
## education 1 1 1 1 0 1 1
## marital 1 1 1 1 1 0 1
## household_size 1 1 1 1 1 1 0
## household_income 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## household_income weight height bmi pulse bp_sys1 bp_dia1
## id 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1
## household_income 0 1 1 1 1 1 1
## weight 1 0 1 1 1 1 1
## height 1 1 0 1 1 1 1
## bmi 1 1 1 0 1 1 1
## pulse 1 1 1 1 0 1 1
## bp_sys1 1 1 1 1 1 0 1
## bp_dia1 1 1 1 1 1 1 0
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## bp_sys2 bp_dia2 time_sed drink_regularly days_drinking dep1
## id 1 1 1 1 1 1
## sex 1 1 1 1 1 1
## age 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1
## education 1 1 1 1 1 1
## marital 1 1 1 1 1 1
## household_size 1 1 1 1 1 1
## household_income 1 1 1 1 1 1
## weight 1 1 1 1 1 1
## height 1 1 1 1 1 1
## bmi 1 1 1 1 1 1
## pulse 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1
## bp_sys2 0 1 1 1 1 1
## bp_dia2 1 0 1 1 1 1
## time_sed 1 1 0 1 1 1
## drink_regularly 1 1 1 0 1 1
## days_drinking 1 1 1 1 0 1
## dep1 1 1 1 1 1 0
## dep2 1 1 1 1 1 1
## dep3 1 1 1 1 1 1
## dep4 1 1 1 1 1 1
## dep5 1 1 1 1 1 1
## dep6 1 1 1 1 1 1
## dep7 1 1 1 1 1 1
## dep8 1 1 1 1 1 1
## dep9 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1
## dep2 dep3 dep4 dep5 dep6 dep7 dep8 dep9 depression_score
## id 1 1 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1 1 1
## household_income 1 1 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1 1 1
## dep2 0 1 1 1 1 1 1 1 1
## dep3 1 0 1 1 1 1 1 1 1
## dep4 1 1 0 1 1 1 1 1 1
## dep5 1 1 1 0 1 1 1 1 1
## dep6 1 1 1 1 0 1 1 1 1
## dep7 1 1 1 1 1 0 1 1 1
## dep8 1 1 1 1 1 1 0 1 1
## dep9 1 1 1 1 1 1 1 0 1
## depression_score 1 1 1 1 1 1 1 1 0
pm <- as.data.frame(pm) #converted to df to change the cells
pm[20:28, 29] <- 0 #changing the values to 0
pm <- as.matrix(pm) #conversion back to matrix
pm
## id sex age ethnicity education marital household_size
## id 0 1 1 1 1 1 1
## sex 1 0 1 1 1 1 1
## age 1 1 0 1 1 1 1
## ethnicity 1 1 1 0 1 1 1
## education 1 1 1 1 0 1 1
## marital 1 1 1 1 1 0 1
## household_size 1 1 1 1 1 1 0
## household_income 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## household_income weight height bmi pulse bp_sys1 bp_dia1
## id 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1
## household_income 0 1 1 1 1 1 1
## weight 1 0 1 1 1 1 1
## height 1 1 0 1 1 1 1
## bmi 1 1 1 0 1 1 1
## pulse 1 1 1 1 0 1 1
## bp_sys1 1 1 1 1 1 0 1
## bp_dia1 1 1 1 1 1 1 0
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## bp_sys2 bp_dia2 time_sed drink_regularly days_drinking dep1
## id 1 1 1 1 1 1
## sex 1 1 1 1 1 1
## age 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1
## education 1 1 1 1 1 1
## marital 1 1 1 1 1 1
## household_size 1 1 1 1 1 1
## household_income 1 1 1 1 1 1
## weight 1 1 1 1 1 1
## height 1 1 1 1 1 1
## bmi 1 1 1 1 1 1
## pulse 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1
## bp_sys2 0 1 1 1 1 1
## bp_dia2 1 0 1 1 1 1
## time_sed 1 1 0 1 1 1
## drink_regularly 1 1 1 0 1 1
## days_drinking 1 1 1 1 0 1
## dep1 1 1 1 1 1 0
## dep2 1 1 1 1 1 1
## dep3 1 1 1 1 1 1
## dep4 1 1 1 1 1 1
## dep5 1 1 1 1 1 1
## dep6 1 1 1 1 1 1
## dep7 1 1 1 1 1 1
## dep8 1 1 1 1 1 1
## dep9 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1
## dep2 dep3 dep4 dep5 dep6 dep7 dep8 dep9 depression_score
## id 1 1 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1 1 1
## household_income 1 1 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1 1 0
## dep2 0 1 1 1 1 1 1 1 0
## dep3 1 0 1 1 1 1 1 1 0
## dep4 1 1 0 1 1 1 1 1 0
## dep5 1 1 1 0 1 1 1 1 0
## dep6 1 1 1 1 0 1 1 1 0
## dep7 1 1 1 1 1 0 1 1 0
## dep8 1 1 1 1 1 1 0 1 0
## dep9 1 1 1 1 1 1 1 0 0
## depression_score 1 1 1 1 1 1 1 1 0
imp1 <- mice(data_incomplete1,
m = 10,
method = method,
predictorMatrix = pm,
maxit = 20,
seed = 1234567,
print = FALSE)
# looking at the plots from our passive imputation model:
plot(imp1)
densityplot(imp1)
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_vrb, group = .imp)) +
## ggplot2::geom_density()
##
## See amices.org/ggmice for more info.
stripplot(imp1)
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = .imp, y = my_vrb)) +
## ggplot2::geom_jitter()
##
## See amices.org/ggmice for more info.
After performing passive imputation on our data, we have reassessed the convergence and plausibility. The convergence plot using the ‘plot’ function is similar for weight, height and bmi, though depression score shows much more convergence. On the other hand, the density plot appears relatively unchanged when compared with our first imputation. The weight and height variables still display low plausibility. Therefore, we now apply passive imputation for these variables.
#BMI, height, weight could be better. The rest looks good. We add one more pmi for BMI:
method["bmi"] <- "~ I(weight / (height / 100)^2)"
method
## id
## ""
## sex
## ""
## age
## ""
## ethnicity
## ""
## education
## ""
## marital
## ""
## household_size
## ""
## household_income
## ""
## weight
## "pmm"
## height
## "pmm"
## bmi
## "~ I(weight / (height / 100)^2)"
## pulse
## ""
## bp_sys1
## ""
## bp_dia1
## ""
## bp_sys2
## "pmm"
## bp_dia2
## "pmm"
## time_sed
## ""
## drink_regularly
## "logreg"
## days_drinking
## ""
## dep1
## ""
## dep2
## "pmm"
## dep3
## "pmm"
## dep4
## ""
## dep5
## "pmm"
## dep6
## "pmm"
## dep7
## ""
## dep8
## "pmm"
## dep9
## "pmm"
## depression_score
## "~ I(dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9)"
# adjusting the prediction matrix to avoid circularity:
(pm1 <- imp1$predictorMatrix)
## id sex age ethnicity education marital household_size
## id 0 1 1 1 1 1 1
## sex 1 0 1 1 1 1 1
## age 1 1 0 1 1 1 1
## ethnicity 1 1 1 0 1 1 1
## education 1 1 1 1 0 1 1
## marital 1 1 1 1 1 0 1
## household_size 1 1 1 1 1 1 0
## household_income 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## household_income weight height bmi pulse bp_sys1 bp_dia1
## id 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1
## household_income 0 1 1 1 1 1 1
## weight 1 0 1 1 1 1 1
## height 1 1 0 1 1 1 1
## bmi 1 1 1 0 1 1 1
## pulse 1 1 1 1 0 1 1
## bp_sys1 1 1 1 1 1 0 1
## bp_dia1 1 1 1 1 1 1 0
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## bp_sys2 bp_dia2 time_sed drink_regularly days_drinking dep1
## id 1 1 1 1 1 1
## sex 1 1 1 1 1 1
## age 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1
## education 1 1 1 1 1 1
## marital 1 1 1 1 1 1
## household_size 1 1 1 1 1 1
## household_income 1 1 1 1 1 1
## weight 1 1 1 1 1 1
## height 1 1 1 1 1 1
## bmi 1 1 1 1 1 1
## pulse 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1
## bp_sys2 0 1 1 1 1 1
## bp_dia2 1 0 1 1 1 1
## time_sed 1 1 0 1 1 1
## drink_regularly 1 1 1 0 1 1
## days_drinking 1 1 1 1 0 1
## dep1 1 1 1 1 1 0
## dep2 1 1 1 1 1 1
## dep3 1 1 1 1 1 1
## dep4 1 1 1 1 1 1
## dep5 1 1 1 1 1 1
## dep6 1 1 1 1 1 1
## dep7 1 1 1 1 1 1
## dep8 1 1 1 1 1 1
## dep9 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1
## dep2 dep3 dep4 dep5 dep6 dep7 dep8 dep9 depression_score
## id 1 1 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1 1 1
## household_income 1 1 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1 1 0
## dep2 0 1 1 1 1 1 1 1 0
## dep3 1 0 1 1 1 1 1 1 0
## dep4 1 1 0 1 1 1 1 1 0
## dep5 1 1 1 0 1 1 1 1 0
## dep6 1 1 1 1 0 1 1 1 0
## dep7 1 1 1 1 1 0 1 1 0
## dep8 1 1 1 1 1 1 0 1 0
## dep9 1 1 1 1 1 1 1 0 0
## depression_score 1 1 1 1 1 1 1 1 0
pm1 <- as.data.frame(pm1) #converted to df to change the cells
pm1[9:10, 11] <- 0 #changing the values to 0
pm1 <- as.matrix(pm1) #conversion back to matrix
pm1
## id sex age ethnicity education marital household_size
## id 0 1 1 1 1 1 1
## sex 1 0 1 1 1 1 1
## age 1 1 0 1 1 1 1
## ethnicity 1 1 1 0 1 1 1
## education 1 1 1 1 0 1 1
## marital 1 1 1 1 1 0 1
## household_size 1 1 1 1 1 1 0
## household_income 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## household_income weight height bmi pulse bp_sys1 bp_dia1
## id 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1
## household_income 0 1 1 1 1 1 1
## weight 1 0 1 0 1 1 1
## height 1 1 0 0 1 1 1
## bmi 1 1 1 0 1 1 1
## pulse 1 1 1 1 0 1 1
## bp_sys1 1 1 1 1 1 0 1
## bp_dia1 1 1 1 1 1 1 0
## bp_sys2 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1
## dep2 1 1 1 1 1 1 1
## dep3 1 1 1 1 1 1 1
## dep4 1 1 1 1 1 1 1
## dep5 1 1 1 1 1 1 1
## dep6 1 1 1 1 1 1 1
## dep7 1 1 1 1 1 1 1
## dep8 1 1 1 1 1 1 1
## dep9 1 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1 1
## bp_sys2 bp_dia2 time_sed drink_regularly days_drinking dep1
## id 1 1 1 1 1 1
## sex 1 1 1 1 1 1
## age 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1
## education 1 1 1 1 1 1
## marital 1 1 1 1 1 1
## household_size 1 1 1 1 1 1
## household_income 1 1 1 1 1 1
## weight 1 1 1 1 1 1
## height 1 1 1 1 1 1
## bmi 1 1 1 1 1 1
## pulse 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1
## bp_sys2 0 1 1 1 1 1
## bp_dia2 1 0 1 1 1 1
## time_sed 1 1 0 1 1 1
## drink_regularly 1 1 1 0 1 1
## days_drinking 1 1 1 1 0 1
## dep1 1 1 1 1 1 0
## dep2 1 1 1 1 1 1
## dep3 1 1 1 1 1 1
## dep4 1 1 1 1 1 1
## dep5 1 1 1 1 1 1
## dep6 1 1 1 1 1 1
## dep7 1 1 1 1 1 1
## dep8 1 1 1 1 1 1
## dep9 1 1 1 1 1 1
## depression_score 1 1 1 1 1 1
## dep2 dep3 dep4 dep5 dep6 dep7 dep8 dep9 depression_score
## id 1 1 1 1 1 1 1 1 1
## sex 1 1 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1 1 1
## ethnicity 1 1 1 1 1 1 1 1 1
## education 1 1 1 1 1 1 1 1 1
## marital 1 1 1 1 1 1 1 1 1
## household_size 1 1 1 1 1 1 1 1 1
## household_income 1 1 1 1 1 1 1 1 1
## weight 1 1 1 1 1 1 1 1 1
## height 1 1 1 1 1 1 1 1 1
## bmi 1 1 1 1 1 1 1 1 1
## pulse 1 1 1 1 1 1 1 1 1
## bp_sys1 1 1 1 1 1 1 1 1 1
## bp_dia1 1 1 1 1 1 1 1 1 1
## bp_sys2 1 1 1 1 1 1 1 1 1
## bp_dia2 1 1 1 1 1 1 1 1 1
## time_sed 1 1 1 1 1 1 1 1 1
## drink_regularly 1 1 1 1 1 1 1 1 1
## days_drinking 1 1 1 1 1 1 1 1 1
## dep1 1 1 1 1 1 1 1 1 0
## dep2 0 1 1 1 1 1 1 1 0
## dep3 1 0 1 1 1 1 1 1 0
## dep4 1 1 0 1 1 1 1 1 0
## dep5 1 1 1 0 1 1 1 1 0
## dep6 1 1 1 1 0 1 1 1 0
## dep7 1 1 1 1 1 0 1 1 0
## dep8 1 1 1 1 1 1 0 1 0
## dep9 1 1 1 1 1 1 1 0 0
## depression_score 1 1 1 1 1 1 1 1 0
imp2 <- mice(data_incomplete1,
m = 10,
method = method,
predictorMatrix = pm1,
maxit = 20,
seed = 1234567,
print = FALSE)
# Checking if the deterministic definition of depression score is maintained in the imputed data:
xyplot(imp1, depression_score ~ I(dep1 + dep2 + dep3 + dep4 + dep5 + dep6 + dep7 + dep8 + dep9), ylab="Imputed depression score", xlab="Calculated depression score")
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot 2 variables named 'my_x' and 'my_y' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_x, y = my_y)) +
## ggplot2::geom_point()
##
## See amices.org/ggmice for more info.
# Checking if the deterministic definition of BMI is maintained in the imputed data:
xyplot(imp2, bmi ~ I(weight / (height / 100)^2), ylab="Imputed BMI", xlab="Calculated BMI")
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot 2 variables named 'my_x' and 'my_y' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_x, y = my_y)) +
## ggplot2::geom_point()
##
## See amices.org/ggmice for more info.
# looking at the plots from our model with 2 passive imputations:
plot(imp2) #convergence is better now: also good for bmi, height, weight
densityplot(imp2) #plausibility
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = my_vrb, group = .imp)) +
## ggplot2::geom_density()
##
## See amices.org/ggmice for more info.
stripplot(imp2) #plausibility
## Hint: Did you know, an equivalent figure can be created with `ggmice()`?
## For example, to plot a variable named 'my_vrb' from a mids object called 'my_mids', run:
##
## ggmice(my_mids, ggplot2::aes(x = .imp, y = my_vrb)) +
## ggplot2::geom_jitter()
##
## See amices.org/ggmice for more info.
# checking the default prediction matrix to find correlating variables: BMI (and height + weight) correlate with depression_score (and two blood pressure variables and household income correlate with BMI). Since it does not include a correlation between age and depression_score, as we have in our research question, we stick to keeping all variables, except the ones removed to avoid circularity from passive imputation. In other words, we stick with the predictor matrix in imp2 due to its fit with our research question.
(pred <- quickpred(data_incomplete1))
## id sex age ethnicity education marital household_size
## id 0 0 0 0 0 0 0
## sex 0 0 0 0 0 0 0
## age 0 0 0 0 0 0 0
## ethnicity 0 0 0 0 0 0 0
## education 0 0 0 0 0 0 0
## marital 0 0 0 0 0 0 0
## household_size 0 0 0 0 0 0 0
## household_income 0 0 0 0 0 0 0
## weight 0 1 1 1 1 0 1
## height 0 1 0 1 1 0 1
## bmi 0 1 1 0 0 0 0
## pulse 0 0 0 0 0 0 0
## bp_sys1 0 0 0 0 0 0 0
## bp_dia1 0 0 0 0 0 0 0
## bp_sys2 0 1 1 0 1 0 1
## bp_dia2 0 1 1 0 0 0 0
## time_sed 0 0 0 0 0 0 0
## drink_regularly 0 1 0 0 0 0 0
## days_drinking 0 0 0 0 0 0 0
## dep1 0 0 0 0 0 0 0
## dep2 0 1 0 0 0 0 0
## dep3 0 1 0 0 1 0 0
## dep4 0 0 0 0 0 0 0
## dep5 0 1 1 0 1 1 0
## dep6 0 1 0 0 0 0 0
## dep7 0 0 0 0 0 0 0
## dep8 0 0 0 0 1 1 0
## dep9 0 0 0 0 1 0 1
## depression_score 0 1 0 0 1 1 0
## household_income weight height bmi pulse bp_sys1 bp_dia1
## id 0 0 0 0 0 0 0
## sex 0 0 0 0 0 0 0
## age 0 0 0 0 0 0 0
## ethnicity 0 0 0 0 0 0 0
## education 0 0 0 0 0 0 0
## marital 0 0 0 0 0 0 0
## household_size 0 0 0 0 0 0 0
## household_income 0 0 0 0 0 0 0
## weight 0 0 1 1 1 1 1
## height 1 1 0 0 0 0 1
## bmi 0 1 1 0 1 1 1
## pulse 0 0 0 0 0 0 0
## bp_sys1 0 0 0 0 0 0 0
## bp_dia1 0 0 0 0 0 0 0
## bp_sys2 0 1 0 1 1 1 1
## bp_dia2 0 1 1 1 1 1 1
## time_sed 0 0 0 0 0 0 0
## drink_regularly 1 0 1 0 0 0 0
## days_drinking 0 0 0 0 0 0 0
## dep1 0 0 0 0 0 0 0
## dep2 1 0 1 1 1 0 0
## dep3 1 1 1 1 1 0 0
## dep4 0 0 0 0 0 0 0
## dep5 1 1 1 1 1 0 0
## dep6 1 0 1 1 0 0 0
## dep7 0 0 0 0 0 0 0
## dep8 1 0 0 0 1 0 0
## dep9 0 0 0 0 0 0 0
## depression_score 1 1 1 1 1 0 0
## bp_sys2 bp_dia2 time_sed drink_regularly days_drinking dep1
## id 0 0 0 0 0 0
## sex 0 0 0 0 0 0
## age 0 0 0 0 0 0
## ethnicity 0 0 0 0 0 0
## education 0 0 0 0 0 0
## marital 0 0 0 0 0 0
## household_size 0 0 0 0 0 0
## household_income 0 0 0 0 0 0
## weight 1 1 1 0 0 0
## height 0 1 1 1 1 0
## bmi 1 1 1 0 0 1
## pulse 0 0 0 0 0 0
## bp_sys1 0 0 0 0 0 0
## bp_dia1 0 0 0 0 0 0
## bp_sys2 0 1 0 0 1 0
## bp_dia2 1 0 0 0 1 0
## time_sed 0 0 0 0 0 0
## drink_regularly 0 0 0 0 1 1
## days_drinking 0 0 0 0 0 0
## dep1 0 0 0 0 0 0
## dep2 0 0 0 0 1 1
## dep3 0 0 0 0 1 1
## dep4 0 0 0 0 0 0
## dep5 0 0 0 0 1 1
## dep6 0 0 0 0 1 1
## dep7 0 0 0 0 0 0
## dep8 0 0 0 0 0 1
## dep9 0 0 1 0 0 1
## depression_score 0 0 0 0 1 1
## dep2 dep3 dep4 dep5 dep6 dep7 dep8 dep9 depression_score
## id 0 0 0 0 0 0 0 0 0
## sex 0 0 0 0 0 0 0 0 0
## age 0 0 0 0 0 0 0 0 0
## ethnicity 0 0 0 0 0 0 0 0 0
## education 0 0 0 0 0 0 0 0 0
## marital 0 0 0 0 0 0 0 0 0
## household_size 0 0 0 0 0 0 0 0 0
## household_income 0 0 0 0 0 0 0 0 0
## weight 1 1 0 1 0 0 0 0 1
## height 0 0 0 1 0 1 0 0 1
## bmi 1 1 1 1 0 1 0 0 1
## pulse 0 0 0 0 0 0 0 0 0
## bp_sys1 0 0 0 0 0 0 0 0 0
## bp_dia1 0 0 0 0 0 0 0 0 0
## bp_sys2 0 0 0 0 1 0 0 0 0
## bp_dia2 0 0 0 0 1 0 0 0 0
## time_sed 0 0 0 0 0 0 0 0 0
## drink_regularly 0 0 0 0 0 0 0 0 0
## days_drinking 0 0 0 0 0 0 0 0 0
## dep1 0 0 0 0 0 0 0 0 0
## dep2 0 1 1 1 1 1 1 1 1
## dep3 1 0 1 1 1 1 1 1 1
## dep4 0 0 0 0 0 0 0 0 0
## dep5 1 1 1 0 1 1 1 1 1
## dep6 1 1 1 1 0 1 1 1 1
## dep7 0 0 0 0 0 0 0 0 0
## dep8 1 1 1 1 1 1 0 1 1
## dep9 1 1 1 1 1 1 1 0 1
## depression_score 1 1 1 1 1 1 1 1 0
The plausibility and convergence for this final imputation have improved. When looking at the convergence plot for weight, height, bmi, and depression score, the convergence is now acceptable. As seen in the density plot, the plausibility of the weight and height variables has greatly improved. We can now be satisfied with the validity and quality of our imputation. ### 3.4 Analysis and pooling - Age Imputation With our chosen multiple imputation model, we perform analysis for our research question and pool the results. Pooling the imputed datasets allows us to obtain a single set of imputed values. This creates estimates that are more accurate and better reflect the uncertainty of the missing data.
# analysis for each imp to see if age is a predictor for depression score:
fit <- with(imp2, lm(depression_score ~ age))
# results:
fit
## call :
## with.mids(data = imp2, expr = lm(depression_score ~ age))
##
## call1 :
## mice(data = data_incomplete1, m = 10, method = method, predictorMatrix = pm1,
## maxit = 20, printFlag = FALSE, seed = 1234567)
##
## nmis :
## id sex age ethnicity
## 0 0 0 0
## education marital household_size household_income
## 0 0 0 0
## weight height bmi pulse
## 150 96 222 0
## bp_sys1 bp_dia1 bp_sys2 bp_dia2
## 0 0 100 100
## time_sed drink_regularly days_drinking dep1
## 0 50 0 0
## dep2 dep3 dep4 dep5
## 75 75 0 75
## dep6 dep7 dep8 dep9
## 75 0 40 81
## depression_score
## 177
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.3210 -0.0154
##
##
## [[2]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.33318 -0.01783
##
##
## [[3]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.14071 -0.01319
##
##
## [[4]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.18988 -0.01393
##
##
## [[5]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.22247 -0.01408
##
##
## [[6]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.21945 -0.01402
##
##
## [[7]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.23324 -0.01284
##
##
## [[8]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.27384 -0.01596
##
##
## [[9]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.14297 -0.01315
##
##
## [[10]]
##
## Call:
## lm(formula = depression_score ~ age)
##
## Coefficients:
## (Intercept) age
## 4.22587 -0.01533
fit$analyses[[1]] %>% summary()
##
## Call:
## lm(formula = depression_score ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.013 -3.351 -1.805 1.176 20.203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.32100 0.69700 6.199 1.19e-09 ***
## age -0.01540 0.01491 -1.033 0.302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.788 on 498 degrees of freedom
## Multiple R-squared: 0.002137, Adjusted R-squared: 0.0001333
## F-statistic: 1.067 on 1 and 498 DF, p-value: 0.3022
fit$analyses[[2]] %>% coef()
## (Intercept) age
## 4.33318014 -0.01783068
# pooling the results and checking them:
poolFit <- pool(fit)
summary(poolFit)
## term estimate std.error statistic df p.value
## 1 (Intercept) 4.23026217 0.68210600 6.2017665 488.3516 1.190560e-09
## 2 age -0.01457293 0.01460972 -0.9974816 485.7817 3.190274e-01
The p-value of these pooled results is 0.319, which indicates that the observed differences in the pooled estimate are likely to have occurred by chance when assuming that the null hypothesis is true. It suggests weak evidence in support of the alternative hypothesis. Because the estimate is negative (-0.015), it indicates a negative association between variables. However, because the absolute value of -0.015 is very low, it indicates weak associations or effects. ### 3.5 Analysis of imputation with age and bmi As the multiple imputation model above with a single predictor “age” does not predict the depression score of individuals well, we decided to add an extra variable “bmi” to our analysis. This was in order to see whether independent variables “age” and “bmi” together better explains the outcome. Therefore a multiple linear regression with multiple imputation for missing data is performed. We hope that this will provide more interesting results from our imputed data. We also used imputed data with an adjusted method vector and predictor matrix due to the transformed variables “depression score” and “bmi”.
## Analysis phase: fit the model
```r
fit1 <- with(imp2, lm(`depression_score` ~ age + bmi))
fit1
## call :
## with.mids(data = imp2, expr = lm(depression_score ~ age + bmi))
##
## call1 :
## mice(data = data_incomplete1, m = 10, method = method, predictorMatrix = pm1,
## maxit = 20, printFlag = FALSE, seed = 1234567)
##
## nmis :
## id sex age ethnicity
## 0 0 0 0
## education marital household_size household_income
## 0 0 0 0
## weight height bmi pulse
## 150 96 222 0
## bp_sys1 bp_dia1 bp_sys2 bp_dia2
## 0 0 100 100
## time_sed drink_regularly days_drinking dep1
## 0 50 0 0
## dep2 dep3 dep4 dep5
## 75 75 0 75
## dep6 dep7 dep8 dep9
## 75 0 40 81
## depression_score
## 177
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 2.01588 -0.02214 0.08764
##
##
## [[2]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 0.77178 -0.02994 0.13642
##
##
## [[3]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## -0.43965 -0.01941 0.16200
##
##
## [[4]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 2.30241 -0.02082 0.07389
##
##
## [[5]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## -0.60831 -0.02195 0.17281
##
##
## [[6]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## -0.41383 -0.02188 0.16830
##
##
## [[7]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 0.008271 -0.021996 0.154333
##
##
## [[8]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 0.28171 -0.02795 0.14837
##
##
## [[9]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## -0.24223 -0.02317 0.16038
##
##
## [[10]]
##
## Call:
## lm(formula = depression_score ~ age + bmi)
##
## Coefficients:
## (Intercept) age bmi
## 1.42648 -0.02345 0.10598
pool <- pool(fit1)
summary(pool)
## term estimate std.error statistic df p.value
## 1 (Intercept) 0.51025059 1.51943828 0.3358153 26.85726 0.7396195
## 2 age -0.02326988 0.01473571 -1.5791485 409.17023 0.1150747
## 3 bmi 0.13701237 0.04710187 2.9088518 20.85309 0.0084347
After pooling the estimates from each model into a single set of estimates, we can see age does not have a significant effect on an individual’s depression score, with a p value of 0.115. However, there is a significant positive relation between depression score and bmi (p = 0.008). For each unit bmi increases, depression score increases by 0.137 points after controlling for age, indicating a higher probability of depression in those with higher BMIs. ### 3.6 Compare the models (age and depression score vs. age, bmi and depression score).
# compare R2
pool.r.squared(fit)
## est lo 95 hi 95 fmi
## R^2 0.002023275 0.001893637 0.01762979 0.01437129
pool.r.squared(fit1)
## est lo 95 hi 95 fmi
## R^2 0.04540625 0.005195789 0.1195377 0.6570686
# compare estimates
D1(fit1,fit)
## test statistic df1 df2 dfcom p.value riv
## 1 ~~ 2 8.461419 1 13.79889 497 0.01158236 1.62826
In the first model, using only age, the R2 is 0.002, whereas for the second with age and bmi, it is 0.045. Though the general R2 is still low in the multiple linear model, adding predictor “bmi” explains about 4.3% more variability in depression score. Comparing estimates, it shows that the “bmi” factor is a significant predictor of depression score, and the multiple linear model with age and bmi as predictors better predict one’s depression score. This indicates that a person’s BMI is a better predictor of depression than a person’s age. ## 4 Conclusion ### 4.1 Discussion In conclusion, we cannot reject the null hypothesis of the first research question because we found no significant relationship between age and depression score using our imputed data. However, we can reject the null hypothesis of our second research question, because we did find a statistically significant relationship between bmi and depression score (0.008). In the complete data, the p-value of the correlation between age and depression score was 0.171, whereas in the imputed data, the p-value was 0.115. On the other hand, the bmi and depressions score was 0.001 in the complete data, and 0.008 on the incomplete data. The difference between these two in both cases is small and does not affect whether or not the null hypothesis is rejected. The correlation estimate of age and depression score in the complete data was -0.061, whereas it was -0.023 in the imputation data. This means the relationship is negative and weak in both cases. For bmi, the correlation estimate with depression score in the complete data was 0.142, while it was 0.137 in the imputation data This means that whether a researcher was studying the complete data or our imputed data, they would reach the same conclusion: there is a weak, negative, insignificant relationship between age and depression score, whereas there is a weak, positive, significant relationship between bmi and depression score. We can therefore conclude that our MI was adequate when addressing our research questions as it would not have led to a different conclusion.
There are some limitations to the models and analysis we used in this assignment that should be discussed. For example, as seen in the first imputation, the default imputation settings in MICE were not appropriate for our missing data and led to non-convergence and low plausibility on several variables. Using the default methods in MICE may limit flexibility and make it more difficult to fine tune the imputation process. It is also difficult to apply to larger datasets. We also used passive imputation. While our final imputation had acceptable convergence and plausibility that was an improvement over the default MICE imputation, it also presents its own issues. For example, while the variables height, bmi, and weight all converged more from the first to the final passive imputation, the depression score had weaker convergence. While it was still acceptable, this should be taken into account. Passive imputation also has limitations in general. Passive imputation ignores the relationships between variables, potentially causing imputed values that are inconsistent with the observed data. Passive imputation may also lead to limited accuracy, and the accuracy of the imputed data is dependent on the variability of the observed data and the extent of missingness in the dataset. Some of this may have been remedied by undergoing more tests and comparisons to choose the best MI model. Dep-scores vs. depression score with convergence? ### 4.3 Potential improvements An improvement that can be made to this assignment would be to more extensively investigate the best MI model for our dataset. For example, we determined that our data is not MCAR, but we did not determine if the data is MAR or MNAR. As passive imputation assumes that data is MCAR or MAR, it may be beneficial to determine if the missing data is likely to be MAR. A sensitivity analysis may have also been useful in determining the best MI model for our dataset. In general, our imputed values were plausible. However, there were some outliers in the imputed data that did not occur in the observed data. This is probably because we chose a model with many predictor variables. We made this choice, because our outcome variable had no strong correlations in general. An improvement could be to perform an analysis on what variables might be acceptable to remove from the prediction matrix in order to completely prevent overfitting, while still using all relevant predictors.