Introduction
While alcohol can have a very temporary positive impact on our mood, its long term use with high consumption can cause severe problems for our mental health. Not only heavy alcohol consumption also regular smoking behavior might also cause an adverse effect on mental health. The mental health is a broad term. Psychologist have defined mental health problems to be a combination of many different symptoms. That is why Nonspecific Psychological Distress Scale (K6) was introduced to measure the amount of distress level in individuals. In this brief analysis, we will use the K6 scale to measure the effect of health behaviors such as drinking and smoking. The goal of this analysis is to use the multiple imputations method to take care of the missing values found in our dependent variable.
Data
The data used in this analysis comes from National Health Interview Survey. This is an five year (2011-2016) consolidated data. It has 717,513 observations and 45 different health related variables. The NHIS data has been extensively used throughout the Department of Health and Human Services (DHHS) to track and monitor trends in illness and disability and also to track progress toward achieving national health objectives. The data is also used by the public health research communities for epidemiology and policy analysis of issues as characterizing those with various health problems, determining barriers to accessing and using appropriate health care and evaluating Federal health programs (CDC).
Primary Dependent Variable
For the mental distress levels, K6 (Nonspecific Psychological Scale) variable was formed. The variable was formed by combing different scores from 6 different questions in the NHIS data set into one consolidated variable K6. The variable, K6, represent three different levels of mental distress 1- NORMAL (0-5 Score), 2- Moderate (6-12 Score), 3- Severe (13-24 Score.
Key Independent Variable
There are two independent variables in the study. One “Alcohol Drinks” which is rename as “drinks” and “Cigarettes Smoked” which also renamed as “cigs”. Both of the variables are categorized. Originally these two variables are numeric and have values from 0 to 90. Where each number represents a the quantity of either drink or cigarette consumed or used by respondent. However, higher values are used with extreme caution because they might be a result of misunderstood question through the survey by the respondent.
Statistical Analysis
For our analysis, we will use the ordered logistic regression model first to see quantities of interest. However, our primary focus is to generate imputed values for our dependent variable “MD3” (Mental Distress 3 Categories). The multiple imputation will be done by using the Amelia Package. Finally, using one of the imputed data sets, we will try to see the differences between imputed model and non-imputed model.
However, it is important to mention that the data from NHIS, especially regarding the modifiable risk factors such as alcohol and cigarette smoking has high number of missing cases. Therefore, understanding the motivation behind multiple imputations, I would not use the multiple imputations on our independent variables which are alcohol drinks and cigarettes smoked.
The reason for this is that majority of respondents might not be heavy drinkers or smokers and their response missing from the data is not because of sheer randomness but because of deliberate reality that they are not heavy drinkers or smokers. However, our dependent variable K6 is legitimate candidate for multiple imputations because, the mental health level is something which can be found and recorded for every human being. The missing values for this variable might be missing due to randomness or item non-response.That is why, we will only focus on our dependent variable (MD3) which is more important and makes much more sense as compared to other two variables in case of multiple imputations.
K6 Composite Score Variable.
Recoding the original variables.
nhis <- mutate(nhis,
sad=ASAD,
effort=AEFFORT,
hopeless=AHOPELESS,
nervous=ANERVOUS,
restless=ARESTLESS,
worthless=AWORTHLESS)
Setting up missing values for each K6 variable.
nhis$sad <- car::recode(nhis$sad,"6:9= NA")
nhis$effort <- car::recode(nhis$effort,"6:9= NA")
nhis$hopeless <- car::recode(nhis$hopeless,"6:9= NA")
nhis$nervous <- car::recode(nhis$nervous,"6:9= NA")
nhis$restless <- car::recode(nhis$restless,"6:9= NA")
nhis$worthless <- car::recode(nhis$worthless,"6:9= NA")
Adding Each variable for composite score variable (K6)
nhis <- mutate(nhis, K6=sad+effort+hopeless+nervous+restless+worthless)
Creating 3 cateogry K6 score variable = (MD3)
nhis <- mutate (nhis, MD3=K6)
Giving appropriate score to each category of (MD3)
nhis$MD3 <- car::recode(nhis$MD3, "0:5=1;6:12=2; 13:24=3")
Validating the new variable, as a factor variable
nhis$MD3 <- factor(nhis$MD3)
is.factor(nhis$MD3)
## [1] TRUE
Recoding for the odered logistic regression.
The summary of our dependent variable MD3 shows that there are 490,750 cases missing from the original data set. We will exclusively use the multiple imputations to account for these missing values.
nhis$MD3 <-factor(nhis$MD3, labels = c("Normal","Moderate","Severe"))
nhis$MD3 <- factor(nhis$MD3, ordered = TRUE, levels = c("Normal", "Moderate", "Severe"))
summary(nhis$MD3)
## Normal Moderate Severe NA's
## 209416 33584 9530 543115
Alcohol Consumption Variable
Creating variable of “Alchohal Drinks Per Day”
For the alcohol consumption variable, we will use the default deletion method in Zelig. This is appropriate for this type of variable. Because as we have mentioned earlier, it is highly probable that majority of the respondents in our sample are not heavy or occasional drinkers. Therefore, it would be biased to use MP on this or other independent variable smoking.
nhis <- mutate(nhis, drinks=ALCAMT/10)
nhis$drinks <- car::recode(nhis$drinks,"0=NA;90:99.9=NA")
nhis$drinks2 <- car::recode(nhis$drinks,"1:4=1;5:10=2;11:89.9=3")
nhis$drinks2 <- factor(nhis$drinks2)
nhis$drinks2 <-factor(nhis$drinks2,
labels =c("<5",
"5-10",
">10"))
summary(nhis$drinks2)
## <5 5-10 >10 NA's
## 146227 14085 1598 633735
Cigarettes Smoked Variable
Creating variable for “cigarettes smoked per day”.
Similar, approach for smoking variable. We will use Zelig default deletion methods to take care of the missing values for this variable.
nhis <- mutate(nhis, cigs=CIGSDAY)
nhis$cigs <- car::recode(nhis$cigs,"41:99=NA")
nhis$cigs2 <- car::recode(nhis$cigs,"1:4=1; 5:10=2; 11:15=3; 15:40=4")
nhis$cigs2 <- factor(nhis$cigs2)
nhis$cigs2 <-factor(nhis$cigs2,labels = c("<5","05-10","11-15",">15"))
summary(nhis$cigs2)
## <5 05-10 11-15 >15 NA's
## 9861 16955 4340 13067 751422
Model#1: Default Listwise Deletion Method in Zelig
Model1 <- zelig(MD3 ~ drinks2 + cigs2,
model = "ologit",
data = nhis,
cite = FALSE)
summary(Model1)
## Model:
## Call:
## z5$zelig(formula = MD3 ~ drinks2 + cigs2, data = nhis)
##
## Coefficients:
## Value Std. Error t value
## drinks25-10 0.0463 0.03296 1.404
## drinks2>10 0.3443 0.07523 4.577
## cigs205-10 0.1143 0.03388 3.373
## cigs211-15 0.1689 0.04750 3.555
## cigs2>15 0.3791 0.03523 10.762
##
## Intercepts:
## Value Std. Error t value
## Normal|Moderate 1.1716 0.0276 42.5059
## Moderate|Severe 2.7282 0.0329 82.9883
##
## Residual Deviance: 46729.81
## AIC: 46743.81
## (764144 observations deleted due to missingness)
## Next step: Use 'setx' method
Model1$setx()
Model1$sim()
plot(Model1)

Multiple Imputations for MD3
For our multiple imputations methods, we will only include the variables which we are part of our analysis. That is, any variable that will be in the analysis model should also be in the imputation model. This has been suggested extensively in the suggested reading. That is why, we have subsetted the data and have chosen the variables which are essential and help increase the statistical power of our model.
nhis3 <- subset(nhis2, select=c(SERIAL,YEAR,MD3,drinks,cigs))
boundK6 <- matrix(c(3, 0, 24),
nrow = 1,
ncol = 3)
boundK6
## [,1] [,2] [,3]
## [1,] 3 0 24
Using Amelia
imputeddata <- amelia(x =nhis3,
m=10 ,
ts="YEAR",
idvars =c("SERIAL","drinks","cigs"),
ords = "MD3",
polytime = 2)
## -- Imputation 1 --
##
## 1 2
##
## -- Imputation 2 --
##
## 1 2
##
## -- Imputation 3 --
##
## 1 2
##
## -- Imputation 4 --
##
## 1 2
##
## -- Imputation 5 --
##
## 1 2
##
## -- Imputation 6 --
##
## 1 2
##
## -- Imputation 7 --
##
## 1 2
##
## -- Imputation 8 --
##
## 1 2
##
## -- Imputation 9 --
##
## 1 2
##
## -- Imputation 10 --
##
## 1 2
Summary of Imputations
The below is the summary of one of the 10 data sets imputed. For our dependent variable MD3, there are no more missing values. The cases are complete and ready to be analyzed.
summary(imputeddata$imputations$imp1)
## SERIAL YEAR MD3 drinks
## Min. : 1 Min. :2010 Normal :610925 Min. : 0.000
## 1st Qu.:13166 1st Qu.:2012 Moderate:144602 1st Qu.: 0.000
## Median :26282 Median :2013 Severe : 40118 Median : 0.000
## Mean :26735 Mean :2013 Mean : 0.486
## 3rd Qu.:39627 3rd Qu.:2015 3rd Qu.: 0.000
## Max. :60347 Max. :2017 Max. :87.000
## NA's :829
## cigs
## Min. : 1.00
## 1st Qu.:96.00
## Median :96.00
## Mean :91.31
## 3rd Qu.:96.00
## Max. :96.00
## NA's :341
Recoding Alcohol Drinks Variable in Imputed Dataset “1”
imputeddata$imputations$imp1$drinks2 <-
car::recode(imputeddata$imputations$imp1$drinks,"0=NA; 21:99.9=NA")
imputeddata$imputations$imp1$drinks2 <-
car::recode(imputeddata$imputations$imp1$drinks2,"1:4=1;5:9=2;10:20=3")
imputeddata$imputations$imp1$drinks2 <-
as.factor(imputeddata$imputations$imp1$drinks2)
imputeddata$imputations$imp1$drinks2 <-
factor(imputeddata$imputations$imp1$drinks2,labels = c("<5","5-9",">10"))
summary(imputeddata$imputations$imp1$drinks2)
## <5 5-9 >10 NA's
## 146227 12982 2539 633897
Recoding Cigarettes Smoked Variable in Imputed Dataset “1”
imputeddata$imputations$imp1$cigs2 <-
car::recode(imputeddata$imputations$imp1$cigs,"41:99=NA")
imputeddata$imputations$imp1$cigs2 <-
car::recode(imputeddata$imputations$imp1$cigs2,"1:4=1;5:10=2;11:15=3;16:40=4")
imputeddata$imputations$imp1$cigs2 <-
as.factor(imputeddata$imputations$imp1$cigs2)
imputeddata$imputations$imp1$cigs2 <-factor(imputeddata$imputations$imp1$cigs2,
labels =c("<5","6-10","11-15",">16"))
summary(imputeddata$imputations$imp1$cigs2)
## <5 6-10 11-15 >16 NA's
## 9861 16955 4340 13067 751422
Model#2: Multiple Impuations Dataset
Model2 <- zelig(MD3 ~ drinks2 + cigs2,
mode="ologit",
data=imputeddata$imputations$imp1,
cite=FALSE)
summary(Model2)
## Model:
## Call:
## z5$zelig(formula = MD3 ~ drinks2 + cigs2, data = imputeddata$imputations$imp1)
##
## Coefficients:
## Value Std. Error t value
## drinks25-9 0.03312 0.03399 0.9743
## drinks2>10 0.22833 0.06251 3.6524
## cigs26-10 0.10878 0.03355 3.2427
## cigs211-15 0.15643 0.04708 3.3226
## cigs2>16 0.36722 0.03493 10.5139
##
## Intercepts:
## Value Std. Error t value
## Normal|Moderate 1.1630 0.0273 42.6393
## Moderate|Severe 2.7286 0.0326 83.6959
##
## Residual Deviance: 47484.04
## AIC: 47498.04
## (763598 observations deleted due to missingness)
## Next step: Use 'setx' method
Model2$setx()
Model2$sim()
plot(Model2)

Statistical models
|
|
Model 1
|
Model 2
|
|
drinks25-10
|
0.05
|
|
|
|
(0.03)
|
|
|
drinks2>10
|
0.34***
|
0.23***
|
|
|
(0.08)
|
(0.06)
|
|
cigs205-10
|
0.11***
|
|
|
|
(0.03)
|
|
|
cigs211-15
|
0.17***
|
0.16***
|
|
|
(0.05)
|
(0.05)
|
|
cigs2>15
|
0.38***
|
|
|
|
(0.04)
|
|
|
drinks25-9
|
|
0.03
|
|
|
|
(0.03)
|
|
cigs26-10
|
|
0.11**
|
|
|
|
(0.03)
|
|
cigs2>16
|
|
0.37***
|
|
|
|
(0.03)
|
|
AIC
|
46743.81
|
47498.04
|
|
BIC
|
46802.31
|
47556.67
|
|
Log Likelihood
|
-23364.90
|
-23742.02
|
|
Deviance
|
46729.81
|
47484.04
|
|
Num. obs.
|
31501
|
32047
|
|
p < 0.001, p < 0.01, p < 0.05
|
Conclusion & Discussion
In the end, it is important to mention that though we imputed the missing values for our independent variable MD3 the results from the the two models are almost identical There is no major difference between Model#1 and Model#2 except for the variation in coefficients. This is due the underlying fact that our two independent variables have a weak effect on mental health. In addition to this, if we used multiple imputations method on our two variables it would be considered as bias, as not all respondents are smoker nor drinkers. It is also very common that multiple imputations does not create very high differences in the associations.
As the goal of this analysis was to carefully handle the missing values in our data. We successfully did by using the Amelia package. It quite possible that a more detailed data with more reliable variables may help us in estimating more textured results regarding this association between modifiable risk factors and nonspecific psychological distress.