Introduction

While alcohol can have a very temporary positive impact on our mood, its long term use with high consumption can cause severe problems for our mental health. Not only heavy alcohol consumption also regular smoking behavior might also cause an adverse effect on mental health. The mental health is a broad term. Psychologist have defined mental health problems to be a combination of many different symptoms. That is why Nonspecific Psychological Distress Scale (K6) was introduced to measure the amount of distress level in individuals. In this brief analysis, we will use the K6 scale to measure the effect of health behaviors such as drinking and smoking. The goal of this analysis is to use the multiple imputations method to take care of the missing values found in our dependent variable.

Data

The data used in this analysis comes from National Health Interview Survey. This is an five year (2011-2016) consolidated data. It has 717,513 observations and 45 different health related variables. The NHIS data has been extensively used throughout the Department of Health and Human Services (DHHS) to track and monitor trends in illness and disability and also to track progress toward achieving national health objectives. The data is also used by the public health research communities for epidemiology and policy analysis of issues as characterizing those with various health problems, determining barriers to accessing and using appropriate health care and evaluating Federal health programs (CDC).

Primary Dependent Variable

For the mental distress levels, K6 (Nonspecific Psychological Scale) variable was formed. The variable was formed by combing different scores from 6 different questions in the NHIS data set into one consolidated variable K6. The variable, K6, represent three different levels of mental distress 1- NORMAL (0-5 Score), 2- Moderate (6-12 Score), 3- Severe (13-24 Score.

Key Independent Variable

There are two independent variables in the study. One “Alcohol Drinks” which is rename as “drinks” and “Cigarettes Smoked” which also renamed as “cigs”. Both of the variables are categorized. Originally these two variables are numeric and have values from 0 to 90. Where each number represents a the quantity of either drink or cigarette consumed or used by respondent. However, higher values are used with extreme caution because they might be a result of misunderstood question through the survey by the respondent.

Statistical Analysis

For our analysis, we will use the ordered logistic regression model first to see quantities of interest. However, our primary focus is to generate imputed values for our dependent variable “MD3” (Mental Distress 3 Categories). The multiple imputation will be done by using the Amelia Package. Finally, using one of the imputed data sets, we will try to see the differences between imputed model and non-imputed model.

However, it is important to mention that the data from NHIS, especially regarding the modifiable risk factors such as alcohol and cigarette smoking has high number of missing cases. Therefore, understanding the motivation behind multiple imputations, I would not use the multiple imputations on our independent variables which are alcohol drinks and cigarettes smoked.

The reason for this is that majority of respondents might not be heavy drinkers or smokers and their response missing from the data is not because of sheer randomness but because of deliberate reality that they are not heavy drinkers or smokers. However, our dependent variable K6 is legitimate candidate for multiple imputations because, the mental health level is something which can be found and recorded for every human being. The missing values for this variable might be missing due to randomness or item non-response.That is why, we will only focus on our dependent variable (MD3) which is more important and makes much more sense as compared to other two variables in case of multiple imputations.

K6 Composite Score Variable.

Recoding the original variables.

nhis <- mutate(nhis, 
              sad=ASAD, 
               effort=AEFFORT, 
                hopeless=AHOPELESS, 
                  nervous=ANERVOUS, 
                   restless=ARESTLESS, 
                      worthless=AWORTHLESS)

Setting up missing values for each K6 variable.

nhis$sad <- car::recode(nhis$sad,"6:9= NA")
  nhis$effort <- car::recode(nhis$effort,"6:9= NA")
    nhis$hopeless <- car::recode(nhis$hopeless,"6:9= NA")
        nhis$nervous <- car::recode(nhis$nervous,"6:9= NA")
          nhis$restless <- car::recode(nhis$restless,"6:9= NA")
            nhis$worthless <- car::recode(nhis$worthless,"6:9= NA")

Adding Each variable for composite score variable (K6)

nhis <- mutate(nhis, K6=sad+effort+hopeless+nervous+restless+worthless)

Creating 3 cateogry K6 score variable = (MD3)

nhis <- mutate (nhis, MD3=K6)

Giving appropriate score to each category of (MD3)

nhis$MD3 <- car::recode(nhis$MD3, "0:5=1;6:12=2; 13:24=3")

Validating the new variable, as a factor variable

nhis$MD3 <- factor(nhis$MD3)
is.factor(nhis$MD3)
## [1] TRUE

Recoding for the odered logistic regression.

The summary of our dependent variable MD3 shows that there are 490,750 cases missing from the original data set. We will exclusively use the multiple imputations to account for these missing values.

nhis$MD3 <-factor(nhis$MD3, labels = c("Normal","Moderate","Severe"))
nhis$MD3 <- factor(nhis$MD3, ordered = TRUE, levels = c("Normal", "Moderate", "Severe"))      
summary(nhis$MD3)
##   Normal Moderate   Severe     NA's 
##   209416    33584     9530   543115

Alcohol Consumption Variable

Creating variable of “Alchohal Drinks Per Day”

For the alcohol consumption variable, we will use the default deletion method in Zelig. This is appropriate for this type of variable. Because as we have mentioned earlier, it is highly probable that majority of the respondents in our sample are not heavy or occasional drinkers. Therefore, it would be biased to use MP on this or other independent variable smoking.

  nhis <- mutate(nhis, drinks=ALCAMT/10)
    nhis$drinks <- car::recode(nhis$drinks,"0=NA;90:99.9=NA")
      nhis$drinks2 <- car::recode(nhis$drinks,"1:4=1;5:10=2;11:89.9=3")
        nhis$drinks2 <- factor(nhis$drinks2)
          nhis$drinks2 <-factor(nhis$drinks2,
                              labels =c("<5",
                                        "5-10",
                                        ">10"))
          summary(nhis$drinks2)
##     <5   5-10    >10   NA's 
## 146227  14085   1598 633735

Cigarettes Smoked Variable

Creating variable for “cigarettes smoked per day”.

Similar, approach for smoking variable. We will use Zelig default deletion methods to take care of the missing values for this variable.

nhis <- mutate(nhis, cigs=CIGSDAY)
  nhis$cigs <- car::recode(nhis$cigs,"41:99=NA")
    nhis$cigs2 <- car::recode(nhis$cigs,"1:4=1; 5:10=2; 11:15=3; 15:40=4")
      nhis$cigs2 <- factor(nhis$cigs2)
        nhis$cigs2 <-factor(nhis$cigs2,labels = c("<5","05-10","11-15",">15"))
          summary(nhis$cigs2)
##     <5  05-10  11-15    >15   NA's 
##   9861  16955   4340  13067 751422

Model#1: Default Listwise Deletion Method in Zelig

Model1 <- zelig(MD3 ~ drinks2 + cigs2, 
                model = "ologit", 
                data = nhis, 
                cite = FALSE)
summary(Model1)
## Model: 
## Call:
## z5$zelig(formula = MD3 ~ drinks2 + cigs2, data = nhis)
## 
## Coefficients:
##              Value Std. Error t value
## drinks25-10 0.0463    0.03296   1.404
## drinks2>10  0.3443    0.07523   4.577
## cigs205-10  0.1143    0.03388   3.373
## cigs211-15  0.1689    0.04750   3.555
## cigs2>15    0.3791    0.03523  10.762
## 
## Intercepts:
##                 Value   Std. Error t value
## Normal|Moderate  1.1716  0.0276    42.5059
## Moderate|Severe  2.7282  0.0329    82.9883
## 
## Residual Deviance: 46729.81 
## AIC: 46743.81 
## (764144 observations deleted due to missingness)
## Next step: Use 'setx' method
Model1$setx()
Model1$sim()
plot(Model1)

Multiple Imputations for MD3

For our multiple imputations methods, we will only include the variables which we are part of our analysis. That is, any variable that will be in the analysis model should also be in the imputation model. This has been suggested extensively in the suggested reading. That is why, we have subsetted the data and have chosen the variables which are essential and help increase the statistical power of our model.

nhis3 <- subset(nhis2, select=c(SERIAL,YEAR,MD3,drinks,cigs))
boundK6 <- matrix(c(3, 0, 24), 
                  nrow = 1, 
                    ncol = 3)
boundK6
##      [,1] [,2] [,3]
## [1,]    3    0   24

Using Amelia

imputeddata <- amelia(x =nhis3, 
                      m=10 ,
                        ts="YEAR",
                          idvars =c("SERIAL","drinks","cigs"), 
                            ords = "MD3",
                              polytime = 2)
## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##   1  2
## 
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2
## 
## -- Imputation 6 --
## 
##   1  2
## 
## -- Imputation 7 --
## 
##   1  2
## 
## -- Imputation 8 --
## 
##   1  2
## 
## -- Imputation 9 --
## 
##   1  2
## 
## -- Imputation 10 --
## 
##   1  2

Summary of Imputations

The below is the summary of one of the 10 data sets imputed. For our dependent variable MD3, there are no more missing values. The cases are complete and ready to be analyzed.

summary(imputeddata$imputations$imp1)
##      SERIAL           YEAR            MD3             drinks      
##  Min.   :    1   Min.   :2010   Normal  :610925   Min.   : 0.000  
##  1st Qu.:13166   1st Qu.:2012   Moderate:144602   1st Qu.: 0.000  
##  Median :26282   Median :2013   Severe  : 40118   Median : 0.000  
##  Mean   :26735   Mean   :2013                     Mean   : 0.486  
##  3rd Qu.:39627   3rd Qu.:2015                     3rd Qu.: 0.000  
##  Max.   :60347   Max.   :2017                     Max.   :87.000  
##                                                   NA's   :829     
##       cigs      
##  Min.   : 1.00  
##  1st Qu.:96.00  
##  Median :96.00  
##  Mean   :91.31  
##  3rd Qu.:96.00  
##  Max.   :96.00  
##  NA's   :341

Recoding Alcohol Drinks Variable in Imputed Dataset “1”

  imputeddata$imputations$imp1$drinks2 <-
    car::recode(imputeddata$imputations$imp1$drinks,"0=NA; 21:99.9=NA")
   
 imputeddata$imputations$imp1$drinks2 <-
   car::recode(imputeddata$imputations$imp1$drinks2,"1:4=1;5:9=2;10:20=3")
     
  imputeddata$imputations$imp1$drinks2 <-
    as.factor(imputeddata$imputations$imp1$drinks2)
        
  imputeddata$imputations$imp1$drinks2 <-
    factor(imputeddata$imputations$imp1$drinks2,labels = c("<5","5-9",">10"))
          
  summary(imputeddata$imputations$imp1$drinks2)
##     <5    5-9    >10   NA's 
## 146227  12982   2539 633897

Recoding Cigarettes Smoked Variable in Imputed Dataset “1”

imputeddata$imputations$imp1$cigs2 <-
  car::recode(imputeddata$imputations$imp1$cigs,"41:99=NA")
   
imputeddata$imputations$imp1$cigs2 <-
   car::recode(imputeddata$imputations$imp1$cigs2,"1:4=1;5:10=2;11:15=3;16:40=4")
     
imputeddata$imputations$imp1$cigs2 <-
    as.factor(imputeddata$imputations$imp1$cigs2)
        
imputeddata$imputations$imp1$cigs2 <-factor(imputeddata$imputations$imp1$cigs2,
                                              labels =c("<5","6-10","11-15",">16"))
summary(imputeddata$imputations$imp1$cigs2)
##     <5   6-10  11-15    >16   NA's 
##   9861  16955   4340  13067 751422

Model#2: Multiple Impuations Dataset

Model2 <- zelig(MD3 ~ drinks2 + cigs2, 
                mode="ologit", 
                data=imputeddata$imputations$imp1, 
                cite=FALSE)
summary(Model2)
## Model: 
## Call:
## z5$zelig(formula = MD3 ~ drinks2 + cigs2, data = imputeddata$imputations$imp1)
## 
## Coefficients:
##              Value Std. Error t value
## drinks25-9 0.03312    0.03399  0.9743
## drinks2>10 0.22833    0.06251  3.6524
## cigs26-10  0.10878    0.03355  3.2427
## cigs211-15 0.15643    0.04708  3.3226
## cigs2>16   0.36722    0.03493 10.5139
## 
## Intercepts:
##                 Value   Std. Error t value
## Normal|Moderate  1.1630  0.0273    42.6393
## Moderate|Severe  2.7286  0.0326    83.6959
## 
## Residual Deviance: 47484.04 
## AIC: 47498.04 
## (763598 observations deleted due to missingness)
## Next step: Use 'setx' method
Model2$setx()
Model2$sim()
plot(Model2)

Statistical models
Model 1 Model 2
drinks25-10 0.05
(0.03)
drinks2>10 0.34*** 0.23***
(0.08) (0.06)
cigs205-10 0.11***
(0.03)
cigs211-15 0.17*** 0.16***
(0.05) (0.05)
cigs2>15 0.38***
(0.04)
drinks25-9 0.03
(0.03)
cigs26-10 0.11**
(0.03)
cigs2>16 0.37***
(0.03)
AIC 46743.81 47498.04
BIC 46802.31 47556.67
Log Likelihood -23364.90 -23742.02
Deviance 46729.81 47484.04
Num. obs. 31501 32047
p < 0.001, p < 0.01, p < 0.05

Conclusion & Discussion

In the end, it is important to mention that though we imputed the missing values for our independent variable MD3 the results from the the two models are almost identical There is no major difference between Model#1 and Model#2 except for the variation in coefficients. This is due the underlying fact that our two independent variables have a weak effect on mental health. In addition to this, if we used multiple imputations method on our two variables it would be considered as bias, as not all respondents are smoker nor drinkers. It is also very common that multiple imputations does not create very high differences in the associations.

As the goal of this analysis was to carefully handle the missing values in our data. We successfully did by using the Amelia package. It quite possible that a more detailed data with more reliable variables may help us in estimating more textured results regarding this association between modifiable risk factors and nonspecific psychological distress.