Count Data Model: Regression and simulation of a frequency of days drinking alcohol

Introduction

In this assignment I am going to use the National Health Interview Survey (NHIS) to discover the causations of the frequency of drinking alcohol. The NHIS database that I am using is an integrated data collected by the National Center for Health Statistics (NCHS), and Centers for Disease Control and Prevention (CDC). The National Health Interview Survey has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives.

The purpose of this analysis is to discover the influence of the amount of days at work, depression, age and gender on frequency of drinking. The aim is to find out which of these variables decreases and increases alcohol consumption. The variables I am going to use for the purpose of this study are:

  1. alc5upyr - Days had 5+ drinks, past year: Time period
  2. hourswrk - Total hours worked last week or usually during the week
  3. depfreq - Respondents reported how often they felt depressed
  4. age - The variable reports individual’s age
  5. sex - Indicates whether the person was male or female

Loading packages

library(DAAG)
library(Zelig)
library(ZeligChoice)
library(dplyr)
library(tidyr)
library(ggplot2)
library(sjmisc)

Loading the NHIS data into R

library(haven)
healthdata <- read_dta ("/Users/lasha/Desktop/Queens College/755/NHIS_v2.dta")
head(healthdata)

Creating a smaller dataset called “health”

It contains variables needed for the analysis. The variables were recoded so the missing values are not included. The variable SEX was changed to a binary 0/1 coding. The variable DEPFREQ “How often feel depressed” has values from 1 to 5 where 1 indicates never depressed, 2 = a few times a year, 3 = monthly, 4 = weekly, 5 = daily.

health <- healthdata %>%
  filter (alc5upyr<=365, depfreq >0, depfreq <=5, hourswrk<=95) %>%
  mutate (depfreq1 = sjmisc::rec(depfreq, rec = "5=1; 4=2; 3=3; 2=4; 1=5")) %>%
  mutate (sex1 = sjmisc::rec(sex, rec = "0=NA; 1=1; 2=0; 7/8/9=NA")) %>%
  select (alc5upyr, hourswrk, depfreq1, age, sex1)
head(health)
## # A tibble: 6 x 5
##    alc5upyr  hourswrk depfreq1   age  sex1
##   <dbl+lbl> <dbl+lbl>    <dbl> <dbl> <dbl>
## 1         0         0        2    81     0
## 2         0         0        1    58     1
## 3         0        95        1    57     1
## 4        12        40        1    27     1
## 5         0        44        4    37     1
## 6        10         4        2    27     1

Overview of the variables

ggplot hours worked

Plot showing the number of total hours worked last week or weekly usually. The high number of 0 hours worked indicates that the question does not apply to a person (for example a person is not employed).

ggplot(data=health) +
geom_histogram(aes(x=hourswrk))

ggplot frequency of drinking

Plot showing the number of days when having 5 or more drinks in the past year (Time period)

ggplot(data=health) +
geom_histogram(aes(x=alc5upyr))

Count data regresion model using Zelig

The analysis of influence of independent variables (hours work, depression, age and sex) on a dependent variable which is alc5upyr (number of days drinking 5+ drinks a year).

The results of the regression show that the the amount of hours worked in a week has a negative influence on days drinking. The same negative influence has the age of the person. With the increase of years the amounts of days drinking 5 or more drinks declines.

We can see that depression has a positive influence on the number of days drinking alcohol. It means that the more depressed people report to be the more days a year they drink alcohol. The regression also show that males are more likely to drink 5 or more drinks more often during the year.

options(scipen=9999)
d <-zelig(alc5upyr ~ hourswrk + depfreq1 + age + sex1, model = "poisson", data = health, cite = F)
summary(d)
## Model: 
## 
## Call:
## z5$zelig(formula = alc5upyr ~ hourswrk + depfreq1 + age + sex1, 
##     data = health)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -10.429   -4.876   -3.376   -2.532   51.998  
## 
## Coefficients:
##                Estimate  Std. Error z value            Pr(>|z|)
## (Intercept)  2.07916088  0.00514647  404.00 <0.0000000000000002
## hourswrk    -0.00363771  0.00005869  -61.98 <0.0000000000000002
## depfreq1     0.20854862  0.00101379  205.71 <0.0000000000000002
## age         -0.01386634  0.00007598 -182.49 <0.0000000000000002
## sex1         1.13766929  0.00277830  409.48 <0.0000000000000002
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2890084  on 59796  degrees of freedom
## Residual deviance: 2639747  on 59792  degrees of freedom
## AIC: 2727221
## 
## Number of Fisher Scoring iterations: 8
## 
## Next step: Use 'setx' method

Simulations

The visualization show the relationship between the level of depression and frequency of days in a year when drinking 5 or more alcoholic beverages

The graph indicates that people who are never depressed (1) consume alcohol less often than people who have a high depression level (5). People who are more depressed tend to drink alcohol much more frequently than people who feel depressed rarely or not at all. This graph show that the depression increase the amount of days in which the alcohol was consumed.

a.range = min(health$depfreq1): max(health$depfreq1)
x <- setx(d, depfreq1 = a.range)
s <- sim (d, x =x )
ci.plot (s)

The visualization show the relationship between the hours worked in a week and frequency of days a year when drinking 5 or more alcoholic beverages

The graph indicates that people who work more hours a week tend to drink less days a year than people who work less often or not at all.

a.range = min(health$hourswrk): max(health$hourswrk)
x <- setx(d, hourswrk = a.range)
w <- sim (d, x =x )
ci.plot (w)

The visualization show the relationship between the age and frequency of drinking

The graph indicates that people who are older tend to drink less days a year than people who are younger.

a.range = min(health$age): max(health$age)
x <- setx(d, age = a.range)
y <- sim (d, x =x )
ci.plot (y)

Simulated comparison between a level of depression and the number of days drinking

Never depressed vs Daily depressed

ev1 = when the person feels never depressed the mean number of days in which 5 or more alcohol drinks were consumed equals 8.179 ev2 = when the person feels depressed everyday the mean number of days of drinking alcohol increases to 18.832

fd = It is a difference between the second expected value and the first expected value: 18.83249 - 8.179053 = 10.65344

The first difference means that the person who is never depressed drinks 10.653 days a year less than a person who is depressed everyday.

x.low <- setx(d, depfreq1 = 1)
x.high <- setx(d, depfreq1 = 5)
sx <- sim(d, x = x.low, x1 = x.high)
summary(sx)
## 
##  sim x :
##  -----
## ev
##          mean         sd      50%     2.5%    97.5%
## [1,] 8.178777 0.01423644 8.178813 8.151901 8.208176
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 8.204 2.819634   8    3    14
## 
##  sim x1 :
##  -----
## ev
##          mean         sd      50%     2.5%    97.5%
## [1,] 18.83655 0.06545293 18.83734 18.70591 18.96949
## pv
##        mean       sd 50% 2.5% 97.5%
## [1,] 18.959 4.347899  19   11    28
## fd
##          mean         sd      50%     2.5%    97.5%
## [1,] 10.65778 0.06981764 10.65607 10.52136 10.80337
#Visualization of a difference between a depression level and a number of days of drinking alcohol.
plot(sx)

Simulated comparison between a gender and the number of days drinking

Males vs Females

ev1 = men drinks more then 5 alcohol drinks around 16.815 days a year. ev2 = women drinks more then 5 alcohol drinks around 5.390 days a year.

fd = It is a difference between the number of drinking days between males and females: 5.390514 - 16.81578 = -11.42527

The first difference means that females tend to drink alcohol 11.42527 less days a year then males.

x.m <- setx(d, sex1 = 1)
x.f <- setx(d, sex1 = 0)
mf <- sim(d, x = x.m, x1 = x.f)
summary(mf)
## 
##  sim x :
##  -----
## ev
##          mean         sd      50%    2.5%    97.5%
## [1,] 16.81414 0.02411747 16.81411 16.7669 16.86077
## pv
##        mean      sd 50%  2.5% 97.5%
## [1,] 16.945 4.08248  17 9.975    25
## 
##  sim x1 :
##  -----
## ev
##          mean         sd      50%    2.5%    97.5%
## [1,] 5.389991 0.01327621 5.389421 5.36467 5.417001
## pv
##       mean       sd 50% 2.5% 97.5%
## [1,] 5.402 2.304714   5    1    10
## fd
##           mean        sd       50%     2.5%     97.5%
## [1,] -11.42414 0.0275508 -11.42386 -11.4764 -11.36955
#Visualization of a difference between males and females and a number of days of drinking alcohol.
plot(mf)

Sex difference in frequency of drinking

#gender difference between people who drinks 5 or more drinks zero days a year
d1x <- setx (d, alc5upyr = 0,  sex1 = 0)
d1x1 <- setx(d, alc5upyr = 0, sex1 = 1)
d1s <- sim(d, x = d1x, x1 = d1x1)

#gender difference between people who drinks 5 or more drinks 10 days a year
d2x <- setx (d, alc5upyr = 10,  sex1 = 0)
d2x1 <- setx(d, alc5upyr = 10, sex1 = 1)
d2s <- sim(d, x = d2x, x1 = d2x1)

#gender difference between people who drinks 5 or more drinks 20 days a year
d3x <- setx (d, alc5upyr = 20,  sex1 = 0)
d3x1 <- setx(d, alc5upyr = 20, sex1 = 1)
d3s <- sim(d, x = d3x, x1 = d3x1)


d1 <- d1s$get_qi(xvalue="x1", qi="fd")
d2 <- d2s$get_qi(xvalue="x1", qi="fd")
d3 <- d3s$get_qi(xvalue="x1", qi="fd")

dfd <- as.data.frame(cbind(d1, d2, d3))
tidd <- dfd %>% 
  gather(alc5upyr, simv, 1:3)

tidd %>% 
  group_by(alc5upyr) %>% 
  summarise(mean = mean(simv), sd = sd(simv))
## # A tibble: 3 x 3
##   alc5upyr     mean         sd
##      <chr>    <dbl>      <dbl>
## 1       V1 11.42288 0.02714639
## 2       V2 11.42524 0.02706163
## 3       V3 11.42454 0.02671804
ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~alc5upyr)

The relationship between the number of days drinking and hours in work

The hourswork was grouped into ordinal leveled variable called “hoursrank” to make an ordinal logit regression model. The hours of work were grouped into categories “0-20” hours a week, “21-40” hours a week, “41-60” hours a week, “61-80” hours a week, “81+”hours a week.

health2 <- health %>%
  mutate (hoursrank = rec(hourswrk, rec="0:20=0-20; 21:40=21-40; 41:60=41-60; 61:80=61-80; else= 81+")) %>%
  select(alc5upyr, hourswrk, hoursrank, depfreq1, age, sex1)
head(health2)
## # A tibble: 6 x 6
##    alc5upyr  hourswrk hoursrank depfreq1   age  sex1
##   <dbl+lbl> <dbl+lbl>     <chr>    <dbl> <dbl> <dbl>
## 1         0         0      0-20        2    81     0
## 2         0         0      0-20        1    58     1
## 3         0        95       81+        1    57     1
## 4        12        40     21-40        1    27     1
## 5         0        44     41-60        4    37     1
## 6        10         4      0-20        2    27     1

Order logit model

The regression is showing the influence of number of days drinking alcohol a year on the amount of hours at work weekly. The results in this regression show that with the increase of hours worked people tend to drink alcohol more days a year.

health2$hoursrank <- factor(health2$hoursrank, ordered = TRUE,
                         levels = c("0-20", "21-40", "41-60", "61-80", "81+"))

d.ord <- zelig(hoursrank ~ alc5upyr, model = "ologit",
               data = health2, cite = F)
summary(d.ord)
## Model: 
## Call:
## z5$zelig(formula = hoursrank ~ alc5upyr, data = health2)
## 
## Coefficients:
##              Value Std. Error t value
## alc5upyr 0.0004473   0.000188   2.379
## 
## Intercepts:
##             Value    Std. Error t value 
## 0-20|21-40   -0.3835   0.0086   -44.6369
## 21-40|41-60   1.3692   0.0104   131.8724
## 41-60|61-80   3.6390   0.0259   140.3899
## 61-80|81+     5.6112   0.0677    82.8420
## 
## Residual Deviance: 136936.96 
## AIC: 136946.96 
## Next step: Use 'setx' method

Conclusion

The analysis showed that the frequency of drinking changes under the influence of hours at work and depression level but it also vary for people in different age and for males and females. The key finding is that the people with high level of depression tend to drink more days a year. The research also show that males tend to drink much more days a year than females and that older people drinks less frequently.

NHIS data citation: Citation “Lynn A. Blewett, Julia A. Rivera Drew, Risa Griffin, Miram L. King, and Kari C.W. Williams. IPUMS Health Surveys: National Health Interview Survey, Version 6.2 [dataset]. Minneapolis, MN: University of Minnesota, 2016. http://doi.org/10.18128/D070.V6.2