The dataset used for this investigation is the National Health Interview Survey (NHIS) from 1997-2016. The analysis is between Average number of cigarettes smoker per day, the dependent variable and total hours worked in the past week, being the independant variable. For this analysis, I will study the releationship between # of cigarettes smoked and hours worked along with other predicting independent varaibles such as race, sex, region (Northeast, North Central/Midwest, South and the West) and BMI. There has been alot of study done regarding the BMI level of smokers so it will be interesting to see the BMI effect on the relationship between number of cigarettes smoked and hours worked. My hypothesis for this is that those who worker longer hours, smoker more number of cigarettes then those with lower hours worked in the past week. Additionally, I think male smoker will tend to smoke more than females depsite working hours. I also think it will be interesting to see the impact of race on smokers with working hours such as between Whites and Blacks. For this investigation, we will be doing a count data models using the Zelig package.
library(readr)
library(dplyr)
library(texreg)
library(ggplot2)
library(Zelig)
library(tidyr)
load("/Users/Deepakie/Documents/Queens College/SOC712/Data/NHIS_v3.rdata")
head(NHIS_v3)
NHIS<- NHIS_v3%>%
select(sex,racenew,bmi_7,region, hourswrk,cigsday)%>%
mutate (sex = ifelse(sex==1, "1","2"),
BMI = ifelse(bmi_7==1, "1",
ifelse(bmi_7==2, "2",
ifelse(bmi_7==3, "3",
ifelse(bmi_7==4, "4",
ifelse(bmi_7==5, "5",
ifelse(bmi_7==6, "6",NA)))))),
race =ifelse(racenew==10, "Black",
ifelse(racenew==20, "White",
ifelse(between(racenew, 30, 60), "Other",
ifelse(racenew>=61, NA,NA)))),
hourswork = ifelse(hourswrk==0, NA,
ifelse(hourswrk>=97, NA, hourswrk)),
cigsday = ifelse(cigsday >= 91, NA, cigsday))%>%
select(-bmi_7,-racenew,-hourswrk)%>%
filter(!is.na(BMI), !is.na(sex), !is.na(BMI), !is.na(race), !is.na(hourswork), !is.na(cigsday))
This is where we recoded and cleaned the NHIS data set accordingly to our analysis. As you can see, the variables are recoded numerically for the most part and are either categorical or continuous. I got rid of the NAs for each varaible. Lets see how the data set looks now!
head(NHIS)
M <- zelig(cigsday ~ hourswork , model = "poisson", data = NHIS, cite = F)
summary(M)
Model:
Call:
z5$zelig(formula = cigsday ~ hourswork, data = NHIS)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.9584 -2.5475 -0.8116 1.7867 14.7293
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.416e+00 3.556e-03 679.3 <2e-16
hourswork 3.804e-03 8.269e-05 46.0 <2e-16
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 449938 on 64479 degrees of freedom
Residual deviance: 447834 on 64478 degrees of freedom
AIC: 711786
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
This poisson model is between cigarettes per day smoked by current smokers and hours worked in the week, on average. Poisson model is used to tell us the statistical significance effect of the explanatory variable on the response varaible. In other words, the effect hours worked in the past week have on numbers of cigarettes smoked.
M1 <- zelig(cigsday ~ hourswork + BMI , model = "poisson", data = NHIS, cite = F)
summary(M1)
Model:
Call:
z5$zelig(formula = cigsday ~ hourswork + BMI, data = NHIS)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.9735 -2.5342 -0.8148 1.7637 14.6941
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4019595 0.0048660 493.619 < 2e-16
hourswork 0.0037864 0.0000828 45.731 < 2e-16
BMI 0.0050568 0.0012308 4.109 3.98e-05
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 449938 on 64479 degrees of freedom
Residual deviance: 447817 on 64477 degrees of freedom
AIC: 711771
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
In this poisson model, we added BMI. BMI is measured in 6 categories being underweight (1), normal (2), overweight(3) Obese30s (4), Obese40s(5) and Obese50s (6) numbered from 1- 6, respectively. In respect to BMI levels, Obese30s means one with BMI of 30 until 39, obese40s means one with a BMI of 40 to 49, and so on.
NHIS$BMI <- as.integer(NHIS$BMI)
NHIS$hourswork <- as.integer(NHIS$hourswork)
bmi.range = min(NHIS$BMI):max(NHIS$BMI)
x <- setx(M1, BMI = bmi.range)
s <- sim(M1, x = x)
ci.plot(s)
Here we see the range of BMI levels in respective to our model. The plot summarizes the BMI effect on cigarettes smoked. Those who have a higher BMIs tend to smoke more number of cigarettes than those with a Normal or Underweight BMIs with respect to hours worked. We see that there is a postive relatiosnhip between numbers of cigarettes smoked on average per day as well as hours worked with BMI. The higher BMIs have a higher probablity of more number of cigarettes smoked per day than lower BMIs such as underweight.
h.range = min(NHIS$hourswork):max(NHIS$hourswork)
x <- setx(M1, hourswork = h.range)
s1 <- sim(M1, x = x)
plot(s1)
I wanted to see the relationship between the hours worked, our independent variable, and cigarettes smoked. The plot above shows us the postive relationship between number of cigarettes smoked per day and hours worked. The higher number of hours worked per week, the higher number of cigs smoked per day. For example, those who work on average 20 or less hours a week, smoke 12 or less cigarettes per day whereas those working 80 or more hours smoke upto atleast 60 cigarettes per day. My hypothesis of the effect of longer working hours and its effect on current smokers is correct.
M2 <- zelig(cigsday ~ hourswork + sex , model = "poisson", data = NHIS, cite = F)
summary(M2)
Model:
Call:
z5$zelig(formula = cigsday ~ hourswork + sex, data = NHIS)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.9856 -2.4191 -0.5528 1.5008 14.9962
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.702e+00 5.203e-03 519.31 <2e-16
hourswork 2.647e-03 8.411e-05 31.47 <2e-16
sex -1.678e-01 2.256e-03 -74.39 <2e-16
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 449938 on 64479 degrees of freedom
Residual deviance: 442252 on 64477 degrees of freedom
AIC: 706206
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
x.male<- setx (M2, sex = 1)
x.female <- setx(M2, sex = 2)
s.sex<- sim(M2, x = x.male, x1=x.female)
summary(s.sex)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 14.02712 0.01996998 14.02769 13.98801 14.06491
pv
mean sd 50% 2.5% 97.5%
[1,] 14.221 3.670372 14 7 22
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 11.86028 0.0204066 11.86095 11.81835 11.89926
pv
mean sd 50% 2.5% 97.5%
[1,] 11.7 3.495457 11 6 19
fd
mean sd 50% 2.5% 97.5%
[1,] -2.166834 0.02855572 -2.166113 -2.222776 -2.111423
The results above show us the difference of average cigarettes smoked between men and females. Men on average smoke 14 cigarettes per day whereas woman smoke about 12. The difference between the average number of cigarettes smoked per day between females and males is about 2 cigarettes
M3 <- zelig(cigsday ~ hourswork + race , model = "poisson", data = NHIS, cite = F)
summary(M3)
Model:
Call:
z5$zelig(formula = cigsday ~ hourswork + race, data = NHIS)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.0761 -2.4160 -0.6648 1.5639 16.0476
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.486e+00 3.599e-03 690.53 <2e-16
hourswork 3.471e-03 8.268e-05 41.99 <2e-16
raceOther -3.312e-01 4.940e-03 -67.05 <2e-16
raceWhite -3.230e-01 3.662e-03 -88.20 <2e-16
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 449938 on 64479 degrees of freedom
Residual deviance: 435521 on 64476 degrees of freedom
AIC: 699477
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
NHIS$race<- as.factor(NHIS$race)
x <- setx(M3, race = "White")
x1 <- setx(M3, race = "Black")
s.race<- sim(M3, x = x, x1=x1)
summary(s.race)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 9.999402 0.03392409 10.00072 9.931534 10.06474
pv
mean sd 50% 2.5% 97.5%
[1,] 9.92 3.049081 10 4 16
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 13.81051 0.01593066 13.80991 13.78028 13.84041
pv
mean sd 50% 2.5% 97.5%
[1,] 13.749 3.957482 13 7 22
fd
mean sd 50% 2.5% 97.5%
[1,] 3.811112 0.03770742 3.809674 3.736018 3.889361
The difference between blacks and White smokers is about 4 cigarettes This is interesting, as I wondered how race would have a effect on number of cigarettes smoked on average. The mean number of cigarettes smoked by whites is 10 per day whereas for Blacks it is about 14 cigs.
xh <- setx(M2, hourswork = quantile(NHIS$hourswork, .25))
xh1 <- setx(M2, hourswork = quantile(NHIS$hourswork, .25) + sd(NHIS$hourswork))
sl <- sim(M2, x = xh, x1 = xh1)
fhl <- sl$get_qi(xvalue="x1", qi="fd")
xhm <- setx(M2, hourswork = quantile(NHIS$hourswork, .50))
xhm1 <- setx(M2, hourswork = quantile(NHIS$hourswork, .50) + sd(NHIS$hourswork))
sm <- sim(M2, x = xhm, x1 = xhm1)
fhm <- sm$get_qi(xvalue="x1", qi="fd")
xhx <- setx(M2, hourswork = quantile(NHIS$hourswork, .75))
xh1x <- setx(M2, hourswork = quantile(NHIS$hourswork, .75) + sd(NHIS$hourswork))
sh <- sim(M2, x = xhx, x1 = xh1x)
fhh <- sh$get_qi(xvalue="x1", qi="fd")
The difference of hours worked between quartiles and each quartile plus one standard deviation is ran in the syntax above. Lets see the number of hours worked in each quartile as well as 1 standard deviation above per number of cigarettes.
h <- as.data.frame(cbind(fhl, fhm, fhh))
head(h)
The table gives us the average number of hours worked per cigarettes smoked in each quartile. For example, in the 25th quartile and 25th quartile plus 1 Standard deviation, the average difference in number of hours worked is .44 for those who smoke 1 cigarettes per day .For the 50th quartile it is about .46 for those who smoke 1 cigarettes. Interestingly, in the 75th quartile there is a difference of .45 in average number of hours worked.Lets see the difference between the queartiles and 1 standard deviation above for higher number of cigarettes smoked.
tail(h)
tidd <- h %>%
gather(quantile, dif, 1:3)
head(tidd)
tidd %>%
group_by(quantile) %>%
summarise(mean = mean(dif), sd = sd(dif))
The table above gives us the average mean and standard deviation of each quartile for hours worked per week. Lets plot it and see the difference.
ggplot(tidd, aes(dif)) + geom_histogram() + facet_grid(~ quantile)
On average, there is a higher difference in number of hours worked in the 50th quartile compared to 25th and 75th quartile. Additionally we see that the difference between each quartile and quartile plus 1 standard deviation is more spread out across the 75th percentile. This may be due to the extreme number of hours worked per week in the data set.
M4 <- zelig(cigsday ~ hourswork + region , model = "poisson", data = NHIS, cite = F)
summary(M4)
Model:
Call:
z5$zelig(formula = cigsday ~ hourswork + region, data = NHIS)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.1936 -2.4719 -0.7271 1.6529 14.3735
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.564e+00 4.521e-03 567.09 <2e-16
hourswork 3.858e-03 8.269e-05 46.66 <2e-16
region -5.822e-02 1.109e-03 -52.49 <2e-16
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 449938 on 64479 degrees of freedom
Residual deviance: 445086 on 64477 degrees of freedom
AIC: 709040
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
r.range = min(NHIS$region):max(NHIS$region)
M4$setrange(region = r.range)
M4$sim()
ci.plot(M4)
In conclusion, I wanted to see the difference in average number of cigarettes smoked across the four regions in the United States in regard to hours worked per week. We see that in the Northeast, the average number of cigarettes smoked are on average above 14, for the Midwest/ North Central, it is about 13.In the South region, the average number of cigarettes smoked are approximately 12.5 and 12 for those who are from the West. This is interesting and there must be some research done on this to better understand the difference of smoker behaviour between regions in the United States.