The dataset used for this investigation is the National Health Interview Survey (NHIS) from 1997-2016. The analysis is between Average number of cigarettes smoker per day, the dependent variable and total hours worked in the past week, being the independant variable. For this analysis, I will study the releationship between # of cigarettes smoked and hours worked along with an predicting independent varaible, sex. In my orginal investigation, I had run poisson models but for the sake of this analysis we will run ls models. My hypothesis was that those who work longder hours, smoke more number of cigarettes then those with lower hours worked in the past week. Additionally, I think male smoker wills tend to smoke more than females despite working hours.For this investigation, we will be models using the Zelig and ZeligChoice package. In earlier model, I had filtered out the NA(missing data) while in this analysis I will be tackling missing data with multiple imputations using the Amelia Package.
library(readr)
library(dplyr)
library(texreg)
library(ggplot2)
library(tidyr)
library(Amelia)
library(Zelig)
library(ZeligChoice)
load("/Users/Deepakie/Documents/Queens College/SOC712/Data/NHIS_v3.rdata")
#head(NHIS_v3)
NHIS1 <- NHIS_v3%>%
select(sex,bmi, hourswrk,cigsday)%>%
mutate (sex = ifelse(sex==1, "1","2"),
hourswork = ifelse(hourswrk==0, NA,
ifelse(hourswrk>=97, NA, hourswrk)),
cigsday = ifelse(cigsday >= 91, NA, cigsday))%>%
select(-hourswrk)
head(NHIS1)
NHIS1$sex <- as.integer(NHIS1$sex)
NHIS1$hourswork <- as.integer(NHIS1$hourswork)
summary(lm(cigsday ~ hourswork + sex, data = NHIS1, na.action = na.omit))
Call:
lm(formula = cigsday ~ hourswork + sex, data = NHIS1, na.action = na.omit)
Residuals:
Min 1Q Median 3Q Max
-15.372 -7.682 -2.182 5.589 79.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.213010 0.171604 88.65 <2e-16 ***
hourswork 0.035653 0.002771 12.87 <2e-16 ***
sex -2.228564 0.072773 -30.62 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.861 on 76593 degrees of freedom
(542941 observations deleted due to missingness)
Multiple R-squared: 0.0166, Adjusted R-squared: 0.01658
F-statistic: 646.6 on 2 and 76593 DF, p-value: < 2.2e-16
54,294 cases were deleted due to missingness. These observations have valuable information about the relationships between the number of cigs smoke daily and hours worked. Multiple imputation using Amelia help retrieve the missing values to make inferences.
summary(NHIS1)
sex bmi cigsday hourswork
Min. :1.00 Min. : 6.60 Min. : 1.0 Min. : 1.00
1st Qu.:1.00 1st Qu.:23.15 1st Qu.: 5.0 1st Qu.:36.00
Median :2.00 Median :26.35 Median :10.0 Median :40.00
Mean :1.56 Mean :27.31 Mean :13.8 Mean :39.89
3rd Qu.:2.00 3rd Qu.:30.15 3rd Qu.:20.0 3rd Qu.:45.00
Max. :2.00 Max. :99.80 Max. :90.0 Max. :95.00
NA's :24428 NA's :496052 NA's :252283
a.imp <- amelia(x = NHIS1, m = 30)
-- Imputation 1 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 2 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 3 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 4 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 5 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 6 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 7 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 8 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 9 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 10 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 11 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 12 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 13 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 14 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 15 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 16 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 17 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 18 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 19 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 20 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 21 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 22 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 23 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 24 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23
-- Imputation 25 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 26 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 27 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23
-- Imputation 28 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 29 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
-- Imputation 30 --
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24
Implementing multiple imputation requires a statistical model from which to compute the m imputations for each missing value in this data set. Here I run 30 imputation.
a.imp
Amelia output with 30 imputed datasets.
Return code: 1
Message: Normal EM convergence.
Chain Lengths:
--------------
Imputation 1: 24
Imputation 2: 24
Imputation 3: 24
Imputation 4: 24
Imputation 5: 24
Imputation 6: 24
Imputation 7: 24
Imputation 8: 24
Imputation 9: 24
Imputation 10: 24
Imputation 11: 24
Imputation 12: 24
Imputation 13: 24
Imputation 14: 24
Imputation 15: 24
Imputation 16: 24
Imputation 17: 24
Imputation 18: 24
Imputation 19: 24
Imputation 20: 24
Imputation 21: 24
Imputation 22: 24
Imputation 23: 24
Imputation 24: 23
Imputation 25: 24
Imputation 26: 24
Imputation 27: 23
Imputation 28: 24
Imputation 29: 24
Imputation 30: 24
names(a.imp)
[1] "imputations" "m" "missMatrix" "overvalues" "theta" "mu" "covMatrices" "code"
[9] "message" "iterHist" "arguments" "orig.vars"
View(a.imp$imputations$imp1)
View(a.imp$imputations$imp2)
View(a.imp$imputations$imp3)
z.out <- zelig(cigsday ~ hourswork + sex, model="ls", data=a.imp, cite = FALSE)
|====== | 7% ~37 s remaining
|========= | 10% ~47 s remaining
|============ | 13% ~46 s remaining
|=============== | 17% ~44 s remaining
|================== | 20% ~43 s remaining
|===================== | 23% ~39 s remaining
|========================= | 27% ~36 s remaining
|============================ | 30% ~34 s remaining
|=============================== | 33% ~32 s remaining
|================================== | 37% ~32 s remaining
|===================================== | 40% ~32 s remaining
|======================================== | 43% ~31 s remaining
|=========================================== | 47% ~29 s remaining
|=============================================== | 50% ~27 s remaining
|================================================== | 53% ~36 s remaining
|===================================================== | 57% ~32 s remaining
|======================================================== | 60% ~29 s remaining
|=========================================================== | 63% ~26 s remaining
|============================================================== | 67% ~24 s remaining
|================================================================= | 70% ~21 s remaining
|==================================================================== | 73% ~18 s remaining
|======================================================================== | 77% ~16 s remaining
|=========================================================================== | 80% ~14 s remaining
|============================================================================== | 83% ~12 s remaining
|================================================================================= | 87% ~9 s remaining
|==================================================================================== | 90% ~7 s remaining
|======================================================================================= | 93% ~5 s remaining
|========================================================================================== | 97% ~2 s remaining
|==============================================================================================|100% ~0 s remaining
summary(z.out)
Model: Combined Imputations
Estimate Std.Error z value Pr(>|z|)
(Intercept) 15.32384 0.15005 102.1 <2e-16
hourswork 0.03727 0.00278 13.4 <2e-16
sex -2.02131 0.05527 -36.6 <2e-16
For results from individual imputed datasets, use summary(x, subset = i:j)
Next step: Use 'setx' method
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = cigsday ~ hourswork + sex, data = a.imp)
Residuals:
Min 1Q Median 3Q Max
-48.264 -7.050 -0.330 6.601 78.608
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.40423 0.06395 240.88 <2e-16
hourswork 0.03476 0.00100 34.75 <2e-16
sex -2.02331 0.02645 -76.49 <2e-16
Residual standard error: 10.13 on 619534 degrees of freedom
Multiple R-squared: 0.01347, Adjusted R-squared: 0.01347
F-statistic: 4229 on 2 and 619534 DF, p-value: < 2.2e-16
Next step: Use 'setx' method
summary(z.out, subset = 2)
Imputed Dataset 2
Call:
z5$zelig(formula = cigsday ~ hourswork + sex, data = a.imp)
Residuals:
Min 1Q Median 3Q Max
-47.252 -7.076 -0.341 6.645 78.599
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.373892 0.064080 239.92 <2e-16
hourswork 0.035210 0.001002 35.13 <2e-16
sex -2.004235 0.026506 -75.61 <2e-16
Residual standard error: 10.15 on 619534 degrees of freedom
Multiple R-squared: 0.01329, Adjusted R-squared: 0.01329
F-statistic: 4172 on 2 and 619534 DF, p-value: < 2.2e-16
Next step: Use 'setx' method
z.out$setx()
z.out$sim()
|======= | 20% ~8 s remaining
|======== | 23% ~9 s remaining
|========= | 27% ~9 s remaining
|========== | 30% ~8 s remaining
|=========== | 33% ~8 s remaining
|============ | 37% ~8 s remaining
|============== | 40% ~8 s remaining
|=============== | 43% ~7 s remaining
|================ | 47% ~7 s remaining
|================= | 50% ~7 s remaining
|================== | 53% ~6 s remaining
|=================== | 57% ~6 s remaining
|===================== | 60% ~5 s remaining
|====================== | 63% ~5 s remaining
|======================= | 67% ~4 s remaining
|======================== | 70% ~4 s remaining
|========================= | 73% ~3 s remaining
|========================== | 77% ~3 s remaining
|============================ | 80% ~2 s remaining
|============================= | 83% ~2 s remaining
|============================== | 87% ~2 s remaining
|=============================== | 90% ~1 s remaining
|================================ | 93% ~1 s remaining
|================================= | 97% ~2 s remaining
|===================================|100% ~0 s remaining
plot(z.out)
After comparing the litewise method to multiple impitations, we can see the results vary. THe litewise deletion method shows that for every extra hour of working, cigs smoked per smoker would be 0.035653 more, and in imputation 2 from Amelia, it is 0.03670. So despite the difference being small, there is a slight difference.Examing the data through two different types of missing values analysis, there are certain differences. The Amelia package changed the coefficent a big higher than litewise deletion.