Age, education, marital status, employment status and income level are the predictor variable for smoking.
Aim is to see how the smoking and each variables relationships changed in original data and missing imputed data.
Expection is to find no significant changes in the results of original and imputed datasets.
## # A tibble: 25,000 x 6
## smoke agec educ employ marst inc
## <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 0 (59,79] 4colgrad employloyed married 5_50kplus
## 2 0 (59,79] 4colgrad employloyed divorced <NA>
## 3 0 (59,79] 3somecol retired married 3_25-35k
## 4 0 (39,59] 4colgrad employloyed married <NA>
## 5 0 (39,59] 4colgrad employloyed married 5_50kplus
## 6 0 (39,59] 4colgrad employloyed married 4_35-50k
## 7 1 (24,39] 3somecol employloyed nm 3_25-35k
## 8 0 (59,79] 1somehs unable widowed <NA>
## 9 0 (59,79] 4colgrad retired married 5_50kplus
## 10 0 (59,79] 2hsgrad retired married 4_35-50k
## # ... with 24,990 more rows
## smoke agec educ employ
## Min. :0.0000 (0,24] :1544 2hsgrad : 6046 employloyed:12929
## 1st Qu.:0.0000 (24,39]:4398 0Prim : 564 nilf : 3181
## Median :0.0000 (39,59]:7885 1somehs : 968 retired : 7095
## Mean :0.1363 (59,79]:9238 3somecol: 6861 unable : 1558
## 3rd Qu.:0.0000 (79,99]:1935 4colgrad:10435 NA's : 237
## Max. :1.0000 NA's : 126
## NA's :1025
## marst inc
## married :12600 1_lt15k : 1810
## cohab : 823 2_15-25k : 3123
## divorced : 3412 3_25-35k : 2051
## nm : 4526 4_35-50k : 2798
## separated: 557 5_50kplus:11016
## widowed : 2860 NA's : 4202
## NA's : 222
We can see income groups have highest missing number. and education group has lowest missing values considering age does not have any missing values.
The percentage of missings is in the educ variable, which only has 0.504% missing.
The percentage of missings is in the marst variable, which only has 0.888% missing.
The percentage of missings is in the employ variable, which only has 0.948% missing.
The percentage of missings is in the smoke variable which only as 4.1% missing.
Which shows that, among these recoded variables, inc , the income variable, 4202 people in the mydat, or 16.808% of the sample.
No missings is in the agec, which only has NA% missing.
## agec educ marst employ smoke inc
## 20018 1 1 1 1 1 1 0
## 3537 1 1 1 1 1 0 1
## 540 1 1 1 1 0 1 1
## 405 1 1 1 1 0 0 2
## 109 1 1 1 0 1 1 1
## 59 1 1 1 0 1 0 2
## 4 1 1 1 0 0 1 2
## 20 1 1 1 0 0 0 3
## 70 1 1 0 1 1 1 1
## 72 1 1 0 1 1 0 2
## 7 1 1 0 1 0 1 2
## 15 1 1 0 1 0 0 3
## 5 1 1 0 0 1 1 2
## 10 1 1 0 0 1 0 3
## 3 1 1 0 0 0 0 4
## 32 1 0 1 1 1 1 1
## 34 1 0 1 1 1 0 2
## 4 1 0 1 1 0 1 2
## 7 1 0 1 1 0 0 3
## 3 1 0 1 0 1 1 2
## 3 1 0 1 0 1 0 3
## 3 1 0 1 0 0 0 4
## 1 1 0 0 1 1 1 2
## 12 1 0 0 1 1 0 3
## 3 1 0 0 1 0 1 3
## 6 1 0 0 1 0 0 4
## 2 1 0 0 0 1 1 3
## 8 1 0 0 0 1 0 4
## 8 1 0 0 0 0 0 5
## 0 126 222 237 1025 4202 5812
The first row shows the number of observations in the data that are complete (first row).
The second row shows the number of people who are missing only the inc variable.
To see pairs of missing values
## $rr
## smoke agec educ employ marst inc
## smoke 23975 23975 23880 23776 23795 20240
## agec 23975 25000 24874 24763 24778 20798
## educ 23880 24874 24874 24664 24692 20753
## employ 23776 24763 24664 24763 24577 20675
## marst 23795 24778 24692 24577 24778 20710
## inc 20240 20798 20753 20675 20710 20798
##
## $rm
## smoke agec educ employ marst inc
## smoke 0 0 95 199 180 3735
## agec 1025 0 126 237 222 4202
## educ 994 0 0 210 182 4121
## employ 987 0 99 0 186 4088
## marst 983 0 86 201 0 4068
## inc 558 0 45 123 88 0
##
## $mr
## smoke agec educ employ marst inc
## smoke 0 1025 994 987 983 558
## agec 0 0 0 0 0 0
## educ 95 126 0 99 86 45
## employ 199 237 210 0 201 123
## marst 180 222 182 186 0 88
## inc 3735 4202 4121 4088 4068 0
##
## $mm
## smoke agec educ employ marst inc
## smoke 1025 0 31 38 42 467
## agec 0 0 0 0 0 0
## educ 31 0 126 27 40 81
## employ 38 0 27 237 36 114
## marst 42 0 40 36 222 134
## inc 467 0 81 114 134 4202
## # A tibble: 6 x 2
## smokedat smoke
## <dbl> <dbl>
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
##
## iter imp variable
## 1 1 smokedat inc educ marst employ
## 1 2 smokedat inc educ marst employ
## 1 3 smokedat inc educ marst employ
## 1 4 smokedat inc educ marst employ
## 1 5 smokedat inc educ marst employ
## 2 1 smokedat inc educ marst employ
## 2 2 smokedat inc educ marst employ
## 2 3 smokedat inc educ marst employ
## 2 4 smokedat inc educ marst employ
## 2 5 smokedat inc educ marst employ
## 3 1 smokedat inc educ marst employ
## 3 2 smokedat inc educ marst employ
## 3 3 smokedat inc educ marst employ
## 3 4 smokedat inc educ marst employ
## 3 5 smokedat inc educ marst employ
## 4 1 smokedat inc educ marst employ
## 4 2 smokedat inc educ marst employ
## 4 3 smokedat inc educ marst employ
## 4 4 smokedat inc educ marst employ
## 4 5 smokedat inc educ marst employ
## 5 1 smokedat inc educ marst employ
## 5 2 smokedat inc educ marst employ
## 5 3 smokedat inc educ marst employ
## 5 4 smokedat inc educ marst employ
## 5 5 smokedat inc educ marst employ
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## smokedat inc agec educ marst employ
## "pmm" "polyreg" "" "polyreg" "polyreg" "polyreg"
## PredictorMatrix:
## smokedat inc agec educ marst employ
## smokedat 0 1 1 1 1 1
## inc 1 0 1 1 1 1
## agec 1 1 0 1 1 1
## educ 1 1 1 0 1 1
## marst 1 1 1 1 0 1
## employ 1 1 1 1 1 0
Imputed values are plausible by having the smoke values outside of the range of the data.
## 1 2 3 4
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1372 Mean :0.1385 Mean :0.1292 Mean :0.1438
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## 5
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1365
## 3rd Qu.:0.0000
## Max. :1.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1363 0.0000 1.0000 1025
## 1 2 3 4
## 1_lt15k : 516 1_lt15k : 504 1_lt15k : 489 1_lt15k : 499
## 2_15-25k : 811 2_15-25k : 735 2_15-25k : 820 2_15-25k : 768
## 3_25-35k : 456 3_25-35k : 457 3_25-35k : 441 3_25-35k : 478
## 4_35-50k : 556 4_35-50k : 605 4_35-50k : 562 4_35-50k : 603
## 5_50kplus:1863 5_50kplus:1901 5_50kplus:1890 5_50kplus:1854
## 5
## 1_lt15k : 486
## 2_15-25k : 802
## 3_25-35k : 457
## 4_35-50k : 597
## 5_50kplus:1860
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus NA's
## 1810 3123 2051 2798 11016 4202
Now, the total number cases in each category changed for the five imputations.
and we see the distribution of the original data (blue dots), the imputed data (red dots) across the levels of employment, for each of the five different imputation runs(the number at the top shows which run, and the first plot is the original data).
As smoke is a binary variable, it does not show the dots rather it remains into the range of 0 and 1. All five imputation have similar outcome.
We will use 1st impute dataset.
## smokedat inc agec educ marst employ
## 1 0 5_50kplus (59,79] 4colgrad married employloyed
## 2 0 5_50kplus (59,79] 4colgrad divorced employloyed
## 3 0 3_25-35k (59,79] 3somecol married retired
## 4 0 5_50kplus (39,59] 4colgrad married employloyed
## 5 0 5_50kplus (39,59] 4colgrad married employloyed
## 6 0 4_35-50k (39,59] 4colgrad married employloyed
## 7 1 3_25-35k (24,39] 3somecol nm employloyed
## 8 0 1_lt15k (59,79] 1somehs widowed unable
## 9 0 5_50kplus (59,79] 4colgrad married retired
## 10 0 4_35-50k (59,79] 2hsgrad married retired
## # A tibble: 10 x 3
## smoke inc agec
## <dbl> <fct> <fct>
## 1 0 5_50kplus (59,79]
## 2 0 <NA> (59,79]
## 3 0 3_25-35k (59,79]
## 4 0 <NA> (39,59]
## 5 0 5_50kplus (39,59]
## 6 0 4_35-50k (39,59]
## 7 1 3_25-35k (24,39]
## 8 0 <NA> (59,79]
## 9 0 5_50kplus (59,79]
## 10 0 4_35-50k (59,79]
While the first few cases don’t show much missingness, we can coax some more interesting cases out and compare the original data to the imputed:
## smokedat inc agec educ marst employ
## 16 0 5_50kplus (79,99] 4colgrad married retired
## 32 1 3_25-35k (39,59] 0Prim married employloyed
## 50 0 5_50kplus (24,39] 4colgrad married employloyed
## 128 0 2_15-25k (59,79] 1somehs nm retired
## 134 0 5_50kplus (39,59] 4colgrad married employloyed
## 140 0 4_35-50k (39,59] 3somecol married employloyed
## 147 1 5_50kplus (59,79] 2hsgrad married retired
## 176 0 2_15-25k (0,24] 3somecol nm employloyed
## 202 0 2_15-25k (24,39] 2hsgrad nm employloyed
## 210 0 5_50kplus (39,59] 4colgrad married employloyed
## # A tibble: 10 x 6
## smoke inc agec educ marst employ
## <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 NA 5_50kplus (79,99] 4colgrad married retired
## 2 NA 3_25-35k (39,59] 0Prim married employloyed
## 3 NA <NA> (24,39] 4colgrad married employloyed
## 4 NA <NA> (59,79] 1somehs nm retired
## 5 NA 5_50kplus (39,59] 4colgrad married employloyed
## 6 NA 4_35-50k (39,59] 3somecol married employloyed
## 7 NA <NA> (59,79] 2hsgrad married retired
## 8 NA 2_15-25k (0,24] 3somecol nm employloyed
## 9 NA 2_15-25k (24,39] 2hsgrad nm employloyed
## 10 NA <NA> (39,59] 4colgrad married employloyed
Here I look at a linear model for smoke:
##
## Call:
## lm(formula = smokedat ~ inc + agec + educ + marst + employ)
##
## Coefficients:
## (Intercept) inc2_15-25k inc3_25-35k inc4_35-50k inc5_50kplus
## 0.15924 -0.02278 -0.03914 -0.05903 -0.07342
## agec(24,39] agec(39,59] agec(59,79] agec(79,99] educ0Prim
## 0.10149 0.08908 0.03658 -0.04043 -0.07174
## educ1somehs educ3somecol educ4colgrad marstcohab marstdivorced
## 0.06802 -0.03097 -0.11275 0.07628 0.09007
## marstnm marstseparated marstwidowed employnilf employretired
## 0.04941 0.11640 0.01202 -0.01459 -0.01278
## employunable
## 0.09947
## [1] 0.3430501
## inc
## 1_lt15k 2_15-25k 3_25-35k 4_35-50k 5_50kplus
## 0.09304 0.15736 0.10028 0.13416 0.51516
## agec
## (0,24] (24,39] (39,59] (59,79] (79,99]
## 0.06176 0.17592 0.31540 0.36952 0.07740
## educ
## 2hsgrad 0Prim 1somehs 3somecol 4colgrad
## 0.24332 0.02268 0.03884 0.27548 0.41968
The model fit on the original data, with missings eliminated:
##
## Call:
## lm(formula = smoke ~ inc + agec + educ + marst + employ, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57850 -0.16698 -0.09455 -0.01782 1.08567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.155034 0.015900 9.751 < 2e-16 ***
## inc2_15-25k -0.020557 0.010263 -2.003 0.04519 *
## inc3_25-35k -0.032679 0.011426 -2.860 0.00424 **
## inc4_35-50k -0.055896 0.011009 -5.077 3.86e-07 ***
## inc5_50kplus -0.070855 0.010233 -6.924 4.53e-12 ***
## agec(24,39] 0.096817 0.012051 8.034 9.95e-16 ***
## agec(39,59] 0.090487 0.012161 7.441 1.04e-13 ***
## agec(59,79] 0.041096 0.012879 3.191 0.00142 **
## agec(79,99] -0.039999 0.016307 -2.453 0.01418 *
## educ0Prim -0.073794 0.017673 -4.176 2.98e-05 ***
## educ1somehs 0.101414 0.013600 7.457 9.22e-14 ***
## educ3somecol -0.030730 0.006645 -4.625 3.77e-06 ***
## educ4colgrad -0.110984 0.006471 -17.152 < 2e-16 ***
## marstcohab 0.073370 0.013317 5.509 3.64e-08 ***
## marstdivorced 0.086508 0.007371 11.737 < 2e-16 ***
## marstnm 0.054139 0.007479 7.239 4.69e-13 ***
## marstseparated 0.128181 0.016170 7.927 2.36e-15 ***
## marstwidowed 0.022397 0.008915 2.512 0.01200 *
## employnilf -0.008257 0.007846 -1.052 0.29263
## employretired -0.018869 0.007413 -2.545 0.01092 *
## employunable 0.103386 0.010937 9.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3301 on 19997 degrees of freedom
## Multiple R-squared: 0.08606, Adjusted R-squared: 0.08515
## F-statistic: 94.15 on 20 and 19997 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = smoke ~ inc + agec + educ + marst + employ, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57850 -0.16698 -0.09455 -0.01782 1.08567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.155034 0.015900 9.751 < 2e-16 ***
## inc2_15-25k -0.020557 0.010263 -2.003 0.04519 *
## inc3_25-35k -0.032679 0.011426 -2.860 0.00424 **
## inc4_35-50k -0.055896 0.011009 -5.077 3.86e-07 ***
## inc5_50kplus -0.070855 0.010233 -6.924 4.53e-12 ***
## agec(24,39] 0.096817 0.012051 8.034 9.95e-16 ***
## agec(39,59] 0.090487 0.012161 7.441 1.04e-13 ***
## agec(59,79] 0.041096 0.012879 3.191 0.00142 **
## agec(79,99] -0.039999 0.016307 -2.453 0.01418 *
## educ0Prim -0.073794 0.017673 -4.176 2.98e-05 ***
## educ1somehs 0.101414 0.013600 7.457 9.22e-14 ***
## educ3somecol -0.030730 0.006645 -4.625 3.77e-06 ***
## educ4colgrad -0.110984 0.006471 -17.152 < 2e-16 ***
## marstcohab 0.073370 0.013317 5.509 3.64e-08 ***
## marstdivorced 0.086508 0.007371 11.737 < 2e-16 ***
## marstnm 0.054139 0.007479 7.239 4.69e-13 ***
## marstseparated 0.128181 0.016170 7.927 2.36e-15 ***
## marstwidowed 0.022397 0.008915 2.512 0.01200 *
## employnilf -0.008257 0.007846 -1.052 0.29263
## employretired -0.018869 0.007413 -2.545 0.01092 *
## employunable 0.103386 0.010937 9.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3301 on 19997 degrees of freedom
## Multiple R-squared: 0.08606, Adjusted R-squared: 0.08515
## F-statistic: 94.15 on 20 and 19997 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = smokedat ~ inc + agec + educ + marst + employ, data = dat.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54463 -0.16543 -0.09143 -0.01127 1.08013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.159244 0.013731 11.597 < 2e-16 ***
## inc2_15-25k -0.022777 0.008824 -2.581 0.00985 **
## inc3_25-35k -0.039137 0.009931 -3.941 8.14e-05 ***
## inc4_35-50k -0.059028 0.009586 -6.158 7.49e-10 ***
## inc5_50kplus -0.073417 0.008893 -8.256 < 2e-16 ***
## agec(24,39] 0.101493 0.010348 9.808 < 2e-16 ***
## agec(39,59] 0.089079 0.010456 8.520 < 2e-16 ***
## agec(59,79] 0.036581 0.011077 3.302 0.00096 ***
## agec(79,99] -0.040427 0.013842 -2.921 0.00350 **
## educ0Prim -0.071745 0.014652 -4.896 9.82e-07 ***
## educ1somehs 0.068015 0.011510 5.909 3.48e-09 ***
## educ3somecol -0.030972 0.005842 -5.302 1.16e-07 ***
## educ4colgrad -0.112747 0.005719 -19.714 < 2e-16 ***
## marstcohab 0.076284 0.012103 6.303 2.97e-10 ***
## marstdivorced 0.090072 0.006644 13.556 < 2e-16 ***
## marstnm 0.049407 0.006696 7.379 1.65e-13 ***
## marstseparated 0.116403 0.014481 8.038 9.52e-16 ***
## marstwidowed 0.012021 0.007757 1.550 0.12124
## employnilf -0.014587 0.006829 -2.136 0.03270 *
## employretired -0.012780 0.006551 -1.951 0.05108 .
## employunable 0.099474 0.009620 10.341 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3283 on 24979 degrees of freedom
## Multiple R-squared: 0.08509, Adjusted R-squared: 0.08436
## F-statistic: 116.2 on 20 and 24979 DF, p-value: < 2.2e-16
In the inputed data set, significance of income groups increased compare to original data. Employnilf variable was not significant in the original dataset but in the imputed data set it became significant. Contrary, being widowed and smoking relaltionship became non significant.
There is no significant changes between two datasets.