Selection of variable

Age, education, marital status, employment status and income level are the predictor variable for smoking.

Aim is to see how the smoking and each variables relationships changed in original data and missing imputed data.

Expection is to find no significant changes in the results of original and imputed datasets.

Recoding variables

## # A tibble: 25,000 x 6
##    smoke agec    educ     employ      marst    inc      
##    <dbl> <fct>   <fct>    <fct>       <fct>    <fct>    
##  1     0 (59,79] 4colgrad employloyed married  5_50kplus
##  2     0 (59,79] 4colgrad employloyed divorced <NA>     
##  3     0 (59,79] 3somecol retired     married  3_25-35k 
##  4     0 (39,59] 4colgrad employloyed married  <NA>     
##  5     0 (39,59] 4colgrad employloyed married  5_50kplus
##  6     0 (39,59] 4colgrad employloyed married  4_35-50k 
##  7     1 (24,39] 3somecol employloyed nm       3_25-35k 
##  8     0 (59,79] 1somehs  unable      widowed  <NA>     
##  9     0 (59,79] 4colgrad retired     married  5_50kplus
## 10     0 (59,79] 2hsgrad  retired     married  4_35-50k 
## # ... with 24,990 more rows
##      smoke             agec            educ               employ     
##  Min.   :0.0000   (0,24] :1544   2hsgrad : 6046   employloyed:12929  
##  1st Qu.:0.0000   (24,39]:4398   0Prim   :  564   nilf       : 3181  
##  Median :0.0000   (39,59]:7885   1somehs :  968   retired    : 7095  
##  Mean   :0.1363   (59,79]:9238   3somecol: 6861   unable     : 1558  
##  3rd Qu.:0.0000   (79,99]:1935   4colgrad:10435   NA's       :  237  
##  Max.   :1.0000                  NA's    :  126                      
##  NA's   :1025                                                        
##        marst              inc       
##  married  :12600   1_lt15k  : 1810  
##  cohab    :  823   2_15-25k : 3123  
##  divorced : 3412   3_25-35k : 2051  
##  nm       : 4526   4_35-50k : 2798  
##  separated:  557   5_50kplus:11016  
##  widowed  : 2860   NA's     : 4202  
##  NA's     :  222

We can see income groups have highest missing number. and education group has lowest missing values considering age does not have any missing values.

The percentage of missings is in the educ variable, which only has 0.504% missing.

The percentage of missings is in the marst variable, which only has 0.888% missing.

The percentage of missings is in the employ variable, which only has 0.948% missing.

The percentage of missings is in the smoke variable which only as 4.1% missing.

Which shows that, among these recoded variables, inc , the income variable, 4202 people in the mydat, or 16.808% of the sample.

No missings is in the agec, which only has NA% missing.

Multiple Imputation

##       agec educ marst employ smoke  inc     
## 20018    1    1     1      1     1    1    0
## 3537     1    1     1      1     1    0    1
## 540      1    1     1      1     0    1    1
## 405      1    1     1      1     0    0    2
## 109      1    1     1      0     1    1    1
## 59       1    1     1      0     1    0    2
## 4        1    1     1      0     0    1    2
## 20       1    1     1      0     0    0    3
## 70       1    1     0      1     1    1    1
## 72       1    1     0      1     1    0    2
## 7        1    1     0      1     0    1    2
## 15       1    1     0      1     0    0    3
## 5        1    1     0      0     1    1    2
## 10       1    1     0      0     1    0    3
## 3        1    1     0      0     0    0    4
## 32       1    0     1      1     1    1    1
## 34       1    0     1      1     1    0    2
## 4        1    0     1      1     0    1    2
## 7        1    0     1      1     0    0    3
## 3        1    0     1      0     1    1    2
## 3        1    0     1      0     1    0    3
## 3        1    0     1      0     0    0    4
## 1        1    0     0      1     1    1    2
## 12       1    0     0      1     1    0    3
## 3        1    0     0      1     0    1    3
## 6        1    0     0      1     0    0    4
## 2        1    0     0      0     1    1    3
## 8        1    0     0      0     1    0    4
## 8        1    0     0      0     0    0    5
##          0  126   222    237  1025 4202 5812

The first row shows the number of observations in the data that are complete (first row).

The second row shows the number of people who are missing only the inc variable.

To see pairs of missing values

## $rr
##        smoke  agec  educ employ marst   inc
## smoke  23975 23975 23880  23776 23795 20240
## agec   23975 25000 24874  24763 24778 20798
## educ   23880 24874 24874  24664 24692 20753
## employ 23776 24763 24664  24763 24577 20675
## marst  23795 24778 24692  24577 24778 20710
## inc    20240 20798 20753  20675 20710 20798
## 
## $rm
##        smoke agec educ employ marst  inc
## smoke      0    0   95    199   180 3735
## agec    1025    0  126    237   222 4202
## educ     994    0    0    210   182 4121
## employ   987    0   99      0   186 4088
## marst    983    0   86    201     0 4068
## inc      558    0   45    123    88    0
## 
## $mr
##        smoke agec educ employ marst inc
## smoke      0 1025  994    987   983 558
## agec       0    0    0      0     0   0
## educ      95  126    0     99    86  45
## employ   199  237  210      0   201 123
## marst    180  222  182    186     0  88
## inc     3735 4202 4121   4088  4068   0
## 
## $mm
##        smoke agec educ employ marst  inc
## smoke   1025    0   31     38    42  467
## agec       0    0    0      0     0    0
## educ      31    0  126     27    40   81
## employ    38    0   27    237    36  114
## marst     42    0   40     36   222  134
## inc      467    0   81    114   134 4202

Basic imputation:

## # A tibble: 6 x 2
##   smokedat smoke
##      <dbl> <dbl>
## 1        0     0
## 2        0     0
## 3        0     0
## 4        0     0
## 5        0     0
## 6        0     0
## 
##  iter imp variable
##   1   1  smokedat  inc  educ  marst  employ
##   1   2  smokedat  inc  educ  marst  employ
##   1   3  smokedat  inc  educ  marst  employ
##   1   4  smokedat  inc  educ  marst  employ
##   1   5  smokedat  inc  educ  marst  employ
##   2   1  smokedat  inc  educ  marst  employ
##   2   2  smokedat  inc  educ  marst  employ
##   2   3  smokedat  inc  educ  marst  employ
##   2   4  smokedat  inc  educ  marst  employ
##   2   5  smokedat  inc  educ  marst  employ
##   3   1  smokedat  inc  educ  marst  employ
##   3   2  smokedat  inc  educ  marst  employ
##   3   3  smokedat  inc  educ  marst  employ
##   3   4  smokedat  inc  educ  marst  employ
##   3   5  smokedat  inc  educ  marst  employ
##   4   1  smokedat  inc  educ  marst  employ
##   4   2  smokedat  inc  educ  marst  employ
##   4   3  smokedat  inc  educ  marst  employ
##   4   4  smokedat  inc  educ  marst  employ
##   4   5  smokedat  inc  educ  marst  employ
##   5   1  smokedat  inc  educ  marst  employ
##   5   2  smokedat  inc  educ  marst  employ
##   5   3  smokedat  inc  educ  marst  employ
##   5   4  smokedat  inc  educ  marst  employ
##   5   5  smokedat  inc  educ  marst  employ
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  smokedat       inc      agec      educ     marst    employ 
##     "pmm" "polyreg"        "" "polyreg" "polyreg" "polyreg" 
## PredictorMatrix:
##          smokedat inc agec educ marst employ
## smokedat        0   1    1    1     1      1
## inc             1   0    1    1     1      1
## agec            1   1    0    1     1      1
## educ            1   1    1    0     1      1
## marst           1   1    1    1     0      1
## employ          1   1    1    1     1      0

Imputed values are plausible by having the smoke values outside of the range of the data.

##        1                2                3                4         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1372   Mean   :0.1385   Mean   :0.1292   Mean   :0.1438  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##        5         
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1365  
##  3rd Qu.:0.0000  
##  Max.   :1.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1363  0.0000  1.0000    1025
##          1                2                3                4       
##  1_lt15k  : 516   1_lt15k  : 504   1_lt15k  : 489   1_lt15k  : 499  
##  2_15-25k : 811   2_15-25k : 735   2_15-25k : 820   2_15-25k : 768  
##  3_25-35k : 456   3_25-35k : 457   3_25-35k : 441   3_25-35k : 478  
##  4_35-50k : 556   4_35-50k : 605   4_35-50k : 562   4_35-50k : 603  
##  5_50kplus:1863   5_50kplus:1901   5_50kplus:1890   5_50kplus:1854  
##          5       
##  1_lt15k  : 486  
##  2_15-25k : 802  
##  3_25-35k : 457  
##  4_35-50k : 597  
##  5_50kplus:1860
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus      NA's 
##      1810      3123      2051      2798     11016      4202

Now, the total number cases in each category changed for the five imputations.

Plotting of smoke and education

and we see the distribution of the original data (blue dots), the imputed data (red dots) across the levels of employment, for each of the five different imputation runs(the number at the top shows which run, and the first plot is the original data).

As smoke is a binary variable, it does not show the dots rather it remains into the range of 0 and 1. All five imputation have similar outcome.

We will use 1st impute dataset.

##    smokedat       inc    agec     educ    marst      employ
## 1         0 5_50kplus (59,79] 4colgrad  married employloyed
## 2         0 5_50kplus (59,79] 4colgrad divorced employloyed
## 3         0  3_25-35k (59,79] 3somecol  married     retired
## 4         0 5_50kplus (39,59] 4colgrad  married employloyed
## 5         0 5_50kplus (39,59] 4colgrad  married employloyed
## 6         0  4_35-50k (39,59] 4colgrad  married employloyed
## 7         1  3_25-35k (24,39] 3somecol       nm employloyed
## 8         0   1_lt15k (59,79]  1somehs  widowed      unable
## 9         0 5_50kplus (59,79] 4colgrad  married     retired
## 10        0  4_35-50k (59,79]  2hsgrad  married     retired
## # A tibble: 10 x 3
##    smoke inc       agec   
##    <dbl> <fct>     <fct>  
##  1     0 5_50kplus (59,79]
##  2     0 <NA>      (59,79]
##  3     0 3_25-35k  (59,79]
##  4     0 <NA>      (39,59]
##  5     0 5_50kplus (39,59]
##  6     0 4_35-50k  (39,59]
##  7     1 3_25-35k  (24,39]
##  8     0 <NA>      (59,79]
##  9     0 5_50kplus (59,79]
## 10     0 4_35-50k  (59,79]

While the first few cases don’t show much missingness, we can coax some more interesting cases out and compare the original data to the imputed:

##     smokedat       inc    agec     educ   marst      employ
## 16         0 5_50kplus (79,99] 4colgrad married     retired
## 32         1  3_25-35k (39,59]    0Prim married employloyed
## 50         0 5_50kplus (24,39] 4colgrad married employloyed
## 128        0  2_15-25k (59,79]  1somehs      nm     retired
## 134        0 5_50kplus (39,59] 4colgrad married employloyed
## 140        0  4_35-50k (39,59] 3somecol married employloyed
## 147        1 5_50kplus (59,79]  2hsgrad married     retired
## 176        0  2_15-25k  (0,24] 3somecol      nm employloyed
## 202        0  2_15-25k (24,39]  2hsgrad      nm employloyed
## 210        0 5_50kplus (39,59] 4colgrad married employloyed
## # A tibble: 10 x 6
##    smoke inc       agec    educ     marst   employ     
##    <dbl> <fct>     <fct>   <fct>    <fct>   <fct>      
##  1    NA 5_50kplus (79,99] 4colgrad married retired    
##  2    NA 3_25-35k  (39,59] 0Prim    married employloyed
##  3    NA <NA>      (24,39] 4colgrad married employloyed
##  4    NA <NA>      (59,79] 1somehs  nm      retired    
##  5    NA 5_50kplus (39,59] 4colgrad married employloyed
##  6    NA 4_35-50k  (39,59] 3somecol married employloyed
##  7    NA <NA>      (59,79] 2hsgrad  married retired    
##  8    NA 2_15-25k  (0,24]  3somecol nm      employloyed
##  9    NA 2_15-25k  (24,39] 2hsgrad  nm      employloyed
## 10    NA <NA>      (39,59] 4colgrad married employloyed

Analyzing the imputed data

Here I look at a linear model for smoke:

## 
## Call:
## lm(formula = smokedat ~ inc + agec + educ + marst + employ)
## 
## Coefficients:
##    (Intercept)     inc2_15-25k     inc3_25-35k     inc4_35-50k    inc5_50kplus  
##        0.15924        -0.02278        -0.03914        -0.05903        -0.07342  
##    agec(24,39]     agec(39,59]     agec(59,79]     agec(79,99]       educ0Prim  
##        0.10149         0.08908         0.03658        -0.04043        -0.07174  
##    educ1somehs    educ3somecol    educ4colgrad      marstcohab   marstdivorced  
##        0.06802        -0.03097        -0.11275         0.07628         0.09007  
##        marstnm  marstseparated    marstwidowed      employnilf   employretired  
##        0.04941         0.11640         0.01202        -0.01459        -0.01278  
##   employunable  
##        0.09947

variation in smoke

## [1] 0.3430501

Frequency table for income

## inc
##   1_lt15k  2_15-25k  3_25-35k  4_35-50k 5_50kplus 
##   0.09304   0.15736   0.10028   0.13416   0.51516

Frequency table for age

## agec
##  (0,24] (24,39] (39,59] (59,79] (79,99] 
## 0.06176 0.17592 0.31540 0.36952 0.07740

Frequency table for education

## educ
##  2hsgrad    0Prim  1somehs 3somecol 4colgrad 
##  0.24332  0.02268  0.03884  0.27548  0.41968

The model fit on the original data, with missings eliminated:

## 
## Call:
## lm(formula = smoke ~ inc + agec + educ + marst + employ, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57850 -0.16698 -0.09455 -0.01782  1.08567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.155034   0.015900   9.751  < 2e-16 ***
## inc2_15-25k    -0.020557   0.010263  -2.003  0.04519 *  
## inc3_25-35k    -0.032679   0.011426  -2.860  0.00424 ** 
## inc4_35-50k    -0.055896   0.011009  -5.077 3.86e-07 ***
## inc5_50kplus   -0.070855   0.010233  -6.924 4.53e-12 ***
## agec(24,39]     0.096817   0.012051   8.034 9.95e-16 ***
## agec(39,59]     0.090487   0.012161   7.441 1.04e-13 ***
## agec(59,79]     0.041096   0.012879   3.191  0.00142 ** 
## agec(79,99]    -0.039999   0.016307  -2.453  0.01418 *  
## educ0Prim      -0.073794   0.017673  -4.176 2.98e-05 ***
## educ1somehs     0.101414   0.013600   7.457 9.22e-14 ***
## educ3somecol   -0.030730   0.006645  -4.625 3.77e-06 ***
## educ4colgrad   -0.110984   0.006471 -17.152  < 2e-16 ***
## marstcohab      0.073370   0.013317   5.509 3.64e-08 ***
## marstdivorced   0.086508   0.007371  11.737  < 2e-16 ***
## marstnm         0.054139   0.007479   7.239 4.69e-13 ***
## marstseparated  0.128181   0.016170   7.927 2.36e-15 ***
## marstwidowed    0.022397   0.008915   2.512  0.01200 *  
## employnilf     -0.008257   0.007846  -1.052  0.29263    
## employretired  -0.018869   0.007413  -2.545  0.01092 *  
## employunable    0.103386   0.010937   9.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3301 on 19997 degrees of freedom
## Multiple R-squared:  0.08606,    Adjusted R-squared:  0.08515 
## F-statistic: 94.15 on 20 and 19997 DF,  p-value: < 2.2e-16

Compare imputed model to original data

## 
## Call:
## lm(formula = smoke ~ inc + agec + educ + marst + employ, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57850 -0.16698 -0.09455 -0.01782  1.08567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.155034   0.015900   9.751  < 2e-16 ***
## inc2_15-25k    -0.020557   0.010263  -2.003  0.04519 *  
## inc3_25-35k    -0.032679   0.011426  -2.860  0.00424 ** 
## inc4_35-50k    -0.055896   0.011009  -5.077 3.86e-07 ***
## inc5_50kplus   -0.070855   0.010233  -6.924 4.53e-12 ***
## agec(24,39]     0.096817   0.012051   8.034 9.95e-16 ***
## agec(39,59]     0.090487   0.012161   7.441 1.04e-13 ***
## agec(59,79]     0.041096   0.012879   3.191  0.00142 ** 
## agec(79,99]    -0.039999   0.016307  -2.453  0.01418 *  
## educ0Prim      -0.073794   0.017673  -4.176 2.98e-05 ***
## educ1somehs     0.101414   0.013600   7.457 9.22e-14 ***
## educ3somecol   -0.030730   0.006645  -4.625 3.77e-06 ***
## educ4colgrad   -0.110984   0.006471 -17.152  < 2e-16 ***
## marstcohab      0.073370   0.013317   5.509 3.64e-08 ***
## marstdivorced   0.086508   0.007371  11.737  < 2e-16 ***
## marstnm         0.054139   0.007479   7.239 4.69e-13 ***
## marstseparated  0.128181   0.016170   7.927 2.36e-15 ***
## marstwidowed    0.022397   0.008915   2.512  0.01200 *  
## employnilf     -0.008257   0.007846  -1.052  0.29263    
## employretired  -0.018869   0.007413  -2.545  0.01092 *  
## employunable    0.103386   0.010937   9.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3301 on 19997 degrees of freedom
## Multiple R-squared:  0.08606,    Adjusted R-squared:  0.08515 
## F-statistic: 94.15 on 20 and 19997 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = smokedat ~ inc + agec + educ + marst + employ, data = dat.imp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54463 -0.16543 -0.09143 -0.01127  1.08013 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.159244   0.013731  11.597  < 2e-16 ***
## inc2_15-25k    -0.022777   0.008824  -2.581  0.00985 ** 
## inc3_25-35k    -0.039137   0.009931  -3.941 8.14e-05 ***
## inc4_35-50k    -0.059028   0.009586  -6.158 7.49e-10 ***
## inc5_50kplus   -0.073417   0.008893  -8.256  < 2e-16 ***
## agec(24,39]     0.101493   0.010348   9.808  < 2e-16 ***
## agec(39,59]     0.089079   0.010456   8.520  < 2e-16 ***
## agec(59,79]     0.036581   0.011077   3.302  0.00096 ***
## agec(79,99]    -0.040427   0.013842  -2.921  0.00350 ** 
## educ0Prim      -0.071745   0.014652  -4.896 9.82e-07 ***
## educ1somehs     0.068015   0.011510   5.909 3.48e-09 ***
## educ3somecol   -0.030972   0.005842  -5.302 1.16e-07 ***
## educ4colgrad   -0.112747   0.005719 -19.714  < 2e-16 ***
## marstcohab      0.076284   0.012103   6.303 2.97e-10 ***
## marstdivorced   0.090072   0.006644  13.556  < 2e-16 ***
## marstnm         0.049407   0.006696   7.379 1.65e-13 ***
## marstseparated  0.116403   0.014481   8.038 9.52e-16 ***
## marstwidowed    0.012021   0.007757   1.550  0.12124    
## employnilf     -0.014587   0.006829  -2.136  0.03270 *  
## employretired  -0.012780   0.006551  -1.951  0.05108 .  
## employunable    0.099474   0.009620  10.341  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3283 on 24979 degrees of freedom
## Multiple R-squared:  0.08509,    Adjusted R-squared:  0.08436 
## F-statistic: 116.2 on 20 and 24979 DF,  p-value: < 2.2e-16

In the inputed data set, significance of income groups increased compare to original data. Employnilf variable was not significant in the original dataset but in the imputed data set it became significant. Contrary, being widowed and smoking relaltionship became non significant.

There is no significant changes between two datasets.