library(tidyverse)
library(psych)
library(naniar)
library(haven)
oregon <- read_dta("oregon2004.dta")
glimpse(oregon)
Rows: 4,508
Columns: 8
$ ID <dbl> 100001, 100004, 100005, 100006, 100007, 100008, 100009, 100010, 1…
$ com3 <dbl> 3, 5, 5, 5, 4, 3, 4, 2, 5, 3, 3, 3, 3, 5, 3, 4, 3, 5, 3, 3, 4, 2,…
$ epht3 <dbl> 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 3, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2,…
$ env_con <dbl> 2.090909, 1.636364, 1.272727, 1.000000, 2.111111, 2.727273, 2.363…
$ male <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,…
$ hlthprob <dbl> 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,…
$ inc <dbl> 60000, 40000, 13750, 11250, 6250, 40000, NA, 27500, 50000, NA, 80…
$ educat <dbl> 12, 16, 12, NA, NA, 12, 16, 14, NA, NA, 13, 12, 13, 8, NA, 14, 11…
n
changes for each variable!Start with a describe
from the psych
package- this gives us observations numbers to compare:
describe(oregon, fast = TRUE)
Say we want to run this regression model, with concern for the environment as the outcome and the other variables as predictors. Watch what happens to our observations:
linearmodel1 <- lm(env_con ~ com3 + epht3 + male + hlthprob + inc + educat, data = oregon)
summary(linearmodel1)
Call:
lm(formula = env_con ~ com3 + epht3 + male + hlthprob + inc +
educat, data = oregon)
Residuals:
Min 1Q Median 3Q Max
-1.95449 -0.46784 -0.02464 0.46097 1.93741
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.329e+00 7.602e-02 17.487 < 2e-16 ***
com3 -4.894e-02 1.068e-02 -4.583 4.78e-06 ***
epht3 4.037e-01 1.460e-02 27.658 < 2e-16 ***
male -5.306e-02 2.472e-02 -2.146 0.0319 *
hlthprob 2.911e-01 2.858e-02 10.188 < 2e-16 ***
inc 3.546e-07 4.128e-07 0.859 0.3905
educat -3.406e-03 4.705e-03 -0.724 0.4691
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6341 on 2819 degrees of freedom
(1682 observations deleted due to missingness)
Multiple R-squared: 0.3075, Adjusted R-squared: 0.3061
F-statistic: 208.7 on 6 and 2819 DF, p-value: < 2.2e-16
The miss_var_summary
function from the naniar
package is a nice way to get a quick summary of missingness.
miss_var_summary(oregon)
Sometimes, a visual is actually quicker and easier way to get a sense of the missingness in a dataset. Either way, we can see that our biggest issues with missingess in this dataset are with the inc
(income) and educat
(highest level of education) variables.
library(visdat)
vis_miss(oregon)
geom_miss_point
is a cool add-on to ggplot2
that shows you the scores of variables that might be missing pairwise comparisons.
ggplot(data = oregon, mapping = aes(x = educat, y = inc)) +
geom_miss_point()
To make this function work, we need to load this function in R. This is a little different from loading a package. Open the R code called mcar-test.R
and run the whole thing. You will see mcar_test
added to your environment as a function. This only works when the dataset you are working with is all numeric (no factors, no characters).
mcar_test(oregon)
Now, let’s do a little data management. Since our data do not satisfy the MCAR assumption, it will be important to show that we have variables in this dataset that can explain why income and years of education might be missing. So, we create two new variables, educat_miss
and inc_miss
, which are binary variables, = 1 if the participant is missing that variable, and = 0 if not.
missing <- oregon %>%
mutate(.,
educat_miss = case_when(
is.na(oregon$educat) ~ 1,
TRUE ~ 0),
inc_miss = case_when(
is.na(oregon$inc) ~ 1,
TRUE ~ 0))
##Describe and summarize the missing variable predictors:
describe(missing, fast = TRUE)
logistic1 <- glm(educat_miss ~ com3 + epht3 + env_con + male + hlthprob + inc, family = binomial, data = missing)
summary(logistic1)
Call:
glm(formula = educat_miss ~ com3 + epht3 + env_con + male + hlthprob +
inc, family = binomial, data = missing)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8781 -0.7804 -0.7385 1.5267 1.8388
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.357e+00 1.941e-01 -6.993 2.7e-12 ***
com3 8.840e-02 3.410e-02 2.593 0.00953 **
epht3 -3.971e-02 5.190e-02 -0.765 0.44422
env_con 7.326e-02 5.908e-02 1.240 0.21497
male -9.449e-02 7.854e-02 -1.203 0.22891
hlthprob -1.694e-01 9.337e-02 -1.814 0.06970 .
inc -7.011e-07 1.185e-06 -0.591 0.55425
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4257.1 on 3774 degrees of freedom
Residual deviance: 4244.2 on 3768 degrees of freedom
(733 observations deleted due to missingness)
AIC: 4258.2
Number of Fisher Scoring iterations: 4
logistic2 <- glm(inc_miss ~ com3 + epht3 + env_con + male + hlthprob + educat, family = binomial, data = missing)
summary(logistic2)
Call:
glm(formula = inc_miss ~ com3 + epht3 + env_con + male + hlthprob +
educat, family = binomial, data = missing)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7146 -0.5851 -0.5081 -0.4521 2.3201
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.500154 0.339533 -4.418 9.95e-06 ***
com3 0.001587 0.045496 0.035 0.9722
epht3 -0.132026 0.071900 -1.836 0.0663 .
env_con 0.157490 0.079415 1.983 0.0474 *
male -0.489780 0.112820 -4.341 1.42e-05 ***
hlthprob -0.303341 0.131099 -2.314 0.0207 *
educat -0.010996 0.018492 -0.595 0.5521
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2581.9 on 3265 degrees of freedom
Residual deviance: 2551.6 on 3259 degrees of freedom
(1242 observations deleted due to missingness)
AIC: 2565.6
Number of Fisher Scoring iterations: 4
library(mice)
Attaching package: ‘mice’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
cbind, rbind
imputed_data <- mice(oregon, m=5, maxit = 5, method = 'pmm', seed = 23284)
iter imp variable
1 1 com3 epht3 env_con male hlthprob inc educat
1 2 com3 epht3 env_con male hlthprob inc educat
1 3 com3 epht3 env_con male hlthprob inc educat
1 4 com3 epht3 env_con male hlthprob inc educat
1 5 com3 epht3 env_con male hlthprob inc educat
2 1 com3 epht3 env_con male hlthprob inc educat
2 2 com3 epht3 env_con male hlthprob inc educat
2 3 com3 epht3 env_con male hlthprob inc educat
2 4 com3 epht3 env_con male hlthprob inc educat
2 5 com3 epht3 env_con male hlthprob inc educat
3 1 com3 epht3 env_con male hlthprob inc educat
3 2 com3 epht3 env_con male hlthprob inc educat
3 3 com3 epht3 env_con male hlthprob inc educat
3 4 com3 epht3 env_con male hlthprob inc educat
3 5 com3 epht3 env_con male hlthprob inc educat
4 1 com3 epht3 env_con male hlthprob inc educat
4 2 com3 epht3 env_con male hlthprob inc educat
4 3 com3 epht3 env_con male hlthprob inc educat
4 4 com3 epht3 env_con male hlthprob inc educat
4 5 com3 epht3 env_con male hlthprob inc educat
5 1 com3 epht3 env_con male hlthprob inc educat
5 2 com3 epht3 env_con male hlthprob inc educat
5 3 com3 epht3 env_con male hlthprob inc educat
5 4 com3 epht3 env_con male hlthprob inc educat
5 5 com3 epht3 env_con male hlthprob inc educat
summary(imputed_data)
Class: mids
Number of multiple imputations: 5
Imputation methods:
ID com3 epht3 env_con male hlthprob inc educat
"" "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" "pmm"
PredictorMatrix:
ID com3 epht3 env_con male hlthprob inc educat
ID 0 1 1 1 1 1 1 1
com3 1 0 1 1 1 1 1 1
epht3 1 1 0 1 1 1 1 1
env_con 1 1 1 0 1 1 1 1
male 1 1 1 1 0 1 1 1
hlthprob 1 1 1 1 1 0 1 1
linearmodel2 <- with(imputed_data, lm(env_con ~ com3 + epht3 + male + hlthprob + inc + educat))
summary(pool(linearmodel2))
term estimate std.error statistic df p.value
1 (Intercept) 1.308790e+00 7.105495e-02 18.4194018 49.95552 0.000000e+00
2 com3 -4.892321e-02 8.668949e-03 -5.6434994 1883.43799 1.919408e-08
3 epht3 4.035296e-01 1.190996e-02 33.8816918 3780.45969 0.000000e+00
4 male -4.201650e-02 2.026166e-02 -2.0736943 4390.40980 3.816588e-02
5 hlthprob 2.936229e-01 2.384667e-02 12.3129555 1995.18157 0.000000e+00
6 inc 2.637422e-07 3.599334e-07 0.7327530 292.74497 4.642954e-01
7 educat -1.180927e-03 4.904356e-03 -0.2407915 23.37258 8.118164e-01
stripplot
stripplot(imputed_data,
educat,
jitter.data = TRUE,
factor = 2,
xlab = "Imputation number")
summary(linearmodel1)
Call:
lm(formula = env_con ~ com3 + epht3 + male + hlthprob + inc +
educat, data = oregon)
Residuals:
Min 1Q Median 3Q Max
-1.95449 -0.46784 -0.02464 0.46097 1.93741
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.329e+00 7.602e-02 17.487 < 2e-16 ***
com3 -4.894e-02 1.068e-02 -4.583 4.78e-06 ***
epht3 4.037e-01 1.460e-02 27.658 < 2e-16 ***
male -5.306e-02 2.472e-02 -2.146 0.0319 *
hlthprob 2.911e-01 2.858e-02 10.188 < 2e-16 ***
inc 3.546e-07 4.128e-07 0.859 0.3905
educat -3.406e-03 4.705e-03 -0.724 0.4691
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6341 on 2819 degrees of freedom
(1682 observations deleted due to missingness)
Multiple R-squared: 0.3075, Adjusted R-squared: 0.3061
F-statistic: 208.7 on 6 and 2819 DF, p-value: < 2.2e-16