PART I: Missing Data

1. Load and Prepare Data

We will use the 2022 General Social Survey (GSS). Variables used: income, education, age, hours worked, gender

gss <- gss_get_yr(2022)
## Fetching: https://gss.norc.org/documents/stata/2022_stata.zip
gss_sub <- gss %>%
  zap_labels() %>%
  transmute(
    income = as.numeric(income),
    educ   = as.numeric(educ),
    age    = as.numeric(age),
    hrs1   = as.numeric(hrs1),
    male   = ifelse(sex == 1, 1, 0)
  )

Question 1 Why do we use zap_labels() when importing Stata data?

This ensures that these labelled vectors are removed so the variable behaves like a standard R vector for analytical purposes.

2. Inspect Missingness

colSums(is.na(gss_sub))
## income   educ    age   hrs1   male 
##    484     29    256   1829     23
count(gss_sub)
## # A tibble: 1 × 1
##       n
##   <int>
## 1  4149
View(gss_sub)

vis_miss(gss_sub)

gss_sub %>%
  mutate(across(everything(), is.na)) %>%
  cor() %>%
  heatmap()

Question 2: How many missing values exist for each variable?

income: 434 educ: 20 age: 208 hrs1: 1606 male: 20

Question 3: What proportion of income is missing?

434/3544 = 12.2%

Question 4: Write 2–3 sentences describing the missingness pattern.

Based on the visualizaations above, there appear to be some correlation between missing data in income and age and hrs1.Furthermore, male and age also exhibit some relationship in their missingness.

3 Create a Missingness Indicator

gss_sub <- gss_sub %>%
  mutate(income_missing = as.integer(is.na(income)))

Question 5 What does a value of 1 represent in income_missing?

A value of 1 in this column serves to indicate that that observation is missing data for income.

Question 6 Why is creating a missingness indicator useful?

This could help us draw correlation between missing variables. In this case, we could use it to see if missing data for income is linked to missing data in other variables. This could eventually be useful when imputing later down the line.

4 Diagnostic Model for Missingness

mar_mod <- glm(income_missing ~ educ + age + male + hrs1,
               data = gss_sub,
               family = binomial)

summary(mar_mod)
## 
## Call:
## glm(formula = income_missing ~ educ + age + male + hrs1, family = binomial, 
##     data = gss_sub)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.249484   0.506824  -0.492   0.6225   
## educ        -0.071532   0.029258  -2.445   0.0145 * 
## age         -0.017516   0.006085  -2.878   0.0040 **
## male        -0.112346   0.169085  -0.664   0.5064   
## hrs1        -0.012617   0.006026  -2.094   0.0363 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1129.5  on 2152  degrees of freedom
## Residual deviance: 1107.9  on 2148  degrees of freedom
##   (1996 observations deleted due to missingness)
## AIC: 1117.9
## 
## Number of Fisher Scoring iterations: 5

Question 7 What is the dependent variable in this model?

The dependent variable in this model is the missingness indicator for income.

Question 8 Do observed variables predict missing income?

Education and age are statistically significant for the missingness indicator, and we see that their coefficients are both negative. In this case, we can interpret this to mean that missing income data is negatively correlated with age and income, so less educated/younger individuals are more likely to omit income data in this dataset.

Question 9 Does this prove the data are Missing at Random (MAR)? Explain.

I would say that this likely means that the data is missing at random but not completely at random. MAR usually means that the missingness is tied to something observable, but it is still pretty random given that it’s not like all people below a certain age and education omitted their income.

5 Complete Case Analysis

m_cc <- lm(income ~ educ + age + male + hrs1,
           data = gss_sub,
           na.action = na.omit)

summary(m_cc)
## 
## Call:
## lm(formula = income ~ educ + age + male + hrs1, data = gss_sub, 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2671  -0.0284   0.2815   0.5637   1.6796 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 8.918792   0.218285  40.858 < 0.0000000000000002 ***
## educ        0.106839   0.011788   9.064 < 0.0000000000000002 ***
## age         0.012136   0.002387   5.084         0.0000004033 ***
## male        0.161143   0.066767   2.414               0.0159 *  
## hrs1        0.012920   0.002387   5.414         0.0000000692 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 1990 degrees of freedom
##   (2154 observations deleted due to missingness)
## Multiple R-squared:  0.07251,    Adjusted R-squared:  0.07065 
## F-statistic: 38.89 on 4 and 1990 DF,  p-value: < 0.00000000000000022
nobs(m_cc)
## [1] 1995

Question 10 What does na.omit do?

In regressions, na.omit removes all rows for which a variable is missing.

Question 11 How many observations remain?

In this case, 1664 observations remain.

Question 12 Interpret the education coefficient.

Holding all else constant, for every one year increase in education observed, income likely increases by .106 – and the relationship is statistically significant.

6 Mean Imputation

gss_mean <- gss_sub %>%
  mutate(income = ifelse(is.na(income),
                         mean(income, na.rm = TRUE),
                         income))

m_mean <- lm(income ~ educ + age + male + hrs1,
             data = gss_mean)

summary(m_mean)
## 
## Call:
## lm(formula = income ~ educ + age + male + hrs1, data = gss_mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2178  -0.0389   0.2683   0.5499   1.6191 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 9.070625   0.200791  45.174 < 0.0000000000000002 ***
## educ        0.101351   0.011029   9.189 < 0.0000000000000002 ***
## age         0.011146   0.002197   5.073         0.0000004259 ***
## male        0.160478   0.062113   2.584              0.00984 ** 
## hrs1        0.011775   0.002193   5.370         0.0000000874 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 2148 degrees of freedom
##   (1996 observations deleted due to missingness)
## Multiple R-squared:  0.06937,    Adjusted R-squared:  0.06764 
## F-statistic: 40.03 on 4 and 2148 DF,  p-value: < 0.00000000000000022

Question 13 What value replaces missing income?

Missing income here is replaced by the mean of income across observed values (about 11~ish).

Question 14 Why is mean imputation generally not recommended?

This method is generally not recommended because it can distort the distribution and underestimate the variance.

7 Multiple Imputation

imp <- mice(gss_sub %>% select(income, educ, age, male, hrs1),
            m = 5,
            method = "pmm",
            seed = 123,
            printFlag = FALSE)

fit_mi <- with(imp, lm(income ~ educ + age + male + hrs1))
#summary(fit_mi) #pause here, and review the output, what did you see?

mi_pooled <- pool(fit_mi) 

summary(mi_pooled) # puase here, and review the output again. What did you see? 
##          term   estimate   std.error statistic         df
## 1 (Intercept) 7.99780066 0.240903788 33.199149   28.32891
## 2        educ 0.15460872 0.011056838 13.983086  836.96820
## 3         age 0.00495595 0.002678731  1.850111   12.11950
## 4        male 0.19750108 0.063658297  3.102519 3044.53691
## 5        hrs1 0.01804122 0.002998555  6.016638   17.31508
##                                            p.value
## 1 0.0000000000000000000000031564693285124606240409
## 2 0.0000000000000000000000000000000000000004392022
## 3 0.0888184119617555223324956159558496437966823578
## 4 0.0019364054304744924075998291357336711371317506
## 5 0.0000128378148088285520312962911804177679186978

Question 15 What does m = 5 represent?

The m in mice represents the number of predicted values we want the algorithm to generate and compare.

Question 16 What is Predictive Mean Matching (PMM)?

PMM is a method whereby we create a new variable with predicted values based on the column with missing values; it then looks at the difference between each predicted value and the one that’s missing and replaces it with whichever one is closest.

MICE Diagnostics

densityplot(imp)

!! Note, this is something I did not cover in class. There are several ways to evaluate our imputation. One diagnostic plot we can make is to map the densities of our imputed variables (in red) against the actual (in blue). If you have 5 imputated datasets, you will have 5 red lines representing each of the imputated dataset. A good MI is when the imputated datasets have similar data distribution as the original data distribution.

Question 16. Do the imputed distributions resemble the observed data?

The imputed distributions modereately resemble the observed ones, though I would be unnerved by the education variable’s distribution given how fuzzy it is. I do think it’s fascinating to see how the imputations smooth out the male variable so much while almost over-accentuating the income one.

imp2 <- mice(gss_sub %>% select(income, educ, age, male, hrs1),
            m = 50,
            method = "pmm",
            seed = 123,
            printFlag = FALSE)

fit_mi2 <- with(imp2, lm(income ~ educ + age + male + hrs1))
#summary(fit_mi) #pause here, and review the output, what did you see?

mi_pooled2 <- pool(fit_mi2) 

summary(mi_pooled2)
##          term    estimate   std.error statistic        df
## 1 (Intercept) 7.930331772 0.249758578 31.751990 262.11595
## 2        educ 0.161674554 0.012457049 12.978560 622.32655
## 3         age 0.004142534 0.001980559  2.091599 862.53301
## 4        male 0.203434692 0.072767623  2.795676 689.36650
## 5        hrs1 0.017836873 0.004239111  4.207692  84.28898
##                                                                                                p.value
## 1 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008184665
## 2 0.00000000000000000000000000000000029322575031163739026959033115441889676127704833614166665231326817
## 3 0.03676637300818753895104151752093457616865634918212890625000000000000000000000000000000000000000000
## 4 0.00532329249625231124692925632757578568998724222183227539062500000000000000000000000000000000000000
## 5 0.00006411813955025799855510365565791630615422036498785018920898437500000000000000000000000000000000
densityplot(imp2)

Question 17. Revise imputation number to 50 (instead of 5), and do the diagnostic again. Did the MI perform better here?

This is really cool to visualize! Furthermore, the coefficients didn’t change too much but the degrees of freedom grew significantly. This would lead me to assume this is a better model, but I’m not sure it would change my interpretation much if at all.

8. Model Comparison

models <- list(
  "Complete cases" = m_cc,
  "Mean imputation" = m_mean,
  "Multiple imputation" = mi_pooled2
)

modelsummary(models)
Complete cases Mean imputation Multiple imputation
(Intercept) 8.919 9.071 7.930
(0.218) (0.201) (0.250)
educ 0.107 0.101 0.162
(0.012) (0.011) (0.012)
age 0.012 0.011 0.004
(0.002) (0.002) (0.002)
male 0.161 0.160 0.203
(0.067) (0.062) (0.073)
hrs1 0.013 0.012 0.018
(0.002) (0.002) (0.004)
Num.Obs. 1995 2153 4149
Num.Imp. 50
R2 0.073 0.069 0.077
R2 Adj. 0.071 0.068 0.076
AIC 7203.7 7626.1
BIC 7237.3 7660.1
Log.Lik. -3595.847 -3807.030
F 40.030
RMSE 1.47 1.42

Question 18 Which method appears most credible?

The MICE’d model is more credible given that it has a slightly better adjusted r-squared. However, the slightly larger standard errors give me some pause, as this could indicate slightly more uncertainty in the MICE’d model.

Question 19 Which coefficients are most stable across models?

The age and hours worked variables have the most stable coefficients across all models.

Note the N for MI is not accurate. This happens when the modelsummary doesn’t know how to treat the pooled data analysis (remember, the coefficients are the averages of all 5 or 50 imputated datasets). I’ll show you an example of how to correct the N in the table when we meet. For now, just know that the N is not accurate for model3 MI.

PART II: Model Diagnostics

Now we examine outliers and multicollinearity.

9 Regression Model

gss_avail <- gss %>%
  select(conrinc, prestg10, wrkslf, sex, race) %>%
  zap_labels() %>%
  drop_na()

mod1 <- lm(conrinc ~ prestg10 + wrkslf + sex + race,
           data = gss_avail)

summary(mod1)
## 
## Call:
## lm(formula = conrinc ~ prestg10 + wrkslf + sex + race, data = gss_avail)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -76490 -20763  -6730   9343 146690 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  21159.68    5651.45   3.744             0.000186 ***
## prestg10      1099.69      58.26  18.876 < 0.0000000000000002 ***
## wrkslf       -3934.26    2430.28  -1.619             0.105642    
## sex         -14083.49    1587.74  -8.870 < 0.0000000000000002 ***
## race         -2332.30    1139.97  -2.046             0.040898 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34670 on 1937 degrees of freedom
## Multiple R-squared:  0.1873, Adjusted R-squared:  0.1856 
## F-statistic: 111.6 on 4 and 1937 DF,  p-value: < 0.00000000000000022

10 Cook’s Distance

Cook’s Distance measures how much each observation influences model estimates.

gss_avail$cooksd <- cooks.distance(mod1)
summary(gss_avail$cooksd)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## 0.0000000003 0.0000202847 0.0001014516 0.0005713772 0.0003329013 0.0196211183
ggplot(gss_avail, aes(seq_along(cooksd), cooksd)) +
  geom_point() +
  geom_hline(yintercept = 4/nrow(gss_avail), linetype = "dashed") +
  labs(x="Observation", y="Cook's Distance")

Question 20 Should we be concerned about influential outliers here? Explain.

None of the Cook’s D’s are greater than 1, so we don’t have any outliers to be concerned about here.

11 Multicollinearity

Variance Inflation Factor (VIF) measures multicollinearity.

vif(mod1)
## prestg10   wrkslf      sex     race 
## 1.013150 1.025979 1.017827 1.011182

Question 21 Which predictors have the highest VIF? What VIF threshold suggests problematic multicollinearity?

VIF shows multicollinearity in a model, and the highest ones here are wrkslf and sex. However, we wouldn’t consider it an issue unless their values were greater than 5. In that case, we would likely need to drop one of the variables.

Final Reflection

Question 22 What is the main lesson about missing data from this lab?

I think the main lesson about missing data here is that it can be easy to correct but we should take tons of precaution and be precise in how it alters our outputs/interpretations.

Question 23 Why is multiple imputation often preferred in social science research?

Multiple imputation is preferred in social science research because it: - can reduces bias; - can preserve statistical power; - can produce realistic standard errors; and - can preserve relationships between variables.