MISSING DATA ANALYSES AND PRACTICES

This is an R Markdown document that summarizes methods for identifying the extent of missing data, the mechanism of the missing data, and compares mean substitution and multiple imputation approaches for handling missing data.

We are first going to load our libraries and read in a file that includes 200 observations on 4 variables.

library(mice)
## Warning: package 'mice' was built under R version 3.6.2
## 
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(psych)
library(MissMech)
library(rio)

setwd("/Users/munira/Desktop/Education/Masters/Illinois Institute of Technology/Coursework/554-Multivariate Statistics/R/Data")
album <- import("Album Sales.csv")

knitr::opts_chunk$set(comment = NA)

AMPUTE DATA

We are next going to ampute data to create missing values.

missalbum <- ampute(album, prop = .5)

What does our missing data look like?

summary(missalbum$amp)
    Adverts             Sales          Airplay         Attract     
 Min.   :   9.104   Min.   : 10.0   Min.   : 0.00   Min.   :1.000  
 1st Qu.: 204.102   1st Qu.:122.5   1st Qu.:19.00   1st Qu.:6.000  
 Median : 508.601   Median :190.0   Median :28.00   Median :7.000  
 Mean   : 590.494   Mean   :187.9   Mean   :27.35   Mean   :6.689  
 3rd Qu.: 903.380   3rd Qu.:240.0   3rd Qu.:35.00   3rd Qu.:8.000  
 Max.   :1985.119   Max.   :360.0   Max.   :63.00   Max.   :9.000  
 NA's   :24         NA's   :22      NA's   :28      NA's   :20     

MCAR or MAR?

We now test the hypothesis that the data are missing MCAR. The null hypothesis is that data are MCAR, so a significant test results indicates the data are MAR.

out <- TestMCARNormality(missalbum$amp,  method = "Auto")
print(out)
Call:
TestMCARNormality(data = missalbum$amp, method = "Auto")

Number of Patterns:  5 

Total number of cases used in the analysis:  200 

 Pattern(s) used:
          Adverts   Sales   Airplay   Attract   Number of cases
group.1         1       1         1        NA                20
group.2         1       1         1         1               106
group.3        NA       1         1         1                24
group.4         1       1        NA         1                28
group.5         1      NA         1         1                22

    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  0.01868399 

    Either the test of multivariate normality or homoscedasticity (or both) is rejected.
    Provided that normality can be assumed, the hypothesis of MCAR is 
    rejected at 0.05 significance level. 

Non-Parametric Test:

    P-value for the non-parametric test of homoscedasticity:  0.1610754 

    Reject Normality at 0.05 significance level.
    There is not sufficient evidence to reject MCAR at 0.05 significance level.

What is the pattern of missing data?

md.pattern(missalbum$amp)

    Attract Sales Adverts Airplay   
106       1     1       1       1  0
28        1     1       1       0  1
24        1     1       0       1  1
22        1     0       1       1  1
20        0     1       1       1  1
         20    22      24      28 94

MEAN SUBSTITUTION

Let’s just focus on the Adverts variable and perform a mean substitution. What happens when we then look at basic descriptive values in the original file and the imputed file?

missalbum$amp$Adverts.mean<-ifelse(is.na(missalbum$amp$Adverts)==T, mean(missalbum$amp$Adverts, na.rm=T), missalbum$amp$Adverts)

describe(missalbum$amp$Adverts)
   vars   n   mean     sd median trimmed    mad min     max   range skew kurtosis    se
X1    1 176 590.49 473.26  508.6  540.63 492.15 9.1 1985.12 1976.01 0.75    -0.28 35.67
describe(missalbum$amp$Adverts.mean)
   vars   n   mean     sd median trimmed    mad min     max   range skew kurtosis    se
X1    1 200 590.49 443.81 587.06  543.53 494.37 9.1 1985.12 1976.01  0.8     0.09 31.38

We can also compare the histograms across the two files to visualize the impact of the mean substitution.

hist(missalbum$amp$Adverts.mean)
hist(missalbum$amp$Adverts, add=T ,col=1)

I want to run a regression model using the mean substitution approach for our missing data, so we will first perform a mean substitute for the other two predictors.

missalbum$amp$Airplay.mean<-ifelse(is.na(missalbum$amp$Airplay)==T, mean(missalbum$amp$Airplay, na.rm=T), missalbum$amp$Airplay)
missalbum$amp$Attract.mean<-ifelse(is.na(missalbum$amp$Attract)==T, mean(missalbum$amp$Attract, na.rm=T), missalbum$amp$Attract)

IMPORTANT NOTE: I did NOT impute values for any missing criterion variable data. Standard practice is to delete those cases.

We then run our regression analysis in which we predict SALES from the other 3 variables.

reg <- lm(Sales ~ Adverts.mean + Airplay.mean + Attract.mean, data = missalbum$amp)
summary(reg)

Call:
lm(formula = Sales ~ Adverts.mean + Airplay.mean + Attract.mean, 
    data = missalbum$amp)

Residuals:
    Min      1Q  Median      3Q     Max 
-160.31  -33.66    0.71   31.21  191.89 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.484256  21.610189  -1.226 0.222026    
Adverts.mean   0.089747   0.009138   9.822  < 2e-16 ***
Airplay.mean   3.284338   0.377611   8.698 2.48e-15 ***
Attract.mean  11.460894   3.030198   3.782 0.000214 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.02 on 174 degrees of freedom
  (22 observations deleted due to missingness)
Multiple R-squared:  0.5552,    Adjusted R-squared:  0.5475 
F-statistic: 72.39 on 3 and 174 DF,  p-value: < 2.2e-16

Let’s get rid of those mean subsitution variables from the dataframe. We aren’t going to use them anymore

missalbum$amp <- missalbum$amp[, -c(5:7)]

MULTIPLE IMPUTATION

Let’s do a basic multiple imputation procedure in which we generate 10 datasets. Note that the default for the mice function is to generate 5 datasets.

imp <- mice(data = missalbum$amp, m=10)
print(imp)

We can check on the imputed values to see if they make sense

head(imp$imp$Sales)
     1   2   3   4   5   6   7   8   9  10
18 210 210 210 300 200 200 210 200 180 210
28 290 340 290 290 360 360 340 340 290 360
39 240 240 230 250 120 240 190 210 250 210
41 310 310 240 210 240 240 340 230 120 310
42 360 250 340 220 250 300 280 210 320 300
63 300 150 300 200 300 150 180 300 230 340
head(imp$imp$Adverts)
          1        2        3        4        5        6        7       8        9      10
5   283.161   97.972  513.694  982.063  715.678  917.017  583.627 976.641   42.568 215.368
15  715.678 1095.578  910.851  900.889 1326.598 1051.168  295.840 644.151  174.093 883.877
20  204.568  601.434  102.563  746.024  125.179 1112.470  313.362 268.598 1051.168 215.368
22  263.598  893.355 1051.168 1132.877 1323.287 1567.548 1542.329 450.562  856.985 985.968
24  669.811  102.568   51.229  263.268  767.134  196.650 1507.972 507.638  642.786 746.024
25 1542.329  526.142  102.568  826.859  283.161 1542.329  256.894 295.840   26.598 982.063

We can also do things like look at observed and imputed values of Sales respect to Adverts

stripplot(imp,Sales~Adverts | .imp, pch=20)

Look at the first imputed data set

dat.imp<-complete(imp)
head(dat.imp, n=10)
    Adverts Sales Airplay Attract
1    10.256   330      43       7
2   985.685   120      28       7
3  1445.563   360      35       7
4  1188.193   270      33       7
5   283.161   220      44       5
6   568.954   170      19       5
7   471.814    70      20       1
8   537.352   210      22       9
9   514.068   200      21       7
10  174.093   300      44       7

Compare this to the original data frame

head(missalbum$amp [,c("Adverts", "Attract", "Sales", "Airplay")], n=10)
    Adverts Attract Sales Airplay
1    10.256      NA   330      43
2   985.685       7   120      28
3  1445.563       7   360      35
4  1188.193       7   270      33
5        NA       5   220      44
6   568.954       5   170      19
7   471.814       1    70      20
8   537.352       9   210      22
9   514.068       7   200      21
10  174.093       7   300      NA

Let’s look at the sixth imputed data set

dat.imp<-complete(imp, action = 6)
head(dat.imp, n=10)
    Adverts Sales Airplay Attract
1    10.256   330      43       7
2   985.685   120      28       7
3  1445.563   360      35       7
4  1188.193   270      33       7
5   917.017   220      44       5
6   568.954   170      19       5
7   471.814    70      20       1
8   537.352   210      22       9
9   514.068   200      21       7
10  174.093   300      53       7

Now let’s run our model using the imputed data

fit.sales <- with(data = imp, exp = lm(Sales ~ Adverts + Airplay + Attract))
fit.sales
call :
with.mids(data = imp, expr = lm(Sales ~ Adverts + Airplay + Attract))

call1 :
mice(data = missalbum$amp, m = 10)

nmis :
Adverts   Sales Airplay Attract 
     24      22      28      20 

analyses :
[[1]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -30.67913      0.08698      3.12035     12.90642  


[[2]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -30.74956      0.09016      3.18020     12.11589  


[[3]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -44.98728      0.09224      3.14329     14.71304  


[[4]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -33.90017      0.08404      3.18692     13.10624  


[[5]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -28.21678      0.08888      3.06767     12.11966  


[[6]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -23.20231      0.08548      3.44898     10.23442  


[[7]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -35.42408      0.09144      3.41801     12.18775  


[[8]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -27.70089      0.09187      3.04473     11.64575  


[[9]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -23.60785      0.08504      3.27392     10.92841  


[[10]]

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)

Coefficients:
(Intercept)      Adverts      Airplay      Attract  
  -31.79345      0.08071      3.23530     12.92729  

What happens if we run the model using the data frame that had missing data?

summary(lm(Sales ~ Adverts + Airplay + Attract, data = missalbum$amp))

Call:
lm(formula = Sales ~ Adverts + Airplay + Attract, data = missalbum$amp)

Residuals:
     Min       1Q   Median       3Q      Max 
-105.870  -26.383   -1.531   29.314  105.156 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -20.00573   20.18977  -0.991 0.324087    
Adverts       0.08973    0.01119   8.020 1.85e-12 ***
Airplay       2.71786    0.36882   7.369 4.62e-11 ***
Attract      11.61895    2.92418   3.973 0.000132 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.33 on 102 degrees of freedom
  (94 observations deleted due to missingness)
Multiple R-squared:  0.6039,    Adjusted R-squared:  0.5922 
F-statistic: 51.83 on 3 and 102 DF,  p-value: < 2.2e-16

Let’s compare that to what we get when we combine results across the imputed datasets.

summary(pool(fit.sales))
         term     estimate    std.error statistic        df      p.value
1 (Intercept) -31.02614923 19.146690700 -1.620444 134.39584 1.074808e-01
2     Adverts   0.08768418  0.008296957 10.568234  76.49639 0.000000e+00
3     Airplay   3.21193627  0.319226644 10.061617  91.96586 2.220446e-16
4     Attract  12.28848421  2.828837713  4.344005  86.65268 3.785861e-05
pool.r.squared(fit.sales, adjusted = FALSE)
         est     lo 95     hi 95 fmi
R^2 0.642939 0.5313541 0.7340211 NaN