This is an R Markdown document that summarizes methods for identifying the extent of missing data, the mechanism of the missing data, and compares mean substitution and multiple imputation approaches for handling missing data.
We are first going to load our libraries and read in a file that includes 200 observations on 4 variables.
library(mice)
## Warning: package 'mice' was built under R version 3.6.2
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(psych)
library(MissMech)
library(rio)
setwd("/Users/munira/Desktop/Education/Masters/Illinois Institute of Technology/Coursework/554-Multivariate Statistics/R/Data")
album <- import("Album Sales.csv")
knitr::opts_chunk$set(comment = NA)
We are next going to ampute data to create missing values.
missalbum <- ampute(album, prop = .5)
What does our missing data look like?
summary(missalbum$amp)
Adverts Sales Airplay Attract
Min. : 9.104 Min. : 10.0 Min. : 0.00 Min. :1.000
1st Qu.: 204.102 1st Qu.:122.5 1st Qu.:19.00 1st Qu.:6.000
Median : 508.601 Median :190.0 Median :28.00 Median :7.000
Mean : 590.494 Mean :187.9 Mean :27.35 Mean :6.689
3rd Qu.: 903.380 3rd Qu.:240.0 3rd Qu.:35.00 3rd Qu.:8.000
Max. :1985.119 Max. :360.0 Max. :63.00 Max. :9.000
NA's :24 NA's :22 NA's :28 NA's :20
We now test the hypothesis that the data are missing MCAR. The null hypothesis is that data are MCAR, so a significant test results indicates the data are MAR.
out <- TestMCARNormality(missalbum$amp, method = "Auto")
print(out)
Call:
TestMCARNormality(data = missalbum$amp, method = "Auto")
Number of Patterns: 5
Total number of cases used in the analysis: 200
Pattern(s) used:
Adverts Sales Airplay Attract Number of cases
group.1 1 1 1 NA 20
group.2 1 1 1 1 106
group.3 NA 1 1 1 24
group.4 1 1 NA 1 28
group.5 1 NA 1 1 22
Test of normality and Homoscedasticity:
-------------------------------------------
Hawkins Test:
P-value for the Hawkins test of normality and homoscedasticity: 0.01868399
Either the test of multivariate normality or homoscedasticity (or both) is rejected.
Provided that normality can be assumed, the hypothesis of MCAR is
rejected at 0.05 significance level.
Non-Parametric Test:
P-value for the non-parametric test of homoscedasticity: 0.1610754
Reject Normality at 0.05 significance level.
There is not sufficient evidence to reject MCAR at 0.05 significance level.
What is the pattern of missing data?
md.pattern(missalbum$amp)
Attract Sales Adverts Airplay
106 1 1 1 1 0
28 1 1 1 0 1
24 1 1 0 1 1
22 1 0 1 1 1
20 0 1 1 1 1
20 22 24 28 94
Let’s just focus on the Adverts variable and perform a mean substitution. What happens when we then look at basic descriptive values in the original file and the imputed file?
missalbum$amp$Adverts.mean<-ifelse(is.na(missalbum$amp$Adverts)==T, mean(missalbum$amp$Adverts, na.rm=T), missalbum$amp$Adverts)
describe(missalbum$amp$Adverts)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 176 590.49 473.26 508.6 540.63 492.15 9.1 1985.12 1976.01 0.75 -0.28 35.67
describe(missalbum$amp$Adverts.mean)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 200 590.49 443.81 587.06 543.53 494.37 9.1 1985.12 1976.01 0.8 0.09 31.38
We can also compare the histograms across the two files to visualize the impact of the mean substitution.
hist(missalbum$amp$Adverts.mean)
hist(missalbum$amp$Adverts, add=T ,col=1)
I want to run a regression model using the mean substitution approach for our missing data, so we will first perform a mean substitute for the other two predictors.
missalbum$amp$Airplay.mean<-ifelse(is.na(missalbum$amp$Airplay)==T, mean(missalbum$amp$Airplay, na.rm=T), missalbum$amp$Airplay)
missalbum$amp$Attract.mean<-ifelse(is.na(missalbum$amp$Attract)==T, mean(missalbum$amp$Attract, na.rm=T), missalbum$amp$Attract)
We then run our regression analysis in which we predict SALES from the other 3 variables.
reg <- lm(Sales ~ Adverts.mean + Airplay.mean + Attract.mean, data = missalbum$amp)
summary(reg)
Call:
lm(formula = Sales ~ Adverts.mean + Airplay.mean + Attract.mean,
data = missalbum$amp)
Residuals:
Min 1Q Median 3Q Max
-160.31 -33.66 0.71 31.21 191.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.484256 21.610189 -1.226 0.222026
Adverts.mean 0.089747 0.009138 9.822 < 2e-16 ***
Airplay.mean 3.284338 0.377611 8.698 2.48e-15 ***
Attract.mean 11.460894 3.030198 3.782 0.000214 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54.02 on 174 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.5552, Adjusted R-squared: 0.5475
F-statistic: 72.39 on 3 and 174 DF, p-value: < 2.2e-16
Let’s get rid of those mean subsitution variables from the dataframe. We aren’t going to use them anymore
missalbum$amp <- missalbum$amp[, -c(5:7)]
Let’s do a basic multiple imputation procedure in which we generate 10 datasets. Note that the default for the mice function is to generate 5 datasets.
imp <- mice(data = missalbum$amp, m=10)
print(imp)
We can check on the imputed values to see if they make sense
head(imp$imp$Sales)
1 2 3 4 5 6 7 8 9 10
18 210 210 210 300 200 200 210 200 180 210
28 290 340 290 290 360 360 340 340 290 360
39 240 240 230 250 120 240 190 210 250 210
41 310 310 240 210 240 240 340 230 120 310
42 360 250 340 220 250 300 280 210 320 300
63 300 150 300 200 300 150 180 300 230 340
head(imp$imp$Adverts)
1 2 3 4 5 6 7 8 9 10
5 283.161 97.972 513.694 982.063 715.678 917.017 583.627 976.641 42.568 215.368
15 715.678 1095.578 910.851 900.889 1326.598 1051.168 295.840 644.151 174.093 883.877
20 204.568 601.434 102.563 746.024 125.179 1112.470 313.362 268.598 1051.168 215.368
22 263.598 893.355 1051.168 1132.877 1323.287 1567.548 1542.329 450.562 856.985 985.968
24 669.811 102.568 51.229 263.268 767.134 196.650 1507.972 507.638 642.786 746.024
25 1542.329 526.142 102.568 826.859 283.161 1542.329 256.894 295.840 26.598 982.063
We can also do things like look at observed and imputed values of Sales respect to Adverts
stripplot(imp,Sales~Adverts | .imp, pch=20)
Look at the first imputed data set
dat.imp<-complete(imp)
head(dat.imp, n=10)
Adverts Sales Airplay Attract
1 10.256 330 43 7
2 985.685 120 28 7
3 1445.563 360 35 7
4 1188.193 270 33 7
5 283.161 220 44 5
6 568.954 170 19 5
7 471.814 70 20 1
8 537.352 210 22 9
9 514.068 200 21 7
10 174.093 300 44 7
Compare this to the original data frame
head(missalbum$amp [,c("Adverts", "Attract", "Sales", "Airplay")], n=10)
Adverts Attract Sales Airplay
1 10.256 NA 330 43
2 985.685 7 120 28
3 1445.563 7 360 35
4 1188.193 7 270 33
5 NA 5 220 44
6 568.954 5 170 19
7 471.814 1 70 20
8 537.352 9 210 22
9 514.068 7 200 21
10 174.093 7 300 NA
Let’s look at the sixth imputed data set
dat.imp<-complete(imp, action = 6)
head(dat.imp, n=10)
Adverts Sales Airplay Attract
1 10.256 330 43 7
2 985.685 120 28 7
3 1445.563 360 35 7
4 1188.193 270 33 7
5 917.017 220 44 5
6 568.954 170 19 5
7 471.814 70 20 1
8 537.352 210 22 9
9 514.068 200 21 7
10 174.093 300 53 7
Now let’s run our model using the imputed data
fit.sales <- with(data = imp, exp = lm(Sales ~ Adverts + Airplay + Attract))
fit.sales
call :
with.mids(data = imp, expr = lm(Sales ~ Adverts + Airplay + Attract))
call1 :
mice(data = missalbum$amp, m = 10)
nmis :
Adverts Sales Airplay Attract
24 22 28 20
analyses :
[[1]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-30.67913 0.08698 3.12035 12.90642
[[2]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-30.74956 0.09016 3.18020 12.11589
[[3]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-44.98728 0.09224 3.14329 14.71304
[[4]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-33.90017 0.08404 3.18692 13.10624
[[5]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-28.21678 0.08888 3.06767 12.11966
[[6]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-23.20231 0.08548 3.44898 10.23442
[[7]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-35.42408 0.09144 3.41801 12.18775
[[8]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-27.70089 0.09187 3.04473 11.64575
[[9]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-23.60785 0.08504 3.27392 10.92841
[[10]]
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract)
Coefficients:
(Intercept) Adverts Airplay Attract
-31.79345 0.08071 3.23530 12.92729
What happens if we run the model using the data frame that had missing data?
summary(lm(Sales ~ Adverts + Airplay + Attract, data = missalbum$amp))
Call:
lm(formula = Sales ~ Adverts + Airplay + Attract, data = missalbum$amp)
Residuals:
Min 1Q Median 3Q Max
-105.870 -26.383 -1.531 29.314 105.156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.00573 20.18977 -0.991 0.324087
Adverts 0.08973 0.01119 8.020 1.85e-12 ***
Airplay 2.71786 0.36882 7.369 4.62e-11 ***
Attract 11.61895 2.92418 3.973 0.000132 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45.33 on 102 degrees of freedom
(94 observations deleted due to missingness)
Multiple R-squared: 0.6039, Adjusted R-squared: 0.5922
F-statistic: 51.83 on 3 and 102 DF, p-value: < 2.2e-16
Let’s compare that to what we get when we combine results across the imputed datasets.
summary(pool(fit.sales))
term estimate std.error statistic df p.value
1 (Intercept) -31.02614923 19.146690700 -1.620444 134.39584 1.074808e-01
2 Adverts 0.08768418 0.008296957 10.568234 76.49639 0.000000e+00
3 Airplay 3.21193627 0.319226644 10.061617 91.96586 2.220446e-16
4 Attract 12.28848421 2.828837713 4.344005 86.65268 3.785861e-05
pool.r.squared(fit.sales, adjusted = FALSE)
est lo 95 hi 95 fmi
R^2 0.642939 0.5313541 0.7340211 NaN