Missing Data and Multiple Imputation

Missing values is the first obstacle in predictive modeling which makes it important to master the methods to overcome them. Certain programs within R claim to treat them intrinsically, but who knows how good it happens inside the ‘black box’. The first step to treating missing variables an analyst must understand their variables and what treatment is needed. The dataset chosen for my simple missing value analysis was collected through social explorer’s Health data. This data is to be used to build awareness of health factors that affect certain health issues.

Review all data and subset the data with the variables that will be used in the analysis. For this analysis, I would like to evaluate the relationship between the food environmental index plus the physical inactivity with the percent of obese adults.

As we can see when viewing the dataset as a whole there are a few NAs present, for this particular dataset one can attempt two different modes of analysis to impute missing values, largely influencing the model’s predictive ability. The first mode of analysis is listwise deletion. Listwise deletion is the default method used to impute missing values. But, can lead to information loss which in turn creates a bias analysis.

Obesity2 <- rename (Obesity1,
          "County" = Geo_QNAME,
          "STATEFP" = Geo_STATE,
          "Ad_Diabet" = SE_T009_001,
          "Ad_limitFds" = SE_T012_001,
          "Access_Exer_Opport" = SE_T012_002,
          "Obese_Adults" = SE_T012_003,
          "PhysInactive" = SE_T012_004,
          "FEI" = SE_T013_001)
select(Obesity2, STATEFP, Obese_Adults, PhysInactive, FEI, PhysInactive, Ad_Diabet, Access_Exer_Opport)

head(Obesity2)

Obesity2 <- subset(Obesity2, select =-c(SE_NV007_001))
Obesity2 <- subset(Obesity2, select =-c(SE_T009_002))
Obesity2 <- subset(Obesity2, select =-c(SE_T012_005))
Obesity2 <- subset(Obesity2, select =-c(SE_NV007_002))
Obesity2 <- subset(Obesity2, select =-c(Geo_COUNTY))
Obesity2 <- subset(Obesity2, select =-c(Geo_NAME))
Obesity2 <- subset(Obesity2, select =-c(Geo_FIPS))
Obesity2 <- subset(Obesity2, select =-c(County))
Obesity2$STATEFP <- as.factor(Obesity2$STATEFP)

data(Obesity2)
dim(Obesity2)

## [1] 3141    7

names(Obesity2)

## [1] "STATEFP"            "Ad_Diabet"          "Ad_limitFds"       
## [4] "Access_Exer_Opport" "Obese_Adults"       "PhysInactive"      
## [7] "FEI"

Listwise deletion

The listwise deletion model, suggests that the intercept of the percent of obese adults starts at 17.47. For every unit of increase of Food environment index the percent of obese adults decreases by .30. For every unit of increase of physical inactivity the percent of obese adults increases by 0.57. The R2 of this model is 0.52. As can be obeserved the original dataset has 3141 observations but, for the listwise deletion model to work it has only observed the 2264 completed observations and has deleted the flawed observations.

z.obe <- zelig(Obese_Adults~ FEI + PhysInactive, model="ls", data=Obesity2, cite = F)
htmlreg(z.obe, doctype = FALSE)

Statistical models
	Model 1
(Intercept)	18.16^***
	(0.60)
FEI	-0.34^***
	(0.06)
PhysInactive	0.55^***
	(0.01)
R²	0.51
Adj. R²	0.51
Num. obs.	2332
RMSE	3.09
p < 0.001, p < 0.01, p < 0.05

Amelia

Amelia created multiple imputations to determine the analysis of data that has missing values. It is done by imputing m values for each missing cell in your data matrix; creating m completed data sets. The program request a few more steps in being able to analyze the variables. First, variables that are not being used in your data set must be removed. For this data set, a subset was created removing all uneeded variables. Secondly, we must call for amelia to analyse the identification variable STATEFP, and create logs for variables that require log-linear transformation.

After, the imputation process is completed a regression analysis was performed including a plot. From the Amelia analysis, it can be observed that the analysis is comprised of two different imputations of the variables. These imputations are fairly similar to one another. One observation that can be detected is the number of observations under Amelia is the R2 which is .50 and .51.

a.out <- amelia(x =Obesity2, cs = "STATEFP", logs = "Obese_Adults")

## -- Imputation 1 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6

a.out

## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  6
## Imputation 2:  6
## Imputation 3:  5
## Imputation 4:  6
## Imputation 5:  6

names(a.out)

##  [1] "imputations" "m"           "missMatrix"  "overvalues"  "theta"      
##  [6] "mu"          "covMatrices" "code"        "message"     "iterHist"   
## [11] "arguments"   "orig.vars"

tmp<- amelia(a.out, idvars = c("STATEFP"))

## -- Imputation 1 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6

View(tmp$imputations$imp1)
View(tmp$imputations$imp2)
View(tmp$imputations$imp3)

z.out <- zelig(Obese_Adults~ FEI  + PhysInactive, model="ls", data=tmp, cite = FALSE)
summary(z.out)

## Model: Combined Imputations 
## 
##              Estimate Std.Error z value Pr(>|z|)
## (Intercept)   17.8470    0.5741   31.09  < 2e-16
## FEI           -0.3328    0.0540   -6.16  7.1e-10
## PhysInactive   0.5672    0.0123   46.14  < 2e-16
## 
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method

summary(z.out, subset = 1)

## Imputed Dataset 1
## Call:
## z5$zelig(formula = Obese_Adults ~ FEI + PhysInactive, data = tmp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5960  -2.1515   0.0557   2.0655  14.3973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  17.77463    0.51470  34.534  < 2e-16
## FEI          -0.30342    0.04815  -6.301 3.37e-10
## PhysInactive  0.56182    0.01086  51.738  < 2e-16
## 
## Residual standard error: 3.145 on 3138 degrees of freedom
## Multiple R-squared:  0.5065, Adjusted R-squared:  0.5061 
## F-statistic:  1610 on 2 and 3138 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

summary(z.out, subset = 2)

## Imputed Dataset 2
## Call:
## z5$zelig(formula = Obese_Adults ~ FEI + PhysInactive, data = tmp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6752  -2.1799   0.0365   2.0819  14.2342 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  17.66410    0.52177  33.854  < 2e-16
## FEI          -0.30073    0.04840  -6.213 5.88e-10
## PhysInactive  0.56673    0.01111  50.992  < 2e-16
## 
## Residual standard error: 3.212 on 3138 degrees of freedom
## Multiple R-squared:  0.499,  Adjusted R-squared:  0.4987 
## F-statistic:  1563 on 2 and 3138 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

z.out$setx()
z.out$sim()
plot(z.out)

Conclusion

After examining the data through two differnt types of missing values analysis, there were a couple of differences. The Amelia package’s influences of the coeffiicent were a bit higher than the listwise deletion. I believe if my data was larger and had a time series variable this method would show more differences. For this particular daataset, the listwise deletion should be sufficient to see FEI plus physical inactivity relationship towards the percent of obese adults.