R Notebook

Missing Data and Multiple Imputation

Missing values is the first obstacle in predictive modeling. It is something that happens frequently in data. The first step to treating missing variables an analyst must understand their variables and what treatment is needed. The dataset chosen for my simple missing value analysis was collected through social explorer’s Health data. This data has been used so we can explore data in regards to different health related issues. For this analysis, I would like to evaluate the relationship between the Quality of Life and how it is effected by not having insurance and also poor health. The first mode of analysis is listwise deletion. Listwise deletion is the default method used to impute missing values. But, can lead to information loss which in turn creates a bias analysis.

library(reshape)
library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(nlme)
library(texreg)
library(nlme)
library(magrittr)
library(tidyr)
health <- read_csv("C:/Users/Meghan/Documents/health.csv")
options(dplyr.show_progress = FALSE)

health <- rename (health,
          "County" = Geo_QNAME,
          "STATEFP" = Geo_STATE,
          "QualityOfLife" = SE_T001_001,
          "PoorHealth" = SE_T002_001,
          "Insurance" = SE_T006_002)
select(health, STATEFP, QualityOfLife, PoorHealth, Insurance)

head(health)

health <- subset(health, select =-c(SE_T001_002))
health <- subset(health, select =-c(SE_T006_003))
health <- subset(health, select =-c(SE_T006_001))
health <- subset(health, select =-c(SE_NV005_001))
health <- subset(health, select =-c(SE_NV005_002))
health <- subset(health, select =-c(SE_NV005_003))
health <- subset(health, select =-c(Geo_COUNTY))
health <- subset(health, select =-c(Geo_NAME))
health <- subset(health, select =-c(Geo_FIPS))
health <- subset(health, select =-c(County))
health$STATEFP <- as.factor(health$STATEFP)

dim(health)

## [1] 3141    4

names(health)

## [1] "STATEFP"       "QualityOfLife" "PoorHealth"    "Insurance"

Listwise Deletion

The listwise deletion model, suggests that the intercept of Quality of Life about 1.57 sick days reported. For every unit of not having insurance the percent of the number of sick days reported a month decreases by .1. For every unit of increase of poor health the number of sick days reported increases by 0.14. The R2 of this model is 0.83.

z.obe <- zelig(QualityOfLife~ Insurance + PoorHealth, model="ls", data=health, cite = F)
htmlreg(z.obe, doctype = FALSE)

Statistical models
	Model 1
(Intercept)	1.57^***
	(0.02)
Insurance	-0.01^***
	(0.00)
PoorHealth	0.14^***
	(0.00)
R²	0.83
Adj. R²	0.83
Num. obs.	3139
RMSE	0.31
p < 0.001, p < 0.01, p < 0.05

Amelia

Amelia created multiple imputations to determine the analysis of data that has missing values. It is done by imputing m values for each missing cell in your data matrix; creating m completed data sets. Variables that are not being used in your data set must be removed. A subset was created removing all uneeded variables. Also we must call for amelia to analyse the identification variable STATEFP, and create logs for variables that require log-linear transformation. After all of this has been completed the imputation process completed a regression analysis was performed including a plot.

a.out <- amelia(x = health, cs = "STATEFP", logs = "QualityOfLife")

## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##  No missing data in bootstrapped sample:  EM chain unnecessary
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2

a.out

## 
## Amelia output with 5 imputed datasets.
## Return code:  1 
## Message:  Normal EM convergence. 
## 
## Chain Lengths:
## --------------
## Imputation 1:  2
## Imputation 2:  2
## Imputation 3:  
## Imputation 4:  2
## Imputation 5:  2

names(a.out)

##  [1] "imputations" "m"           "missMatrix"  "overvalues"  "theta"      
##  [6] "mu"          "covMatrices" "code"        "message"     "iterHist"   
## [11] "arguments"   "orig.vars"

tmp<- amelia(a.out, idvars = c("STATEFP"))

## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##   1  2
## 
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2

View(tmp$imputations$imp1)
View(tmp$imputations$imp2)
View(tmp$imputations$imp3)

z.out <- zelig(QualityOfLife ~ + PoorHealth + Insurance, model = "ls", data = health, cite = FALSE)
summary(z.out)

## Model: 
## 
## Call:
## z5$zelig(formula = QualityOfLife ~ +PoorHealth + Insurance, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61579 -0.21142  0.02469  0.22027  0.92082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.569854   0.021071  74.505  < 2e-16
## PoorHealth   0.140268   0.001395 100.517  < 2e-16
## Insurance   -0.006402   0.001055  -6.069 1.44e-09
## 
## Residual standard error: 0.3088 on 3136 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.826 
## F-statistic:  7450 on 2 and 3136 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

summary(z.out, subset = 1)

## Model: 
## 
## Call:
## z5$zelig(formula = QualityOfLife ~ +PoorHealth + Insurance, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61579 -0.21142  0.02469  0.22027  0.92082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.569854   0.021071  74.505  < 2e-16
## PoorHealth   0.140268   0.001395 100.517  < 2e-16
## Insurance   -0.006402   0.001055  -6.069 1.44e-09
## 
## Residual standard error: 0.3088 on 3136 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.826 
## F-statistic:  7450 on 2 and 3136 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

summary(z.out, subset = 2)

## Model: 
## 
## Call:
## z5$zelig(formula = QualityOfLife ~ +PoorHealth + Insurance, data = health)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61579 -0.21142  0.02469  0.22027  0.92082 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.569854   0.021071  74.505  < 2e-16
## PoorHealth   0.140268   0.001395 100.517  < 2e-16
## Insurance   -0.006402   0.001055  -6.069 1.44e-09
## 
## Residual standard error: 0.3088 on 3136 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.826 
## F-statistic:  7450 on 2 and 3136 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

z.out$setx()
z.out$sim()
plot(z.out)

Conclusion

After examining the data through two differnt types of missing values analysis, there were a couple of differences. The Amelia package’s influences of the coefficient were a bit higher than the listwise deletion. Poor health was about the same, but the influence of insurance on quality of life seemed to have more of an effect in the amelia package then in the listwise deletion.