In this analysis I will look at variables that contain missing data. That dataset I am “fixing” for my missing value analysis is data used before in previous homeworks. The data was collected through Social Explorer’s Health Data Set on Obesity including variables such as Consumption of Alcohol for Adults and Free Lunches. For this analysis I would like to look at the data between Alcohol Consumption (Drinking Adults) and Free Lunch on Obesity.
The two different types of analysis I will demonstrate are List Wise Deletion and Multiple Imputation.
*List Wise Deletion: In this method, an entire record is excluded from analysis if any single value is missing.
*Multiple Imputation: This method builds multiple imputation models to approximate missing values.
We wil use this analysis to compare results of each method.
library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(dplyr)
AlcoObesity2 <- rename (AlcoObesity,
"STATEFP" = Geo_STATE,
"Fair_to_Poor_Health" = SE_T002_001,
"Current_Smokers" = SE_T011_001,
"Drinking_Adults" = SE_T011_002,
"Persons_with_Limited_Access_to_Healthy_Foods" = SE_T012_001,
"Access_to_Exercise" = SE_T012_002,
"Obese_Persons" = SE_T012_003,
"Physically_Inactive" = SE_T012_004,
"Free_Lunch" = SE_T012_005)
select(AlcoObesity2, STATEFP, Drinking_Adults, Persons_with_Limited_Access_to_Healthy_Foods, Obese_Persons, Free_Lunch)
head(AlcoObesity2)
AlcoObesity2 <- subset(AlcoObesity2, select =-c(Geo_FIPS))
AlcoObesity2 <- subset(AlcoObesity2, select =-c(Geo_NAME))
AlcoObesity2<- subset(AlcoObesity2, select =-c(Geo_QNAME))
AlcoObesity2 <- subset(AlcoObesity2, select =-c(Geo_COUNTY))
AlcoObesity2$STATEFP <- as.factor(AlcoObesity2$STATEFP)
dim(AlcoObesity2)
[1] 3141 9
z.alco <- zelig(Obese_Persons ~ Drinking_Adults + Free_Lunch, model="ls", data=AlcoObesity2, cite = F)
htmlreg(z.alco, doctype = FALSE)
| Model 1 | ||
|---|---|---|
| (Intercept) | 31.74*** | |
| (0.61) | ||
| Drinking_Adults | -0.27*** | |
| (0.03) | ||
| Free_Lunch | 0.08*** | |
| (0.01) | ||
| R2 | 0.21 | |
| Adj. R2 | 0.21 | |
| Num. obs. | 2978 | |
| RMSE | 3.89 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
names(AlcoObesity2)
[1] "STATEFP" "Fair_to_Poor_Health"
[3] "Current_Smokers" "Drinking_Adults"
[5] "Persons_with_Limited_Access_to_Healthy_Foods" "Access_to_Exercise"
[7] "Obese_Persons" "Physically_Inactive"
[9] "Free_Lunch"
a.out <- amelia(x = AlcoObesity2, cs = "STATEFP", logs = "Obese_Persons")
-- Imputation 1 --
1 2 3 4
-- Imputation 2 --
1 2 3 4
-- Imputation 3 --
1 2 3 4
-- Imputation 4 --
1 2 3 4
-- Imputation 5 --
1 2 3 4
a.out
Amelia output with 5 imputed datasets.
Return code: 1
Message: Normal EM convergence.
Chain Lengths:
--------------
Imputation 1: 4
Imputation 2: 4
Imputation 3: 4
Imputation 4: 4
Imputation 5: 4
names(a.out)
[1] "imputations" "m" "missMatrix" "overvalues" "theta" "mu" "covMatrices" "code" "message"
[10] "iterHist" "arguments" "orig.vars"
tmp<- amelia(a.out, idvars = c("STATEFP"))
-- Imputation 1 --
1 2 3 4
-- Imputation 2 --
1 2 3 4
-- Imputation 3 --
1 2 3 4
-- Imputation 4 --
1 2 3 4
-- Imputation 5 --
1 2 3 4
View(tmp$imputations$imp1)
View(tmp$imputations$imp2)
View(tmp$imputations$imp3)
head(tmp$imputations$imp1)
head(tmp$imputations$imp2)
head(tmp$imputations$imp3)
z.out <- zelig(Obese_Persons~ Drinking_Adults + Free_Lunch, model="ls", data=tmp, cite = FALSE)
summary(z.out)
Model: Combined Imputations
Estimate Std.Error z value Pr(>|z|)
(Intercept) 31.50212 0.61821 51.0 <2e-16
Drinking_Adults -0.27456 0.02622 -10.5 <2e-16
Free_Lunch 0.08888 0.00545 16.3 <2e-16
For results from individual imputed datasets, use summary(x, subset = i:j)
Next step: Use 'setx' method
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = Obese_Persons ~ Drinking_Adults + Free_Lunch,
data = tmp)
Residuals:
Min 1Q Median 3Q Max
-18.9483 -2.1229 0.3924 2.6186 11.1083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.485617 0.601622 52.34 <2e-16
Drinking_Adults -0.274062 0.025768 -10.64 <2e-16
Free_Lunch 0.089074 0.005289 16.84 <2e-16
Residual standard error: 3.931 on 3138 degrees of freedom
Multiple R-squared: 0.2256, Adjusted R-squared: 0.2251
F-statistic: 457 on 2 and 3138 DF, p-value: < 2.2e-16
Next step: Use 'setx' method
summary(z.out, subset = 2)
Imputed Dataset 2
Call:
z5$zelig(formula = Obese_Persons ~ Drinking_Adults + Free_Lunch,
data = tmp)
Residuals:
Min 1Q Median 3Q Max
-18.9370 -2.1451 0.4029 2.6054 10.5402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.636171 0.599768 52.75 <2e-16
Drinking_Adults -0.279387 0.025714 -10.87 <2e-16
Free_Lunch 0.087731 0.005279 16.62 <2e-16
Residual standard error: 3.936 on 3138 degrees of freedom
Multiple R-squared: 0.2238, Adjusted R-squared: 0.2233
F-statistic: 452.4 on 2 and 3138 DF, p-value: < 2.2e-16
Next step: Use 'setx' method
z.out$setx()
z.out$sim()
plot(z.out)
So there were a couple of differences. When using Amelia Package, the missing data that were filled in were very similar in numbers. Nothing stood out as being too extreme. I think if my data set were larger, dealing with countries instead of states, and using more missing data there would most likely be larger extremes. I think either list wise deletion or Amelia package would work.