Assignment-11

Title: Comparative Analysis of Listwise Deletion and Imputation of Missing Data

Introduction

Missing value in the dataset is very common especially in case of cross-sectional study. Listwise deletion is a common method to deal with missing data. Due to its lack of efficiency to provide reliable results, statisticians have been rejecting to use listwise deletion. New methods have been developing to deal with missing values. Imputation is the most contemporary method to deal with missing data and found to be considered as very effective. The purpose of this study was to compare the regression results by using listwise deletion and imputation.

Data

he dataset was extracted from www.ipums.org. The dataset consisted of people aged between 15 and 70 years who lived in the household. The census years between 2001 and 2005 were included in the dataset. ##Defining variables

Dependent variable

Migration: Migration was defined as whether or not people changed their residency in 1 year ago and moved to different states. Migration consisted of two levels: migrated and not migrated. The level “migrated” consisted of people who changed their residency and moved to a different state in 1 year ago. The level “non-migrated” consisted of people who remained in the same house or moved within a state in 1 year ago. People who moved abroad were used as exclusion criteria.

Family income status: It has two codes: “0” indicates that the person is the only earning member in the family and “1” indicates the person is not only earning person in the family . The total personal pre-tax income was deducted from total pre-tax family income to determine person’s status in the family earning. If person’s total personal income was equivalent to total family income, that person was coded as “0” and if person’s total family income was greater than total personal income, that person was coded as “1”. It is important to note that both personal and family income loses were excluded before executing deduction.

Employment status: It indicates whether the respondent was a part of the labor force( – working or seeking work) or not.

Educational attainment: It indicates respondents’ educational attainment, as measured by the highest year of school or degree completed.

Control variable

Age: Age reports the person’s age in years as of the last birthday.

Race: The variable “race” indicated the person’s major race groups: White, Black, Asian, Other.

Sex: Sex reports whether the person was male or female.

Data Analysis

library(tidyverse)
mydata<-read_csv("usa_00051.csv")
mydata<-filter(mydata, !(CITIZEN==0), !(FTOTINC<0), !(INCTOT<0), !(MIGRATE1==4))
mydata1<-mydata[c(1, 6, 13, 14, 16, 19, 21, 24:27)]
summary(mydata1) 
##       YEAR         STATEFIP         SEX             AGE       
##  Min.   :2001   Min.   : 1.0   Min.   :1.000   Min.   :15.00  
##  1st Qu.:2002   1st Qu.: 6.0   1st Qu.:1.000   1st Qu.:30.00  
##  Median :2004   Median :18.0   Median :2.000   Median :40.00  
##  Mean   :2004   Mean   :23.5   Mean   :1.516   Mean   :40.91  
##  3rd Qu.:2005   3rd Qu.:36.0   3rd Qu.:2.000   3rd Qu.:51.00  
##  Max.   :2005   Max.   :56.0   Max.   :2.000   Max.   :70.00  
##       RACE            EDUC           EMPSTAT          INCTOT       
##  Min.   :1.000   Min.   : 0.000   Min.   :0.000   Min.   :      0  
##  1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:1.000   1st Qu.:   5200  
##  Median :2.000   Median : 6.000   Median :1.000   Median :  18000  
##  Mean   :3.403   Mean   : 6.546   Mean   :1.627   Mean   :  29596  
##  3rd Qu.:6.000   3rd Qu.:10.000   3rd Qu.:3.000   3rd Qu.:  38000  
##  Max.   :9.000   Max.   :11.000   Max.   :3.000   Max.   :9999999  
##     FTOTINC             SEI           MIGRATE1   
##  Min.   :      0   Min.   : 0.00   Min.   :1.00  
##  1st Qu.:  26000   1st Qu.:10.00   1st Qu.:1.00  
##  Median :  49800   Median :23.00   Median :1.00  
##  Mean   :  66623   Mean   :33.57   Mean   :1.17  
##  3rd Qu.:  85000   3rd Qu.:61.00   3rd Qu.:1.00  
##  Max.   :1536000   Max.   :96.00   Max.   :3.00
mydata1$INCTOT[mydata1$INCTOT == 9999999] <- NA
mydata1$EMPSTAT [mydata1$EMPSTAT  == 0] <- NA
mydata2<-mydata1 %>%
    mutate(Race = sjmisc::rec(RACE, rec = "1=1; 2=2; 4:6=3; 3=4;7:9=4 "))%>%
  mutate(Education=sjmisc::rec(EDUC, rec = "0:2=1; 3:6=2; 7:9=3; 10:11=4"))%>%
   mutate(Migrate=sjmisc::rec(MIGRATE1,rec = "1=1; 2=1; 3=0" )) %>% 
  mutate(Employed=sjmisc::rec(EMPSTAT,rec = "1=1; 2=1; 3=0" ))
 mydata2$SEX<-factor(mydata1$SEX, levels = c(1, 2), 
               labels = c("Male","Female"))
mydata2$Race<-factor(mydata2$Race, levels = c(1, 2, 3, 4), 
                  labels = c("White", "Black", "Asian", "Other"))
mydata2$Education<-factor(mydata2$Education,  levels = c(1, 2, 3, 4), 
                   labels = c("Junior school or less", "High School", "college 3y or less", "college 4y or more"))  
   mydata2$Migrate<-factor(mydata2$Migrate,  levels = c( 1, 0), 
                   labels = c("Migrated", "Not migrated" )) 
mydata2<-mydata2 %>% 
  mutate(inc_diff=(FTOTINC-INCTOT)/1000) %>% 
    mutate(Family_income = sjmisc::rec(inc_diff, rec = "0.00=0; 0.01:998.20=1 ")) %>%   
  mutate(c_age=AGE-median(AGE)) 

Method-1: Listwise Deletion

listwisedeletion<-na.omit(mydata2)
m1 <- glm(Migrate~Family_income+Race+SEX+c_age+Employed+Education, family = binomial, data = listwisedeletion)
m2 <- glm(Migrate~Family_income*Employed+Race+SEX+c_age+Education, family = binomial, data = listwisedeletion)
m3 <- glm(Migrate~Family_income*Employed*SEX+Race+c_age+Education, family = binomial, data = listwisedeletion)
library(texreg)
htmlreg(list(m1, m2, m3))
Statistical models
Model 1 Model 2 Model 3
(Intercept) -3.60*** -3.75*** -3.73***
(0.04) (0.04) (0.05)
Family_income -0.60*** -0.40*** -0.65***
(0.02) (0.04) (0.06)
RaceBlack 0.10** 0.10** 0.11***
(0.03) (0.03) (0.03)
RaceAsian 0.06** 0.07** 0.07***
(0.02) (0.02) (0.02)
RaceOther -0.02 -0.02 -0.02
(0.02) (0.02) (0.02)
SEXFemale -0.04* -0.05** -0.10
(0.02) (0.02) (0.06)
c_age -0.03*** -0.03*** -0.03***
(0.00) (0.00) (0.00)
Employed -0.26*** -0.07* -0.07
(0.02) (0.04) (0.05)
EducationHigh School 0.16*** 0.16*** 0.17***
(0.03) (0.03) (0.03)
Educationcollege 3y or less 0.37*** 0.38*** 0.39***
(0.04) (0.04) (0.04)
Educationcollege 4y or more 0.90*** 0.90*** 0.90***
(0.03) (0.03) (0.03)
Family_income:Employed -0.26*** 0.00
(0.04) (0.06)
Family_income:SEXFemale 0.36***
(0.08)
Employed:SEXFemale -0.02
(0.07)
Family_income:Employed:SEXFemale -0.35***
(0.09)
AIC 134840.41 134802.92 134737.98
BIC 134965.18 134939.04 134908.13
Log Likelihood -67409.20 -67389.46 -67353.99
Deviance 134818.41 134778.92 134707.98
Num. obs. 623619 623619 623619
p < 0.001, p < 0.01, p < 0.05
library(Zelig)
library(ZeligChoice)
library(survival)
z.listwise<-zelig(Migrate~Family_income*Employed*SEX+Race+c_age+Education, model="probit", data=listwisedeletion, cite = F)
htmlreg(z.listwise, doctype = FALSE)
Statistical models
Model 1
(Intercept) -1.96***
(0.02)
Family_income -0.27***
(0.02)
Employed -0.04
(0.02)
SEXFemale -0.05
(0.03)
RaceBlack 0.05***
(0.01)
RaceAsian 0.03***
(0.01)
RaceOther -0.01
(0.01)
c_age -0.01***
(0.00)
EducationHigh School 0.07***
(0.01)
Educationcollege 3y or less 0.16***
(0.01)
Educationcollege 4y or more 0.37***
(0.01)
Family_income:Employed -0.01
(0.03)
Family_income:SEXFemale 0.15***
(0.03)
Employed:SEXFemale -0.01
(0.03)
Family_income:Employed:SEXFemale -0.14***
(0.04)
AIC 134857.33
BIC 135027.48
Log Likelihood -67413.66
Deviance 134827.33
Num. obs. 623619
p < 0.001, p < 0.01, p < 0.05

Method-2: Imputation

library(Amelia)
mydata4<-mydata2[c(1:3,5:7,10, 11:18)]
summary(mydata4)
##       YEAR         STATEFIP        SEX              RACE      
##  Min.   :2001   Min.   : 1.0   Male  :308087   Min.   :1.000  
##  1st Qu.:2002   1st Qu.: 6.0   Female:328559   1st Qu.:1.000  
##  Median :2004   Median :18.0                   Median :2.000  
##  Mean   :2004   Mean   :23.5                   Mean   :3.403  
##  3rd Qu.:2005   3rd Qu.:36.0                   3rd Qu.:6.000  
##  Max.   :2005   Max.   :56.0                   Max.   :9.000  
##                                                               
##       EDUC           EMPSTAT           SEI           MIGRATE1   
##  Min.   : 0.000   Min.   :1.000   Min.   : 0.00   Min.   :1.00  
##  1st Qu.: 5.000   1st Qu.:1.000   1st Qu.:10.00   1st Qu.:1.00  
##  Median : 6.000   Median :1.000   Median :23.00   Median :1.00  
##  Mean   : 6.546   Mean   :1.644   Mean   :33.57   Mean   :1.17  
##  3rd Qu.:10.000   3rd Qu.:3.000   3rd Qu.:61.00   3rd Qu.:1.00  
##  Max.   :11.000   Max.   :3.000   Max.   :96.00   Max.   :3.00  
##                   NA's   :6479                                  
##     Race                        Education              Migrate      
##  White:303960   Junior school or less: 99483   Migrated    :621449  
##  Black: 43636   High School          :253072   Not migrated: 15197  
##  Asian:166910   college 3y or less   :105662                        
##  Other:122140   college 4y or more   :178429                        
##                                                                     
##                                                                     
##                                                                     
##     Employed        inc_diff       Family_income       c_age         
##  Min.   :0.000   Min.   :-419.00   Min.   :0.000   Min.   :-25.0000  
##  1st Qu.:0.000   1st Qu.:   0.80   1st Qu.:1.000   1st Qu.:-10.0000  
##  Median :1.000   Median :  24.00   Median :1.000   Median :  0.0000  
##  Mean   :0.703   Mean   :  37.04   Mean   :0.767   Mean   :  0.9082  
##  3rd Qu.:1.000   3rd Qu.:  51.00   3rd Qu.:1.000   3rd Qu.: 11.0000  
##  Max.   :1.000   Max.   : 998.20   Max.   :1.000   Max.   : 30.0000  
##  NA's   :6479    NA's   :1         NA's   :6550
z.out <- zelig(Migrate~Family_income*Employed*SEX+Race+c_age+Education, model="probit", data = a.out, cite = FALSE)
summary(z.out)
## Model: Combined Imputations 
## 
##                                   Estimate Std.Error z value Pr(>|z|)
## (Intercept)                      -1.966947  0.022842  -86.11  < 2e-16
## Family_income                    -0.274294  0.025055  -10.95  < 2e-16
## Employed                         -0.034635  0.021809   -1.59  0.11227
## SEXFemale                        -0.049621  0.028028   -1.77  0.07665
## RaceBlack                         0.051982  0.013905    3.74  0.00019
## RaceAsian                         0.029194  0.008498    3.44  0.00059
## RaceOther                        -0.005048  0.009999   -0.50  0.61367
## c_age                            -0.013071  0.000278  -46.95  < 2e-16
## EducationHigh School              0.070660  0.012204    5.79  7.0e-09
## Educationcollege 3y or less       0.162684  0.013883   11.72  < 2e-16
## Educationcollege 4y or more       0.376257  0.012582   29.90  < 2e-16
## Family_income:Employed           -0.004148  0.027370   -0.15  0.87954
## Family_income:SEXFemale           0.148755  0.032748    4.54  5.6e-06
## Employed:SEXFemale               -0.006252  0.031778   -0.20  0.84404
## Family_income:Employed:SEXFemale -0.142801  0.037641   -3.79  0.00015
## 
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method
x.out <- setx(z.out)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
s.out <- sim(z.out, x = x.out)
plot(s.out)

Comparative Analysis of Results from Listwise Deletion and Imputation

The comparative analysis of regression results obtained from the dataset produced by listwise deletion and dataset produced by imputation suggested differences in estimated coefficients, standard errors, z-values and significant levels.Since the number of missing values (“Employed” =6479; “Family_income”=6550)are lower than total number of observations (636646) in the dataset,there were very slight differences in the estimated coefficients obtained from regression models using the datasets produced by listwise deletion and imputation. However, standard errors were relatively lower in case of imputation compared to listwise deletion.