Missing value in the dataset is very common especially in case of cross-sectional study. Listwise deletion is a common method to deal with missing data. Due to its lack of efficiency to provide reliable results, statisticians have been rejecting to use listwise deletion. New methods have been developing to deal with missing values. Imputation is the most contemporary method to deal with missing data and found to be considered as very effective. The purpose of this study was to compare the regression results by using listwise deletion and imputation.
he dataset was extracted from www.ipums.org. The dataset consisted of people aged between 15 and 70 years who lived in the household. The census years between 2001 and 2005 were included in the dataset. ##Defining variables
Migration: Migration was defined as whether or not people changed their residency in 1 year ago and moved to different states. Migration consisted of two levels: migrated and not migrated. The level “migrated” consisted of people who changed their residency and moved to a different state in 1 year ago. The level “non-migrated” consisted of people who remained in the same house or moved within a state in 1 year ago. People who moved abroad were used as exclusion criteria.
Family income status: It has two codes: “0” indicates that the person is the only earning member in the family and “1” indicates the person is not only earning person in the family . The total personal pre-tax income was deducted from total pre-tax family income to determine person’s status in the family earning. If person’s total personal income was equivalent to total family income, that person was coded as “0” and if person’s total family income was greater than total personal income, that person was coded as “1”. It is important to note that both personal and family income loses were excluded before executing deduction.
Employment status: It indicates whether the respondent was a part of the labor force( – working or seeking work) or not.
Educational attainment: It indicates respondents’ educational attainment, as measured by the highest year of school or degree completed.
Age: Age reports the person’s age in years as of the last birthday.
Race: The variable “race” indicated the person’s major race groups: White, Black, Asian, Other.
Sex: Sex reports whether the person was male or female.
library(tidyverse)
mydata<-read_csv("usa_00051.csv")
mydata<-filter(mydata, !(CITIZEN==0), !(FTOTINC<0), !(INCTOT<0), !(MIGRATE1==4))
mydata1<-mydata[c(1, 6, 13, 14, 16, 19, 21, 24:27)]
summary(mydata1)
## YEAR STATEFIP SEX AGE
## Min. :2001 Min. : 1.0 Min. :1.000 Min. :15.00
## 1st Qu.:2002 1st Qu.: 6.0 1st Qu.:1.000 1st Qu.:30.00
## Median :2004 Median :18.0 Median :2.000 Median :40.00
## Mean :2004 Mean :23.5 Mean :1.516 Mean :40.91
## 3rd Qu.:2005 3rd Qu.:36.0 3rd Qu.:2.000 3rd Qu.:51.00
## Max. :2005 Max. :56.0 Max. :2.000 Max. :70.00
## RACE EDUC EMPSTAT INCTOT
## Min. :1.000 Min. : 0.000 Min. :0.000 Min. : 0
## 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.: 5200
## Median :2.000 Median : 6.000 Median :1.000 Median : 18000
## Mean :3.403 Mean : 6.546 Mean :1.627 Mean : 29596
## 3rd Qu.:6.000 3rd Qu.:10.000 3rd Qu.:3.000 3rd Qu.: 38000
## Max. :9.000 Max. :11.000 Max. :3.000 Max. :9999999
## FTOTINC SEI MIGRATE1
## Min. : 0 Min. : 0.00 Min. :1.00
## 1st Qu.: 26000 1st Qu.:10.00 1st Qu.:1.00
## Median : 49800 Median :23.00 Median :1.00
## Mean : 66623 Mean :33.57 Mean :1.17
## 3rd Qu.: 85000 3rd Qu.:61.00 3rd Qu.:1.00
## Max. :1536000 Max. :96.00 Max. :3.00
mydata1$INCTOT[mydata1$INCTOT == 9999999] <- NA
mydata1$EMPSTAT [mydata1$EMPSTAT == 0] <- NA
mydata2<-mydata1 %>%
mutate(Race = sjmisc::rec(RACE, rec = "1=1; 2=2; 4:6=3; 3=4;7:9=4 "))%>%
mutate(Education=sjmisc::rec(EDUC, rec = "0:2=1; 3:6=2; 7:9=3; 10:11=4"))%>%
mutate(Migrate=sjmisc::rec(MIGRATE1,rec = "1=1; 2=1; 3=0" )) %>%
mutate(Employed=sjmisc::rec(EMPSTAT,rec = "1=1; 2=1; 3=0" ))
mydata2$SEX<-factor(mydata1$SEX, levels = c(1, 2),
labels = c("Male","Female"))
mydata2$Race<-factor(mydata2$Race, levels = c(1, 2, 3, 4),
labels = c("White", "Black", "Asian", "Other"))
mydata2$Education<-factor(mydata2$Education, levels = c(1, 2, 3, 4),
labels = c("Junior school or less", "High School", "college 3y or less", "college 4y or more"))
mydata2$Migrate<-factor(mydata2$Migrate, levels = c( 1, 0),
labels = c("Migrated", "Not migrated" ))
mydata2<-mydata2 %>%
mutate(inc_diff=(FTOTINC-INCTOT)/1000) %>%
mutate(Family_income = sjmisc::rec(inc_diff, rec = "0.00=0; 0.01:998.20=1 ")) %>%
mutate(c_age=AGE-median(AGE))
listwisedeletion<-na.omit(mydata2)
m1 <- glm(Migrate~Family_income+Race+SEX+c_age+Employed+Education, family = binomial, data = listwisedeletion)
m2 <- glm(Migrate~Family_income*Employed+Race+SEX+c_age+Education, family = binomial, data = listwisedeletion)
m3 <- glm(Migrate~Family_income*Employed*SEX+Race+c_age+Education, family = binomial, data = listwisedeletion)
library(texreg)
htmlreg(list(m1, m2, m3))
| Model 1 | Model 2 | Model 3 | ||
|---|---|---|---|---|
| (Intercept) | -3.60*** | -3.75*** | -3.73*** | |
| (0.04) | (0.04) | (0.05) | ||
| Family_income | -0.60*** | -0.40*** | -0.65*** | |
| (0.02) | (0.04) | (0.06) | ||
| RaceBlack | 0.10** | 0.10** | 0.11*** | |
| (0.03) | (0.03) | (0.03) | ||
| RaceAsian | 0.06** | 0.07** | 0.07*** | |
| (0.02) | (0.02) | (0.02) | ||
| RaceOther | -0.02 | -0.02 | -0.02 | |
| (0.02) | (0.02) | (0.02) | ||
| SEXFemale | -0.04* | -0.05** | -0.10 | |
| (0.02) | (0.02) | (0.06) | ||
| c_age | -0.03*** | -0.03*** | -0.03*** | |
| (0.00) | (0.00) | (0.00) | ||
| Employed | -0.26*** | -0.07* | -0.07 | |
| (0.02) | (0.04) | (0.05) | ||
| EducationHigh School | 0.16*** | 0.16*** | 0.17*** | |
| (0.03) | (0.03) | (0.03) | ||
| Educationcollege 3y or less | 0.37*** | 0.38*** | 0.39*** | |
| (0.04) | (0.04) | (0.04) | ||
| Educationcollege 4y or more | 0.90*** | 0.90*** | 0.90*** | |
| (0.03) | (0.03) | (0.03) | ||
| Family_income:Employed | -0.26*** | 0.00 | ||
| (0.04) | (0.06) | |||
| Family_income:SEXFemale | 0.36*** | |||
| (0.08) | ||||
| Employed:SEXFemale | -0.02 | |||
| (0.07) | ||||
| Family_income:Employed:SEXFemale | -0.35*** | |||
| (0.09) | ||||
| AIC | 134840.41 | 134802.92 | 134737.98 | |
| BIC | 134965.18 | 134939.04 | 134908.13 | |
| Log Likelihood | -67409.20 | -67389.46 | -67353.99 | |
| Deviance | 134818.41 | 134778.92 | 134707.98 | |
| Num. obs. | 623619 | 623619 | 623619 | |
| p < 0.001, p < 0.01, p < 0.05 | ||||
library(Zelig)
library(ZeligChoice)
library(survival)
z.listwise<-zelig(Migrate~Family_income*Employed*SEX+Race+c_age+Education, model="probit", data=listwisedeletion, cite = F)
htmlreg(z.listwise, doctype = FALSE)
| Model 1 | ||
|---|---|---|
| (Intercept) | -1.96*** | |
| (0.02) | ||
| Family_income | -0.27*** | |
| (0.02) | ||
| Employed | -0.04 | |
| (0.02) | ||
| SEXFemale | -0.05 | |
| (0.03) | ||
| RaceBlack | 0.05*** | |
| (0.01) | ||
| RaceAsian | 0.03*** | |
| (0.01) | ||
| RaceOther | -0.01 | |
| (0.01) | ||
| c_age | -0.01*** | |
| (0.00) | ||
| EducationHigh School | 0.07*** | |
| (0.01) | ||
| Educationcollege 3y or less | 0.16*** | |
| (0.01) | ||
| Educationcollege 4y or more | 0.37*** | |
| (0.01) | ||
| Family_income:Employed | -0.01 | |
| (0.03) | ||
| Family_income:SEXFemale | 0.15*** | |
| (0.03) | ||
| Employed:SEXFemale | -0.01 | |
| (0.03) | ||
| Family_income:Employed:SEXFemale | -0.14*** | |
| (0.04) | ||
| AIC | 134857.33 | |
| BIC | 135027.48 | |
| Log Likelihood | -67413.66 | |
| Deviance | 134827.33 | |
| Num. obs. | 623619 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
library(Amelia)
mydata4<-mydata2[c(1:3,5:7,10, 11:18)]
summary(mydata4)
## YEAR STATEFIP SEX RACE
## Min. :2001 Min. : 1.0 Male :308087 Min. :1.000
## 1st Qu.:2002 1st Qu.: 6.0 Female:328559 1st Qu.:1.000
## Median :2004 Median :18.0 Median :2.000
## Mean :2004 Mean :23.5 Mean :3.403
## 3rd Qu.:2005 3rd Qu.:36.0 3rd Qu.:6.000
## Max. :2005 Max. :56.0 Max. :9.000
##
## EDUC EMPSTAT SEI MIGRATE1
## Min. : 0.000 Min. :1.000 Min. : 0.00 Min. :1.00
## 1st Qu.: 5.000 1st Qu.:1.000 1st Qu.:10.00 1st Qu.:1.00
## Median : 6.000 Median :1.000 Median :23.00 Median :1.00
## Mean : 6.546 Mean :1.644 Mean :33.57 Mean :1.17
## 3rd Qu.:10.000 3rd Qu.:3.000 3rd Qu.:61.00 3rd Qu.:1.00
## Max. :11.000 Max. :3.000 Max. :96.00 Max. :3.00
## NA's :6479
## Race Education Migrate
## White:303960 Junior school or less: 99483 Migrated :621449
## Black: 43636 High School :253072 Not migrated: 15197
## Asian:166910 college 3y or less :105662
## Other:122140 college 4y or more :178429
##
##
##
## Employed inc_diff Family_income c_age
## Min. :0.000 Min. :-419.00 Min. :0.000 Min. :-25.0000
## 1st Qu.:0.000 1st Qu.: 0.80 1st Qu.:1.000 1st Qu.:-10.0000
## Median :1.000 Median : 24.00 Median :1.000 Median : 0.0000
## Mean :0.703 Mean : 37.04 Mean :0.767 Mean : 0.9082
## 3rd Qu.:1.000 3rd Qu.: 51.00 3rd Qu.:1.000 3rd Qu.: 11.0000
## Max. :1.000 Max. : 998.20 Max. :1.000 Max. : 30.0000
## NA's :6479 NA's :1 NA's :6550
z.out <- zelig(Migrate~Family_income*Employed*SEX+Race+c_age+Education, model="probit", data = a.out, cite = FALSE)
summary(z.out)
## Model: Combined Imputations
##
## Estimate Std.Error z value Pr(>|z|)
## (Intercept) -1.966947 0.022842 -86.11 < 2e-16
## Family_income -0.274294 0.025055 -10.95 < 2e-16
## Employed -0.034635 0.021809 -1.59 0.11227
## SEXFemale -0.049621 0.028028 -1.77 0.07665
## RaceBlack 0.051982 0.013905 3.74 0.00019
## RaceAsian 0.029194 0.008498 3.44 0.00059
## RaceOther -0.005048 0.009999 -0.50 0.61367
## c_age -0.013071 0.000278 -46.95 < 2e-16
## EducationHigh School 0.070660 0.012204 5.79 7.0e-09
## Educationcollege 3y or less 0.162684 0.013883 11.72 < 2e-16
## Educationcollege 4y or more 0.376257 0.012582 29.90 < 2e-16
## Family_income:Employed -0.004148 0.027370 -0.15 0.87954
## Family_income:SEXFemale 0.148755 0.032748 4.54 5.6e-06
## Employed:SEXFemale -0.006252 0.031778 -0.20 0.84404
## Family_income:Employed:SEXFemale -0.142801 0.037641 -3.79 0.00015
##
## For results from individual imputed datasets, use summary(x, subset = i:j)
## Next step: Use 'setx' method
x.out <- setx(z.out)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
s.out <- sim(z.out, x = x.out)
plot(s.out)
The comparative analysis of regression results obtained from the dataset produced by listwise deletion and dataset produced by imputation suggested differences in estimated coefficients, standard errors, z-values and significant levels.Since the number of missing values (“Employed” =6479; “Family_income”=6550)are lower than total number of observations (636646) in the dataset,there were very slight differences in the estimated coefficients obtained from regression models using the datasets produced by listwise deletion and imputation. However, standard errors were relatively lower in case of imputation compared to listwise deletion.