Income Differences by Marital Status and Sex

Introduction

Income gap between genders is a long talked issue in the United States. As more women are entering the workforce, the pay gap is getting narrower. According to a report realsed by Bureau of Labor Statistics in April 2018:

Median weekly earnings of full-time workers were $881 in the first quarter of 2018. Women had median weekly earnings of $783, or 81.1 percent of the $965 median for men.
By educational attainment, full-time workers age 25 and over without a high school diploma had median weekly earnings of $563, compared with $713 for high school graduates (no college) and $1,286 for those holding at least a bachelor’s degree. Among college graduates with advanced degrees (master’s or professional degree and above), the highest earning 10 percent of male workers made $3,894 or more per week, compared with $2,875 or more for their female counterparts.

Data

The data for this analysis is been obtained from IPUMS.org, a free open data source. The data obtained is for 3 years from 2014 to 2016 focusing on the income differences in New York City. Variables such as income, sex, marital status (married and single), and education were extracted from IPUMS.

The dataset has 135,385 obsevations and 11 variables and have 26,562 missing cases for the Income variable.

# uploading of the packages.
library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(dplyr)
library(mosaic)
library(ggplot2)

# uploading the data.
data<- read.csv("/users/sharanbhamra/Desktop/SOC 712/usa_00024.csv") 

# recoding City variable.
data$CITY<- recode(data$CITY,
                   '4610' = "New York")
# recoding Sex variable so that male = 1 and female = 0.
data$SEX <- car::recode(data$SEX," 1=1; 2=0")

# recoding Marrital status variable so that married = 1 and single = 0.
data$MARST<- car::recode(data$MARST," 1=1; 6=0")

# creating a new variable for Income to recode 999999 to missing cases.
data<- mutate(data, INCOME=INCWAGE)
data$INCOME <- car::recode(data$INCOME,"999999 = NA")
data$INCOME <- car::recode(data$INCOME,"999998 = NA")

data2 <- data %>%
  select(YEAR, CITY, SEX, MARST, EDUC, INCOME)
sum(is.na(data2))

## [1] 26562

Listwise Deletion Model

data3 <- zelig(INCOME ~ EDUC + MARST*SEX , model="ls", data=data2, cite = F)
htmlreg(data3, doctype = FALSE)

Statistical models
	Model 1
(Intercept)	-38275.29^***
	(685.41)
EDUC	8575.81^***
	(73.85)
MARST	7023.43^***
	(567.88)
SEX	6093.55^***
	(564.78)
MARST:SEX	19815.58^***
	(807.75)
R²	0.14
Adj. R²	0.14
Num. obs.	108823
RMSE	66566.08
p < 0.001, p < 0.01, p < 0.05

The above listwise deletion model deleted 26,562 obsevations. The model indicates that as education level increases income increases by $8,575.81, married individuals earn $7,023.43 more than singles and males earn $6,093.55 more than females. However, if an individual is a male and married, he earns an income of $19,815.58 more than a female who is married.

Multiple Imputations Model

Running the model with 20 imputaions

a.out1 <- amelia(x= data2, m = 20, ts = "YEAR", cs = "CITY")

## -- Imputation 1 --
## 
##   1  2
## 
## -- Imputation 2 --
## 
##   1  2
## 
## -- Imputation 3 --
## 
##   1  2
## 
## -- Imputation 4 --
## 
##   1  2
## 
## -- Imputation 5 --
## 
##   1  2
## 
## -- Imputation 6 --
## 
##   1  2
## 
## -- Imputation 7 --
## 
##   1  2
## 
## -- Imputation 8 --
## 
##   1  2
## 
## -- Imputation 9 --
## 
##   1  2
## 
## -- Imputation 10 --
## 
##   1  2
## 
## -- Imputation 11 --
## 
##   1  2
## 
## -- Imputation 12 --
## 
##   1  2
## 
## -- Imputation 13 --
## 
##   1  2
## 
## -- Imputation 14 --
## 
##   1  2
## 
## -- Imputation 15 --
## 
##   1  2
## 
## -- Imputation 16 --
## 
##   1  2
## 
## -- Imputation 17 --
## 
##   1  2
## 
## -- Imputation 18 --
## 
##   1  2
## 
## -- Imputation 19 --
## 
##   1  2
## 
## -- Imputation 20 --
## 
##   1  2

z.out1 <- zelig(INCOME ~ EDUC + MARST*SEX, model = "ls", data = a.out1, cite = FALSE)

Subseting the imputations dataset

summary(z.out1, subset = 10)

## Imputed Dataset 10
## Call:
## z5$zelig(formula = INCOME ~ EDUC + MARST * SEX, data = a.out1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -288831  -29334  -10653   19090  644433 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40900.32     443.75  -92.17   <2e-16
## EDUC          8681.07      53.63  161.88   <2e-16
## MARST         8842.46     537.63   16.45   <2e-16
## SEX           9467.21     466.52   20.29   <2e-16
## MARST:SEX    16460.38     744.34   22.11   <2e-16
## 
## Residual standard error: 66840 on 135380 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2179 
## F-statistic:  9432 on 4 and 135380 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

summary(z.out1, subset = 20)

## Imputed Dataset 20
## Call:
## z5$zelig(formula = INCOME ~ EDUC + MARST * SEX, data = a.out1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -267168  -29225  -10655   19072  644643 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40943.3      442.7  -92.48   <2e-16
## EDUC          8695.4       53.5  162.53   <2e-16
## MARST         8777.6      536.4   16.36   <2e-16
## SEX           9300.6      465.4   19.98   <2e-16
## MARST:SEX    16620.7      742.6   22.38   <2e-16
## 
## Residual standard error: 66690 on 135380 degrees of freedom
## Multiple R-squared:  0.2192, Adjusted R-squared:  0.2192 
## F-statistic:  9500 on 4 and 135380 DF,  p-value: < 2.2e-16
## 
## Next step: Use 'setx' method

z.out1$setx()
z.out1$sim()
plot(z.out1)

Conclusion

Missing data can be sometimes crucial when running analysis. Listwise model can be important for datasets that are small and dont have many obervartions and variables. But for datasets that are large, dealing with missing values is imporatant as it can influence the analysis results. From the above models, listwise gave an appoximate of income differences by marital status and sex. The multiple imputaions gave 20 different scenarios of income differences by Marital status and sex, though the difference ain’t much if you compare both the models results.

Assignment 11