Income gap between genders is a long talked issue in the United States. As more women are entering the workforce, the pay gap is getting narrower. According to a report realsed by Bureau of Labor Statistics in April 2018:
Median weekly earnings of full-time workers were $881 in the first quarter of 2018. Women had median weekly earnings of $783, or 81.1 percent of the $965 median for men.
By educational attainment, full-time workers age 25 and over without a high school diploma had median weekly earnings of $563, compared with $713 for high school graduates (no college) and $1,286 for those holding at least a bachelor’s degree. Among college graduates with advanced degrees (master’s or professional degree and above), the highest earning 10 percent of male workers made $3,894 or more per week, compared with $2,875 or more for their female counterparts.
The data for this analysis is been obtained from IPUMS.org, a free open data source. The data obtained is for 3 years from 2014 to 2016 focusing on the income differences in New York City. Variables such as income, sex, marital status (married and single), and education were extracted from IPUMS.
The dataset has 135,385 obsevations and 11 variables and have 26,562 missing cases for the Income variable.
# uploading of the packages.
library(Amelia)
library(Zelig)
library(ZeligChoice)
library(texreg)
library(dplyr)
library(mosaic)
library(ggplot2)
# uploading the data.
data<- read.csv("/users/sharanbhamra/Desktop/SOC 712/usa_00024.csv")
# recoding City variable.
data$CITY<- recode(data$CITY,
'4610' = "New York")
# recoding Sex variable so that male = 1 and female = 0.
data$SEX <- car::recode(data$SEX," 1=1; 2=0")
# recoding Marrital status variable so that married = 1 and single = 0.
data$MARST<- car::recode(data$MARST," 1=1; 6=0")
# creating a new variable for Income to recode 999999 to missing cases.
data<- mutate(data, INCOME=INCWAGE)
data$INCOME <- car::recode(data$INCOME,"999999 = NA")
data$INCOME <- car::recode(data$INCOME,"999998 = NA")
data2 <- data %>%
select(YEAR, CITY, SEX, MARST, EDUC, INCOME)
sum(is.na(data2))
## [1] 26562
data3 <- zelig(INCOME ~ EDUC + MARST*SEX , model="ls", data=data2, cite = F)
htmlreg(data3, doctype = FALSE)
Model 1 | ||
---|---|---|
(Intercept) | -38275.29*** | |
(685.41) | ||
EDUC | 8575.81*** | |
(73.85) | ||
MARST | 7023.43*** | |
(567.88) | ||
SEX | 6093.55*** | |
(564.78) | ||
MARST:SEX | 19815.58*** | |
(807.75) | ||
R2 | 0.14 | |
Adj. R2 | 0.14 | |
Num. obs. | 108823 | |
RMSE | 66566.08 | |
p < 0.001, p < 0.01, p < 0.05 |
The above listwise deletion model deleted 26,562 obsevations. The model indicates that as education level increases income increases by $8,575.81, married individuals earn $7,023.43 more than singles and males earn $6,093.55 more than females. However, if an individual is a male and married, he earns an income of $19,815.58 more than a female who is married.
a.out1 <- amelia(x= data2, m = 20, ts = "YEAR", cs = "CITY")
## -- Imputation 1 --
##
## 1 2
##
## -- Imputation 2 --
##
## 1 2
##
## -- Imputation 3 --
##
## 1 2
##
## -- Imputation 4 --
##
## 1 2
##
## -- Imputation 5 --
##
## 1 2
##
## -- Imputation 6 --
##
## 1 2
##
## -- Imputation 7 --
##
## 1 2
##
## -- Imputation 8 --
##
## 1 2
##
## -- Imputation 9 --
##
## 1 2
##
## -- Imputation 10 --
##
## 1 2
##
## -- Imputation 11 --
##
## 1 2
##
## -- Imputation 12 --
##
## 1 2
##
## -- Imputation 13 --
##
## 1 2
##
## -- Imputation 14 --
##
## 1 2
##
## -- Imputation 15 --
##
## 1 2
##
## -- Imputation 16 --
##
## 1 2
##
## -- Imputation 17 --
##
## 1 2
##
## -- Imputation 18 --
##
## 1 2
##
## -- Imputation 19 --
##
## 1 2
##
## -- Imputation 20 --
##
## 1 2
z.out1 <- zelig(INCOME ~ EDUC + MARST*SEX, model = "ls", data = a.out1, cite = FALSE)
summary(z.out1, subset = 10)
## Imputed Dataset 10
## Call:
## z5$zelig(formula = INCOME ~ EDUC + MARST * SEX, data = a.out1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -288831 -29334 -10653 19090 644433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40900.32 443.75 -92.17 <2e-16
## EDUC 8681.07 53.63 161.88 <2e-16
## MARST 8842.46 537.63 16.45 <2e-16
## SEX 9467.21 466.52 20.29 <2e-16
## MARST:SEX 16460.38 744.34 22.11 <2e-16
##
## Residual standard error: 66840 on 135380 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2179
## F-statistic: 9432 on 4 and 135380 DF, p-value: < 2.2e-16
##
## Next step: Use 'setx' method
summary(z.out1, subset = 20)
## Imputed Dataset 20
## Call:
## z5$zelig(formula = INCOME ~ EDUC + MARST * SEX, data = a.out1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -267168 -29225 -10655 19072 644643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40943.3 442.7 -92.48 <2e-16
## EDUC 8695.4 53.5 162.53 <2e-16
## MARST 8777.6 536.4 16.36 <2e-16
## SEX 9300.6 465.4 19.98 <2e-16
## MARST:SEX 16620.7 742.6 22.38 <2e-16
##
## Residual standard error: 66690 on 135380 degrees of freedom
## Multiple R-squared: 0.2192, Adjusted R-squared: 0.2192
## F-statistic: 9500 on 4 and 135380 DF, p-value: < 2.2e-16
##
## Next step: Use 'setx' method
z.out1$setx()
z.out1$sim()
plot(z.out1)
Missing data can be sometimes crucial when running analysis. Listwise model can be important for datasets that are small and dont have many obervartions and variables. But for datasets that are large, dealing with missing values is imporatant as it can influence the analysis results. From the above models, listwise gave an appoximate of income differences by marital status and sex. The multiple imputaions gave 20 different scenarios of income differences by Marital status and sex, though the difference ain’t much if you compare both the models results.