“Income Inequality” is widespread in the United States. According to “inequality.org”; income disparities have become so pronounced that America’s top 10 percent now average more than nine times as much income as the bottom 90 percent. The evidence that income inequality in the United States has been growing for decades and is greater than in any other developed democracy is not much disputed. Therefore, in this analysis, we will try to investigate that what are the common factors of income differences in the US, using a probability based sample data from various nationally Representative sources.
The data used in this analysis has been acquired from IPUMS.org. IPUMS.org a free data source. The data is a consolidated data set which is originally obtained from Scientists and Engineers Statistical Data System (SESTAT), the leading surveys for studying the science and engineering (STEM) workforce in the United States and also from National Surveys of College Graduates (NSCG), Recent College Graduates (NSRCG) and Doctorate Recipients (SDR). These different data sets are integrated from 1993 to the present. The data for this analysis is specifically from 2003:2016.
The data has 5% percent of missing values and can only be used for preliminary analysis. The limited resources didn’t allow to run simulation on this data, as the data set is larger than normally available data sets and has 413,251 observations.
For this data, our primary dependent variable is Annual Income. The annual income is a metric variable and it is labeled as SALARY in our analysis.
We will use three key independent variables, gender, education, and race. These are categorical (factor) variables. The reason of choosing these particular variables is because social scientist, experts and various other research studies have confirmed that in the US race and gender are strong correlates of income. Following the current literature, we will also include education as another factor to study the differences. These are all very well understood factors associated with income and our goal is to use this nationally representative data to see the current patterns in the US.
First, we will check the distribution of the data using basic histogram for any outlines. Second, using ggplot2, we construct in detailed histograms to check the gender differences in the distribution of data. Third, using ggplot2 Bar plots we will look at median and mean differences in the income. Lastly, we run a simple linear regression to see the association between independent variables.
ggplot(educ1) + aes(x=SALARY) + geom_bar()
## Warning: Removed 78812 rows containing non-finite values (stat_count).
educ_new <-subset(educ1, SALARY<150000)
ggplot(educ_new) + aes(x=SALARY) + geom_bar()
summary(educ_new)
## PERSONID YEAR WEIGHT SAMPLE
## Min. :2.000e+16 Min. :2003 Min. : 0.376 Min. : 601.0
## 1st Qu.:2.000e+16 1st Qu.:2006 1st Qu.: 28.824 1st Qu.: 701.0
## Median :5.000e+16 Median :2008 Median : 99.579 Median : 801.0
## Mean :6.894e+16 Mean :2008 Mean : 224.593 Mean : 803.1
## 3rd Qu.:1.100e+17 3rd Qu.:2010 3rd Qu.: 240.760 3rd Qu.: 902.0
## Max. :1.400e+17 Max. :2013 Max. :14782.721 Max. :1002.0
## SURID AGE GENDER RACE
## Min. :1.000 Min. :23.00 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:32.00 1st Qu.:1.000 1st Qu.:2.000
## Median :1.000 Median :42.00 Median :2.000 Median :2.000
## Mean :1.495 Mean :42.84 Mean :1.585 Mean :2.048
## 3rd Qu.:2.000 3rd Qu.:52.00 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :3.000 Max. :75.00 Max. :2.000 Max. :3.000
## EDUC LFSTAT WRKG HRSWKGR WKSWKGR
## Min. :1.000 Min. :1 Min. :1 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:1 1st Qu.:1 1st Qu.:3.000 1st Qu.:4.00
## Median :2.000 Median :1 Median :1 Median :3.000 Median :4.00
## Mean :1.966 Mean :1 Mean :1 Mean :3.231 Mean :3.89
## 3rd Qu.:3.000 3rd Qu.:1 3rd Qu.:1 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :4.000 Max. :1 Max. :1 Max. :4.000 Max. :4.00
## OCEDRLP SALARY
## Min. :1.000 Min. : 0
## 1st Qu.:1.000 1st Qu.: 42000
## Median :1.000 Median : 65000
## Mean :1.517 Mean : 67132
## 3rd Qu.:2.000 3rd Qu.: 90000
## Max. :3.000 Max. :149000
educ_new$GENDER = factor(educ_new$GENDER,
levels= c(1,2),
labels= c("Female",
"Male"))
educ_new$RACE = factor(educ_new$RACE,
levels= c(1,2,3,4),
labels= c("Asian",
"White",
"Under-presented",
"Other"))
educ_new$EDUC = factor(educ_new$EDUC,
levels= c(1,2,3,4),
labels= c("Bachelors",
"Masters",
"PHD",
"Professional"))
educ_new$YEAR2 = factor(educ_new$YEAR)
meansalary <- ddply(educ_new, "GENDER", summarise, salarymean=mean(SALARY))
head(meansalary)
## GENDER salarymean
## 1 Female 57702.20
## 2 Male 73809.67
H1 <- ggplot(data=educ_new,
aes(educ_new$SALARY,
color=GENDER)) +
geom_histogram(fill="white",
position="dodge",
binwidth = 1) +
geom_vline(aes(xintercept=mean(SALARY)),
color="blue",
linetype="dashed",
size=1) +
scale_color_brewer(palette="Dark2")+
theme_classic()+theme(legend.position="top")
H1
H2 <- ggplot(data=educ_new, aes(educ_new$SALARY, color=GENDER)) +
geom_histogram(fill="white", position="dodge") +
geom_vline(data=meansalary, aes(xintercept=salarymean, color=GENDER), linetype="dashed")+
scale_color_brewer(palette="Dark2") +
theme_classic()+theme(legend.position="top")+
labs(title="Salary Plot",x="Annual Salary", y = "Count")
H2
educ3 <- select(educ_new, GENDER, RACE, EDUC, SALARY)
educ3$SALARY <-factor(educ3$SALARY, labels = c("<25K",
"Between 25k-49k",
"Between 50K-74K",
"Between 75K-99K",
">100K"))
educ3$SALARY <- factor(educ3$SALARY, ordered = TRUE,
levels = c("<25K",
"Between 25k-49k",
"Between 50K-74K",
"Between 75K-99K",
">100K"))
table.SALARY <- table(educ3$SALARY)
table.SALARY
##
## <25K Between 25k-49k Between 50K-74K Between 75K-99K
## 45352 84298 114673 87372
## >100K
## 81556
export2md(check2, header.labels = c(p.overall = "p-value"))
Female N=171333 | Male N=241918 | p-value | |
---|---|---|---|
RACE: | 0.000 | ||
Asian | 24487 (14.3%) | 39412 (16.3%) | |
White | 104206 (60.8%) | 161613 (66.8%) | |
Under-presented | 42640 (24.9%) | 40893 (16.9%) | |
EDUC: | 0.000 | ||
Bachelors | 67099 (39.2%) | 98951 (40.9%) | |
Masters | 52055 (30.4%) | 57381 (23.7%) | |
PHD | 45812 (26.7%) | 77800 (32.2%) | |
Professional | 6367 (3.72%) | 7786 (3.22%) | |
SALARY: | 0.000 | ||
<25K | 25364 (14.8%) | 19988 (8.26%) | |
Between 25k-49k | 46663 (27.2%) | 37635 (15.6%) | |
Between 50K-74K | 50551 (29.5%) | 64122 (26.5%) | |
Between 75K-99K | 28898 (16.9%) | 58474 (24.2%) | |
>100K | 19857 (11.6%) | 61699 (25.5%) |
freqtable <- table(educ3$SALARY)
df <- as.data.frame.table(freqtable)
head(df)
## Var1 Freq
## 1 <25K 45352
## 2 Between 25k-49k 84298
## 3 Between 50K-74K 114673
## 4 Between 75K-99K 87372
## 5 >100K 81556
Bar1 <- ggplot(df, aes(Var1, Freq)) +
geom_bar(stat="identity", width = 0.5, fill="Black") +
labs(title="Categorical View of Income",
subtitle="Frequency of Different Income Categories",
caption="") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) + cleanup
Bar1
Bar2 <- ggplot(educ_new, aes(GENDER, SALARY, fill=GENDER)) +
stat_summary(fun.y = median, geom = "bar", position = "dodge")+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Income Differences By Gender", subtitle="Mean Income by Gender",
caption="Source: National Surveys of College Graduates (NSCG)") +
scale_fill_manual(name = "Gender of Respondent",
labels= c("Female", "Male"),
values = c("Black","Gray86"))+ cleanup
Bar2
Bar3 <- ggplot(educ_new, aes(EDUC, SALARY, fill=GENDER)) +
stat_summary(fun.y = median, geom = "bar", position = "dodge")+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Income Differences By Education",
subtitle="Y=Median Annual Income",
caption="National Surveys of College Graduates(NSCG)")+
scale_fill_manual(name = "Gender of Respondent",
labels= c("Female", "Male"),
values = c("Black","Gray86")) +cleanup
Bar3
Bar4 <- ggplot(educ_new, aes(EDUC, SALARY, fill=RACE)) +
stat_summary(fun.y = median, geom = "bar", position = "dodge")+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Income Differences By Education", subtitle="Y=Median Income",
caption="National Surveys of College Graduates (NSCG)")+
scale_fill_manual(name = "Race of Respondent",
labels= c("Asian", "White", "Under-Represented"),
values = c("Black", "Gray","Gray86"))+ cleanup
Bar4
Bar5 <- ggplot(educ_new, aes(RACE, SALARY, fill=GENDER)) +
stat_summary(fun.y = median, geom = "bar", position = "dodge")+
scale_fill_manual(name = "Gender of Respondent",
labels= c("Female", "Male"),
values = c("Black", "Gray"))+
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Income Differences By Race",
subtitle="Median Income by Race",
caption="National Surveys of College Graduates (NSCG)")
Bar5
Line1 <- ggplot(educ_new, aes(YEAR2, SALARY)) +
stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom= "line", aes(group=1), color= "Red", linetype = "dashed")+
theme_bw()
Line1
Line2 <- ggplot(educ_new, aes(YEAR2, SALARY, color=GENDER )) + stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom= "line", aes(group=GENDER))+scale_color_brewer(palette="Dark2")+ theme_bw()
Line2
Line3 <- ggplot(educ_new, aes(YEAR2, SALARY, color=EDUC )) + stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom= "line", aes(group=EDUC))+
scale_color_brewer(palette="Dark2")+ theme_bw()
Line3
model1 <- lm(SALARY ~ GENDER, data= educ_new)
model2 <- lm(SALARY ~ EDUC, data= educ_new)
model3 <-lm (SALARY ~ RACE, data = educ_new)
model4 <-lm (SALARY ~ EDUC*GENDER, data=educ_new)
The association between income and gender seems to be significant. Model#4 is the best for our analysis. It simply tells us, that being male increase your income by $18982 per annual. Similarly, higher levels of education is associated with higher levels of income.
Interestingly, the figure#1-5 (Bar Plots) show that Asians are more likely to earn higher income, as compared to other racial groups and minorities. However, underrepresented minorities, are more likely to earn less income.
In the end, the income differences between men and women, are significant. Not only gender differences but racial differences have considerable influence of annual income. However, it is important mention that education is an important factors in determining the level of income, person earns. Therefore, it important to encourage more educational incentives especially for minorities, so that this gap can filled and reduced in future.
htmlreg(list(model1, model2, model3, model4))
Model 1 | Model 2 | Model 3 | Model 4 | ||
---|---|---|---|---|---|
(Intercept) | 57702.20*** | 59130.88*** | 72743.56*** | 48261.27*** | |
(79.20) | (80.40) | (132.47) | (122.77) | ||
GENDERMale | 16107.47*** | 18240.33*** | |||
(103.51) | (159.04) | ||||
EDUCMasters | 6678.56*** | 8715.00*** | |||
(127.56) | (185.75) | ||||
EDUCPHD | 18832.30*** | 21956.27*** | |||
(123.07) | (192.74) | ||||
EDUCProfessional | 17488.07*** | 24818.99*** | |||
(286.87) | (417.04) | ||||
RACEWhite | -4659.51*** | ||||
(147.53) | |||||
RACEUnder-presented | -12936.05*** | ||||
(175.99) | |||||
EDUCMasters:GENDERMale | -1393.88*** | ||||
(249.70) | |||||
EDUCPHD:GENDERMale | -5933.72*** | ||||
(245.71) | |||||
EDUCProfessional:GENDERMale | -11807.87*** | ||||
(560.40) | |||||
R2 | 0.06 | 0.06 | 0.01 | 0.11 | |
Adj. R2 | 0.06 | 0.06 | 0.01 | 0.11 | |
Num. obs. | 413251 | 413251 | 413251 | 413251 | |
RMSE | 32781.55 | 32760.42 | 33485.92 | 31802.52 | |
p < 0.001, p < 0.01, p < 0.05 |