Income Differences & Relatable Factors

Brieft Analysis Using GGPLOT2

Introduction

“Income Inequality” is widespread in the United States. According to “inequality.org”; income disparities have become so pronounced that America’s top 10 percent now average more than nine times as much income as the bottom 90 percent. The evidence that income inequality in the United States has been growing for decades and is greater than in any other developed democracy is not much disputed. Therefore, in this analysis, we will try to investigate that what are the common factors of income differences in the US, using a probability based sample data from various nationally Representative sources.

Data

The data used in this analysis has been acquired from IPUMS.org. IPUMS.org a free data source. The data is a consolidated data set which is originally obtained from Scientists and Engineers Statistical Data System (SESTAT), the leading surveys for studying the science and engineering (STEM) workforce in the United States and also from National Surveys of College Graduates (NSCG), Recent College Graduates (NSRCG) and Doctorate Recipients (SDR). These different data sets are integrated from 1993 to the present. The data for this analysis is specifically from 2003:2016.

Limitations

The data has 5% percent of missing values and can only be used for preliminary analysis. The limited resources didn’t allow to run simulation on this data, as the data set is larger than normally available data sets and has 413,251 observations.

Primay Dependent Variable

For this data, our primary dependent variable is Annual Income. The annual income is a metric variable and it is labeled as SALARY in our analysis.

Key Independent Variables

We will use three key independent variables, gender, education, and race. These are categorical (factor) variables. The reason of choosing these particular variables is because social scientist, experts and various other research studies have confirmed that in the US race and gender are strong correlates of income. Following the current literature, we will also include education as another factor to study the differences. These are all very well understood factors associated with income and our goal is to use this nationally representative data to see the current patterns in the US.

Statistical Analyis

First, we will check the distribution of the data using basic histogram for any outlines. Second, using ggplot2, we construct in detailed histograms to check the gender differences in the distribution of data. Third, using ggplot2 Bar plots we will look at median and mean differences in the income. Lastly, we run a simple linear regression to see the association between independent variables.

Note: All the figures are interactive, hover on the points to check mean values of each categorical variable.

Checking Data For Outliers

Using a simple bar_plot.

ggplot(educ1) + aes(x=SALARY) + geom_bar()
## Warning: Removed 78812 rows containing non-finite values (stat_count).

Droping High Values for income >$149,999

educ_new <-subset(educ1, SALARY<150000) 

Rechecking for Outliers

ggplot(educ_new) + aes(x=SALARY) + geom_bar()

summary(educ_new)
##     PERSONID              YEAR          WEIGHT              SAMPLE      
##  Min.   :2.000e+16   Min.   :2003   Min.   :    0.376   Min.   : 601.0  
##  1st Qu.:2.000e+16   1st Qu.:2006   1st Qu.:   28.824   1st Qu.: 701.0  
##  Median :5.000e+16   Median :2008   Median :   99.579   Median : 801.0  
##  Mean   :6.894e+16   Mean   :2008   Mean   :  224.593   Mean   : 803.1  
##  3rd Qu.:1.100e+17   3rd Qu.:2010   3rd Qu.:  240.760   3rd Qu.: 902.0  
##  Max.   :1.400e+17   Max.   :2013   Max.   :14782.721   Max.   :1002.0  
##      SURID            AGE            GENDER           RACE      
##  Min.   :1.000   Min.   :23.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:32.00   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :42.00   Median :2.000   Median :2.000  
##  Mean   :1.495   Mean   :42.84   Mean   :1.585   Mean   :2.048  
##  3rd Qu.:2.000   3rd Qu.:52.00   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :75.00   Max.   :2.000   Max.   :3.000  
##       EDUC           LFSTAT       WRKG      HRSWKGR         WKSWKGR    
##  Min.   :1.000   Min.   :1   Min.   :1   Min.   :1.000   Min.   :1.00  
##  1st Qu.:1.000   1st Qu.:1   1st Qu.:1   1st Qu.:3.000   1st Qu.:4.00  
##  Median :2.000   Median :1   Median :1   Median :3.000   Median :4.00  
##  Mean   :1.966   Mean   :1   Mean   :1   Mean   :3.231   Mean   :3.89  
##  3rd Qu.:3.000   3rd Qu.:1   3rd Qu.:1   3rd Qu.:4.000   3rd Qu.:4.00  
##  Max.   :4.000   Max.   :1   Max.   :1   Max.   :4.000   Max.   :4.00  
##     OCEDRLP          SALARY      
##  Min.   :1.000   Min.   :     0  
##  1st Qu.:1.000   1st Qu.: 42000  
##  Median :1.000   Median : 65000  
##  Mean   :1.517   Mean   : 67132  
##  3rd Qu.:2.000   3rd Qu.: 90000  
##  Max.   :3.000   Max.   :149000

Recoding Variables

educ_new$GENDER = factor(educ_new$GENDER,
                     levels= c(1,2),
                      labels= c("Female", 
                                  "Male"))
educ_new$RACE = factor(educ_new$RACE,
                     levels= c(1,2,3,4),
                      labels= c("Asian",
                                  "White",
                                    "Under-presented", 
                                      "Other"))
educ_new$EDUC = factor(educ_new$EDUC,
                     levels= c(1,2,3,4),
                     labels= c("Bachelors",
                                "Masters",
                                  "PHD", 
                                    "Professional"))
educ_new$YEAR2 = factor(educ_new$YEAR)
meansalary <- ddply(educ_new, "GENDER", summarise, salarymean=mean(SALARY))
head(meansalary)
##   GENDER salarymean
## 1 Female   57702.20
## 2   Male   73809.67

Historgram#1 to Check Distribution of Income by Gender

Mean Represented By Blue Line

H1 <- ggplot(data=educ_new, 
             aes(educ_new$SALARY,
                 color=GENDER)) + 
  geom_histogram(fill="white", 
                 position="dodge", 
                 binwidth = 1) + 
    geom_vline(aes(xintercept=mean(SALARY)), 
               color="blue", 
               linetype="dashed", 
               size=1) +
      scale_color_brewer(palette="Dark2")+
        theme_classic()+theme(legend.position="top")
H1

Historgram#2 to Check Distribution of Income by Gender

Mean Represented by “Red” & “Green” Line

H2 <- ggplot(data=educ_new, aes(educ_new$SALARY, color=GENDER)) +
  geom_histogram(fill="white", position="dodge") + 
    geom_vline(data=meansalary, aes(xintercept=salarymean, color=GENDER), linetype="dashed")+
      scale_color_brewer(palette="Dark2") +
        theme_classic()+theme(legend.position="top")+
            labs(title="Salary Plot",x="Annual Salary", y = "Count")
H2

educ3 <- select(educ_new, GENDER, RACE, EDUC, SALARY)
educ3$SALARY <-factor(educ3$SALARY, labels = c("<25K", 
                                               "Between 25k-49k", 
                                                "Between 50K-74K", 
                                                 "Between 75K-99K", 
                                                    ">100K"))

educ3$SALARY <- factor(educ3$SALARY, ordered = TRUE,
                         levels = c("<25K", 
                                      "Between 25k-49k", 
                                        "Between 50K-74K", 
                                          "Between 75K-99K", 
                                            ">100K"))

table.SALARY <- table(educ3$SALARY)
table.SALARY
## 
##            <25K Between 25k-49k Between 50K-74K Between 75K-99K 
##           45352           84298          114673           87372 
##           >100K 
##           81556

Proportion Table (By Percent%)

export2md(check2, header.labels = c(p.overall = "p-value"))
Summary descriptives table by groups of `GENDER’
Female N=171333 Male N=241918 p-value
RACE: 0.000
    Asian 24487 (14.3%) 39412 (16.3%)
    White 104206 (60.8%) 161613 (66.8%)
    Under-presented 42640 (24.9%) 40893 (16.9%)
EDUC: 0.000
    Bachelors 67099 (39.2%) 98951 (40.9%)
    Masters 52055 (30.4%) 57381 (23.7%)
    PHD 45812 (26.7%) 77800 (32.2%)
    Professional 6367 (3.72%) 7786 (3.22%)
SALARY: 0.000
    <25K 25364 (14.8%) 19988 (8.26%)
    Between 25k-49k 46663 (27.2%) 37635 (15.6%)
    Between 50K-74K 50551 (29.5%) 64122 (26.5%)
    Between 75K-99K 28898 (16.9%) 58474 (24.2%)
    >100K 19857 (11.6%) 61699 (25.5%)
freqtable <- table(educ3$SALARY)
df <- as.data.frame.table(freqtable)
head(df)
##              Var1   Freq
## 1            <25K  45352
## 2 Between 25k-49k  84298
## 3 Between 50K-74K 114673
## 4 Between 75K-99K  87372
## 5           >100K  81556

Figure1: Different Income Cateogries in Data

Bar1 <- ggplot(df, aes(Var1, Freq)) + 
  geom_bar(stat="identity", width = 0.5, fill="Black") + 
      labs(title="Categorical View of Income", 
           subtitle="Frequency of Different Income Categories", 
             caption="") +
              theme(axis.text.x = element_text(angle=65, vjust=0.6)) + cleanup
Bar1

Figure2: Income Differences by Gender

Bar2 <- ggplot(educ_new, aes(GENDER, SALARY, fill=GENDER)) +
  stat_summary(fun.y = median, geom = "bar", position = "dodge")+
    theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
      labs(title="Income Differences By Gender", subtitle="Mean Income by Gender",
                          caption="Source: National Surveys of College Graduates (NSCG)") +
                            scale_fill_manual(name = "Gender of Respondent", 
                                              labels= c("Female", "Male"),
                                              values = c("Black","Gray86"))+ cleanup
Bar2

Figure3: Income Differences by Education & Gender

Bar3 <- ggplot(educ_new, aes(EDUC, SALARY, fill=GENDER)) + 
  stat_summary(fun.y = median, geom = "bar", position = "dodge")+
    theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
      labs(title="Income Differences By Education", 
           subtitle="Y=Median Annual Income", 
            caption="National Surveys of College Graduates(NSCG)")+ 
              scale_fill_manual(name = "Gender of Respondent", 
                                labels= c("Female", "Male"), 
                                  values = c("Black","Gray86")) +cleanup
Bar3

Figure4: Income Differences by Education & Race

Bar4 <- ggplot(educ_new, aes(EDUC, SALARY, fill=RACE)) + 
  stat_summary(fun.y = median, geom = "bar", position = "dodge")+ 
    theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
      labs(title="Income Differences By Education", subtitle="Y=Median Income",
           caption="National Surveys of College Graduates (NSCG)")+
            scale_fill_manual(name = "Race of Respondent", 
                          labels= c("Asian", "White", "Under-Represented"), 
                            values = c("Black", "Gray","Gray86"))+ cleanup
Bar4

Figure5: Income Differences by Race & Gender

Bar5 <- ggplot(educ_new, aes(RACE, SALARY, fill=GENDER)) + 
  stat_summary(fun.y = median, geom = "bar", position = "dodge")+ 
    scale_fill_manual(name = "Gender of Respondent", 
                      labels= c("Female", "Male"), 
                        values = c("Black", "Gray"))+ 
                        theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
                          labs(title="Income Differences By Race", 
                               subtitle="Median Income by Race",
                            caption="National Surveys of College Graduates (NSCG)") 
Bar5

Figure6: Gross Income by Year (2003 - 2013)

Line1 <- ggplot(educ_new, aes(YEAR2, SALARY)) + 
  stat_summary(fun.y = mean, geom = "point") + 
    stat_summary(fun.y = mean, geom= "line", aes(group=1), color= "Red", linetype = "dashed")+ 
      theme_bw()
Line1

Figure7: Mean Income by Year & Gender

Line2 <- ggplot(educ_new, aes(YEAR2, SALARY, color=GENDER )) + stat_summary(fun.y = mean, geom = "point") + 
        stat_summary(fun.y = mean, geom= "line", aes(group=GENDER))+scale_color_brewer(palette="Dark2")+ theme_bw()
Line2

Figure8: Mean Income By Year and Education Attainment

Line3 <- ggplot(educ_new, aes(YEAR2, SALARY, color=EDUC )) + stat_summary(fun.y = mean, geom = "point") + 
        stat_summary(fun.y = mean, geom= "line", aes(group=EDUC))+
  scale_color_brewer(palette="Dark2")+ theme_bw()
Line3

model1 <- lm(SALARY ~ GENDER, data= educ_new)
model2 <- lm(SALARY ~ EDUC, data= educ_new)
model3 <-lm (SALARY ~ RACE, data = educ_new)
model4 <-lm (SALARY ~ EDUC*GENDER, data=educ_new)

Results

The association between income and gender seems to be significant. Model#4 is the best for our analysis. It simply tells us, that being male increase your income by $18982 per annual. Similarly, higher levels of education is associated with higher levels of income.

Interestingly, the figure#1-5 (Bar Plots) show that Asians are more likely to earn higher income, as compared to other racial groups and minorities. However, underrepresented minorities, are more likely to earn less income.

In the end, the income differences between men and women, are significant. Not only gender differences but racial differences have considerable influence of annual income. However, it is important mention that education is an important factors in determining the level of income, person earns. Therefore, it important to encourage more educational incentives especially for minorities, so that this gap can filled and reduced in future.

htmlreg(list(model1, model2, model3, model4))
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) 57702.20*** 59130.88*** 72743.56*** 48261.27***
(79.20) (80.40) (132.47) (122.77)
GENDERMale 16107.47*** 18240.33***
(103.51) (159.04)
EDUCMasters 6678.56*** 8715.00***
(127.56) (185.75)
EDUCPHD 18832.30*** 21956.27***
(123.07) (192.74)
EDUCProfessional 17488.07*** 24818.99***
(286.87) (417.04)
RACEWhite -4659.51***
(147.53)
RACEUnder-presented -12936.05***
(175.99)
EDUCMasters:GENDERMale -1393.88***
(249.70)
EDUCPHD:GENDERMale -5933.72***
(245.71)
EDUCProfessional:GENDERMale -11807.87***
(560.40)
R2 0.06 0.06 0.01 0.11
Adj. R2 0.06 0.06 0.01 0.11
Num. obs. 413251 413251 413251 413251
RMSE 32781.55 32760.42 33485.92 31802.52
p < 0.001, p < 0.01, p < 0.05