0. Project Objective:

This report addresses the following question:

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?

1. Data summary

To address the project objective, we will use the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. The NLSY97 data set contains survey responses on 8984 individuals who have been surveyed every one or two years starting in 1997. As our research is focused on differences in income among male and female respondents, we will focus on the subset of the 5829 survey respondents whom indicated income was earned. This is measured in our outcome variable: Total Income (Survey Reference #: T8976700.)

Demographic Variables:

To provide more details about our data subset, we can see the two demographic characteristics of our survey respondents: Race (Survey Reference #: R1482600) and Birth Year (Survey Reference #: R0536402).

Income Earners by Race
Black Hispanic Mixed Other
Female 807 579 23 1399
Male 704 659 29 1629
Income Earners by Birth Year
1980 1981 1982 1983 1984
Female 528 552 574 578 576
Male 538 629 606 645 603

One should observe from the data presented that while Birth Year shows a healthy spread of data points across all years, Race does not appear to be as well spread. We observe that there is a low amount of survey responses (52)that indicated a Race of “Mixed”, and as such we will not gain statistical significance on any hypothesis test or linear regression analysis by Race. Therefore, to improve our analysis we will combine all future Race data for both “Mixed” and “Other” and label as “Other.”

Our key outcome variable, Total Income (as observed over two values of Gender), is top-coded due to data sensitivity rules followed by the NLSY97 process. In such cases, the top 2% of values in the data set on this variable are averaged, and only the average is displayed for these top 2% of earners. This would negatively impact an income gap study, as men and women in the top 2% bracket would all appear to have the same income. Therefore, we remove the top 108 values (all of which displayed the average of the top 2%, which was $180331). The implication of removing these values is the range of values for which any patterns and linear regression models we come up with will be applicable on a smaller range of values (those observed and retained in the study.)

Total Income appears to differ among Male and Female respondents over the entire subset of respondents indicating an income was earned, with an average income for Males of $38719.11, and an average income for all Females of $31500.79. This income gap is detailed further below across both demographics (Race and Birth year):

The above demographic analysis of our data subset includes 95% confidence bars, which helps to determine the statistical significance of the difference in income among Females and Males across the various Races or Birth Years. One take-away is clear, any hypothesis test (t-test) on the difference between the means of income across Female and Male counterparts in this survey, across all values in the two demographics above would reject the null hypothesis (Ho: means are equal across men and women) and conclude that the means must be different. In fact, because the confidence bars above appear entirely above 0, we can also conclude that the difference between the means can be concluded with 95% confidence that male incomes are greater than females across all values of the two demographic variables above.

Additionally, the income gap among Black respondents is sharply lower (about half) of the other two races (Hispanic and Other), and the confidence intervals seem to confirm a Hypothesis test that the income gap differs (and is less than all other Races) for Black respondents (given no overlap in confidence bars).

Birth Year on the other hand, when observed by the naked eye, seems to affect income gap negatively as the year increases. However, an analysis of confidence intervals shows us some overlap between most years. Therefore this apparent trend of the income gap decreasing as Birth Year increases requires further analysis to confirm.

In addition to the two demographic variables analyzed above, I have also chosen 5 additional variables to test for impact on our outcome variable, Total Income over the two values of Gender. These will be analyzed on linear regression models. Each of these 5 testing variables were chosen to represent a cross section of numeric and factor variables, variables that affected the respondent’s economic background, their educational background, and their legal background. Between all 5 variables, including the added spectrum of Race and Birth Year, I hope to find data that provides confirmed patterns.

Numeric Test Variables:

Factor Test Variables:

Below are the 5 testing variables’ distributions analyzed across Gender:

The above distributions show us a healthy spread of our data set across all variables, except for the # of Incarcerations, which we would expect to be clustered around 0. However this variable may have significant impact on earnings potential when greater than 0, therefore we will retain the variable but be careful to consider if the data is statistically significant at the 0.05 level.

Before we begin our Linear Regression analysis, let us confirm none of the chosen variables are co-linear:

As shown above, none of our variable pairs appear to have a co-linearity condition, which would be shown by a value nearing 1.0. All of our values appear to be lower than 0.2. Therefore we are safe to proceed with linear regression on these 5 testing variable, along with our chosen 2 demographic variables, and our outcome variable Total Income, over Gender.

2. Methodology

I followed a strict process in cleaning and analyzing the data. In each of the methodology sections below, you will find the details around assumptions I made, changes that were applied, and why.

Data Cleansing Overview:

Data Naming Convention:

Data Analysis:

3. Findings

Simple Linear Model (Demographics only)

To begin the linear regression analysis, let’s first explore a simple linear model, including only our Outcome Variable, Gender, and the two Demographic variables to explore the meaning of the estimated coefficients:

Simple Linear Model 1: lm Table: Simple Linear Model 1: Total_Income ~ Gender + Race + Birth_Year Table: Simple Linear Model 1: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2697920 434530 6.2 0
GenderMale 6744 611 11.0 0
RaceHispanic 5715 899 6.3 0
RaceOther 9914 741 13.4 0
Birth_Year -1348 219 -6.2 0

As shown above, we constructed a simple linear regression model on our outcome variable, Total Income, over the demographics of Gender, Race, and Birth Year. What is shown above tells us the following:

We now turn our attention to creating more complex linear models against a new variable: Income Gap, which will be mean value of Income (subtracted between men and women) for each combination of the linear co-variants. We will add in each of our 5 testing variables, each time determining the affects on the model, and using anovo() to determine if the new model is statistically more significant than its predecessor. We will continue this until all 5 variables are included.

Father Education Level:

Linear Model 2: lm Table: Linear Model 2: Income_Gap ~ Race + Birth_Year + Dad_Max_Grade Table: Linear Model 2: data_lm_2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 745689 1811159 0.41 0.68
RaceHispanic 6726 3410 1.97 0.05
RaceOther 5857 3471 1.69 0.09
Birth_Year -374 914 -0.41 0.68
Dad_Max_Grade -147 172 -0.86 0.39

The first model introducing a numeric test variable is presented above, against true income gap data (mean of Male incomes - Female incomes across each combination of variables). Our analysis of it is as follows:

Linear Model 2B: lm Table: Linear Model 2B: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade Table: Linear Model 2B: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2510013 543305 4.6 0
GenderMale 7302 765 9.5 0
RaceHispanic 6992 1287 5.4 0
RaceOther 8175 1066 7.7 0
Birth_Year -1258 274 -4.6 0
Dad_Max_Grade 839 104 8.1 0

This result is far better, with all previously held coefficient p values returning back to below the significance level. In addition, or new covarient also has a statistically significant p value at 9.3210^{-16}, with a positive estimated coefficient of 838.89 indicating that as the father’s total years of education rises, the income of the resident child is estimated to increase by this amount on average.

This graph shows us the plots of data for our subset, spread across the Father’s years of education and income levels, as well as the approximate linear model on this data. What is clear from this graph is that it also agrees with the model presented previously (2B) that the Father’s Years of Education is positively related to the Total Income in this linear model. Also clear is the gap of income between Male and Female respondents is statistically significant as the 95% confidence bands (shown in grey shading) do not overlap. However, interestingly enough, we see a narrowing of this gap as the father’s education level increases towards our maximum value of 20.

Mother Age at First Childbirth:

Next we will add to this model the next numerical variable, Mother’s Age at First Childbirth:

Linear Model 3: lm Table: Linear Model 3: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 Table: Linear Model 3: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2257031 558818 4.0 0
GenderMale 7738 786 9.8 0
RaceHispanic 5739 1350 4.2 0
RaceOther 6406 1135 5.6 0
Birth_Year -1135 282 -4.0 0
Dad_Max_Grade 699 109 6.4 0
Mom_Age_Child1 510 88 5.8 0

Model #3 builds upon the previous model with adding Mother’s Age at First Childbirth. Key findings are:

Let us also view a graphical display of the data subset across Total Income and Mother’s Age at First Childbirth:

This graph provides an interesting analysis. At first, we see a statistically significant income gap (shown by the lack of overlap of the 95% confidence bands, from about age 10 to 34.) After this age, not only do the confidence bands begin to overlap more, but the income gap appears to narrow, almost to the point where Female incomes have caught up to Male incomes. Although our data set does not provide data beyond the range of age 50, based on the slope of the two lines at the highest X value, one could hypothesize that a female may begin to earn more than a male in that model. However we cannot make such an assumption as this model can only be expected to perform well under the given range of data.

Number of Incarcerations:

Let us now turn to the final numeric variable, Number of Incarcerations:

Linear Model 4: lm Table: Linear Model 4: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 + X._Incarcerations Table: Linear Model 4: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2314190 553724 4.2 0
GenderMale 8586 787 10.9 0
RaceHispanic 5740 1337 4.3 0
RaceOther 6455 1124 5.7 0
Birth_Year -1163 279 -4.2 0
Dad_Max_Grade 658 108 6.1 0
Mom_Age_Child1 476 87 5.4 0
X._Incarcerations -5462 707 -7.7 0

Model #4 again presents a strong linear model, with all previously held variables staying statistically significant, and without any major changes to estimated coefficients or direction of impact (positive/negative.)

The linear depiction above seems to present the findings that at 0, 1, and 2 incarcerations, there is a gender gap that is statistically significant (no overlap of confidence bands.) However, that gap between confidence bands narrows the further from 0 you go, and eventually after 2 incarcerations, the confidence bands are entirely overlapped so we cannot conclude at 3 or higher incarcerations whether there is a true income gap.

Additionally, most of the data points higher than zero are not surprisingly male. This is the reason why we do not see the slopes of the two regression lines differ much as the confidence bands increase (especially on the pink line, an indication of insufficient female evidence at the higher values.)

Grades Received in Middle School:

Now we will begin to add our factor variables to our regression model, starting with Grades Received in Middle School:

Linear Model 5: lm Table: Linear Model 5: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 + X._Incarcerations + Performance_MS Table: Linear Model 5: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) -582564 1563321 -0.37 0.71
GenderMale 8175 1232 6.64 0.00
RaceHispanic 6019 2013 2.99 0.00
RaceOther 6399 1677 3.81 0.00
Birth_Year 291 788 0.37 0.71
Dad_Max_Grade 285 136 2.10 0.04
Mom_Age_Child1 154 135 1.14 0.25
X._Incarcerations -4626 1201 -3.85 0.00
Performance_MSMixed 29669 24087 1.23 0.22
Performance_MSMostly below D’s 14172 21488 0.66 0.51
Performance_MSMostly D’s 18185 21224 0.86 0.39
Performance_MSAbout half C’s and half D’s 15003 20980 0.72 0.47
Performance_MSMostly C’s 20189 20932 0.96 0.33
Performance_MSAbout half B’s and half C’s 23626 20889 1.13 0.26
Performance_MSMostly B’s 23343 20905 1.12 0.26
Performance_MSA’s to C’s 20917 24061 0.87 0.38
Performance_MSAbout half A’s and half B’s 27306 20884 1.31 0.19
Performance_MSMostly A’s 32307 20897 1.55 0.12

Linear model 5 is our first attempt (other than attempt 2B) where adding an additional variable has reduced the strength of our model:

Highest Degree Attained

Our last testing variable is Highest Degree Attained, also a factor variable, which we will add on top of model 4 (rather than model 5.)

Linear Model 6: lm Table: Linear Model 6: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 + X._Incarcerations + Highest_Degree Table: Linear Model 6: nlsy97_subset
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1875071 520497 3.60 0.00
GenderMale 10911 748 14.59 0.00
RaceHispanic 6326 1254 5.04 0.00
RaceOther 5404 1057 5.11 0.00
Birth_Year -937 263 -3.57 0.00
Dad_Max_Grade 156 105 1.49 0.14
Mom_Age_Child1 104 84 1.24 0.21
X._Incarcerations -3617 681 -5.31 0.00
Highest_DegreeNon Interview 5616 4638 1.21 0.23
Highest_DegreeNone -7984 4604 -1.73 0.08
Highest_DegreeGED -5663 4507 -1.26 0.21
Highest_DegreeHS Diploma 719 4314 0.17 0.87
Highest_DegreeAssociates 5315 4470 1.19 0.23
Highest_DegreeBachelors 13400 4322 3.10 0.00
Highest_DegreeMasters 21359 4485 4.76 0.00
Highest_DegreePhD 39013 8830 4.42 0.00
Highest_DegreePro.(DDS, JD, MD) 33721 5558 6.07 0.00

Linear model 6 gives our most interesting analysis yet:

To further analyze how the Highest Degree Attained affects income gaps, let us compute the average income for Males and Females in our data subset, and compare:

Highest_Degree Male_Income Female_Income Income_Gap
Invalid Skip 31205 28168 3037
Non Interview 38981 33624 5357
None 26114 15348 10767
GED 27518 20791 6727
HS Diploma 36904 24662 12242
Associates 40323 33130 7193
Bachelors 50040 41233 8807
Masters 57116 49355 7761
PhD 55667 69833 -14167
Pro.(DDS, JD, MD) 68703 58854 9849

This table agrees with most of our linear regression model, in that from the degree level of None all the way to Masters the income rises with each progressive degree step. PhD seems to be the anomaly in that it is a lower average income compared to Masters degree for Males only. This may be due to the fact that we only have 4 individuals in our survey subset with such characteristics. Also interestingly, Females have an income advantage in the PhD row, which again may be due to the fact that we have insufficient data on Males to test this theory. Overall, the gap itself does not seem to follow a distinct pattern (up or down) consistently as the degree level rises, so treating degree level alone to determine income gap may be difficult, but it clearly has an affect on the incomes themselves.

Final Interpretation of our Linear Models

To further test the significance of model 6, let us use the Analysis of Variance (anova) function to determine if model 6 is more statistically significant than model 4, our last successful variable integration. In doing so, we observe that the p value (F test) for model 6 is 8.1710^{-84}, which indicates that it is more statistically significant than model 4.

Another tool to assess the strength of a model is by running diagnostic plots, which we run here:

Our diagnostic plots tell us some good news and some imperfect news about our model, which may affect our confidence level in findings:

Plot 1:

Plot 2:

Plot 3:

Plot 4:

4. Discussion

My conclusion from this report is that of the 7 linear models I constructed, over 5 testing variables, and 3 demographic variables, Linear Model 6: lm, Linear Model 6: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 + X._Incarcerations + Highest_Degree, Linear Model 6: nlsy97_subset provides us our strongest performing model on our outcome variable Total Income. Additionally, analyzing the linear regression lines plotted for 3 of our testing variables and other tabular output gave us important insight to all variables’ affects on income gaps:

My confidence level in these conclusions is very high due to the process I followed. I followed the appropriate process to determine statistical significance before putting too much faith in an answer (for example, the Grades Received in Middle School model). I do feel strongly that both Father Education Level and Mother Age at First Childbirth have a positive impact on incomes, and a negative impact on income gap (female incomes rise over the spectrum faster than males in both cases.) I have less confidence in my hypothesis as to why both of these cases are true, however I believe with a fair amount of domain knowledge in the area (as a parent myself) these are strong words to live by. Lastly, I also have strong confidence in both the negative relationship between Number of Incarcerations and Incomes, and the positive relationship between Degree Attained and Incomes, however neither proved to have a significant impact or recognized pattern on the income gap itself.