This report addresses the following question:
Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?
To address the project objective, we will use the NLSY97 (National Longitudinal Survey of Youth, 1997 cohort) data set. The NLSY97 data set contains survey responses on 8984 individuals who have been surveyed every one or two years starting in 1997. As our research is focused on differences in income among male and female respondents, we will focus on the subset of the 5829 survey respondents whom indicated income was earned. This is measured in our outcome variable: Total Income (Survey Reference #: T8976700.)
Demographic Variables:
To provide more details about our data subset, we can see the two demographic characteristics of our survey respondents: Race (Survey Reference #: R1482600) and Birth Year (Survey Reference #: R0536402).
| Black | Hispanic | Mixed | Other | |
|---|---|---|---|---|
| Female | 807 | 579 | 23 | 1399 |
| Male | 704 | 659 | 29 | 1629 |
| 1980 | 1981 | 1982 | 1983 | 1984 | |
|---|---|---|---|---|---|
| Female | 528 | 552 | 574 | 578 | 576 |
| Male | 538 | 629 | 606 | 645 | 603 |
One should observe from the data presented that while Birth Year shows a healthy spread of data points across all years, Race does not appear to be as well spread. We observe that there is a low amount of survey responses (52)that indicated a Race of “Mixed”, and as such we will not gain statistical significance on any hypothesis test or linear regression analysis by Race. Therefore, to improve our analysis we will combine all future Race data for both “Mixed” and “Other” and label as “Other.”
Our key outcome variable, Total Income (as observed over two values of Gender), is top-coded due to data sensitivity rules followed by the NLSY97 process. In such cases, the top 2% of values in the data set on this variable are averaged, and only the average is displayed for these top 2% of earners. This would negatively impact an income gap study, as men and women in the top 2% bracket would all appear to have the same income. Therefore, we remove the top 108 values (all of which displayed the average of the top 2%, which was $180331). The implication of removing these values is the range of values for which any patterns and linear regression models we come up with will be applicable on a smaller range of values (those observed and retained in the study.)
Total Income appears to differ among Male and Female respondents over the entire subset of respondents indicating an income was earned, with an average income for Males of $38719.11, and an average income for all Females of $31500.79. This income gap is detailed further below across both demographics (Race and Birth year):
The above demographic analysis of our data subset includes 95% confidence bars, which helps to determine the statistical significance of the difference in income among Females and Males across the various Races or Birth Years. One take-away is clear, any hypothesis test (t-test) on the difference between the means of income across Female and Male counterparts in this survey, across all values in the two demographics above would reject the null hypothesis (Ho: means are equal across men and women) and conclude that the means must be different. In fact, because the confidence bars above appear entirely above 0, we can also conclude that the difference between the means can be concluded with 95% confidence that male incomes are greater than females across all values of the two demographic variables above.
Additionally, the income gap among Black respondents is sharply lower (about half) of the other two races (Hispanic and Other), and the confidence intervals seem to confirm a Hypothesis test that the income gap differs (and is less than all other Races) for Black respondents (given no overlap in confidence bars).
Birth Year on the other hand, when observed by the naked eye, seems to affect income gap negatively as the year increases. However, an analysis of confidence intervals shows us some overlap between most years. Therefore this apparent trend of the income gap decreasing as Birth Year increases requires further analysis to confirm.
In addition to the two demographic variables analyzed above, I have also chosen 5 additional variables to test for impact on our outcome variable, Total Income over the two values of Gender. These will be analyzed on linear regression models. Each of these 5 testing variables were chosen to represent a cross section of numeric and factor variables, variables that affected the respondent’s economic background, their educational background, and their legal background. Between all 5 variables, including the added spectrum of Race and Birth Year, I hope to find data that provides confirmed patterns.
Numeric Test Variables:
Father Education Level (Field name: Dad_Max_Grade; Survey Reference #: R1302600). This variable is the numeric response to the question: “Highest grade completed by respondent’s residential father (includes both biological and non-biological fathers).” The range of answers are from 0 (indicating no formal education), to 20 (indicating 20 full years of formal education, or 8 years of post-secondary education.) This variable was chosen for its significance on affecting an individual’s economic status while growing up, which may have an impact on future life choices and earnings.
Mother Age at First Childbirth (Field name: Mom_Age_Child1; Survey Reference #: R1200100). This variable is the numeric response to the question: “What was the age of your biological mother at her first childbirth?” The range of answers indicated some extreme responses at the lower end of the spectrum, thus we evaluate only values from age 10 to age 50. This variable was also chosen due to its significance on affecting an individual’s economic status while growing up, which may also have an impact on future life choices and earnings.
Number of Incarcerations (Field name: #_Incarcerations; Survey Reference #: E8043100). The last numeric variable is the response to the question: “What is the total number of incarcerations you were implicated in?” The range of values were from 0 to 9 (in our subset of data). This variable was chosen due to its significance in telling the legal record of respondents which may affect their future earnings.
Factor Test Variables:
Grades Received in Middle School (Field name: Performance_MS; Survey Reference #: R1700500). This variable is the categorical response to the question: “Overall, what grades did you receive in 8th grade?” While a similar question was provided for High School grades 9-12, ultimately the Middle School version was chosen due to the larger amount of responses, making it a more significant population sample. This variable was chosen to include analysis of the respondent’s educational background, which may be a future indicator of earning potential. The various response values are detailed in the graph below.
Highest Degree Attained (Field name: Highest_Degree; Survey Reference #: T6657300). This variable is the categorical response to the question: “What is the highest degree received prior to the start of the 2011/2012 academic school year?” This variable was chosen to include analysis of the respondent’s educational attainment level, which may be a future indicator of earning potential. The various response values are detailed in the graph below.
Below are the 5 testing variables’ distributions analyzed across Gender:
The above distributions show us a healthy spread of our data set across all variables, except for the # of Incarcerations, which we would expect to be clustered around 0. However this variable may have significant impact on earnings potential when greater than 0, therefore we will retain the variable but be careful to consider if the data is statistically significant at the 0.05 level.
Before we begin our Linear Regression analysis, let us confirm none of the chosen variables are co-linear:
As shown above, none of our variable pairs appear to have a co-linearity condition, which would be shown by a value nearing 1.0. All of our values appear to be lower than 0.2. Therefore we are safe to proceed with linear regression on these 5 testing variable, along with our chosen 2 demographic variables, and our outcome variable Total Income, over Gender.
I followed a strict process in cleaning and analyzing the data. In each of the methodology sections below, you will find the details around assumptions I made, changes that were applied, and why.
Data Cleansing Overview:
The data set was identified to have 33 factor variables based on question and answer detail found in ‘nlsy97_new_income_info_txt’. Additionally, the data set contained 48 numeric variables.
For each factor I identified the mapping values, including 5 negative NLSY97 codes, which were saved in my data set for most factor variables. This includes both “Refuse to respond” (-1) and “Don’t know” (-2): both of which indicate that the survey respondent went out of their way to avoid the question, which may indicate to us a hidden truth, depending on the question (for example, use of hard drugs). Other negative values were not reflective of a survey respondent’s desire to avoid the question, and may have reflected other scenarios such as an out of scope question for a respondent (for example, a question about College type for a High School student.) However all negative NLSY97 codes were saved for all factor variables for use in analysis if necessary.
All numeric variables were considered and analyzed for their range of numeric responses, the questions asked, the total number of negative values submitted, and/or the perceived relative importance of a question to the concept of income inequality. This due diligence was critical so that I did not incorrectly label a given value (such as the NLSY97 codes in values -1:-5), rather than ignoring hidden information in a non standard answer. In the end I determined that only two numeric variables showed an indication that the negative values would constitute hidden information (see next bullet). As a result I coded all other numeric negative values as NA. All future data analysis would be built around the assumption of ignoring NA values.
For the two numeric variables (which were based on the same question, taken in two separate survey years), I determined the wording of the question (“On how many days have you used marijuana in the last 30 days?”), indicated relevant information may be provided in the two negative responses: “Refuse to answer” (-1) and “Don’t know” (-2). Therefore, I restored both of these questions’ negative responses for later analysis (the question re-appeared in 2000 and 2011 versions)
Although only 9 variables were chosen for further analysis, the data set has been cleaned and mapped for all 81 variables, so that further analysis could be done on any variable in the future.
Data Naming Convention:
Data Analysis:
My approach to running linear regression on the variables above was first to start with a simple model based on demographic variables alone. This allowed me to see the relative direction and statistical significance of each of those variables, which were all categorical. This also helped to understand the way the model would be built as more complex items were added (for example, what factor levels were included in the baseline/intercept, and those that establish a positive or negative relationship.)
My next step (which you will see was followed for model 2 below) was to switch to an analysis against the “income gap” variable itself, rather than Total Income with Gender included as a coefficient. This was done by taking the mean income of men versus women in each of the combinations of variables that I would subsequently add to the model (ultimately resulting in quite a bit of combinations, giving me a good data set to work with.) However, as you will see, the first result of using this approach (model 2) did not show promising results, and even lowered the statistical significance (increased p values) for all of my demographic variables by doing the analysis this way. For that reason, all future models were built using the Total Income variable alone as the outcome variable, and using the coefficient estimate for the Gender variable as my guide for the income gap itself in that model.
As I navigated through each of my 5 testing variables, I added one variable at a time on top of my previously run model. After doing so, I first validated the statistical significance of the variables that were in the previous model were un-affected in a major way (p value or coefficient direction.) I then analyzed the new variable added into the model to ensure it was statistically significant. Once I confirmed both, I carried forward to the next variable by adding on top of the latest model.
Simple Linear Model (Demographics only)
To begin the linear regression analysis, let’s first explore a simple linear model, including only our Outcome Variable, Gender, and the two Demographic variables to explore the meaning of the estimated coefficients:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2697920 | 434530 | 6.2 | 0 |
| GenderMale | 6744 | 611 | 11.0 | 0 |
| RaceHispanic | 5715 | 899 | 6.3 | 0 |
| RaceOther | 9914 | 741 | 13.4 | 0 |
| Birth_Year | -1348 | 219 | -6.2 | 0 |
As shown above, we constructed a simple linear regression model on our outcome variable, Total Income, over the demographics of Gender, Race, and Birth Year. What is shown above tells us the following:
Gender appears to have a statistically significant affect on Total Income, with a p value of 5.4410^{-28}, and a positive coefficient value for “Male” which indicates to us that across the observed values this model predicts on average Total Income to be 6743.87 greater than the intercept value when the respondent is male.
Race also appears to have a statistically significant effect on Total Income across all its values as well, with a p value of 2.2910^{-10} for “Hispanic” and 3.4710^{-40} for “Other” (while “Black” remains as part of the baseline set.) Each race value discussed also has a positive coefficient estimate of 5714.91 for “Hispanic” and 9913.87for “Other”, indicating that the average Total Income would be expected to increase by these amounts over the baseline intercept value if a respondent were either of these Race values.
Birth Year also appears to have statistically significant effect on Total Income, with a p value of `r 8.310^{-10}, however here we see a negative estimated coefficient of -1348.46, indicating an expected decrease in Total Income as the Birth Year increases by 1. This matches our earlier graph of this variable distribution for Total Income.
We now turn our attention to creating more complex linear models against a new variable: Income Gap, which will be mean value of Income (subtracted between men and women) for each combination of the linear co-variants. We will add in each of our 5 testing variables, each time determining the affects on the model, and using anovo() to determine if the new model is statistically more significant than its predecessor. We will continue this until all 5 variables are included.
Father Education Level:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 745689 | 1811159 | 0.41 | 0.68 |
| RaceHispanic | 6726 | 3410 | 1.97 | 0.05 |
| RaceOther | 5857 | 3471 | 1.69 | 0.09 |
| Birth_Year | -374 | 914 | -0.41 | 0.68 |
| Dad_Max_Grade | -147 | 172 | -0.86 | 0.39 |
The first model introducing a numeric test variable is presented above, against true income gap data (mean of Male incomes - Female incomes across each combination of variables). Our analysis of it is as follows:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2510013 | 543305 | 4.6 | 0 |
| GenderMale | 7302 | 765 | 9.5 | 0 |
| RaceHispanic | 6992 | 1287 | 5.4 | 0 |
| RaceOther | 8175 | 1066 | 7.7 | 0 |
| Birth_Year | -1258 | 274 | -4.6 | 0 |
| Dad_Max_Grade | 839 | 104 | 8.1 | 0 |
This result is far better, with all previously held coefficient p values returning back to below the significance level. In addition, or new covarient also has a statistically significant p value at 9.3210^{-16}, with a positive estimated coefficient of 838.89 indicating that as the father’s total years of education rises, the income of the resident child is estimated to increase by this amount on average.
This graph shows us the plots of data for our subset, spread across the Father’s years of education and income levels, as well as the approximate linear model on this data. What is clear from this graph is that it also agrees with the model presented previously (2B) that the Father’s Years of Education is positively related to the Total Income in this linear model. Also clear is the gap of income between Male and Female respondents is statistically significant as the 95% confidence bands (shown in grey shading) do not overlap. However, interestingly enough, we see a narrowing of this gap as the father’s education level increases towards our maximum value of 20.
Mother Age at First Childbirth:
Next we will add to this model the next numerical variable, Mother’s Age at First Childbirth:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2257031 | 558818 | 4.0 | 0 |
| GenderMale | 7738 | 786 | 9.8 | 0 |
| RaceHispanic | 5739 | 1350 | 4.2 | 0 |
| RaceOther | 6406 | 1135 | 5.6 | 0 |
| Birth_Year | -1135 | 282 | -4.0 | 0 |
| Dad_Max_Grade | 699 | 109 | 6.4 | 0 |
| Mom_Age_Child1 | 510 | 88 | 5.8 | 0 |
Model #3 builds upon the previous model with adding Mother’s Age at First Childbirth. Key findings are:
All existing variables are not affected (detrimentally), as each of their p values remain statistically significant, and no directions nor estimated coefficients changed in any major way.
The Mother’s Age at First Childbirth appears to be statistically significant, with a p value at 7.710^{-9}.
The Mother’s Age at First Childbirth appears to have a positive relationship on Total Income (for each 1 unit increase in his or her mother’s age at her first childbirth, we expect the respondent’s Total Income to increase by the multiple of 509.59.) This is a similar affect to Father’s Education Level.
Let us also view a graphical display of the data subset across Total Income and Mother’s Age at First Childbirth:
This graph provides an interesting analysis. At first, we see a statistically significant income gap (shown by the lack of overlap of the 95% confidence bands, from about age 10 to 34.) After this age, not only do the confidence bands begin to overlap more, but the income gap appears to narrow, almost to the point where Female incomes have caught up to Male incomes. Although our data set does not provide data beyond the range of age 50, based on the slope of the two lines at the highest X value, one could hypothesize that a female may begin to earn more than a male in that model. However we cannot make such an assumption as this model can only be expected to perform well under the given range of data.
Number of Incarcerations:
Let us now turn to the final numeric variable, Number of Incarcerations:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2314190 | 553724 | 4.2 | 0 |
| GenderMale | 8586 | 787 | 10.9 | 0 |
| RaceHispanic | 5740 | 1337 | 4.3 | 0 |
| RaceOther | 6455 | 1124 | 5.7 | 0 |
| Birth_Year | -1163 | 279 | -4.2 | 0 |
| Dad_Max_Grade | 658 | 108 | 6.1 | 0 |
| Mom_Age_Child1 | 476 | 87 | 5.4 | 0 |
| X._Incarcerations | -5462 | 707 | -7.7 | 0 |
Model #4 again presents a strong linear model, with all previously held variables staying statistically significant, and without any major changes to estimated coefficients or direction of impact (positive/negative.)
For our Number of Incarcerations variable, we find a statistically significant impact on Total Income in our new model, with a p value of 1.4710^{-14}.
This model also presents us our second coefficient with a negative impact on the Total Income. Therefore, for each one unit increase in # of Incarcerations, we expect a -5461.91 decrease on the total income. This matches my expectations, as I would have hypothesized that individuals with a criminal record would make progressively less money in their lifetimes.
The linear depiction above seems to present the findings that at 0, 1, and 2 incarcerations, there is a gender gap that is statistically significant (no overlap of confidence bands.) However, that gap between confidence bands narrows the further from 0 you go, and eventually after 2 incarcerations, the confidence bands are entirely overlapped so we cannot conclude at 3 or higher incarcerations whether there is a true income gap.
Additionally, most of the data points higher than zero are not surprisingly male. This is the reason why we do not see the slopes of the two regression lines differ much as the confidence bands increase (especially on the pink line, an indication of insufficient female evidence at the higher values.)
Grades Received in Middle School:
Now we will begin to add our factor variables to our regression model, starting with Grades Received in Middle School:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -582564 | 1563321 | -0.37 | 0.71 |
| GenderMale | 8175 | 1232 | 6.64 | 0.00 |
| RaceHispanic | 6019 | 2013 | 2.99 | 0.00 |
| RaceOther | 6399 | 1677 | 3.81 | 0.00 |
| Birth_Year | 291 | 788 | 0.37 | 0.71 |
| Dad_Max_Grade | 285 | 136 | 2.10 | 0.04 |
| Mom_Age_Child1 | 154 | 135 | 1.14 | 0.25 |
| X._Incarcerations | -4626 | 1201 | -3.85 | 0.00 |
| Performance_MSMixed | 29669 | 24087 | 1.23 | 0.22 |
| Performance_MSMostly below D’s | 14172 | 21488 | 0.66 | 0.51 |
| Performance_MSMostly D’s | 18185 | 21224 | 0.86 | 0.39 |
| Performance_MSAbout half C’s and half D’s | 15003 | 20980 | 0.72 | 0.47 |
| Performance_MSMostly C’s | 20189 | 20932 | 0.96 | 0.33 |
| Performance_MSAbout half B’s and half C’s | 23626 | 20889 | 1.13 | 0.26 |
| Performance_MSMostly B’s | 23343 | 20905 | 1.12 | 0.26 |
| Performance_MSA’s to C’s | 20917 | 24061 | 0.87 | 0.38 |
| Performance_MSAbout half A’s and half B’s | 27306 | 20884 | 1.31 | 0.19 |
| Performance_MSMostly A’s | 32307 | 20897 | 1.55 | 0.12 |
Linear model 5 is our first attempt (other than attempt 2B) where adding an additional variable has reduced the strength of our model:
Existing variables Birth Year and Mother’s Age at First Childbirth have all become statistically insignificant coefficients with p values of 0.71 and 0.25, respectively.
Existing variable Birth Year actually had its sign flip, it is now a positive coefficient of 290.98.
Existing variable Father Education Level, while still statistically significant with a p value of 0.04, has a higher p value nearing our significance threshold of 0.05.
In addition every single level estimated from the Grades Received in Middle School factor are statistically insignificant, all above 0.05.
While other variables (Race, Gender, Number of Incarcerations) remained unaffected, we will reject this variable as having a linear affect on Total Income, and thus revert back to model 4 when we add our next variable.
Highest Degree Attained
Our last testing variable is Highest Degree Attained, also a factor variable, which we will add on top of model 4 (rather than model 5.)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1875071 | 520497 | 3.60 | 0.00 |
| GenderMale | 10911 | 748 | 14.59 | 0.00 |
| RaceHispanic | 6326 | 1254 | 5.04 | 0.00 |
| RaceOther | 5404 | 1057 | 5.11 | 0.00 |
| Birth_Year | -937 | 263 | -3.57 | 0.00 |
| Dad_Max_Grade | 156 | 105 | 1.49 | 0.14 |
| Mom_Age_Child1 | 104 | 84 | 1.24 | 0.21 |
| X._Incarcerations | -3617 | 681 | -5.31 | 0.00 |
| Highest_DegreeNon Interview | 5616 | 4638 | 1.21 | 0.23 |
| Highest_DegreeNone | -7984 | 4604 | -1.73 | 0.08 |
| Highest_DegreeGED | -5663 | 4507 | -1.26 | 0.21 |
| Highest_DegreeHS Diploma | 719 | 4314 | 0.17 | 0.87 |
| Highest_DegreeAssociates | 5315 | 4470 | 1.19 | 0.23 |
| Highest_DegreeBachelors | 13400 | 4322 | 3.10 | 0.00 |
| Highest_DegreeMasters | 21359 | 4485 | 4.76 | 0.00 |
| Highest_DegreePhD | 39013 | 8830 | 4.42 | 0.00 |
| Highest_DegreePro.(DDS, JD, MD) | 33721 | 5558 | 6.07 | 0.00 |
Linear model 6 gives our most interesting analysis yet:
While two of our pre-existing variables, Father Education Level and Mother Age at First Childbirth saw increases in their p values of 0.14, and 0.21, respectively, these were not drastic increases. This alone may not indicate the model is weaker, and an analysis of variance between our models (see next section.)
All other pre-existing variables retained their significance levels and their signs (positive/negative), with some expected adjustment in coefficient estimates.
All relevant values of the Highest Degree factor variable (ignoring the two negative values which had very few responses of Non Interview and Invalid Skip) have statistically significant p values less than 0.005.
As expected, the most advanced degrees (PhD, Professional, Masters, and Bachelors) all have positive relationships on Total Income in the model with estimated coefficients greater than 1. All other degree levels (None, HS Diploma, and GED) have negative relationships on Total Income with coefficient estimates below 0.
To further analyze how the Highest Degree Attained affects income gaps, let us compute the average income for Males and Females in our data subset, and compare:
| Highest_Degree | Male_Income | Female_Income | Income_Gap |
|---|---|---|---|
| Invalid Skip | 31205 | 28168 | 3037 |
| Non Interview | 38981 | 33624 | 5357 |
| None | 26114 | 15348 | 10767 |
| GED | 27518 | 20791 | 6727 |
| HS Diploma | 36904 | 24662 | 12242 |
| Associates | 40323 | 33130 | 7193 |
| Bachelors | 50040 | 41233 | 8807 |
| Masters | 57116 | 49355 | 7761 |
| PhD | 55667 | 69833 | -14167 |
| Pro.(DDS, JD, MD) | 68703 | 58854 | 9849 |
This table agrees with most of our linear regression model, in that from the degree level of None all the way to Masters the income rises with each progressive degree step. PhD seems to be the anomaly in that it is a lower average income compared to Masters degree for Males only. This may be due to the fact that we only have 4 individuals in our survey subset with such characteristics. Also interestingly, Females have an income advantage in the PhD row, which again may be due to the fact that we have insufficient data on Males to test this theory. Overall, the gap itself does not seem to follow a distinct pattern (up or down) consistently as the degree level rises, so treating degree level alone to determine income gap may be difficult, but it clearly has an affect on the incomes themselves.
Final Interpretation of our Linear Models
To further test the significance of model 6, let us use the Analysis of Variance (anova) function to determine if model 6 is more statistically significant than model 4, our last successful variable integration. In doing so, we observe that the p value (F test) for model 6 is 8.1710^{-84}, which indicates that it is more statistically significant than model 4.
Another tool to assess the strength of a model is by running diagnostic plots, which we run here:
Our diagnostic plots tell us some good news and some imperfect news about our model, which may affect our confidence level in findings:
Plot 1:
Plot 2:
Plot 3:
Plot 4:
My conclusion from this report is that of the 7 linear models I constructed, over 5 testing variables, and 3 demographic variables, Linear Model 6: lm, Linear Model 6: Total_Income ~ Gender + Race + Birth_Year + Dad_Max_Grade + Mom_Age_Child1 + X._Incarcerations + Highest_Degree, Linear Model 6: nlsy97_subset provides us our strongest performing model on our outcome variable Total Income. Additionally, analyzing the linear regression lines plotted for 3 of our testing variables and other tabular output gave us important insight to all variables’ affects on income gaps:
As Father Education Level increased, we saw increasing incomes in both sexes, and a narrowing of the income gap. Therefore we can conclude that a father’s education level has lasting affects on his offspring. This may be due to the role model effect, where offspring seek to further their own education level as their father did. It may also reflect the economic status of the family, and the accompanying ability to pay for additional opportunities for children such as tutoring, camps, and other educational events. In either case, both support a theory that increased education and/or educational opportunities further an individual’s future earnings potential.
As Mother Age at First Childbirth increased, we also saw increasing incomes in both sexes, and an even greater narrowing of the income gap as that age increased. This narrowing was so strong, that by the upper tail of our range, the confidence bands of income gap had overwritten each other, and the trend pointed towards a higher female income at higher levels of the variable (if Mother’s were physically capable of having children beyond their 50’s.) I find this to be a very interesting effect: Female offspring are more likely to be affected positively by the increased age of their mother at her first childbirth. This could be due to the role model effect as well, where daughters may feel encouraged (whether consciously or unconsciously) to devote their late teens and early 20’s to enhancing their career similar to their mother’s life choices, rather than having children early. This seems to have improved these females’ chances of a higher income more so than the standard control group.
As Number of Incarcerations increased, we saw a decreased effect on income. However, as most incarcerations were reported on male respondents, the income gap analysis is somewhat of a moot point (there was a gap at levels 0, 1, and 2, however the confidence bands were narrowing quickly due to lack of data on female incarcerations). The greater observation overall from this variable is a clear indication that number of incarcerations negatively affect all respondents’ ability to earn a high wage later in life.
Grades Received in Middle School proved to be an un-useful testing variable, as our linear model with this variable showed discouraging results. This may be due to the fact that Middle School grades have less of an effect overall on earnings potential, due to the ample opportunity to recover from poor performance later on in High School or Post Secondary Education. In hindsight, perhaps the High School version of this variable would have been more useful, and its importance possibly outweighed the fact that we had far fewer responses on that variable.
Highest Degree Attained proved to have a strong affect on overall incomes, but not necessarily income gaps. While Males earned more on all but one degree level, our income gaps varied widely as we increased through the scale of Degree attainment. Our one data anomaly of a negative income gap for the PhD level (Female mean income was higher) was likely due to the fact of too few male responses on this degree level.
My confidence level in these conclusions is very high due to the process I followed. I followed the appropriate process to determine statistical significance before putting too much faith in an answer (for example, the Grades Received in Middle School model). I do feel strongly that both Father Education Level and Mother Age at First Childbirth have a positive impact on incomes, and a negative impact on income gap (female incomes rise over the spectrum faster than males in both cases.) I have less confidence in my hypothesis as to why both of these cases are true, however I believe with a fair amount of domain knowledge in the area (as a parent myself) these are strong words to live by. Lastly, I also have strong confidence in both the negative relationship between Number of Incarcerations and Incomes, and the positive relationship between Degree Attained and Incomes, however neither proved to have a significant impact or recognized pattern on the income gap itself.