tle: “Final_Project_Adam_Kurstin”
tput: html_document

Part 1: Data Summary

This analysis will examine the gender income gap using the National Longitudinal Survey of Youth, 1979 cohort) data set.

## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

This table shows the poverty rate in 1987 broken down by race and gender. It is evident that poverty among females is higher in each racial category. Although the difference is not consistent across all racial categories. The graph, shows the results of the table visually. It is clear that the blacks are most likely to be in poverty, followed by hispanics and then non-black/non-hispanic. The gender divide in poverty levels shrinks from black to hispanic to non-black/non-hispanic.

Black Hispanic Non-Black, Non-Hispanic
Female 0.3753501 0.2536232 0.1448353
Male 0.2128079 0.1590551 0.0928168

The table and graph show that the gender divide between men and women is not consistent across all categories of marriage status. The most pronounced difference in gender disparity occurs in the married/spouse present category, where the income of men far exceeds that of women. The other two categories see only modest differences in average income. Women that were never married in fact earn more than men on average.

It is possible that these married households follow more socially conventional constructions, where men focus on their careers and women on raising a family. Interuptions in the careers of women may keep their income substantially lower than that of their husbands.

MARRIED, SPOUSE PRESENT NEVER MARRIED OTHER
Female 30687.68 25147.77 26672.27
Male 54238.21 22156.47 31275.01

The data here show that men’s income exceeds that of womens’ across all job sectors. It appears that average income across job sectors varies, however the gender divide in income does not appear to be of much significance. It appears that job category may not be a significant variable in terms of analyzing the income difference between men and women.

FAMILY BUSINESS GOVERNMENT NON-PROFIT PRIVATE FOR PROFIT SELF-EMPLOYED
Female 24375.3 43456.55 39652.92 34584.65 22078.61
Male 34108.0 59012.88 50518.36 50281.52 37124.01

This plot shows the the average income of individuals by highest completed grade of education, separated by gender. It is clear that incomes are substantially higher beginning at grade 12 (high school graduation) and grade 16 (college graduation). Upon first glance, it appears that women are more represented in the lower end of income at each grade level, and men more represented at the higher end of each grade level.

Part 2: Methodology

I normally excluded missing values from my datasets. I created many sub-datasets and excluded missing values when answering a particular question. However, when answering the next question, I created another sub-dataset from the original. In this way, I was not progressively losing data each time I answered a question.

I omitted truncated variables, notably those on income. Including censored data leads to inconsistent results, as the coefficients won’t converge to their true values. Eliminating these outlying observations will allow our coefficients to converge to their true values. However, it must be noted that our model is more limited, and fails to completely capture our data.

I did not add any additional variables to the dataset.

I thought that the job sector of a worker might add to the discussion to the gender income gap. But this variable proved to be somewhat of a dead-end, and seemed to be independent of the gender income gap.

I also tried a log transform of income, expecting the distribution to behave more normally. However, this was not the case. Neiether income nor the log transformation was normally distributed. Perhaps a power transformation would be appropriate. This was not a large obstacle because there were thousands of observations, so the Central Limit Theorem should still be applicable.

In my final analysis, I focused on poverty status and total income as a measure of whether men earn more than women. I used education level, race, and marital status as controlling variables to explain the gender income gap. I used regressions and decision trees to map out the analysis.

Part 3: Findings

I ran QQ plots on the total income of 2012 (removing censored values) variable to test whether it was normal. For both males and females, incomes were not normally distributed. I then test to see whether a logorithm transformation would yield better results, but it too failed to produce normalized data. This should be OK going forward, because there are sufficient observations in which the central limit theorem can be applied, and T-tests would be appropriate.

A created a boxplot to better visualize the distribution of incomes between males and females. It is clear from the boxplot that males earn on average more that females. I then ran a simple T-Test to determine whether that was indeed the case. The resultant p-value showed a high level of significance, so we can reject the null hypothesis that the means of income between men and women are equal.

##   sex.factor total.income12.mean total.income12.std_err
## 1     Female          28492.3221               505.8242
## 2       Male          41834.0904               675.9342
## 
##  Welch Two Sample t-test
## 
## data:  total.income12 by sex.factor
## t = -15.8032, df = 6202.509, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14996.78 -11686.76
## sample estimates:
## mean in group Female   mean in group Male 
##             28492.32             41834.09
## [1] 3.487194e-55
## [1] -14996.78 -11686.76
## attr(,"conf.level")
## [1] 0.95

I then examined poverty status in a proportional table by gender. The resulting chart shows a stark gender divide in poverty. Among those in poverty, the proportion of women highly exceeds that of men. However, among those not in poverty, the proportion of men and women are roughly equal.

Female Male
0 3469 3581
1 964 547
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  pov.table
## X-squared = 105.5463, df = 1, p-value < 2.2e-16
0 1
Female 0.4920567 0.6379881
Male 0.5079433 0.3620119

Linear Regression

## 
## Call:
## lm(formula = total.income12 ~ sex.factor + race.factor + highest.grade.com89, 
##     data = data.reg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79075 -22599  -3300  16951 150481 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        -56204.0     2289.2 -24.552   <2e-16
## sex.factorMale                      15157.2      765.5  19.802   <2e-16
## race.factorHispanic                  9210.8     1108.0   8.313   <2e-16
## race.factorNon-Black, Non-Hispanic  11009.3      879.1  12.524   <2e-16
## highest.grade.com89                  6061.8      170.9  35.478   <2e-16
##                                       
## (Intercept)                        ***
## sex.factorMale                     ***
## race.factorHispanic                ***
## race.factorNon-Black, Non-Hispanic ***
## highest.grade.com89                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31030 on 6609 degrees of freedom
## Multiple R-squared:  0.2205, Adjusted R-squared:   0.22 
## F-statistic: 467.3 on 4 and 6609 DF,  p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56203.980 2289.2221 -24.551563 0
sex.factorMale 15157.243 765.4509 19.801720 0
race.factorHispanic 9210.800 1108.0318 8.312758 0
race.factorNon-Black, Non-Hispanic 11009.261 879.0749 12.523690 0
highest.grade.com89 6061.785 170.8604 35.478005 0

It appears that all of our variables are highly significant, because they have p-values equal to 0. This means that all variables have high explanatory power. For highest grade completed, the average individual can expect to earn an additional $6062 for every additional year of education completed. The coefficient for male indicates that men should expect to earn about $15,157 more than women (baseline). Similarly, hispanics should expect to earn $9210 more than blacks (baseline) and non-black/non-hispanics should earn $11009 more than blacks (baseline).

Diagnostic Plots: The Residuals vs Fitted plots suggest that our residuals do not have a constant variance when plotted against our fitted values. This means our data is heteroskedastic. This is to be expected with income, as it is not normally distributed, and it is furthermore bounded by the 0 and the truncation. The residiuals and the fitted values do appear to be decently uncorrelated, which suggests our model might not suffer from significant Omitted Variable Bias.

Our Normal Q-Q plot tells us that the residuals from the regression do look approximately normally distributed. This suggests that our p-values are valid.

The diagnostic plots do not capture significant outliers, because there are no significant outliers. We deliberaltely removed incomes that were above a certain threshold. We did this because extremely high incomes could disproportionally disrupt our model, but we also need to mention that removing the model limits its applicability to the real world.

Logistic Regression (approched learned in David Choi’s R class)

This logistic regression measures the probablity that an observation is male or female across levels of income. You can see that at low levels of income, a respondent is more likely to be female, until a value of around $45,000. At higher levels of income, a respondent is increasingly likely to be male.

Decision Tree

This shows that education level is among the best determinates of poverty status. But gender is and race are also important to determining poverty status.

Part 4: Discussion I found that there is definitely a significant difference in income levels between men and women, with men earning about $15,000 more a year and less likely to be in poverty. However, controlling variables provided greater insight into this finding. Marital status showed that married couples highly drove the gap in income, which was not nearly as pronounced for unmarried couples. The income gap also varied in size between races, as well as education levels.

I have limited confidence in my models. It is important to note that removing truncated values limits the applicability of the model. Also, our factor variables in race and marital status did not have high resolution. The regression output did not consider interacting terms. Furthermore, a time series would add additional value to the analysis.

However, it should still be safe to say the the main finding of a significant gender income gap is appropriate.