To investigate the relationship of income gap between men and women and other depending factors. I chose seven representative variables in the NLSY79 dataset(National Longitudinal Survey of Youth, 1979 cohort). Here’s information on what the seven variables mean.
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
# Transform and relabel gender and race variables
# Transform and relabel gender, childhood status, marital status, and education variables
nlsy <- mutate(nlsy,
gender = recode_factor(gender,
`1` = "Male",
`2` = "Female"),
povstatus_1979 = recode_factor(povstatus_1979,
`0` = "not in poverty",
`1` = "in poverty",
`NA` = 'missing'),
race = recode_factor(race,
`3` = "Other",
`2` = "Black",
`1` = "Hispanic"),
marstat.col_2000 = recode_factor(marstat.col_2000,
`2` = "Married(spouse present)",
`1` = "Never married",
`3` = "Other",
`NA` = "missing"),
edu_2012 = recode_factor(edu_2012,
`1` = '1st grade',
`2` = '2nd grade',
`3` = '3rd grade',
`4` = '4th grade',
`5` = '5th grade',
`6` = '6th grade',
`7` = '7th grade',
`8` = '8th grade',
`9` = '9th grade',
`10` = '10th grade',
`11` = '11th grade',
`12` = '12th grade',
`13` = '1st year college',
`14` = '2nd year college',
`15` = '3rd year college',
`16` = '4th year college',
`17` = '5th year college',
`18` = '6th year college',
`19` = '7th year college',
`20` = '8th year college'))## Warning in recode.numeric(.x, !!!values, .default = .default, .missing
## = .missing): NAs introduced by coercion
## Warning in recode.numeric(.x, !!!values, .default = .default, .missing
## = .missing): NAs introduced by coercion
## Warning: Unreplaced values treated as NA as .x is not compatible. Please
## specify replacements exhaustively or supply .default
Note I choose “other” race, male, no crimnal history as our baseline variable(due to prevalance in the dataset)
To test whether there is a significant difference of income gap between men and women, first we will plot the income between men and women and compare their income difference.
# Create boxplot showing how income varies between men and women
qplot(x = gender, y = income,
geom = "boxplot", data = nlsy,
xlab = "Gender",
ylab = "Annual Indivisual Income in 2011 ($)",
fill = I('#CC79A7'))## Warning: Removed 5662 rows containing non-finite values (stat_boxplot).
We can tell from the boxplot that on average, around 75 percent of male’s annual income is below 75k, whereas 75 percent of female’s annual income is below 50k. Overall, female is associated with lower income. But how can we assess whether this difference is statistically significant? Let’s compute a summary table by adding the standard error(which the standard deviation adjusted by the group size) to assess the statistical significance.
# Notice the consistent use of round() to ensure that our summaries
# do not have too many decimal values
nlsy %>%
group_by(gender) %>%
filter(!is.nan(income))%>%
summarize(num.obs = n(),
mean.income = round(mean(income,na.rm = TRUE),0),
sd.income = round(sd(income,na.rm = TRUE),0),
se.income = round(sd(income,na.rm = TRUE) / sqrt(num.obs),0))## # A tibble: 2 x 5
## gender num.obs mean.income sd.income se.income
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Male 6403 53446 69369 867
## 2 Female 6283 29539 35330 446
The summary table suggests that the there’s a relatively small standard error of average female income, which means the sample mean is a more accurate reflection of the actual population mean. The income difference between men and womenis looking quite significant. Now we will run a t test to test our etimates and incorporate the interpretation in our findings
##
## Welch Two Sample t-test
##
## data: income by gender
## t = 18.034, df = 4993, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 21308.43 26506.37
## sample estimates:
## mean in group Male mean in group Female
## 53445.91 29538.51
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "stderr" "alternative" "method" "data.name"
## [1] 1.752135e-70
## mean in group Male mean in group Female
## 53445.91 29538.51
## [1] 21308.43 26506.37
## attr(,"conf.level")
## [1] 0.95
income.gender.diff <- round(income.t.test$estimate[1] - income.t.test$estimate[2], 1)
# Confidence level as a %
conf.level <- attr(income.t.test$conf.int, "conf.level") * 100Assessment of Significance - Our study finds that Male on average earn 23907.4dollars higher compared to female annually. (t-statistic 18.03, p=0, 95% CI [21308.4, 26506.4]g).
Here are some questions I will try to answer using the tabular summaries and graphics: - Does race appear to to have an effect on income differnce? - Does income difference appear to be consistent across racial group? - What is the association between race and income difference between men and women?
##color-blind friendly
income.summary <- nlsy%>%
group_by(race,gender)%>%
summarize(avg.income = round(mean(income, na.rm= TRUE),0))
kable(income.summary,digits = 3,
format = "markdown")| race | gender | avg.income |
|---|---|---|
| Other | Male | 69113 |
| Other | Female | 33570 |
| Black | Male | 32779 |
| Black | Female | 24096 |
| Hispanic | Male | 46017 |
| Hispanic | Female | 27827 |
##we can plot the table as follows:
# Define basic aesthetic parameters
p.income <- ggplot(data = income.summary,
aes(y = avg.income, x = race, fill = gender))
# Pick colors for the bars
gender.colors <- c("#999999", "#CC79A7")
# Display barchart
p.income + geom_bar(stat = "identity", position = "dodge") +
ylab("Average annual indivisual income($) in 2011") +
xlab("Respondents' race") +
guides(fill = guide_legend(title = "gender")) +
scale_fill_manual(values=gender.colors)- The tabular table summarize the average income across different groups, and the bargraphs plot the income difference visually. Here are some insights: - Race seems to have an effect on the income difference between men and women. Especially there’s a wide income gap in “other” race, the smallest income difference across racial groups is black, but their average annual income all below $35k. - Income difference appear to be consistent across all racial group: On average male earn higher than female, however, “other” race has a larger income gap across gender: white male almost earn on average double than white female. Hispanic groups and black racial groups also have income difference but not large difference. - From the discussion above we can conclude that race is associated with income difference men and women. ##### 2. marital status Here are some questions I will try to answer using the tabular summaries and graphics: - Does marital status appear to to have an effect on income difference? - Does income difference appear to be consistent across marital status? - What is the association between marital satus and income difference between men and women?
##income and marital status((How to drop NA values))
base.plot <- ggplot(data = subset(nlsy, !is.na(marstat.col_2000 )), aes(x = as.factor(marstat.col_2000 ), y = income)) +
xlab("marital status") +
ylab("Average annual indivisual income($) in 2011")
# Box plot
base.plot + geom_boxplot(aes(fill = gender))+
guides(fill = guide_legend(title = "gender")) +
scale_fill_manual(values=gender.colors)## Warning: Removed 1521 rows containing non-finite values (stat_boxplot).
As we can observe from the group boxplots: - Overall,married couple earn higher than never married people and otehr marital status such as divorce or widowed. - A clear pattern indicates that married male earn more than never married male. However, female income seems to be consistent across all marital status. Married female earn slightly higher than never married and other groups, but not too much income difference. - Marital satus is associated with male’s income, rather than women’s income. ##### 3. Summary table with multiple variables
From the grouping plots we found that the boxplots have a high outlier which we havenot studied well, for those high income earners who earn more than 50k. Wha’s the income performance for men and women from different marital status across racial group?
Here our main variable of interest here is high.income, which indicates whether the individual’s income was over $50K. Anyone for whom high.income == 1 is considered a “high earner”. First, we will create a summary table showing the proportion of high earners varies across all combinations of the following variables: sex, race and marital status.
##first, we will add another column to indicate if the person earn more than 70k, here we filtered the null value for marital status.
nlsy <- mutate(nlsy, high.income = as.numeric(income >= 50000))
multi.value <- nlsy %>%
filter(!is.na(marstat.col_2000)) %>%
group_by(gender, race, marstat.col_2000) %>%
summarize(count = n(),
high.earn.rate = round(sum(high.income == 1, na.rm = TRUE)/count,3))
multi.value## # A tibble: 18 x 5
## # Groups: gender, race [6]
## gender race marstat.col_2000 count high.earn.rate
## <fct> <fct> <fct> <int> <dbl>
## 1 Male Other Married(spouse present) 1326 0.489
## 2 Male Other Never married 296 0.26
## 3 Male Other Other 366 0.265
## 4 Male Black Married(spouse present) 439 0.317
## 5 Male Black Never married 433 0.109
## 6 Male Black Other 308 0.182
## 7 Male Hispanic Married(spouse present) 419 0.394
## 8 Male Hispanic Never married 160 0.156
## 9 Male Hispanic Other 170 0.176
## 10 Female Other Married(spouse present) 1418 0.198
## 11 Female Other Never married 172 0.326
## 12 Female Other Other 481 0.168
## 13 Female Black Married(spouse present) 430 0.151
## 14 Female Black Never married 410 0.12
## 15 Female Black Other 418 0.151
## 16 Female Hispanic Married(spouse present) 447 0.186
## 17 Female Hispanic Never married 107 0.187
## 18 Female Hispanic Other 230 0.165
##then we use kable to producel the table in a nicely format
kable(multi.value,digits = 3,
format = "markdown")| gender | race | marstat.col_2000 | count | high.earn.rate |
|---|---|---|---|---|
| Male | Other | Married(spouse present) | 1326 | 0.489 |
| Male | Other | Never married | 296 | 0.260 |
| Male | Other | Other | 366 | 0.265 |
| Male | Black | Married(spouse present) | 439 | 0.317 |
| Male | Black | Never married | 433 | 0.109 |
| Male | Black | Other | 308 | 0.182 |
| Male | Hispanic | Married(spouse present) | 419 | 0.394 |
| Male | Hispanic | Never married | 160 | 0.156 |
| Male | Hispanic | Other | 170 | 0.176 |
| Female | Other | Married(spouse present) | 1418 | 0.198 |
| Female | Other | Never married | 172 | 0.326 |
| Female | Other | Other | 481 | 0.168 |
| Female | Black | Married(spouse present) | 430 | 0.151 |
| Female | Black | Never married | 410 | 0.120 |
| Female | Black | Other | 418 | 0.151 |
| Female | Hispanic | Married(spouse present) | 447 | 0.186 |
| Female | Hispanic | Never married | 107 | 0.187 |
| Female | Hispanic | Other | 230 | 0.165 |
The summary statistics shows the high earn rate across racial group, marital status, and gender. - White(“Other” race) married male has 48.9 percent of earning more than 50k, followed by married hispanic male, 39.4 percent. Black unmarried male has the lowest earning rate, only 10.9 percent. - Surprisingly, white never married female has the highest earning rate of 32.6 percent, more than white married female of 19.8 percent of high earning rate. Other hgih earn rate difference doesnot vary much across the combination of those variables.
income.plot <- ggplot(data = multi.value, aes(x = marstat.col_2000, y = high.earn.rate))
income.plot + geom_bar(stat="identity", aes(fill = race)) + facet_grid(gender ~ race) +
coord_flip() +
xlab("Marital status") +
ylab("Proportion earning over $50K in 2011") +
guides(fill = F) +
theme(axis.text.x=element_text(angle = 90, hjust = 0)) - The summary bar charts visualize the summary statistics and it reconfirms our conclusion that married male across all racial group tend to have a higher proportion of earning over 50k dollars, however, we don’t observe the difference in female groups except that white never married female has the highest earning rate across all combinations.
In this part, we will examine the effect of education level on the income gap between men and women by presenting group bars with error bars. We use the respondents’ highest degree obtained in 2011 as the horizontal axis to display the education attainment level, and we caculated the difference of average income between men and women and 95% confidence intervals as our vertical axis to show the height of income gap.
Here are some questions I am interested in:
Here’s a set of commands that calculates the difference in average income between men and women for each education level
##income gap by education attainment level
gap.data.conf <- nlsy %>%
filter(!is.na(edu_2012)) %>%
group_by(edu_2012) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE),
upper = t.test(income ~ gender)$conf.int[1],
lower = t.test(income ~ gender)$conf.int[2],
is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))
kable(gap.data.conf ,digits = 3,
format = "markdown")| edu_2012 | income.gap | upper | lower | is.significant |
|---|---|---|---|---|
| 3rd grade | 24742.857 | -141360.417 | 190846.13 | 0 |
| 4th grade | -3825.000 | -143553.453 | 135903.45 | 0 |
| 5th grade | 18666.667 | -23368.389 | 60701.72 | 0 |
| 6th grade | 16918.009 | 2815.127 | 31020.89 | 1 |
| 7th grade | 4972.159 | -4952.595 | 14896.91 | 0 |
| 8th grade | 11912.897 | 5399.298 | 18426.50 | 1 |
| 9th grade | 14059.536 | 5231.685 | 22887.39 | 1 |
| 10th grade | 9205.695 | 3867.041 | 14544.35 | 1 |
| 11th grade | 7851.805 | 2587.913 | 13115.70 | 1 |
| 12th grade | 14772.958 | 12473.567 | 17072.35 | 1 |
| 1st year college | 22300.396 | 14823.478 | 29777.31 | 1 |
| 2nd year college | 21538.275 | 14685.949 | 28390.60 | 1 |
| 3rd year college | 31804.522 | 21018.761 | 42590.28 | 1 |
| 4th year college | 52317.325 | 42056.756 | 62577.89 | 1 |
| 5th year college | 40139.301 | 22546.934 | 57731.67 | 1 |
| 6th year college | 72665.607 | 51028.668 | 94302.55 | 1 |
| 7th year college | 54057.965 | 15521.182 | 92594.75 | 1 |
| 8th year college | 93166.515 | 61428.526 | 124904.51 | 1 |
# Plot, with error bars
ggplot(data = gap.data.conf, aes(x = edu_2012, y = income.gap,
fill = is.significant)) +
geom_bar(position = "dodge", stat = "identity", ) +
xlab("The highest education degree completed in 2011") +
ylab("Average income gap between men and women($)") +
ggtitle("Income gap between men and women, by education attainment level") +
guides(fill = guide_legend(title = "significant level")) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.2) +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
Now we try to test if childhood poverty status mitigate or exacerbate the income gap between men and women, similar apporach to add 95 percent of confidence interval on the error bars of poverty status in 1979.
##income gap between man and women, by povstatus_1979 (need to filter the na value)
gap.data.conf <- nlsy %>%
group_by(povstatus_1979) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) - mean(income[gender == "Female"], na.rm = TRUE),
upper = t.test(income ~ gender)$conf.int[1],
lower = t.test(income ~ gender)$conf.int[2],
is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))## Warning: Factor `povstatus_1979` contains implicit NA, consider using
## `forcats::fct_explicit_na`
ggplot(data = gap.data.conf, aes(x = povstatus_1979, y = income.gap,
fill = is.significant)) +
geom_bar(position = "dodge", stat = "identity") +
xlab("Childhood Poverty Status") +
ylab("Income gap($)") +
ggtitle("Income gap between men and women, by childhood status") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12)) - This time we found that income gap between men and women appears to be statistically significant for people (not)in poverty because the error bars do not overlap each other, and the income gap for people not in poverty are much higher(almost 30k) than people in poverty in childhood(almost 10k). It indicates that children who were not raised under poverty, men earn much higher than women after they grow up, whereas the poor children don’t have too much income gap between men and women.
- To investigate the income difference between and women and related factors that influenced the gap, we first compare compare the average income difference between men and women and test its significance leve, then select some interested variables might correlated with income level(gender, income, marital status, race, jobs number, education level, poverty status). For example, we assume that married people will have a wider income difference than unmarried because they some women take traditonal roles of housewives and men are the bread earners. We will compare the marital status across different ethnic groups to test whether the income differnece trend is consistent. - Most of the variables I chose are categorical variables, so I used many bar charts and boxplots to examine the relationship, and to better dignose the income gap, I added the error bars for the education level and poverty status groups. They all proved to be related in the income gap. - Then I ran multi factor regression tests and add interaction to examine the estimates and p-values to assess the significance. - Some dianostic tools are utilized to examine the variance of residuals and clear patterns. Finally we compare the two regression models to test which models suits the dataset the best.
- In the national longitudinal surveys 1979, there are a lot of missingness in the dataset due to the large time span. To better analyze the survey response, fistly, we recode the negative values (refusal, don't know, valid skip, non-interview) as null values, then we divide the variables into two catogories: Factor variables and numeric variables. - For factor variables, we treat missing values as just another factor level, for example. In the marital status, there are around 4700 non-interview respondents, almost one half of the total observations, so we should consdider code them into missing level as jsut another factor leve. Sometimes missingness can be informative or predictive, leading to a significant coeffcient for the missing level. For example, when we ran a logistic regression in which we used missing as one of the marital status to indicate indivisuals whose childhood marital status is unknown, Having poverty satus = missing is likely to be associated with high income gap. - For numeric variables, we simply recode negative values to NA.
- The income variable that we have available is topcoded, which means that for the top 2% of earners, we don’t observe the actual income out of privacy. Instead their income is recoded as the average of the top 2% of incomes. That’s the reason when we always observe outliers which aligned horizontally in many graphs. - Capping is introduced so that model does not learn to correlate extremely high incomes with outcome variable. But at the same time, there could be other variables (say number of children in the households) on which rich people is not going to be an outlier. It is better to keep the record for the person but cap outlier variables like income. To include their data in our regression analysis will help us better converge to the “true” estimates. - So here we compare with the orginal income income.orig and topcoded income income in two regression models used in 3.Findings part(a).
###truncated income
nlsy.lm.cap <- lm(income ~ + gender + povstatus_1979 + race + jobs.num, na.action=na.exclude, data = nlsy)
# Pull coefficients element from summary(lm) object
round(summary(nlsy.lm.cap)$coef, 3)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68899.881 1881.970 36.611 0.000
## genderFemale -24414.487 1455.576 -16.773 0.000
## povstatus_1979in poverty -17175.492 1830.209 -9.384 0.000
## raceBlack -16488.248 1771.348 -9.308 0.000
## raceHispanic -7995.504 1972.889 -4.053 0.000
## jobs.num -513.028 161.723 -3.172 0.002
###original income
nlsy.lm.ori <- lm(income.orig ~ + gender + povstatus_1979 + race + jobs.num, na.action=na.exclude, data = nlsy)
# Pull coefficients element from summary(lm) object
round(summary(nlsy.lm.ori)$coef, 3)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17509.835 9808.865 1.785 0.081
## genderFemale -3725.323 5630.802 -0.662 0.512
## povstatus_1979in poverty -4947.737 6685.992 -0.740 0.463
## raceBlack -1030.302 6543.279 -0.157 0.876
## raceHispanic -3218.488 8789.524 -0.366 0.716
## jobs.num 770.167 787.530 0.978 0.334
- The first table is the topcoded income and the second is the orginal values of earned income in 2011. We can tell that original income though has greater values in intercepts, but the estiamtes of coeffcient are much smaller with much bigger standard error, which reflects the true income difference between men and women. However, all the estimates are statitically insignificant(p-values bigger than 0.05). Therefore, topcoded apporach will greatly increase the confidence of interpereting the coefficients and it can generalize to the regression model to test whether there’s is a statitically significant difference in income gap between gender.
- In the data summaries, marital status is a variable I thought it should be highly correlated with income difference. Suprisingly, only married couple is postively associated with a high income gap, whereas other marital status(divorced, widowed) did not reflect a difference in income between gender. This tells us that married women are stuck in the family burden and they tend to earn less than working husbands. Without marital commitment, some unmarried female even has a higher rate of earnign over 50k across other marital status. This finding make me rethink the assumption that “marriage lead to success of career and family”. It only applies to male, not female.
nlsy.lm2.interact <- lm(income ~ + gender + edu_2012 + gender*edu_2012 , na.action=na.exclude, data = nlsy)
summary(nlsy.lm2.interact)##
## Call:
## lm(formula = income ~ +gender + edu_2012 + gender * edu_2012,
## data = nlsy, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165951 -22043 -6055 14598 322349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34000 34155 0.995 0.319543
## genderFemale -24743 38728 -0.639 0.522915
## edu_20124th grade -18825 41831 -0.450 0.652705
## edu_20125th grade -15333 44094 -0.348 0.728042
## edu_20126th grade -12818 37130 -0.345 0.729937
## edu_20127th grade -23619 36227 -0.652 0.514440
## edu_20128th grade -18052 34729 -0.520 0.603210
## edu_20129th grade -12519 34512 -0.363 0.716820
## edu_201210th grade -18273 34574 -0.529 0.597151
## edu_201211th grade -16021 34472 -0.465 0.642126
## edu_201212th grade 1594 34176 0.047 0.962805
## edu_20121st year college 16073 34283 0.469 0.639203
## edu_20122nd year college 18006 34264 0.526 0.599250
## edu_20123rd year college 26523 34394 0.771 0.440651
## edu_20124th year college 65373 34240 1.909 0.056268
## edu_20125th year college 53995 34579 1.561 0.118455
## edu_20126th year college 92561 34445 2.687 0.007223
## edu_20127th year college 90460 34906 2.592 0.009573
## edu_20128th year college 131950 34574 3.816 0.000137
## genderFemale:edu_20124th grade 28568 57006 0.501 0.616289
## genderFemale:edu_20125th grade 6076 58686 0.104 0.917540
## genderFemale:edu_20126th grade 7825 42697 0.183 0.854596
## genderFemale:edu_20127th grade 19771 41854 0.472 0.636671
## genderFemale:edu_20128th grade 12830 40031 0.321 0.748596
## genderFemale:edu_20129th grade 10683 39508 0.270 0.786852
## genderFemale:edu_201210th grade 15537 39528 0.393 0.694282
## genderFemale:edu_201211th grade 16891 39400 0.429 0.668149
## genderFemale:edu_201212th grade 9970 38768 0.257 0.797054
## genderFemale:edu_20121st year college 2442 38925 0.063 0.949969
## genderFemale:edu_20122nd year college 3205 38897 0.082 0.934343
## genderFemale:edu_20123rd year college -7062 39075 -0.181 0.856593
## genderFemale:edu_20124th year college -27574 38871 -0.709 0.478115
## genderFemale:edu_20125th year college -15396 39340 -0.391 0.695540
## genderFemale:edu_20126th year college -47923 39160 -1.224 0.221082
## genderFemale:edu_20127th year college -29315 39795 -0.737 0.461360
## genderFemale:edu_20128th year college -68424 39554 -1.730 0.083698
##
## (Intercept)
## genderFemale
## edu_20124th grade
## edu_20125th grade
## edu_20126th grade
## edu_20127th grade
## edu_20128th grade
## edu_20129th grade
## edu_201210th grade
## edu_201211th grade
## edu_201212th grade
## edu_20121st year college
## edu_20122nd year college
## edu_20123rd year college
## edu_20124th year college .
## edu_20125th year college
## edu_20126th year college **
## edu_20127th year college **
## edu_20128th year college ***
## genderFemale:edu_20124th grade
## genderFemale:edu_20125th grade
## genderFemale:edu_20126th grade
## genderFemale:edu_20127th grade
## genderFemale:edu_20128th grade
## genderFemale:edu_20129th grade
## genderFemale:edu_201210th grade
## genderFemale:edu_201211th grade
## genderFemale:edu_201212th grade
## genderFemale:edu_20121st year college
## genderFemale:edu_20122nd year college
## genderFemale:edu_20123rd year college
## genderFemale:edu_20124th year college
## genderFemale:edu_20125th year college
## genderFemale:edu_20126th year college
## genderFemale:edu_20127th year college
## genderFemale:edu_20128th year college .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48300 on 6986 degrees of freedom
## (5664 observations deleted due to missingness)
## Multiple R-squared: 0.2558, Adjusted R-squared: 0.2521
## F-statistic: 68.63 on 35 and 6986 DF, p-value: < 2.2e-16
- We also tried to involve other factor variables in our regression models, such as interaction between education level and gender. We ran the same regression model to examine the p-values of genderFemale:education to be not sinificant, which means education attainemnt does positively relates to income level, but it doesn’t relate to the income difference between men and women. Education can affect income, but it does not affect the income fact, so it cannot exacerbate or mitigate income gap between men and women. This is the difference between significant main effects and significant interactions.
Assessment of Significance of income difference between men and women - Our study finds that Male on average earn 23907.4dollars higher compared to female annually. (t-statistic 18.03, p=0, 95% CI [21308.4, 26506.4]g).
poverty status, race and gender and jobs.num in our linear regression model.We will try to answer the following questions: - What is the interpretation of the coefficient of jobs.num in this model? - What is the interpretation of the coefficient of poverty status?
options(scipen=4) # Set scipen = 0 to get back to default
# Fit regression model
nlsy.lm <- lm(income ~ + gender + povstatus_1979 + race + jobs.num, na.action=na.exclude, data = nlsy)
# Regression model summary
summary.nlsy.lm <- summary(nlsy.lm)
# Pull coefficients element from summary(lm) object
kable(summary.nlsy.lm$coef,
digits = c(2,2,2,5 ), format = 'markdown')| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 68899.88 | 1881.97 | 36.61 | 0.00000 |
| genderFemale | -24414.49 | 1455.58 | -16.77 | 0.00000 |
| povstatus_1979in poverty | -17175.49 | 1830.21 | -9.38 | 0.00000 |
| raceBlack | -16488.25 | 1771.35 | -9.31 | 0.00000 |
| raceHispanic | -7995.50 | 1972.89 | -4.05 | 0.00005 |
| jobs.num | -513.03 | 161.72 | -3.17 | 0.00152 |
- Because there are three categorical variables in our regression model, we need to specify the baseline before giving interpretation of other three coefficients.The baseline is white male not in poverty during childhood with zero job, so the estimate of raceblack means that the estimated intercept is 16488 dollars higher among white compared to black. Similarly, the estimated of genderFemale means that the estimated intecept is 17175 dollars higher among non-poverty earners than poverty earners. - Another way of putting it: For two people of the same race and gender and same poverty status, every additional prior job is on average associated with a $513 decrease in income. - Among people of the same race and gender who have previously held the same number of jobs, people living in poverty on average $17175 less than people who are not raised in poverty. - Looking at the p-values, it looks like gender, povstatus_1979 (childhood poverty status in 1979), race and job numbers are all statistically significant predictors of indivisual income. #### (b) Diagnostic tools to assess whether the linear model is apporpirate.
- The first two plots are the most important, but the last two can also help with identifying outliers and non-linearities. - Residual vs. Fitted Plot: There is a clear non-linearity present in the plot. we see that the variance appear to be increasing in fitted value in a horizontal funnel shape and there are plenty of outlier residuals. - Normal QQ plot: the underlying normality assumptions don’t hold here, the residuals appear highly non-normal. Both the lower tail and upper tail are heavier than we would expect under normality and we found an isolated upper tail, is correlated the outlier in the residual and fitted plot. - Scale and Location plot: There is a slight indication of non-constant (heteroskedastic) variance. - Residuals vs Leverage: There appear to be clear outliers in the data.
nlsy.lm.interact <- update(nlsy.lm, . ~ . + race*gender)
summary.nlsy.lm.int <- summary(nlsy.lm.interact)
# Pull coefficients element from summary(lm) object
kable(summary.nlsy.lm.int$coef,
digits = c(2,2,2,5 ), format = 'markdown')| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 73514.25 | 1973.08 | 37.26 | 0.00000 |
| genderFemale | -34925.51 | 2032.41 | -17.18 | 0.00000 |
| povstatus_1979in poverty | -17844.20 | 1822.62 | -9.79 | 0.00000 |
| raceBlack | -29428.94 | 2429.40 | -12.11 | 0.00000 |
| raceHispanic | -15107.16 | 2760.96 | -5.47 | 0.00000 |
| jobs.num | -416.85 | 161.38 | -2.58 | 0.00982 |
| genderFemale:raceBlack | 25739.18 | 3323.67 | 7.74 | 0.00000 |
| genderFemale:raceHispanic | 14444.32 | 3823.94 | 3.78 | 0.00016 |
#Next we run an anova to compare the model constructed in q(a) to the model constructed q(b)
anova(nlsy.lm, nlsy.lm.interact)## Analysis of Variance Table
##
## Model 1: income ~ +gender + povstatus_1979 + race + jobs.num
## Model 2: income ~ gender + povstatus_1979 + race + jobs.num + gender:race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5461 15565922814681
## 2 5459 15392073630859 2 173849183822 30.829 4.855e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- In the interaction term, there are two estimated coefficients added in the model.For example, the coeffcient of genderFemale:raceBlack indicates that among people of the same poverty status and same number of jobs, the average income gap between men and women is 25738 dollars lower in black racial groups than in the other racial group. - Looking at the p-values, even for adjusting for poverty status and number of jobs, there is a statistically significant difference in the income gap between men and women across racial groups. Income gap among other racial group is much wider than black and hispanic groups. - The interaction between race and gender turns out to be a highly statistically predictor of income gap in the model. We can see smaller standard errors in the interactive models. One we control for poverty status and number of prior jobs, our data is consistent with the income gap between men and women being the same across different racial groups in the U.S.
- I have around 70 percent of confidence in my analysis and they’re statistically significant in my opinion. I believe that race, childhood poverty status and marital status are related to income gap between men and women because multiple tests and graphs with error bars proved my view point. I hope those findings can help establish more women empowerment policies to benefit married women and people who were raised up in poverty of different race.