Introduction

The question we are interested in answering is “Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?”

Men and women have a long reported history of income gaps, and we are interested in parsing out which other factors play into that, and by how much. In order to answer this question, we first do some exploratory analysis of the data in order to see which variables might be interesting in a regression model. Then we create the regression model and plot diagnostics on it, and run some tests on the model to see which variables seem especially significant. We then run anova tests to see which interaction terms seem to be most significant to the model, and plot diagnostics again on the regression model with the most significant interaction term.

We find that race, education level, number of previous jobs, and crime history are significant predictors for the difference in income gaps between men and women.

Importing and Cleaning the Data

First we import the data, change the column names to something more understandable, and print out the first 6 rows to check the data.

## # A tibble: 6 x 67
##      pk caseid birthisus birthissouth isforlang forlang issouth14
##   <dbl>  <dbl>     <dbl>        <dbl>     <dbl>   <dbl>     <dbl>
## 1   445      1         1            0         1       4         0
## 2   445      2         2           -4         1       4        -4
## 3   445      3         1            0         0      -4         0
## 4   445      4         1            0         0      -4        -3
## 5   445      5         1            0         0      -4         0
## 6   445      6         1            0         0      -4         0
## # ... with 60 more variables: urbanrural14 <dbl>, religion <dbl>,
## #   expedu <dbl>, ismilitary <dbl>, faplace <dbl>, fatime <dbl>,
## #   fauseful <dbl>, fajuvie <dbl>, faroles <dbl>, fashare <dbl>,
## #   fahappier <dbl>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <dbl>,
## #   gender <dbl>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## #   ispoverty1979 <dbl>, ispolicestop <dbl>, agepolicestop <dbl>,
## #   ischarged <dbl>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## #   totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## #   numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## #   numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## #   spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## #   totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## #   ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## #   mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## #   typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## #   numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## #   estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## #   famsize2012 <dbl>, region2012 <dbl>, edu2012 <dbl>,
## #   urbanrural2012 <dbl>, numjobs2012 <dbl>

Taking a look at this data, it’s clear to see that we need to recode some of the variables to be in the correct format we need. We recode the variables we need to get a good exploratory analysis of the data, skipping over the ones we are not interested in exploring. Particularly, we recode as factor the family attitudes questions (faplace, fatime, fauseful, fajuvie, faroles, fashare, fahappier), race (race), gender (gender), the crime questions (ispolicestop, ischarged), education level in 2012 (edu2012), and recode as numeric the total incomes in 1990 (totinc1990), 2000 (totinc2000), and 2012 (totinc2012) and the number of previous jobs in 2012 (numjobs2012).

totinc2012 will become our output variable that we test the rest of these covariates on, and for those variables we also recoded the nonrespondents to na. We will look at how we process those missing values in the analysis section.

totinc2012, totinc1990, and totinc2000 are also all topcoded, which means that the top 2% of the data all have the same number, which is the maximum of the responses. We will also look at how to deal with these values later on in the analysis as well.

  • A note that we recode the family attitudes questions for fauseful and fashare to correspond with a higher factor level implying a higher level of sexism in the responses to those questions.
Variable Name Question Asked on Survey Recoded Responses
totinc2012 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012) numeric, topcoded
totinc1990 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990) numeric, topcoded
totinc2000 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000) numeric, topcoded
numjobs2012 NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE
numeric
edu2012 HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR 0=“Elementary”,
93=“Elementary”,
94=“Elementary”,
1=“Elementary”,
2=“Elementary”,
3=“Elementary”,
4=“Elementary”,
5=“Elementary”,
6=“Middle”,
7=“Middle”,
8=“Middle”,
9=“High”,
10=“High”,
11=“High”,
12=“High”,
13=“College”,
14=“College”,
15=“College”,
16=“College”,
17=“Post College”,
18=“Post College”,
19=“Post College”,
20=“Post College”,
95=“Unknown”
race R’S RACIAL/ETHNIC COHORT FROM SCREENER 3=“Other”,
2=“Black”,
1=“Hispanic”
gender SEX OF R 1=“Male”,
2=“Female”
ispolicestop EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE? 0=“No”,
1=“Yes”
ischarged EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE) 0=“No”,
1=“Yes”
faplace FAMILY ATTITUDES - WOMAN’S PLACE IS IN THE HOME? 1= “Strongly Disagree”,
2 = “Disagree”,
3=“Agree”,
4=“Strongly Agree”
fatime FAMILY ATTITUDES - WIFE WITH FAMILY HAS NO TIME FOR OTHER EMPLOYMENT? 1= “Strongly Disagree”,
2 = “Disagree”,
3=“Agree”,
4=“Strongly Agree”
fauseful FAMILY ATTITUDES - WORKING WIFE FEELS MORE USEFUL? 4=“Strongly Agree”,
3=“Agree”,
2 = “Disagree”,
1= “Strongly Disagree”
fajuvie FAMILY ATTITUDES - EMPLOYMENT OF WIVES LEADS TO JUVENILE DELINQUENCY? 1= “Strongly Disagree”,
2 = “Disagree”,
3=“Agree”,
4=“Strongly Agree”
faroles FAMILY ATTITUDES - TRADITIONAL HUSBAND/WIFE ROLES BEST? 1= “Strongly Disagree”,
2 = “Disagree”,
3=“Agree”,
4=“Strongly Agree”
fashare FAMILY ATTITUDES - MEN SHOULD SHARE HOUSEWORK? 4=“Strongly Agree”,
3=“Agree”,
2 = “Disagree”,
1= “Strongly Disagree”
fahappier FAMILY ATTITUDES - WOMEN ARE HAPPIER IN TRADITIONAL ROLES? 1= “Strongly Disagree”,
2 = “Disagree”,
3=“Agree”,
4=“Strongly Agree”
## # A tibble: 6 x 67
##      pk caseid birthisus birthissouth isforlang forlang issouth14
##   <dbl>  <dbl> <fct>     <fct>        <fct>     <fct>   <fct>    
## 1   445      1 US        Not South    Yes       Other   No       
## 2   445      2 Not US    -4           Yes       Other   -4       
## 3   445      3 US        Not South    No        -4      No       
## 4   445      4 US        Not South    No        -4      -3       
## 5   445      5 US        Not South    No        -4      No       
## 6   445      6 US        Not South    No        -4      No       
## # ... with 60 more variables: urbanrural14 <fct>, religion <fct>,
## #   expedu <fct>, ismilitary <fct>, faplace <fct>, fatime <fct>,
## #   fauseful <fct>, fajuvie <fct>, faroles <fct>, fashare <fct>,
## #   fahappier <fct>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <fct>,
## #   gender <fct>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## #   ispoverty1979 <dbl>, ispolicestop <fct>, agepolicestop <dbl>,
## #   ischarged <fct>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## #   totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## #   numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## #   numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## #   spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## #   totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## #   ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## #   mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## #   typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## #   numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## #   estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## #   famsize2012 <dbl>, region2012 <fct>, edu2012 <fct>,
## #   urbanrural2012 <fct>, numjobs2012 <dbl>

Data Summary & Methodology

We’ll start by doing some exploratory analysis of the variables we’re interested in. As we work through the data, we’ll explore what the graphs and plots look like when we leave the topcoded values in vs when we don’t, and we’ll be removing the missing values for only the variables we’re exploring at that stage as we graph. This is to ensure a readable plot and ensure that what we’re seeing in those plots makes sense, but it does also means that our findings are only generalizable to the people that answered that specific question - survey nonresponse error is a huge issue with survey generalizability and we need to make sure to take that into account when exploring our findings later.

Introduction

We first want to just get a sense of where our data is on the variables we’re interested in exploring - gender and totinc2012.

We first create simple count tables to see how many people of each gender we have in our responses, and then partition those tables by both race and education to see what that distribution looks like.

Count by Gender
Gender Count
Male 6403
Female 6283
Count by Race & Gender
Race Gender Count
Other Male 3790
Other Female 3720
Black Male 1613
Black Female 1561
Hispanic Male 1000
Hispanic Female 1002
Count by Education Level & Gender
Education Level Gender Count
Elementary Male 9
Elementary Female 13
Middle Male 90
Middle Female 83
High Male 1936
High Female 1753
College Male 1153
College Female 1486
Post College Male 336
Post College Female 442
NA Male 2879
NA Female 2506

We see that there seem to be the same amount of men and women in our data, which is important for further statistical analysis, but that some factors in our education subsets have less than 30 people, which is important to note for further analysis as well. T-tests do not work on groups less than 30 people, so we need to do further data cleaning before using edu2012 in our analysis.

Tables are sometimes hard to read so we can also display those as plots:

Another important component of this data are the top coded values - this table shows us how many men and women are in the top coded section which will be important to know while deciding whether to remove them or not. There don’t seem to be a significant number of people in either category, so we may be able to remove them safely. We’ll continue exploring this later, but it is interesting to note that there are significantly more men than women in the topcoded values.

Number of Topcoded Values by Gender
Gender Num Topcoded
Male 131
Female 12

The outcome variable we’re interested in is totinc2012, so let’s look at what that looks like across gender, with both a boxplot and a histogram. Here, we’ve plotted the graphs without the topcoded values on the left, and the graphs with the topcoded values on the right. We can see with these that men seem to be making more than women on average, there seem to be more women on the bottom of the income specturm than men, and that the topcoded values seem to be outliers for both genders, and the data gets more spread out and also less confident (the gap between men and women looks less significant) when we take them out. It’s also easier to see trends in the data with them taken out.

Now that we have a basic idea of what our data looks like across the variables we are interested in, we now turn to a more nuanced analysis of each variable and its relationship with income gaps between men and women. Unless specifed otherwise, in each of these analyses, I will be taking the topcoded variables out in order to better showcase the trends in the data.

Race & Gender

Race and gender are incredibly interrelated with one another. Here we first explore in a table and then a plot what income gaps for men and women look like across race.

Income gaps seem smallest for black people, which is interesting, and largest for the “other” category, which makes sense since “other” would be every other race except for black and hispanic, and that’s a lot of races and therefore a lot of variation. The gaps are also statisitically significant for each racial group.

Income Gaps by Race
Race Income Gap Lower Bound Upper Bound Significant?
Other 19558.80 17065.65 22051.95 TRUE
Black 5402.63 2841.95 7963.31 TRUE
Hispanic 11124.96 7596.31 14653.60 TRUE

Education & Gender

We next do the same with education and gender. Except for the Elementary category (which we’ll remember did not have >30 people in that group when broken down by gender), income gaps are significant across education categories, and seem to be largest during college and post-college.

Income Gaps by Education Level
Education Level Income Gap Lower Bound Upper Bound Significant?
Elementary 12614.53 -1737.77 26966.83 FALSE
Middle 11108.67 6094.88 16122.45 TRUE
High 11774.34 10031.51 13517.18 TRUE
College 21577.29 18627.23 24527.36 TRUE
Post College 21588.06 14887.29 28288.83 TRUE

Number of Previous Jobs & Gender

This variable is another one where we test to see what leaving in or taking out the topcoded values will do for our analysis.

We see the difference that taking them out makes on the data, because it’s much more spread out now that it doesn’t have all of those outliers. We see that the top coded values, for both men and women, are clearly outliers, and we also see that the confidence intervals undestandably get larger when we take the variables out. Number of previous jobs also seems to trend toward fewer previous jobs meaning higher income for both men and women, and there actually seems to be a pretty even distribution of genders along the spectrum. The trend lines seem to show that females are more likely to have a higher income with more jobs, but the confidence intervals seems so large that it doesn’t seem significant.

Family Attitudes & Gender

I’m curious about the income gaps and their relationship with the family attitudes questions. When we plot them, we see that the general trend (overall) is that the more sexist responses lead to higher income gaps until the last factor (either “Strongly Agree” or “Strongly Disagree”, based on the question) - this could be because those women had more incomes of 0, or because there were fewer responses. We see, interestingly, that all the income gaps are significant.

These graphs are interesting, but I don’t think I’ll use them in my analysis - it would add too many variables, and I would assume that the correlation between these variables would be quite strong and potentially impact the regression.

Crime & Gender

We now turn to our analysis of crime and its relationship with income gaps. We have two crime related variables we’re looking at - ispolicestop and ischarged. I’d like to use them in my regression later but am curious at their relationship, which I would assume would be quite strongly related. I’m also particularly curious about their relationship with race and income gaps together.

Income Gaps by Police Stops
Stopped by Police? Income Gap Lower Bound Upper Bound Significant?
No 26401.07 23266.48 29535.65 TRUE
Yes 19949.03 13903.55 25994.50 TRUE
Income Gaps by Crime Charges
Charged with Crime? Income Gap Lower Bound Upper Bound Significant?
No 27389.23 24442.37 30336.10 TRUE
Yes 11811.40 4973.07 18649.72 TRUE

Interestingly, for both variables, the income gap seems smaller for those who have been stopped by the police at some point in their lives, and they’re both statistically significant.

We know that race and being charged with a crime or being stopped by the police have a relationship with each other in our society, so we check the significance of crime on income gaps between the races.

Income Gaps by Race & Police Stops
Stopped by Police? Race Income Gap Lower Bound Upper Bound Significant?
No Other 38060.26 33006.23 43114.29 TRUE
No Black 10840.25 6884.22 14796.28 TRUE
No Hispanic 19923.56 13634.53 26212.59 TRUE
Yes Other 30767.58 20886.90 40648.25 TRUE
Yes Black 8540.79 2035.51 15046.07 TRUE
Yes Hispanic 16926.88 5092.75 28761.00 TRUE
Income Gaps by Race and Crime Charges
Charged with Crime? Race Income Gap Lower Bound Upper Bound Significant?
No Other 40000.08 35218.30 44781.87 TRUE
No Black 11077.04 7483.62 14670.46 TRUE
No Hispanic 19822.41 13955.04 25689.78 TRUE
Yes Other 18714.17 8721.68 28706.66 TRUE
Yes Black 349.16 -10834.64 11532.95 FALSE
Yes Hispanic 15933.04 323.96 31542.11 TRUE

We see that again, income gaps go down after being stopped by the police or being charged with a crime across all races, with only one grouping of the data being not significant. This may have to do with the counts of men and women in that category. We also see that the error bars (and therefore the confidence intervals) have gotten larger, which could have to do with the fewer number of people in those categories as well.

Previous Income & Gender

I’m curious about the relationship between previous income in 1990 and 2000 with the respondents’ income in 2012. We first plot each of the previous incomes with totinc2012 (we take out the topcoded values to better see any trends within the data) and then do a pairs plot to see collinearity.

We can see from the graphs that there is at least somewhat of a positive correlation between both previous incomes and current (2012) income, which makes sense. We can also see from the pairs plots that the collinearity coefficients are relatively high across the board, so we won’t be using either of the previous incomes in our model.

Regression

We now turn to building a model on variables that we’ve explored above. Based on the exploratory analysis, this is the regression model I’ve created:

gender.lm <- lm(totinc2012 ~ gender + race + edu2012 + numjobs2012 + ispolicestop + ischarged, data = nlsy)

As a reminder, these are what the variables mean & coded as:

Variable Name Question Asked on Survey Recoded Responses
totinc2012 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012) numeric, topcoded
totinc1990 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990) numeric, topcoded
totinc2000 TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000) numeric, topcoded
numjobs2012 NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE
numeric
edu2012 HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR 0=“Elementary”,
93=“Elementary”,
94=“Elementary”,
1=“Elementary”,
2=“Elementary”,3=“Elementary”,
4=“Elementary”,
5=“Elementary”,
6=“Middle”,
7=“Middle”,
8=“Middle”,
9=“High”,
10=“High”,
11=“High”,
12=“High”,
13=“College”,
14=“College”,
15=“College”,
16=“College”,
17=“Post College”,
18=“Post College”,
19=“Post College”,
20=“Post College”,
95=“Unknown”
race R’S RACIAL/ETHNIC COHORT FROM SCREENER 3=“Other”,
2=“Black”,
1=“Hispanic”
gender SEX OF R 1=“Male”,
2=“Female”
ispolicestop EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE? 0=“No”,
1=“Yes”
ischarged EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE) 0=“No”,
1=“Yes”

Introduction

We first need to decide how we’re handling missing and topcoded values.

We don’t want coefficients for where intercepts where factor values are na because that’s difficult to interpret, and we don’t want that messing with the regression for the intercepts where the values do exist. So I’ve decided for those reasons to remove the na values.

The next question is how to handle the top coded values. We can plot the lm object to see what the diagnostic plots look like for the regression where we keep the top coded values in.

Regression Output with TopCoded
Coefficient Std Error T-Value Significant?
(Intercept) 45555.288 11050.282 4.123 0.0000
genderFemale -29773.056 1270.210 -23.439 0.0000
raceBlack -18109.599 1393.790 -12.993 0.0000
raceHispanic -7506.218 1649.109 -4.552 0.0000
edu2012Middle -1636.406 11656.020 -0.140 0.8884
edu2012High 14197.543 10986.015 1.292 0.1963
edu2012College 39919.807 11002.000 3.628 0.0003
edu2012Post College 75262.615 11120.841 6.768 0.0000
numjobs2012 -867.370 86.919 -9.979 0.0000
ispolicestopYes 654.078 1715.731 0.381 0.7030
ischargedYes -10319.521 2229.750 -4.628 0.0000

This next set of plots is for the regression model where we take the topcoded values out.

Regression Output without TopCoded
Coefficient Std Error T-Value Significant?
(Intercept) 32995.234 6912.757 4.773 0.0000
genderFemale -17931.479 806.231 -22.241 0.0000
raceBlack -12160.662 877.459 -13.859 0.0000
raceHispanic -4689.720 1040.826 -4.506 0.0000
edu2012Middle -907.444 7289.107 -0.124 0.9009
edu2012High 13879.138 6870.517 2.020 0.0434
edu2012College 32212.658 6881.175 4.681 0.0000
edu2012Post College 52070.978 6964.158 7.477 0.0000
numjobs2012 -583.077 54.718 -10.656 0.0000
ispolicestopYes -257.511 1086.940 -0.237 0.8127
ischargedYes -6810.899 1404.005 -4.851 0.0000

Neither set of plots look great (residuals plots show a discernable trend when they shouldn’t and should just be regular noise, it seems like a linear model is not the best fit by those trends, and qqplots show a strong trend toward nonnormality), but the plots for the regression where we take the top coded variables out definitely look better than the plots for the regression where we left them in, so we’ll continue with the analysis without the topcoded values.

We next do a pairs plot to see if there’s any discrinable collinearity between any of the variables we’ve chosen. It’s hard to say for sure with a coefficient because most of these are categorical variables, so this is really just for exploratory analysis.

The one set of variables I’m especially interested in figuring out the relationship between is ispolicestop and ischarged. Intutively they seem like they would have a relationship to me. ispolicestop is not very significant in the initial regression (with a p-value of 0.812731) while ischarged is (with a p-value of 1.255991810^{-6}), so I’m curious if that would change when I take one out.

We test both these variables by seeing what the regression looks like when we take out one and then the other.

Regression Output without ischarged
Coefficient Std Error T-Value Significant?
(Intercept) 32687.412 6924.145 4.721 0.0000
genderFemale -17451.072 801.478 -21.774 0.0000
raceBlack -12076.948 878.772 -13.743 0.0000
raceHispanic -4730.921 1042.550 -4.538 0.0000
edu2012Middle -1425.498 7300.640 -0.195 0.8452
edu2012High 13719.560 6882.047 1.994 0.0462
edu2012College 32295.974 6892.781 4.685 0.0000
edu2012Post College 52366.176 6975.659 7.507 0.0000
numjobs2012 -610.073 54.526 -11.189 0.0000
ispolicestopYes -1614.922 1052.079 -1.535 0.1248

When we take out ischarged from the model, The p-value for ispolicestop goes down, but only to 0.1248359, which is still not significant.

We then try to take out ispolicestop.

Regression Output wihtout ispolicestop
Coefficient Std Error T-Value Significant?
(Intercept) 32971.290 6911.528 4.770 0.0000
genderFemale -17892.290 789.024 -22.676 0.0000
raceBlack -12159.074 877.371 -13.859 0.0000
raceHispanic -4690.934 1040.739 -4.507 0.0000
edu2012Middle -943.901 7286.966 -0.130 0.8969
edu2012High 13851.677 6869.053 2.017 0.0438
edu2012College 32195.397 6880.302 4.679 0.0000
edu2012Post College 52055.305 6963.350 7.476 0.0000
numjobs2012 -584.012 54.571 -10.702 0.0000
ischargedYes -6896.530 1356.587 -5.084 0.0000

ischarged stays significant with a p-value now of 3.801744710^{-7}, so we can see that this variable seems to be more important to the model than ispolicestop. I’d still expect them to be related, so we try adding an interaction term and try an anova to see if it’s significant.

Regression Output
Coefficient Std Error T-Value Significant?
(Intercept) 33011.914 6913.184 4.775 0.0000
genderFemale -17913.780 806.912 -22.200 0.0000
raceBlack -12157.048 877.529 -13.854 0.0000
raceHispanic -4687.994 1040.885 -4.504 0.0000
edu2012Middle -969.068 7290.344 -0.133 0.8943
edu2012High 13813.938 6871.893 2.010 0.0444
edu2012College 32157.320 6882.266 4.672 0.0000
edu2012Post College 52022.977 6965.065 7.469 0.0000
numjobs2012 -583.669 54.731 -10.664 0.0000
ispolicestopYes 21.790 1199.155 0.018 0.9855
ischargedYes -6081.995 1928.158 -3.154 0.0016
ispolicestopYes:ischargedYes -1525.535 2765.786 -0.552 0.5813
ANOVA Output for ispolicestop*ischarged
Res.Df RSS Df Sum of Sq Pr(>Chi)
6660 6.487017e+12 NA NA NA
6661 6.487314e+12 -1 -296331764 0.5812403

With a intercept p-value of 0.5812588 and an anova p-value of 0.5812403 we see that my hypothesis was wrong and that the interaction term doesn’t actually add anything to the regression.

We then run quick anova tests on both of these variables to see which one to keep in our model moving forward. With a p-value of 0.8127237 for ispolicestop and one of 1.228082710^{-6} for ischarged, we remove ispolicestop from our regression and move forward to test other interaction terms with gender.

ANOVA Output for ischarged
Res.Df RSS Df Sum of Sq Pr(>Chi)
6661 6.487314e+12 NA NA NA
6662 6.510233e+12 -1 -22919088918 1.2e-06
ANOVA Output for ispolicestop
Res.Df RSS Df Sum of Sq Pr(>Chi)
6661 6.487314e+12 NA NA NA
6662 6.487368e+12 -1 -54664589 0.8127237

Testing Interaction Terms

We saw earlier that the residuals plots for the regression implied that our model wasn’t a great model to fit these variables, due to lack of normality and needing a nonlinear model among other issues. We now test in this section to see if adding an interaction term between gender and one other variable fixing these issues. To choose which interaction term to try adding, we test anova function on each interaction term and choose the one with the most significant p-value.

Regression Output for gender*race
Coefficient Std Error T-Value Significant?
(Intercept) 34990.596 6892.825 5.076 0.0000
genderFemale -23049.252 1097.254 -21.006 0.0000
raceBlack -18462.211 1261.681 -14.633 0.0000
raceHispanic -8608.291 1502.720 -5.728 0.0000
edu2012Middle -664.454 7261.593 -0.092 0.9271
edu2012High 14345.351 6846.031 2.095 0.0362
edu2012College 32427.331 6857.058 4.729 0.0000
edu2012Post College 52538.025 6940.122 7.570 0.0000
numjobs2012 -558.828 54.511 -10.252 0.0000
ischargedYes -6812.844 1351.886 -5.040 0.0000
genderFemale:raceBlack 12080.251 1744.569 6.924 0.0000
genderFemale:raceHispanic 7414.436 2055.975 3.606 0.0003
ANOVA Output for gender*race
Res.Df RSS Df Sum of Sq Pr(>Chi)
6660 6.439172e+12 NA NA NA
6662 6.487368e+12 -2 -48196770651 0

We try gender*race first. Gender and race as we saw earlier are interrelated, and it makes sense that these two interracted would have a significant impact on income gaps between men and women. We see that with small p-values of the intercepts and an anova p-value of 1.497216810^{-11} that it is.

Regression Output for gender*edu2012
Coefficient Std Error T-Value Significant?
(Intercept) 29216.714 11068.656 2.640 0.0083
genderFemale -12216.717 14004.535 -0.872 0.3831
raceBlack -11984.648 876.963 -13.666 0.0000
raceHispanic -4618.993 1039.714 -4.443 0.0000
edu2012Middle 1847.875 11583.253 0.160 0.8733
edu2012High 15609.371 11063.511 1.411 0.1583
edu2012College 37979.968 11082.589 3.427 0.0006
edu2012Post College 57635.192 11215.277 5.139 0.0000
numjobs2012 -564.304 54.688 -10.319 0.0000
ischargedYes -6685.017 1357.170 -4.926 0.0000
genderFemale:edu2012Middle -4343.277 14895.237 -0.292 0.7706
genderFemale:edu2012High -2166.110 14043.919 -0.154 0.8774
genderFemale:edu2012College -9777.701 14062.016 -0.695 0.4869
genderFemale:edu2012Post College -9105.121 14219.580 -0.640 0.5220
ANOVA Output for gender*edu2012
Res.Df RSS Df Sum of Sq Pr(>Chi)
6658 6.465093e+12 NA NA NA
6662 6.487368e+12 -4 -22275818471 0.0001301

We next try gender*edu2012. Education levels can also be related to gender, so we expect this to have an impact as well. With an anova p-value of 1.301399510^{-4}, we see that it is. It is interesting to note that the p-values for the individual intercepts in the model are not significant, but the term itself is a significant predictor of totinc2012.

Regression Output for gender*numjobs2012
Coefficient Std Error T-Value Significant?
(Intercept) 36734.008 6933.813 5.298 0.0000
genderFemale -24895.526 1536.565 -16.202 0.0000
raceBlack -11897.184 876.977 -13.566 0.0000
raceHispanic -4412.893 1039.944 -4.243 0.0000
edu2012Middle -822.799 7272.187 -0.113 0.9099
edu2012High 13577.110 6855.283 1.981 0.0477
edu2012College 31622.436 6867.163 4.605 0.0000
edu2012Post College 51366.328 6950.406 7.390 0.0000
numjobs2012 -856.237 74.810 -11.445 0.0000
ischargedYes -6549.988 1355.402 -4.833 0.0000
genderFemale:numjobs2012 576.306 108.581 5.308 0.0000
ANOVA Output for gender*numjobs2012
Res.Df RSS Df Sum of Sq Pr(>Chi)
6661 6.460048e+12 NA NA NA
6662 6.487368e+12 -1 -27320873100 1e-07

We noticed earlier that there didn’t seem to be a discernable trend between gender and number of previous jobs, but it’s worth checking out that interaction as well - so we add gender*numjobs2012. It’s important to note here that numjobs2012 is coded as a numeric variable. Even though we didn’t see any trends in the data, with an anova p-value of 1.110709110^{-7}, it seems like this interaction is also significant in this model as a predictor.

Regression Output for gender*ischarged
Coefficient Std Error T-Value Significant?
(Intercept) 33296.221 6911.186 4.818 0.0000
genderFemale -18338.434 814.892 -22.504 0.0000
raceBlack -12101.930 877.515 -13.791 0.0000
raceHispanic -4665.694 1040.510 -4.484 0.0000
edu2012Middle -1030.132 7285.019 -0.141 0.8876
edu2012High 13739.600 6867.309 2.001 0.0455
edu2012College 32123.262 6878.442 4.670 0.0000
edu2012Post College 51920.351 6961.662 7.458 0.0000
numjobs2012 -583.938 54.556 -10.703 0.0000
ischargedYes -8469.294 1535.989 -5.514 0.0000
genderFemale:ischargedYes 6993.280 3206.319 2.181 0.0292
ANOVA Output for gender*ischarged
Res.Df RSS Df Sum of Sq Pr(>Chi)
6661 6.482739e+12 NA NA NA
6662 6.487368e+12 -1 -4629854986 0.0291765

We then check gender*ischarged, and see that it is less significant than expected (but still significant) at a p-value of 0.0291765.

Regression Output for race*ischarged
Coefficient Std Error T-Value Significant?
(Intercept) 32883.343 6913.749 4.756 0.0000
genderFemale -17934.837 789.247 -22.724 0.0000
raceBlack -11639.677 918.568 -12.672 0.0000
raceHispanic -4595.488 1100.017 -4.178 0.0000
edu2012Middle -1004.953 7286.090 -0.138 0.8903
edu2012High 13835.341 6869.166 2.014 0.0440
edu2012College 32166.567 6880.464 4.675 0.0000
edu2012Post College 52083.013 6963.831 7.479 0.0000
numjobs2012 -588.512 54.614 -10.776 0.0000
ischargedYes -4959.766 1894.660 -2.618 0.0089
raceBlack:ischargedYes -5899.867 3075.965 -1.918 0.0551
raceHispanic:ischargedYes -1113.637 3374.626 -0.330 0.7414
ANOVA Output for race*ischarged
Res.Df RSS Df Sum of Sq Pr(>Chi)
6660 6.483681e+12 NA NA NA
6662 6.487368e+12 -2 -3687593707 0.1504781

We explored earlier the relationship between race and ischarged, so I’m curious if an interaction term between those two variables would be significant - we try adding race*ischarged to the model, and see that it is in fact not significant, with an anova p-value of 0.1504781.

With gender*race being the most significant interaction term, we add that to our model and plot it to see if that helps our plots look better at all.

Regression Output for regression with gender*race
Coefficient Std Error T-Value Significant?
(Intercept) 34990.596 6892.825 5.076 0.0000
genderFemale -23049.252 1097.254 -21.006 0.0000
raceBlack -18462.211 1261.681 -14.633 0.0000
raceHispanic -8608.291 1502.720 -5.728 0.0000
edu2012Middle -664.454 7261.593 -0.092 0.9271
edu2012High 14345.351 6846.031 2.095 0.0362
edu2012College 32427.331 6857.058 4.729 0.0000
edu2012Post College 52538.025 6940.122 7.570 0.0000
numjobs2012 -558.828 54.511 -10.252 0.0000
ischargedYes -6812.844 1351.886 -5.040 0.0000
genderFemale:raceBlack 12080.251 1744.569 6.924 0.0000
genderFemale:raceHispanic 7414.436 2055.975 3.606 0.0003

It doesn’t - the diagnostic plots still show some trend when they should be just scatter or noise (though the trend line is straight for the residuals vs fitted graph), and the qqplot shows a distinct lack of normality, especially on the tails. The plots don’t look like they’ve changed much.

Discussion

Our findings show that the variables race, edu2012, numjobs2012, ischarged and their individual interactions with gender are significant in predicting the outcome of totinc2012, but these findings should all be taken with a huge grain of salt - we are trying to predict a huge outcome variable with very few variables, and the regression diagnostic plots (even when we added the most significant interaction term) show a lack of normality and the need for a nonlinear model, which ours isn’t.

I have very little confidence in this model or my analysis. A lot of our findings could also be impacted by the fact that we took out the topcoded values - we saw in the exploratory analysis that our error bars and confidence intervals got larger when we took the topcoded values out, so there may have been some variables that would be more significant if we’d kept them in. Some variables may have been impacted by low response numbers as well, since we took all of the missing values out, but also because some variables like ischarged had very few responses for “Yes” relative to responses of “No.” That definitely may have skewed both our model and our findings.