The question we are interested in answering is “Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?”
Men and women have a long reported history of income gaps, and we are interested in parsing out which other factors play into that, and by how much. In order to answer this question, we first do some exploratory analysis of the data in order to see which variables might be interesting in a regression model. Then we create the regression model and plot diagnostics on it, and run some tests on the model to see which variables seem especially significant. We then run anova tests to see which interaction terms seem to be most significant to the model, and plot diagnostics again on the regression model with the most significant interaction term.
We find that race, education level, number of previous jobs, and crime history are significant predictors for the difference in income gaps between men and women.
First we import the data, change the column names to something more understandable, and print out the first 6 rows to check the data.
## # A tibble: 6 x 67
## pk caseid birthisus birthissouth isforlang forlang issouth14
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 445 1 1 0 1 4 0
## 2 445 2 2 -4 1 4 -4
## 3 445 3 1 0 0 -4 0
## 4 445 4 1 0 0 -4 -3
## 5 445 5 1 0 0 -4 0
## 6 445 6 1 0 0 -4 0
## # ... with 60 more variables: urbanrural14 <dbl>, religion <dbl>,
## # expedu <dbl>, ismilitary <dbl>, faplace <dbl>, fatime <dbl>,
## # fauseful <dbl>, fajuvie <dbl>, faroles <dbl>, fashare <dbl>,
## # fahappier <dbl>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <dbl>,
## # gender <dbl>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## # ispoverty1979 <dbl>, ispolicestop <dbl>, agepolicestop <dbl>,
## # ischarged <dbl>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## # totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## # numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## # numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## # spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## # totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## # ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## # mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## # typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## # numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## # estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## # famsize2012 <dbl>, region2012 <dbl>, edu2012 <dbl>,
## # urbanrural2012 <dbl>, numjobs2012 <dbl>
Taking a look at this data, it’s clear to see that we need to recode some of the variables to be in the correct format we need. We recode the variables we need to get a good exploratory analysis of the data, skipping over the ones we are not interested in exploring. Particularly, we recode as factor the family attitudes questions (faplace, fatime, fauseful, fajuvie, faroles, fashare, fahappier), race (race), gender (gender), the crime questions (ispolicestop, ischarged), education level in 2012 (edu2012), and recode as numeric the total incomes in 1990 (totinc1990), 2000 (totinc2000), and 2012 (totinc2012) and the number of previous jobs in 2012 (numjobs2012).
totinc2012 will become our output variable that we test the rest of these covariates on, and for those variables we also recoded the nonrespondents to na. We will look at how we process those missing values in the analysis section.
totinc2012, totinc1990, and totinc2000 are also all topcoded, which means that the top 2% of the data all have the same number, which is the maximum of the responses. We will also look at how to deal with these values later on in the analysis as well.
fauseful and fashare to correspond with a higher factor level implying a higher level of sexism in the responses to those questions.| Variable Name | Question Asked on Survey | Recoded Responses |
|---|---|---|
| totinc2012 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012) | numeric, topcoded |
| totinc1990 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990) | numeric, topcoded |
| totinc2000 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000) | numeric, topcoded |
| numjobs2012 | NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE | |
| numeric | ||
| edu2012 | HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR | 0=“Elementary”,93=“Elementary”,94=“Elementary”,1=“Elementary”,2=“Elementary”,3=“Elementary”,4=“Elementary”,5=“Elementary”,6=“Middle”,7=“Middle”,8=“Middle”,9=“High”,10=“High”,11=“High”,12=“High”,13=“College”,14=“College”,15=“College”,16=“College”,17=“Post College”,18=“Post College”,19=“Post College”,20=“Post College”,95=“Unknown” |
| race | R’S RACIAL/ETHNIC COHORT FROM SCREENER | 3=“Other”,2=“Black”,1=“Hispanic” |
| gender | SEX OF R | 1=“Male”,2=“Female” |
| ispolicestop | EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE? | 0=“No”,1=“Yes” |
| ischarged | EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE) | 0=“No”,1=“Yes” |
| faplace | FAMILY ATTITUDES - WOMAN’S PLACE IS IN THE HOME? | 1= “Strongly Disagree”,2 = “Disagree”,3=“Agree”,4=“Strongly Agree” |
| fatime | FAMILY ATTITUDES - WIFE WITH FAMILY HAS NO TIME FOR OTHER EMPLOYMENT? | 1= “Strongly Disagree”,2 = “Disagree”,3=“Agree”,4=“Strongly Agree” |
| fauseful | FAMILY ATTITUDES - WORKING WIFE FEELS MORE USEFUL? | 4=“Strongly Agree”,3=“Agree”,2 = “Disagree”,1= “Strongly Disagree” |
| fajuvie | FAMILY ATTITUDES - EMPLOYMENT OF WIVES LEADS TO JUVENILE DELINQUENCY? | 1= “Strongly Disagree”,2 = “Disagree”,3=“Agree”,4=“Strongly Agree” |
| faroles | FAMILY ATTITUDES - TRADITIONAL HUSBAND/WIFE ROLES BEST? | 1= “Strongly Disagree”,2 = “Disagree”,3=“Agree”,4=“Strongly Agree” |
| fashare | FAMILY ATTITUDES - MEN SHOULD SHARE HOUSEWORK? | 4=“Strongly Agree”,3=“Agree”,2 = “Disagree”,1= “Strongly Disagree” |
| fahappier | FAMILY ATTITUDES - WOMEN ARE HAPPIER IN TRADITIONAL ROLES? | 1= “Strongly Disagree”,2 = “Disagree”,3=“Agree”,4=“Strongly Agree” |
## # A tibble: 6 x 67
## pk caseid birthisus birthissouth isforlang forlang issouth14
## <dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
## 1 445 1 US Not South Yes Other No
## 2 445 2 Not US -4 Yes Other -4
## 3 445 3 US Not South No -4 No
## 4 445 4 US Not South No -4 -3
## 5 445 5 US Not South No -4 No
## 6 445 6 US Not South No -4 No
## # ... with 60 more variables: urbanrural14 <fct>, religion <fct>,
## # expedu <fct>, ismilitary <fct>, faplace <fct>, fatime <fct>,
## # fauseful <fct>, fajuvie <fct>, faroles <fct>, fashare <fct>,
## # fahappier <fct>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <fct>,
## # gender <fct>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## # ispoverty1979 <dbl>, ispolicestop <fct>, agepolicestop <dbl>,
## # ischarged <fct>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## # totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## # numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## # numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## # spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## # totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## # ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## # mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## # typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## # numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## # estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## # famsize2012 <dbl>, region2012 <fct>, edu2012 <fct>,
## # urbanrural2012 <fct>, numjobs2012 <dbl>
We’ll start by doing some exploratory analysis of the variables we’re interested in. As we work through the data, we’ll explore what the graphs and plots look like when we leave the topcoded values in vs when we don’t, and we’ll be removing the missing values for only the variables we’re exploring at that stage as we graph. This is to ensure a readable plot and ensure that what we’re seeing in those plots makes sense, but it does also means that our findings are only generalizable to the people that answered that specific question - survey nonresponse error is a huge issue with survey generalizability and we need to make sure to take that into account when exploring our findings later.
We first want to just get a sense of where our data is on the variables we’re interested in exploring - gender and totinc2012.
We first create simple count tables to see how many people of each gender we have in our responses, and then partition those tables by both race and education to see what that distribution looks like.
| Gender | Count |
|---|---|
| Male | 6403 |
| Female | 6283 |
| Race | Gender | Count |
|---|---|---|
| Other | Male | 3790 |
| Other | Female | 3720 |
| Black | Male | 1613 |
| Black | Female | 1561 |
| Hispanic | Male | 1000 |
| Hispanic | Female | 1002 |
| Education Level | Gender | Count |
|---|---|---|
| Elementary | Male | 9 |
| Elementary | Female | 13 |
| Middle | Male | 90 |
| Middle | Female | 83 |
| High | Male | 1936 |
| High | Female | 1753 |
| College | Male | 1153 |
| College | Female | 1486 |
| Post College | Male | 336 |
| Post College | Female | 442 |
| NA | Male | 2879 |
| NA | Female | 2506 |
We see that there seem to be the same amount of men and women in our data, which is important for further statistical analysis, but that some factors in our education subsets have less than 30 people, which is important to note for further analysis as well. T-tests do not work on groups less than 30 people, so we need to do further data cleaning before using edu2012 in our analysis.
Tables are sometimes hard to read so we can also display those as plots:
Another important component of this data are the top coded values - this table shows us how many men and women are in the top coded section which will be important to know while deciding whether to remove them or not. There don’t seem to be a significant number of people in either category, so we may be able to remove them safely. We’ll continue exploring this later, but it is interesting to note that there are significantly more men than women in the topcoded values.
| Gender | Num Topcoded |
|---|---|
| Male | 131 |
| Female | 12 |
The outcome variable we’re interested in is totinc2012, so let’s look at what that looks like across gender, with both a boxplot and a histogram. Here, we’ve plotted the graphs without the topcoded values on the left, and the graphs with the topcoded values on the right. We can see with these that men seem to be making more than women on average, there seem to be more women on the bottom of the income specturm than men, and that the topcoded values seem to be outliers for both genders, and the data gets more spread out and also less confident (the gap between men and women looks less significant) when we take them out. It’s also easier to see trends in the data with them taken out.
Now that we have a basic idea of what our data looks like across the variables we are interested in, we now turn to a more nuanced analysis of each variable and its relationship with income gaps between men and women. Unless specifed otherwise, in each of these analyses, I will be taking the topcoded variables out in order to better showcase the trends in the data.
Race and gender are incredibly interrelated with one another. Here we first explore in a table and then a plot what income gaps for men and women look like across race.
Income gaps seem smallest for black people, which is interesting, and largest for the “other” category, which makes sense since “other” would be every other race except for black and hispanic, and that’s a lot of races and therefore a lot of variation. The gaps are also statisitically significant for each racial group.
| Race | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|
| Other | 19558.80 | 17065.65 | 22051.95 | TRUE |
| Black | 5402.63 | 2841.95 | 7963.31 | TRUE |
| Hispanic | 11124.96 | 7596.31 | 14653.60 | TRUE |
We next do the same with education and gender. Except for the Elementary category (which we’ll remember did not have >30 people in that group when broken down by gender), income gaps are significant across education categories, and seem to be largest during college and post-college.
| Education Level | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|
| Elementary | 12614.53 | -1737.77 | 26966.83 | FALSE |
| Middle | 11108.67 | 6094.88 | 16122.45 | TRUE |
| High | 11774.34 | 10031.51 | 13517.18 | TRUE |
| College | 21577.29 | 18627.23 | 24527.36 | TRUE |
| Post College | 21588.06 | 14887.29 | 28288.83 | TRUE |
This variable is another one where we test to see what leaving in or taking out the topcoded values will do for our analysis.
We see the difference that taking them out makes on the data, because it’s much more spread out now that it doesn’t have all of those outliers. We see that the top coded values, for both men and women, are clearly outliers, and we also see that the confidence intervals undestandably get larger when we take the variables out. Number of previous jobs also seems to trend toward fewer previous jobs meaning higher income for both men and women, and there actually seems to be a pretty even distribution of genders along the spectrum. The trend lines seem to show that females are more likely to have a higher income with more jobs, but the confidence intervals seems so large that it doesn’t seem significant.
I’m curious about the income gaps and their relationship with the family attitudes questions. When we plot them, we see that the general trend (overall) is that the more sexist responses lead to higher income gaps until the last factor (either “Strongly Agree” or “Strongly Disagree”, based on the question) - this could be because those women had more incomes of 0, or because there were fewer responses. We see, interestingly, that all the income gaps are significant.
These graphs are interesting, but I don’t think I’ll use them in my analysis - it would add too many variables, and I would assume that the correlation between these variables would be quite strong and potentially impact the regression.
We now turn to our analysis of crime and its relationship with income gaps. We have two crime related variables we’re looking at - ispolicestop and ischarged. I’d like to use them in my regression later but am curious at their relationship, which I would assume would be quite strongly related. I’m also particularly curious about their relationship with race and income gaps together.
| Stopped by Police? | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|
| No | 26401.07 | 23266.48 | 29535.65 | TRUE |
| Yes | 19949.03 | 13903.55 | 25994.50 | TRUE |
| Charged with Crime? | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|
| No | 27389.23 | 24442.37 | 30336.10 | TRUE |
| Yes | 11811.40 | 4973.07 | 18649.72 | TRUE |
Interestingly, for both variables, the income gap seems smaller for those who have been stopped by the police at some point in their lives, and they’re both statistically significant.
We know that race and being charged with a crime or being stopped by the police have a relationship with each other in our society, so we check the significance of crime on income gaps between the races.
| Stopped by Police? | Race | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|---|
| No | Other | 38060.26 | 33006.23 | 43114.29 | TRUE |
| No | Black | 10840.25 | 6884.22 | 14796.28 | TRUE |
| No | Hispanic | 19923.56 | 13634.53 | 26212.59 | TRUE |
| Yes | Other | 30767.58 | 20886.90 | 40648.25 | TRUE |
| Yes | Black | 8540.79 | 2035.51 | 15046.07 | TRUE |
| Yes | Hispanic | 16926.88 | 5092.75 | 28761.00 | TRUE |
| Charged with Crime? | Race | Income Gap | Lower Bound | Upper Bound | Significant? |
|---|---|---|---|---|---|
| No | Other | 40000.08 | 35218.30 | 44781.87 | TRUE |
| No | Black | 11077.04 | 7483.62 | 14670.46 | TRUE |
| No | Hispanic | 19822.41 | 13955.04 | 25689.78 | TRUE |
| Yes | Other | 18714.17 | 8721.68 | 28706.66 | TRUE |
| Yes | Black | 349.16 | -10834.64 | 11532.95 | FALSE |
| Yes | Hispanic | 15933.04 | 323.96 | 31542.11 | TRUE |
We see that again, income gaps go down after being stopped by the police or being charged with a crime across all races, with only one grouping of the data being not significant. This may have to do with the counts of men and women in that category. We also see that the error bars (and therefore the confidence intervals) have gotten larger, which could have to do with the fewer number of people in those categories as well.
I’m curious about the relationship between previous income in 1990 and 2000 with the respondents’ income in 2012. We first plot each of the previous incomes with totinc2012 (we take out the topcoded values to better see any trends within the data) and then do a pairs plot to see collinearity.
We can see from the graphs that there is at least somewhat of a positive correlation between both previous incomes and current (2012) income, which makes sense. We can also see from the pairs plots that the collinearity coefficients are relatively high across the board, so we won’t be using either of the previous incomes in our model.
We now turn to building a model on variables that we’ve explored above. Based on the exploratory analysis, this is the regression model I’ve created:
gender.lm <- lm(totinc2012 ~ gender + race + edu2012 + numjobs2012 + ispolicestop + ischarged, data = nlsy)
As a reminder, these are what the variables mean & coded as:
| Variable Name | Question Asked on Survey | Recoded Responses |
|---|---|---|
| totinc2012 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012) | numeric, topcoded |
| totinc1990 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990) | numeric, topcoded |
| totinc2000 | TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000) | numeric, topcoded |
| numjobs2012 | NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE | |
| numeric | ||
| edu2012 | HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR | 0=“Elementary”,93=“Elementary”,94=“Elementary”,1=“Elementary”,2=“Elementary”,3=“Elementary”,4=“Elementary”,5=“Elementary”,6=“Middle”,7=“Middle”,8=“Middle”,9=“High”,10=“High”,11=“High”,12=“High”,13=“College”,14=“College”,15=“College”,16=“College”,17=“Post College”,18=“Post College”,19=“Post College”,20=“Post College”,95=“Unknown” |
| race | R’S RACIAL/ETHNIC COHORT FROM SCREENER | 3=“Other”,2=“Black”,1=“Hispanic” |
| gender | SEX OF R | 1=“Male”,2=“Female” |
| ispolicestop | EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE? | 0=“No”,1=“Yes” |
| ischarged | EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE) | 0=“No”,1=“Yes” |
We first need to decide how we’re handling missing and topcoded values.
We don’t want coefficients for where intercepts where factor values are na because that’s difficult to interpret, and we don’t want that messing with the regression for the intercepts where the values do exist. So I’ve decided for those reasons to remove the na values.
The next question is how to handle the top coded values. We can plot the lm object to see what the diagnostic plots look like for the regression where we keep the top coded values in.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 45555.288 | 11050.282 | 4.123 | 0.0000 |
| genderFemale | -29773.056 | 1270.210 | -23.439 | 0.0000 |
| raceBlack | -18109.599 | 1393.790 | -12.993 | 0.0000 |
| raceHispanic | -7506.218 | 1649.109 | -4.552 | 0.0000 |
| edu2012Middle | -1636.406 | 11656.020 | -0.140 | 0.8884 |
| edu2012High | 14197.543 | 10986.015 | 1.292 | 0.1963 |
| edu2012College | 39919.807 | 11002.000 | 3.628 | 0.0003 |
| edu2012Post College | 75262.615 | 11120.841 | 6.768 | 0.0000 |
| numjobs2012 | -867.370 | 86.919 | -9.979 | 0.0000 |
| ispolicestopYes | 654.078 | 1715.731 | 0.381 | 0.7030 |
| ischargedYes | -10319.521 | 2229.750 | -4.628 | 0.0000 |
This next set of plots is for the regression model where we take the topcoded values out.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 32995.234 | 6912.757 | 4.773 | 0.0000 |
| genderFemale | -17931.479 | 806.231 | -22.241 | 0.0000 |
| raceBlack | -12160.662 | 877.459 | -13.859 | 0.0000 |
| raceHispanic | -4689.720 | 1040.826 | -4.506 | 0.0000 |
| edu2012Middle | -907.444 | 7289.107 | -0.124 | 0.9009 |
| edu2012High | 13879.138 | 6870.517 | 2.020 | 0.0434 |
| edu2012College | 32212.658 | 6881.175 | 4.681 | 0.0000 |
| edu2012Post College | 52070.978 | 6964.158 | 7.477 | 0.0000 |
| numjobs2012 | -583.077 | 54.718 | -10.656 | 0.0000 |
| ispolicestopYes | -257.511 | 1086.940 | -0.237 | 0.8127 |
| ischargedYes | -6810.899 | 1404.005 | -4.851 | 0.0000 |
Neither set of plots look great (residuals plots show a discernable trend when they shouldn’t and should just be regular noise, it seems like a linear model is not the best fit by those trends, and qqplots show a strong trend toward nonnormality), but the plots for the regression where we take the top coded variables out definitely look better than the plots for the regression where we left them in, so we’ll continue with the analysis without the topcoded values.
We next do a pairs plot to see if there’s any discrinable collinearity between any of the variables we’ve chosen. It’s hard to say for sure with a coefficient because most of these are categorical variables, so this is really just for exploratory analysis.
The one set of variables I’m especially interested in figuring out the relationship between is ispolicestop and ischarged. Intutively they seem like they would have a relationship to me. ispolicestop is not very significant in the initial regression (with a p-value of 0.812731) while ischarged is (with a p-value of 1.255991810^{-6}), so I’m curious if that would change when I take one out.
We test both these variables by seeing what the regression looks like when we take out one and then the other.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 32687.412 | 6924.145 | 4.721 | 0.0000 |
| genderFemale | -17451.072 | 801.478 | -21.774 | 0.0000 |
| raceBlack | -12076.948 | 878.772 | -13.743 | 0.0000 |
| raceHispanic | -4730.921 | 1042.550 | -4.538 | 0.0000 |
| edu2012Middle | -1425.498 | 7300.640 | -0.195 | 0.8452 |
| edu2012High | 13719.560 | 6882.047 | 1.994 | 0.0462 |
| edu2012College | 32295.974 | 6892.781 | 4.685 | 0.0000 |
| edu2012Post College | 52366.176 | 6975.659 | 7.507 | 0.0000 |
| numjobs2012 | -610.073 | 54.526 | -11.189 | 0.0000 |
| ispolicestopYes | -1614.922 | 1052.079 | -1.535 | 0.1248 |
When we take out ischarged from the model, The p-value for ispolicestop goes down, but only to 0.1248359, which is still not significant.
We then try to take out ispolicestop.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 32971.290 | 6911.528 | 4.770 | 0.0000 |
| genderFemale | -17892.290 | 789.024 | -22.676 | 0.0000 |
| raceBlack | -12159.074 | 877.371 | -13.859 | 0.0000 |
| raceHispanic | -4690.934 | 1040.739 | -4.507 | 0.0000 |
| edu2012Middle | -943.901 | 7286.966 | -0.130 | 0.8969 |
| edu2012High | 13851.677 | 6869.053 | 2.017 | 0.0438 |
| edu2012College | 32195.397 | 6880.302 | 4.679 | 0.0000 |
| edu2012Post College | 52055.305 | 6963.350 | 7.476 | 0.0000 |
| numjobs2012 | -584.012 | 54.571 | -10.702 | 0.0000 |
| ischargedYes | -6896.530 | 1356.587 | -5.084 | 0.0000 |
ischarged stays significant with a p-value now of 3.801744710^{-7}, so we can see that this variable seems to be more important to the model than ispolicestop. I’d still expect them to be related, so we try adding an interaction term and try an anova to see if it’s significant.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 33011.914 | 6913.184 | 4.775 | 0.0000 |
| genderFemale | -17913.780 | 806.912 | -22.200 | 0.0000 |
| raceBlack | -12157.048 | 877.529 | -13.854 | 0.0000 |
| raceHispanic | -4687.994 | 1040.885 | -4.504 | 0.0000 |
| edu2012Middle | -969.068 | 7290.344 | -0.133 | 0.8943 |
| edu2012High | 13813.938 | 6871.893 | 2.010 | 0.0444 |
| edu2012College | 32157.320 | 6882.266 | 4.672 | 0.0000 |
| edu2012Post College | 52022.977 | 6965.065 | 7.469 | 0.0000 |
| numjobs2012 | -583.669 | 54.731 | -10.664 | 0.0000 |
| ispolicestopYes | 21.790 | 1199.155 | 0.018 | 0.9855 |
| ischargedYes | -6081.995 | 1928.158 | -3.154 | 0.0016 |
| ispolicestopYes:ischargedYes | -1525.535 | 2765.786 | -0.552 | 0.5813 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6660 | 6.487017e+12 | NA | NA | NA |
| 6661 | 6.487314e+12 | -1 | -296331764 | 0.5812403 |
With a intercept p-value of 0.5812588 and an anova p-value of 0.5812403 we see that my hypothesis was wrong and that the interaction term doesn’t actually add anything to the regression.
We then run quick anova tests on both of these variables to see which one to keep in our model moving forward. With a p-value of 0.8127237 for ispolicestop and one of 1.228082710^{-6} for ischarged, we remove ispolicestop from our regression and move forward to test other interaction terms with gender.
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6661 | 6.487314e+12 | NA | NA | NA |
| 6662 | 6.510233e+12 | -1 | -22919088918 | 1.2e-06 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6661 | 6.487314e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -1 | -54664589 | 0.8127237 |
We saw earlier that the residuals plots for the regression implied that our model wasn’t a great model to fit these variables, due to lack of normality and needing a nonlinear model among other issues. We now test in this section to see if adding an interaction term between gender and one other variable fixing these issues. To choose which interaction term to try adding, we test anova function on each interaction term and choose the one with the most significant p-value.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 34990.596 | 6892.825 | 5.076 | 0.0000 |
| genderFemale | -23049.252 | 1097.254 | -21.006 | 0.0000 |
| raceBlack | -18462.211 | 1261.681 | -14.633 | 0.0000 |
| raceHispanic | -8608.291 | 1502.720 | -5.728 | 0.0000 |
| edu2012Middle | -664.454 | 7261.593 | -0.092 | 0.9271 |
| edu2012High | 14345.351 | 6846.031 | 2.095 | 0.0362 |
| edu2012College | 32427.331 | 6857.058 | 4.729 | 0.0000 |
| edu2012Post College | 52538.025 | 6940.122 | 7.570 | 0.0000 |
| numjobs2012 | -558.828 | 54.511 | -10.252 | 0.0000 |
| ischargedYes | -6812.844 | 1351.886 | -5.040 | 0.0000 |
| genderFemale:raceBlack | 12080.251 | 1744.569 | 6.924 | 0.0000 |
| genderFemale:raceHispanic | 7414.436 | 2055.975 | 3.606 | 0.0003 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6660 | 6.439172e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -2 | -48196770651 | 0 |
We try gender*race first. Gender and race as we saw earlier are interrelated, and it makes sense that these two interracted would have a significant impact on income gaps between men and women. We see that with small p-values of the intercepts and an anova p-value of 1.497216810^{-11} that it is.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 29216.714 | 11068.656 | 2.640 | 0.0083 |
| genderFemale | -12216.717 | 14004.535 | -0.872 | 0.3831 |
| raceBlack | -11984.648 | 876.963 | -13.666 | 0.0000 |
| raceHispanic | -4618.993 | 1039.714 | -4.443 | 0.0000 |
| edu2012Middle | 1847.875 | 11583.253 | 0.160 | 0.8733 |
| edu2012High | 15609.371 | 11063.511 | 1.411 | 0.1583 |
| edu2012College | 37979.968 | 11082.589 | 3.427 | 0.0006 |
| edu2012Post College | 57635.192 | 11215.277 | 5.139 | 0.0000 |
| numjobs2012 | -564.304 | 54.688 | -10.319 | 0.0000 |
| ischargedYes | -6685.017 | 1357.170 | -4.926 | 0.0000 |
| genderFemale:edu2012Middle | -4343.277 | 14895.237 | -0.292 | 0.7706 |
| genderFemale:edu2012High | -2166.110 | 14043.919 | -0.154 | 0.8774 |
| genderFemale:edu2012College | -9777.701 | 14062.016 | -0.695 | 0.4869 |
| genderFemale:edu2012Post College | -9105.121 | 14219.580 | -0.640 | 0.5220 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6658 | 6.465093e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -4 | -22275818471 | 0.0001301 |
We next try gender*edu2012. Education levels can also be related to gender, so we expect this to have an impact as well. With an anova p-value of 1.301399510^{-4}, we see that it is. It is interesting to note that the p-values for the individual intercepts in the model are not significant, but the term itself is a significant predictor of totinc2012.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 36734.008 | 6933.813 | 5.298 | 0.0000 |
| genderFemale | -24895.526 | 1536.565 | -16.202 | 0.0000 |
| raceBlack | -11897.184 | 876.977 | -13.566 | 0.0000 |
| raceHispanic | -4412.893 | 1039.944 | -4.243 | 0.0000 |
| edu2012Middle | -822.799 | 7272.187 | -0.113 | 0.9099 |
| edu2012High | 13577.110 | 6855.283 | 1.981 | 0.0477 |
| edu2012College | 31622.436 | 6867.163 | 4.605 | 0.0000 |
| edu2012Post College | 51366.328 | 6950.406 | 7.390 | 0.0000 |
| numjobs2012 | -856.237 | 74.810 | -11.445 | 0.0000 |
| ischargedYes | -6549.988 | 1355.402 | -4.833 | 0.0000 |
| genderFemale:numjobs2012 | 576.306 | 108.581 | 5.308 | 0.0000 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6661 | 6.460048e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -1 | -27320873100 | 1e-07 |
We noticed earlier that there didn’t seem to be a discernable trend between gender and number of previous jobs, but it’s worth checking out that interaction as well - so we add gender*numjobs2012. It’s important to note here that numjobs2012 is coded as a numeric variable. Even though we didn’t see any trends in the data, with an anova p-value of 1.110709110^{-7}, it seems like this interaction is also significant in this model as a predictor.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 33296.221 | 6911.186 | 4.818 | 0.0000 |
| genderFemale | -18338.434 | 814.892 | -22.504 | 0.0000 |
| raceBlack | -12101.930 | 877.515 | -13.791 | 0.0000 |
| raceHispanic | -4665.694 | 1040.510 | -4.484 | 0.0000 |
| edu2012Middle | -1030.132 | 7285.019 | -0.141 | 0.8876 |
| edu2012High | 13739.600 | 6867.309 | 2.001 | 0.0455 |
| edu2012College | 32123.262 | 6878.442 | 4.670 | 0.0000 |
| edu2012Post College | 51920.351 | 6961.662 | 7.458 | 0.0000 |
| numjobs2012 | -583.938 | 54.556 | -10.703 | 0.0000 |
| ischargedYes | -8469.294 | 1535.989 | -5.514 | 0.0000 |
| genderFemale:ischargedYes | 6993.280 | 3206.319 | 2.181 | 0.0292 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6661 | 6.482739e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -1 | -4629854986 | 0.0291765 |
We then check gender*ischarged, and see that it is less significant than expected (but still significant) at a p-value of 0.0291765.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 32883.343 | 6913.749 | 4.756 | 0.0000 |
| genderFemale | -17934.837 | 789.247 | -22.724 | 0.0000 |
| raceBlack | -11639.677 | 918.568 | -12.672 | 0.0000 |
| raceHispanic | -4595.488 | 1100.017 | -4.178 | 0.0000 |
| edu2012Middle | -1004.953 | 7286.090 | -0.138 | 0.8903 |
| edu2012High | 13835.341 | 6869.166 | 2.014 | 0.0440 |
| edu2012College | 32166.567 | 6880.464 | 4.675 | 0.0000 |
| edu2012Post College | 52083.013 | 6963.831 | 7.479 | 0.0000 |
| numjobs2012 | -588.512 | 54.614 | -10.776 | 0.0000 |
| ischargedYes | -4959.766 | 1894.660 | -2.618 | 0.0089 |
| raceBlack:ischargedYes | -5899.867 | 3075.965 | -1.918 | 0.0551 |
| raceHispanic:ischargedYes | -1113.637 | 3374.626 | -0.330 | 0.7414 |
| Res.Df | RSS | Df | Sum of Sq | Pr(>Chi) |
|---|---|---|---|---|
| 6660 | 6.483681e+12 | NA | NA | NA |
| 6662 | 6.487368e+12 | -2 | -3687593707 | 0.1504781 |
We explored earlier the relationship between race and ischarged, so I’m curious if an interaction term between those two variables would be significant - we try adding race*ischarged to the model, and see that it is in fact not significant, with an anova p-value of 0.1504781.
With gender*race being the most significant interaction term, we add that to our model and plot it to see if that helps our plots look better at all.
| Coefficient | Std Error | T-Value | Significant? | |
|---|---|---|---|---|
| (Intercept) | 34990.596 | 6892.825 | 5.076 | 0.0000 |
| genderFemale | -23049.252 | 1097.254 | -21.006 | 0.0000 |
| raceBlack | -18462.211 | 1261.681 | -14.633 | 0.0000 |
| raceHispanic | -8608.291 | 1502.720 | -5.728 | 0.0000 |
| edu2012Middle | -664.454 | 7261.593 | -0.092 | 0.9271 |
| edu2012High | 14345.351 | 6846.031 | 2.095 | 0.0362 |
| edu2012College | 32427.331 | 6857.058 | 4.729 | 0.0000 |
| edu2012Post College | 52538.025 | 6940.122 | 7.570 | 0.0000 |
| numjobs2012 | -558.828 | 54.511 | -10.252 | 0.0000 |
| ischargedYes | -6812.844 | 1351.886 | -5.040 | 0.0000 |
| genderFemale:raceBlack | 12080.251 | 1744.569 | 6.924 | 0.0000 |
| genderFemale:raceHispanic | 7414.436 | 2055.975 | 3.606 | 0.0003 |
It doesn’t - the diagnostic plots still show some trend when they should be just scatter or noise (though the trend line is straight for the residuals vs fitted graph), and the qqplot shows a distinct lack of normality, especially on the tails. The plots don’t look like they’ve changed much.
Our findings show that the variables race, edu2012, numjobs2012, ischarged and their individual interactions with gender are significant in predicting the outcome of totinc2012, but these findings should all be taken with a huge grain of salt - we are trying to predict a huge outcome variable with very few variables, and the regression diagnostic plots (even when we added the most significant interaction term) show a lack of normality and the need for a nonlinear model, which ours isn’t.
I have very little confidence in this model or my analysis. A lot of our findings could also be impacted by the fact that we took out the topcoded values - we saw in the exploratory analysis that our error bars and confidence intervals got larger when we took the topcoded values out, so there may have been some variables that would be more significant if we’d kept them in. Some variables may have been impacted by low response numbers as well, since we took all of the missing values out, but also because some variables like ischarged had very few responses for “Yes” relative to responses of “No.” That definitely may have skewed both our model and our findings.