Introduction

The question we are interested in answering is “Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?”

Men and women have a long reported history of income gaps, and we are interested in parsing out which other factors play into that, and by how much. In order to answer this question, we first do some exploratory analysis of the data in order to see which variables might be interesting in a regression model. Then we create the regression model and plot diagnostics on it, and run some tests on the model to see which variables seem especially significant. We then run anova tests to see which interaction terms seem to be most significant to the model, and plot diagnostics again on the regression model with the most significant interaction term.

We find that race, education level, number of previous jobs, and crime history are significant predictors for the difference in income gaps between men and women.

Importing and Cleaning the Data

First we import the data, change the column names to something more understandable, and print out the first 6 rows to check the data.

## # A tibble: 6 x 67
##      pk caseid birthisus birthissouth isforlang forlang issouth14
##   <dbl>  <dbl>     <dbl>        <dbl>     <dbl>   <dbl>     <dbl>
## 1   445      1         1            0         1       4         0
## 2   445      2         2           -4         1       4        -4
## 3   445      3         1            0         0      -4         0
## 4   445      4         1            0         0      -4        -3
## 5   445      5         1            0         0      -4         0
## 6   445      6         1            0         0      -4         0
## # ... with 60 more variables: urbanrural14 <dbl>, religion <dbl>,
## #   expedu <dbl>, ismilitary <dbl>, faplace <dbl>, fatime <dbl>,
## #   fauseful <dbl>, fajuvie <dbl>, faroles <dbl>, fashare <dbl>,
## #   fahappier <dbl>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <dbl>,
## #   gender <dbl>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## #   ispoverty1979 <dbl>, ispolicestop <dbl>, agepolicestop <dbl>,
## #   ischarged <dbl>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## #   totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## #   numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## #   numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## #   spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## #   totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## #   ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## #   mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## #   typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## #   numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## #   estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## #   famsize2012 <dbl>, region2012 <dbl>, edu2012 <dbl>,
## #   urbanrural2012 <dbl>, numjobs2012 <dbl>

Taking a look at this data, it’s clear to see that we need to recode some of the variables to be in the correct format we need. We recode the variables we need to get a good exploratory analysis of the data, skipping over the ones we are not interested in exploring. Particularly, we recode as factor the family attitudes questions (faplace, fatime, fauseful, fajuvie, faroles, fashare, fahappier), race (race), gender (gender), the crime questions (ispolicestop, ischarged), education level in 2012 (edu2012), and recode as numeric the total incomes in 1990 (totinc1990), 2000 (totinc2000), and 2012 (totinc2012) and the number of previous jobs in 2012 (numjobs2012).

totinc2012 will become our output variable that we test the rest of these covariates on, and for those variables we also recoded the nonrespondents to na. We will look at how we process those missing values in the analysis section.

totinc2012, totinc1990, and totinc2000 are also all topcoded, which means that the top 2% of the data all have the same number, which is the maximum of the responses. We will also look at how to deal with these values later on in the analysis as well.

A note that we recode the family attitudes questions for fauseful and fashare to correspond with a higher factor level implying a higher level of sexism in the responses to those questions.

Variable Name	Question Asked on Survey	Recoded Responses
totinc2012	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012)	numeric, topcoded
totinc1990	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990)	numeric, topcoded
totinc2000	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000)	numeric, topcoded
numjobs2012	NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE
numeric
edu2012	HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR	`0`=“Elementary”, `93`=“Elementary”, `94`=“Elementary”, `1`=“Elementary”, `2`=“Elementary”, `3`=“Elementary”, `4`=“Elementary”, `5`=“Elementary”, `6`=“Middle”, `7`=“Middle”, `8`=“Middle”, `9`=“High”, `10`=“High”, `11`=“High”, `12`=“High”, `13`=“College”, `14`=“College”, `15`=“College”, `16`=“College”, `17`=“Post College”, `18`=“Post College”, `19`=“Post College”, `20`=“Post College”, `95`=“Unknown”
race	R’S RACIAL/ETHNIC COHORT FROM SCREENER	`3`=“Other”, `2`=“Black”, `1`=“Hispanic”
gender	SEX OF R	`1`=“Male”, `2`=“Female”
ispolicestop	EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE?	`0`=“No”, `1`=“Yes”
ischarged	EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE)	`0`=“No”, `1`=“Yes”
faplace	FAMILY ATTITUDES - WOMAN’S PLACE IS IN THE HOME?	`1`= “Strongly Disagree”, `2` = “Disagree”, `3`=“Agree”, `4`=“Strongly Agree”
fatime	FAMILY ATTITUDES - WIFE WITH FAMILY HAS NO TIME FOR OTHER EMPLOYMENT?	`1`= “Strongly Disagree”, `2` = “Disagree”, `3`=“Agree”, `4`=“Strongly Agree”
fauseful	FAMILY ATTITUDES - WORKING WIFE FEELS MORE USEFUL?	`4`=“Strongly Agree”, `3`=“Agree”, `2` = “Disagree”, `1`= “Strongly Disagree”
fajuvie	FAMILY ATTITUDES - EMPLOYMENT OF WIVES LEADS TO JUVENILE DELINQUENCY?	`1`= “Strongly Disagree”, `2` = “Disagree”, `3`=“Agree”, `4`=“Strongly Agree”
faroles	FAMILY ATTITUDES - TRADITIONAL HUSBAND/WIFE ROLES BEST?	`1`= “Strongly Disagree”, `2` = “Disagree”, `3`=“Agree”, `4`=“Strongly Agree”
fashare	FAMILY ATTITUDES - MEN SHOULD SHARE HOUSEWORK?	`4`=“Strongly Agree”, `3`=“Agree”, `2` = “Disagree”, `1`= “Strongly Disagree”
fahappier	FAMILY ATTITUDES - WOMEN ARE HAPPIER IN TRADITIONAL ROLES?	`1`= “Strongly Disagree”, `2` = “Disagree”, `3`=“Agree”, `4`=“Strongly Agree”

## # A tibble: 6 x 67
##      pk caseid birthisus birthissouth isforlang forlang issouth14
##   <dbl>  <dbl> <fct>     <fct>        <fct>     <fct>   <fct>    
## 1   445      1 US        Not South    Yes       Other   No       
## 2   445      2 Not US    -4           Yes       Other   -4       
## 3   445      3 US        Not South    No        -4      No       
## 4   445      4 US        Not South    No        -4      -3       
## 5   445      5 US        Not South    No        -4      No       
## 6   445      6 US        Not South    No        -4      No       
## # ... with 60 more variables: urbanrural14 <fct>, religion <fct>,
## #   expedu <fct>, ismilitary <fct>, faplace <fct>, fatime <fct>,
## #   fauseful <fct>, fajuvie <fct>, faroles <fct>, fashare <fct>,
## #   fahappier <fct>, expoccup1979 <dbl>, isedu5yrs <dbl>, race <fct>,
## #   gender <fct>, maritalstatus1979 <dbl>, famsize1979 <dbl>,
## #   ispoverty1979 <dbl>, ispolicestop <fct>, agepolicestop <dbl>,
## #   ischarged <fct>, agealc <dbl>, numweed <dbl>, ageweed <dbl>,
## #   totinc1990 <dbl>, ispoverty1990 <dbl>, edu1990 <dbl>,
## #   numjobs1990 <dbl>, numchild1990 <dbl>, youngchild1990 <dbl>,
## #   numcoke <dbl>, agecoke <dbl>, typeemp2000 <dbl>, typeoccup2000 <dbl>,
## #   spouseoccup2000 <dbl>, spousehrsweek2000 <dbl>, numchildren2000 <dbl>,
## #   totinc2000 <dbl>, famsize2000 <dbl>, totfaminc2000 <dbl>,
## #   ispoverty2000 <dbl>, marstatcollap2000 <dbl>, maritalstatus2000 <dbl>,
## #   mnthsmarraigebirth <dbl>, dobpartner2012 <dbl>, typeemp2012 <dbl>,
## #   typeoccup2012 <dbl>, spouseoccup2012 <dbl>, numwksspouse2012 <dbl>,
## #   numdrinks2012 <dbl>, totinc2012 <dbl>, estinc20121 <dbl>,
## #   estinc20122 <dbl>, totincspouse2012 <dbl>, estincspouse2012 <dbl>,
## #   famsize2012 <dbl>, region2012 <fct>, edu2012 <fct>,
## #   urbanrural2012 <fct>, numjobs2012 <dbl>

Data Summary & Methodology

We’ll start by doing some exploratory analysis of the variables we’re interested in. As we work through the data, we’ll explore what the graphs and plots look like when we leave the topcoded values in vs when we don’t, and we’ll be removing the missing values for only the variables we’re exploring at that stage as we graph. This is to ensure a readable plot and ensure that what we’re seeing in those plots makes sense, but it does also means that our findings are only generalizable to the people that answered that specific question - survey nonresponse error is a huge issue with survey generalizability and we need to make sure to take that into account when exploring our findings later.

Introduction

We first want to just get a sense of where our data is on the variables we’re interested in exploring - gender and totinc2012.

We first create simple count tables to see how many people of each gender we have in our responses, and then partition those tables by both race and education to see what that distribution looks like.

Count by Gender
Gender	Count
Male	6403
Female	6283

Count by Race & Gender
Race	Gender	Count
Other	Male	3790
Other	Female	3720
Black	Male	1613
Black	Female	1561
Hispanic	Male	1000
Hispanic	Female	1002

Count by Education Level & Gender
Education Level	Gender	Count
Elementary	Male	9
Elementary	Female	13
Middle	Male	90
Middle	Female	83
High	Male	1936
High	Female	1753
College	Male	1153
College	Female	1486
Post College	Male	336
Post College	Female	442
NA	Male	2879
NA	Female	2506

We see that there seem to be the same amount of men and women in our data, which is important for further statistical analysis, but that some factors in our education subsets have less than 30 people, which is important to note for further analysis as well. T-tests do not work on groups less than 30 people, so we need to do further data cleaning before using edu2012 in our analysis.

Tables are sometimes hard to read so we can also display those as plots:

Another important component of this data are the top coded values - this table shows us how many men and women are in the top coded section which will be important to know while deciding whether to remove them or not. There don’t seem to be a significant number of people in either category, so we may be able to remove them safely. We’ll continue exploring this later, but it is interesting to note that there are significantly more men than women in the topcoded values.

Number of Topcoded Values by Gender
Gender	Num Topcoded
Male	131
Female	12

The outcome variable we’re interested in is totinc2012, so let’s look at what that looks like across gender, with both a boxplot and a histogram. Here, we’ve plotted the graphs without the topcoded values on the left, and the graphs with the topcoded values on the right. We can see with these that men seem to be making more than women on average, there seem to be more women on the bottom of the income specturm than men, and that the topcoded values seem to be outliers for both genders, and the data gets more spread out and also less confident (the gap between men and women looks less significant) when we take them out. It’s also easier to see trends in the data with them taken out.

Now that we have a basic idea of what our data looks like across the variables we are interested in, we now turn to a more nuanced analysis of each variable and its relationship with income gaps between men and women. Unless specifed otherwise, in each of these analyses, I will be taking the topcoded variables out in order to better showcase the trends in the data.

Race & Gender

Race and gender are incredibly interrelated with one another. Here we first explore in a table and then a plot what income gaps for men and women look like across race.

Income gaps seem smallest for black people, which is interesting, and largest for the “other” category, which makes sense since “other” would be every other race except for black and hispanic, and that’s a lot of races and therefore a lot of variation. The gaps are also statisitically significant for each racial group.

Income Gaps by Race
Race	Income Gap	Lower Bound	Upper Bound	Significant?
Other	19558.80	17065.65	22051.95	TRUE
Black	5402.63	2841.95	7963.31	TRUE
Hispanic	11124.96	7596.31	14653.60	TRUE

Education & Gender

We next do the same with education and gender. Except for the Elementary category (which we’ll remember did not have >30 people in that group when broken down by gender), income gaps are significant across education categories, and seem to be largest during college and post-college.

Income Gaps by Education Level
Education Level	Income Gap	Lower Bound	Upper Bound	Significant?
Elementary	12614.53	-1737.77	26966.83	FALSE
Middle	11108.67	6094.88	16122.45	TRUE
High	11774.34	10031.51	13517.18	TRUE
College	21577.29	18627.23	24527.36	TRUE
Post College	21588.06	14887.29	28288.83	TRUE

Number of Previous Jobs & Gender

This variable is another one where we test to see what leaving in or taking out the topcoded values will do for our analysis.

We see the difference that taking them out makes on the data, because it’s much more spread out now that it doesn’t have all of those outliers. We see that the top coded values, for both men and women, are clearly outliers, and we also see that the confidence intervals undestandably get larger when we take the variables out. Number of previous jobs also seems to trend toward fewer previous jobs meaning higher income for both men and women, and there actually seems to be a pretty even distribution of genders along the spectrum. The trend lines seem to show that females are more likely to have a higher income with more jobs, but the confidence intervals seems so large that it doesn’t seem significant.

Family Attitudes & Gender

I’m curious about the income gaps and their relationship with the family attitudes questions. When we plot them, we see that the general trend (overall) is that the more sexist responses lead to higher income gaps until the last factor (either “Strongly Agree” or “Strongly Disagree”, based on the question) - this could be because those women had more incomes of 0, or because there were fewer responses. We see, interestingly, that all the income gaps are significant.

These graphs are interesting, but I don’t think I’ll use them in my analysis - it would add too many variables, and I would assume that the correlation between these variables would be quite strong and potentially impact the regression.

Crime & Gender

We now turn to our analysis of crime and its relationship with income gaps. We have two crime related variables we’re looking at - ispolicestop and ischarged. I’d like to use them in my regression later but am curious at their relationship, which I would assume would be quite strongly related. I’m also particularly curious about their relationship with race and income gaps together.

Income Gaps by Police Stops
Stopped by Police?	Income Gap	Lower Bound	Upper Bound	Significant?
No	26401.07	23266.48	29535.65	TRUE
Yes	19949.03	13903.55	25994.50	TRUE

Income Gaps by Crime Charges
Charged with Crime?	Income Gap	Lower Bound	Upper Bound	Significant?
No	27389.23	24442.37	30336.10	TRUE
Yes	11811.40	4973.07	18649.72	TRUE

Interestingly, for both variables, the income gap seems smaller for those who have been stopped by the police at some point in their lives, and they’re both statistically significant.

We know that race and being charged with a crime or being stopped by the police have a relationship with each other in our society, so we check the significance of crime on income gaps between the races.

Income Gaps by Race & Police Stops
Stopped by Police?	Race	Income Gap	Lower Bound	Upper Bound	Significant?
No	Other	38060.26	33006.23	43114.29	TRUE
No	Black	10840.25	6884.22	14796.28	TRUE
No	Hispanic	19923.56	13634.53	26212.59	TRUE
Yes	Other	30767.58	20886.90	40648.25	TRUE
Yes	Black	8540.79	2035.51	15046.07	TRUE
Yes	Hispanic	16926.88	5092.75	28761.00	TRUE

Income Gaps by Race and Crime Charges
Charged with Crime?	Race	Income Gap	Lower Bound	Upper Bound	Significant?
No	Other	40000.08	35218.30	44781.87	TRUE
No	Black	11077.04	7483.62	14670.46	TRUE
No	Hispanic	19822.41	13955.04	25689.78	TRUE
Yes	Other	18714.17	8721.68	28706.66	TRUE
Yes	Black	349.16	-10834.64	11532.95	FALSE
Yes	Hispanic	15933.04	323.96	31542.11	TRUE

We see that again, income gaps go down after being stopped by the police or being charged with a crime across all races, with only one grouping of the data being not significant. This may have to do with the counts of men and women in that category. We also see that the error bars (and therefore the confidence intervals) have gotten larger, which could have to do with the fewer number of people in those categories as well.

Previous Income & Gender

I’m curious about the relationship between previous income in 1990 and 2000 with the respondents’ income in 2012. We first plot each of the previous incomes with totinc2012 (we take out the topcoded values to better see any trends within the data) and then do a pairs plot to see collinearity.

We can see from the graphs that there is at least somewhat of a positive correlation between both previous incomes and current (2012) income, which makes sense. We can also see from the pairs plots that the collinearity coefficients are relatively high across the board, so we won’t be using either of the previous incomes in our model.

Regression

We now turn to building a model on variables that we’ve explored above. Based on the exploratory analysis, this is the regression model I’ve created:

gender.lm <- lm(totinc2012 ~ gender + race + edu2012 + numjobs2012 + ispolicestop + ischarged, data = nlsy)

As a reminder, these are what the variables mean & coded as:

Variable Name	Question Asked on Survey	Recoded Responses
totinc2012	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2012)	numeric, topcoded
totinc1990	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (1990)	numeric, topcoded
totinc2000	TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (2000)	numeric, topcoded
numjobs2012	NUMBER OF DIFFERENT JOBS EVER REPORTED AS OF INTERVIEW DATE
numeric
edu2012	HIGHEST GRADE COMPLETED AS OF MAY 1 SURVEY YEAR	`0`=“Elementary”, `93`=“Elementary”, `94`=“Elementary”, `1`=“Elementary”, `2`=“Elementary”,`3`=“Elementary”, `4`=“Elementary”, `5`=“Elementary”, `6`=“Middle”, `7`=“Middle”, `8`=“Middle”, `9`=“High”, `10`=“High”, `11`=“High”, `12`=“High”, `13`=“College”, `14`=“College”, `15`=“College”, `16`=“College”, `17`=“Post College”, `18`=“Post College”, `19`=“Post College”, `20`=“Post College”, `95`=“Unknown”
race	R’S RACIAL/ETHNIC COHORT FROM SCREENER	`3`=“Other”, `2`=“Black”, `1`=“Hispanic”
gender	SEX OF R	`1`=“Male”, `2`=“Female”
ispolicestop	EVER “STOPPED” BY POLICE FOR OTHER THAN MINOR TRAFFIC OFFENSE?	`0`=“No”, `1`=“Yes”
ischarged	EVER CHARGED WITH ILLEGAL ACTIVITY? 80 INT (EXC MINOR TRAFFIC OFFENSE)	`0`=“No”, `1`=“Yes”

Introduction

We first need to decide how we’re handling missing and topcoded values.

We don’t want coefficients for where intercepts where factor values are na because that’s difficult to interpret, and we don’t want that messing with the regression for the intercepts where the values do exist. So I’ve decided for those reasons to remove the na values.

The next question is how to handle the top coded values. We can plot the lm object to see what the diagnostic plots look like for the regression where we keep the top coded values in.

Regression Output with TopCoded
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	45555.288	11050.282	4.123	0.0000
genderFemale	-29773.056	1270.210	-23.439	0.0000
raceBlack	-18109.599	1393.790	-12.993	0.0000
raceHispanic	-7506.218	1649.109	-4.552	0.0000
edu2012Middle	-1636.406	11656.020	-0.140	0.8884
edu2012High	14197.543	10986.015	1.292	0.1963
edu2012College	39919.807	11002.000	3.628	0.0003
edu2012Post College	75262.615	11120.841	6.768	0.0000
numjobs2012	-867.370	86.919	-9.979	0.0000
ispolicestopYes	654.078	1715.731	0.381	0.7030
ischargedYes	-10319.521	2229.750	-4.628	0.0000

This next set of plots is for the regression model where we take the topcoded values out.

Regression Output without TopCoded
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	32995.234	6912.757	4.773	0.0000
genderFemale	-17931.479	806.231	-22.241	0.0000
raceBlack	-12160.662	877.459	-13.859	0.0000
raceHispanic	-4689.720	1040.826	-4.506	0.0000
edu2012Middle	-907.444	7289.107	-0.124	0.9009
edu2012High	13879.138	6870.517	2.020	0.0434
edu2012College	32212.658	6881.175	4.681	0.0000
edu2012Post College	52070.978	6964.158	7.477	0.0000
numjobs2012	-583.077	54.718	-10.656	0.0000
ispolicestopYes	-257.511	1086.940	-0.237	0.8127
ischargedYes	-6810.899	1404.005	-4.851	0.0000

Neither set of plots look great (residuals plots show a discernable trend when they shouldn’t and should just be regular noise, it seems like a linear model is not the best fit by those trends, and qqplots show a strong trend toward nonnormality), but the plots for the regression where we take the top coded variables out definitely look better than the plots for the regression where we left them in, so we’ll continue with the analysis without the topcoded values.

We next do a pairs plot to see if there’s any discrinable collinearity between any of the variables we’ve chosen. It’s hard to say for sure with a coefficient because most of these are categorical variables, so this is really just for exploratory analysis.

The one set of variables I’m especially interested in figuring out the relationship between is ispolicestop and ischarged. Intutively they seem like they would have a relationship to me. ispolicestop is not very significant in the initial regression (with a p-value of 0.812731) while ischarged is (with a p-value of 1.255991810^{-6}), so I’m curious if that would change when I take one out.

We test both these variables by seeing what the regression looks like when we take out one and then the other.

Regression Output without ischarged
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	32687.412	6924.145	4.721	0.0000
genderFemale	-17451.072	801.478	-21.774	0.0000
raceBlack	-12076.948	878.772	-13.743	0.0000
raceHispanic	-4730.921	1042.550	-4.538	0.0000
edu2012Middle	-1425.498	7300.640	-0.195	0.8452
edu2012High	13719.560	6882.047	1.994	0.0462
edu2012College	32295.974	6892.781	4.685	0.0000
edu2012Post College	52366.176	6975.659	7.507	0.0000
numjobs2012	-610.073	54.526	-11.189	0.0000
ispolicestopYes	-1614.922	1052.079	-1.535	0.1248

When we take out ischarged from the model, The p-value for ispolicestop goes down, but only to 0.1248359, which is still not significant.

We then try to take out ispolicestop.

Regression Output wihtout ispolicestop
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	32971.290	6911.528	4.770	0.0000
genderFemale	-17892.290	789.024	-22.676	0.0000
raceBlack	-12159.074	877.371	-13.859	0.0000
raceHispanic	-4690.934	1040.739	-4.507	0.0000
edu2012Middle	-943.901	7286.966	-0.130	0.8969
edu2012High	13851.677	6869.053	2.017	0.0438
edu2012College	32195.397	6880.302	4.679	0.0000
edu2012Post College	52055.305	6963.350	7.476	0.0000
numjobs2012	-584.012	54.571	-10.702	0.0000
ischargedYes	-6896.530	1356.587	-5.084	0.0000

ischarged stays significant with a p-value now of 3.801744710^{-7}, so we can see that this variable seems to be more important to the model than ispolicestop. I’d still expect them to be related, so we try adding an interaction term and try an anova to see if it’s significant.

Regression Output
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	33011.914	6913.184	4.775	0.0000
genderFemale	-17913.780	806.912	-22.200	0.0000
raceBlack	-12157.048	877.529	-13.854	0.0000
raceHispanic	-4687.994	1040.885	-4.504	0.0000
edu2012Middle	-969.068	7290.344	-0.133	0.8943
edu2012High	13813.938	6871.893	2.010	0.0444
edu2012College	32157.320	6882.266	4.672	0.0000
edu2012Post College	52022.977	6965.065	7.469	0.0000
numjobs2012	-583.669	54.731	-10.664	0.0000
ispolicestopYes	21.790	1199.155	0.018	0.9855
ischargedYes	-6081.995	1928.158	-3.154	0.0016
ispolicestopYes:ischargedYes	-1525.535	2765.786	-0.552	0.5813

ANOVA Output for ispolicestop*ischarged
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6660	6.487017e+12	NA	NA	NA
6661	6.487314e+12	-1	-296331764	0.5812403

With a intercept p-value of 0.5812588 and an anova p-value of 0.5812403 we see that my hypothesis was wrong and that the interaction term doesn’t actually add anything to the regression.

We then run quick anova tests on both of these variables to see which one to keep in our model moving forward. With a p-value of 0.8127237 for ispolicestop and one of 1.228082710^{-6} for ischarged, we remove ispolicestop from our regression and move forward to test other interaction terms with gender.

ANOVA Output for ischarged
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6661	6.487314e+12	NA	NA	NA
6662	6.510233e+12	-1	-22919088918	1.2e-06

ANOVA Output for ispolicestop
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6661	6.487314e+12	NA	NA	NA
6662	6.487368e+12	-1	-54664589	0.8127237

Testing Interaction Terms

We saw earlier that the residuals plots for the regression implied that our model wasn’t a great model to fit these variables, due to lack of normality and needing a nonlinear model among other issues. We now test in this section to see if adding an interaction term between gender and one other variable fixing these issues. To choose which interaction term to try adding, we test anova function on each interaction term and choose the one with the most significant p-value.

Regression Output for gender*race
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	34990.596	6892.825	5.076	0.0000
genderFemale	-23049.252	1097.254	-21.006	0.0000
raceBlack	-18462.211	1261.681	-14.633	0.0000
raceHispanic	-8608.291	1502.720	-5.728	0.0000
edu2012Middle	-664.454	7261.593	-0.092	0.9271
edu2012High	14345.351	6846.031	2.095	0.0362
edu2012College	32427.331	6857.058	4.729	0.0000
edu2012Post College	52538.025	6940.122	7.570	0.0000
numjobs2012	-558.828	54.511	-10.252	0.0000
ischargedYes	-6812.844	1351.886	-5.040	0.0000
genderFemale:raceBlack	12080.251	1744.569	6.924	0.0000
genderFemale:raceHispanic	7414.436	2055.975	3.606	0.0003

ANOVA Output for gender*race
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6660	6.439172e+12	NA	NA	NA
6662	6.487368e+12	-2	-48196770651	0

We try gender*race first. Gender and race as we saw earlier are interrelated, and it makes sense that these two interracted would have a significant impact on income gaps between men and women. We see that with small p-values of the intercepts and an anova p-value of 1.497216810^{-11} that it is.

Regression Output for gender*edu2012
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	29216.714	11068.656	2.640	0.0083
genderFemale	-12216.717	14004.535	-0.872	0.3831
raceBlack	-11984.648	876.963	-13.666	0.0000
raceHispanic	-4618.993	1039.714	-4.443	0.0000
edu2012Middle	1847.875	11583.253	0.160	0.8733
edu2012High	15609.371	11063.511	1.411	0.1583
edu2012College	37979.968	11082.589	3.427	0.0006
edu2012Post College	57635.192	11215.277	5.139	0.0000
numjobs2012	-564.304	54.688	-10.319	0.0000
ischargedYes	-6685.017	1357.170	-4.926	0.0000
genderFemale:edu2012Middle	-4343.277	14895.237	-0.292	0.7706
genderFemale:edu2012High	-2166.110	14043.919	-0.154	0.8774
genderFemale:edu2012College	-9777.701	14062.016	-0.695	0.4869
genderFemale:edu2012Post College	-9105.121	14219.580	-0.640	0.5220

ANOVA Output for gender*edu2012
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6658	6.465093e+12	NA	NA	NA
6662	6.487368e+12	-4	-22275818471	0.0001301

We next try gender*edu2012. Education levels can also be related to gender, so we expect this to have an impact as well. With an anova p-value of 1.301399510^{-4}, we see that it is. It is interesting to note that the p-values for the individual intercepts in the model are not significant, but the term itself is a significant predictor of totinc2012.

Regression Output for gender*numjobs2012
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	36734.008	6933.813	5.298	0.0000
genderFemale	-24895.526	1536.565	-16.202	0.0000
raceBlack	-11897.184	876.977	-13.566	0.0000
raceHispanic	-4412.893	1039.944	-4.243	0.0000
edu2012Middle	-822.799	7272.187	-0.113	0.9099
edu2012High	13577.110	6855.283	1.981	0.0477
edu2012College	31622.436	6867.163	4.605	0.0000
edu2012Post College	51366.328	6950.406	7.390	0.0000
numjobs2012	-856.237	74.810	-11.445	0.0000
ischargedYes	-6549.988	1355.402	-4.833	0.0000
genderFemale:numjobs2012	576.306	108.581	5.308	0.0000

ANOVA Output for gender*numjobs2012
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6661	6.460048e+12	NA	NA	NA
6662	6.487368e+12	-1	-27320873100	1e-07

We noticed earlier that there didn’t seem to be a discernable trend between gender and number of previous jobs, but it’s worth checking out that interaction as well - so we add gender*numjobs2012. It’s important to note here that numjobs2012 is coded as a numeric variable. Even though we didn’t see any trends in the data, with an anova p-value of 1.110709110^{-7}, it seems like this interaction is also significant in this model as a predictor.

Regression Output for gender*ischarged
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	33296.221	6911.186	4.818	0.0000
genderFemale	-18338.434	814.892	-22.504	0.0000
raceBlack	-12101.930	877.515	-13.791	0.0000
raceHispanic	-4665.694	1040.510	-4.484	0.0000
edu2012Middle	-1030.132	7285.019	-0.141	0.8876
edu2012High	13739.600	6867.309	2.001	0.0455
edu2012College	32123.262	6878.442	4.670	0.0000
edu2012Post College	51920.351	6961.662	7.458	0.0000
numjobs2012	-583.938	54.556	-10.703	0.0000
ischargedYes	-8469.294	1535.989	-5.514	0.0000
genderFemale:ischargedYes	6993.280	3206.319	2.181	0.0292

ANOVA Output for gender*ischarged
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6661	6.482739e+12	NA	NA	NA
6662	6.487368e+12	-1	-4629854986	0.0291765

We then check gender*ischarged, and see that it is less significant than expected (but still significant) at a p-value of 0.0291765.

Regression Output for race*ischarged
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	32883.343	6913.749	4.756	0.0000
genderFemale	-17934.837	789.247	-22.724	0.0000
raceBlack	-11639.677	918.568	-12.672	0.0000
raceHispanic	-4595.488	1100.017	-4.178	0.0000
edu2012Middle	-1004.953	7286.090	-0.138	0.8903
edu2012High	13835.341	6869.166	2.014	0.0440
edu2012College	32166.567	6880.464	4.675	0.0000
edu2012Post College	52083.013	6963.831	7.479	0.0000
numjobs2012	-588.512	54.614	-10.776	0.0000
ischargedYes	-4959.766	1894.660	-2.618	0.0089
raceBlack:ischargedYes	-5899.867	3075.965	-1.918	0.0551
raceHispanic:ischargedYes	-1113.637	3374.626	-0.330	0.7414

ANOVA Output for race*ischarged
Res.Df	RSS	Df	Sum of Sq	Pr(>Chi)
6660	6.483681e+12	NA	NA	NA
6662	6.487368e+12	-2	-3687593707	0.1504781

We explored earlier the relationship between race and ischarged, so I’m curious if an interaction term between those two variables would be significant - we try adding race*ischarged to the model, and see that it is in fact not significant, with an anova p-value of 0.1504781.

With gender*race being the most significant interaction term, we add that to our model and plot it to see if that helps our plots look better at all.

Regression Output for regression with gender*race
	Coefficient	Std Error	T-Value	Significant?
(Intercept)	34990.596	6892.825	5.076	0.0000
genderFemale	-23049.252	1097.254	-21.006	0.0000
raceBlack	-18462.211	1261.681	-14.633	0.0000
raceHispanic	-8608.291	1502.720	-5.728	0.0000
edu2012Middle	-664.454	7261.593	-0.092	0.9271
edu2012High	14345.351	6846.031	2.095	0.0362
edu2012College	32427.331	6857.058	4.729	0.0000
edu2012Post College	52538.025	6940.122	7.570	0.0000
numjobs2012	-558.828	54.511	-10.252	0.0000
ischargedYes	-6812.844	1351.886	-5.040	0.0000
genderFemale:raceBlack	12080.251	1744.569	6.924	0.0000
genderFemale:raceHispanic	7414.436	2055.975	3.606	0.0003

It doesn’t - the diagnostic plots still show some trend when they should be just scatter or noise (though the trend line is straight for the residuals vs fitted graph), and the qqplot shows a distinct lack of normality, especially on the tails. The plots don’t look like they’ve changed much.

Discussion

Our findings show that the variables race, edu2012, numjobs2012, ischarged and their individual interactions with gender are significant in predicting the outcome of totinc2012, but these findings should all be taken with a huge grain of salt - we are trying to predict a huge outcome variable with very few variables, and the regression diagnostic plots (even when we added the most significant interaction term) show a lack of normality and the need for a nonlinear model, which ours isn’t.

I have very little confidence in this model or my analysis. A lot of our findings could also be impacted by the fact that we took out the topcoded values - we saw in the exploratory analysis that our error bars and confidence intervals got larger when we took the topcoded values out, so there may have been some variables that would be more significant if we’d kept them in. Some variables may have been impacted by low response numbers as well, since we took all of the missing values out, but also because some variables like ischarged had very few responses for “Yes” relative to responses of “No.” That definitely may have skewed both our model and our findings.

Final Project

Satvika Neti

December 8, 2019

Introduction

Importing and Cleaning the Data

Data Summary & Methodology

Introduction

Race & Gender

Education & Gender

Number of Previous Jobs & Gender

Family Attitudes & Gender

Crime & Gender

Previous Income & Gender

Regression

Introduction

Testing Interaction Terms

Discussion