Introduction

This report seeks to answer the three following questions:

The first question is, to investigate how various socio-economic variables affect pay rate. The second one is, was there gender-based pay inequity when this data was collected. Finally, was there race-based pay inequity as well?

We will be using a data set called wage1 which is a real-life data set from the wooldridge library. It contains all of the data from the 1976 Current Population Survey about wages and variables about the people they surveyed. There are 24 total variables in the data set, but the most important ones for this analysis are; wage (average hourly earnings), educ (years of education), exper (years potential experience), tenure (years with current employer), nonwhite (=1 if nonwhite), and female (=1 if female). The full data set can be viewed below:

Throughout, we will need the functionality of the tidyverse package, mainly to create box plot and histogram visualizations. The modelr package is for helping me with the regression modeling I will be doing by providing functions for me. The DT package is to help me with creating nice looking data tables. As the one from above. Finally, I already explained the wooldridge library in the introduction.

library(tidyverse)
library(modelr)
library(DT)
library(wooldridge)

Part 1: Was there Gender-Based Pay Inequity?

Quick Visualization of the Question

The problem we are investigating deals with the relationship between a person’s gender and the average hourly earnings they get. We could suggest that the gender that someone is can potentially affect how much money they can make. What we are trying to do here is to see if we can create a quick visualization that can show the difference in how much money the two genders are making, and how good it is at answering the question. I believe that this visualization will be able to show the difference in wage between the two genders, but it will not be able to fully answer the question. My logic for this is because this is a very intricate question that will take more than just a visualization to show if this is true or not. We can test this with a box plot:

ggplot(data = wage1_tibble1) +
  geom_boxplot(mapping = aes(x = reorder(female, wage), y = wage, color = female)) +
  labs(x = "gender (1 = female and 0 = male)",
       y = "average hourly earnings",
       color = "gender",
       title = "Gender as a Function of Wage",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

I was right with what I was thinking. The visualization does not tell me enough information to satisfactorily answer the overall question on whether there is a gender-based pay inequity. The main reason why this box plot is not sufficient enough to answer the question fully is because it is way too simple of a visualization to convey all the intricate parts that go into why females and males get paid different amounts or not. All this box plot is doing is just saying that males make more money and just compares the median wage of the two genders but does not take into account, for example, the type of job, assertiveness, education level, etc. In order to be able to fully explain why there is or is not a gender-based pay inequity. We need to look at all the factors that contribute to it.

Building a Simple Regression Model

wage1_tibble1_model <- lm(wage ~ female, data = wage1_tibble1)

Discussing the Regression Coefficient and P-Value

The question that we are trying to solve here is what does the regression coefficient and the p-value of female tell me about the pay differences between males and females? My hypothesis is that the regression coefficient is going to show males make more money than females. Most likely not a lot, but possibly a couple dollars or so more. I also think that the p-value is going to be less than the 0.05 threshold. This is going to say that gender does have a significant impact on ones wage. The reason why I believe this is because this is a hot topic that has been going on for a while and I believe that this data set could show that it has been an issue in the past. We can test this hypothesis by looking at the summary statistics for the simple regression model:

summary(wage1_tibble1_model)

## 
## Call:
## lm(formula = wage ~ female, data = wage1_tibble1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5995 -1.8495 -0.9877  1.4260 17.8805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0995     0.2100  33.806  < 2e-16 ***
## female1      -2.5118     0.3034  -8.279 1.04e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared:  0.1157, Adjusted R-squared:  0.114 
## F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15

My hypothesis was correct. The regression coefficient of female tells me that when the predictor variable changes from a 0 to a 1, the average hourly earnings of wage, the response variable, is less than males by about $2.5118 for females. The null hypothesis is that gender has no effect on wages. The p-value for female is 0.00000000000000104, which is less than the 0.05 threshold. This means that we can reject the null hypothesis. Additionally, this states that gender does have a significant impact on wage and that difference seems to be -2.5118 dollars.

Should the previous Section be Believed?

The problem we are trying to answer here is if, based on the goodness-of-fit statistics, the conclusion from the previous problem should be believed. My theory is that we should not believe the conclusion. The reason why is because this is a very simple regression model. These types of models do not tell the full picture because they omit essential variables from themselves. We can test this theory by looking at the RSE and the R^2’s in the summary statistics of the simple regression model.

No, I should not at all believe the conclusion from part (a). The main reason why is because of the two R^2 values. The multiple R^2 value is 0.1157 and the adjusted R^2 value is 0.114. These values are way too far from 1, which means that this regression model is not very good to use for any kind of statistical analysis. On top of that the RSE value is also kind of on the high side, with it being 3.476. This is also not very good because you want to make sure you have a low RSE value.

Visualization of the Residual Distribution

The challenge we have is creating a visualization that can showcase the distribution of the models residuals and to discuss what it means. My assumption is that the residual distribution is not going to look very normally distributed. My rationality for thinking this is because this is a simple regression model, and we have said earlier that we cannot fully believe its conclusion. This has to also mean then that the residual distribution will not be a good visualization. We can find this out by creating a histogram:

wage1_tibble1_w_resid <- wage1_tibble1 %>%
  add_residuals(wage1_tibble1_model)

ggplot(wage1_tibble1_w_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of people",
       title = "Simple Regression Models Residual Distribution",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

My assumption was correct. The residual distribution of this model shows that it is somewhat normally distributed because it does have somewhat of a peak at 0, but it is a little bit right skewed because of how much it is trailing to the right. Which is something that we do not want. What this distribution means is that the model is under predicting the wage values compared to the actual wage values in the data set. This means that there is somewhat of a bias and that there could be a pattern in the data set that is not being detected.

Conclusion to my Simple Regression Model

The last question for this section is what can we conclude from the simple regression model I made? The way to answer this is by just looking at what the model says at face-value about the gender-based pay inequity question. This means looking at the regression coefficient and p-value of female, and the RSE and the two R^2 values.

What I can conclude from the simple regression model is that there is in fact a correlation between how much a person gets paid and their gender. The simple regression model tells me that the female variable does have a significant impact on the wage variable. Which means that males get paid more than females by about $2.5118.

Adding the Control Variables

What I Think are the Confounding Variables

The first problem to answer for this section is, what are the three most likely confounding variables, and why? To answer this, I need to look at the documentation for the wage1 data set, and see what I believe are the most likely ones.

The first most likely confounding variable is educ. The reason why I think education is a confounding variable is because you would most likely assume that if someone has more years of education it would allow them to be paid more because they have all this knowledge at their disposal that they have gained. The second most likely confounding variable is exper. The reason why I believe experience to be a confounding variable is because if someone has more experience in their position it would allow them to be paid more as a result of having all this past knowledge of the field and their job. The third most likely confounding variable is tenure. I suspect tenure to be a confounding variable is because you would consider that if someone stays at a company for longer they would be able to get paid more. Because they know the company more and they will get more promotions, bonuses and/or have a higher wage/salary over the years.

Rebuilding the Regression Model

This question is just a continuation of the previous one. Where I just need to add the three confounding variables I believe are important, as controls, to the simple regression model. This will turn it into a multiple regression model:

wage1_tibble1_model_cf <- lm(wage ~ female + educ + exper + tenure, data = wage1_tibble1)

summary(wage1_tibble1_model_cf)

## 
## Call:
## lm(formula = wage ~ female + educ + exper + tenure, data = wage1_tibble1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7675 -1.8080 -0.4229  1.0467 14.0075 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.56794    0.72455  -2.164   0.0309 *  
## female1     -1.81085    0.26483  -6.838 2.26e-11 ***
## educ         0.57150    0.04934  11.584  < 2e-16 ***
## exper        0.02540    0.01157   2.195   0.0286 *  
## tenure       0.14101    0.02116   6.663 6.83e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.958 on 521 degrees of freedom
## Multiple R-squared:  0.3635, Adjusted R-squared:  0.3587 
## F-statistic:  74.4 on 4 and 521 DF,  p-value: < 2.2e-16

Assessing the Multiple Regression Model

Comparing the Residual Distribution Models

For this section we are going to first compare how the multiple regression models residual distribution looks to the simple one. We are specifically looking at if there was any improvements in the multiple residual distribution. My hypothesis is that there is going to be improvements going from the simple residual distribution to the multiple. The reason why I think this is because the multiple one takes into account the variables that could potentially alter how gender affects pay rate. While the simple one does not. To see if this is true we need to create the multiple residual distribution and compare it to the simple one:

wage1_tibble1_cf_w_resid <- wage1_tibble1 %>%
  add_residuals(wage1_tibble1_model_cf)

ggplot(wage1_tibble1_cf_w_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of people",
       title = "Multiple Regression Models Residual Distribution",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

Yes, I do see improvements between the multiple regression models residual distribution and the simple regression one. The peak for the new one is closer to 0, but the graph is not as right-skewed as it was with the simple regression models residual distribution. This distribution stops before the 15 x-value mark at the graph on the right and the simple one goes over the 15 x-value mark. It looks like it is a bit more equally distributed on both sides but is still a bit more on the right side skewed.

How good is the Multiple Regression Model?

For the next question we are looking at how good the multiple regression models goodness-of-fit statistics are when compared to the simple models. We will also look at what these statistics mean on top of that. My theory is that this model will have way better goodness-of-fit statistics than the simple model. The main reason why I believe this is because this model is taking into account other variables that can influence how gender affects pay rate. Which is what the simple model was not doing. This will allow the multiple model to be more accurate. To test my theory we need to compare the two models’ summary statistics between each other.

I was correct with my theory. The goodness-of-fit statistics of the multiple regression model is pretty good. The multiple R^2 is 0.3635 and the adjusted R^2 is 0.3587, which are pretty good enough close to 1 based on our model. What the multiple and the adjusted R^2 tells us is that the model can confidently explain about 36% and 35%, respectively, of the variation in wage and the rest is from random noise. The RSE is also decently low being at 2.958. What this tells us is that between the intervals of -2.958 and 2.958 is 70% of the confidence interval of the residuals. Yes, this models goodness-of-fit statistics are way better than the simple regression models one. The multiple regression model has a lower RSE and higher multiple and adjusted R^2.

Is the Model Reliable?

The final question for this section is if the multiple regression model I created is a reliable one? The best way to answer this is to look at the residual distribution I created from the multiple model. Also to look at the goodness-of-fit statistics and the confounding variables I chose as well. These are the factors that can help me to deduce if my model is truly a reliable one.

Yes, I do believe that I have a reliable model. The residual distribution model I have created looks already like it is distributed as equally as it can get. I did try other combinations of confounding variables and they did not look as good or equally distributed as the one from above. Also the RSE is already pretty low enough to be usable and the multiple and adjusted R^2 are as close as they can get to 1. There could be models out there that have better R^2’s, RSE’s, and residual distributions, but this model has the best of everything equally. Finally, the three confounding variables I chose are, what I believe, to be the most important factors that contribute to why someone would be paid more than someone else. In this case why males would be paid more than females.

Interpreting the Multiple Regression Model Results

What the Regression Coefficient tells me

The first problem for this section is what is the regression coefficient of female in the multiple model and how does it compare to the simple one? We also need to answer what does the regression coefficient mean in the context the data was taken? My guess is that the regression coefficient in the multiple model will be bigger, but not too big; it would still be in the negative. My reasoning is because the simple model is not very reliable, which means that a lower regression coefficient is not true, so a more accurate one should, theoretically, be bigger. To test this we can look at the summary statistic for the multiple regression model and compare it to the simple one:

summary(wage1_tibble1_model_cf)

## 
## Call:
## lm(formula = wage ~ female + educ + exper + tenure, data = wage1_tibble1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7675 -1.8080 -0.4229  1.0467 14.0075 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.56794    0.72455  -2.164   0.0309 *  
## female1     -1.81085    0.26483  -6.838 2.26e-11 ***
## educ         0.57150    0.04934  11.584  < 2e-16 ***
## exper        0.02540    0.01157   2.195   0.0286 *  
## tenure       0.14101    0.02116   6.663 6.83e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.958 on 521 degrees of freedom
## Multiple R-squared:  0.3635, Adjusted R-squared:  0.3587 
## F-statistic:  74.4 on 4 and 521 DF,  p-value: < 2.2e-16

I would say that I was correct. The regression coefficient of female in the multiple regression model is -1.81085. This tells me that when the predictor variable changes from a 0 to a 1, the average hourly earnings of wage is less than males by about $1.81085 for females. The way that this coefficient compares to the simple regression models coefficient is that it is bigger. The multiple one is -1.81085 and the simple one is -2.5118. This is a difference of 0.70095 (-1.81085 - -2.5118).

What the P-Value tells Me

The second question is what is the p-value of the female coefficient from the previous problem, and what does it tell me about the model?

The null hypothesis is that gender has no effect on wages. The p-value for female is 0.0000000000226, which is less than the 0.05 threshold. This states that gender does have a significant impact on wage and that difference seems to be -1.81085 dollars. Additionally this means that we can reject the null hypothesis.

My Answer to the Main Question

The final question for this section goes back to the main question I have been trying to answer from the beginning of part 1. Was there gender-based pay inequity when this data was collected? The best way to answer this would be to focus mainly on the multiple regression model, because this is the most reliable model that can answer this intricate question.

Yes, I believe that when this data was collected there was gender-based pay inequity. The reason why I believe this is because the multiple regression model points to the conclusion that this is true. The multiple regression model says that gender does have a significant impact on wage; that is about -1.81085. Additionally, the conclusion is strengthened because this model has higher R^2’s and an RSE that is lower. Overall, the multiple regression model states that there was gender-based pay inequity at this time.

Should this Model be Used Today?

The final question of this part is, would it be appropriate to use my multiple regression model and its conclusion to make any claim about gender-based pay inequity today? To answer this I need to think about the context from which this data was collected and what could be useful to know about this issue today.

I want to say overall no, the reason why is because this data is from 1976 and since then a lot could and has potentially changed with how much males and females are getting paid, so this data might be outdated and would not really explain the current situation we have with gender-based pay inequity. We could use parts of it like the regression coefficient of female to show how it has progressed since this data was collected to today. We could also use the confounding variables as the reason why males and females get paid less between each other because I believe that these reasons might have stayed somewhat consistent over the years. Finally, we could use the models conclusion to see how it has changed today, but we should take it with a grain of salt. Comprehensively, we should not use this model or its conclusion to make any claim about gender-based pay inequity today. We can only potentially use it to compare and see how things have changed.

Part 2: Was there Race-Based Pay Inequity?

Quick Visualization of the Question

The problem we are investigating deals with the relationship between a person’s race and the average hourly earnings they get. We could suggest that the race that someone is can potentially affect how much money they can make. What we are trying to do here is to see if we can create a quick visualization that can show the difference in how much money whites and nonwhites are making, and how good it is at answering the question. I think that a visualization will be able to show the difference in wage, but it will not be able to fully answer the problem. Mainly because this is a multifaceted question that just a simple visualization cannot answer. We can test this with a box plot:

ggplot(data = wage1_tibble1) +
  geom_boxplot(mapping = aes(x = reorder(nonwhite, wage), y = wage, color = nonwhite)) +
  labs(x = "race (1 = nonwhite and 0 = white)",
       y = "average hourly earnings",
       color = "race",
       title = "Race as a Function of Wage",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

My reasoning was correct. I would say that this visualization does not satisfactorily answer the overall question on whether there is a race-based pay inequity because it does not tell me enough. This box plot is just too simple of a visualization to use to answer a question like this. It cannot convey all of the variables and factors that play into why nonwhites and whites get paid differently or not. What this box plot is saying is that there is a minuscule difference, in wage, between nonwhites and whites. This visualization falls short because all it is doing is comparing nonwhites and whites median wages. It does not take into account all the variables that might contribute to this. In order to answer a question like this, we need to look at all the components that go into it.

Building a Simple Regression Model

wage1_tibble1_model2 <- lm(wage ~ nonwhite, data = wage1_tibble1)

Discussing the Regression Coefficient and P-Value

The question that we are trying to solve here is what does the regression coefficient and the p-value of nonwhite tell me about the pay differences between whites and nonwhites? My hypothesis is that the regression coefficient is going to show whites and nonwhites make roughly the same amount of money. Most likely the difference will be minuscule. I also believe that the p-value is going to be more than the 0.05 threshold. This is going to say that race does not have a significant impact on ones wage. We can test this hypothesis by looking at the summary statistics for the simple regression model:

summary(wage1_tibble1_model2)

## 
## Call:
## lm(formula = wage ~ nonwhite, data = wage1_tibble1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.414 -2.526 -1.259  1.026 19.036 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.9442     0.1700  34.961   <2e-16 ***
## nonwhite1    -0.4682     0.5306  -0.882    0.378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.694 on 524 degrees of freedom
## Multiple R-squared:  0.001484,   Adjusted R-squared:  -0.0004218 
## F-statistic: 0.7786 on 1 and 524 DF,  p-value: 0.378

My hypothesis was correct. What the regression coefficient of nonwhite tells me is that when the predictor variable changes from a 0 to a 1, the average hourly earnings of wage for nonwhites is less than whites by about $0.4682. The null hypothesis is that race has no effect on wages. The p-value for nonwhites is 0.378, which is more than the 0.05 threshold. This means that we do not have enough evidence to confidently reject the null hypothesis. Finally, this states that race does not have a significant impact on wage.

Should the previous Section be Believed?

The problem we are trying to answer here is if the conclusion from the previous problem should be believed, based on the goodness-of-fit statistics. My theory is that the conclusion should not be believed at all. The reason why I think this is because this is a simple regression model. They do not tell the full story of the data because they omit essential variables that contribute to the conclusion. We can test this theory by looking at the R^2’s and the RSE in the summary statistics of the simple regression model.

No, I should definitely not believe the conclusion from part (a) because of the goodness-of-fit statistics. The first reason why is because the RSE is 3.694. Which is pretty high, and that is something we do not want to see in our model. The main reason why this model’s conclusion should not be believed is because the multiple R^2 value is 0.001484, and the adjusted R^2 value is -0.0004218. These values are drastically far away from 1; one is even in the negative which is really bad. All of this means that this regression model is not good at all to use for any kind of statistical analysis.

Visualization of the Residual Distribution

The challenge we have is making a visualization that can display the distribution of the models residuals and to discuss what it means. I believe that the the residual distribution is not going to be very normally distributed. My thought process for thinking this is because this is a simple regression model, and we cannot fully believe its conclusion. This has to also mean then that the residual distribution will not be as good. We can figure this out with a histogram:

wage1_tibble1_w_resid2 <- wage1_tibble1 %>%
  add_residuals(wage1_tibble1_model2)

ggplot(wage1_tibble1_w_resid2) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of people",
       title = "Simple Regression Models Residual Distribution",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

I was correct with my idea. This model’s residual distribution shows that it is right-skewed. It does not have a peak at 0 and it has a lot of points trailing to the right. These are things that we do not want to see in a residual distribution. What this means is that the model is under predicting the wage values compared to the actual wage values in the data set. What all of this means is that there is bias and that there could be a pattern in the data set that is not being detected.

Conclusion to my Simple Regression Model

The last question for this section is what can we conclude from the simple regression model I made? To answer this, we can look at what the model says at face-value about the race-based pay inequity question. We can accomplish this by looking back at the regression coefficient and p-value of nonwhite, and the RSE and the two R^2 values.

From the simple regression model, I can conclude that there is no correlation between how much a person gets paid and their race. The simple regression model tells me that the nonwhite variable does not have a significant impact on the wage variable. Which means that whites do not get paid more than nonwhites.

Adding the Control Variables

What I Think are the Confounding Variables

The first problem is, what are the three most likely confounding variables, and why? This is what needs to be answered first for this section. The documentation for the wage1 data set, can be helpful to look at to be able to answer this problem. In it, I can see what I believe are the most likely ones.

I believe that the three most likely confounding variables are the same ones I used for the female variable in the last part. The reason why I believe these three, education, experience, and tenure, to be confounding variables is because they are, in my opinion, the most important factors that dictate if someone will be paid more or not than someone else. These are all things that employers look for and want in an employee, so they will pay them more for having these traits.

Rebuilding the Regression Model

This question is just a continuation of the previous one. Where I just need to add the three confounding variables I found in the last part, about the female variable, as the same ones for this section, about the nonwhite variable. To do this I will add the three confounding variables to the simple regression model as controls to create a multiple regression model:

wage1_tibble1_model2_cf <- lm(wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)

summary(wage1_tibble1_model2_cf)

## 
## Call:
## lm(formula = wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6154 -1.7815 -0.6287  1.1882 14.6463 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.85711    0.73687  -3.877 0.000119 ***
## nonwhite1   -0.06758    0.44520  -0.152 0.879403    
## educ         0.59830    0.05152  11.613  < 2e-16 ***
## exper        0.02231    0.01207   1.848 0.065142 .  
## tenure       0.16932    0.02167   7.814 3.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.087 on 521 degrees of freedom
## Multiple R-squared:  0.3065, Adjusted R-squared:  0.3011 
## F-statistic: 57.55 on 4 and 521 DF,  p-value: < 2.2e-16

Assessing the Multiple Regression Model

Comparing the Residual Distribution Models

To start this section we are going to compare how the multiple regression models residual distribution looks to the simple one. We want to see if there was any improvements from the multiple to the simple residual distribution. My hypothesis is that we are going to see improvements between the two. I believe this because the multiple one has additional confounding variables that could potentially alter how race affects pay rate. While the simple one does not. To see if this is true we need to create the multiple residual distribution and compare it to the simple one:

wage1_tibble1_cf2_w_resid <- wage1_tibble1 %>%
  add_residuals(wage1_tibble1_model2_cf)

ggplot(wage1_tibble1_cf2_w_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of people",
       title = "Multiple Regression Models Residual Distribution",
       caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")

Yes, I do see improvements between the multiple regression model residual distribution and the simple regression one. The first improvement I see is that the peak looks more like it is at 0, which is good. Also this distribution looks more normally distributed, but is still skewed a bit to the right. Although not as right-skewed as the simple residual distribution. This distribution also stops before the 15 x-value mark at the graph on the right and the simple one stops before the 20 x-value mark.

How good is the Multiple Regression Model?

Next, we are looking at how good the multiple regression models goodness-of-fit statistics are compared to the simple models. We want to also look at what these statistics mean. What I think is that the multiple model will have way better goodness-of-fit statistics. The reason being, is that this model has additional variables that can impact how race affects pay rate. Which is what the simple model does not have. Essentially, allowing the multiple model to be more accurate. To test this, we need to compare the two models’ summary statistics between each other.

I was correct with my thinking. This multiple regression model has a pretty good goodness-of-fit statistics. The first thing is the RSE is 3.087, which is decently low. What this tell me is that the 70% confidence interval of the residuals is between the intervals of -3.087 and 3.087. The main thing is that the multiple R^2 is 0.3065 and the adjusted R^2 is 0.3011. These are good enough where they are decently close to 1. What these tell me is that the model can confidently explain about 30%, respectively, of the variation in wage and the rest is from random noise. Overall, this models goodness-of-fit statistics are way better than the simple regression models one. The multiple regression model has a lower RSE and much higher multiple and adjusted R^2.

Is the Model Reliable?

To wrap up this section, we need to answer if the multiple regression model I created is a reliable one? To answer this, it is best to look at the residual distribution I created from the model. Also the goodness-of-fit statistics and the confounding variables I chose will help as well in answering this question. These components can help to understand if my model is truly a reliable one.

Yes, I do believe that I have a reliable model here. The model’s residual distribution looks like it is as normally distributed as it can be. I did try different combinations of confounding variables and none of the residual distribution graphs look as good or equally distributed as the one I chose. The multiple and adjusted R^2’s are as close as they can get to 1. Also the RSE is already pretty low enough to be usable. There could be some other models out there that have better R^2’s, RSE’s, and residual distributions, but this one has everything good and equal enough. Finally, educ, exper, and tenure, I believe, are the best confounding variables that contribute to why whites would be paid more than nonwhites.

Interpreting the Multiple Regression Model Results

What the Regression Coefficient tells me

For this section, the first problem is what is the regression coefficient of nonwhite in the multiple model and how does it compare to the simple one? We also need to answer what does the regression coefficient mean in the context the data was taken? I speculate that the regression coefficient in the multiple model will be smaller than the simple one. My thinking is because I found the simple model not reliable, this means that a higher regression coefficient is not the case. For it to be a more accurate one, it should be smaller. We can look at the summary statistic for the multiple regression model and compare it to the simple model to see if I was right:

summary(wage1_tibble1_model2_cf)

## 
## Call:
## lm(formula = wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6154 -1.7815 -0.6287  1.1882 14.6463 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.85711    0.73687  -3.877 0.000119 ***
## nonwhite1   -0.06758    0.44520  -0.152 0.879403    
## educ         0.59830    0.05152  11.613  < 2e-16 ***
## exper        0.02231    0.01207   1.848 0.065142 .  
## tenure       0.16932    0.02167   7.814 3.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.087 on 521 degrees of freedom
## Multiple R-squared:  0.3065, Adjusted R-squared:  0.3011 
## F-statistic: 57.55 on 4 and 521 DF,  p-value: < 2.2e-16

The regression coefficient of nonwhite is -0.06758 in the multiple regression model. What this means is that when the predictor variable changes from a 0 to a 1, the average hourly earnings of wage for nonwhites is less than whites by about $0.06758. The way that this coefficient compares to the simple regression models is that it is smaller. The multiple one is -0.06758 and the simple one is -0.4682. This is a difference of 0.40062 (-0.06758 - -0.4682).

What the P-Value tells Me

The second question is about the p-value of the nonwhite coefficient in the multiple regression model from the previous question. More specifically, what does it tell me about the model?

The null hypothesis is that race has no effect on wages. The p-value for nonwhite is 0.879403, which is more than the 0.05 threshold. This means that we do not have enough evidence to confidently reject the null hypothesis. Finally, this states that race does not have a significant impact on wage.

My Answer to the Main Question

This last question goes back to the main question from the start of this part. Was there raced-based pay inequity when this data was collected? To answer this, I need to solely use the multiple regression model to answer this, because this is the most reliable model that can help me.

No, I believe that there was no race-based pay inequity when this data was collected. The reason why is because the multiple regression model points to the conclusion that this is true. The multiple regression model has better R^2’s and RSE to help prove the claim. It also states that race does not have a significant impact on wage. Overall, the multiple regression model affirms that there was no race-based pay inequity at this time.

Should this Model be Used Today?

Would it be appropriate to use my multiple regression model and its conclusion to make any claim about race-based pay inequity today? This is the last question for this part that needs to be answered. To be able to answer this question in its entirety, I need to talk about what could be useful to know about this issue today and the context from which this data was collected.

I think it could be appropriate to use parts of this model and its conclusion, but overall I do not think this would be appropriate to use to claim anything about race-based pay inequity today. We can potentially use the confounding variables, I chose, as the reason why whites and nonwhites get paid less or more between the two because I believe that these reasons have stayed somewhat consistent. Another thing we can potentially take from the model is the conclusion, and see how it has changed today. Finally, We could also use the regression coefficient of nonwhite to show how it has changed since this data was collected to today. The main reason why I believe we cannot use this model for today is because this data is from 1976 and since then a lot has and could have changed with how much whites and nonwhites are getting paid, so this data might be outdated and would not really explain the current situation we have with raced-based pay inequity. In general, we should only potentially use it to compare and see how things have changed. We should not use this model or its conclusion to make any claim about race-based pay inequity today.

Conclusion

In summary, we can conclude that various socio-economic variables do have an affect on pay rate. gender/race and wage are both affected the most by education, tenure, and experience; from what I have found and believe. When taking these three confounding variables into account, we can see that, at the time of this data being recorded, that there was gender-based pay inequity, but no race-based pay inequity. The multiple regression models I have created help in proving that this is all true.

An Analysis of Various Socio-Economic Variables Affecting Wage

Haris Sendijarevic

12/5/2023

Introduction

Part 1: Was there Gender-Based Pay Inequity?

Quick Visualization of the Question

Building a Simple Regression Model

Discussing the Regression Coefficient and P-Value

Should the previous Section be Believed?

Visualization of the Residual Distribution

Conclusion to my Simple Regression Model

Adding the Control Variables

What I Think are the Confounding Variables

Rebuilding the Regression Model

Assessing the Multiple Regression Model

Comparing the Residual Distribution Models

How good is the Multiple Regression Model?

Is the Model Reliable?

Interpreting the Multiple Regression Model Results

What the Regression Coefficient tells me

What the P-Value tells Me

My Answer to the Main Question

Should this Model be Used Today?

Part 2: Was there Race-Based Pay Inequity?

Quick Visualization of the Question

Building a Simple Regression Model

Discussing the Regression Coefficient and P-Value

Should the previous Section be Believed?

Visualization of the Residual Distribution

Conclusion to my Simple Regression Model

Adding the Control Variables

What I Think are the Confounding Variables

Rebuilding the Regression Model

Assessing the Multiple Regression Model

Comparing the Residual Distribution Models

How good is the Multiple Regression Model?

Is the Model Reliable?

Interpreting the Multiple Regression Model Results

What the Regression Coefficient tells me

What the P-Value tells Me

My Answer to the Main Question

Should this Model be Used Today?

Conclusion