This report seeks to answer the three following questions:
The first question is, to investigate how various socio-economic variables affect pay rate. The second one is, was there gender-based pay inequity when this data was collected. Finally, was there race-based pay inequity as well?
We will be using a data set called wage1 which is a
real-life data set from the wooldridge library. It
contains all of the data from the 1976 Current Population Survey about
wages and variables about the people they surveyed. There are 24 total
variables in the data set, but the most important ones for this analysis
are; wage (average hourly earnings), educ
(years of education), exper (years potential experience),
tenure (years with current employer), nonwhite
(=1 if nonwhite), and female (=1 if female). The full data
set can be viewed below:
Throughout, we will need the functionality of the tidyverse package, mainly to create box plot and histogram visualizations. The modelr package is for helping me with the regression modeling I will be doing by providing functions for me. The DT package is to help me with creating nice looking data tables. As the one from above. Finally, I already explained the wooldridge library in the introduction.
library(tidyverse)
library(modelr)
library(DT)
library(wooldridge)
The problem we are investigating deals with the relationship between a person’s gender and the average hourly earnings they get. We could suggest that the gender that someone is can potentially affect how much money they can make. What we are trying to do here is to see if we can create a quick visualization that can show the difference in how much money the two genders are making, and how good it is at answering the question. I believe that this visualization will be able to show the difference in wage between the two genders, but it will not be able to fully answer the question. My logic for this is because this is a very intricate question that will take more than just a visualization to show if this is true or not. We can test this with a box plot:
ggplot(data = wage1_tibble1) +
geom_boxplot(mapping = aes(x = reorder(female, wage), y = wage, color = female)) +
labs(x = "gender (1 = female and 0 = male)",
y = "average hourly earnings",
color = "gender",
title = "Gender as a Function of Wage",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
I was right with what I was thinking. The visualization does not tell me enough information to satisfactorily answer the overall question on whether there is a gender-based pay inequity. The main reason why this box plot is not sufficient enough to answer the question fully is because it is way too simple of a visualization to convey all the intricate parts that go into why females and males get paid different amounts or not. All this box plot is doing is just saying that males make more money and just compares the median wage of the two genders but does not take into account, for example, the type of job, assertiveness, education level, etc. In order to be able to fully explain why there is or is not a gender-based pay inequity. We need to look at all the factors that contribute to it.
wage1_tibble1_model <- lm(wage ~ female, data = wage1_tibble1)
The question that we are trying to solve here is what does the
regression coefficient and the p-value of female tell me
about the pay differences between males and females? My hypothesis is
that the regression coefficient is going to show males make more money
than females. Most likely not a lot, but possibly a couple dollars or so
more. I also think that the p-value is going to be less than the 0.05
threshold. This is going to say that gender does have a significant
impact on ones wage. The reason why I believe this is because this is a
hot topic that has been going on for a while and I believe that this
data set could show that it has been an issue in the past. We can test
this hypothesis by looking at the summary statistics for the simple
regression model:
summary(wage1_tibble1_model)
##
## Call:
## lm(formula = wage ~ female, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5995 -1.8495 -0.9877 1.4260 17.8805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0995 0.2100 33.806 < 2e-16 ***
## female1 -2.5118 0.3034 -8.279 1.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared: 0.1157, Adjusted R-squared: 0.114
## F-statistic: 68.54 on 1 and 524 DF, p-value: 1.042e-15
My hypothesis was correct. The regression coefficient of
female tells me that when the predictor variable changes
from a 0 to a 1, the average hourly earnings of wage, the response
variable, is less than males by about $2.5118 for females. The null
hypothesis is that gender has no effect on
wages. The p-value for female is
0.00000000000000104, which is less than the 0.05 threshold. This means
that we can reject the null hypothesis. Additionally, this states that
gender does have a significant impact on wage and that difference seems
to be -2.5118 dollars.
The problem we are trying to answer here is if, based on the goodness-of-fit statistics, the conclusion from the previous problem should be believed. My theory is that we should not believe the conclusion. The reason why is because this is a very simple regression model. These types of models do not tell the full picture because they omit essential variables from themselves. We can test this theory by looking at the RSE and the R^2’s in the summary statistics of the simple regression model.
No, I should not at all believe the conclusion from part (a). The main reason why is because of the two R^2 values. The multiple R^2 value is 0.1157 and the adjusted R^2 value is 0.114. These values are way too far from 1, which means that this regression model is not very good to use for any kind of statistical analysis. On top of that the RSE value is also kind of on the high side, with it being 3.476. This is also not very good because you want to make sure you have a low RSE value.
The challenge we have is creating a visualization that can showcase the distribution of the models residuals and to discuss what it means. My assumption is that the residual distribution is not going to look very normally distributed. My rationality for thinking this is because this is a simple regression model, and we have said earlier that we cannot fully believe its conclusion. This has to also mean then that the residual distribution will not be a good visualization. We can find this out by creating a histogram:
wage1_tibble1_w_resid <- wage1_tibble1 %>%
add_residuals(wage1_tibble1_model)
ggplot(wage1_tibble1_w_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of people",
title = "Simple Regression Models Residual Distribution",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
My assumption was correct. The residual distribution of this model shows that it is somewhat normally distributed because it does have somewhat of a peak at 0, but it is a little bit right skewed because of how much it is trailing to the right. Which is something that we do not want. What this distribution means is that the model is under predicting the wage values compared to the actual wage values in the data set. This means that there is somewhat of a bias and that there could be a pattern in the data set that is not being detected.
The last question for this section is what can we conclude from the
simple regression model I made? The way to answer this is by just
looking at what the model says at face-value about the gender-based pay
inequity question. This means looking at the regression coefficient and
p-value of female, and the RSE and the two R^2 values.
What I can conclude from the simple regression model is that there is in fact a correlation between how much a person gets paid and their gender. The simple regression model tells me that the female variable does have a significant impact on the wage variable. Which means that males get paid more than females by about $2.5118.
The first problem to answer for this section is, what are the three
most likely confounding variables, and why? To answer this, I need to
look at the documentation for the wage1 data set, and see
what I believe are the most likely ones.
The first most likely confounding variable is educ. The reason why I think education is a confounding variable is because you would most likely assume that if someone has more years of education it would allow them to be paid more because they have all this knowledge at their disposal that they have gained. The second most likely confounding variable is exper. The reason why I believe experience to be a confounding variable is because if someone has more experience in their position it would allow them to be paid more as a result of having all this past knowledge of the field and their job. The third most likely confounding variable is tenure. I suspect tenure to be a confounding variable is because you would consider that if someone stays at a company for longer they would be able to get paid more. Because they know the company more and they will get more promotions, bonuses and/or have a higher wage/salary over the years.
This question is just a continuation of the previous one. Where I just need to add the three confounding variables I believe are important, as controls, to the simple regression model. This will turn it into a multiple regression model:
wage1_tibble1_model_cf <- lm(wage ~ female + educ + exper + tenure, data = wage1_tibble1)
summary(wage1_tibble1_model_cf)
##
## Call:
## lm(formula = wage ~ female + educ + exper + tenure, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7675 -1.8080 -0.4229 1.0467 14.0075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.56794 0.72455 -2.164 0.0309 *
## female1 -1.81085 0.26483 -6.838 2.26e-11 ***
## educ 0.57150 0.04934 11.584 < 2e-16 ***
## exper 0.02540 0.01157 2.195 0.0286 *
## tenure 0.14101 0.02116 6.663 6.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.958 on 521 degrees of freedom
## Multiple R-squared: 0.3635, Adjusted R-squared: 0.3587
## F-statistic: 74.4 on 4 and 521 DF, p-value: < 2.2e-16
For this section we are going to first compare how the multiple regression models residual distribution looks to the simple one. We are specifically looking at if there was any improvements in the multiple residual distribution. My hypothesis is that there is going to be improvements going from the simple residual distribution to the multiple. The reason why I think this is because the multiple one takes into account the variables that could potentially alter how gender affects pay rate. While the simple one does not. To see if this is true we need to create the multiple residual distribution and compare it to the simple one:
wage1_tibble1_cf_w_resid <- wage1_tibble1 %>%
add_residuals(wage1_tibble1_model_cf)
ggplot(wage1_tibble1_cf_w_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of people",
title = "Multiple Regression Models Residual Distribution",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
Yes, I do see improvements between the multiple regression models residual distribution and the simple regression one. The peak for the new one is closer to 0, but the graph is not as right-skewed as it was with the simple regression models residual distribution. This distribution stops before the 15 x-value mark at the graph on the right and the simple one goes over the 15 x-value mark. It looks like it is a bit more equally distributed on both sides but is still a bit more on the right side skewed.
For the next question we are looking at how good the multiple regression models goodness-of-fit statistics are when compared to the simple models. We will also look at what these statistics mean on top of that. My theory is that this model will have way better goodness-of-fit statistics than the simple model. The main reason why I believe this is because this model is taking into account other variables that can influence how gender affects pay rate. Which is what the simple model was not doing. This will allow the multiple model to be more accurate. To test my theory we need to compare the two models’ summary statistics between each other.
I was correct with my theory. The goodness-of-fit statistics of the multiple regression model is pretty good. The multiple R^2 is 0.3635 and the adjusted R^2 is 0.3587, which are pretty good enough close to 1 based on our model. What the multiple and the adjusted R^2 tells us is that the model can confidently explain about 36% and 35%, respectively, of the variation in wage and the rest is from random noise. The RSE is also decently low being at 2.958. What this tells us is that between the intervals of -2.958 and 2.958 is 70% of the confidence interval of the residuals. Yes, this models goodness-of-fit statistics are way better than the simple regression models one. The multiple regression model has a lower RSE and higher multiple and adjusted R^2.
The final question for this section is if the multiple regression model I created is a reliable one? The best way to answer this is to look at the residual distribution I created from the multiple model. Also to look at the goodness-of-fit statistics and the confounding variables I chose as well. These are the factors that can help me to deduce if my model is truly a reliable one.
Yes, I do believe that I have a reliable model. The residual distribution model I have created looks already like it is distributed as equally as it can get. I did try other combinations of confounding variables and they did not look as good or equally distributed as the one from above. Also the RSE is already pretty low enough to be usable and the multiple and adjusted R^2 are as close as they can get to 1. There could be models out there that have better R^2’s, RSE’s, and residual distributions, but this model has the best of everything equally. Finally, the three confounding variables I chose are, what I believe, to be the most important factors that contribute to why someone would be paid more than someone else. In this case why males would be paid more than females.
The first problem for this section is what is the regression
coefficient of female in the multiple model and how does it
compare to the simple one? We also need to answer what does the
regression coefficient mean in the context the data was taken? My guess
is that the regression coefficient in the multiple model will be bigger,
but not too big; it would still be in the negative. My reasoning is
because the simple model is not very reliable, which means that a lower
regression coefficient is not true, so a more accurate one should,
theoretically, be bigger. To test this we can look at the summary
statistic for the multiple regression model and compare it to the simple
one:
summary(wage1_tibble1_model_cf)
##
## Call:
## lm(formula = wage ~ female + educ + exper + tenure, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7675 -1.8080 -0.4229 1.0467 14.0075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.56794 0.72455 -2.164 0.0309 *
## female1 -1.81085 0.26483 -6.838 2.26e-11 ***
## educ 0.57150 0.04934 11.584 < 2e-16 ***
## exper 0.02540 0.01157 2.195 0.0286 *
## tenure 0.14101 0.02116 6.663 6.83e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.958 on 521 degrees of freedom
## Multiple R-squared: 0.3635, Adjusted R-squared: 0.3587
## F-statistic: 74.4 on 4 and 521 DF, p-value: < 2.2e-16
I would say that I was correct. The regression coefficient of
female in the multiple regression model is -1.81085. This
tells me that when the predictor variable changes from a 0 to a 1, the
average hourly earnings of wage is less than males by about $1.81085 for
females. The way that this coefficient compares to the simple regression
models coefficient is that it is bigger. The multiple one is -1.81085
and the simple one is -2.5118. This is a difference of 0.70095 (-1.81085
- -2.5118).
The second question is what is the p-value of the female
coefficient from the previous problem, and what does it tell me about
the model?
The null hypothesis is that gender has no effect on
wages. The p-value for female is
0.0000000000226, which is less than the 0.05 threshold. This states that
gender does have a significant impact on wage and that difference seems
to be -1.81085 dollars. Additionally this means that we can reject the
null hypothesis.
The final question for this section goes back to the main question I have been trying to answer from the beginning of part 1. Was there gender-based pay inequity when this data was collected? The best way to answer this would be to focus mainly on the multiple regression model, because this is the most reliable model that can answer this intricate question.
Yes, I believe that when this data was collected there was gender-based pay inequity. The reason why I believe this is because the multiple regression model points to the conclusion that this is true. The multiple regression model says that gender does have a significant impact on wage; that is about -1.81085. Additionally, the conclusion is strengthened because this model has higher R^2’s and an RSE that is lower. Overall, the multiple regression model states that there was gender-based pay inequity at this time.
The final question of this part is, would it be appropriate to use my multiple regression model and its conclusion to make any claim about gender-based pay inequity today? To answer this I need to think about the context from which this data was collected and what could be useful to know about this issue today.
I want to say overall no, the reason why is because this data is from
1976 and since then a lot could and has potentially changed with how
much males and females are getting paid, so this data might be outdated
and would not really explain the current situation we have with
gender-based pay inequity. We could use parts of it like the regression
coefficient of female to show how it has progressed since
this data was collected to today. We could also use the confounding
variables as the reason why males and females get paid less between each
other because I believe that these reasons might have stayed somewhat
consistent over the years. Finally, we could use the models conclusion
to see how it has changed today, but we should take it with a grain of
salt. Comprehensively, we should not use this model or its conclusion to
make any claim about gender-based pay inequity today. We can only
potentially use it to compare and see how things have changed.
The problem we are investigating deals with the relationship between a person’s race and the average hourly earnings they get. We could suggest that the race that someone is can potentially affect how much money they can make. What we are trying to do here is to see if we can create a quick visualization that can show the difference in how much money whites and nonwhites are making, and how good it is at answering the question. I think that a visualization will be able to show the difference in wage, but it will not be able to fully answer the problem. Mainly because this is a multifaceted question that just a simple visualization cannot answer. We can test this with a box plot:
ggplot(data = wage1_tibble1) +
geom_boxplot(mapping = aes(x = reorder(nonwhite, wage), y = wage, color = nonwhite)) +
labs(x = "race (1 = nonwhite and 0 = white)",
y = "average hourly earnings",
color = "race",
title = "Race as a Function of Wage",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
My reasoning was correct. I would say that this visualization does not satisfactorily answer the overall question on whether there is a race-based pay inequity because it does not tell me enough. This box plot is just too simple of a visualization to use to answer a question like this. It cannot convey all of the variables and factors that play into why nonwhites and whites get paid differently or not. What this box plot is saying is that there is a minuscule difference, in wage, between nonwhites and whites. This visualization falls short because all it is doing is comparing nonwhites and whites median wages. It does not take into account all the variables that might contribute to this. In order to answer a question like this, we need to look at all the components that go into it.
wage1_tibble1_model2 <- lm(wage ~ nonwhite, data = wage1_tibble1)
The question that we are trying to solve here is what does the
regression coefficient and the p-value of nonwhite tell me
about the pay differences between whites and nonwhites? My hypothesis is
that the regression coefficient is going to show whites and nonwhites
make roughly the same amount of money. Most likely the difference will
be minuscule. I also believe that the p-value is going to be more than
the 0.05 threshold. This is going to say that race does not have a
significant impact on ones wage. We can test this hypothesis by looking
at the summary statistics for the simple regression model:
summary(wage1_tibble1_model2)
##
## Call:
## lm(formula = wage ~ nonwhite, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.414 -2.526 -1.259 1.026 19.036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9442 0.1700 34.961 <2e-16 ***
## nonwhite1 -0.4682 0.5306 -0.882 0.378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.694 on 524 degrees of freedom
## Multiple R-squared: 0.001484, Adjusted R-squared: -0.0004218
## F-statistic: 0.7786 on 1 and 524 DF, p-value: 0.378
My hypothesis was correct. What the regression coefficient of
nonwhite tells me is that when the predictor variable
changes from a 0 to a 1, the average hourly earnings of wage for
nonwhites is less than whites by about $0.4682. The null hypothesis is
that race has no effect on wages. The p-value
for nonwhites is 0.378, which is more than the 0.05
threshold. This means that we do not have enough evidence to confidently
reject the null hypothesis. Finally, this states that race does not have
a significant impact on wage.
The problem we are trying to answer here is if the conclusion from the previous problem should be believed, based on the goodness-of-fit statistics. My theory is that the conclusion should not be believed at all. The reason why I think this is because this is a simple regression model. They do not tell the full story of the data because they omit essential variables that contribute to the conclusion. We can test this theory by looking at the R^2’s and the RSE in the summary statistics of the simple regression model.
No, I should definitely not believe the conclusion from part (a) because of the goodness-of-fit statistics. The first reason why is because the RSE is 3.694. Which is pretty high, and that is something we do not want to see in our model. The main reason why this model’s conclusion should not be believed is because the multiple R^2 value is 0.001484, and the adjusted R^2 value is -0.0004218. These values are drastically far away from 1; one is even in the negative which is really bad. All of this means that this regression model is not good at all to use for any kind of statistical analysis.
The challenge we have is making a visualization that can display the distribution of the models residuals and to discuss what it means. I believe that the the residual distribution is not going to be very normally distributed. My thought process for thinking this is because this is a simple regression model, and we cannot fully believe its conclusion. This has to also mean then that the residual distribution will not be as good. We can figure this out with a histogram:
wage1_tibble1_w_resid2 <- wage1_tibble1 %>%
add_residuals(wage1_tibble1_model2)
ggplot(wage1_tibble1_w_resid2) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of people",
title = "Simple Regression Models Residual Distribution",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
I was correct with my idea. This model’s residual distribution shows that it is right-skewed. It does not have a peak at 0 and it has a lot of points trailing to the right. These are things that we do not want to see in a residual distribution. What this means is that the model is under predicting the wage values compared to the actual wage values in the data set. What all of this means is that there is bias and that there could be a pattern in the data set that is not being detected.
The last question for this section is what can we conclude from the
simple regression model I made? To answer this, we can look at what the
model says at face-value about the race-based pay inequity question. We
can accomplish this by looking back at the regression coefficient and
p-value of nonwhite, and the RSE and the two R^2
values.
From the simple regression model, I can conclude that there is no correlation between how much a person gets paid and their race. The simple regression model tells me that the nonwhite variable does not have a significant impact on the wage variable. Which means that whites do not get paid more than nonwhites.
The first problem is, what are the three most likely confounding
variables, and why? This is what needs to be answered first for this
section. The documentation for the wage1 data set, can be
helpful to look at to be able to answer this problem. In it, I can see
what I believe are the most likely ones.
I believe that the three most likely confounding variables are the same ones I used for the female variable in the last part. The reason why I believe these three, education, experience, and tenure, to be confounding variables is because they are, in my opinion, the most important factors that dictate if someone will be paid more or not than someone else. These are all things that employers look for and want in an employee, so they will pay them more for having these traits.
This question is just a continuation of the previous one. Where I
just need to add the three confounding variables I found in the last
part, about the female variable, as the same ones for this
section, about the nonwhite variable. To do this I will add
the three confounding variables to the simple regression model as
controls to create a multiple regression model:
wage1_tibble1_model2_cf <- lm(wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)
summary(wage1_tibble1_model2_cf)
##
## Call:
## lm(formula = wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6154 -1.7815 -0.6287 1.1882 14.6463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.85711 0.73687 -3.877 0.000119 ***
## nonwhite1 -0.06758 0.44520 -0.152 0.879403
## educ 0.59830 0.05152 11.613 < 2e-16 ***
## exper 0.02231 0.01207 1.848 0.065142 .
## tenure 0.16932 0.02167 7.814 3.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.087 on 521 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.3011
## F-statistic: 57.55 on 4 and 521 DF, p-value: < 2.2e-16
To start this section we are going to compare how the multiple regression models residual distribution looks to the simple one. We want to see if there was any improvements from the multiple to the simple residual distribution. My hypothesis is that we are going to see improvements between the two. I believe this because the multiple one has additional confounding variables that could potentially alter how race affects pay rate. While the simple one does not. To see if this is true we need to create the multiple residual distribution and compare it to the simple one:
wage1_tibble1_cf2_w_resid <- wage1_tibble1 %>%
add_residuals(wage1_tibble1_model2_cf)
ggplot(wage1_tibble1_cf2_w_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of people",
title = "Multiple Regression Models Residual Distribution",
caption = "Data obtained from https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041")
Yes, I do see improvements between the multiple regression model residual distribution and the simple regression one. The first improvement I see is that the peak looks more like it is at 0, which is good. Also this distribution looks more normally distributed, but is still skewed a bit to the right. Although not as right-skewed as the simple residual distribution. This distribution also stops before the 15 x-value mark at the graph on the right and the simple one stops before the 20 x-value mark.
Next, we are looking at how good the multiple regression models goodness-of-fit statistics are compared to the simple models. We want to also look at what these statistics mean. What I think is that the multiple model will have way better goodness-of-fit statistics. The reason being, is that this model has additional variables that can impact how race affects pay rate. Which is what the simple model does not have. Essentially, allowing the multiple model to be more accurate. To test this, we need to compare the two models’ summary statistics between each other.
I was correct with my thinking. This multiple regression model has a pretty good goodness-of-fit statistics. The first thing is the RSE is 3.087, which is decently low. What this tell me is that the 70% confidence interval of the residuals is between the intervals of -3.087 and 3.087. The main thing is that the multiple R^2 is 0.3065 and the adjusted R^2 is 0.3011. These are good enough where they are decently close to 1. What these tell me is that the model can confidently explain about 30%, respectively, of the variation in wage and the rest is from random noise. Overall, this models goodness-of-fit statistics are way better than the simple regression models one. The multiple regression model has a lower RSE and much higher multiple and adjusted R^2.
To wrap up this section, we need to answer if the multiple regression model I created is a reliable one? To answer this, it is best to look at the residual distribution I created from the model. Also the goodness-of-fit statistics and the confounding variables I chose will help as well in answering this question. These components can help to understand if my model is truly a reliable one.
Yes, I do believe that I have a reliable model here. The model’s residual distribution looks like it is as normally distributed as it can be. I did try different combinations of confounding variables and none of the residual distribution graphs look as good or equally distributed as the one I chose. The multiple and adjusted R^2’s are as close as they can get to 1. Also the RSE is already pretty low enough to be usable. There could be some other models out there that have better R^2’s, RSE’s, and residual distributions, but this one has everything good and equal enough. Finally, educ, exper, and tenure, I believe, are the best confounding variables that contribute to why whites would be paid more than nonwhites.
For this section, the first problem is what is the regression
coefficient of nonwhite in the multiple model and how does
it compare to the simple one? We also need to answer what does the
regression coefficient mean in the context the data was taken? I
speculate that the regression coefficient in the multiple model will be
smaller than the simple one. My thinking is because I found the simple
model not reliable, this means that a higher regression coefficient is
not the case. For it to be a more accurate one, it should be smaller. We
can look at the summary statistic for the multiple regression model and
compare it to the simple model to see if I was right:
summary(wage1_tibble1_model2_cf)
##
## Call:
## lm(formula = wage ~ nonwhite + educ + exper + tenure, data = wage1_tibble1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6154 -1.7815 -0.6287 1.1882 14.6463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.85711 0.73687 -3.877 0.000119 ***
## nonwhite1 -0.06758 0.44520 -0.152 0.879403
## educ 0.59830 0.05152 11.613 < 2e-16 ***
## exper 0.02231 0.01207 1.848 0.065142 .
## tenure 0.16932 0.02167 7.814 3.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.087 on 521 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.3011
## F-statistic: 57.55 on 4 and 521 DF, p-value: < 2.2e-16
The regression coefficient of nonwhite is -0.06758 in
the multiple regression model. What this means is that when the
predictor variable changes from a 0 to a 1, the average hourly earnings
of wage for nonwhites is less than whites by about $0.06758. The way
that this coefficient compares to the simple regression models is that
it is smaller. The multiple one is -0.06758 and the simple one is
-0.4682. This is a difference of 0.40062 (-0.06758 - -0.4682).
The second question is about the p-value of the nonwhite
coefficient in the multiple regression model from the previous question.
More specifically, what does it tell me about the model?
The null hypothesis is that race has no effect on
wages. The p-value for nonwhite is 0.879403,
which is more than the 0.05 threshold. This means that we do not have
enough evidence to confidently reject the null hypothesis. Finally, this
states that race does not have a significant impact on wage.
This last question goes back to the main question from the start of this part. Was there raced-based pay inequity when this data was collected? To answer this, I need to solely use the multiple regression model to answer this, because this is the most reliable model that can help me.
No, I believe that there was no race-based pay inequity when this data was collected. The reason why is because the multiple regression model points to the conclusion that this is true. The multiple regression model has better R^2’s and RSE to help prove the claim. It also states that race does not have a significant impact on wage. Overall, the multiple regression model affirms that there was no race-based pay inequity at this time.
Would it be appropriate to use my multiple regression model and its conclusion to make any claim about race-based pay inequity today? This is the last question for this part that needs to be answered. To be able to answer this question in its entirety, I need to talk about what could be useful to know about this issue today and the context from which this data was collected.
I think it could be appropriate to use parts of this model and its
conclusion, but overall I do not think this would be appropriate to use
to claim anything about race-based pay inequity today. We can
potentially use the confounding variables, I chose, as the reason why
whites and nonwhites get paid less or more between the two because I
believe that these reasons have stayed somewhat consistent. Another
thing we can potentially take from the model is the conclusion, and see
how it has changed today. Finally, We could also use the regression
coefficient of nonwhite to show how it has changed since
this data was collected to today. The main reason why I believe we
cannot use this model for today is because this data is from 1976 and
since then a lot has and could have changed with how much whites and
nonwhites are getting paid, so this data might be outdated and would not
really explain the current situation we have with raced-based pay
inequity. In general, we should only potentially use it to compare and
see how things have changed. We should not use this model or its
conclusion to make any claim about race-based pay inequity today.
In summary, we can conclude that various socio-economic variables do have an affect on pay rate. gender/race and wage are both affected the most by education, tenure, and experience; from what I have found and believe. When taking these three confounding variables into account, we can see that, at the time of this data being recorded, that there was gender-based pay inequity, but no race-based pay inequity. The multiple regression models I have created help in proving that this is all true.