I am interested in learning something about Nevada wages for 2017. I have gathered some variables from the 2017 Census Bureau’s Current Population Survey Public Use Microdata Sample (PUMS) for Nevada. I have included:
NB: You may need to load libraries in the chunk above
Describe the data you will use in your regression including variables of interest and controls. Include the reason for using your data. Describe any hypotheses you can form (this variable will be positive etc.). Do any work you need in the r chunk.
totalct <- NV2017CPS %>% count()
totalct
totalv <- NV2017CPS %>% select(wage)
womanct <- NV2017CPS %>% filter(female == 1) %>% count()
womanct
womanv <- NV2017CPS %>% filter(female == 1) %>% select(wage)
manct <- NV2017CPS %>% filter(female == 0) %>% count()
manct
manv <- NV2017CPS %>% filter(female == 0) %>% select(wage)
nativect <- NV2017CPS %>% filter(native == 1) %>% count()
nativect
ruralct <- NV2017CPS %>% filter(rural == 1) %>% count()
ruralct
nativev <- NV2017CPS %>% filter(native == 1) %>% select(wage)
ruralv <- NV2017CPS %>% filter(rural == 1) %>% select(wage)
mean(totalv[[1]])
## [1] 45919.93
mean(womanv[[1]])
## [1] 38058.58
mean(manv[[1]])
## [1] 52821.11
mean(nativev[[1]])
## [1] 31174.68
mean(ruralv[[1]])
## [1] 42035.14
I will be looking at how multiple factors affect wage including education level, age, whether they are native American, biological sex, and whether they live in a rural area. There are 13,776 observations in this data set with 6,440 females and 7,336 males.Out of the total number of observations, there are 280 native Americans. There is also a total of 777 observations who live in rural areas.
By looking at the mean wages for different groups, I found that the average wage overall is 45,919.93. For men, the average salary jumps to 52,821.11 and women’s average salary is 38,058.58. Out of curiosity, I looked at the average salary among native Americans and people who live in rural areas. For native American’s, the average salary is 31,174.14. For people in rural areas, their average salary was 42,035.14.
Historically, women have been paid less. I have always been skeptical of this assumption. The argument I often hear against this claim is that women generally take up less demanding jobs and I am not sure if I would deny that, but I have yet to hear of whether women in similar roles as men get paid less. I am also curious to see if native Americans experience any deviation in wages. I also hear that people of minorities tend to have lower wages.
I believe the variables Less than HS diploma, native, and female will have a negative affect on the intercept. I think whether someone lives in a rural area or not might affect their wage in a negative way as well.
Describe the regression and perform it.
I am looking at the regression:
\(\hat{wage} = \beta_0 + \beta_1 x_{educ} + \beta_2 x_{age} + \beta_3 x_{native} + \beta_4 x_{female} + \beta_5 x_{rural}\)
To see if a variable has significance in the regression, I will use a t-test to determine which variables can be zero and which ones cannot be zero.
model1 <- lm(wage ~ educ + age + native + female + rural, data = NV2017CPS)
summary(model1)
##
## Call:
## lm(formula = wage ~ educ + age + native + female + rural, data = NV2017CPS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111503 -22323 -6399 10676 391684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30679.55 1949.02 15.741 < 2e-16 ***
## educBachelors Degree 15581.20 1754.90 8.879 < 2e-16 ***
## educGraduate Degree 43516.01 1998.80 21.771 < 2e-16 ***
## educHS Diploma -9980.04 1654.06 -6.034 1.64e-09 ***
## educLess than HS Diploma -18211.28 1883.68 -9.668 < 2e-16 ***
## educSome College -3979.81 1653.79 -2.406 0.0161 *
## age 520.09 28.49 18.252 < 2e-16 ***
## native -7195.15 3034.51 -2.371 0.0177 *
## female -16386.07 844.47 -19.404 < 2e-16 ***
## rural 61.00 1857.71 0.033 0.9738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49320 on 13766 degrees of freedom
## Multiple R-squared: 0.1528, Adjusted R-squared: 0.1522
## F-statistic: 275.8 on 9 and 13766 DF, p-value: < 2.2e-16
plot(model1)
After looking at the summary, I might want to compare this to a reduced model since, based on this t-test, the variable rural is insignificant. To make sure, I used a partial F-test to see if the reduced model is acceptable. Since the p-value is 0.9738, I think it is safe to accept the reduced model without the variable rural.
Looking at the normal Q-Q plot, most of the observations approximately follow a linear pattern up until the positive 2nd quantile along the x-axis. This is when the observations start to follow a different pattern.
Describe and interpret the results of your regression. Describe any tests of your hypotheses.
model2 <- lm(wage ~ educ + age + native + female, data = NV2017CPS)
anova(model2, model1)
summary(model2)
##
## Call:
## lm(formula = wage ~ educ + age + native + female, data = NV2017CPS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111505 -22320 -6400 10677 391681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30681.68 1947.87 15.751 < 2e-16 ***
## educBachelors Degree 15580.02 1754.47 8.880 < 2e-16 ***
## educGraduate Degree 43515.06 1998.52 21.774 < 2e-16 ***
## educHS Diploma -9978.84 1653.59 -6.035 1.63e-09 ***
## educLess than HS Diploma -18210.49 1883.46 -9.669 < 2e-16 ***
## educSome College -3979.59 1653.72 -2.406 0.0161 *
## age 520.12 28.49 18.259 < 2e-16 ***
## native -7176.93 2983.25 -2.406 0.0162 *
## female -16386.42 844.37 -19.407 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49320 on 13767 degrees of freedom
## Multiple R-squared: 0.1528, Adjusted R-squared: 0.1523
## F-statistic: 310.3 on 8 and 13767 DF, p-value: < 2.2e-16
bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 513.99, df = 8, p-value < 2.2e-16
plot(wage ~ age, data = NV2017CPS)
New regression model:
\(\hat{wage} = 30,681.68 + 15,580.02x_{Bachelors} + 43,515.06x_{Graduate} - 9,978.84x_{HSDiploma} - 18210.49x_{LessthanHS} - 3,979.59x_{SomeCol} + 520.12x_{age} - 7,176.93x_{native} - 16,386.42x_{female}\)
The variables Bachelor’s, Graduate degree, HS diploma, less than HS diploma, and some College are modifiers for the intercept since, each person could only fill out one of those options. It looks like the intercept is associated with whether a person has an Associate’s degree, starting with 30,681.68.
Even after modifying the original regression model, the variance explained by the model is still pretty low at 0.1523. Maybe there are other factors that we do not have access to influencing these numbers. Other factors such as access to internet or whether the person has a car or not.
To test for homoskedasticity, I used the bptest and got a p-value close to zero. Therefore, I can reject the null that the model is homoskedastic. This means the variance is not constant for all observations. Since I can accept that this model is heteroskedastic, the variances among each observation could be vastly different.
So both being female and native American do have a negative affect on a person’s wage with women’s coefficient being -16,386.42 and native American’s coefficient being -7,176.93. However, living in a rural area seemed to not affect people’s wage significantly. That might have something to do with the number of observations in the data set that live in rural areas. One thing to keep in mind is that this model may predict that a person has a negative salary and we know that that is unrealistic since no one would work at a cost.