Introduction

In this assignment we will be looking more at gender disparities in wages in the United States. We will be using the Current Population Survey just as we did in Assignment 3. But in Assignment 3, we only knew how to do a bivariate linear regression. To control for other factors, we had to subset the population to single persons of about the same age. Know we know how to use linear regression to control for other factors.

The process for analyzing these data is the same as in the tutorial. I will give you the code for a few data cleaning steps

Load and clean the data

These data are a bit different than last week, but you can use the same codebook.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cps <- read_csv('cps2017.csv')
## Rows: 135750 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): UID, GESTFIPS, HEFAMINC, PARTNER_STA, RACE_R, Y_AGE, A_MJOCC, PEIOOCC
## dbl (8): GTMETSTA, A_SEX, EDUC, HISPANIC, A_AGE, A_WKSTAT, NCHIL, WSAL
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I’ve provided code to clean up the data here: (fct_na_value_to_level

cps_clean <- cps %>% 
  mutate(young_child = ifelse(!is.na(Y_AGE) & Y_AGE<=5, TRUE, FALSE),
         NCHIL = ifelse(!is.na(NCHIL), NCHIL, 0),
         SEX = factor(A_SEX, levels=c(1,2), labels = c('Male','Female')),
         HISP = factor(HISPANIC, levels=c(2,1), labels=c('Not Hispanic','Hispanic')),
         RACE = factor(RACE_R, levels = c('White','Black','Other')),
         PARTNER_STA = forcats::fct_explicit_na(PARTNER_STA, na_level='none') %>%
           fct_relevel('none','married','cohabit'),
         METSTA = factor(GTMETSTA, levels = c(1,2), labels=c('Metropolitan', 'Non-Metropolitan')),
         OCC1 = factor(A_MJOCC),
         OCC2 = factor(PEIOOCC))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `PARTNER_STA = `%>%`(...)`.
## Caused by warning:
## ! `fct_explicit_na()` was deprecated in forcats 1.0.0.
## ℹ Please use `fct_na_value_to_level()` instead.

The variables are

We won’t be using the occupational codes, but you can find them in the codebook. They might be helpful for looking at outliers.

Run a regression

We’ll run a regression of log WSAL on SEX, RACE, PARTNER_STA, EDUC, HISPANIC, A_AGE, NCHIL and young_child.

For each variable, give some thought to whether you think the coefficients will be positive or negative, and give the reason:

SEX: I think the coefficient will be negative. We already know a wage gap exists, so the female coefficient will probably be negative.

RACE: I think this coefficient will also be negative. Any race besides white and Eastern Asian experiences a negative wage gap, especially if the person is female.

PARTNER_STA: I think this will be a positive coefficient. Married men are considered to be more stable and trustworthy, so their pay tends to go up. For married women however, their pay tends to decrease since employers think maternity costs should be considered

EDUC: I think this will be a positive coefficient. Workers with higher education and additional degrees tend to make more money than workers with entry-level education, so this is sensible.

HISPANIC: I think this coefficient will be negative. It is already proven that Hispanic women earn the least out of all demographics, so this coefficient will likley have a negative downtrend.

A_AGE: I think this will be a positive coefficient. The more experience someone has in a job, the higher their pay likely will be. The older someone is, the more experience they probably will have.

NCHIL: I think this will be a negative coefficient. Minority households, especially Hispanic families, tend to have larger numbers of young children. These households also typically earn less money due to wage bias, so I think it will be negative.

young_child: I thyink this will be a negative coefficient for the same reasons I believe NCHIL will be negative. Minority households often have more young children, resulting in a wage gap.

Last data cleaning:

Here, we’ll filter to select people who worked full time and made more than a $5000. There’s people here who say they work full-time, but make far less than minimum wage. Maybe they’re employed but on leave? Or maybe they are employed now, but they didn’t work all of last year? I don’t know. But I think that they are different enough that we need to remove them.

Using filter, filter the cps_clean data only include people who

  1. Earned at least 5000 in wages and salary last year, and
  2. Are full time employed now.

Call this filtered dataset reg_data.

reg_data <- filter(cps_clean, A_WKSTAT== "2", WSAL >= "5000")

Then, with reg_data, use the ggpairs function to show the plot of ‘SEX’,‘RACE’,‘PARTNER_STA’, ‘A_AGE’, ‘NCHIL’ and log wages. (Hint, you might use mutate to create a column of log wages first)

library(GGally) 
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
log_WSAL <- log(reg_data$WSAL)   
reg_data <- mutate(reg_data, log_WSAL)
ggpairs(reg_data %>%  select("SEX", "RACE", "PARTNER_STA", "A_AGE", "NCHIL",   "log_WSAL")) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then, with reg_data, use the lm function to run a regression of log WSAL on SEX, RACE, PARTNER_STA, EDUC, HISPANIC, A_AGE, NCHIL and young_child.

Print the summary of the regression model. For each variable, provide an interpretation of the coefficient.

log_reg <- lm(log_WSAL~SEX + RACE + PARTNER_STA + EDUC + HISPANIC + A_AGE + NCHIL+ young_child, data=reg_data)

SEXfemale = -0.182064

This coefficient implies that females make 0.182064 less than males for every 1 unit move up.

RACEBlack = -0.072605

This coefficient implies that black workers make 0.072605 less for every 1 unit move up.

RACEOther = -0.016924

This coefficient implies that other races besides white and black make 0.016924 less for every 1 unit move up.

PARTNER_STAmarried = 0.144372

This coefficient implies that married workers make 0.144372 more than single workers for every 1 unit move up.

PARTNER_STAcohabit = 0.105819

This coefficient implies that partners who cohabitate make 0.105819 more than single workers for every 1 unit move up.

EDUC = 0.059953

This coefficient implies that workers make 0.059953 more with every 1 unit increase of education.

HISPANIC = 0.070821

This coefficient implies that Hispanic individuals make 0.070821 more with every 1 unit increase than non Hispanic individuals.

A_AGE = 0.009681

This coefficient implies that with each 1 unit of age increase, a worker earns 0.4622 more.

NCHIL = 0.004093

This coefficient implies that with each 1 unit of increase, a worker earns 0.004093 more than workers with less children.

young_child = 0.086565

This coefficient implies that with each 1 unit of increase, a worker earns 0.086565 more than workers with no young children.

Residual diagnostics

Produce: 1. a histogram of the residuals, and 2. a scatterplot of the residuals vs the fitted values.

Interpret the diagnostics and determine the suitability of the model.

residual <- broom::augment(log_reg)
reg_data <- mutate(reg_data, residual) 
ggplot(data = reg_data, aes(x=.resid)) + geom_histogram(bins=149, binwidth = 0.07)

This histogram, I believe, is a better way to see how the data is skewed in terms of its residuals. The data is generally skewed left, but normal enough for our purposes, and this is easily depicted in the model. Though we cannot make giant conclusions based off of just a histogram, we can make better conclusions with this model than with the scatterplot.

ggplot(data=residual, aes(x=.fitted, y=.std.resid)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This is not a great model to use to make conclusions off of. While the data points have the same appearing slope with different intercepts (which is what should happen), the line itself isn’t a great description of what is going on. It turns halfway through and cuts through the datapoints for one of the variables. It is just hard to interpret and not great. I believe that this is a nonlinear model.

Write up your results.

Provide a brief write up of your results. Be sure to address: 1) What is the relation between wages and gender? 2) Are there any caveats or things to note with the data that might cause us to question the results a little or a lot? 3) Are there any confounding factors that we do not have data for?

  1. There is a definite correlation between wages and gender. When looking at females, this is a negative correlation. It is proved, with the correlation coefficient of -0.182064, that women make less than men. This is unfortunate to have confirmed evidence for, as a woman who will likely make less than her equally qualified male counterparts, but this is the reality.

  2. It is very important to note that definite conclusions cannot be made at any point just with quantitative results. Yes, there is a negative correlation between female gender and wages, but we cannot definitely determine that there is a gender-defined wage gap. While yes, I fully believe there is a wage gap determined by sexism, we cannot declare this based just off of the data. We also cleaned up a lot of the data and took out certain aspects to make it more manageable. Though this is great for our purposes in the lab with beginner level R-Studio experience, this takes away from potential results. Though the difference is probably very minor– we cannot know for sure! The people making less than $5,000 a year might skew our results– we don’t know for sure!

  3. Something to consider when calculating income is tips in the service industry. There could be a disproportionate distribution of waiters or servers among different variables– especially age, race, or education. Marginalized groups are less likely to have access to education, therefore unfortunately pushing them towards the service sector. While some establishments record their tips onto employee salaries for taxation purposes, some do not! This under-the-table tipping system could greatly alter the reported income or salaries for different demographics, which would actually produce a different correlation coefficient. There is no way that we could know this for sure, so this could be a confounding variable without any data.