Question 1

Name two issues specifically pertaining to the sample of the SuperProfessionalJobs.com data that might give Joe an inaccurate estimate about the population.

Answer: #1. There may be some selection bias since the job search website used a non-random sampling method to create their sample. Of their 13,467 users, only 1557 completed the survey. The 1557 people who completed the survey may not be representative of the whole 13,476 population. A good reason for this is because some groups of people may be over or under represented in the sample which will cause innacurate estimates of the population. For example, superprofessionaljobs.com may be a website mainly used by older people since most younger people use linkedin. This will cause an over representation of older people. This will cause the sample to not be fully representative of the population, and will therefore result in an innacurate estimate about the population. #2. 38% of respondents did not insert a salary amount and this is a very high number of people who do not have #an income. The data is most likely not missing at random. Reasons for this missing data could be that many of the respondents are college students or unemployed people looking for a job. This shows that there could be overrepresentation of lower income groups in the sample which will cause an innaccurate estimate about the population.

Question 2

Name two issues that may arise pertaining to the variables available in the SuperProfessionalJobs.com data which might give Joe an inaccurate estimate about the population? (i.e., are the variables sufficient? What other variables should one want?)

Answer: #1. There is missing data in the income column. If the missing data was random, this would not be an issue, but it is very hard to find missing data that is truely random. For example, many more women might leave this questions blank since more women are stay at home parents, which may create an over respresentation of men putting their salaries down compared to women. #2. We also see that the variable income has the value 999999 as missing and 999998 as NA. we have to get rid of these values since right now we are highly overestimating the income with these high numbers in the dataset.

Question 3

Describe one specific situation how these issues might exacerbate the gender earnings gap and a second situation that might make it seem smaller than reality. Be specific about it.

Answer: #1. One situation where these issues might exacerbate the gender earnings gap is if there are biases in the data that are not controlled. For example there could be discrimination bias, and womens true income could be understimated if they many women take parental leaves and this takes a toll on their income. Discrimination bias could also cause women’s true income to be underestimated since women can get lower salaries and hours compared to men in the same roles just because they are women. #A second situation that might make it seem smaller than reality is if their is selection bias in the data. For example, the survey is overrepresented with older women compared to a balanced age representation of men, so it seems like the average income for women is higher than it really is.

Question 4

If you were to impute the missing salary data with the mean, think of one circumstance where doing so may bias your estimate. Also, think of a better way to impute the missing salary data than imputing with the mean? Also see footnotes in case.

Answer: # imputing the missing salary data with the mean will bias your estimate since it may not be an accurate estimate of income. This is because you are taking the mean of all of the other incomes but not predicting the missing income using all over the variables that you have. For example, if you use the mean of the data but the person with missing data is unemployed or in college, then you would be overestimating their income by a lot. A better way to impute for the missing salary is through a regression model using the ACS variables to fill in that missing data.

Question 5

Import ACS.csv into RStudio. Name this data frame ACS. There should be 1,000,000 observations. There is no need to do any data cleaning at this point, but you can feel free to get a feel for the data through looking at their structure and familiarize yourself with the variables.

#insert code
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = 'C:/Users/rbrody2/Desktop')
ACS <- read.csv("ACS.csv")

Question 6

The raw data contains many variables that are categorical and can be converted into dummy variables. In R, create each of the following dummy variables:

a. Sex (sex): Create one dummy variable, called female, for anyone who is identified as a “female”.

#insert code
ACS$female <- ifelse(ACS$sex=='Female', 1, 0)

b. Marital Status (marst): Create one dummy variable, called married, for anyone who is “Married, spouse present” or “Married, spouse absent”.

#insert code
ACS$married <- ifelse(ACS$marst == 'Married, spouse absent' | ACS$marst== 'Married, spouse present', 1, 0)

c. Having a children last year (fertyr): Create one dummy, called child, for those who answered “Yes”.

#insert code
ACS$child <- ifelse(ACS$fertyr=='Yes', 1, 0)

d. Race(race): Create separate dummy variables for each of the following categories:

  • A dummy called white for those who are “White”
#insert code
ACS$white <- ifelse(ACS$race=='White', 1, 0)
  • A dummy called black for those who are “Black/African American/Negro”
#insert code
ACS$black <- ifelse(ACS$race=='Black/African American/Negro', 1, 0)

e. Hispanic (race)” Create a single dummy variable called hispanic for anyone who is “Cuban”, “Mexican”, “Puerto Rican”, or “Other”

#insert code
ACS$hispanic <- ifelse(ACS$hispan == 'Mexican' | ACS$hispan == 'Puerto Rican'
                             | ACS$hispan == 'Cuban' | ACS$hispan == 'Other', 1, 0)

f. Education (educ): Create a single dummy variable called college for anyone who has “1 year of college”, “2 years of college”, “4 years of college”, or “5+ years of college”.

#insert code
ACS$college <- ifelse(ACS$educ == '1 year of college' | ACS$educ == '2 years of college'
                             | ACS$educ == '4 years of college' | ACS$educ == '5+ years of college', 1, 0)

g. Employment Status (empstat): Create a single dummy variable called notemployed for anyone who is “Unemployed” or “Not in labor force.”

#insert code
ACS$notemployed <- ifelse(ACS$empstat == 'Unemployed' | ACS$empstat== 'Not in labor force', 1, 0)

Question 7

Change the Hours Usually Worked variable (uhrswork) from a character to a numeric variable. Use the as.numeric() function to do this. Just note that in order to have R replace the uhrswork variable, you need to store it as an object (with the same variable name) in the ACS data frame.

#insert code
ACS$uhrswork <- as.numeric(ACS$uhrswork)
## Warning: NAs introduced by coercion

Question 8

Your laptop is struggling to run some estimates with a large data set, so you want a representative sample instead. Set a seed of 78 and randomly subset the data so that you only have 200,000 observations instead of 1 million observations. Name this data frame ACS_subset.

#insert code
set.seed(78) 
library(dplyr) #need to load dplyr package for sampling
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ACS_Subset = sample_n(ACS, 200000) #ACS_rep is a randomized sample of 500,000 observations from ACS!
ACS_Subset$sample <- 1 

Question 9

Based on existing numeric variables (including dummies), determine whether ACS_subset is a representative subset of ACS. How did you determine representativeness (as we learned in class)? What is one major weakness of this approach? Note: Make sure any regressions you run is formatted in a table using modelsummary().

Answer:

#insert code
ACS$sample <- 0
ACS_compare <- rbind(ACS, ACS_Subset)
models_compare <- list()

# Let's use several variables to test for representativeness
models_compare[['female']] <- lm(female ~ sample, data=ACS_compare)
models_compare[['married']] <- lm(married ~ sample, data=ACS_compare)
models_compare[['child']] <- lm(child ~ sample, data=ACS_compare)
models_compare[['white']] <- lm(white ~ sample, data=ACS_compare)
models_compare[['black']] <- lm(black ~ sample, data=ACS_compare)
models_compare[['hispanic']] <- lm(hispanic ~ sample, data=ACS_compare)
models_compare[['incwage']] <- lm(incwage ~ sample, data=ACS_compare)
models_compare[['notemployed']] <- lm(hispanic ~ notemployed, data=ACS_compare)
models_compare[['college']] <- lm(college ~ sample, data=ACS_compare)
models_compare[['uhrswork']] <- lm(uhrswork ~ sample, data=ACS_compare)
library(modelsummary)
modelsummary(models_compare, stars=TRUE, title="Statistical Differences w/Representative Sample")
## Warning: In version 0.8.0 of the `modelsummary` package, the default significance markers produced by the `stars=TRUE` argument were changed to be consistent with R's defaults.
## This warning is displayed once per session.
Statistical Differences w/Representative Sample
female married child white black hispanic incwage notemployed college uhrswork
(Intercept) 0.500*** 0.413*** 0.013*** 0.749*** 0.103*** 0.160*** 235251.561*** 0.162*** 0.423*** 38.659***
(0.001) (0.000) (0.000) (0.000) (0.000) (0.000) (394.903) (0.000) (0.000) (0.017)
sample 0.001 0.000 0.000 0.001 0.000 -0.001 824.770 0.001 -0.020
(0.001) (0.001) (0.000) (0.001) (0.001) (0.001) (967.310) (0.001) (0.041)
notemployed -0.009***
(0.001)
Num.Obs. 1200000 1200000 1200000 1200000 1200000 1200000 1200000 1200000 1200000 683721
R2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
R2 Adj. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
AIC 1741903.9 1705127.4 -1808101.5 1400346.4 549550.8 997080.3 34332804.4 996940.4 1713242.9 5408110.6
BIC 1741939.9 1705163.4 -1808065.5 1400382.4 549586.8 997116.3 34332840.4 996976.4 1713278.8 5408144.9
Log.Lik. -870948.929 -852560.724 904053.770 -700170.195 -274772.386 -498537.128 -17166399.208 -498467.181 -856618.428 -2704052.311
F 1.310 0.000 0.088 1.007 0.001 0.745 0.727 140.647 0.186 0.245
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
# we can see that no coefficients in the sample row have any stars next to them, showing there is no significant  statistical different between the sample and population means for all of our variables. This shows that the ACS_subset is a representative subset of ACS. One major weakness of this approach is that there are certain variables you cannot capture such as a persons ability or their intelligence, which can make your results not be truly representative of the population even with this approach. 

Question 10

Using ACS_subset, run a regression using incwage as the dependent variable and female as the independent variable. Write a sentence to properly interpret the constant (y-intercept). Then interpret the slope for the female coefficient. Based on what you know about US incomes, do you think these interpretations make sense in real life? Why or why not?

Answer:

#insert code
models=list()
models[['incwage']] <- lm(incwage ~ female, data=ACS_Subset)
modelsummary(models, stars=TRUE, title="Statistical Differences w/Representative Sample")
Statistical Differences w/Representative Sample
incwage
(Intercept) 248460.209***
(1251.395)
female -24703.774***
(1767.453)
Num.Obs. 200000
R2 0.001
R2 Adj. 0.001
AIC 5722451.6
BIC 5722482.2
Log.Lik. -2861222.789
F 195.358
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
# we can see tha the slope for female is -24,703. This means that on average, females makes $24,703 less than men. This does not make sense since most people will get similar offers for similar jobs regardless of their gender. I would expect the gap to be around $2000-5000 considering discrimination biases in the workplace and women having different priorities such as taking care of kids. 

Question 11

Produce a histogram from ACS_subset [using either the hist() or ggplot2 functions if you know it] to visualize the distribution of the incwage variable. Notice the fact that there are an unusually large number of people making either $999,999 or $999,998 in the data.

#insert code
library(ggplot2)
hist(ACS$incwage)

Question 12

Recode the incwage variable in ACS_subset so that anyone making incwage of 999999 (which means N/A) or 999998 (which means missing) are now considered to be NA in the data frame.

#insert code
ACS_Subset$incwage[ACS_Subset$incwage>=999998] <-NA

Question 13

Use the package “modelsummary” and the function datasummary_skim() in R to take a look at the descriptive statistics. See our code in class for example. Take note that some of variables are considered categorical (“factor” variables) in R and won’t show up in the summary, but your dummy variables should show up in the table. As long as the descriptive statistics look okay to you, feel free to use this as the analytical data set. If there is anything that sticks out to you that you feel needs pre-processing, feel free to do so but do describe both the issue and what you did. (Note: You don’t actually HAVE to change anything based on the data, but I want you to have the freedom to do anything you may see that you believe needs pre-processing, since in your own analyses these are generally up to you.

Answer (if any):

#insert code
datasummary_skim(ACS_Subset, fmt = "%.3f")
Unique (#) Missing (%) Mean SD Min Median Max
year 3 0 2017.007 0.816 2016.000 2017.000 2018.000
perwt 827 0 106.381 86.010 1.000 83.000 1804.000
uhrswork 95 43 38.639 12.633 1.000 40.000 98.000
incwage 942 21 34785.982 57030.189 0.000 18000.000 736000.000
female 2 0 0.501 0.500 0.000 1.000 1.000
married 2 0 0.413 0.492 0.000 0.000 1.000
child 2 0 0.013 0.114 0.000 0.000 1.000
white 2 0 0.750 0.433 0.000 1.000 1.000
black 2 0 0.103 0.304 0.000 0.000 1.000
hispanic 2 0 0.159 0.366 0.000 0.000 1.000
college 2 0 0.424 0.494 0.000 0.000 1.000
notemployed 2 0 0.273 0.446 0.000 0.000 1.000
sample 1 0 1.000 0.000 1.000 1.000 1.000

Question 14

After pre-processing, Joe wants to estimate the difference in earnings between male and female workers in general. This could give him some idea into what the raw differences in earnings would be. Run a regression using ACS_subset with incwage as the dependent variable and female as the independent variable.

#insert code
models <- list()
models[['income without weight']] <- lm(incwage ~ female, ACS_Subset)
modelsummary(models, stars=TRUE, title="Statistical Differences w/Representative Sample")
Statistical Differences w/Representative Sample
income without weight
(Intercept) 43154.087***
(201.597)
female -16567.593***
(283.662)
Num.Obs. 158291
R2 0.021
R2 Adj. 0.021
AIC 3912836.2
BIC 3912866.1
Log.Lik. -1956415.089
F 3411.283
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 15

For the regression you ran in 14, Interpret the female coefficient in its context.

Answer:

#on average, females make make $20,638.270 less than men

Question 16

For the regression you ran in 14, Interpret the y-intercept in its context.

Answer: #when female = 0, we see the average income of men, which is $52,411.919

Question 17

Thinking about what to do next, Joe phones his buddy, Foster. Upon hearing what Joe was up to, Foster says, “As I understand it, I believe the ACS data is at the individual-level, but people from different geographic areas do not respond equally to the ACS survey. As a result, you have to weigh certain people more or less to get an accurate representation of the country. There is an ACS variable for that – PERWT.” Joe repeats his estimates using the weight variable. Also see footnote for this.

#insert code
models[['income with weight']] <- lm(incwage ~ female,ACS_Subset, weight=perwt)
modelsummary(models, stars=TRUE, title="Statistical Differences w/Representative Sample")
Statistical Differences w/Representative Sample
income without weight income with weight
(Intercept) 43154.087*** 41918.404***
(201.597) (191.025)
female -16567.593*** -16180.978***
(283.662) (268.781)
Num.Obs. 158291 158291
R2 0.021 0.022
R2 Adj. 0.021 0.022
AIC 3912836.2 3937424.7
BIC 3912866.1 3937454.6
Log.Lik. -1956415.089 -1968709.339
F 3411.283 3624.194
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 18

Foster comments, “There should also be differences between people with and without children and whether they are employed, the quantity of time they work, etc.” Ideas are flowing now. [18] Joe continues his analysis on top of the last regression he ran by also controlling for having children in the past year (child), their employment status (notemployed), their race (white, black), whether they are Hispanic (hispanic), their education (college), and whether they are married (married).

#insert code
models[['income+others weighted']] <- lm(incwage~female+child+notemployed+hispanic+college+married,ACS_Subset,weight=perwt)
modelsummary(models, stars=TRUE, title="Statistical Differences w/Representative Sample")
Statistical Differences w/Representative Sample
income without weight income with weight income+others weighted
(Intercept) 43154.087*** 41918.404*** 38049.825***
(201.597) (191.025) (267.478)
female -16567.593*** -16180.978*** -13722.643***
(283.662) (268.781) (241.228)
child -3470.310***
(898.989)
notemployed -39362.402***
(260.113)
hispanic -7361.794***
(313.964)
college 18913.345***
(248.976)
married 14750.205***
(241.156)
Num.Obs. 158291 158291 158291
R2 0.021 0.022 0.237
R2 Adj. 0.021 0.022 0.237
AIC 3912836.2 3937424.7 3898251.0
BIC 3912866.1 3937454.6 3898330.8
Log.Lik. -1956415.089 -1968709.339 -1949117.497
F 3411.283 3624.194 8183.288
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Question 19

Joe begins to think more about who exactly should be included in the analysis. Many people below 25 could either still be in school, or very early in their career in a way that is not representative of their skills. Those above 65 may be retired or are more likely to either stop working or be disabled. He re-estimates the calculations to condition on only those who are between ages 25-65 (i.e., subset ACS_subset so that it is for people between those ages.) Also see footnote for this.

#insert code
ACS_Subset <-subset(ACS_Subset,age>=25 & age <=65)

Question 20

After running the regression, we can estimate female earnings as a percentage of male earnings using the following equation:

[See equation in Case]

In the above equation, the denominator is just the mean of incwage for males who have incwage less than 999998 and are between ages 25 through 65. In your results, what is the percentage that women make relative to men in the US? Also see footnote for this.

Answer:

#insert code
models[['income 25-65']] <- lm(incwage~female,ACS_Subset,weight=perwt)
modelsummary(models, stars=TRUE, title="Statistical Differences w/Representative Sample")
Statistical Differences w/Representative Sample
income without weight income with weight income+others weighted income 25-65
(Intercept) 43154.087*** 41918.404*** 38049.825*** 50998.990***
(201.597) (191.025) (267.478) (232.314)
female -16567.593*** -16180.978*** -13722.643*** -20107.681***
(283.662) (268.781) (241.228) (325.802)
child -3470.310***
(898.989)
notemployed -39362.402***
(260.113)
hispanic -7361.794***
(313.964)
college 18913.345***
(248.976)
married 14750.205***
(241.156)
Num.Obs. 158291 158291 158291 123644
R2 0.021 0.022 0.237 0.030
R2 Adj. 0.021 0.022 0.237 0.030
AIC 3912836.2 3937424.7 3898251.0 3092266.3
BIC 3912866.1 3937454.6 3898330.8 3092295.5
Log.Lik. -1956415.089 -1968709.339 -1949117.497 -1546130.171
F 3411.283 3624.194 8183.288 3809.048
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
#50,998-20,107 = 30,891
#30,891/50,998=61%
#61% = what women make relative to men in the US

Question 21

Further, make a list of all the analytical decisions Joe (and you) made to arrive at this estimate. Thinking back, do you believe the estimate? If someone other than Joe dealt with the data, do you believe that they would arrive at the same results? What are some specific reasons why another analyst may or may not arrive at the same estimate?

Answer: # This estimate is somewhat believable. All of the variables that had to be changed were changed, and we took care of any outliers and innacuracies in the data. We then also put weights on all of our control variables so I believe any bias we could have will be adjusted with the weights. 61% does not sound like a believeable number for me. I believe that there is a lot more bias happening in the data that we are not controlling for. Another analyst may not arrive at the same results. We took many pre-processing steps that they may not have taken like getting rid of the 999,999 income and changing age and hours worked to numeric. Another analyst may also come to another result since they may use different variables to control the output of income. For example, we used whether they had a child, as a control variable and another analyst may use another control variables based on their own analysis which causes a difference in our estimates. We may also want to look at specific subsets of data such as the 25-65 age group where another analyst may not use this same subset.

Question 22

Finally, how much does your analysis tell you about the causes of gender differences in labor market earnings?

Answer: # we can see that having a child lowers the average income of a person by $3,740. Since only women, can have children, this gives them an automatic disadvantage in terms of their income if they are planning to have kids. We also saw that gender gap increases more as age goes on. When we subsetted the data to 25-65, we saw a higher gender gap compared to when we included all of theages. This shows that the gap is higher during prime working times when many women go on parental leave and have to deal with more discrimination in the workplace.

Question 23

Based on the findings, in what ways might Isabella be right to be angry about the findings? In what ways can one argue that the issue is exaggerated? Provide support using evidence from your linear regression model(s), especially the covariates. # Isabella may be angry about the fact that having a baby in the last year will decrease your salary by $3,470. This applies to her since she recently had a baby. She may also be angry about the fact that even when we created multiple controls variables and weights to accurately predict income, on average, men still make over $10,000 than women. If there were not weights and controls for variables such as marriage, gender, and having a child, there could be a huge exaggeration of how high the gender gap is. Overall, women have different priorities in life compared to men such as Raising kids. Men focus more on making money for status and provding for their families. Overall, it is very hard to represent incomes of working men and women since certain women may have lower income due to issues such as having a child while men will not have these same issues.

Answer:

Question 24

Put on your problem-solving hat. Describe one or two specific ways the government, institutions, and/or companies can help close the gender wage gap. Describe one or two specific ways that families and/or individuals can help close the gap.

Answer: #Institutions # 1.institutions can help close the gender wage gap by not discriminating against female workers and paying them less because they are females. They may also discriminate by providing less working hours for women which they should not do. # 2. Insitutions can also give paid medical leaves to women who were recently pregnant and make sure their role is not taken. # Families/Friends # 1. Families and friends can be there for women who were recently pregnant by helping them take care of their childrenso that they will not be away from work too long which will affect their income less.