The Current Population Survey

In this assignment you will use linear regression to conduct some t-tests using data from the 2017 Current Population Survey (CPS). The CPS is conducted by the US Bureau of Labor Statistics and the US Census Bureau. There is a monthly component, which is one source for our “monthly labor figures” that make the news each month. To collect these data, respondents are followed for four months, then have eight months off, and are then followed for four months again. In this way, we can see the stability of work from month-to-month and from year-to-year for each individual.

The data we will use are from something called the “March Supplement” to the CPS. In March, respondents receive an extensive survey to collect a lot of their demographic information that does not change from month-to-month.

I’ve provided a file that has every individual between the ages of 25 and 29, plus a few of the columns. Some of the columns are originals from the dataset, so you’ll have to look at the codebook for them. For a few columns I’ve done the processing myself because it would be too much for this lab.

Load the packages and data

This will load tidyverse, and then load in the data and create an object called cps_raw

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cps_raw <- read_csv('cps2017.csv')
## Rows: 185914 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): H_SEQ, PF_SEQ, A_LINENO, A_SPOUSE, PE_COHAB, RRP, PERRP, RACE_R
## dbl (6): A_SEX, EDUC, HISPANIC, A_AGE, A_WKSTAT, WSAL
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A few of the columns are:

  • H_SEQ: A household sequence number. A sort of id. People in the same household have the same H_SEQ.
  • PF_SEQ: A Family sequence number. Households are made up of one of more families. Roommates will be separate families, but the same household.
  • A_LINENO: The “Line Number” for each person. People in a household are numbered 1,2,3. etc. You can find unique individuals by combining the Household Sequence number. Other variable here use the line number to reference spouses and cohabiting partners.
  • A_SPOUSE: The line number to the person’s spouse. ‘00’ if unmarried. You’ll have to look this up in the codebook.
  • PE_COHAB: You’ll have to look this up in the codebook.
  • PERRP: The Person’s relation to the householder. This contains information about non-married partners. You’ll have to look this up in the codebook.
  • A_SEX: The person’s sex. You’ll have to look this up in the codebook.
  • EDUC: The person’s number of years of education. I created this (from A_HGA), so you won’t find it in the codebook.
  • HISPANIC: 1 is Hispanic. 2 if not.
  • RACE_R: The Person’s race. I create this, so you won’t find it in the codebook.
  • A_AGE: The person’s age.
  • A_WKSTAST: Work Status. You’ll have to look this up in the codebook.
  • WSAL: Income from wages and salary last year.

You can find the complete codebook on Canvas. You will have to download it for this lab.

The hypothesis

You will test whether women ages 25-29 have the same income as men the same ages and create confidence intervals for the difference.

We need to think carefully to make sure that we are comparing the same types of people. In particular, it is common in American households for women to work less outside of the home than for men to do so. So we will look at just men and women who are not married. But marriage in the US is changing too, and more and more couples are choosing to remain unmarried, so we will also try to look at men and women who are not only not married, but are also not living with a partner. Additionally, some people don’t work full time, and so we should not compare the annual wages and salary of people who work full time with the annual wages and salary of people who do not. So there is a bit tidyverse cleaning ahead of us.

Data Cleaning

Here are some data cleaning steps. In particular, we need to

  1. Use mutate to create a factor (categorical) variable out of Sex.
  2. filter out people that are single.
  3. filter out people who are not working full time.

Create a new factor variable for Sex.

The A_SEX variable has 1’s and 2’s in it. Not categories of male and female. Look in the codebook on page 8-15. You’ll see that Males are ‘1’ and Females are ‘2’. You’ll have to do this on your own, so go ahead and look now, rather than just taking my word for it. I will just add this to the cps_raw data frame.

 cps_raw <- cps_raw %>%
     mutate(SEX = factor(A_SEX, levels = c('1','2'), labels = c('Male','Female')))

The factor function has arguments (col, levels, labels), the col is the column that has the data we want to convert, the levels has what the original levels are, and the labels has what we want to call those levels. 1 will become ‘Male’ and 2 will become ‘Female’.

Filter out married people

Create a new data.frame cps_single that has filtered on people that are single. This requires looking at both the A_SPOUSE variable and the PECOHAB variable. I’ve done the first for you, but you’ll have to look up the second. Uncomment the code in this block and add the correct filter for PECOHAB. The codebook will tell what value means that there is no partner present.

cps_single <- cps_raw %>%
  filter(A_SPOUSE == '00') %>%
  filter(PE_COHAB == '-1')

Filter on people in their 20s

Now create a new dataset cps_single_20s that filters away everyone not ages 25-29 (filter on people >= 25 and ‘<30’)

cps_single_20s <- cps_single %>% filter(cps_single$A_AGE >= 25 & cps_single$A_AGE < 30)

Filter on people employed full time.

Finally, create the final dataset cps_final that just has people employed full time. This uses the WKSTAT variable, but you’ll have to look it up in the codebook. Only include those who are full time now.

cps_final <- cps_single_20s %>% filter(cps_single_20s$A_WKSTAT == 2)

You’re dataset should contain 3,475 individuals. If not, correct before continuing.

 dim(cps_final) 
## [1] 3475   15
# Calculate the dimension of the data. Should be 3475 x 15

Data Visualization and Non-Normality

Before doing a t-test, you should check the distribution of the data with tables and charts. We want either (1) a dataset that is approximately normal, or (2) a dataset that is large enough that we don’t care what distribution it has.

Create a table that summarizes the data by sex, calculating the min, median, mean, sd and max of the variable WSAL:

cps_final %>%
        group_by(SEX) %>%
        summarize(min = min(WSAL),
                  median = median(WSAL),
                  mean = mean(WSAL),
                  sd = sd(WSAL),
                  max=max(WSAL))
## # A tibble: 2 × 6
##   SEX      min median   mean     sd     max
##   <fct>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 Male       0  35000 42505. 44413. 1099999
## 2 Female     0  30000 36656. 33162.  800000

Draw a histogram of salary.

I’ve decided to do this one for you.

ggplot(data=cps_final, aes(x=WSAL/1000)) + geom_histogram(binwidth = 25, center=12.5) +
  labs(x='Wages and Salary last year (Thousands of dollars)')

Draw a boxplot of salary, by SEX

WSAL <- cps_final$WSAL
WSAL <- as.factor(WSAL)
SEX <- cps_final$SEX

p <- ggplot(cps_final, aes(x= WSAL/1000, y= SEX))+ geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=1, notch=TRUE) + labs(y= 'wages and salary last year (thousands of dollars')

p

The distribution is really right skewed.

We’ll talk more about transforms later this semester, but when data are right skewed like this, it is common to try a log-transform of the data.

Plot a log-transformed wage variable

So let’s plot a log histogram. First, you should notice from the summary tables that there are people with zero income last year - even though they are full time now. When log transforming data with zeros, it common to do log(x + small number). Let’s create a new variable that is Wages plus 1.

cps_final <- cps_final %>%
  mutate(LOG_WSAL = log(WSAL+1))

Log_WSAL <- cps_final$LOG_WSAL

And plot that histogram using ggplot

ggplot(data=cps_final, aes(x=Log_WSAL)) + geom_histogram(binwidth = 8, center=1.8) + labs(x='Wages and Salary last year (Ten Thousands of dollars)')

And create side-by-side boxplots for Males and Females

ggplot(cps_final, aes(x= Log_WSAL, y= SEX))+ geom_boxplot(outlier.colour="red", outlier.shape=16,outlier.size=1, notch=TRUE) + labs(y= 'log of wages and salary last year')

That’s a little more normal. Now there’s a long left tail instead of right tail.

I see at least two options for going forward:

  1. Use the original data. They’re not normal, but maybe we can appeal to the Central Limit Theorem because the sample size is large.
  2. Go with the log-transformed data plus one. This is more normal, but it may be hard to explain what a logged variable is.

A third option is to filter out everyone who didn’t earn money last year. This is okay, but you have to recognize that we are now changing our population to peole who are employed this year and who earned some money last year. This might be justified because it creates a population of people with relatively steady employment. We can even create a threshold, like people who made at leat $5,000 last year. It’s a little arbitrary, but it could be justified.

Even if you did that, you still might want to log transform. I will finish out this section by filtering out people making less than 5,000 and produce the boxplots:

cps_final2 <- cps_final %>% filter(WSAL>=5000)
ggplot(data=cps_final2, aes(x=SEX,y=LOG_WSAL)) + geom_boxplot() + 
  labs(y='log of wages and salary last year (for people who earned > 5000)')

I am comfortable with with this: the population subset can be justifiable, and their aren’t any extreme outliers to throw off the mean and standard deviation. The histogram is a little left skewed, but I think that the sample is large enough that we can assume the CLT is at work.

Conducting t-tests

Question 1.

Using linear regression, conduct a t-test for the difference in means between Males and Females using the (unlogged) wages and salary:

model_WSAL <- lm(WSAL~SEX, data=cps_final2)
summary(model_WSAL)
## 
## Call:
## lm(formula = WSAL ~ SEX, data = cps_final2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -40179  -18672   -8179    8328 1054820 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45179.4      948.1   47.65  < 2e-16 ***
## SEXFemale    -6507.2     1381.7   -4.71 2.59e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39480 on 3275 degrees of freedom
## Multiple R-squared:  0.006727,   Adjusted R-squared:  0.006424 
## F-statistic: 22.18 on 1 and 3275 DF,  p-value: 2.585e-06

- Explain what the estimate for the Intercept and slope mean in this model.

In this model, starting at the intercept of \(45,179.4\), each point decreases by \(6,507.2\). This means that female wages decrease by \(6,507.2\) at each 1 level increase compared to a male salary.

- Interpret the p-value for the intercept coefficient. Is it scientifically interesting? Why?

The p-value here, 2e-16, which is incredibly small, and lower than 0.05. This means there is a relationship between the variables, and we can reject the null hypothesis.

- Interpret the p-value of the slope coefficient.

The p-value here, 2.585e-06, which is very low, and lower than 0.05. This means there is a relationship between the variables, and we can reject the null hypothesis.

- Calculate a 95 percent confidence interval for the difference and explain what it means.

confint(model_WSAL, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 43320.504 47038.368
## SEXFemale   -9216.257 -3798.136

This confidence interval can be written as (-9216.257, -3798.136), and does not contain 0 in its values. This means that there IS a statistically significant correlation between our variables!

Question 2

Use linear regression to conduct a t-test for the difference in means between Males and Females using the (logged) wages and salary (I’ve done this one for you, but you’ll have to uncomment it):

 model_WSAL2 <-  lm(LOG_WSAL~SEX, data=cps_final2)
 summary(model_WSAL2)
## 
## Call:
## lm(formula = LOG_WSAL ~ SEX, data = cps_final2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9728 -0.3635  0.0011  0.3912  3.4207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.49016    0.01533 684.360  < 2e-16 ***
## SEXFemale   -0.12339    0.02234  -5.524 3.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6383 on 3275 degrees of freedom
## Multiple R-squared:  0.00923,    Adjusted R-squared:  0.008927 
## F-statistic: 30.51 on 1 and 3275 DF,  p-value: 3.582e-08

- Interpret the p-value of the coefficient.

This p-value is incredibly small at 3.58e-08. This is lower than 0.05, which means we can reject the null hypothesis.

- Compare the t-statistic with the logged data to that of the unlogged data. Which is larger? What about the logged and unlogged data cause the t-statistic to be larger?

The t-statistic with the logged data is smaller than the data of the unlogged data. Because taking the log of a number produces a smaller product, it is sensible that the overall t-statistic would be smaller in turn. The log values will have a smaller mean, standard deviation, and standard error, ultimately producing a smaller t-statistic.

Question 3

Rather than thinking about differences in means of logged wages, it’s possible to exponentiate the coefficients to “undo” the log. This slightly changed the interpretation, however. This number then represents the proportion of the average female’s income to that of males. For example, a 0.9 would mean that the average woman earns 90% of the average man (10 % less than), and 1.05 would mean the average woman makes 105% (5 % more than) the salary of the average man.

Transform the estimate into a multiplicative factor and explain the answer. I did this one for you, but you’ll have to uncomment it.

exp(-.1182)
## [1] 0.8885183

The result for this function is 0.8885183, or 88.9%. This means that the average woman earns 88.9% of the average man, or 11.1% less than the average man.

Calculate a 95 percent confidence interval for the multiplicative factor: First, calculate the intervals on the log scale

exp(confint(model_WSAL2))
##                    2.5 %       97.5 %
## (Intercept) 3.489528e+04 3.705710e+04
## SEXFemale   8.460422e-01 9.234963e-01

T Test for differences in education.

Women make about 85 percent to 93 percent of their male peers. Income is related to education. That’s why you’re all in school.
Perhaps women are less educated on average?

Question 4

Perform a t-test to compare the education levels of men and women in the cps_final2 dataset.

 EDUC <- cps_final2$EDUC
 model_EDUC <- lm(EDUC~SEX, data=cps_final2)
 summary(model_EDUC)
## 
## Call:
## lm(formula = EDUC ~ SEX, data = cps_final2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4666  -1.8904  -0.4666   1.5334   7.1096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.89043    0.05796 239.662  < 2e-16 ***
## SEXFemale    0.57620    0.08446   6.822 1.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.413 on 3275 degrees of freedom
## Multiple R-squared:  0.01401,    Adjusted R-squared:  0.01371 
## F-statistic: 46.54 on 1 and 3275 DF,  p-value: 1.066e-11

- Interpret the coefficients.

This t-statistic test demonstrates that as we increase by one level, a woman’s education level actually increases by 0.57620. This slope suggests that women are actually more educated than men, but we cannot determine this for sure yet.

- Interpret the p-value and the coefficient.

This p-value is truly tiny, calculating to 1.07e-11. This strongly suggests that we can reject the null hypothesis since it is less than 0.05, suggesting that there could be, with further testing, a relationship between the levels of education between sexes. This could be confirmed with a confidence interval.

- Evaluate whether the data support a hypothesis that women and mean have the same average education.

This data does NOT support a hypothesis that women and men have the same average education. I can make this claim for several reasons. The p-value is far far lower than 0.05, which rejects the null hypothesis. Additionally, upon calculating a confidence interval, the range does NOT include zero, implying there is a correlation between education and sex. This proves that men and women do NOT have the same average education.

confint(model_EDUC)
##                  2.5 %     97.5 %
## (Intercept) 13.7767885 14.0040650
## SEXFemale    0.4105893  0.7418041
# This is the code I used to calculate the confidence interval.

T Test for college-educated persons

We can try to control for the effect of education by only looking at people with 16 years of education. Create a new dataset cps_final_college that takes cps_final2 but keeps only those people with exactly 16 years of education.

cps_final_college <- cps_final2 %>%
  filter(EDUC == '16')

Question 5

Now repeat the t-tests for the unlogged and logged salary from this new dataset.

- How big is this sample?

This sample is 984. I determined this by observing the listed degrees of freedom– 983.

- Interpret your p-values and calculate 95 percent confidence intervals for both cases.

For the unlogged WSAL, the p-value was 0.695. This means that we fail to reject the null hypothesis. Its confidence interval produces the values (-1.092001e-19, 1.63847e-19), which DOES include zero. When this occurs, we officially fail to reject the null hypothesis, and these variables do not have a statistically significant correlation.

For the logged WSAL, the p-value was 0.726, which is much higher than 0.05. This means that we fail to reject the null hypothesis. Its confidence interval produces the values (-7.96262e-15, 1.143002e-14), which DOES include zero. When this occurs, we officially fail to reject the null hypothesis, and these variables do not have a statistically significant correlation.

- Be sure to convert your results for logged income into a multiplicative factor.

When converting the confidence interval into a multiplicative factor, the interval is much different. Though it still does not include 0, the number produced when converting into a multiplicative factor via exp(1.734e-15) = 1. The new confidence interval is (1, 1). This means there is not a statistically significant correlation still between the variables.

- Interpret your results to determine if they support the hypothesis that 20-something men and women with a college education and with stable jobs have the same average income.

These results indicate there is no correlation between college education and having the same average income. Because we fail to reject the null hypothesis, we cannot support the alternative hypothesis suggesting that there IS a correlation. Additionally, because each confidence interval does include 0, these results are not statistically significant, cementing the lack of relationship between college education and average income between sexes.

model_WSAL_college <- lm(EDUC~WSAL, data=cps_final_college)
summary(model_WSAL_college)
## 
## Call:
## lm(formula = EDUC ~ WSAL, data = cps_final_college)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.896e-12  2.710e-15  3.130e-15  3.450e-15  4.220e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.600e+01  4.644e-15 3.445e+15   <2e-16 ***
## WSAL        2.732e-20  6.957e-20 3.930e-01    0.695    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.241e-14 on 983 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4995 
## F-statistic:   983 on 1 and 983 DF,  p-value: < 2.2e-16
confint(model_WSAL_college)
##                     2.5 %      97.5 %
## (Intercept)  1.600000e+01 1.60000e+01
## WSAL        -1.092001e-19 1.63847e-19
Log_WSAL <- cps_final_college$LOG_WSAL
model_WSAL_college2 <- lm(EDUC~Log_WSAL, data=cps_final_college)
summary(model_WSAL_college2)
## 
## Call:
## lm(formula = EDUC ~ Log_WSAL, data = cps_final_college)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.896e-12  2.370e-15  2.880e-15  3.410e-15  6.680e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.600e+01  5.282e-14 3.029e+14   <2e-16 ***
## Log_WSAL    1.734e-15  4.941e-15 3.510e-01    0.726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.241e-14 on 983 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4995 
## F-statistic: 982.9 on 1 and 983 DF,  p-value: < 2.2e-16
confint(model_WSAL_college2)
##                    2.5 %       97.5 %
## (Intercept)  1.60000e+01 1.600000e+01
## Log_WSAL    -7.96262e-15 1.143002e-14
exp(confint(model_WSAL_college2))
##               2.5 %  97.5 %
## (Intercept) 8886111 8886111
## Log_WSAL          1       1

Question 6

Summarize your results. Assess whether you trust the findings: Do you think the effect is correct? If so, explain why the sample gives you confidence that you’ve found a true difference? If you think the results are too high or too low, Why? What else do you think might explain the difference in wages and salary between the men and women in our sample that is not accounted for?

I believe that my findings are generally correct. While we cannot absolutely determine which hypothesis is correct or not, we can definitely assume that we cannot reject the null hypothesis. This alone makes me confident that there is a lack of correlation. This is further cemented through the inclusion of 0 in both confidence intervals.

I fully believe the original test comparing sex and average income has accurate reports. As a woman raised by a working mother, I know the wage gap all too well. The proposed slope is very logical, and only further reinforces my own personal experiences with the gender wage gap. Education was an interesting variable to look at, as my mother has 5 degrees and still makes less than the men working the same position as her with less degrees. I believe it to be incredibly possible that women have higher education than men. This could very well be due to the wage gap itself– in order to stand the same alongside one’s counterparts you have to be OVERLY qualified.

I believe there could be additional factors besides just education. It could be important to take into account the years of experience one has in a position or how their salary is measured– is it a sales position with additional income? Or is this job a set monthly payment? This could be important to consider when making these conclusions. However, I am generally content with these results.

Parting thoughts

  1. Working with data is iterative. I started this by thinking that I just needed to mutate and filter a few things. But as the analysis proceeded, problems cropped up that required me created new variables and new filters. This is not unusual. It’s impossible to foresee every problem. But at each step, you need to think about how you are changing the data or subsetting the data, and decide how that changes the population you are studying and how you interpret your results.

  2. Controlling for factors is tedious with simple t-tests between means. Here, we have tried to control for marital status, age and education by picking subsets of data. But we’ve narrowed ourselves to such a sliver of the population that we can’t say anything about general trends anymore. After the mid-term we’ll do “multiple linear regression” which will us to control for many things all at once.