Statistical inference with the GSS data

Setup

Load packages

library(tidyverse)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

Our data come from the General Social Survey (GSS). The GSS has gathered data on core demographic, behavioral, and attitudinal trends in the United States regularly since 1972. Every year (or sometimes every two years) the GSS conducts approximately 1500 interviews. GSS used a block quota methodology for 1972-1976. GSS has since then selected interview subjects based on a full probability sample design. Results are generalizable to the non-institutionalized adult US population. One caveat: The GSS only interviews people who live in households. Approximately 10% of young adults and 10% of elderly adults do not live in households, but instead live in dorms, military barracks, assisted living, or similar facilities.

The GSS is an observational study and so we cannot infer causality from these data.

Part 2: Research questions

We are interested in the following research questions: What are the most important factors that inform job preferences? Have job preferences changed over the last 40 years? How does education affect job preferences? These questions are critical to forming effective jobs and employment policies for America.

Since its inception, GSS has tracked five factors of job preference. The survey asks each respondent the following:

Please look at this card and tell me which one thing on this list you would most prefer in a job? b. Which comes next? c. Which is third most important? d. Which is fourth most important?

High income

No danger of being fired

Working hours are short, lots of free time

Chances for advancement

Work important and gives a feeling of accomplishment

Each respondent prioritizes these traits accordingly, and the results are in jobinc, jobsec, jobhour, jobpromo, and jobmeans.

Our educational analysis uses the variable degree which records the highest degree attained by each respondent.

Part 3: Exploratory data analysis

Each of our five job-related research variables is a factor with five possible values. For example here is a summary of jobinc:

gss %>% count(jobinc)

## # A tibble: 6 x 2
##      jobinc     n
##      <fctr> <int>
## 1 Most Impt  4429
## 2    Second  4974
## 3     Third  6034
## 4    Fourth  3607
## 5     Fifth  1528
## 6        NA 36489

The large number of NAs is due to this survey question being outside the “core” GSS, so that it does not get asked in every interview. The other four variables – jobsec, jobhour, jobpromo, and jobmeans – also have roughly 36490 NAs. We have confirmed that these NA values are almost entirely overlapping. They come from essentially the same 36490 respondents. We remove all NAs for all five job factors, losing a total of 36493 responses, and review the general distribution:

gss_jobpref <- gss %>% 
  filter(!is.na(jobinc),!is.na(jobsec),!is.na(jobhour),!is.na(jobpromo),!is.na(jobmeans))

gss_jobpref %>%
  gather(jobinc,jobsec,jobhour,jobpromo,jobmeans,key=job_factor,value=job_factor_rank) %>%
  mutate(
    job_factor=factor(job_factor, levels=c("jobinc","jobsec","jobhour","jobpromo","jobmeans")),
    job_factor_rank=factor(job_factor_rank, levels=c("Fifth","Fourth","Third","Second","Most Impt"))) %>%
  ggplot(aes(x=job_factor_rank)) + geom_bar() + 
  facet_wrap(~job_factor) + 
  coord_flip() +
  ggtitle("Overall ranking of five job factors:\nIncome, security, hours, promotions, meaning") + 
  xlab("Job factor rank") +
  ylab("Number of respondents")

We see that jobmeans has received the most first choices, and jobhours has received the most fifth (last) choices. It gets more complicated after that to understand the results. For example, jobinc is the second most popular first choice, while jobpromo is the most popular second choice. Which of these is “really” the second most popular factor after jobmeans? The answer is not clear because of the dependencies that intertwine our five variables.

We simplify matters by reducing each 1-5 ranking to a simple TRUE or FALSE, which indicates if that factor received a top-two ranking or not:

top2 <- function(x) {
  return(x=="Most Impt" | x=="Second")
}

gss_jobtop2 <- gss_jobpref %>%
  select(year,jobinc,jobsec,jobhour,jobpromo,jobmeans,degree) %>% 
  mutate(jobinc=top2(jobinc),
         jobsec=top2(jobsec),
         jobhour=top2(jobhour),
         jobpromo=top2(jobpromo),
         jobmeans=top2(jobmeans))

Let’s double-check the correctness of this transformation. Every person should have exactly two “top 2s” out of five. We see that this is in fact the case:

# Check for correctness: Every person should have exactly two out of five TRUE.
gss_jobtop2 %>% 
  summarise(avg_no_top2s=mean(jobinc+jobsec+jobhour+jobpromo+jobmeans),
            sd_no_top2s=sd(jobinc+jobsec+jobhour+jobpromo+jobmeans))

##   avg_no_top2s sd_no_top2s
## 1            2           0

Our transformation gives us (at least) three benefits. First, it eliminates information that we don’t really care about – namely the distinction between each person’s 3rd, 4th, and 5th choices. Second, its simplified structure allows us to reconstruct our summary as a single chart, instead of five charts. And third, with that single chart it suggests an unambiguous overall ranking of the factors, which we will investigate next.

# Proportion is number of TRUE divided by n. Sum of all proportions should be 2, which it is
(prop_jobpref <- gss_jobtop2 %>% 
  summarise(jobinc=sum(jobinc)/n(),
            jobsec=sum(jobsec)/n(),
            jobhour=sum(jobhour)/n(),
            jobpromo=sum(jobpromo)/n(),
            jobmeans=sum(jobmeans)/n(),
            n=n()) %>%
  gather(jobinc,jobsec,jobhour,jobpromo,jobmeans,key=job_factor,value=job_top2) %>%
  mutate(job_factor=factor(job_factor,levels=c("jobinc","jobsec","jobhour","jobpromo","jobmeans"))))

##       n job_factor  job_top2
## 1 20568     jobinc 0.4571179
## 2 20568     jobsec 0.2196130
## 3 20568    jobhour 0.1354045
## 4 20568   jobpromo 0.5295605
## 5 20568   jobmeans 0.6583042

prop_jobpref %>%
  ggplot(aes(x=job_factor,y=job_top2)) + geom_bar(stat = "identity") +
  ggtitle("Preference for each job factor among top 2") +
  xlab("Job factor") +
  ylab("Prop. of resp. choosing job factor 1st or 2nd")

Part 4: Inference

The previous chart suggests the following ranked order of job factors: (1) meaning, (2) promotions, (3) income, and far behind the first three, (4) security, and finally (5) hours. Let’s confirm this.

We will conduct pairwise comparisons. In each case the null hypothesis is that the pair of factors are chosen top-2 by the same proportion of Americans, and the alternative (two-sided) hypothesis is that they are chosen top-2 by different proportions. We’ll use a Bonferroni correction of 10 because there are 10 ways to pair up our 5 factors. We could also use a chi-square test here, but we opt for pairwise comparisons so that we get statistical confirmation of the specific rankings that appear in our above chart. ANOVA is another tool one might consider to evaluate means across groups, but ANOVA does not appply here because we are evaluating proportions, not means, across groups.

Our pairwise comparison depends on two conditions: Independence: respondents are chosen randomly so this is OK. Success-failure: There are 20568 respondents, and clearly many more than 10 respondents of “success” (TRUE) and “failure” (FALSE) in every case.

Before doing pairwise comparisons, we calculate 95% confidence intervals. (These rely on the same conditions we just confirmed.) We use an estimate of p=0.5 to calculate our confidence intervals, because we don’t know what the real proportion is and 0.5 gives us the most conservative possible intervals. In addition we will expand our confidence intervals with a Bonferroni correction, so that comparing confidence intervals is less prone to Type 1 error.

# Two-sided 95% confidence interval with Bonferroni correction K = 5*(5-1)/2 = 10
K = 10
Z_bonferroni = qnorm(0.025/K,lower.tail = FALSE)

prop_jobpref %>% 
  mutate(upper=job_top2+Z_bonferroni*sqrt(0.5*(1-0.5)/n),
         lower=job_top2-Z_bonferroni*sqrt(0.5*(1-0.5)/n)) %>%
  ggplot(aes(x=job_factor)) + 
  geom_point(aes(y=job_top2)) + 
  geom_errorbar(aes(ymin=lower,ymax=upper)) +
  ggtitle("Comparing job factor preferences:\n95% confidence intervals with Bonferroni correction") +
  xlab("Job factor") + 
  ylab("Prop. of resp. choosing factor 1st or 2nd")

Even with the Bonferroni correction and p=0.5, our 95% confidence intervals are small compared to the differences separating the proportional results of each factor.

Let’s do a pairwise comparison jobinc and jobpromo, because these two are the closest. We’ll use a pooled proportion. Note that n is the same for every group. Below we start with our table of proportions, then calculate n=20568, p_pooled=0.493, SE=0.00493, Z=-14.694, and p-value=7.03e-49.

prop_jobpref

##       n job_factor  job_top2
## 1 20568     jobinc 0.4571179
## 2 20568     jobsec 0.2196130
## 3 20568    jobhour 0.1354045
## 4 20568   jobpromo 0.5295605
## 5 20568   jobmeans 0.6583042

(n <- as.numeric(prop_jobpref[["n"]][1]))

## [1] 20568

(p_pooled <- (prop_jobpref[["job_top2"]][1]+prop_jobpref[["job_top2"]][4])/2)

## [1] 0.4933392

(SE <- sqrt((p_pooled*(1-p_pooled)/n) + (p_pooled*(1-p_pooled)/n)))

## [1] 0.00493004

(Z = (prop_jobpref[["job_top2"]][1] - prop_jobpref[["job_top2"]][4])/SE)

## [1] -14.69413

(p_value <- pnorm(Z,lower.tail = TRUE) * 2)

## [1] 7.029915e-49

Even for our closest pair of job factors, we have an astronomically small p-value. This result agrees with the confidence intervals, which are small compared to the gaps between each of the five job-factor top-2 proportions. When we calculate p-values for the other 9 pairs we will get even smaller p-values. We have statistically significant evidence that the job factors are prefered in the following order: (1) meaning, (2) promotions, (3) income, (4) security, and finally (5) hours.

Historical Analysis

With the historical richness of GSS, we next investigate how these proportions change over time:

# Calculate proportion of top-2s for each factor, grouped by year
(gss_yearly_jobtop2 <- gss_jobtop2 %>% group_by(year) %>%
  summarise(income=sum(jobinc)/n(),
            security=sum(jobsec)/n(),
            hours=sum(jobhour)/n(),
            promotions=sum(jobpromo)/n(),
            meaning=sum(jobmeans)/n(),
            n=n(),
            pre2010=as.logical(mean(year<2010))))

## # A tibble: 17 x 8
##     year    income  security      hours promotions   meaning     n pre2010
##    <int>     <dbl>     <dbl>      <dbl>      <dbl>     <dbl> <int>   <lgl>
##  1  1973 0.4268126 0.2120383 0.14842681  0.5355677 0.6771546  1462    TRUE
##  2  1974 0.3920220 0.1856946 0.16299862  0.5570839 0.7022008  1454    TRUE
##  3  1976 0.4285714 0.2242926 0.13181504  0.5424431 0.6728778  1449    TRUE
##  4  1977 0.4500000 0.2209459 0.12297297  0.5425676 0.6635135  1480    TRUE
##  5  1980 0.4672702 0.1789694 0.11768802  0.5389972 0.6970752  1436    TRUE
##  6  1982 0.5510204 0.3038549 0.11337868  0.4671202 0.5646259   882    TRUE
##  7  1984 0.4392783 0.2088827 0.09646079  0.5503123 0.7050659  1441    TRUE
##  8  1985 0.4513333 0.1846667 0.10066667  0.5866667 0.6766667  1500    TRUE
##  9  1987 0.5090293 0.1924379 0.11399549  0.5355530 0.6489842  1772    TRUE
## 10  1988 0.4524313 0.2040169 0.10782241  0.5422833 0.6934461   946    TRUE
## 11  1989 0.4642857 0.1683673 0.10306122  0.5418367 0.7224490   980    TRUE
## 12  1990 0.4693182 0.1829545 0.09659091  0.5318182 0.7193182   880    TRUE
## 13  1991 0.4878307 0.1830688 0.10370370  0.5111111 0.7142857   945    TRUE
## 14  1993 0.4811958 0.2285439 0.09836066  0.5159113 0.6759884  1037    TRUE
## 15  1994 0.4859611 0.2375810 0.10583153  0.4989201 0.6717063   463    TRUE
## 16  2006 0.4906542 0.2636849 0.15487316  0.4819760 0.6088117  1498    TRUE
## 17  2012 0.3563097 0.4316013 0.45387063  0.4443266 0.3138918   943   FALSE

yearly_jobpref <- gss_yearly_jobtop2 %>% 
  gather(income,security,hours,promotions,meaning,key=job_factor,value=job_top2)

yearly_jobpref %>% 
  ggplot(aes(x=year,y=job_top2,color=job_factor)) + 
  geom_point(size=2) + 
  geom_line(alpha=0.3) + 
  geom_smooth(aes(linetype=pre2010),se=FALSE) +
  ggtitle("Comparing job factor preferences over time") +
  xlab("Year") +
  ylab("Prop. of resp. choosing job factor 1st or 2nd")

## `geom_smooth()` using method = 'loess'

The popularity of each factor over time is plotted above with dots that are color coded by factor. Thin straight lines connect each set of dots to highlight each historical series. Curved solid lines show estimated trends following each set of dots, with the exclusion of the final year of 2012. Notice there is a striking change in every factor in 2012. We are curious why this is the case, but investigating this is beyond the scope of this project. We therefore omit 2012 data from the rest of our analysis for the sake of simplicity, and investigate history from 1972 to 2006.

We will also focus the rest of our analysis on the single most popular factor in job preference: meaning. It appears that this factor changes over time, perhaps decreasing overall from 1972 to 2006, with a possible upward trend around 1990. Is the dependence of jobmeans on year statistically significant?

Our null hypothesis is “nothing is going on” with the popularity of meaning over time. Our alternative hypothesis is two-sided:

H₀: The proportion of people who rank meaning in the top two factors for their jobs does not depend on year (from 1972 to 2006).
H_A: The proportion of people who rank meaning in the top two factors for their jobs does depend on year (from 1972 to 2006).

We will use a chi-square test to evaluate these hypotheses. The chi-square test is our best (only) available tool to evaluate a set of observed counts of categorical data for its fit to a null hypothesis. Our categorical data are the jobmeans responses and we have 16 observed counts from surveys conducted from 1972 through 2006. (Note: ANOVA is another tool available to us, but our categorical jobmeans data provide proportions, not the groups of means required by ANOVA.)

The chi-square test depends on two conditions, both of which are met. Independence: GSS respondents from 1977 on are selected from a random sample. For 1972-1976 the block quota method is strictly, technically, not a random sample, but still meets our needs based on the careful methods employed by GSS. Sample size / distribution: There are roughly 800-1500 respondents per year, split fairly evenly between top-2 and not-top-2 responses, which is well above the minimum requirement of 5.

Let’s start by calculating some preliminary 95% confidence intervals for a first take on this. The validity of our confidence intervals depends on the same two conditions we just confirmed. We use a pooled proportion p_{hat_pooled} based on all respondents from all years pre-2010. This is our best guess for the actual proportion p. Our calculation uses a weighted average of each year’s value to arrive at the pooled proportion:

pre2010_jobpref <- yearly_jobpref %>% filter(job_factor=="meaning",year<2010)

# Job_top2*n gives us the total number of people who chose meaning in top 2 that year
(p_hat_pooled <- as.double(pre2010_jobpref %>% summarise(p_hat=sum(job_top2*n)/sum(n))))

## [1] 0.6748535

With this pooled proportion, we calculate our 95% confidence intervals, shown below as vertical error bars. We are not using a Bonferroni correction in these calculations because we are not presuming to do any pairwise comparisons. The red horizontal line indicates p_{hat_pooled}:

# Two-sided 95% confidence interval
Z = qnorm(0.025,lower.tail = FALSE)

# SE = sqrt(p_hat_pooled*(1-p_hat_pooled)/n) where n is # resp for each year
pre2010_jobpref <- pre2010_jobpref %>%
    mutate(upper=job_top2+Z*sqrt(p_hat_pooled*(1-p_hat_pooled)/n),
           lower=job_top2-Z*sqrt(p_hat_pooled*(1-p_hat_pooled)/n))

pre2010_jobpref %>% 
  ggplot(aes(x=year)) + 
  geom_point(aes(y=job_top2)) + 
  geom_errorbar(aes(ymin=lower,ymax=upper)) + 
  geom_hline(yintercept=p_hat_pooled,color="red") +
  ggtitle("Job meaning chosen top-2 over time: 95% confidence intervals") +
  xlab("Year") + 
  ylab("Prop. of resp. choosing jobmeans 1st or 2nd")

This chart reproduces the green points from the previous graph, with the addition of 95% confidence intervals and p_{hat_pooled}. Based on the confidence intervals we see here and their overlap with the steady-state red line of p_{hat_pooled}, we have reason to be cautious in accepting the trend curve from the previous graph. Also, it certainly appears that something unusual happened with the 1982 results. Investigating 1982 is beyond our scope and we will simply leave 1982 in our data as-is.

We use a chi-square analysis to investigate more carefully our hypothesis that jobmeans depends on the year. To perform this analysis we put our data in a table that counts TRUEs and FALSEs for each year (TRUE = success = jobmeans is in top 2. FALSE = failure = jobmeans is not in top 2.)

(pre2010_matrix <- pre2010_jobpref %>% 
  group_by(year) %>%
  summarise(meaning_top2=sum(n*job_top2),meaning_not_top2=sum(n*(1-job_top2))) %>%
  select(meaning_top2,meaning_not_top2))

## # A tibble: 16 x 2
##    meaning_top2 meaning_not_top2
##           <dbl>            <dbl>
##  1          990              472
##  2         1021              433
##  3          975              474
##  4          982              498
##  5         1001              435
##  6          498              384
##  7         1016              425
##  8         1015              485
##  9         1150              622
## 10          656              290
## 11          708              272
## 12          633              247
## 13          675              270
## 14          701              336
## 15          311              152
## 16          912              586

chisq.test(pre2010_matrix)

## 
##  Pearson's Chi-squared test
## 
## data:  pre2010_matrix
## X-squared = 125.41, df = 15, p-value < 2.2e-16

Our chi-square test score is 125.41 with an exponentially miniscule p-value. This allows us to reject our null hypothesis. We have statistically significant evidence that jobmeans does depend on year (from 1972 to 2006). Note that our test does not confirm a specific trend (such as the downward trend we have suggested) nor does it say anything about causality. The test merely confirms statistically that there is a dependence between jobmeans and year.

Educational Deep Dive

Let’s dive even deeper and see how education impacts our findings. We will continue to omit 2012 from consideration. We will use the variable degree as a starting point. This splits respondents into five buckets. The three most educated buckets are significantly smaller than the least two educated buckets, so we merge those three into “College-Plus”:

gss_jobtop2 %>% filter(year<2010) %>% count(degree)

## # A tibble: 6 x 2
##           degree     n
##           <fctr> <int>
## 1 Lt High School  4936
## 2    High School 10080
## 3 Junior College   754
## 4       Bachelor  2299
## 5       Graduate  1098
## 6             NA   458

# Convert degree variable into 3 buckets, with college-plus holding jr college and beyond
edu_3 <- function(x) {
  return(ifelse(x=="Junior College" | 
                  x=="Bachelor" | 
                  x=="Graduate", "College-Plus",
                as.character(x)))
}

# Summarise job priorities by education and by year
edu3_yearly_jobpref <- gss_jobtop2 %>% filter(year<2010) %>%
  mutate(education=edu_3(degree)) %>% 
  group_by(education,year) %>% 
  summarise(income=sum(jobinc)/n(),
            security=sum(jobsec)/n(),
            hours=sum(jobhour)/n(),
            promotions=sum(jobpromo)/n(),
            meaning=sum(jobmeans)/n(),
            n=n())

Let’s look at overall job factor preferences for each of our three levels of education. We are still excluding 2012 but we are briefly bringing back all five factors for this chart:

# Peel off year and calculate job overall priorities by education
edu3_jobpref <- edu3_yearly_jobpref %>%
  summarise(income=sum(income*n)/sum(n),
            security=sum(security*n)/sum(n),
            hours=sum(hours*n)/sum(n),
            promotions=sum(promotions*n)/sum(n),
            meaning=sum(meaning*n)/sum(n),
            n=sum(n))

edu3_jobpref %>% 
  gather(income,security,hours,promotions,meaning,key=job_factor,value=job_top2) %>%
  mutate(job_factor=factor(job_factor,levels=c("income","security","hours","promotions","meaning"))) %>%
  ggplot(aes(x=job_factor,y=job_top2)) + 
  geom_bar(stat="identity") + 
  facet_wrap(~education) +
  ggtitle("Preferred job factors by education") +
  xlab("Job factor") + 
  ylab("Prop. of resp. choosing factor in top 2")

It appears that meaning is significantly more popular with more educated respondents than it is with those less educated. We can confirm this statistically with a chi-square test on the counts of how many respondents chose jobmeans in the top 2 or not, for each educational level (1=college-plus, 2=high school, 3=less than high school, 4=NA). At the same time we can confirm by inspection that the conditions are met for a valid chi-square test.

(pre2010_edu3_matrix <- edu3_jobpref %>% transmute(meaning_top2=meaning*n,meaning_not_top2=(1-meaning)*n))

## # A tibble: 4 x 2
##   meaning_top2 meaning_not_top2
##          <dbl>            <dbl>
## 1         3476              675
## 2         6902             3178
## 3         2629             2307
## 4          237              221

chisq.test(pre2010_edu3_matrix)

## 
##  Pearson's Chi-squared test
## 
## data:  pre2010_edu3_matrix
## X-squared = 1011, df = 3, p-value < 2.2e-16

Our chi-square test score of 1011 again gives us an exponentially small p-value and confirms the statistical significance of differences we have seen in preference for job meaning across education levels.

Continuing our deeper dive, how has jobmeans changed over time for each level of education? We already have this data and simply need to chart it:

edu3_yearly_jobpref %>%
  ggplot(aes(x=year,y=meaning,color=education)) + 
  geom_point(size=2) + 
  geom_line(alpha=0.3) + 
  geom_smooth(se=FALSE) +
  ggtitle("Preference for job meaning over time, by education level") +
  xlab("Year") +
  ylab("Prop. of resp. choosing jobmeans 1st or 2nd")

## `geom_smooth()` using method = 'loess'

It appears that every educational level has some variety of a slight general downward trend with possibly a brief up along the way. Meanwhile those with no known education level (NA) are radically different. They appear to oscillate up and down between the lower bound of the less-than-high-school curve and the upper bound of the college-plus curve. We suspect this oscillation could be related to the actual education levels that are unknown in this NA group. More importantly, we hypothesize that each education level does in fact follow the same downward trend on their preference for job meaning over time. We leave that analysis for future work.

Summary

We investigated what Americans consider most important about their jobs. Meaning is the most preferred factor of the five considered (i.e., a sense of importance and accomplishment). There is strong evidence suggesting that people with more education have a stronger preference for meaning in their jobs than do people with less education. Across Americans of all education levels, however, the relative importance of meaning in their jobs is changing over time. It appears it may be declining slightly over time.

Our research confirms dependencies with time but does not confirm a directional trend. More research is needed to confirm the directional trends that we have suggested here. We also suggest research be done to double-check the GSS data of 2012, which appears drastically different than earlier years of the survey.