library(tidyverse)
library(statsr)load("gss.Rdata")Our data come from the General Social Survey (GSS). The GSS has gathered data on core demographic, behavioral, and attitudinal trends in the United States regularly since 1972. Every year (or sometimes every two years) the GSS conducts approximately 1500 interviews. GSS used a block quota methodology for 1972-1976. GSS has since then selected interview subjects based on a full probability sample design. Results are generalizable to the non-institutionalized adult US population. One caveat: The GSS only interviews people who live in households. Approximately 10% of young adults and 10% of elderly adults do not live in households, but instead live in dorms, military barracks, assisted living, or similar facilities.
The GSS is an observational study and so we cannot infer causality from these data.
We are interested in the following research questions: What are the most important factors that inform job preferences? Have job preferences changed over the last 40 years? How does education affect job preferences? These questions are critical to forming effective jobs and employment policies for America.
Since its inception, GSS has tracked five factors of job preference. The survey asks each respondent the following:Please look at this card and tell me which one thing on this list you would most prefer in a job? b. Which comes next? c. Which is third most important? d. Which is fourth most important?
- High income
- No danger of being fired
- Working hours are short, lots of free time
- Chances for advancement
- Work important and gives a feeling of accomplishment
Each respondent prioritizes these traits accordingly, and the results are in jobinc, jobsec, jobhour, jobpromo, and jobmeans.
Our educational analysis uses the variable degree which records the highest degree attained by each respondent.
Each of our five job-related research variables is a factor with five possible values. For example here is a summary of jobinc:
gss %>% count(jobinc)## # A tibble: 6 x 2
## jobinc n
## <fctr> <int>
## 1 Most Impt 4429
## 2 Second 4974
## 3 Third 6034
## 4 Fourth 3607
## 5 Fifth 1528
## 6 NA 36489
The large number of NAs is due to this survey question being outside the “core” GSS, so that it does not get asked in every interview. The other four variables – jobsec, jobhour, jobpromo, and jobmeans – also have roughly 36490 NAs. We have confirmed that these NA values are almost entirely overlapping. They come from essentially the same 36490 respondents. We remove all NAs for all five job factors, losing a total of 36493 responses, and review the general distribution:
gss_jobpref <- gss %>%
filter(!is.na(jobinc),!is.na(jobsec),!is.na(jobhour),!is.na(jobpromo),!is.na(jobmeans))
gss_jobpref %>%
gather(jobinc,jobsec,jobhour,jobpromo,jobmeans,key=job_factor,value=job_factor_rank) %>%
mutate(
job_factor=factor(job_factor, levels=c("jobinc","jobsec","jobhour","jobpromo","jobmeans")),
job_factor_rank=factor(job_factor_rank, levels=c("Fifth","Fourth","Third","Second","Most Impt"))) %>%
ggplot(aes(x=job_factor_rank)) + geom_bar() +
facet_wrap(~job_factor) +
coord_flip() +
ggtitle("Overall ranking of five job factors:\nIncome, security, hours, promotions, meaning") +
xlab("Job factor rank") +
ylab("Number of respondents")We see that jobmeans has received the most first choices, and jobhours has received the most fifth (last) choices. It gets more complicated after that to understand the results. For example, jobinc is the second most popular first choice, while jobpromo is the most popular second choice. Which of these is “really” the second most popular factor after jobmeans? The answer is not clear because of the dependencies that intertwine our five variables.
We simplify matters by reducing each 1-5 ranking to a simple TRUE or FALSE, which indicates if that factor received a top-two ranking or not:
top2 <- function(x) {
return(x=="Most Impt" | x=="Second")
}
gss_jobtop2 <- gss_jobpref %>%
select(year,jobinc,jobsec,jobhour,jobpromo,jobmeans,degree) %>%
mutate(jobinc=top2(jobinc),
jobsec=top2(jobsec),
jobhour=top2(jobhour),
jobpromo=top2(jobpromo),
jobmeans=top2(jobmeans))Let’s double-check the correctness of this transformation. Every person should have exactly two “top 2s” out of five. We see that this is in fact the case:
# Check for correctness: Every person should have exactly two out of five TRUE.
gss_jobtop2 %>%
summarise(avg_no_top2s=mean(jobinc+jobsec+jobhour+jobpromo+jobmeans),
sd_no_top2s=sd(jobinc+jobsec+jobhour+jobpromo+jobmeans))## avg_no_top2s sd_no_top2s
## 1 2 0
Our transformation gives us (at least) three benefits. First, it eliminates information that we don’t really care about – namely the distinction between each person’s 3rd, 4th, and 5th choices. Second, its simplified structure allows us to reconstruct our summary as a single chart, instead of five charts. And third, with that single chart it suggests an unambiguous overall ranking of the factors, which we will investigate next.
# Proportion is number of TRUE divided by n. Sum of all proportions should be 2, which it is
(prop_jobpref <- gss_jobtop2 %>%
summarise(jobinc=sum(jobinc)/n(),
jobsec=sum(jobsec)/n(),
jobhour=sum(jobhour)/n(),
jobpromo=sum(jobpromo)/n(),
jobmeans=sum(jobmeans)/n(),
n=n()) %>%
gather(jobinc,jobsec,jobhour,jobpromo,jobmeans,key=job_factor,value=job_top2) %>%
mutate(job_factor=factor(job_factor,levels=c("jobinc","jobsec","jobhour","jobpromo","jobmeans"))))## n job_factor job_top2
## 1 20568 jobinc 0.4571179
## 2 20568 jobsec 0.2196130
## 3 20568 jobhour 0.1354045
## 4 20568 jobpromo 0.5295605
## 5 20568 jobmeans 0.6583042
prop_jobpref %>%
ggplot(aes(x=job_factor,y=job_top2)) + geom_bar(stat = "identity") +
ggtitle("Preference for each job factor among top 2") +
xlab("Job factor") +
ylab("Prop. of resp. choosing job factor 1st or 2nd")The previous chart suggests the following ranked order of job factors: (1) meaning, (2) promotions, (3) income, and far behind the first three, (4) security, and finally (5) hours. Let’s confirm this.
We will conduct pairwise comparisons. In each case the null hypothesis is that the pair of factors are chosen top-2 by the same proportion of Americans, and the alternative (two-sided) hypothesis is that they are chosen top-2 by different proportions. We’ll use a Bonferroni correction of 10 because there are 10 ways to pair up our 5 factors. We could also use a chi-square test here, but we opt for pairwise comparisons so that we get statistical confirmation of the specific rankings that appear in our above chart. ANOVA is another tool one might consider to evaluate means across groups, but ANOVA does not appply here because we are evaluating proportions, not means, across groups.
Our pairwise comparison depends on two conditions: Independence: respondents are chosen randomly so this is OK. Success-failure: There are 20568 respondents, and clearly many more than 10 respondents of “success” (TRUE) and “failure” (FALSE) in every case.
Before doing pairwise comparisons, we calculate 95% confidence intervals. (These rely on the same conditions we just confirmed.) We use an estimate of p=0.5 to calculate our confidence intervals, because we don’t know what the real proportion is and 0.5 gives us the most conservative possible intervals. In addition we will expand our confidence intervals with a Bonferroni correction, so that comparing confidence intervals is less prone to Type 1 error.
# Two-sided 95% confidence interval with Bonferroni correction K = 5*(5-1)/2 = 10
K = 10
Z_bonferroni = qnorm(0.025/K,lower.tail = FALSE)
prop_jobpref %>%
mutate(upper=job_top2+Z_bonferroni*sqrt(0.5*(1-0.5)/n),
lower=job_top2-Z_bonferroni*sqrt(0.5*(1-0.5)/n)) %>%
ggplot(aes(x=job_factor)) +
geom_point(aes(y=job_top2)) +
geom_errorbar(aes(ymin=lower,ymax=upper)) +
ggtitle("Comparing job factor preferences:\n95% confidence intervals with Bonferroni correction") +
xlab("Job factor") +
ylab("Prop. of resp. choosing factor 1st or 2nd")Even with the Bonferroni correction and p=0.5, our 95% confidence intervals are small compared to the differences separating the proportional results of each factor.
Let’s do a pairwise comparison jobinc and jobpromo, because these two are the closest. We’ll use a pooled proportion. Note that n is the same for every group. Below we start with our table of proportions, then calculate n=20568, p_pooled=0.493, SE=0.00493, Z=-14.694, and p-value=7.03e-49.
prop_jobpref## n job_factor job_top2
## 1 20568 jobinc 0.4571179
## 2 20568 jobsec 0.2196130
## 3 20568 jobhour 0.1354045
## 4 20568 jobpromo 0.5295605
## 5 20568 jobmeans 0.6583042
(n <- as.numeric(prop_jobpref[["n"]][1]))## [1] 20568
(p_pooled <- (prop_jobpref[["job_top2"]][1]+prop_jobpref[["job_top2"]][4])/2)## [1] 0.4933392
(SE <- sqrt((p_pooled*(1-p_pooled)/n) + (p_pooled*(1-p_pooled)/n)))## [1] 0.00493004
(Z = (prop_jobpref[["job_top2"]][1] - prop_jobpref[["job_top2"]][4])/SE)## [1] -14.69413
(p_value <- pnorm(Z,lower.tail = TRUE) * 2)## [1] 7.029915e-49
Even for our closest pair of job factors, we have an astronomically small p-value. This result agrees with the confidence intervals, which are small compared to the gaps between each of the five job-factor top-2 proportions. When we calculate p-values for the other 9 pairs we will get even smaller p-values. We have statistically significant evidence that the job factors are prefered in the following order: (1) meaning, (2) promotions, (3) income, (4) security, and finally (5) hours.
With the historical richness of GSS, we next investigate how these proportions change over time:
# Calculate proportion of top-2s for each factor, grouped by year
(gss_yearly_jobtop2 <- gss_jobtop2 %>% group_by(year) %>%
summarise(income=sum(jobinc)/n(),
security=sum(jobsec)/n(),
hours=sum(jobhour)/n(),
promotions=sum(jobpromo)/n(),
meaning=sum(jobmeans)/n(),
n=n(),
pre2010=as.logical(mean(year<2010))))## # A tibble: 17 x 8
## year income security hours promotions meaning n pre2010
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <lgl>
## 1 1973 0.4268126 0.2120383 0.14842681 0.5355677 0.6771546 1462 TRUE
## 2 1974 0.3920220 0.1856946 0.16299862 0.5570839 0.7022008 1454 TRUE
## 3 1976 0.4285714 0.2242926 0.13181504 0.5424431 0.6728778 1449 TRUE
## 4 1977 0.4500000 0.2209459 0.12297297 0.5425676 0.6635135 1480 TRUE
## 5 1980 0.4672702 0.1789694 0.11768802 0.5389972 0.6970752 1436 TRUE
## 6 1982 0.5510204 0.3038549 0.11337868 0.4671202 0.5646259 882 TRUE
## 7 1984 0.4392783 0.2088827 0.09646079 0.5503123 0.7050659 1441 TRUE
## 8 1985 0.4513333 0.1846667 0.10066667 0.5866667 0.6766667 1500 TRUE
## 9 1987 0.5090293 0.1924379 0.11399549 0.5355530 0.6489842 1772 TRUE
## 10 1988 0.4524313 0.2040169 0.10782241 0.5422833 0.6934461 946 TRUE
## 11 1989 0.4642857 0.1683673 0.10306122 0.5418367 0.7224490 980 TRUE
## 12 1990 0.4693182 0.1829545 0.09659091 0.5318182 0.7193182 880 TRUE
## 13 1991 0.4878307 0.1830688 0.10370370 0.5111111 0.7142857 945 TRUE
## 14 1993 0.4811958 0.2285439 0.09836066 0.5159113 0.6759884 1037 TRUE
## 15 1994 0.4859611 0.2375810 0.10583153 0.4989201 0.6717063 463 TRUE
## 16 2006 0.4906542 0.2636849 0.15487316 0.4819760 0.6088117 1498 TRUE
## 17 2012 0.3563097 0.4316013 0.45387063 0.4443266 0.3138918 943 FALSE
yearly_jobpref <- gss_yearly_jobtop2 %>%
gather(income,security,hours,promotions,meaning,key=job_factor,value=job_top2)
yearly_jobpref %>%
ggplot(aes(x=year,y=job_top2,color=job_factor)) +
geom_point(size=2) +
geom_line(alpha=0.3) +
geom_smooth(aes(linetype=pre2010),se=FALSE) +
ggtitle("Comparing job factor preferences over time") +
xlab("Year") +
ylab("Prop. of resp. choosing job factor 1st or 2nd")## `geom_smooth()` using method = 'loess'
The popularity of each factor over time is plotted above with dots that are color coded by factor. Thin straight lines connect each set of dots to highlight each historical series. Curved solid lines show estimated trends following each set of dots, with the exclusion of the final year of 2012. Notice there is a striking change in every factor in 2012. We are curious why this is the case, but investigating this is beyond the scope of this project. We therefore omit 2012 data from the rest of our analysis for the sake of simplicity, and investigate history from 1972 to 2006.
We will also focus the rest of our analysis on the single most popular factor in job preference: meaning. It appears that this factor changes over time, perhaps decreasing overall from 1972 to 2006, with a possible upward trend around 1990. Is the dependence of jobmeans on year statistically significant?
We will use a chi-square test to evaluate these hypotheses. The chi-square test is our best (only) available tool to evaluate a set of observed counts of categorical data for its fit to a null hypothesis. Our categorical data are the jobmeans responses and we have 16 observed counts from surveys conducted from 1972 through 2006. (Note: ANOVA is another tool available to us, but our categorical jobmeans data provide proportions, not the groups of means required by ANOVA.)
The chi-square test depends on two conditions, both of which are met. Independence: GSS respondents from 1977 on are selected from a random sample. For 1972-1976 the block quota method is strictly, technically, not a random sample, but still meets our needs based on the careful methods employed by GSS. Sample size / distribution: There are roughly 800-1500 respondents per year, split fairly evenly between top-2 and not-top-2 responses, which is well above the minimum requirement of 5.
Let’s start by calculating some preliminary 95% confidence intervals for a first take on this. The validity of our confidence intervals depends on the same two conditions we just confirmed. We use a pooled proportion phat_pooled based on all respondents from all years pre-2010. This is our best guess for the actual proportion p. Our calculation uses a weighted average of each year’s value to arrive at the pooled proportion:
pre2010_jobpref <- yearly_jobpref %>% filter(job_factor=="meaning",year<2010)
# Job_top2*n gives us the total number of people who chose meaning in top 2 that year
(p_hat_pooled <- as.double(pre2010_jobpref %>% summarise(p_hat=sum(job_top2*n)/sum(n))))## [1] 0.6748535
With this pooled proportion, we calculate our 95% confidence intervals, shown below as vertical error bars. We are not using a Bonferroni correction in these calculations because we are not presuming to do any pairwise comparisons. The red horizontal line indicates phat_pooled:
# Two-sided 95% confidence interval
Z = qnorm(0.025,lower.tail = FALSE)
# SE = sqrt(p_hat_pooled*(1-p_hat_pooled)/n) where n is # resp for each year
pre2010_jobpref <- pre2010_jobpref %>%
mutate(upper=job_top2+Z*sqrt(p_hat_pooled*(1-p_hat_pooled)/n),
lower=job_top2-Z*sqrt(p_hat_pooled*(1-p_hat_pooled)/n))
pre2010_jobpref %>%
ggplot(aes(x=year)) +
geom_point(aes(y=job_top2)) +
geom_errorbar(aes(ymin=lower,ymax=upper)) +
geom_hline(yintercept=p_hat_pooled,color="red") +
ggtitle("Job meaning chosen top-2 over time: 95% confidence intervals") +
xlab("Year") +
ylab("Prop. of resp. choosing jobmeans 1st or 2nd")This chart reproduces the green points from the previous graph, with the addition of 95% confidence intervals and phat_pooled. Based on the confidence intervals we see here and their overlap with the steady-state red line of phat_pooled, we have reason to be cautious in accepting the trend curve from the previous graph. Also, it certainly appears that something unusual happened with the 1982 results. Investigating 1982 is beyond our scope and we will simply leave 1982 in our data as-is.
We use a chi-square analysis to investigate more carefully our hypothesis that jobmeans depends on the year. To perform this analysis we put our data in a table that counts TRUEs and FALSEs for each year (TRUE = success = jobmeans is in top 2. FALSE = failure = jobmeans is not in top 2.)
(pre2010_matrix <- pre2010_jobpref %>%
group_by(year) %>%
summarise(meaning_top2=sum(n*job_top2),meaning_not_top2=sum(n*(1-job_top2))) %>%
select(meaning_top2,meaning_not_top2))## # A tibble: 16 x 2
## meaning_top2 meaning_not_top2
## <dbl> <dbl>
## 1 990 472
## 2 1021 433
## 3 975 474
## 4 982 498
## 5 1001 435
## 6 498 384
## 7 1016 425
## 8 1015 485
## 9 1150 622
## 10 656 290
## 11 708 272
## 12 633 247
## 13 675 270
## 14 701 336
## 15 311 152
## 16 912 586
chisq.test(pre2010_matrix)##
## Pearson's Chi-squared test
##
## data: pre2010_matrix
## X-squared = 125.41, df = 15, p-value < 2.2e-16
Our chi-square test score is 125.41 with an exponentially miniscule p-value. This allows us to reject our null hypothesis. We have statistically significant evidence that jobmeans does depend on year (from 1972 to 2006). Note that our test does not confirm a specific trend (such as the downward trend we have suggested) nor does it say anything about causality. The test merely confirms statistically that there is a dependence between jobmeans and year.
Let’s dive even deeper and see how education impacts our findings. We will continue to omit 2012 from consideration. We will use the variable degree as a starting point. This splits respondents into five buckets. The three most educated buckets are significantly smaller than the least two educated buckets, so we merge those three into “College-Plus”:
gss_jobtop2 %>% filter(year<2010) %>% count(degree)## # A tibble: 6 x 2
## degree n
## <fctr> <int>
## 1 Lt High School 4936
## 2 High School 10080
## 3 Junior College 754
## 4 Bachelor 2299
## 5 Graduate 1098
## 6 NA 458
# Convert degree variable into 3 buckets, with college-plus holding jr college and beyond
edu_3 <- function(x) {
return(ifelse(x=="Junior College" |
x=="Bachelor" |
x=="Graduate", "College-Plus",
as.character(x)))
}
# Summarise job priorities by education and by year
edu3_yearly_jobpref <- gss_jobtop2 %>% filter(year<2010) %>%
mutate(education=edu_3(degree)) %>%
group_by(education,year) %>%
summarise(income=sum(jobinc)/n(),
security=sum(jobsec)/n(),
hours=sum(jobhour)/n(),
promotions=sum(jobpromo)/n(),
meaning=sum(jobmeans)/n(),
n=n())Let’s look at overall job factor preferences for each of our three levels of education. We are still excluding 2012 but we are briefly bringing back all five factors for this chart:
# Peel off year and calculate job overall priorities by education
edu3_jobpref <- edu3_yearly_jobpref %>%
summarise(income=sum(income*n)/sum(n),
security=sum(security*n)/sum(n),
hours=sum(hours*n)/sum(n),
promotions=sum(promotions*n)/sum(n),
meaning=sum(meaning*n)/sum(n),
n=sum(n))
edu3_jobpref %>%
gather(income,security,hours,promotions,meaning,key=job_factor,value=job_top2) %>%
mutate(job_factor=factor(job_factor,levels=c("income","security","hours","promotions","meaning"))) %>%
ggplot(aes(x=job_factor,y=job_top2)) +
geom_bar(stat="identity") +
facet_wrap(~education) +
ggtitle("Preferred job factors by education") +
xlab("Job factor") +
ylab("Prop. of resp. choosing factor in top 2")It appears that meaning is significantly more popular with more educated respondents than it is with those less educated. We can confirm this statistically with a chi-square test on the counts of how many respondents chose jobmeans in the top 2 or not, for each educational level (1=college-plus, 2=high school, 3=less than high school, 4=NA). At the same time we can confirm by inspection that the conditions are met for a valid chi-square test.
(pre2010_edu3_matrix <- edu3_jobpref %>% transmute(meaning_top2=meaning*n,meaning_not_top2=(1-meaning)*n))## # A tibble: 4 x 2
## meaning_top2 meaning_not_top2
## <dbl> <dbl>
## 1 3476 675
## 2 6902 3178
## 3 2629 2307
## 4 237 221
chisq.test(pre2010_edu3_matrix)##
## Pearson's Chi-squared test
##
## data: pre2010_edu3_matrix
## X-squared = 1011, df = 3, p-value < 2.2e-16
Our chi-square test score of 1011 again gives us an exponentially small p-value and confirms the statistical significance of differences we have seen in preference for job meaning across education levels.
Continuing our deeper dive, how has jobmeans changed over time for each level of education? We already have this data and simply need to chart it:
edu3_yearly_jobpref %>%
ggplot(aes(x=year,y=meaning,color=education)) +
geom_point(size=2) +
geom_line(alpha=0.3) +
geom_smooth(se=FALSE) +
ggtitle("Preference for job meaning over time, by education level") +
xlab("Year") +
ylab("Prop. of resp. choosing jobmeans 1st or 2nd")## `geom_smooth()` using method = 'loess'
It appears that every educational level has some variety of a slight general downward trend with possibly a brief up along the way. Meanwhile those with no known education level (NA) are radically different. They appear to oscillate up and down between the lower bound of the less-than-high-school curve and the upper bound of the college-plus curve. We suspect this oscillation could be related to the actual education levels that are unknown in this NA group. More importantly, we hypothesize that each education level does in fact follow the same downward trend on their preference for job meaning over time. We leave that analysis for future work.
We investigated what Americans consider most important about their jobs. Meaning is the most preferred factor of the five considered (i.e., a sense of importance and accomplishment). There is strong evidence suggesting that people with more education have a stronger preference for meaning in their jobs than do people with less education. Across Americans of all education levels, however, the relative importance of meaning in their jobs is changing over time. It appears it may be declining slightly over time.
Our research confirms dependencies with time but does not confirm a directional trend. More research is needed to confirm the directional trends that we have suggested here. We also suggest research be done to double-check the GSS data of 2012, which appears drastically different than earlier years of the survey.