Introduction:

Typically, the argument for further study, aside from personal interest, is based on employability and earning potential advantages. However, many people express dissatisfaction with their jobs and count the years before they can retire. This is probably not the healthiest situation, whether for society and the workforce, or for the individuals concerned. Therefore, it may be of general benefit, as well as a matter of curiosity, to determine whether the level of academic achievement has any impact on one’s job satisfaction.

Data:

The data used for this study is from the General Social Survey (GSS) and the data frame containing this data was downloaded from here, which was specially prepared for the Data Analysis and Statistical Inference course on Coursera by recoding missing values as NA and converting variables to R’s factor data type as appropriate.

The GSS is an observational survey, funded by the National Science Foundation in the United States, and the data set used here is for the years 1972-2012. The surveys collected data through computer-assisted personal interviews, face-to-face interviews, or telephone interviews. Each case in the survey corresponds to an English or Spanish speaking non-institutionalized person over the age of 18 living in the US. There is some sampling bias against individuals who speak no English (1972-2006) or neither English nor Spanish (after 2006), although since non-English speakers represent a small fraction \((<2\%)\) of the population, it should not appreciably affect our conclusions. The sampling method and weighting of the survey has changed over the years, such as with blocking and stratification variables, however we will proceed with the analysis here under the assumption that the GSS data is sufficiently random to generalize the findings to the US population. Since this is not an experiment with random assignment, we cannot make statements regarding causality in our findings.

We will be looking at whether there is a relationship between respondents’ highest academic degree, indicated by the ordinal categorical variable degree with five levels Lt High School, High School, Junior College, Bachelor, Graduate, and the respondents’ job satisfaction, given by the ordinal categorical variable satjob with the four levels Very Satisfied, Mod. Satisfied, A Little Dissat, Very Dissatisfied.

Exploratory data analysis:

The GSS data, gss, is a data frame of 57061 observations of 114 variables, which is both unwieldy and unnecessary for our purposes, so we can subset it to include only the variables degree and satjob as well as year as it may be interesting to use the same data later to investigate any changes over time. Only complete cases, or rows/observations with no missing values, are retained.

subgss <- na.omit(gss[, c("degree", "satjob", "year")])
summary(subgss)
##             degree                    satjob           year     
##  Lt High School: 7341   Very Satisfied   :19414   Min.   :1972  
##  High School   :21744   Mod. Satisfied   :15513   1st Qu.:1982  
##  Junior College: 2367   A Little Dissat  : 4057   Median :1991  
##  Bachelor      : 6246   Very Dissatisfied: 1688   Mean   :1991  
##  Graduate      : 2974                             3rd Qu.:2000  
##                                                   Max.   :2012

The summary shows that there are thousands of cases for each variable. More clearly, the contingency table shows that we should have no issues in performing a Chi-squared test of independence as the minimum count of \((>5)\) expected cases for each cell will likely be met. Note that this is the data for all 29 years of data available from 1972-2012. If we wish to analyze individual years, we will need to check for the minimum cell count again.

subgss_table <- table(subgss[, c("degree", "satjob")])
addmargins(subgss_table)
##                 satjob
## degree           Very Satisfied Mod. Satisfied A Little Dissat Very Dissatisfied   Sum
##   Lt High School           3349           2793             821               378  7341
##   High School             10005           8497            2281               961 21744
##   Junior College           1201            883             214                69  2367
##   Bachelor                 3106           2386             546               208  6246
##   Graduate                 1753            954             195                72  2974
##   Sum                     19414          15513            4057              1688 40672

A cursory glance at the mosaic plot reveals initially that those with Junior College, Bachelor, or Graduate degrees seem to be more likely to answer Very Satisfied, however we will need to perform our analysis to determine whether there is a statistically significant result.

Inference:

Formally, we would like to determine whether there is a relationship between the highest academic degree achieved and respondents’ job satisfaction. We can formulate our null and alternative hypotheses as:

\(H_0:\) Degree and job satisfaction are independent. Job satisfaction does not vary with the degree of the respondent.

\(H_A:\) Degree and job satisfaction are dependent. Job satisfaction does vary with the degree of the respondent.

Since we are dealing with two categorical variable each with more than two levels, we will use a Chi-squared test of independence. The conditions for the test are:

1. Independence: We take the GSS data to be sufficiently randomly sampled. Also, the sample is definitely \(<10\%\) of the US population, and each case can only contribute to a single cell in the contingency table. Thus, the independence condition is satisfied.

2. Sample size: Each scenario, or cell, has \(>5\) expected cases as seen in the table below, therefore the sample size condition is met.

chisq.test(subgss_table)$expected
##                 satjob
## degree           Very Satisfied Mod. Satisfied A Little Dissat Very Dissatisfied
##   Lt High School       3504.086      2799.9836        732.2590         304.67172
##   High School         10379.082      8293.5354       2168.9469         902.43588
##   Junior College       1129.842       902.8145        236.1064          98.23702
##   Bachelor             2981.408      2382.3318        623.0336         259.22620
##   Graduate             1419.582      1134.3347        296.6542         123.42919

Finally, we can perform the Chi-squared test of independence.

chisq.test(subgss_table)
## 
##  Pearson's Chi-squared test
## 
## data:  subgss_table
## X-squared = 267.14, df = 12, p-value < 2.2e-16

The p-value is \(3.610742\times 10^{-50}\) \((<2.2 \times 10^{-16})\) which is a very small value. We are confident that the probability of getting at least as extreme results as in this survey given that a relationship does not exist between degree and job satisfaction is so small that we can reject the null hypothesis.

Conclusion:

We can conclude that there is indeed a relationship between degree and job satisfaction. Unfortunately we cannot specify much more than this with only a Chi-squared test. It would be interesting to bin the degree variable into two groups: college and noncollege; and bin the satjob variable also into two groups: satisfied and dissatisfied. This would allow hypothesis testing and finding confidence intervals (provided conditions are met) comparing two proportions, which could be used to determine whether college degrees are associated with greater job satisfaction.

It may also be informative to perform the same hypothesis tests over individual years to see whether there has been a change over the past four decades in the relationship between these two variables.

References:

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. http://doi.org/10.3886/ICPSR34802.v1

Appendix:

head(subgss, 40)
##            degree          satjob year
## 1        Bachelor A Little Dissat 1972
## 3     High School  Mod. Satisfied 1972
## 4        Bachelor  Very Satisfied 1972
## 6     High School  Mod. Satisfied 1972
## 7     High School  Very Satisfied 1972
## 8        Bachelor A Little Dissat 1972
## 9     High School  Mod. Satisfied 1972
## 10    High School  Mod. Satisfied 1972
## 12 Lt High School  Very Satisfied 1972
## 13 Lt High School  Very Satisfied 1972
## 14 Lt High School  Mod. Satisfied 1972
## 15 Lt High School  Very Satisfied 1972
## 16    High School  Mod. Satisfied 1972
## 19       Bachelor  Very Satisfied 1972
## 21    High School  Very Satisfied 1972
## 22    High School  Very Satisfied 1972
## 23    High School  Mod. Satisfied 1972
## 26    High School  Mod. Satisfied 1972
## 27    High School  Mod. Satisfied 1972
## 28    High School A Little Dissat 1972
## 29    High School  Mod. Satisfied 1972
## 30 Lt High School  Very Satisfied 1972
## 31 Lt High School  Mod. Satisfied 1972
## 32    High School  Very Satisfied 1972
## 35    High School  Mod. Satisfied 1972
## 36    High School  Mod. Satisfied 1972
## 39 Lt High School  Mod. Satisfied 1972
## 40    High School  Mod. Satisfied 1972
## 42    High School  Very Satisfied 1972
## 43 Lt High School  Mod. Satisfied 1972
## 47    High School  Very Satisfied 1972
## 48    High School  Very Satisfied 1972
## 49 Lt High School  Mod. Satisfied 1972
## 50 Lt High School  Mod. Satisfied 1972
## 51 Lt High School  Very Satisfied 1972
## 52    High School  Mod. Satisfied 1972
## 53 Lt High School  Very Satisfied 1972
## 56    High School  Mod. Satisfied 1972
## 59    High School  Very Satisfied 1972
## 60 Lt High School  Very Satisfied 1972

Post-Appendix Extension:

This section will address some of the ideas for future work pointed out earlier in the Conclusion.

First we merge the levels of the degree and satjob such that they have binary results:

mersubgss <- subgss

levels(mersubgss$degree) <- list(noncollege = c("Lt High School", "High School"), college = c("Junior College", "Bachelor", "Graduate"))

levels(mersubgss$satjob) <- list(dissatisfied = c("A Little Dissat", "Very Dissatisfied"), satisfied = c("Very Satisfied", "Mod. Satisfied"))

str(mersubgss)
## 'data.frame':    40672 obs. of  3 variables:
##  $ degree: Factor w/ 2 levels "noncollege","college": 2 1 2 1 1 2 1 1 1 1 ...
##  $ satjob: Factor w/ 2 levels "dissatisfied",..: 1 2 2 2 2 1 2 2 2 2 ...
##  $ year  : int  1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:16389] 2 5 11 17 18 20 24 25 33 34 ...
##   .. ..- attr(*, "names")= chr [1:16389] "2" "5" "11" "17" ...
mertable <- table(mersubgss[, c("degree", "satjob")])
addmargins(mertable)
##             satjob
## degree       dissatisfied satisfied   Sum
##   noncollege         4441     24644 29085
##   college            1304     10283 11587
##   Sum                5745     34927 40672
plot(mertable,
     las = 1, 
     col = "skyblue",
     main = "",
     xlab = "Degree",
     ylab = "Job satisfaction"
     )

And now for the hypothesis test:

inference(mersubgss$satjob, mersubgss$degree, est="proportion", success = "satisfied", type = "ht", alternative = "greater", method = "theoretical", null = 0, order = c("college", "noncollege"), eda_plot = F, inf_plot = F)
## Response variable: categorical, Explanatory variable: categorical
## Difference between two proportions -- success: satisfied
## Summary statistics:
##               x
## y              college noncollege   Sum
##   dissatisfied    1304       4441  5745
##   satisfied      10283      24644 34927
##   Sum            11587      29085 40672
## Observed difference between proportions (college-noncollege) = 0.0402
## H0: p_college - p_noncollege = 0 
## HA: p_college - p_noncollege > 0 
## Pooled proportion = 0.8587 
## Check conditions:
##    college : number of expected successes = 9950 ; number of expected failures = 1637 
##    noncollege : number of expected successes = 24977 ; number of expected failures = 4108 
## Standard error = 0.004 
## Test statistic: Z =  10.494 
## p-value =  0