Typically, the argument for further study, aside from personal interest, is based on employability and earning potential advantages. However, many people express dissatisfaction with their jobs and count the years before they can retire. This is probably not the healthiest situation, whether for society and the workforce, or for the individuals concerned. Therefore, it may be of general benefit, as well as a matter of curiosity, to determine whether the level of academic achievement has any impact on one’s job satisfaction.
The data used for this study is from the General Social Survey (GSS) and the data frame containing this data was downloaded from here, which was specially prepared for the Data Analysis and Statistical Inference course on Coursera by recoding missing values as NA and converting variables to R’s factor data type as appropriate.
The GSS is an observational survey, funded by the National Science Foundation in the United States, and the data set used here is for the years 1972-2012. The surveys collected data through computer-assisted personal interviews, face-to-face interviews, or telephone interviews. Each case in the survey corresponds to an English or Spanish speaking non-institutionalized person over the age of 18 living in the US. There is some sampling bias against individuals who speak no English (1972-2006) or neither English nor Spanish (after 2006), although since non-English speakers represent a small fraction \((<2\%)\) of the population, it should not appreciably affect our conclusions. The sampling method and weighting of the survey has changed over the years, such as with blocking and stratification variables, however we will proceed with the analysis here under the assumption that the GSS data is sufficiently random to generalize the findings to the US population. Since this is not an experiment with random assignment, we cannot make statements regarding causality in our findings.
We will be looking at whether there is a relationship between respondents’ highest academic degree, indicated by the ordinal categorical variable degree with five levels Lt High School, High School, Junior College, Bachelor, Graduate, and the respondents’ job satisfaction, given by the ordinal categorical variable satjob with the four levels Very Satisfied, Mod. Satisfied, A Little Dissat, Very Dissatisfied.
The GSS data, gss, is a data frame of 57061 observations of 114 variables, which is both unwieldy and unnecessary for our purposes, so we can subset it to include only the variables degree and satjob as well as year as it may be interesting to use the same data later to investigate any changes over time. Only complete cases, or rows/observations with no missing values, are retained.
subgss <- na.omit(gss[, c("degree", "satjob", "year")])
summary(subgss)
## degree satjob year
## Lt High School: 7341 Very Satisfied :19414 Min. :1972
## High School :21744 Mod. Satisfied :15513 1st Qu.:1982
## Junior College: 2367 A Little Dissat : 4057 Median :1991
## Bachelor : 6246 Very Dissatisfied: 1688 Mean :1991
## Graduate : 2974 3rd Qu.:2000
## Max. :2012
The summary shows that there are thousands of cases for each variable. More clearly, the contingency table shows that we should have no issues in performing a Chi-squared test of independence as the minimum count of \((>5)\) expected cases for each cell will likely be met. Note that this is the data for all 29 years of data available from 1972-2012. If we wish to analyze individual years, we will need to check for the minimum cell count again.
subgss_table <- table(subgss[, c("degree", "satjob")])
addmargins(subgss_table)
## satjob
## degree Very Satisfied Mod. Satisfied A Little Dissat Very Dissatisfied Sum
## Lt High School 3349 2793 821 378 7341
## High School 10005 8497 2281 961 21744
## Junior College 1201 883 214 69 2367
## Bachelor 3106 2386 546 208 6246
## Graduate 1753 954 195 72 2974
## Sum 19414 15513 4057 1688 40672
A cursory glance at the mosaic plot reveals initially that those with Junior College, Bachelor, or Graduate degrees seem to be more likely to answer Very Satisfied, however we will need to perform our analysis to determine whether there is a statistically significant result.
Formally, we would like to determine whether there is a relationship between the highest academic degree achieved and respondents’ job satisfaction. We can formulate our null and alternative hypotheses as:
\(H_0:\) Degree and job satisfaction are independent. Job satisfaction does not vary with the degree of the respondent.
\(H_A:\) Degree and job satisfaction are dependent. Job satisfaction does vary with the degree of the respondent.
Since we are dealing with two categorical variable each with more than two levels, we will use a Chi-squared test of independence. The conditions for the test are:
1. Independence: We take the GSS data to be sufficiently randomly sampled. Also, the sample is definitely \(<10\%\) of the US population, and each case can only contribute to a single cell in the contingency table. Thus, the independence condition is satisfied.
2. Sample size: Each scenario, or cell, has \(>5\) expected cases as seen in the table below, therefore the sample size condition is met.
chisq.test(subgss_table)$expected
## satjob
## degree Very Satisfied Mod. Satisfied A Little Dissat Very Dissatisfied
## Lt High School 3504.086 2799.9836 732.2590 304.67172
## High School 10379.082 8293.5354 2168.9469 902.43588
## Junior College 1129.842 902.8145 236.1064 98.23702
## Bachelor 2981.408 2382.3318 623.0336 259.22620
## Graduate 1419.582 1134.3347 296.6542 123.42919
Finally, we can perform the Chi-squared test of independence.
chisq.test(subgss_table)
##
## Pearson's Chi-squared test
##
## data: subgss_table
## X-squared = 267.14, df = 12, p-value < 2.2e-16
The p-value is \(3.610742\times 10^{-50}\) \((<2.2 \times 10^{-16})\) which is a very small value. We are confident that the probability of getting at least as extreme results as in this survey given that a relationship does not exist between degree and job satisfaction is so small that we can reject the null hypothesis.
We can conclude that there is indeed a relationship between degree and job satisfaction. Unfortunately we cannot specify much more than this with only a Chi-squared test. It would be interesting to bin the degree variable into two groups: college and noncollege; and bin the satjob variable also into two groups: satisfied and dissatisfied. This would allow hypothesis testing and finding confidence intervals (provided conditions are met) comparing two proportions, which could be used to determine whether college degrees are associated with greater job satisfaction.
It may also be informative to perform the same hypothesis tests over individual years to see whether there has been a change over the past four decades in the relationship between these two variables.
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. http://doi.org/10.3886/ICPSR34802.v1
head(subgss, 40)
## degree satjob year
## 1 Bachelor A Little Dissat 1972
## 3 High School Mod. Satisfied 1972
## 4 Bachelor Very Satisfied 1972
## 6 High School Mod. Satisfied 1972
## 7 High School Very Satisfied 1972
## 8 Bachelor A Little Dissat 1972
## 9 High School Mod. Satisfied 1972
## 10 High School Mod. Satisfied 1972
## 12 Lt High School Very Satisfied 1972
## 13 Lt High School Very Satisfied 1972
## 14 Lt High School Mod. Satisfied 1972
## 15 Lt High School Very Satisfied 1972
## 16 High School Mod. Satisfied 1972
## 19 Bachelor Very Satisfied 1972
## 21 High School Very Satisfied 1972
## 22 High School Very Satisfied 1972
## 23 High School Mod. Satisfied 1972
## 26 High School Mod. Satisfied 1972
## 27 High School Mod. Satisfied 1972
## 28 High School A Little Dissat 1972
## 29 High School Mod. Satisfied 1972
## 30 Lt High School Very Satisfied 1972
## 31 Lt High School Mod. Satisfied 1972
## 32 High School Very Satisfied 1972
## 35 High School Mod. Satisfied 1972
## 36 High School Mod. Satisfied 1972
## 39 Lt High School Mod. Satisfied 1972
## 40 High School Mod. Satisfied 1972
## 42 High School Very Satisfied 1972
## 43 Lt High School Mod. Satisfied 1972
## 47 High School Very Satisfied 1972
## 48 High School Very Satisfied 1972
## 49 Lt High School Mod. Satisfied 1972
## 50 Lt High School Mod. Satisfied 1972
## 51 Lt High School Very Satisfied 1972
## 52 High School Mod. Satisfied 1972
## 53 Lt High School Very Satisfied 1972
## 56 High School Mod. Satisfied 1972
## 59 High School Very Satisfied 1972
## 60 Lt High School Very Satisfied 1972
This section will address some of the ideas for future work pointed out earlier in the Conclusion.
First we merge the levels of the degree and satjob such that they have binary results:
mersubgss <- subgss
levels(mersubgss$degree) <- list(noncollege = c("Lt High School", "High School"), college = c("Junior College", "Bachelor", "Graduate"))
levels(mersubgss$satjob) <- list(dissatisfied = c("A Little Dissat", "Very Dissatisfied"), satisfied = c("Very Satisfied", "Mod. Satisfied"))
str(mersubgss)
## 'data.frame': 40672 obs. of 3 variables:
## $ degree: Factor w/ 2 levels "noncollege","college": 2 1 2 1 1 2 1 1 1 1 ...
## $ satjob: Factor w/ 2 levels "dissatisfied",..: 1 2 2 2 2 1 2 2 2 2 ...
## $ year : int 1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:16389] 2 5 11 17 18 20 24 25 33 34 ...
## .. ..- attr(*, "names")= chr [1:16389] "2" "5" "11" "17" ...
mertable <- table(mersubgss[, c("degree", "satjob")])
addmargins(mertable)
## satjob
## degree dissatisfied satisfied Sum
## noncollege 4441 24644 29085
## college 1304 10283 11587
## Sum 5745 34927 40672
plot(mertable,
las = 1,
col = "skyblue",
main = "",
xlab = "Degree",
ylab = "Job satisfaction"
)
And now for the hypothesis test:
inference(mersubgss$satjob, mersubgss$degree, est="proportion", success = "satisfied", type = "ht", alternative = "greater", method = "theoretical", null = 0, order = c("college", "noncollege"), eda_plot = F, inf_plot = F)
## Response variable: categorical, Explanatory variable: categorical
## Difference between two proportions -- success: satisfied
## Summary statistics:
## x
## y college noncollege Sum
## dissatisfied 1304 4441 5745
## satisfied 10283 24644 34927
## Sum 11587 29085 40672
## Observed difference between proportions (college-noncollege) = 0.0402
## H0: p_college - p_noncollege = 0
## HA: p_college - p_noncollege > 0
## Pooled proportion = 0.8587
## Check conditions:
## college : number of expected successes = 9950 ; number of expected failures = 1637
## noncollege : number of expected successes = 24977 ; number of expected failures = 4108
## Standard error = 0.004
## Test statistic: Z = 10.494
## p-value = 0