Setup

Background

This extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for students learning statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R. Our hope is that this will allow students to focus on statistical concepts without having to (initially) be concerned about some of the data management and interpretation issues associated with missing data and factor variables in R. Other than the two modifications mentioned above, all data and coding come from the original dataset. Students and researchers seeking to conduct research or explore the full codebook behind the full General Social Survey Cumulative File are urged to consult the original dataset at the citation that follows:

This data treat observations as independent samples by year.

Data management team

Data Citation

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11.

Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

Data Codebook

Load packages

library(statsr)
library(skimr)
library(mosaic)
library(ggpubr)

Load data

load("gss.Rdata")

Part 1: Data

This particular data that is used in this project is a longitudinal observation study. The result from this study cannot show causation.

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.


Part 2: Research question

Come up with a research question that you want to answer using these data. You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. You are welcomed to create new variables based on existing ones. Along with your research question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

I would like to find if there is an association between having a degree and the status of employment.

This question is of interest to me because I would like to know whether the outcome between these variables are significant.


Part 3: Exploratory data analysis

Conditions for chi-square test:
1). Data is from reputable source and is claimed to be randomly selected. ✓
2). Expected sample size condition met, all expected counts ≥ 5. ✓

The following tables used the tally() function to count how many working status and degree categories have occurred in their respective column.

tally(~wrkstat, data = gss)
## wrkstat
## Working Fulltime Working Parttime Temp Not Working Unempl, Laid Off 
##            28207             5842             1213             1873 
##          Retired           School    Keeping House            Other 
##             7642             1751             9387             1132 
##             <NA> 
##               14
tally(~degree, data = gss)
## degree
## Lt High School    High School Junior College       Bachelor       Graduate 
##          11822          29287           3070           8002           3870 
##           <NA> 
##           1010
tally(~wrkstat, data = gss, format = "proportion")
## wrkstat
## Working Fulltime Working Parttime Temp Not Working Unempl, Laid Off 
##     0.4943306286     0.1023816617     0.0212579520     0.0328245211 
##          Retired           School    Keeping House            Other 
##     0.1339268502     0.0306864584     0.1645081579     0.0198384185 
##             <NA> 
##     0.0002453515
tally(~degree, data = gss, format = "proportion")
## degree
## Lt High School    High School Junior College       Bachelor       Graduate 
##     0.20718179     0.51325774     0.05380207     0.14023589     0.06782216 
##           <NA> 
##     0.01770036

Then, I graphed bar plots of work status count that is depended on their degree category.

ggplot(gss, aes(y = wrkstat, fill = wrkstat)) +
  geom_bar() +
  facet_wrap(~ degree) 

tab <- table(gss$wrkstat, gss$degree)  
new_tab <- data.frame(tab)
  
df_wide <- as.data.frame.matrix(tab)
df_wide
##                  Lt High School High School Junior College Bachelor Graduate
## Working Fulltime           3438       14829           1911     5106     2642
## Working Parttime            997        3292            336      799      338
## Temp Not Working            227         602             85      199       87
## Unempl, Laid Off            510        1023             80      164       58
## Retired                    2596        3277            244      762      488
## School                      323        1063            104      192       59
## Keeping House              3322        4657            274      698      165
## Other                       407         539             36       75       33

Part 4: Inference

Research question:
Is there an association between having a degree and the status of employment?

1).
H0: The working and degree status variables are independent.
Ha: The working and degree status variables are dependent.

2).

dep <- chisq.test(df_wide)
dep$expected
##                  Lt High School High School Junior College  Bachelor   Graduate
## Working Fulltime      5890.4888  14592.6643     1529.93237 3984.3027 1928.61181
## Working Parttime      1215.3905   3010.9193      315.67250  822.0852  397.93244
## Temp Not Working       253.1185    627.0571       65.74228  171.2083   82.87382
## Unempl, Laid Off       387.0603    958.8749      100.53090  261.8060  126.72788
## Retired               1553.9365   3849.6082      403.60280 1051.0763  508.77617
## School                 367.2327    909.7554       95.38109  248.3947  120.23609
## Keeping House         1922.8567   4763.5439      499.42217 1300.6125  629.56475
## Other                  229.9159    569.5769       59.71590  155.5142   75.27705
dep$residuals
##                  Lt High School High School Junior College    Bachelor
## Working Fulltime     -31.954451    1.956419      9.7423998  17.7704946
## Working Parttime      -6.264348    5.122494      1.1441041  -0.8051481
## Temp Not Working      -1.641670   -1.000640      2.3751036   2.1239904
## Unempl, Laid Off       6.248887    2.070844     -2.0476616  -6.0447152
## Retired               26.434892   -9.228886     -7.9444423  -8.9165207
## School                -2.308198    5.080693      0.8825135  -3.5782219
## Keeping House         31.907204   -1.543703    -10.0870161 -16.7095267
## Other                 11.678711   -1.281200     -3.0689842  -6.4563566
##                     Graduate
## Working Fulltime  16.2443936
## Working Parttime  -3.0043967
## Temp Not Working   0.4532523
## Unempl, Laid Off  -6.1051571
## Retired           -0.9210899
## School            -5.5845740
## Keeping House    -18.5150996
## Other             -4.8727415
dep
## 
##  Pearson's Chi-squared test
## 
## data:  df_wide
## X-squared = 4871.8, df = 28, p-value < 2.2e-16

We reject H0. At 5% significance level, there is statistical evidence showing an association between having a degree and working status.