Statistical inference with the GSS data

Setup

Load main packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

About

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972.
The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, inter-group tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

Sampling Design

General Social Survey Cumulative File, 1972-2012
The GSS is an area-probability sample that uses the NORC National Sampling Frame for an equal- probability multi-stage cluster sample of housing units for the entire United States.
Cluster Sampling: The population is divided into clusters (e.g., housing units), and a random selection of clusters is chosen. All individuals within the selected clusters are included in the sample.
Since the sample for the GSS is a cluster sample, results from this analysis can be generalized to the population
The GSS is a non-experimental study that aims to capture data from a representative sample of the population without manipulating variables or assigning individuals to specific conditions. Therefore, no random assignment was used and causality cannot be inferred
Also noted, until 2006 the GSS only sampled the English speaking population. This excludes data of non-English speakers and thus introduces cultural and linguistic bias. The experiences, attitudes, and behaviors of non-English speakers could differ significantly from those of English speakers. As a result, the conclusions based on GSS data may not fully capture the perspectives and characteristics of the entire population.

Part 2: Research question

Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?

Note that ‘jobmeans’ variable is part of the question below:

Would you please look at this list and tell me which one thing on this list you would most prefer in a job? b. Which comes next? c. Which is third most important? d. Which is fourth most important? e. Fifth most important?

jobinc: High income

jobsec: No danger of being fired

jobhour: Short working hours

jobpromo: chance of advancement

jobmeans: Work important and feel accomplishment

Relevance: It can provide insights into how societal and economic contexts influence perceptions of work.

Part 3: Exploratory data analysis

Overview

# str(gss)

summary(gss$sei)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   17.10   32.40   39.00   48.42   63.50   97.20   25784

summary(gss$jobmeans)

## Most Impt    Second     Third    Fourth     Fifth      NA's 
##      9641      3899      3245      2450      1333     36493

Checking missing pattern

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

gss %>%
  select(jobmeans, sei) %>%
  md.pattern()

##         sei jobmeans      
## 6312      1        1     0
## 24965     1        0     1
## 14256     0        1     1
## 11528     0        0     2
##       25784    36493 62277

gss %>%
  select(jobmeans, sei) %>%
  aggr(col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), 
                  cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable     Count
##  jobmeans 0.6395436
##       sei 0.4518673

We see that a majority of data is missing in both variables ‘sei’ and ‘jobmeans’
Only 11% is complete, the rest is either missing in one or both variables.

Imputing missing values

To impute the missing values, mice package use an algorithm in a such a way that use information from other variables in the dataset to predict and impute the missing values. Therefore, you may not want to use a certain variable as predictors. For example, the ID variable does not have any predictive value.

One commonly used imputation method for continuous variables (eg. ‘sei’) is predictive mean matching (PMM). PMM works by first predicting the missing values based on the observed values and then matching the predicted values to the nearest observed values.

For ‘jobmeans’, which is an ordinal categorical variable with >2 levels, we use proportional odds model “polr”. The algorithm fits a proportional odds model to the observed data, treating the ordinal variable as the dependent variable and using other variables as predictors. It estimates the parameters of the model using maximum likelihood estimation.

# pick variables that are conceptually or theoretically related to 
# "jobmeans" (belief in the importance of work) and "sei" (socioeconomic index) as predictors
set1 <- gss %>%
  select(caseid, jobmeans, sei, age, rank, degree, income06, wrkstat, wrkslf)

# set up for imputation
init = mice(set1, maxit=0) 
meth = init$method
predM = init$predictorMatrix

# remove a variable as a predictor
predM[, c("caseid")] = 0
# skip a variable from imputation
meth[c("caseid")] = ""
# specify the methods for imputing the missing values
meth[c("sei")] = "pmm" 
meth[c("jobmeans")] = "polr" 
# run the multiple (m=5) imputation
# m=5 indicates we will create 5 imputed data sets
# the results from these multiple data sets can then be pooled to create one final imputed data set
set.seed(103)
imputed = mice(set1, method=meth, predictorMatrix=predM, m=5)

## 
##  iter imp variable
##   1   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf

# Create a dataset after imputation
imputed_set <- complete(imputed)

# Check for missings in the imputed dataset
sapply(imputed_set, function(x) sum(is.na(x)))

##   caseid jobmeans      sei      age     rank   degree income06  wrkstat 
##        0        0        0        0        0        0        0        0 
##   wrkslf 
##        0

stripplot(imputed, sei, pch = 19, xlab = "Imputation number")

stripplot(imputed, jobmeans, pch = 19, xlab = "Imputation number")

# blue is observed, red is imputed
densityplot(imputed, ~ sei)

densityplot(imputed, ~ jobmeans)

These graphs show that the distributions of our imputed data sets are similar to the distribution of the original data set
This means our imputation process is valid and introduces no bias to our new and imputed data set.

Data Manipulation

# turn 'sei' into a categorical variable
ready_data <- imputed_set %>%
  select(jobmeans, sei) %>%
  mutate(sei_cat = as.factor(case_when(
    sei <= 32.4 ~ "low SES",
    sei <= 63.5 ~ "medium SES",
    sei > 63.5 ~ "high SES"
  )), sei = NULL)

# change sei_cat to be ordinal
ready_data$sei_cat <- factor(ready_data$sei_cat, levels = c("low SES", "medium SES", "high SES"))

# levels(data$sei_cat)
summary(ready_data$sei_cat)

##    low SES medium SES   high SES 
##      15610      30080      11371

summary(ready_data$jobmeans)

## Most Impt    Second     Third    Fourth     Fifth 
##     26879     10869      8959      6693      3661

library(knitr)
library(summarytools)
kable(ctable(ready_data$sei_cat, ready_data$jobmeans, style = "rmarkdown", prop = "r"), format = "simple")

	Most Impt	Second	Third	Fourth	Fifth	Total
low SES	6870	2843	2454	2253	1190	15610
medium SES	14159	5807	4774	3414	1926	30080
high SES	5850	2219	1731	1026	545	11371
Total	26879	10869	8959	6693	3661	57061

	Most Impt	Second	Third	Fourth	Fifth	Total
low SES	0.4401025	0.1821268	0.1572069	0.1443306	0.0762332	1
medium SES	0.4707114	0.1930519	0.1587101	0.1134973	0.0640293	1
high SES	0.5144666	0.1951455	0.1522294	0.0902295	0.0479289	1
Total	0.4710573	0.1904804	0.1570074	0.1172955	0.0641594	1

Across the three socioeconomic groups that report “Work important and feel accomplishment” as the most important factor, we can see an uptrend in terms of proportion.
Across the ranking (left to right) for “Work important and feel accomplishment”, there is a consistent downtrend for all three socioeconomic groups.

library(vcd)
# Create the mosaic plot
mosaicplot(table(ready_data$sei_cat, ready_data$jobmeans),
           main = "Mosaic Plot: jobmeans vs. sei_cat",
           xlab = "sei_cat", ylab = "jobmeans")

The observed difference is that while “Work important and feel accomplishment” tend to be ranked as most important across all socioeconomic groups, the likelihood of a respondent ranking it like so seems to depend on their socioeconomic group. ie. one in ‘high SES’ might be more likely to rank it as most important than one in ‘low SES’.

Part 4: Inference

Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?

We assume:

‘sei’ is explanatory

‘jobmeans’ is response

H0: ‘sei’ and ‘jobmeans’ are independent. Belief in the importance of work does not vary by socioeconomic status.

Ha: ‘sei’ and ‘jobmeans’ are dependent. Belief in the importance of work does vary by socioeconomic status.

Conditions for the chi-square test:

Independence: Each observation in the data set represents a single, unique adult, and GSS is an equal- probability multi-stage cluster sample of housing units for the entire United States. Thus, sampled observations are independent

Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases. This is true.

Perform chi-square test of independence using the theoretical method.

chisq_result <- chisq.test(x = ready_data$sei_cat, y = ready_data$jobmeans)
chisq_result

## 
##  Pearson's Chi-squared test
## 
## data:  ready_data$sei_cat and ready_data$jobmeans
## X-squared = 341.67, df = 8, p-value < 2.2e-16

Since p-value is very small (p-value < 2.2e-16), we reject H0 and conclude there is convincing evidence that belief in the importance of work does vary by socioeconomic status.

The effect size and the power are calculated for the completed hypothesis test:

# report effect size and power of test


# A contingency table with observed frequencies
observed <- table(ready_data$jobmeans, ready_data$sei_cat)

# # Calculate the effect size using Cramer's V formula
library(DescTools)
effect_size <- CramerV(observed)
# can also possibly use ES.w2 to calculate effect size

# Compute power of test or determine parameters to obtain target power (same as power.anova.test).
library(pwr)
pwr.chisq.test(w = effect_size, df = 8, sig.level = .05, N = nrow(ready_data))

## 
##      Chi squared power calculation 
## 
##               w = 0.05471671
##               N = 57061
##              df = 8
##       sig.level = 0.05
##           power = 1
## 
## NOTE: N is the number of observations

Note:

Effect size is typically used to quantify the size of an effect or the degree of association between variables.
Cramer’s V of approximately .055 indicates a small effect size (0.1 small, 0.3 medium, 0.5 large), suggesting a weak association between the ‘sei’ and ‘jobmeans’ variables.
Power is the probability of correctly rejecting a false null hypothesis in a statistical hypothesis test. It is the ability of a statistical test to detect a true effect or relationship if it exists.
Power of 1 indicates that this test has no problem correctly detecting an effect or difference.