Setup

Load main packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

About

Sampling Design


Part 2: Research question

Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?

Note that ‘jobmeans’ variable is part of the question below:

Would you please look at this list and tell me which one thing on this list you would most prefer in a job? b. Which comes next? c. Which is third most important? d. Which is fourth most important? e. Fifth most important?

jobinc: High income

jobsec: No danger of being fired

jobhour: Short working hours

jobpromo: chance of advancement

jobmeans: Work important and feel accomplishment

Relevance: It can provide insights into how societal and economic contexts influence perceptions of work.


Part 3: Exploratory data analysis

Overview

# str(gss)

summary(gss$sei)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   17.10   32.40   39.00   48.42   63.50   97.20   25784
summary(gss$jobmeans)
## Most Impt    Second     Third    Fourth     Fifth      NA's 
##      9641      3899      3245      2450      1333     36493

Checking missing pattern

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
gss %>%
  select(jobmeans, sei) %>%
  md.pattern()

##         sei jobmeans      
## 6312      1        1     0
## 24965     1        0     1
## 14256     0        1     1
## 11528     0        0     2
##       25784    36493 62277
gss %>%
  select(jobmeans, sei) %>%
  aggr(col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), 
                  cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable     Count
##  jobmeans 0.6395436
##       sei 0.4518673

Imputing missing values

To impute the missing values, mice package use an algorithm in a such a way that use information from other variables in the dataset to predict and impute the missing values. Therefore, you may not want to use a certain variable as predictors. For example, the ID variable does not have any predictive value.

One commonly used imputation method for continuous variables (eg. ‘sei’) is predictive mean matching (PMM). PMM works by first predicting the missing values based on the observed values and then matching the predicted values to the nearest observed values.

For ‘jobmeans’, which is an ordinal categorical variable with >2 levels, we use proportional odds model “polr”. The algorithm fits a proportional odds model to the observed data, treating the ordinal variable as the dependent variable and using other variables as predictors. It estimates the parameters of the model using maximum likelihood estimation.

# pick variables that are conceptually or theoretically related to 
# "jobmeans" (belief in the importance of work) and "sei" (socioeconomic index) as predictors
set1 <- gss %>%
  select(caseid, jobmeans, sei, age, rank, degree, income06, wrkstat, wrkslf)

# set up for imputation
init = mice(set1, maxit=0) 
meth = init$method
predM = init$predictorMatrix

# remove a variable as a predictor
predM[, c("caseid")] = 0
# skip a variable from imputation
meth[c("caseid")] = ""
# specify the methods for imputing the missing values
meth[c("sei")] = "pmm" 
meth[c("jobmeans")] = "polr" 
# run the multiple (m=5) imputation
# m=5 indicates we will create 5 imputed data sets
# the results from these multiple data sets can then be pooled to create one final imputed data set
set.seed(103)
imputed = mice(set1, method=meth, predictorMatrix=predM, m=5)
## 
##  iter imp variable
##   1   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   1   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   2   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   3   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   4   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   1  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   2  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   3  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   4  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
##   5   5  jobmeans  sei  age  rank  degree  income06  wrkstat  wrkslf
# Create a dataset after imputation
imputed_set <- complete(imputed)

# Check for missings in the imputed dataset
sapply(imputed_set, function(x) sum(is.na(x)))
##   caseid jobmeans      sei      age     rank   degree income06  wrkstat 
##        0        0        0        0        0        0        0        0 
##   wrkslf 
##        0
stripplot(imputed, sei, pch = 19, xlab = "Imputation number")

stripplot(imputed, jobmeans, pch = 19, xlab = "Imputation number")

# blue is observed, red is imputed
densityplot(imputed, ~ sei)

densityplot(imputed, ~ jobmeans)

Data Manipulation

# turn 'sei' into a categorical variable
ready_data <- imputed_set %>%
  select(jobmeans, sei) %>%
  mutate(sei_cat = as.factor(case_when(
    sei <= 32.4 ~ "low SES",
    sei <= 63.5 ~ "medium SES",
    sei > 63.5 ~ "high SES"
  )), sei = NULL)

# change sei_cat to be ordinal
ready_data$sei_cat <- factor(ready_data$sei_cat, levels = c("low SES", "medium SES", "high SES"))

# levels(data$sei_cat)
summary(ready_data$sei_cat)
##    low SES medium SES   high SES 
##      15610      30080      11371
summary(ready_data$jobmeans)
## Most Impt    Second     Third    Fourth     Fifth 
##     26879     10869      8959      6693      3661
library(knitr)
library(summarytools)
kable(ctable(ready_data$sei_cat, ready_data$jobmeans, style = "rmarkdown", prop = "r"), format = "simple")
Most Impt Second Third Fourth Fifth Total
low SES 6870 2843 2454 2253 1190 15610
medium SES 14159 5807 4774 3414 1926 30080
high SES 5850 2219 1731 1026 545 11371
Total 26879 10869 8959 6693 3661 57061
Most Impt Second Third Fourth Fifth Total
low SES 0.4401025 0.1821268 0.1572069 0.1443306 0.0762332 1
medium SES 0.4707114 0.1930519 0.1587101 0.1134973 0.0640293 1
high SES 0.5144666 0.1951455 0.1522294 0.0902295 0.0479289 1
Total 0.4710573 0.1904804 0.1570074 0.1172955 0.0641594 1
library(vcd)
# Create the mosaic plot
mosaicplot(table(ready_data$sei_cat, ready_data$jobmeans),
           main = "Mosaic Plot: jobmeans vs. sei_cat",
           xlab = "sei_cat", ylab = "jobmeans")


Part 4: Inference

Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?

We assume:

‘sei’ is explanatory

‘jobmeans’ is response

H0: ‘sei’ and ‘jobmeans’ are independent. Belief in the importance of work does not vary by socioeconomic status.

Ha: ‘sei’ and ‘jobmeans’ are dependent. Belief in the importance of work does vary by socioeconomic status.

Conditions for the chi-square test:

Independence: Each observation in the data set represents a single, unique adult, and GSS is an equal- probability multi-stage cluster sample of housing units for the entire United States. Thus, sampled observations are independent

Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases. This is true.

Perform chi-square test of independence using the theoretical method.

chisq_result <- chisq.test(x = ready_data$sei_cat, y = ready_data$jobmeans)
chisq_result
## 
##  Pearson's Chi-squared test
## 
## data:  ready_data$sei_cat and ready_data$jobmeans
## X-squared = 341.67, df = 8, p-value < 2.2e-16

Since p-value is very small (p-value < 2.2e-16), we reject H0 and conclude there is convincing evidence that belief in the importance of work does vary by socioeconomic status.

The effect size and the power are calculated for the completed hypothesis test:

# report effect size and power of test


# A contingency table with observed frequencies
observed <- table(ready_data$jobmeans, ready_data$sei_cat)

# # Calculate the effect size using Cramer's V formula
library(DescTools)
effect_size <- CramerV(observed)
# can also possibly use ES.w2 to calculate effect size

# Compute power of test or determine parameters to obtain target power (same as power.anova.test).
library(pwr)
pwr.chisq.test(w = effect_size, df = 8, sig.level = .05, N = nrow(ready_data))
## 
##      Chi squared power calculation 
## 
##               w = 0.05471671
##               N = 57061
##              df = 8
##       sig.level = 0.05
##           power = 1
## 
## NOTE: N is the number of observations

Note: