library(ggplot2)
library(dplyr)
library(tidyr)
library(statsr)load("gss.Rdata")Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).
About
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972.
The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, inter-group tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
Sampling Design
General Social Survey Cumulative File, 1972-2012
The GSS is an area-probability sample that uses the NORC National Sampling Frame for an equal- probability multi-stage cluster sample of housing units for the entire United States.
Cluster Sampling: The population is divided into clusters (e.g., housing units), and a random selection of clusters is chosen. All individuals within the selected clusters are included in the sample.
Since the sample for the GSS is a cluster sample, results from this analysis can be generalized to the population
The GSS is a non-experimental study that aims to capture data from a representative sample of the population without manipulating variables or assigning individuals to specific conditions. Therefore, no random assignment was used and causality cannot be inferred
Also noted, until 2006 the GSS only sampled the English speaking population. This excludes data of non-English speakers and thus introduces cultural and linguistic bias. The experiences, attitudes, and behaviors of non-English speakers could differ significantly from those of English speakers. As a result, the conclusions based on GSS data may not fully capture the perspectives and characteristics of the entire population.
Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?
Note that ‘jobmeans’ variable is part of the question below:
Would you please look at this list and tell me which one thing on this list you would most prefer in a job? b. Which comes next? c. Which is third most important? d. Which is fourth most important? e. Fifth most important?
jobinc: High income
jobsec: No danger of being fired
jobhour: Short working hours
jobpromo: chance of advancement
jobmeans: Work important and feel accomplishment
Relevance: It can provide insights into how societal and economic contexts influence perceptions of work.
Overview
# str(gss)
summary(gss$sei)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.10 32.40 39.00 48.42 63.50 97.20 25784
summary(gss$jobmeans)## Most Impt Second Third Fourth Fifth NA's
## 9641 3899 3245 2450 1333 36493
Checking missing pattern
library(mice)##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
gss %>%
select(jobmeans, sei) %>%
md.pattern()## sei jobmeans
## 6312 1 1 0
## 24965 1 0 1
## 14256 0 1 1
## 11528 0 0 2
## 25784 36493 62277
gss %>%
select(jobmeans, sei) %>%
aggr(col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data),
cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))##
## Variables sorted by number of missings:
## Variable Count
## jobmeans 0.6395436
## sei 0.4518673
We see that a majority of data is missing in both variables ‘sei’ and ‘jobmeans’
Only 11% is complete, the rest is either missing in one or both variables.
Imputing missing values
To impute the missing values, mice package use an algorithm in a such a way that use information from other variables in the dataset to predict and impute the missing values. Therefore, you may not want to use a certain variable as predictors. For example, the ID variable does not have any predictive value.
One commonly used imputation method for continuous variables (eg. ‘sei’) is predictive mean matching (PMM). PMM works by first predicting the missing values based on the observed values and then matching the predicted values to the nearest observed values.
For ‘jobmeans’, which is an ordinal categorical variable with >2 levels, we use proportional odds model “polr”. The algorithm fits a proportional odds model to the observed data, treating the ordinal variable as the dependent variable and using other variables as predictors. It estimates the parameters of the model using maximum likelihood estimation.
# pick variables that are conceptually or theoretically related to
# "jobmeans" (belief in the importance of work) and "sei" (socioeconomic index) as predictors
set1 <- gss %>%
select(caseid, jobmeans, sei, age, rank, degree, income06, wrkstat, wrkslf)
# set up for imputation
init = mice(set1, maxit=0)
meth = init$method
predM = init$predictorMatrix
# remove a variable as a predictor
predM[, c("caseid")] = 0
# skip a variable from imputation
meth[c("caseid")] = ""
# specify the methods for imputing the missing values
meth[c("sei")] = "pmm"
meth[c("jobmeans")] = "polr"
# run the multiple (m=5) imputation
# m=5 indicates we will create 5 imputed data sets
# the results from these multiple data sets can then be pooled to create one final imputed data set
set.seed(103)
imputed = mice(set1, method=meth, predictorMatrix=predM, m=5)##
## iter imp variable
## 1 1 jobmeans sei age rank degree income06 wrkstat wrkslf
## 1 2 jobmeans sei age rank degree income06 wrkstat wrkslf
## 1 3 jobmeans sei age rank degree income06 wrkstat wrkslf
## 1 4 jobmeans sei age rank degree income06 wrkstat wrkslf
## 1 5 jobmeans sei age rank degree income06 wrkstat wrkslf
## 2 1 jobmeans sei age rank degree income06 wrkstat wrkslf
## 2 2 jobmeans sei age rank degree income06 wrkstat wrkslf
## 2 3 jobmeans sei age rank degree income06 wrkstat wrkslf
## 2 4 jobmeans sei age rank degree income06 wrkstat wrkslf
## 2 5 jobmeans sei age rank degree income06 wrkstat wrkslf
## 3 1 jobmeans sei age rank degree income06 wrkstat wrkslf
## 3 2 jobmeans sei age rank degree income06 wrkstat wrkslf
## 3 3 jobmeans sei age rank degree income06 wrkstat wrkslf
## 3 4 jobmeans sei age rank degree income06 wrkstat wrkslf
## 3 5 jobmeans sei age rank degree income06 wrkstat wrkslf
## 4 1 jobmeans sei age rank degree income06 wrkstat wrkslf
## 4 2 jobmeans sei age rank degree income06 wrkstat wrkslf
## 4 3 jobmeans sei age rank degree income06 wrkstat wrkslf
## 4 4 jobmeans sei age rank degree income06 wrkstat wrkslf
## 4 5 jobmeans sei age rank degree income06 wrkstat wrkslf
## 5 1 jobmeans sei age rank degree income06 wrkstat wrkslf
## 5 2 jobmeans sei age rank degree income06 wrkstat wrkslf
## 5 3 jobmeans sei age rank degree income06 wrkstat wrkslf
## 5 4 jobmeans sei age rank degree income06 wrkstat wrkslf
## 5 5 jobmeans sei age rank degree income06 wrkstat wrkslf
# Create a dataset after imputation
imputed_set <- complete(imputed)
# Check for missings in the imputed dataset
sapply(imputed_set, function(x) sum(is.na(x)))## caseid jobmeans sei age rank degree income06 wrkstat
## 0 0 0 0 0 0 0 0
## wrkslf
## 0
stripplot(imputed, sei, pch = 19, xlab = "Imputation number")stripplot(imputed, jobmeans, pch = 19, xlab = "Imputation number")# blue is observed, red is imputed
densityplot(imputed, ~ sei)densityplot(imputed, ~ jobmeans)Data Manipulation
# turn 'sei' into a categorical variable
ready_data <- imputed_set %>%
select(jobmeans, sei) %>%
mutate(sei_cat = as.factor(case_when(
sei <= 32.4 ~ "low SES",
sei <= 63.5 ~ "medium SES",
sei > 63.5 ~ "high SES"
)), sei = NULL)
# change sei_cat to be ordinal
ready_data$sei_cat <- factor(ready_data$sei_cat, levels = c("low SES", "medium SES", "high SES"))
# levels(data$sei_cat)
summary(ready_data$sei_cat)## low SES medium SES high SES
## 15610 30080 11371
summary(ready_data$jobmeans)## Most Impt Second Third Fourth Fifth
## 26879 10869 8959 6693 3661
library(knitr)
library(summarytools)
kable(ctable(ready_data$sei_cat, ready_data$jobmeans, style = "rmarkdown", prop = "r"), format = "simple")| Most Impt | Second | Third | Fourth | Fifth | Total | |
|---|---|---|---|---|---|---|
| low SES | 6870 | 2843 | 2454 | 2253 | 1190 | 15610 |
| medium SES | 14159 | 5807 | 4774 | 3414 | 1926 | 30080 |
| high SES | 5850 | 2219 | 1731 | 1026 | 545 | 11371 |
| Total | 26879 | 10869 | 8959 | 6693 | 3661 | 57061 |
| Most Impt | Second | Third | Fourth | Fifth | Total | |
|---|---|---|---|---|---|---|
| low SES | 0.4401025 | 0.1821268 | 0.1572069 | 0.1443306 | 0.0762332 | 1 |
| medium SES | 0.4707114 | 0.1930519 | 0.1587101 | 0.1134973 | 0.0640293 | 1 |
| high SES | 0.5144666 | 0.1951455 | 0.1522294 | 0.0902295 | 0.0479289 | 1 |
| Total | 0.4710573 | 0.1904804 | 0.1570074 | 0.1172955 | 0.0641594 | 1 |
Across the three socioeconomic groups that report “Work important and feel accomplishment” as the most important factor, we can see an uptrend in terms of proportion.
Across the ranking (left to right) for “Work important and feel accomplishment”, there is a consistent downtrend for all three socioeconomic groups.
library(vcd)
# Create the mosaic plot
mosaicplot(table(ready_data$sei_cat, ready_data$jobmeans),
main = "Mosaic Plot: jobmeans vs. sei_cat",
xlab = "sei_cat", ylab = "jobmeans")Does the belief in the importance of work (jobmeans) vary depending on socioeconomic status (sei)?
We assume:
‘sei’ is explanatory
‘jobmeans’ is response
H0: ‘sei’ and ‘jobmeans’ are independent. Belief in the importance of work does not vary by socioeconomic status.
Ha: ‘sei’ and ‘jobmeans’ are dependent. Belief in the importance of work does vary by socioeconomic status.
Conditions for the chi-square test:
Independence: Each observation in the data set represents a single, unique adult, and GSS is an equal- probability multi-stage cluster sample of housing units for the entire United States. Thus, sampled observations are independent
Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases. This is true.
Perform chi-square test of independence using the theoretical method.
chisq_result <- chisq.test(x = ready_data$sei_cat, y = ready_data$jobmeans)
chisq_result##
## Pearson's Chi-squared test
##
## data: ready_data$sei_cat and ready_data$jobmeans
## X-squared = 341.67, df = 8, p-value < 2.2e-16
Since p-value is very small (p-value < 2.2e-16), we reject H0 and conclude there is convincing evidence that belief in the importance of work does vary by socioeconomic status.
The effect size and the power are calculated for the completed hypothesis test:
# report effect size and power of test
# A contingency table with observed frequencies
observed <- table(ready_data$jobmeans, ready_data$sei_cat)
# # Calculate the effect size using Cramer's V formula
library(DescTools)
effect_size <- CramerV(observed)
# can also possibly use ES.w2 to calculate effect size
# Compute power of test or determine parameters to obtain target power (same as power.anova.test).
library(pwr)
pwr.chisq.test(w = effect_size, df = 8, sig.level = .05, N = nrow(ready_data))##
## Chi squared power calculation
##
## w = 0.05471671
## N = 57061
## df = 8
## sig.level = 0.05
## power = 1
##
## NOTE: N is the number of observations
Note:
Effect size is typically used to quantify the size of an effect or the degree of association between variables.
Cramer’s V of approximately .055 indicates a small effect size (0.1 small, 0.3 medium, 0.5 large), suggesting a weak association between the ‘sei’ and ‘jobmeans’ variables.
Power is the probability of correctly rejecting a false null hypothesis in a statistical hypothesis test. It is the ability of a statistical test to detect a true effect or relationship if it exists.
Power of 1 indicates that this test has no problem correctly detecting an effect or difference.