Statistical inference with the GSS data (by David Harper)

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(dplyr)
library(ggplot2)
library(gmodels) # for CrossTable()

Load data

We load the supplied dataframe which (emphasis mine) is an “extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for students learning statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R”

load("gss.Rdata")

Part 1: Data

We will be making an inference for categorical data where each variable has two levels: marital status (i.e., either yes or no) and “confidence in major companies” (i.e.., either “a great deal” or not). This implies a test of the difference between two proportions such that we would prefer to assume a nearly normal distribution of the sample proportion. Specifically, we want to assume the sampling distribution of the difference between proportions (p1-p2) is nearly normal. In order to make this nearly normal assumption, we need to confirm that two conditions are met; see 6.2.1. of OpenIntro Statistics, 3rd Edition:

Each proportion separately follows a normal model, and
The two samples are independent of each other

In order to assume each proportion follows a normal model, in turn we need to meet these two conditions (see 6.1.1.):

The sample observations are independent and
Success-failure condition: We expect to see at least 10 successes and 10 failures in the sample; ie, np >= 10 and n(1-p) >= 10.

Based on my interpretation of the GSS codebook, in particular Appendix A, it is reasonable to assume independence both within each sample and between samples. Further, the success-failure condition is easily met. See cross-table below; among the four tests, the lowest success-failure value is 104 = p1*n = 0.176 * 590. Therefore, our assumption of near normality appears to be justified.

Part 2: Research question

My research question is: are married people more likely to have a great deal of confidence in major companies? I am interested because marriage itself is a social institution. Viewed as a social institution, I wonder if being married constitutes an affirmative belief in institutions that might tend to translate into an affirmative believe (ie, confidence) in corporate institutions. My instinct is that, yes, married people might be more likely, even if slightly, to have confidence in companies.

Also, after looking at the dataset, I decided to ask the question only for the most recent year collected, which is 2012. It occurred to me that time is an important dimension; e.g., quick EDA shows that several of the confidence-based variables (“Confidence in Institutions”) have declined over time. I didn’t want to comingle time into the inference.

Part 3: Exploratory data analysis

Because I am interested in comparing two proportions, I will first partition both variables marital and conbus into two categories. I will filter “not married” observations into dataframes nm_c and nm_cm where “not married” will include either widowed, divorced, separated or never married. Similarly, the conbus variable is a factor with three levels such that “less than great confidence”" will filter into dataframes m_nc and nm_nc which includes either “only some” confidence or “hardly any” confidence.

gss_12 <- gss %>%
  select(caseid, year, age, sex, race, marital, conbus) %>% filter(year == 2012)
m_c <- gss_12 %>% filter(marital == "Married", conbus == "A Great Deal")
m_nc <- gss_12 %>% filter(marital == "Married", conbus != "A Great Deal", !is.na(conbus))
nm_c <- gss_12 %>% filter(marital != "Married", !is.na(marital), conbus == "A Great Deal")
nm_nc <- gss_12 %>% filter(marital != "Married", !is.na(marital), conbus != "A Great Deal", !is.na(conbus))

First we will take a look at the raw counts (business confidence) by marital status (please note the marital groupings are overlapping, not stacked). Visually, while the sample contains more married people (compared to the other categories), it is not visually obvious that confidence proportionally varies by marital status, at least to me.

gss_12 %>% ggplot(aes(x = conbus, fill = marital)) +
  geom_histogram(stat= "count", position = "identity", alpha = 0.4) +
  labs(title = "Confidence by Marital Status", x = "Confidence in Business (conbus)", y = "Count (overlapping)")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Below we view the same data but more precisely with a cross-table. Here we can observe there is a slight difference:

17.6% of Married people have a Great Deal of confidence in business; ie, 104 out of 590
By comparison, the other groups exhibit lower proportions: 15.2% among Widowed, 14.4% among Divorced, 13.0% among Separated, and 17.0% among Never Married. We will see below that the aggregated statistic (of these non-married groups) is about 15.7%.

In summary, 17.6% of married people have (had) a Great Deal of confidence, compared to 15.7% among those not married. The question is: the married proportion is higher, but is it a (statistically) significant difference?

CrossTable(gss_12$marital, gss_12$conbus)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1303 
## 
##  
##                | gss_12$conbus 
## gss_12$marital | A Great Deal |    Only Some |   Hardly Any |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##        Married |          104 |          375 |          111 |          590 | 
##                |        0.392 |        0.112 |        1.287 |              | 
##                |        0.176 |        0.636 |        0.188 |        0.453 | 
##                |        0.481 |        0.461 |        0.407 |              | 
##                |        0.080 |        0.288 |        0.085 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##        Widowed |           16 |           69 |           20 |          105 | 
##                |        0.114 |        0.177 |        0.182 |              | 
##                |        0.152 |        0.657 |        0.190 |        0.081 | 
##                |        0.074 |        0.085 |        0.073 |              | 
##                |        0.012 |        0.053 |        0.015 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##       Divorced |           31 |          135 |           49 |          215 | 
##                |        0.604 |        0.004 |        0.347 |              | 
##                |        0.144 |        0.628 |        0.228 |        0.165 | 
##                |        0.144 |        0.166 |        0.179 |              | 
##                |        0.024 |        0.104 |        0.038 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##      Separated |            6 |           24 |           16 |           46 | 
##                |        0.346 |        0.781 |        4.200 |              | 
##                |        0.130 |        0.522 |        0.348 |        0.035 | 
##                |        0.028 |        0.029 |        0.059 |              | 
##                |        0.005 |        0.018 |        0.012 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##  Never Married |           59 |          211 |           77 |          347 | 
##                |        0.038 |        0.154 |        0.254 |              | 
##                |        0.170 |        0.608 |        0.222 |        0.266 | 
##                |        0.273 |        0.259 |        0.282 |              | 
##                |        0.045 |        0.162 |        0.059 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |          216 |          814 |          273 |         1303 | 
##                |        0.166 |        0.625 |        0.210 |              | 
## ---------------|--------------|--------------|--------------|--------------|
## 
##

Part 4: Inference

First we will calculate the confidence interval. Please note that the standard error, denoted below by variable se_ci uses the proportions observed in the respective samples (see 6.1.1 of the OpenIntro Text).

m_c_count <- nrow(m_c)
m_nc_count <- nrow(m_nc)
nm_c_count <- nrow(nm_c)
nm_nc_count <- nrow(nm_nc)
married_count <- m_c_count + m_nc_count
not_married_count <- nm_c_count + nm_nc_count
p1 <- m_c_count/ married_count
p2 <- nm_c_count / not_married_count
p1

## [1] 0.1762712

p2

## [1] 0.1570827

se_ci <- sqrt(p1*(1-p1)/married_count + p2*(1-p2)/not_married_count)
z_vect <- c(-1.96, 1.96)
ci <- (p1 - p2) + z_vect * se_ci
ci

## [1] -0.02154026  0.05991714

Next we will conduct a hypothesis test for a proportion. The null value is zero; i.e., the null belief is that the proportions in the population are not different. Please note the standard error in this case, denoted by se_hypo uses the so-called pooled proportion (see 6.2.3 of the OpenIntro Text).

p_pooled <- (m_c_count + nm_c_count)/(married_count + not_married_count)
se_hypo <- sqrt(p_pooled*(1 - p_pooled)/married_count + p_pooled*(1 - p_pooled)/not_married_count)
z_stat <- ((p1 - p2) - 0)/se_hypo
z_stat

## [1] 0.9271307

p_value <- pnorm(abs(z_stat), lower.tail = FALSE)*2
p_value

## [1] 0.3538587

Part 4: Inference (Summary)

In conclusion:

State hypotheses: Married people (in the 2012 sample) are more likely to have a “great deal” of confidence in business (major companies)
Check conditions: Independence and successful-failure conditions are met.
State the method(s) to be used and why and how: Confidence interval and hypothesis test with a normal distribution because we meet the conditions for assuming near normality
Perform inference: The confidence interval of the difference in proportions is (-0.02154026, 0.05991714). The hypothesis test fails to reject the null with a z-score of only 0.9271307. The associated two-tailed p-value (aka, exact significance level) is fully 35.38587%.
Interpret results: Although we observe the married people have great confidence that is about +1.92% greater than the non-married group (ie, 0.1762712 versus 1570827), it fails to be a statistically significant difference at any meaningful confidence level.