Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(dplyr)
library(ggplot2)
library(gmodels) # for CrossTable()

Load data

We load the supplied dataframe which (emphasis mine) is an “extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for students learning statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R

load("gss.Rdata")

Part 1: Data

We will be making an inference for categorical data where each variable has two levels: marital status (i.e., either yes or no) and “confidence in major companies” (i.e.., either “a great deal” or not). This implies a test of the difference between two proportions such that we would prefer to assume a nearly normal distribution of the sample proportion. Specifically, we want to assume the sampling distribution of the difference between proportions (p1-p2) is nearly normal. In order to make this nearly normal assumption, we need to confirm that two conditions are met; see 6.2.1. of OpenIntro Statistics, 3rd Edition:

In order to assume each proportion follows a normal model, in turn we need to meet these two conditions (see 6.1.1.):

Based on my interpretation of the GSS codebook, in particular Appendix A, it is reasonable to assume independence both within each sample and between samples. Further, the success-failure condition is easily met. See cross-table below; among the four tests, the lowest success-failure value is 104 = p1*n = 0.176 * 590. Therefore, our assumption of near normality appears to be justified.


Part 2: Research question

My research question is: are married people more likely to have a great deal of confidence in major companies? I am interested because marriage itself is a social institution. Viewed as a social institution, I wonder if being married constitutes an affirmative belief in institutions that might tend to translate into an affirmative believe (ie, confidence) in corporate institutions. My instinct is that, yes, married people might be more likely, even if slightly, to have confidence in companies.

Also, after looking at the dataset, I decided to ask the question only for the most recent year collected, which is 2012. It occurred to me that time is an important dimension; e.g., quick EDA shows that several of the confidence-based variables (“Confidence in Institutions”) have declined over time. I didn’t want to comingle time into the inference.


Part 3: Exploratory data analysis

Because I am interested in comparing two proportions, I will first partition both variables marital and conbus into two categories. I will filter “not married” observations into dataframes nm_c and nm_cm where “not married” will include either widowed, divorced, separated or never married. Similarly, the conbus variable is a factor with three levels such that “less than great confidence”" will filter into dataframes m_nc and nm_nc which includes either “only some” confidence or “hardly any” confidence.

gss_12 <- gss %>%
  select(caseid, year, age, sex, race, marital, conbus) %>% filter(year == 2012)
m_c <- gss_12 %>% filter(marital == "Married", conbus == "A Great Deal")
m_nc <- gss_12 %>% filter(marital == "Married", conbus != "A Great Deal", !is.na(conbus))
nm_c <- gss_12 %>% filter(marital != "Married", !is.na(marital), conbus == "A Great Deal")
nm_nc <- gss_12 %>% filter(marital != "Married", !is.na(marital), conbus != "A Great Deal", !is.na(conbus))

First we will take a look at the raw counts (business confidence) by marital status (please note the marital groupings are overlapping, not stacked). Visually, while the sample contains more married people (compared to the other categories), it is not visually obvious that confidence proportionally varies by marital status, at least to me.

gss_12 %>% ggplot(aes(x = conbus, fill = marital)) +
  geom_histogram(stat= "count", position = "identity", alpha = 0.4) +
  labs(title = "Confidence by Marital Status", x = "Confidence in Business (conbus)", y = "Count (overlapping)")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Below we view the same data but more precisely with a cross-table. Here we can observe there is a slight difference:

In summary, 17.6% of married people have (had) a Great Deal of confidence, compared to 15.7% among those not married. The question is: the married proportion is higher, but is it a (statistically) significant difference?

CrossTable(gss_12$marital, gss_12$conbus)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1303 
## 
##  
##                | gss_12$conbus 
## gss_12$marital | A Great Deal |    Only Some |   Hardly Any |    Row Total | 
## ---------------|--------------|--------------|--------------|--------------|
##        Married |          104 |          375 |          111 |          590 | 
##                |        0.392 |        0.112 |        1.287 |              | 
##                |        0.176 |        0.636 |        0.188 |        0.453 | 
##                |        0.481 |        0.461 |        0.407 |              | 
##                |        0.080 |        0.288 |        0.085 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##        Widowed |           16 |           69 |           20 |          105 | 
##                |        0.114 |        0.177 |        0.182 |              | 
##                |        0.152 |        0.657 |        0.190 |        0.081 | 
##                |        0.074 |        0.085 |        0.073 |              | 
##                |        0.012 |        0.053 |        0.015 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##       Divorced |           31 |          135 |           49 |          215 | 
##                |        0.604 |        0.004 |        0.347 |              | 
##                |        0.144 |        0.628 |        0.228 |        0.165 | 
##                |        0.144 |        0.166 |        0.179 |              | 
##                |        0.024 |        0.104 |        0.038 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##      Separated |            6 |           24 |           16 |           46 | 
##                |        0.346 |        0.781 |        4.200 |              | 
##                |        0.130 |        0.522 |        0.348 |        0.035 | 
##                |        0.028 |        0.029 |        0.059 |              | 
##                |        0.005 |        0.018 |        0.012 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##  Never Married |           59 |          211 |           77 |          347 | 
##                |        0.038 |        0.154 |        0.254 |              | 
##                |        0.170 |        0.608 |        0.222 |        0.266 | 
##                |        0.273 |        0.259 |        0.282 |              | 
##                |        0.045 |        0.162 |        0.059 |              | 
## ---------------|--------------|--------------|--------------|--------------|
##   Column Total |          216 |          814 |          273 |         1303 | 
##                |        0.166 |        0.625 |        0.210 |              | 
## ---------------|--------------|--------------|--------------|--------------|
## 
## 

Part 4: Inference

First we will calculate the confidence interval. Please note that the standard error, denoted below by variable se_ci uses the proportions observed in the respective samples (see 6.1.1 of the OpenIntro Text).

m_c_count <- nrow(m_c)
m_nc_count <- nrow(m_nc)
nm_c_count <- nrow(nm_c)
nm_nc_count <- nrow(nm_nc)
married_count <- m_c_count + m_nc_count
not_married_count <- nm_c_count + nm_nc_count
p1 <- m_c_count/ married_count
p2 <- nm_c_count / not_married_count
p1
## [1] 0.1762712
p2
## [1] 0.1570827
se_ci <- sqrt(p1*(1-p1)/married_count + p2*(1-p2)/not_married_count)
z_vect <- c(-1.96, 1.96)
ci <- (p1 - p2) + z_vect * se_ci
ci
## [1] -0.02154026  0.05991714

Next we will conduct a hypothesis test for a proportion. The null value is zero; i.e., the null belief is that the proportions in the population are not different. Please note the standard error in this case, denoted by se_hypo uses the so-called pooled proportion (see 6.2.3 of the OpenIntro Text).

p_pooled <- (m_c_count + nm_c_count)/(married_count + not_married_count)
se_hypo <- sqrt(p_pooled*(1 - p_pooled)/married_count + p_pooled*(1 - p_pooled)/not_married_count)
z_stat <- ((p1 - p2) - 0)/se_hypo
z_stat
## [1] 0.9271307
p_value <- pnorm(abs(z_stat), lower.tail = FALSE)*2
p_value
## [1] 0.3538587

Part 4: Inference (Summary)

In conclusion: