Statistical inference with the GSS data

Part 1: Data

Since 1972, the GSS has collected demographic data on American adult respondents along with data on their attitudes, behaviors & attributes. The dataset allows for analysis of trends and constants in these answers over time as it was collected yearly until 1994 and then every other year after that. Each row in the dataset represents a “case”, or a single respondent’s answers for a single year.

Generalizability

For recent years of the GSS (since 1977), respondents were predetermined using a “full probability sample”, meaning they were selected randomly. Therefore, the results from 1977 are generalizable to the population that was included in the survey (English speaking, adult, American). The survey design is not an experiment, but rather observational, therefore cannot be used to infer causality.

Caveats:

Before 1977, the GSS used a modified probability sample, employing a block quota which introduces sample bias based on respondents not at home at the time of the survey. Later surveys predesignated respondents so did not have this bias. Therefore, in this analysis, I will focus on years 1977 and after.

Until 2006, the suervey only covered the US English-speaking population so data should not be generalized to non-English speakers

There is potential non-response bias due to the survey being voluntary and time limited to one hour. NA values will be omitted from the analysis. The behavioral/attitudinal answers provided are not validated, so there is potentially bias from the respondents not wanting to admit to certain attitudes or behaviors.

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)
library(grid)

Load data

load("gss.Rdata")

Part 2: Research question

Has confidence in religious institutions changed from the pre-Internet era (before the mid-90’s) to the era post consumer adoption of the Internet (1996 and later, for the purpose of this analysis)? Specifically, is it different amongst young adults (more likely to adopt the Internet earlier)?

Today, it seems as though people are moving away from religion and towards science/technology. It will be interesting to see if during the era of the rise of internet, people’s confidence in religious institutions has waned.

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

dim(gss)

## [1] 57061   114

length(unique(gss$year))

## [1] 29

min(gss$year)

## [1] 1972

max(gss$year)

## [1] 2012

There are 57K rows in the GSS data, each respresenting one respondent from one year of the survey. There are 29 years included in the data, beginning in 1972 and ending in 2012.

# get rid of rows with NA values
data <- gss %>% select(year, age, conclerg) %>% na.omit()

# categorize respondents by age into groups
data$age_group <- ifelse(data$age < 25, "young_adult",
                   ifelse(data$age >= 25 & data$age< 65, "adult", "senior"))
prop.table(table(data$conclerg, data$age_group), 2)

##               
##                    adult    senior young_adult
##   A Great Deal 0.2601238 0.3843156   0.2927681
##   Only Some    0.5265866 0.4352978   0.4945137
##   Hardly Any   0.2132896 0.1803867   0.2127182

plot1 <- ggplot(data) + aes(x=age_group, fill=conclerg) + geom_bar(position="fill")
plot1 <- plot1 + xlab("Age Group") + ylab("Proportion") + scale_fill_discrete(name="Confidence in Organized Religion")
plot1

For all years of the survey in aggregate, it looks like the levels of confidence in organized religion of adults vs. young adults is about equal, while seniors are more likely to have “A Great Deal” of confidence and less likely to have “Hardly Any” confidence. But do these look different for in the pre-Internet era vs. post-Internet era?

# subset to just years of interest
recent <- data %>% filter(age_group == "young_adult")
recent <- recent %>% select(year, conclerg) %>% filter(year >= 1980 & year < 2001)
# categorize respondents by age into groups
recent$era <- ifelse(recent$year < 1996, "pre_internet", "post_internet")

plot2 <- ggplot(recent) + aes(x=era, fill=conclerg) + geom_bar(position="fill") + ggtitle("Pre vs. Post Internet Era Attitudes of Young Adults Towards Organized Religion")
plot2

Visually, it seems that overall there is a difference in proportions of respondents between the pre-Internet and post-Internet era whose responses indicated they had “A Great Deal” of confidence in organized religion. There seem to be less of these responses in the Post-Internet era amongst young adults.

Part 4: Inference

pre <- recent %>% filter(era == "pre_internet")
nrow(pre)

## [1] 1630

nrow(pre %>% filter(conclerg== "A Great Deal"))

## [1] 450

post <- recent %>% filter(era == "post_internet")
nrow(post)

## [1] 510

nrow(post %>% filter(conclerg== "A Great Deal"))

## [1] 161

res <- prop.test(x=c(450, 161), n=c(1630, 510))

res

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(450, 161) out of c(1630, 510)
## X-squared = 2.797, df = 1, p-value = 0.09444
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.086705829  0.007480519
## sample estimates:
##    prop 1    prop 2 
## 0.2760736 0.3156863

\({H_{0}}\): \({p_{post-internet}} = {p_{pre-internet}}\)

\({H_{A}}\): \({p_{post-internet}} \neq {p_{pre-internet}}\)

We can assume normality because the survey respondents were randomly sampled, make up <10% of the population, and the success-failure condition is met (where success is responding “A Great Deal” to the survey question “How much confidence do you have in organized religion?”) Therefore, to test these hypotheses, we can use a two-proportion z-test.

The p value is 0.0944, which means we would fail to reject the null hypothesis at a 95% confidence level that the proportion of young adults who have “A Great Deal” of confidence in organized religion is the same in the pre vs. post internet era. The 95% confidence interval for the difference in proportions contains 0, which aligns with this p-value result.

Part 5: Reference

http://gss.norc.org/Get-Documentation