Statistical inference with the GSS data

Setup

Load packages

library(ggplot2); library(kableExtra)
library(dplyr); library(tidyr)
library(statsr); library(RColorBrewer)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

Background

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

Methods

According the GSS Panel Codebook, the GSS has conducted 32 surveys from 1972 to 2018, including a total of 64,814 completed interviews. Each survey from 1972 to 2004 was an independently drawn sample of English-speaking persons 18 years of age or over. Starting in 2006, Spanish-speakers were added to the target population. Block quota sampling was used for 5 of the survey years, with full probability sampling employed in the remainder of the surveys. Additionally, from 2004 onward, GSS utilizes a three-wave, rolling panel sample design, which also uses non-responsive sub-sampling to keep the design unbiased.

Scope of Inference (Generalizability and Causality)

Although the surveys have been subject to changes in the sampling methodology and questions, response and forms have been modified over the years, overall the GSS does provide a useful sample to be able to generalize to the US adult population.

However, as it is an observational study conducted through surveys and interviews with respondents, any statistical inference cannot be used to draw casual relationships.

Part 2: Research question

At this time in history, with a pandemic affecting the entire globe and climate change becoming an ever more present risk to be addressed, trust in the scientific community may be as important as ever. We will look to see whether confidence in our scientific institutions has increased over a thirty year period (1980 to 2010) and whether level of education is correlated with confidence in these institutions.

Methods

For these questions, we will perform tests for the difference of two proportions. We will define our alpha level as 0.05. And we need to define our hypotheses for our tests.

Test 1:

H_0 = Trust in our scientific institutions is the same in 2010 as in 1980. (p_1980 = p_2010)
H_a = Trust in our scientific institutions is different in 2010 than in 1980. (p_1980 != p_2010)

Test 2:

H_0 = Trust in our scientific institutions does not differ based on education level. (p_college = p_no_college)
H_a = Trust in our scientific institutions differs based on education level. (p_college != p_no_college)

Results

From the survey data, we estimate that trust in scientific institutions has declined by 4.7% with a 95% Confidence Interval of 0.08% to 8.5%. Additionally, we estimate that the trust by college graduates is 19.0% greater with a 95% confidence interval of 13.2% to 24.9%.

Conclusion

Trust in our scientific institutions has decreased even though we may be relying on them even more. Further research will be needed to see if education can a partial solution to gaining trust in these needed institutions.

Part 3: Exploratory data analysis

First we will subset the data to include only the variables we are analyzing. Additionally, we will simplify the variables to only have 2 factors. For, confidence in science institutions, we will simplify to Trust / No_trust. And for education level, we will simplify to College or No_college.

data <- gss %>%
        select(year, consci, degree) %>%
        drop_na() %>%
        filter(year == 1980 | year == 2010) %>%
        mutate(conf_sci = ifelse(consci == "A Great Deal", "Trust", "No_trust")) %>%
        mutate(coll_grad = ifelse(degree == "Bachelor" | degree == "Graduate",
                                  "College", "No_College"))

head(data, 10) %>%
        kable() %>%
        kable_styling(bootstrap_options = "striped", full_width = FALSE)

year	consci	degree	conf_sci	coll_grad
1980	A Great Deal	Lt High School	Trust	No_College
1980	A Great Deal	Lt High School	Trust	No_College
1980	A Great Deal	Lt High School	Trust	No_College
1980	A Great Deal	High School	Trust	No_College
1980	Only Some	High School	No_trust	No_College
1980	A Great Deal	High School	Trust	No_College
1980	Only Some	High School	No_trust	No_College
1980	A Great Deal	Bachelor	Trust	College
1980	A Great Deal	Bachelor	Trust	College
1980	A Great Deal	Lt High School	Trust	No_College

data_prop <-data %>%
  group_by(year, coll_grad) %>%
  summarize(Trust = mean(conf_sci == "Trust"))

From this subsetted data, we can create tables and visualizations to analyze our variables.

data_prop %>% 
  pivot_wider(names_from = coll_grad, values_from = Trust) %>%
  kable() %>%
  kable_styling(full_width = FALSE)

year	College	No_College
1980	0.6063348	0.4356808
2010	0.5507614	0.3606195

ggplot(data, aes(factor(conf_sci), fill = factor(year))) +
        geom_bar(position = "dodge") +
        facet_wrap(~ coll_grad) +
        ggtitle("Confidence in Science Institutions by College Education") +
        labs(x = "Confidence in Science", 
             y = ("Respondent Count")) +
        scale_fill_manual("Year", 
                          values = c("1980" = "lightblue", "2010" = "blue"))

Further, we will subset our data, so that we have the needed information for each our tests.

trust_by_education <-data %>%
  filter(year == 2010) %>%
  group_by(coll_grad) %>%
  summarize(Trust = mean(conf_sci == "Trust"),
            count = n())

trust_by_education %>%
  kable() %>%
  kable_styling(full_width = FALSE)

coll_grad	Trust	count
College	0.5507614	394
No_College	0.3606195	904

trust_by_year <-data %>%
  group_by(year) %>%
  summarize(Trust = mean(conf_sci == "Trust"),
            count = n())

trust_by_year  %>%
  kable() %>%
  kable_styling(full_width = FALSE)

year	Trust	count
1980	0.4650078	1286
2010	0.4183359	1298

Part 4: Inference

Before we begin to perform any statistical inference, we will need to test the conditions of the sampling distribution of the difference of proportions.

Independence

The respondents to the survey have been independently drawn from the population and represent less than 10% of the representative population, so we meet the condition for independence.

Normality

The success-failure for each group in our tests are in excess of 10, so we can safely apply the normal model.

Hyptothesis Tests

Because we are comparing two difference we will need to calculate both a point estimate and pooled proportion for each test. From this pooled proportion we can calculate a standard error and a z-score. From this z-score, we can derive a p_value to compare to our alpha limit of 0.05.

Let’s perform our first test

H_0 = Trust in our scientific institutions is the same in 2010 as in 1980. (p_1980 = p_2010)
H_a = Trust in our scientific institutions is different in 2010 than in 1980. (p_1980 != p_2010)

p_1980 <- trust_by_year[1, 2]
p_2010 <- trust_by_year[2, 2]

n_1980 <- trust_by_year[1, 3]
n_2010 <- trust_by_year[2, 3]

pnt_est_1 <- p_2010 - p_1980

p_pool_1 <- (p_1980 * n_1980 + p_2010 * n_2010) / (n_1980 + n_2010)

SE_1 <- sqrt((p_pool_1 * (1 - p_pool_1)) / n_1980 + (p_pool_1 * (1 - p_pool_1)) / n_2010)

z_score_1 <- as.numeric(pnt_est_1 / SE_1)

p_value_1 <- pnorm(abs(z_score_1), lower.tail = FALSE)

df_table_1 <- data.frame(c(p_1980, p_2010, n_1980, n_2010, 
                pnt_est_1, p_pool_1, SE_1, z_score_1, p_value_1))

colnames(df_table_1) <- c("p_1980", "p_2010", "n_1980", "n_2010", 
                "pnt_est", "p_pool", "SE", "z_score", "p_value")

kable(df_table_1) %>%
  kable_styling()

p_1980	p_2010	n_1980	n_2010	pnt_est	p_pool	SE	z_score	p_value
0.4650078	0.4183359	1286	1298	-0.0466719	0.4415635	0.0195376	-2.388819	0.0084513

From this test, we calculate a p-value of .008, which is less than our alpha level of 0.05, so we reject the Null hypothesis, and conclude that trust in the scientific institution has changed over this 30 year period. Unfortunately, trust has declined over the period by 4.66%.

Let’s look at our second test:

H_0 = Trust in our scientific institutions does not differ based on education level. (p_college = p_no_college)
H_a = Trust in our scientific institutions differs based on education level. (p_college != p_no_college)

p_coll <- trust_by_education[1, 2]
p_no_coll <- trust_by_education[2, 2]

n_coll <- trust_by_education[1, 3]
n_no_coll <- trust_by_education[2, 3]

pnt_est_2 <- p_coll - p_no_coll

p_pool_2 <- (p_coll * n_coll + p_no_coll * n_no_coll) / (n_coll + n_no_coll)

SE_2 <- sqrt((p_pool_2 * (1 - p_pool_2)) / n_coll + (p_pool_2 * (1 - p_pool_2)) / n_no_coll)

z_score_2 <- as.numeric(pnt_est_2 / SE_2)

p_value_2 <- pnorm(abs(z_score_2), lower.tail = FALSE)

df_table_2 <- data.frame(c(p_coll, p_no_coll, n_coll, n_no_coll, 
                pnt_est_2, p_pool_2, SE_2, z_score_2, p_value_2))

colnames(df_table_2) <- c("p_coll", "p_no_coll", "n_coll", "n_no_coll", 
                "pnt_est", "p_pool", "SE", "z_score", "p_value")

kable(df_table_2) %>%
  kable_styling()

p_coll	p_no_coll	n_coll	n_no_coll	pnt_est	p_pool	SE	z_score	p_value
0.5507614	0.3606195	394	904	0.190142	0.4183359	0.0297786	6.385196	0

From this test, we calculate a p-value of approximately 0, so we reject the Null hypothesis, and conclude that trust in the scientific institution does differ by educational level.

Confidence Intervals

Lastly, let’s calculate 95% confidence intervals for each test.

conf_int_year <- as.numeric(pnt_est_1) + c(-1,1) * qnorm(.975) * as.numeric(SE_1)

conf_int_education <- as.numeric(pnt_est_2) + c(-1,1) * qnorm(.975) * as.numeric(SE_2)

df_table_3 <- rbind(conf_int_year, conf_int_education)

kable(df_table_3) %>%
  kable_styling(full_width = FALSE)

conf_int_year	-0.0849649	-0.0083788
conf_int_education	0.1317770	0.2485069

Conclusion

Even though we are relying more on scientific institutions to deal with catastrophic events, our trust in these institutions has been eroded over time. With college graduates trust higher, further research is needed to see if additional education can provide an increse in trust in these necessary institutions.