Statistical inference with the GSS data

Setup

Load packages

library(tidyverse)

## Warning: package 'tibble' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

library(statsr)

Load data

setwd(dir = "C:/Users/Marc/Downloads")
load("gss.Rdata")

Part 1: Data

About The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
Datasource: The data used for this exercise is an extract of the GSS Cumulative File 1972-2012 and provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for students learning statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.
Scope: GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
Type: The GSS (General Social Survey) is an interview survey of U.S. households chosen by using a proportional sampling technique. Proportional sampling is a method of sampling in which the investigator divides a finite population into subpopulations and then applies random sampling techniques to each subpopulation. The survey is conducted by the National Opinion Research Center (NORC).
Implication: The GSS uses random sampling: it ensures that the sample is a good representation of the population. As it is not an experimental study, there are no random assignment. As a result, the GSS observations are not causable but generalizable.

Part 2: Research question

“The financial crisis of 2007-2008, also known as the global financial crisis and the 2008 financial crisis, is considered by many economists to have been the worst financial crisis since the Great Depression of the 1930s”" (Wikipedia). I read several times that following the 2008 financial crisis, people lost their “trust” in the financial institutions. We will use the GSS to compare pre and post 2008 confidence in banks and financial institutions in order to determine if indeed, the confidence level in financial institutions changed.

Part 3: Exploratory data analysis

Summary Statistic

# overall sample size
length(gss$confinan)

## [1] 57061

# Number of NA
sum(is.na(gss$confinan))

## [1] 22008

# effective sample (na excluded)
length(gss$confinan) - sum(is.na(gss$confinan))

## [1] 35053

# Confidence per category
table(gss$confinan)

## 
## A Great Deal    Only Some   Hardly Any 
##         9015        19659         6379

# Proportion per category in percent
table(gss$confinan) / (length(gss$confinan) - sum(is.na(gss$confinan))) * 100

## 
## A Great Deal    Only Some   Hardly Any 
##     25.71820     56.08364     18.19816

Plots

my.palette <- c("#f7b758","#b94665", "#a7a6a6", "#2ecee5", "#009E73", "#f37735")

# number of respondant per year and per category
plot1 <- as.data.frame(table(gss$confinan, gss$year))
colnames(plot1) <- c("Category", "Year", "Freq")

plot1$Year[plot1$Freq == 0] # checking the years where we have 0 as an answer

##  [1] 1972 1972 1972 1973 1973 1973 1974 1974 1974 1985 1985 1985
## 29 Levels: 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 ... 2012

# it looks like this question wasn't asked in the interview in 1972, 1973, 1974 and 1985
head(plot1, 10)

##        Category Year Freq
## 1  A Great Deal 1972    0
## 2     Only Some 1972    0
## 3    Hardly Any 1972    0
## 4  A Great Deal 1973    0
## 5     Only Some 1973    0
## 6    Hardly Any 1973    0
## 7  A Great Deal 1974    0
## 8     Only Some 1974    0
## 9    Hardly Any 1974    0
## 10 A Great Deal 1975  475

plot1 <- filter(plot1, Freq != 0) # removing those cases

## Warning: package 'bindrcpp' was built under R version 3.3.3

ggplot(plot1, aes(x= Category, y= Freq, fill= Category))+
  geom_boxplot()+
  theme_bw()+
  scale_fill_manual(values = my.palette)+
  theme(legend.position="top")+
  ylab("Number of respondents")+
  ggtitle("Box plot: number of respondents per year per category")

The “Hardly any” category seems to have a smaller variance over the year. Interestingly, one point was considered as an outlier: 564 respondents in 2010
“Only some” is the category with the most variance and the highest median

# proportion per year
plot2 <- as.data.frame(table(gss$year, gss$confinan))
colnames(plot2) <- c("Year", "Category", "Freq")
plot2 <- filter(plot2, Freq != 0) 
plot2 <- spread(plot2, Category, Freq)
plot2 <- mutate(plot2, Total = `A Great Deal` + `Only Some` + `Hardly Any`) # create total per year
plot2$`A Great Deal` <- plot2$`A Great Deal` / plot2$Total * 100 # convert in percent
plot2$`Only Some` <- plot2$`Only Some` / plot2$Total * 100
plot2$`Hardly Any` <- plot2$`Hardly Any` / plot2$Total * 100
plot2 <- plot2[,c(1:4)] # drop the total
plot2 <- gather(plot2, Category, Freq, - Year) # make it readable by ggplot2

ggplot(plot2, aes(x= Year, y= Freq, color= Category, group= Category))+
  geom_point(size=5)+
  geom_line(size= 1.2)+
  scale_color_manual(values = my.palette)+
  theme_bw()+
  geom_point(aes(x= Year,y= Freq),colour="white",size=3)+
  xlab("Year")+
  ylab("Percentage")+
  theme(legend.position="top")+
  ggtitle("Confidence in banks and financial institutions per year and in percentage")

We can see that in 2010 and 2012, the “hardly any” category is quite above “a great deal” percentages. Similarly, the “only some” category slighly decreased on this period
“Only some” category stays the main common answer each year
“A great deal” category was significantly higher than the “hardly any” category in the 70s / beginning of the 80s before starting to decrease
There are 2 peaks for the “hardly any” category: in 1991 and 2010

#3. bar plot stacked before and after 2008
gss$y2008 <- gss$year >= 2008
gss$y2008 <- gsub(TRUE, "Post 2008", gss$y2008)
gss$y2008 <- gsub(FALSE, "Pre 2008", gss$y2008)


plot3 <- as.data.frame(table(gss$y2008, gss$confinan))
colnames(plot3) <- c("Post_2008", "Category", "Freq")

ggplot(plot3, aes(x= Post_2008, y= Freq, fill= Category))+
      geom_bar(stat="identity", position="fill")+
      theme_bw()+
      scale_fill_manual(values = my.palette)+
      theme(legend.position="top")+
      ylab("Number of respondents")+
      xlab("Periods when the interviews were conducted")+
      ggtitle("Share of trust in financial institutions before and after 2008")

We can see that the “Hardly any” category strongly increased for the post 2008 interviews while “a great deal” decrease.
Surprisingly, the share of “only some” category stay more or less similar

Part 4: Inference

Hypotheses

Our objective is to use the GSS to compare pre and post 2008 confidence in banks and financial institutions in order to determine if indeed, the confidence level in financial institutions changed

In order to do this, we will compare the proportion of individuals who answered “a great deal” among all interviewees for post and pre 2008

Null hypothesis: there is no difference between the proportion of interviewees who said that they have “a great deal” of confidence in banks and in financial institutions for interviews conducted before 2008 and for the ones conducted in 2008 and after
Alternative hypothesis: there is a difference between the proportion of interviewees who said that they have “a great deal” of confidence in banks and in financial institutions for interviews conducted before 2008 and for the ones conducted in 2008 and after
H0: P_post2008 - P_pre2008 = 0
HA: P_post2008 - P_pre2008 != 0

Method to be used

This case is an inference for categorical data. So proportion will be used
We will calculate the confidence interval and an hypothesis test for P1 - P2 = 0. The confidence interval is useful to give the “direction” of the change (if there are any) and to confirm the hypothesis testing result
Pooled proportion will be used for the hypothesis test in order to determine the success - failure condition as well as calculating the standard error
we will then compute the test statistic with the p distribution
we will use a 95% confidence interval

Confidence Interval: Checking conditions

Each proportion respect the success - failure conditions with more than 10 “successes” or “failures” in both groups (calculation below)
the 2 samples are independent to each other because each group is a ramdom sample from less than 10% of the US population
As a result, the difference of the 2 proportions tend to follow a normal model

# success failure

table(gss$y2008, gss$confinan)

##            
##             A Great Deal Only Some Hardly Any
##   Post 2008          556      2140       1343
##   Pre 2008          8459     17519       5036

# the proportion for P_post2008 is 556 / (556 + 2140 + 1343)
P_post2008 <- 556 / (556 + 2140 + 1343)

# the proportion for P_pre2008 is 8459 / (8459 + 17519 + 5036)
P_pre2008 <- 8459 / (8459 + 17519 + 5036)

# success failure conditions for Post 2008
556 * P_post2008 # success

## [1] 76.53776

556 * (1- P_post2008) # failure

## [1] 479.4622

# success failure conditions for Pre 2008
8459 * P_pre2008 # success

## [1] 2307.174

8459 * (1 - P_pre2008) # failure

## [1] 6151.826

# All well above 10

Calculating the Confidence Interval

# calculating the SE
a <- (P_post2008 * (1 - P_post2008))/ (556 + 2140 + 1343)
b <- (P_pre2008 * (1 - P_pre2008))/ (8459 + 17519 + 5036)

se <- sqrt(a + b)

# differemce of proportion
P <- P_post2008 - P_pre2008

# Point estimate +- 1.96 * SE
P + 1.96 * se

## [1] -0.1233649

P - 1.96 * se

## [1] -0.146815

We are 95% confident that the condidence in the banks and the financial institution changed between -12% to -14%. It indicates a decrease of confidence post 2008

Hypothesis test: Checking conditions

pooled proportion = # of interviewees who trust the banks and financial institutions “a great deal” in the entire study / # of interviewees in the entire study
The pooled proportion represents our best estimate of the proportions P_post2008 and P_pre2008 if the null hypothesis is true (P_post2008 = P_pre2008)
The pooled proportion is used to check the success - failure conditions and calculating the SE

table(gss$confinan)

## 
## A Great Deal    Only Some   Hardly Any 
##         9015        19659         6379

Pooled_P <- 9015 / (9015 + 19659 + 6379)

# checking the success failure condition
# success failure conditions for Post 2008
556 * Pooled_P # success

## [1] 142.9932

556 * (1- Pooled_P) # failure

## [1] 413.0068

# success failure conditions for Pre 2008
8459 * Pooled_P # success

## [1] 2175.502

8459 * (1 - Pooled_P) # failure

## [1] 6283.498

# All greater than 10

The success - failure condition is respected for each group
the 2 samples are independent to each other because each group is a ramdom sample from less than 10% of the US population
As a result, we can safely apply the normal model

Hypothesis test: calculation

# calculating the SE with the pooled proportion
a <- (Pooled_P * (1 - Pooled_P))/ (556 + 2140 + 1343)
b <- (Pooled_P * (1 - Pooled_P))/ (8459 + 17519 + 5036)

se <- sqrt(a + b)

# computing the test statistic with Z 
Z <- (P_post2008 - P_pre2008) / se
Z

## [1] -18.47629

Checking the P value in a P value, we can see that for Z <= -3.50, the probability is less than or equal to 0.0002 (for one tail). We can multiplicate it by 2 to have a the 2 tails (4e-04)

Interpret results

Because the p value is smaller than 0.05, we can reject the null hypothesis: the difference for post 2008 and pre 2008 confidence in banks and financial institutions cannot be (reasonaly) explained by chance.

The confidence interval also indicated the change of the population (from -12% to -14%), indicating that there was a decrease of confidence post 2008