Setup

Load packages

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
library(statsr)

Load data

setwd(dir = "C:/Users/Marc/Downloads")
load("gss.Rdata")

Part 1: Data


Part 2: Research question

“The financial crisis of 2007-2008, also known as the global financial crisis and the 2008 financial crisis, is considered by many economists to have been the worst financial crisis since the Great Depression of the 1930s”" (Wikipedia). I read several times that following the 2008 financial crisis, people lost their “trust” in the financial institutions. We will use the GSS to compare pre and post 2008 confidence in banks and financial institutions in order to determine if indeed, the confidence level in financial institutions changed.


Part 3: Exploratory data analysis

Summary Statistic

# overall sample size
length(gss$confinan)
## [1] 57061
# Number of NA
sum(is.na(gss$confinan))
## [1] 22008
# effective sample (na excluded)
length(gss$confinan) - sum(is.na(gss$confinan))
## [1] 35053
# Confidence per category
table(gss$confinan)
## 
## A Great Deal    Only Some   Hardly Any 
##         9015        19659         6379
# Proportion per category in percent
table(gss$confinan) / (length(gss$confinan) - sum(is.na(gss$confinan))) * 100
## 
## A Great Deal    Only Some   Hardly Any 
##     25.71820     56.08364     18.19816

Plots

my.palette <- c("#f7b758","#b94665", "#a7a6a6", "#2ecee5", "#009E73", "#f37735")

# number of respondant per year and per category
plot1 <- as.data.frame(table(gss$confinan, gss$year))
colnames(plot1) <- c("Category", "Year", "Freq")

plot1$Year[plot1$Freq == 0] # checking the years where we have 0 as an answer
##  [1] 1972 1972 1972 1973 1973 1973 1974 1974 1974 1985 1985 1985
## 29 Levels: 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 ... 2012
# it looks like this question wasn't asked in the interview in 1972, 1973, 1974 and 1985
head(plot1, 10)
##        Category Year Freq
## 1  A Great Deal 1972    0
## 2     Only Some 1972    0
## 3    Hardly Any 1972    0
## 4  A Great Deal 1973    0
## 5     Only Some 1973    0
## 6    Hardly Any 1973    0
## 7  A Great Deal 1974    0
## 8     Only Some 1974    0
## 9    Hardly Any 1974    0
## 10 A Great Deal 1975  475
plot1 <- filter(plot1, Freq != 0) # removing those cases
## Warning: package 'bindrcpp' was built under R version 3.3.3
ggplot(plot1, aes(x= Category, y= Freq, fill= Category))+
  geom_boxplot()+
  theme_bw()+
  scale_fill_manual(values = my.palette)+
  theme(legend.position="top")+
  ylab("Number of respondents")+
  ggtitle("Box plot: number of respondents per year per category")

  • The “Hardly any” category seems to have a smaller variance over the year. Interestingly, one point was considered as an outlier: 564 respondents in 2010

  • “Only some” is the category with the most variance and the highest median

# proportion per year
plot2 <- as.data.frame(table(gss$year, gss$confinan))
colnames(plot2) <- c("Year", "Category", "Freq")
plot2 <- filter(plot2, Freq != 0) 
plot2 <- spread(plot2, Category, Freq)
plot2 <- mutate(plot2, Total = `A Great Deal` + `Only Some` + `Hardly Any`) # create total per year
plot2$`A Great Deal` <- plot2$`A Great Deal` / plot2$Total * 100 # convert in percent
plot2$`Only Some` <- plot2$`Only Some` / plot2$Total * 100
plot2$`Hardly Any` <- plot2$`Hardly Any` / plot2$Total * 100
plot2 <- plot2[,c(1:4)] # drop the total
plot2 <- gather(plot2, Category, Freq, - Year) # make it readable by ggplot2

ggplot(plot2, aes(x= Year, y= Freq, color= Category, group= Category))+
  geom_point(size=5)+
  geom_line(size= 1.2)+
  scale_color_manual(values = my.palette)+
  theme_bw()+
  geom_point(aes(x= Year,y= Freq),colour="white",size=3)+
  xlab("Year")+
  ylab("Percentage")+
  theme(legend.position="top")+
  ggtitle("Confidence in banks and financial institutions per year and in percentage")

  • We can see that in 2010 and 2012, the “hardly any” category is quite above “a great deal” percentages. Similarly, the “only some” category slighly decreased on this period

  • “Only some” category stays the main common answer each year

  • “A great deal” category was significantly higher than the “hardly any” category in the 70s / beginning of the 80s before starting to decrease

  • There are 2 peaks for the “hardly any” category: in 1991 and 2010

#3. bar plot stacked before and after 2008
gss$y2008 <- gss$year >= 2008
gss$y2008 <- gsub(TRUE, "Post 2008", gss$y2008)
gss$y2008 <- gsub(FALSE, "Pre 2008", gss$y2008)


plot3 <- as.data.frame(table(gss$y2008, gss$confinan))
colnames(plot3) <- c("Post_2008", "Category", "Freq")

ggplot(plot3, aes(x= Post_2008, y= Freq, fill= Category))+
      geom_bar(stat="identity", position="fill")+
      theme_bw()+
      scale_fill_manual(values = my.palette)+
      theme(legend.position="top")+
      ylab("Number of respondents")+
      xlab("Periods when the interviews were conducted")+
      ggtitle("Share of trust in financial institutions before and after 2008")

  • We can see that the “Hardly any” category strongly increased for the post 2008 interviews while “a great deal” decrease.

  • Surprisingly, the share of “only some” category stay more or less similar


Part 4: Inference

Hypotheses

Our objective is to use the GSS to compare pre and post 2008 confidence in banks and financial institutions in order to determine if indeed, the confidence level in financial institutions changed

In order to do this, we will compare the proportion of individuals who answered “a great deal” among all interviewees for post and pre 2008

  • Null hypothesis: there is no difference between the proportion of interviewees who said that they have “a great deal” of confidence in banks and in financial institutions for interviews conducted before 2008 and for the ones conducted in 2008 and after

  • Alternative hypothesis: there is a difference between the proportion of interviewees who said that they have “a great deal” of confidence in banks and in financial institutions for interviews conducted before 2008 and for the ones conducted in 2008 and after

  • H0: P_post2008 - P_pre2008 = 0

  • HA: P_post2008 - P_pre2008 != 0

Method to be used

  • This case is an inference for categorical data. So proportion will be used

  • We will calculate the confidence interval and an hypothesis test for P1 - P2 = 0. The confidence interval is useful to give the “direction” of the change (if there are any) and to confirm the hypothesis testing result

  • Pooled proportion will be used for the hypothesis test in order to determine the success - failure condition as well as calculating the standard error

  • we will then compute the test statistic with the p distribution

  • we will use a 95% confidence interval

Confidence Interval: Checking conditions

  • Each proportion respect the success - failure conditions with more than 10 “successes” or “failures” in both groups (calculation below)

  • the 2 samples are independent to each other because each group is a ramdom sample from less than 10% of the US population

  • As a result, the difference of the 2 proportions tend to follow a normal model

# success failure

table(gss$y2008, gss$confinan)
##            
##             A Great Deal Only Some Hardly Any
##   Post 2008          556      2140       1343
##   Pre 2008          8459     17519       5036
# the proportion for P_post2008 is 556 / (556 + 2140 + 1343)
P_post2008 <- 556 / (556 + 2140 + 1343)

# the proportion for P_pre2008 is 8459 / (8459 + 17519 + 5036)
P_pre2008 <- 8459 / (8459 + 17519 + 5036)

# success failure conditions for Post 2008
556 * P_post2008 # success
## [1] 76.53776
556 * (1- P_post2008) # failure
## [1] 479.4622
# success failure conditions for Pre 2008
8459 * P_pre2008 # success
## [1] 2307.174
8459 * (1 - P_pre2008) # failure
## [1] 6151.826
# All well above 10

Calculating the Confidence Interval

# calculating the SE
a <- (P_post2008 * (1 - P_post2008))/ (556 + 2140 + 1343)
b <- (P_pre2008 * (1 - P_pre2008))/ (8459 + 17519 + 5036)

se <- sqrt(a + b)

# differemce of proportion
P <- P_post2008 - P_pre2008

# Point estimate +- 1.96 * SE
P + 1.96 * se
## [1] -0.1233649
P - 1.96 * se
## [1] -0.146815
  • We are 95% confident that the condidence in the banks and the financial institution changed between -12% to -14%. It indicates a decrease of confidence post 2008

Hypothesis test: Checking conditions

  • pooled proportion = # of interviewees who trust the banks and financial institutions “a great deal” in the entire study / # of interviewees in the entire study

  • The pooled proportion represents our best estimate of the proportions P_post2008 and P_pre2008 if the null hypothesis is true (P_post2008 = P_pre2008)

  • The pooled proportion is used to check the success - failure conditions and calculating the SE

table(gss$confinan)
## 
## A Great Deal    Only Some   Hardly Any 
##         9015        19659         6379
Pooled_P <- 9015 / (9015 + 19659 + 6379)

# checking the success failure condition
# success failure conditions for Post 2008
556 * Pooled_P # success
## [1] 142.9932
556 * (1- Pooled_P) # failure
## [1] 413.0068
# success failure conditions for Pre 2008
8459 * Pooled_P # success
## [1] 2175.502
8459 * (1 - Pooled_P) # failure
## [1] 6283.498
# All greater than 10
  • The success - failure condition is respected for each group

  • the 2 samples are independent to each other because each group is a ramdom sample from less than 10% of the US population

  • As a result, we can safely apply the normal model

Hypothesis test: calculation

# calculating the SE with the pooled proportion
a <- (Pooled_P * (1 - Pooled_P))/ (556 + 2140 + 1343)
b <- (Pooled_P * (1 - Pooled_P))/ (8459 + 17519 + 5036)

se <- sqrt(a + b)

# computing the test statistic with Z 
Z <- (P_post2008 - P_pre2008) / se
Z
## [1] -18.47629

Checking the P value in a P value, we can see that for Z <= -3.50, the probability is less than or equal to 0.0002 (for one tail). We can multiplicate it by 2 to have a the 2 tails (4e-04)

Interpret results

Because the p value is smaller than 0.05, we can reject the null hypothesis: the difference for post 2008 and pre 2008 confidence in banks and financial institutions cannot be (reasonaly) explained by chance.

The confidence interval also indicated the change of the population (from -12% to -14%), indicating that there was a decrease of confidence post 2008