SLICC - Inferential Statistics Research Using GSS data

This study is about how dramatically people’s perspective on suicide because of financial status is changed since the Financial Crisis in 2008. Thankfully, Within “General Social Survey(GSS)”, the data that I exactly need is included. GSS conducted a series of survey from 1972 to 2012 to look into trends in Americans’ behaviour, attitudes, and otherwise demographical traits. The research had been carried out every year up to 1994 and every other year thereafter till 2012.

When it comes to generalisability and causation of data, 90 minutes in-person interviews were using, thus random sampling – that is, it can be generalised to the US population. Nevertheless, we cannot infer its causation since this survey is observational study without random assignments, yet can infer correlations, association and conclusions by the attributes of random sampling.

Before beginning with research, I want to address some potential biases that we have to bear in mind:

Narrow target population. The survey was for adults, taking place in English till 2004. Therefore only English speakers were able to accurately answer the survey.
Changes in methodologies over time. Spanish speakers were included in their survey
False information. Respondents might be telling their good traits, while hiding their bad ones.

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Research question

My interest of research is as follow:

Does there appear to be a relationship between thought on suicide if bankrupt and change in individual’s financial status after 2008 Financial Crisis?

The most recent, catastrophic global crisis is the Financial Crisis in 2008. I have always been wondering how statistically significant thought on suicide if bankrupt is in response to change in individual’s financial status. Within the GSS dataset, suicide2 and finalter will be used, since these stand for suicide and change in respondents’ financial status. Also for comparison purposes, all time period (historical), 1972 - 2012, and year after the crisis (after_crisis), 2008-2012, will be created, used.

As you might already know, those two variables are not numeric but categorical. Therefore, it is not appropriate to use Students’ t-test, that we most of time use for regression analysis. What method of inference will be used is “Chi Square Test of Independence, \(\chi^{2}\)”

Let us begin with subsetting data.

# after_crisis and historical represent after the global crisis and original dataset of all time period, 1972 - 2012
historical <- gss %>% 
  select(year, suicide2, finalter) %>% 
  na.omit() 
  
after_crisis <- gss %>%
  filter(year >= 2008) %>% 
  select(year, suicide2, finalter) %>%
  na.omit()
#'na.omit()' function omits Not Available responses

Part 2: Exploratory data analysis

In this part, Exploratory Data Analysis (EDA) will be performed, summarising and visualising the subset data, historical and after_crisis. First, summary statistics for historical dataset,

summary(historical)

##       year      suicide2           finalter    
##  Min.   :1977   Yes: 2461   Better     :11004  
##  1st Qu.:1985   No :26969   Worse      : 6921  
##  Median :1994               Stayed Same:11505  
##  Mean   :1994                                  
##  3rd Qu.:2002                                  
##  Max.   :2012

As displayed, historical shows a certain range of time period starting from 1977, but not from its survey setoff year, 1972. This is because of either NA values having been removed or the possibility that suicide2 and finalter might be not included in the survey before 1977.

historical_table <- table(historical$suicide2, historical$finalter)
historical_table

##      
##       Better Worse Stayed Same
##   Yes   1029   582         850
##   No    9975  6339       10655

In Chi-square test of independence the proportions of counts are important, thus required.

historical_prop <- prop.table(historical_table)
historical_prop

##      
##           Better      Worse Stayed Same
##   Yes 0.03496432 0.01977574  0.02888209
##   No  0.33893986 0.21539246  0.36204553

Rows represent the answers, Yes and No, each for suicide if bankrup. Likewise, each column stands for respondents’ responses to their current financial status if getting better/worse compared to the past. In historical dataset, those who think that they are being worse-off are less likely to think of suicide if bankrupt, compared to those who better-off.

To see this clearly,

historical_bar <- ggplot(historical, aes(x = finalter, fill = suicide2)) +
  geom_bar(position = "fill") +
  labs(x = "Changes in Respondents' Financial Status Compared to the Past", y = "Proportion", title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt") +
  scale_fill_discrete(name = "Option", labels = c("Yes", "No"))

historical_bar

The table and bar plot above shows nothing prominent that we can easily spot. The proportion of Yes responses is the highest in those who have been better-off, about 3.5%, while 2% from those worse-off and 2.9% from those stayed the same.

Now, turn our attention to “after_crisis” dataset if their thought is changed after the 2008 Financial Crisis.

summary(after_crisis)

##       year      suicide2          finalter   
##  Min.   :2008   Yes: 408   Better     :1115  
##  1st Qu.:2008   No :3561   Worse      :1319  
##  Median :2010              Stayed Same:1535  
##  Mean   :2010                                
##  3rd Qu.:2012                                
##  Max.   :2012

after_crisis_table <- table(after_crisis$suicide2, after_crisis$finalter)
after_crisis_table

##      
##       Better Worse Stayed Same
##   Yes    117   142         149
##   No     998  1177        1386

prop.table(after_crisis_table)

##      
##           Better      Worse Stayed Same
##   Yes 0.02947846 0.03577727  0.03754094
##   No  0.25144873 0.29654825  0.34920635

In summary, while the proportion of “yes” respondents from those who better off has decreased, that of “No” has increased.

Again to visualise it, use the code below

after_crisis_bar <- ggplot(after_crisis, aes(x = finalter, fill = suicide2)) + 
  geom_bar(position = "fill") +
  labs(title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt AFTER crisis", x = "Changes in Respondents' Financial Status Compared to the Past", y = "Proportion") +
  scale_fill_discrete(name = "Option", labels = c("Yes", "No"))

after_crisis_bar

For the last in EDA, comparisons between alltime and after crisis data can be made visual using facet_grid() function of ggplot2 feature. To do this, There should be slight adjustment in historical dataset.

historical2 <- historical %>% 
  mutate(acbc = as.factor(ifelse(year <= 2007, "alltime", "ac")))

comp <- ggplot(historical2, aes(x = acbc, fill = suicide2)) + 
  geom_bar(position = "fill") +
  facet_grid(.~finalter) +
  labs(title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt", x = "All Time & After Crisis", y = "Proportion") +
  scale_fill_discrete(name = "Option", labels = c("Yes" , "No"))

comp

Not very much surprisingly, there seems changes in their thought on suicide after the 2008 Crisis.

As mentioned, this study use the method of inference “Chi-square test of the independence”, since we’re evaluating the relationship between two categorical variables; quantify how different the observed counts are from the expected counts; large deviation from what would be expected based on sampling variation alone provide strong evidence for the alternative hypothesis

Part 3: Inference

In that this study is handling two categorical variables, thought on the right of suicide if bankrupt and current financial status, Chi-square test of independence will be used, as suitable.

Hypothesis Testing

First, set the hypothesis test to see if variables are either dependent or independent.

\(H_{0}\): “Thought on the right of suicide if bankrupt” and “respondent’s current financial status” are independent.
\(H_{1}\): No they are dependent and thought on suicide does differ by their financial status.

Second, check if the conditions are met

Since chi-square test of independence requires exactly the same condition as the chi-square goodness of fit test, see if

observations are independent each other (random sampling/assignment and n < 10% of the population)
Sample size. Each particular scenario has at least five expected counts.

The first condition is assumed to be true. This is because GSS collected data using random sampling and appraently total number of observations (29,430) is less than 10% of US population.

In the case of the second condtion, see the code and its outcome below

historical_con <- table(historical$finalter, historical$suicide2)
historical_con

##              
##                 Yes    No
##   Better       1029  9975
##   Worse         582  6339
##   Stayed Same   850 10655

sum(historical_con <= 5)

## [1] 0

after_crisis_con <- table(after_crisis$finalter, after_crisis$suicide2)
after_crisis_con

##              
##                Yes   No
##   Better       117  998
##   Worse        142 1177
##   Stayed Same  149 1386

sum(after_crisis_con <= 5)

## [1] 0

As indicated above, each cell in both tables has at least five counts, so all conditions are met.

Third, the method of inference that will be being used is chi square test of independence.

\(expected = \frac{(row total) \times (column total)}{table total}\)

\(\chi^{2} = \sum_{i = 1}^{k} \frac{(observed - expected)^2}{expected}\)

\(df = {(row total - 1)\times(column total - 1)}\)

Using the notations above, each cell in the contingency table will be used to calculate the expected counts, then summing up all the resulted values – that is, the value of chi-square test. But here we skip over this tedious process.

Last, inference will be performed using chisq.test().

chisq_historical <- chisq.test(historical$finalter, historical$suicide2)
chisq_historical

## 
##  Pearson's Chi-squared test
## 
## data:  historical$finalter and historical$suicide2
## X-squared = 28.311, df = 2, p-value = 7.119e-07

This test statistics displayed using historical data, 1972 - 2012, is statistically significant. The extremely low value of p-value tells us “There’s something going on and the variables are dependent. Hence reject the \(H_0\) at any conservative significant levels.”

In the case of chisq.test(after_crisis),

chisq_after_crisis <- chisq.test(after_crisis$finalter, after_crisis$suicide2)
chisq_after_crisis

## 
##  Pearson's Chi-squared test
## 
## data:  after_crisis$finalter and after_crisis$suicide2
## X-squared = 0.93916, df = 2, p-value = 0.6253

Strikingly, the p-value, 0.6253, acquired for the “after crisis” is much higher than I was expecting, not significant at all. Therefore, fail to reject \(H_0\).

Conclusion

At the very beginning of this study, I thought the two variables, suicide2 and finalter, would be able to produce statistically significant test statistics, because I reckoned so many people struggling with financial hardship during and after the Crisis in 2008. Mass media were reporting a series of bankrupcy, suicide cases and something disasterous like this, on a hourly basis. However, statistical results are different from what I thought. This might be because of bad filtering of year and age variables. Result, in fact, might be different if I kept generation the same in “after-crisis” analysis.