This study is about how dramatically people’s perspective on suicide because of financial status is changed since the Financial Crisis in 2008. Thankfully, Within “General Social Survey(GSS)”, the data that I exactly need is included. GSS conducted a series of survey from 1972 to 2012 to look into trends in Americans’ behaviour, attitudes, and otherwise demographical traits. The research had been carried out every year up to 1994 and every other year thereafter till 2012.
When it comes to generalisability and causation of data, 90 minutes in-person interviews were using, thus random sampling – that is, it can be generalised to the US population. Nevertheless, we cannot infer its causation since this survey is observational study without random assignments, yet can infer correlations, association and conclusions by the attributes of random sampling.
Before beginning with research, I want to address some potential biases that we have to bear in mind:
library(ggplot2)
library(dplyr)
library(statsr)
load("gss.Rdata")
My interest of research is as follow:
Does there appear to be a relationship between thought on suicide if bankrupt and change in individual’s financial status after 2008 Financial Crisis?
The most recent, catastrophic global crisis is the Financial Crisis in 2008. I have always been wondering how statistically significant thought on suicide if bankrupt is in response to change in individual’s financial status. Within the GSS dataset, suicide2 and finalter will be used, since these stand for suicide and change in respondents’ financial status. Also for comparison purposes, all time period (historical), 1972 - 2012, and year after the crisis (after_crisis), 2008-2012, will be created, used.
As you might already know, those two variables are not numeric but categorical. Therefore, it is not appropriate to use Students’ t-test, that we most of time use for regression analysis. What method of inference will be used is “Chi Square Test of Independence, \(\chi^{2}\)”
Let us begin with subsetting data.
# after_crisis and historical represent after the global crisis and original dataset of all time period, 1972 - 2012
historical <- gss %>%
select(year, suicide2, finalter) %>%
na.omit()
after_crisis <- gss %>%
filter(year >= 2008) %>%
select(year, suicide2, finalter) %>%
na.omit()
#'na.omit()' function omits Not Available responses
In this part, Exploratory Data Analysis (EDA) will be performed, summarising and visualising the subset data, historical
and after_crisis
. First, summary statistics for historical
dataset,
summary(historical)
## year suicide2 finalter
## Min. :1977 Yes: 2461 Better :11004
## 1st Qu.:1985 No :26969 Worse : 6921
## Median :1994 Stayed Same:11505
## Mean :1994
## 3rd Qu.:2002
## Max. :2012
As displayed, historical
shows a certain range of time period starting from 1977, but not from its survey setoff year, 1972. This is because of either NA
values having been removed or the possibility that suicide2
and finalter
might be not included in the survey before 1977.
historical_table <- table(historical$suicide2, historical$finalter)
historical_table
##
## Better Worse Stayed Same
## Yes 1029 582 850
## No 9975 6339 10655
In Chi-square test of independence the proportions of counts are important, thus required.
historical_prop <- prop.table(historical_table)
historical_prop
##
## Better Worse Stayed Same
## Yes 0.03496432 0.01977574 0.02888209
## No 0.33893986 0.21539246 0.36204553
Rows represent the answers, Yes and No, each for suicide if bankrup. Likewise, each column stands for respondents’ responses to their current financial status if getting better/worse compared to the past. In historical
dataset, those who think that they are being worse-off are less likely to think of suicide if bankrupt, compared to those who better-off.
To see this clearly,
historical_bar <- ggplot(historical, aes(x = finalter, fill = suicide2)) +
geom_bar(position = "fill") +
labs(x = "Changes in Respondents' Financial Status Compared to the Past", y = "Proportion", title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt") +
scale_fill_discrete(name = "Option", labels = c("Yes", "No"))
historical_bar
The table and bar plot above shows nothing prominent that we can easily spot. The proportion of Yes responses is the highest in those who have been better-off, about 3.5%, while 2% from those worse-off and 2.9% from those stayed the same.
Now, turn our attention to “after_crisis” dataset if their thought is changed after the 2008 Financial Crisis.
summary(after_crisis)
## year suicide2 finalter
## Min. :2008 Yes: 408 Better :1115
## 1st Qu.:2008 No :3561 Worse :1319
## Median :2010 Stayed Same:1535
## Mean :2010
## 3rd Qu.:2012
## Max. :2012
after_crisis_table <- table(after_crisis$suicide2, after_crisis$finalter)
after_crisis_table
##
## Better Worse Stayed Same
## Yes 117 142 149
## No 998 1177 1386
prop.table(after_crisis_table)
##
## Better Worse Stayed Same
## Yes 0.02947846 0.03577727 0.03754094
## No 0.25144873 0.29654825 0.34920635
In summary, while the proportion of “yes” respondents from those who better off has decreased, that of “No” has increased.
Again to visualise it, use the code below
after_crisis_bar <- ggplot(after_crisis, aes(x = finalter, fill = suicide2)) +
geom_bar(position = "fill") +
labs(title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt AFTER crisis", x = "Changes in Respondents' Financial Status Compared to the Past", y = "Proportion") +
scale_fill_discrete(name = "Option", labels = c("Yes", "No"))
after_crisis_bar
For the last in EDA, comparisons between alltime and after crisis data can be made visual using facet_grid()
function of ggplot2
feature. To do this, There should be slight adjustment in historical dataset.
historical2 <- historical %>%
mutate(acbc = as.factor(ifelse(year <= 2007, "alltime", "ac")))
comp <- ggplot(historical2, aes(x = acbc, fill = suicide2)) +
geom_bar(position = "fill") +
facet_grid(.~finalter) +
labs(title = "Impact of Changes in Individuals Financial Status on Thought on Suicide If Bankrupt", x = "All Time & After Crisis", y = "Proportion") +
scale_fill_discrete(name = "Option", labels = c("Yes" , "No"))
comp
Not very much surprisingly, there seems changes in their thought on suicide after the 2008 Crisis.
As mentioned, this study use the method of inference “Chi-square test of the independence”, since we’re evaluating the relationship between two categorical variables; quantify how different the observed counts are from the expected counts; large deviation from what would be expected based on sampling variation alone provide strong evidence for the alternative hypothesis
In that this study is handling two categorical variables, thought on the right of suicide if bankrupt and current financial status, Chi-square test of independence will be used, as suitable.
First, set the hypothesis test to see if variables are either dependent or independent.
Second, check if the conditions are met
Since chi-square test of independence requires exactly the same condition as the chi-square goodness of fit test, see if
The first condition is assumed to be true. This is because GSS collected data using random sampling and appraently total number of observations (29,430) is less than 10% of US population.
In the case of the second condtion, see the code and its outcome below
historical_con <- table(historical$finalter, historical$suicide2)
historical_con
##
## Yes No
## Better 1029 9975
## Worse 582 6339
## Stayed Same 850 10655
sum(historical_con <= 5)
## [1] 0
after_crisis_con <- table(after_crisis$finalter, after_crisis$suicide2)
after_crisis_con
##
## Yes No
## Better 117 998
## Worse 142 1177
## Stayed Same 149 1386
sum(after_crisis_con <= 5)
## [1] 0
As indicated above, each cell in both tables has at least five counts, so all conditions are met.
Third, the method of inference that will be being used is chi square test of independence.
\(expected = \frac{(row total) \times (column total)}{table total}\)
\(\chi^{2} = \sum_{i = 1}^{k} \frac{(observed - expected)^2}{expected}\)
\(df = {(row total - 1)\times(column total - 1)}\)
Using the notations above, each cell in the contingency table will be used to calculate the expected counts, then summing up all the resulted values – that is, the value of chi-square test. But here we skip over this tedious process.
Last, inference will be performed using chisq.test()
.
chisq_historical <- chisq.test(historical$finalter, historical$suicide2)
chisq_historical
##
## Pearson's Chi-squared test
##
## data: historical$finalter and historical$suicide2
## X-squared = 28.311, df = 2, p-value = 7.119e-07
This test statistics displayed using historical data, 1972 - 2012, is statistically significant. The extremely low value of p-value tells us “There’s something going on and the variables are dependent. Hence reject the \(H_0\) at any conservative significant levels.”
In the case of chisq.test(after_crisis)
,
chisq_after_crisis <- chisq.test(after_crisis$finalter, after_crisis$suicide2)
chisq_after_crisis
##
## Pearson's Chi-squared test
##
## data: after_crisis$finalter and after_crisis$suicide2
## X-squared = 0.93916, df = 2, p-value = 0.6253
Strikingly, the p-value, 0.6253, acquired for the “after crisis” is much higher than I was expecting, not significant at all. Therefore, fail to reject \(H_0\).
At the very beginning of this study, I thought the two variables, suicide2
and finalter
, would be able to produce statistically significant test statistics, because I reckoned so many people struggling with financial hardship during and after the Crisis in 2008. Mass media were reporting a series of bankrupcy, suicide cases and something disasterous like this, on a hourly basis. However, statistical results are different from what I thought. This might be because of bad filtering of year
and age
variables. Result, in fact, might be different if I kept generation the same in “after-crisis” analysis.