This is a project for Inferential Statistics course, which is a part of Coursera’s Statistics with R Specialization.
The report aims to perform statistical inference via hypothesis testing and/or confidence interval with GSS (General Social Survey) \(1972-2012\) years data set.
Since 1972, the General Social Survey has been monitoring societal change and studying the growing complexity of American society. GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
Code buttonLoad packages & data
library(ggplot2); library(dplyr); library(statsr)
library(gridExtra); library(plotly)
if(!file.exists("./data/1126_SR-IS-w5_ Suicide/gss.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_5db435f06000e694f6050a2d43fc7be3_gss.Rdata?Expires=1609459200&Signature=QV37aZJ-NFeTr7XXolAEcHxQUXjYxwMupwPrplRWjviud3Q4Ho6pVQi9egwtw1hUmZZ9vYXiivNUWEpUd2zgSbnI3r5A9mJoU77O698Hb3~crmYJ3k-qXQBiTKRaBVBemoE4YKi5AEyyMDjh1P0xq01pyYwxlU3AKSRvUTbUROY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
destfile = "./data/1126_SR-IS-w5_ Suicide/gss.RData",
method = "curl")
}
load("./data/1126_SR-IS-w5_ Suicide/gss.RData")The data come from General Social Survey (GSS) for the years 1972-2012 and includes \(57\ 061\) observations on \(114\) variables. Samples were drawn using an area probability design randomly selected US adults (18+) answering the survey questions face-to-face with an in-person 90-minutes interview. More information can be found in GSS FAQ, Codebook GSS-2012 modified for the assignment can be foud here.
Millions of people struggle with economic hardships during the coronavirus pandemic. This April, the US National Suicide Prevention Hotline reported a nine-fold increase in calls compared to the same time last year.
Suicide rates spiked during the Great Depression of the 1930s, and also rose in the U.S. and many other countries during the Great Recession of 2008.
According to the most recent report by the Federal Reserve Bank of New York’s Center for Microeconomic Data, outstanding student loan debt in the US stands at nearly $1.5 trillion and total credit card balances have seen a $20 billion increase as of the second quarter of 2019 and stands at $870 billion. Student loan debt has been a growing concern for young people and can impact their current reality.
The question is:
satfin and person’s opinion on having right to commit suicide because of being tired of living suicide4?Create a data set qdata with the two variables of interest (satfin, suicide4), and look at its structure/ summary:
qdata<- gss %>%
select(satfin, suicide4)
str(qdata)'data.frame': 57061 obs. of 2 variables:
$ satfin : Factor w/ 3 levels "Satisfied","More Or Less",..: 3 2 1 3 1 2 2 3 2 3 ...
$ suicide4: Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...
summary(qdata) satfin suicide4
Satisfied :15344 Yes : 4579
More Or Less :23176 No :24629
Not At All Sat:13934 NA's:27853
NA's : 4607
Both variables are categorical recorded as factors with three levels for financial satisfaction satfin (“Satisfied”/ “More Or Less”/ “Not At All Sat”), and two levels for right to suicide suicide4 (“Yes”/ “No”).
Also, both have quite a few of NA values, so remove them. Then, look at the dimension and visually - at groups in qdata:
qdata<- na.omit(qdata)
dim<-dim(qdata)
p1<- ggplot(qdata) + aes (satfin, fill=satfin)+
geom_bar()+
scale_fill_manual(name="",
values = c("cornflowerblue","violet","purple")) +
xlab("Financial satisfaction")+ guides(fill=FALSE)
p2<-ggplot(qdata) + aes (suicide4, fill=suicide4)+
geom_bar()+
scale_fill_manual(name="", values =
c("firebrick3","springgreen3"))+
xlab("Right to suicide")+ guides(fill=FALSE)
grid.arrange(p1,p2,ncol=2, top = "Groups frequencies")The “More or less” answer is the most popular, and also obviously most of the respondents think people haven’t right to suicide. Construct now contingency table and proportion bar chart:
table(qdata) suicide4
satfin Yes No
Satisfied 1407 7148
More Or Less 1843 10719
Not At All Sat 1317 6676
g <- ggplot(qdata) + aes (satfin, fill=suicide4) + geom_bar(position="fill")+
labs(x="Financial satisfaction",
y="Proportion",
title="Financial satisfaction vs Opinion on suicide (interactive)")+
scale_fill_manual(name="Right to suicide", values =
c("firebrick3","springgreen3"))+
theme(legend.title=element_text(size=7.3), title=element_text(size=8))
ggplotly(g)It can be seen a very little difference between the groups (both in the table and the plot), and it’s unable to quantify it.
First, it’s reasonable to check, if there is a difference between the proportions of people answers “Yes” to “suicide question” and “No” - across financial satisfaction groups. Apply prop.test to find out:
prop<-prop.test(table(qdata$satfin, qdata$suicide4),correct = FALSE,
alternative ="two.sided", conf.level = 0.95)Since the p-value \(=\) 0.00018 \(< 0.05\), reject the null hypothesis and conclude the proportion of cases is not the same in each group.
There are two categorical variable, and at least one of them (satfin) has more than two levels. So, it can be used Chi-square independence test to see if variables satfin and suicide4 are independent or not (assuming the data is approximately normal).
\(H_0\): In the population, people opinion on suicide and their financial satisfaction are independent
\(H_A\): In the population, people opinion about committing suicide and their financial satisfaction are dependent
In other words,
There are two conditions for this test:
chisq.test(qdata$satfin, qdata$suicide4)$expected qdata$suicide4
qdata$satfin Yes No
Satisfied 1342.174 7212.826
More Or Less 1970.823 10591.177
Not At All Sat 1254.003 6738.997
Numbers in all cells are more than five.
inf<-inference(y = suicide4, x = satfin, data = qdata, statistic = "proportion",
type = "ht", null = NULL, alternative = "greater", sig_level = 0.05,
method = "theoretical")Response variable: categorical (2 levels)
Explanatory variable: categorical (3 levels)
Observed:
y
x Yes No
Satisfied 1407 7148
More Or Less 1843 10719
Not At All Sat 1317 6676
Expected:
y
x Yes No
Satisfied 1342.174 7212.826
More Or Less 1970.823 10591.177
Not At All Sat 1254.003 6738.997
H0: satfin and suicide4 are independent
HA: satfin and suicide4 are dependent
chi_sq = 17.3003, df = 2, p_value = 0.0002
The test returns a very high chi-square value of 17 and a very small p-value of 0.00018 (much lower that the significance level of \(5\%\)), therefore reject the null hypothesis; the sample population provides enough evidence that:
people’s opinion on having the right to suicide and their financial satisfaction are dependent, i.e.
Note 1: CAUSATION. There is only conclusion the variables are related (there is association); but the relationship is not necessarily causal, in the sense that one variable “causes” the other.
Note 2: CONFIDENCE INTERVAL: Chi-square doesn’t define confidence intervals (it is a non-parametric test), so confidence interval method (CI) is not applicable, and it’s only used the hypothesis test (HT) method
The study establishes a dependence between opinion on having the right to suicide and financial satisfaction.
The original data set includes GSS observations for the years 1972-2012, so the results could only be generalised to the entire US adult residents population in this period.
At first, visual analysis didn’t give any ideas about if there is dependence between the variables. The hypothesis of existing this dependence was then tested with Chi-square independence test, and there was found out that opinion on having the right to suicide is significantly different across financial satisfaction groups.
FUTURE RESEARCH could address impact of consumer debt burden and other forms of financial distress on mental health.