Overview

This is a project for Inferential Statistics course, which is a part of Coursera’s Statistics with R Specialization.

The report aims to perform statistical inference via hypothesis testing and/or confidence interval with GSS (General Social Survey) \(1972-2012\) years data set.

Since 1972, the General Social Survey has been monitoring societal change and studying the growing complexity of American society. GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.


  • Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(ggplot2); library(dplyr); library(statsr)
library(gridExtra); library(plotly)
if(!file.exists("./data/1126_SR-IS-w5_ Suicide/gss.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_5db435f06000e694f6050a2d43fc7be3_gss.Rdata?Expires=1609459200&Signature=QV37aZJ-NFeTr7XXolAEcHxQUXjYxwMupwPrplRWjviud3Q4Ho6pVQi9egwtw1hUmZZ9vYXiivNUWEpUd2zgSbnI3r5A9mJoU77O698Hb3~crmYJ3k-qXQBiTKRaBVBemoE4YKi5AEyyMDjh1P0xq01pyYwxlU3AKSRvUTbUROY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
              destfile = "./data/1126_SR-IS-w5_ Suicide/gss.RData",
              method = "curl")
  }
load("./data/1126_SR-IS-w5_ Suicide/gss.RData")

Part 1: Data

The data come from General Social Survey (GSS) for the years 1972-2012 and includes \(57\ 061\) observations on \(114\) variables. Samples were drawn using an area probability design randomly selected US adults (18+) answering the survey questions face-to-face with an in-person 90-minutes interview. More information can be found in GSS FAQ, Codebook GSS-2012 modified for the assignment can be foud here.


  • Generalization. The data were collected by random sampling with multiple levels stratification for region, race, age, income and sex. So, it can be generalized to the US adult (18+) residing households in the US, with some reservations:
    • all findings can only refer to the respective survey years, from 1972 to 2012;
    • there could be some biases due to
      • voluntary participation;
      • required about 90 minutes for interview (some people not having enough time might refuse/ not finish answering);
  • Causation/ Correlation. The data come from a survey and not from a random assignment experiment. So, it’s an observational study, i.e there is no causality, but only correlation/ association between the variables examined

Part 2: Research question

Millions of people struggle with economic hardships during the coronavirus pandemic. This April, the US National Suicide Prevention Hotline reported a nine-fold increase in calls compared to the same time last year.

Suicide rates spiked during the Great Depression of the 1930s, and also rose in the U.S. and many other countries during the Great Recession of 2008.

According to the most recent report by the Federal Reserve Bank of New York’s Center for Microeconomic Data, outstanding student loan debt in the US stands at nearly $1.5 trillion and total credit card balances have seen a $20 billion increase as of the second quarter of 2019 and stands at $870 billion. Student loan debt has been a growing concern for young people and can impact their current reality.


The question is:

  • Does GSS data set confirm an association between financial satisfaction satfin and person’s opinion on having right to commit suicide because of being tired of living suicide4?

Part 3: Exploratory data analysis

Create a data set qdata with the two variables of interest (satfin, suicide4), and look at its structure/ summary:

qdata<- gss %>% 
  select(satfin, suicide4)
str(qdata)
'data.frame':   57061 obs. of  2 variables:
 $ satfin  : Factor w/ 3 levels "Satisfied","More Or Less",..: 3 2 1 3 1 2 2 3 2 3 ...
 $ suicide4: Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...
summary(qdata)
            satfin      suicide4    
 Satisfied     :15344   Yes : 4579  
 More Or Less  :23176   No  :24629  
 Not At All Sat:13934   NA's:27853  
 NA's          : 4607               

Both variables are categorical recorded as factors with three levels for financial satisfaction satfin (“Satisfied”/ “More Or Less”/ “Not At All Sat”), and two levels for right to suicide suicide4 (“Yes”/ “No”).

Also, both have quite a few of NA values, so remove them. Then, look at the dimension and visually - at groups in qdata:

qdata<- na.omit(qdata)
dim<-dim(qdata)
p1<- ggplot(qdata) + aes (satfin, fill=satfin)+
  geom_bar()+
  scale_fill_manual(name="",
                          values = c("cornflowerblue","violet","purple")) +
  xlab("Financial satisfaction")+ guides(fill=FALSE)
p2<-ggplot(qdata) + aes (suicide4, fill=suicide4)+
  geom_bar()+
  scale_fill_manual(name="", values =
                      c("firebrick3","springgreen3"))+
  xlab("Right to suicide")+ guides(fill=FALSE)
grid.arrange(p1,p2,ncol=2, top = "Groups frequencies")

The “More or less” answer is the most popular, and also obviously most of the respondents think people haven’t right to suicide. Construct now contingency table and proportion bar chart:

table(qdata)
                suicide4
satfin             Yes    No
  Satisfied       1407  7148
  More Or Less    1843 10719
  Not At All Sat  1317  6676
g <- ggplot(qdata) + aes (satfin, fill=suicide4) + geom_bar(position="fill")+
  labs(x="Financial satisfaction",
       y="Proportion",
       title="Financial satisfaction vs Opinion on suicide (interactive)")+
  scale_fill_manual(name="Right to suicide", values =
                      c("firebrick3","springgreen3"))+
  theme(legend.title=element_text(size=7.3), title=element_text(size=8))
ggplotly(g)

It can be seen a very little difference between the groups (both in the table and the plot), and it’s unable to quantify it.


Part 4: Inference

First, it’s reasonable to check, if there is a difference between the proportions of people answers “Yes” to “suicide question” and “No” - across financial satisfaction groups. Apply prop.test to find out:

  • \(H_0\): The proportion of cases is the same in each “satisfaction group” (Nothing happens)
  • \(H_A\): The proportion of cases is not the same in each “satisfaction group”, i.e. at least one \(p_i\) is different from the others (Something happens)
prop<-prop.test(table(qdata$satfin, qdata$suicide4),correct = FALSE,
          alternative ="two.sided", conf.level = 0.95)

Since the p-value \(=\) 0.00018 \(< 0.05\), reject the null hypothesis and conclude the proportion of cases is not the same in each group.

Chi-square Independence test

  • WHY Chi-square Independence test

There are two categorical variable, and at least one of them (satfin) has more than two levels. So, it can be used Chi-square independence test to see if variables satfin and suicide4 are independent or not (assuming the data is approximately normal).

  • Chi-square HYPOTHESIS STATEMENT

\(H_0\): In the population, people opinion on suicide and their financial satisfaction are independent

\(H_A\): In the population, people opinion about committing suicide and their financial satisfaction are dependent

In other words,


  • “Is person’s opinion on right to suicide, and their satisfaction with the financial situation, independent”,
  • or “the opinion varies with person’s financial satisfaction”?

  • Checking CONDITIONS

There are two conditions for this test:

  1. independence: this condition is met since the GSS uses random sampling. Sample size is less than 10% of the US population and each result is only counted in one cell.
  2. more than 5 expected cases for each scenario. Check it:
chisq.test(qdata$satfin, qdata$suicide4)$expected
                qdata$suicide4
qdata$satfin          Yes        No
  Satisfied      1342.174  7212.826
  More Or Less   1970.823 10591.177
  Not At All Sat 1254.003  6738.997

Numbers in all cells are more than five.

  • Performing INFERENCE
inf<-inference(y = suicide4, x = satfin, data = qdata, statistic = "proportion",
          type = "ht", null = NULL, alternative = "greater", sig_level = 0.05,
          method = "theoretical")
Response variable: categorical (2 levels) 
Explanatory variable: categorical (3 levels) 
Observed:
                y
x                  Yes    No
  Satisfied       1407  7148
  More Or Less    1843 10719
  Not At All Sat  1317  6676

Expected:
                y
x                     Yes        No
  Satisfied      1342.174  7212.826
  More Or Less   1970.823 10591.177
  Not At All Sat 1254.003  6738.997

H0: satfin and suicide4 are independent
HA: satfin and suicide4 are dependent
chi_sq = 17.3003, df = 2, p_value = 0.0002

The test returns a very high chi-square value of 17 and a very small p-value of 0.00018 (much lower that the significance level of \(5\%\)), therefore reject the null hypothesis; the sample population provides enough evidence that:

people’s opinion on having the right to suicide and their financial satisfaction are dependent, i.e.

  • there is strong evidence that people with different financial satisfaction tend to have a different opinion on having the right to suicide

  • Note 1: CAUSATION. There is only conclusion the variables are related (there is association); but the relationship is not necessarily causal, in the sense that one variable “causes” the other.

  • Note 2: CONFIDENCE INTERVAL: Chi-square doesn’t define confidence intervals (it is a non-parametric test), so confidence interval method (CI) is not applicable, and it’s only used the hypothesis test (HT) method

Conclusion

The study establishes a dependence between opinion on having the right to suicide and financial satisfaction.

The original data set includes GSS observations for the years 1972-2012, so the results could only be generalised to the entire US adult residents population in this period.

At first, visual analysis didn’t give any ideas about if there is dependence between the variables. The hypothesis of existing this dependence was then tested with Chi-square independence test, and there was found out that opinion on having the right to suicide is significantly different across financial satisfaction groups.

FUTURE RESEARCH could address impact of consumer debt burden and other forms of financial distress on mental health.