Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

1a: Introduction

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

This project uses an extract of the General Social Survey (GSS) Cumulative File 1972-2012 that was provided by Coursera.

The data extract contains 57061 observations of 114 variables.

Unlike the full General Social Survey Cumulative File, the extract has been sanitized by removing missing values from the responses and factor variables were created when appropriate to facilitate analysis using R.

Data is collected by various surveys.Surveys are conducted using many different modes (e.g. face-to-face, mail, telephone, Internet).As these are all observational studies conducted on 18+ yrs old we can not genralize to the entire U.S poulation.Also based on observational studies we cannot claim causal relationships between variables .

As the survey is conducted by random sampling, the results from this project can be generalized to the entire US population. However, the statistical tests performed cannot provide causality relationships between the variables of interest.

Part 2: Research question

I would like to explore the relationship between how the frequency of reading news might have influenced the confidence one has in the executive branch of the government .How in year 2012 the frequent news reader“s views were influenced towards their confidence in executive branch of government?

The varibles used for this analysis are:

news - How often do you read the newspaper - every day, a few times a week, once a week, less than once a week, or never?
confed - you have a great deal of confidence, only some confidence, or hardly any confidence at all in Executive branch of the federal government.

gss_12<- gss %>% filter(year == 2012 & !is.na(news) & !is.na(confed)) %>% select(news, confed)

summary(gss_12$news)

##          Everyday  Few Times A Week       Once A Week Less Than Once Wk 
##               199               102                83               108 
##             Never 
##               154

summary(gss_12$confed)

## A Great Deal    Only Some   Hardly Any 
##           83          307          256

Part 3: Exploratory data analysis

tab<-table(gss_12$new, gss_12$confed)

tab

##                    
##                     A Great Deal Only Some Hardly Any
##   Everyday                    27        90         82
##   Few Times A Week            18        53         31
##   Once A Week                 13        39         31
##   Less Than Once Wk           14        56         38
##   Never                       11        69         74

ggplot(data = gss_12, aes(x = news)) + geom_bar(aes(fill = confed), position = "dodge") + theme(axis.text.x = element_text(angle = 60, hjust = 1))

plot(table(gss_12$news, gss_12$confed))

Observations:

1: Respondents with great deal of confidence are consistently lowest across all groups of news readers.

2: Respondents with only some confidence are majority across all groups except the respondents who never read the newspapers. Interestingly, responders who never read have majority with hardly any confidence in the executive branch of government.

3:In mosaic plot the area of category A Great Deal is cnstantly decrasing with the frequency of reading news.

Part 4: Inference

Hypothesis:

Null Hypothesis: Frequency of reading news and confidence in executive branch of government are independent.

Alternative Hypothesis: Frequency of reading news and confidence in executive branch of government are dependent.

Method :

We will be using CHI SQUARED INDEPENDENCE TEST as we are testing dependence of two categorical variables and each varible has multiple levels.

Conditions :

Independence: As gss is collected by random sample survey we can asssume data is independent. : Data extracts 57061 observations which is definately less than 10% of total US population as sampling is done without replacement . : Each cell contributes to one cell in the table.
Sample Size : Each particular scenario (ie cell) must have atleast 5 expected cases. To check that lets run the chi squared test

chisq.test(gss_12$news, gss_12$confed)$expected

##                    gss_12$confed
## gss_12$news         A Great Deal Only Some Hardly Any
##   Everyday              25.56811  94.57121   78.86068
##   Few Times A Week      13.10526  48.47368   40.42105
##   Once A Week           10.66409  39.44427   32.89164
##   Less Than Once Wk     13.87616  51.32508   42.79876
##   Never                 19.78638  73.18576   61.02786

we can see each cell has atleast 5 expected counts.

Degrees of Freedom :

df = (r-1)*(c-1)

= (5-1)*(3-1)

= 8

Chi-square test of independece:

We will run this test to find test statistic (x^2) and p-value.

chisq.test(gss_12$news, gss_12$confed)

## 
##  Pearson's Chi-squared test
## 
## data:  gss_12$news and gss_12$confed
## X-squared = 13.362, df = 8, p-value = 0.1

Findings:

X^2 = 13.362

df = 8

p-value = 0.1

As we can see our p-value is high.With a high p-value we fail to reject the null hypothesis in favor of alternative hypothesis.Which means data provides convincing evidence that Frequency of reading news and confidence in executive branch of government are independent.

As the study is observational we can establish associtionnot the casual relationship.

Statistical inference with the GSS data