library(ggplot2)
library(dplyr)
library(statsr)load("gss.Rdata")Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
This project uses an extract of the General Social Survey (GSS) Cumulative File 1972-2012 that was provided by Coursera.
The data extract contains 57061 observations of 114 variables.
Unlike the full General Social Survey Cumulative File, the extract has been sanitized by removing missing values from the responses and factor variables were created when appropriate to facilitate analysis using R.
Data is collected by various surveys.Surveys are conducted using many different modes (e.g. face-to-face, mail, telephone, Internet).As these are all observational studies conducted on 18+ yrs old we can not genralize to the entire U.S poulation.Also based on observational studies we cannot claim causal relationships between variables .
As the survey is conducted by random sampling, the results from this project can be generalized to the entire US population. However, the statistical tests performed cannot provide causality relationships between the variables of interest.
I would like to explore the relationship between how the frequency of reading news might have influenced the confidence one has in the executive branch of the government .How in year 2012 the frequent news reader“s views were influenced towards their confidence in executive branch of government?
The varibles used for this analysis are:
news - How often do you read the newspaper - every day, a few times a week, once a week, less than once a week, or never?
confed - you have a great deal of confidence, only some confidence, or hardly any confidence at all in Executive branch of the federal government.
gss_12<- gss %>% filter(year == 2012 & !is.na(news) & !is.na(confed)) %>% select(news, confed)
summary(gss_12$news)## Everyday Few Times A Week Once A Week Less Than Once Wk
## 199 102 83 108
## Never
## 154
summary(gss_12$confed)## A Great Deal Only Some Hardly Any
## 83 307 256
tab<-table(gss_12$new, gss_12$confed)
tab##
## A Great Deal Only Some Hardly Any
## Everyday 27 90 82
## Few Times A Week 18 53 31
## Once A Week 13 39 31
## Less Than Once Wk 14 56 38
## Never 11 69 74
ggplot(data = gss_12, aes(x = news)) + geom_bar(aes(fill = confed), position = "dodge") + theme(axis.text.x = element_text(angle = 60, hjust = 1))plot(table(gss_12$news, gss_12$confed))Observations:
1: Respondents with great deal of confidence are consistently lowest across all groups of news readers.
2: Respondents with only some confidence are majority across all groups except the respondents who never read the newspapers. Interestingly, responders who never read have majority with hardly any confidence in the executive branch of government.
3:In mosaic plot the area of category A Great Deal is cnstantly decrasing with the frequency of reading news.
Null Hypothesis: Frequency of reading news and confidence in executive branch of government are independent.
Alternative Hypothesis: Frequency of reading news and confidence in executive branch of government are dependent.
We will be using CHI SQUARED INDEPENDENCE TEST as we are testing dependence of two categorical variables and each varible has multiple levels.
Independence: As gss is collected by random sample survey we can asssume data is independent. : Data extracts 57061 observations which is definately less than 10% of total US population as sampling is done without replacement . : Each cell contributes to one cell in the table.
Sample Size : Each particular scenario (ie cell) must have atleast 5 expected cases. To check that lets run the chi squared test
chisq.test(gss_12$news, gss_12$confed)$expected## gss_12$confed
## gss_12$news A Great Deal Only Some Hardly Any
## Everyday 25.56811 94.57121 78.86068
## Few Times A Week 13.10526 48.47368 40.42105
## Once A Week 10.66409 39.44427 32.89164
## Less Than Once Wk 13.87616 51.32508 42.79876
## Never 19.78638 73.18576 61.02786
we can see each cell has atleast 5 expected counts.
df = (r-1)*(c-1)
= (5-1)*(3-1)
= 8
We will run this test to find test statistic (x^2) and p-value.
chisq.test(gss_12$news, gss_12$confed)##
## Pearson's Chi-squared test
##
## data: gss_12$news and gss_12$confed
## X-squared = 13.362, df = 8, p-value = 0.1
X^2 = 13.362
df = 8
p-value = 0.1
As we can see our p-value is high.With a high p-value we fail to reject the null hypothesis in favor of alternative hypothesis.Which means data provides convincing evidence that Frequency of reading news and confidence in executive branch of government are independent.
As the study is observational we can establish associtionnot the casual relationship.