This is the Course project for the Inferential Statistics course on Coursera by Duke University. The goal is to investigate the Impact of the news on the confidence in the Military from the General Social Survey (GSS) dataset;
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
library(statsr)
library(lattice)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
load("gss.Rdata")
Source: General Social Survey (GSS)
The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions include such items as national spending priorities, marijuana use, crime and punishment, race relations, quality of life, and confidence in institutions. Since 1988, the GSS has also collected data on sexual behavior including number of sex partners, frequency of intercourse, extramarital relationships, and sex with prostitutes.
The GSS conducts surveys by random sampling and thus the data can be generalized to the general population of the USA.
This is an observational study where the data is collected in a way that does not interfere with how the data arises(evidence of a naturally occuring association between variables). Since there is no random assignment involved in the process we conclude that no causal relationship can be established from GSS data.
The USA spends allocates a significant portion of its budget to the Military and according to a report by Statista, the United States of America is a global leader in Military Spending.
In more recent years, the United States Military has been in the news a lot, hence the purpose of this research is to investigate whether there is convincing evidence that the frequency of reading the news influences the confidence a person has in the Military.
Based on this question, the analysis would be done using the following variables.
news: A categorical variable indicating how often the respondent reads the newspaper.
conarmy: A categorical variable indicating the confidence level the respondent has in the military.
In order to capture more recent times, we will limit the study to span from the year 2001 (the year of the unfortunate event of the 9/11 attack) to the latest year in the data (2012). This new table would be stored in the new value gss_2001
#Select years from 2001, remove null values in variables of interest
gss %>%
filter(year >= 2001 &
!is.na(news) &
!is.na(conarmy)) %>%
select(news, year, conarmy) -> gss_2001
head(gss_2001)
## news year conarmy
## 1 Once A Week 2002 A Great Deal
## 2 Everyday 2002 Only Some
## 3 Everyday 2002 Only Some
## 4 Never 2002 Hardly Any
## 5 Everyday 2002 A Great Deal
## 6 Never 2002 A Great Deal
Let’s get the number of observations in our new data
dim(gss_2001)
## [1] 3920 3
Performing some exploratory data analysis before inference.
Let’s view the observations in the news column.
table(gss_2001$news)
##
## Everyday Few Times A Week Once A Week Less Than Once Wk
## 1339 795 544 611
## Never
## 631
The largest group in the table above reads the newspaper everyday, followed by a few times a week.
Next we view the observations in the conarmy (confidence in military) column
table(gss_2001$conarmy)
##
## A Great Deal Only Some Hardly Any
## 1981 1516 423
Again, from the table above we observe that the largest subset actually has a great deal of confidence in the Military.
table(gss_2001$news, gss_2001$conarmy)
##
## A Great Deal Only Some Hardly Any
## Everyday 683 511 145
## Few Times A Week 411 289 95
## Once A Week 278 210 56
## Less Than Once Wk 280 268 63
## Never 329 238 64
Firstly, we are going to make a mosaic plot which would give us a good illustration of the relationship between the two categorical variables of interest.
plot(table(gss_2001$news,gss_2001$conarmy), color = TRUE, main = "Newspaper readers and Confidence in the Military")
In order to interpret the plot above, we note that; - The area of the tiles is proportional to the value counts within a group. - When tiles across groups all have same areas, it indicates independence between the variables.
Observation: From the plot above, we observe that the group A Great Deal remains constant across all groups except for those who read newspapers Less than Once A Week
Next, we make a bar-plot to give us a better picture of each group of newspaper readers.
g <- ggplot(data = gss_2001, aes(x=news), colo)
g <- g + geom_bar(aes(fill=conarmy), position = "dodge")
g + theme(axis.text.x = element_text(angle=45,hjust = 1))
Observations:
Respondents with great deal of confidence are consistently highest across all groups of news readers. We also note that for respondents who read newspapers less than once a week, we have very similar counts.
Respondents with hardly any confidence are a minority across all groups except the respondents who never read the newspapers. Interestingly, responders who never read have majority with hardly any confidence in the executive branch of government.
Null hypothesis (H0): The frequency of reading newspapers and the confidence in the military are independent.
Alternative hypothesis (HA): The frequency of reading newspapers and the confidence in the military are dependent.
4.2 Check Conditions 1. Independence: The GSS data is generated from a random sample survey. So we can assume that the variables are independent.
Sample Size:The 3920 observations used for this study are indeed less than 10% of the total US population. So this condition is satisfied
Degrees of Freedom: We have 3 confidence levels and 5 news reading frequency levels. As we have two categorical variables each with over 2 levels, we utilize the chi-squared test of independence to test the hypothesis.
Expected Counts: To perform a chi-square test (goodness of fit or independence), the expected counts for each cell should be at least 5.
chisq.test(gss_2001$news,gss_2001$conarmy)$expected
## gss_2001$conarmy
## gss_2001$news A Great Deal Only Some Hardly Any
## Everyday 676.6732 517.8378 144.48903
## Few Times A Week 401.7589 307.4541 85.78699
## Once A Week 274.9143 210.3837 58.70204
## Less Than Once Wk 308.7732 236.2949 65.93189
## Never 318.8804 244.0296 68.09005
From the above table, all cells have an Expected value greater than 5. Therefore, we can proceed to perform the chi-squared test of independence using the inference function below.
chisq.test(gss_2001$news,gss_2001$conarmy)
##
## Pearson's Chi-squared test
##
## data: gss_2001$news and gss_2001$conarmy
## X-squared = 10.402, df = 8, p-value = 0.2379
The chi-squared statistic is 10.402 and the corresponding p-value for 8 degrees of freedom is 0.2379. The p-value of 0.2379 is much greater than the significance level of 0.05.
inference(y = conarmy, x = news, data = gss_2001, statistic = "proportion", type = "ht", null = 0, alternative = "greater", method = "theoretical",success = "High")
## Warning: Ignoring null value since it's undefined for chi-square test of
## independence
## Response variable: categorical (3 levels)
## Explanatory variable: categorical (5 levels)
## Observed:
## y
## x A Great Deal Only Some Hardly Any
## Everyday 683 511 145
## Few Times A Week 411 289 95
## Once A Week 278 210 56
## Less Than Once Wk 280 268 63
## Never 329 238 64
##
## Expected:
## y
## x A Great Deal Only Some Hardly Any
## Everyday 676.6732 517.8378 144.48903
## Few Times A Week 401.7589 307.4541 85.78699
## Once A Week 274.9143 210.3837 58.70204
## Less Than Once Wk 308.7732 236.2949 65.93189
## Never 318.8804 244.0296 68.09005
##
## H0: news and conarmy are independent
## HA: news and conarmy are dependent
## chi_sq = 10.4021, df = 8, p_value = 0.2379
Since the p-value is above the significance level of 5% (p > 0.05), we conclude that we do not have sufficient evidence to reject the null hypothesis (H0). In the context of the research question, it means that the frequency of reading the newspaper does not influence a persons confidence in the military.
We have convincing evidence to reject the null hypothesis in favor of the alternative hypothesis that the frequency of reading newspapers and confidence in the executive branch of the government are dependent. The study is observational, so we can only establish association but not causal links between these two variables.