Statistical inference with the GSS data

1. Introduction

This is the Course project for the Inferential Statistics course on Coursera by Duke University. The goal is to investigate the Impact of the news on the confidence in the Military from the General Social Survey (GSS) dataset;

Setup

Load packages

library(ggplot2)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.2

library(statsr)
library(lattice)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

Source: General Social Survey (GSS)

The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions include such items as national spending priorities, marijuana use, crime and punishment, race relations, quality of life, and confidence in institutions. Since 1988, the GSS has also collected data on sexual behavior including number of sex partners, frequency of intercourse, extramarital relationships, and sex with prostitutes.

The GSS conducts surveys by random sampling and thus the data can be generalized to the general population of the USA.

This is an observational study where the data is collected in a way that does not interfere with how the data arises(evidence of a naturally occuring association between variables). Since there is no random assignment involved in the process we conclude that no causal relationship can be established from GSS data.

Part 2: Research questions

Newspapers and the Military

The USA spends allocates a significant portion of its budget to the Military and according to a report by Statista, the United States of America is a global leader in Military Spending.

In more recent years, the United States Military has been in the news a lot, hence the purpose of this research is to investigate whether there is convincing evidence that the frequency of reading the news influences the confidence a person has in the Military.

Based on this question, the analysis would be done using the following variables.

news: A categorical variable indicating how often the respondent reads the newspaper.
conarmy: A categorical variable indicating the confidence level the respondent has in the military.

In order to capture more recent times, we will limit the study to span from the year 2001 (the year of the unfortunate event of the 9/11 attack) to the latest year in the data (2012). This new table would be stored in the new value gss_2001

#Select years from 2001, remove null values in variables of interest 
gss %>%
  filter(year >= 2001 & 
           !is.na(news) &
           !is.na(conarmy))  %>%
  select(news, year, conarmy) -> gss_2001

head(gss_2001)

##          news year      conarmy
## 1 Once A Week 2002 A Great Deal
## 2    Everyday 2002    Only Some
## 3    Everyday 2002    Only Some
## 4       Never 2002   Hardly Any
## 5    Everyday 2002 A Great Deal
## 6       Never 2002 A Great Deal

Let’s get the number of observations in our new data

dim(gss_2001)

## [1] 3920    3

Part 3: Exploratory data analysis

Performing some exploratory data analysis before inference.

Let’s view the observations in the news column.

table(gss_2001$news)

## 
##          Everyday  Few Times A Week       Once A Week Less Than Once Wk 
##              1339               795               544               611 
##             Never 
##               631

The largest group in the table above reads the newspaper everyday, followed by a few times a week.

Next we view the observations in the conarmy (confidence in military) column

table(gss_2001$conarmy)

## 
## A Great Deal    Only Some   Hardly Any 
##         1981         1516          423

Again, from the table above we observe that the largest subset actually has a great deal of confidence in the Military.

table(gss_2001$news, gss_2001$conarmy)

##                    
##                     A Great Deal Only Some Hardly Any
##   Everyday                   683       511        145
##   Few Times A Week           411       289         95
##   Once A Week                278       210         56
##   Less Than Once Wk          280       268         63
##   Never                      329       238         64

Firstly, we are going to make a mosaic plot which would give us a good illustration of the relationship between the two categorical variables of interest.

plot(table(gss_2001$news,gss_2001$conarmy), color = TRUE, main = "Newspaper readers and Confidence in the Military")

In order to interpret the plot above, we note that; - The area of the tiles is proportional to the value counts within a group. - When tiles across groups all have same areas, it indicates independence between the variables.

Observation: From the plot above, we observe that the group A Great Deal remains constant across all groups except for those who read newspapers Less than Once A Week

Next, we make a bar-plot to give us a better picture of each group of newspaper readers.

g <- ggplot(data = gss_2001, aes(x=news), colo)
g <- g + geom_bar(aes(fill=conarmy), position = "dodge")
g + theme(axis.text.x = element_text(angle=45,hjust = 1))

Observations:

Respondents with great deal of confidence are consistently highest across all groups of news readers. We also note that for respondents who read newspapers less than once a week, we have very similar counts.
Respondents with hardly any confidence are a minority across all groups except the respondents who never read the newspapers. Interestingly, responders who never read have majority with hardly any confidence in the executive branch of government.

Part 4: Inference

4.1 State Hypothesis

Null hypothesis (H0): The frequency of reading newspapers and the confidence in the military are independent.
Alternative hypothesis (HA): The frequency of reading newspapers and the confidence in the military are dependent.

4.2 Check Conditions 1. Independence: The GSS data is generated from a random sample survey. So we can assume that the variables are independent.

Sample Size:The 3920 observations used for this study are indeed less than 10% of the total US population. So this condition is satisfied
Degrees of Freedom: We have 3 confidence levels and 5 news reading frequency levels. As we have two categorical variables each with over 2 levels, we utilize the chi-squared test of independence to test the hypothesis.
Expected Counts: To perform a chi-square test (goodness of fit or independence), the expected counts for each cell should be at least 5.

chisq.test(gss_2001$news,gss_2001$conarmy)$expected

##                    gss_2001$conarmy
## gss_2001$news       A Great Deal Only Some Hardly Any
##   Everyday              676.6732  517.8378  144.48903
##   Few Times A Week      401.7589  307.4541   85.78699
##   Once A Week           274.9143  210.3837   58.70204
##   Less Than Once Wk     308.7732  236.2949   65.93189
##   Never                 318.8804  244.0296   68.09005

From the above table, all cells have an Expected value greater than 5. Therefore, we can proceed to perform the chi-squared test of independence using the inference function below.

Chi-squared test of Independence

chisq.test(gss_2001$news,gss_2001$conarmy)

## 
##  Pearson's Chi-squared test
## 
## data:  gss_2001$news and gss_2001$conarmy
## X-squared = 10.402, df = 8, p-value = 0.2379

The chi-squared statistic is 10.402 and the corresponding p-value for 8 degrees of freedom is 0.2379. The p-value of 0.2379 is much greater than the significance level of 0.05.

inference(y = conarmy, x = news, data = gss_2001, statistic = "proportion", type = "ht", null = 0, alternative = "greater", method = "theoretical",success = "High")

## Warning: Ignoring null value since it's undefined for chi-square test of
## independence

## Response variable: categorical (3 levels) 
## Explanatory variable: categorical (5 levels) 
## Observed:
##                    y
## x                   A Great Deal Only Some Hardly Any
##   Everyday                   683       511        145
##   Few Times A Week           411       289         95
##   Once A Week                278       210         56
##   Less Than Once Wk          280       268         63
##   Never                      329       238         64
## 
## Expected:
##                    y
## x                   A Great Deal Only Some Hardly Any
##   Everyday              676.6732  517.8378  144.48903
##   Few Times A Week      401.7589  307.4541   85.78699
##   Once A Week           274.9143  210.3837   58.70204
##   Less Than Once Wk     308.7732  236.2949   65.93189
##   Never                 318.8804  244.0296   68.09005
## 
## H0: news and conarmy are independent
## HA: news and conarmy are dependent
## chi_sq = 10.4021, df = 8, p_value = 0.2379

Result Interpretation

Since the p-value is above the significance level of 5% (p > 0.05), we conclude that we do not have sufficient evidence to reject the null hypothesis (H0). In the context of the research question, it means that the frequency of reading the newspaper does not influence a persons confidence in the military.

We have convincing evidence to reject the null hypothesis in favor of the alternative hypothesis that the frequency of reading newspapers and confidence in the executive branch of the government are dependent. The study is observational, so we can only establish association but not causal links between these two variables.