Statistical Inference with the GSS Data

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.1

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

library(statsr)
library(lattice)

Load data

load("gss.Rdata")

* * *

Part 1: Data (3 points)

The data for this analysis is the General Social Survey (GSS), which is conducted biannually by the National Opinion Research Center (NORC). This data project is ongoing since its inception in 1972 with the purpose of “monitoring societal change and studying the growing complexity of the American society.” (NORC) Ultimately, the data aims to gather information on the attitudes, behaviors, and attributes of contemporary American society, and to compare these responses in the U.S. to those of other nation-states. More information is found on the NORC website: http://www.norc.org/Research/Projects/Pages/general-social-survey.aspx.

Methodology: According to the GSS Codebook: http://gss.norc.org/get-documentation, the sampling method is not a simple random sample of the population. Rather, samples were taken from metropolitan and non-metropolitan areas, using the blocking technique. Interviews were conducted by canvassing blocks within certain randomly selected Block Groups. Although a simple random sample was not used, the study does use weighting techniques to remedy this. The codebook further states that: “the GSS samples closely resemble distributions reported in the Census and other authoritative sources.” Thus, while not a simple random sample, the sample is sufficiently representative and randomly chosen, allowing it to be generalizable to the general population.

Generalizability: The data, given the methodology, is generalizable; however, it is important to note some areas for improvement. Since the data were collected using in-person interviews, there is a potential that the data is not completely accurate as the interviewee may have incentives to answer certain questions in certain ways, depending on how comfortable they feel with the interviewer. In addition, there’s also concern that the sampling method itself, of an interview, may exclude those individuals who do not have the sufficient time needed to participate in a long(er) in-person interview.

Correlation/Causation: Since this data is observational with no random assignment of the subjects to specific groups, the data CANNOT be used to evaluate causation. Rather, it can only be used as observational data to measure relationships and correlations between variables.

* * *

Part 2: Research question

Research Question: Is there a relationship between the amount of media consumption (news) and the confidence in the executive branch of the federal government? What about confidence in the press?

“You’re fake news.” (quote by Donald Trump on multiple occasions). The current President of the United States has greatly antagonized the media, even calling it the “greatest enemy” of the U.S. Those who support Trump seem to agree in disbelieving the press. This question is interesting as it will aim to see if there is any relationship between the amount of media consumption (news) and the confidence in the executive branch of the federal government. It would also be interesting to see if there is any correlation between news consumption and the confidence in press. While the most recent data (from 2016 to 2018) is unavailable, this analysis will use data from 2008 onwards (the election year of the previous administration) to see if this there are any correlations between data over the past decade.

* * *

Part 3: Exploratory data analysis

Plots:

dim(gss)

## [1] 57061   114

Initially, there are over 57,000 responses in the GSS but for the purposes of this analysis, we will filter out the data from 2008 onwards for the variables: news (news consumption), confed (confidence in the executive branch of the federal government), and conpress (confidence in the press).

First, I will filter the data to figure out if there’s a correlation between news and confed.

gss %>%
  filter(year >= 2008 &
      !is.na(news) &
      !is.na(confed))%>%
  select(news,confed)  -> gss_newsfed

dim(gss_newsfed)

## [1] 2047    2

The data has been reduced significantly to 2,047.

Next, the data I will filter will be to find if there is a correlation between news and confidence in the press.

gss %>%
  filter(year >= 2008 &
      !is.na(news) &
      !is.na(conpress))%>%
  select(news,conpress)  -> gss_newspress

dim(gss_newspress)

## [1] 2063    2

Again, the data has been reduced significantly to 2,063.

Taking this filtered data, we’ll look at the data in table form.

First: for news consumption

table(gss_newsfed$news, useNA = "ifany")

## 
##          Everyday  Few Times A Week       Once A Week Less Than Once Wk 
##               650               386               293               321 
##             Never 
##               397

Based on the results, we can see that most of the respondents read the news every day.

Then we’ll see how confident these respondents feel about the federal government.

table(gss_newsfed$confed, useNA = "ifany")

## 
## A Great Deal    Only Some   Hardly Any 
##          255          966          826

Just looking at the numbers, we can see that very few respondents have “a great deal” of confidence in the executive branch of the federal government and most have only some or hardly any.

Next we’ll look at how the respondents’ responses to news consumption and their confidence in the press compare:

table(gss_newspress$news, useNA = "ifany")

## 
##          Everyday  Few Times A Week       Once A Week Less Than Once Wk 
##               654               390               297               324 
##             Never 
##               398

Again, a majority of respondents read the newspaper every day.

table(gss_newspress$conpress, useNA = "ifany")

## 
## A Great Deal    Only Some   Hardly Any 
##          183          894          986

Yet, very few have “a great deal” of confidence in the press, with a majority of the respondents expressing only some or hardly any confidence.

+++++++++++++

Now we plot for confed and news:

table(gss_newsfed$news,gss_newsfed$confed)

##                    
##                     A Great Deal Only Some Hardly Any
##   Everyday                    92       276        282
##   Few Times A Week            56       194        136
##   Once A Week                 37       151        105
##   Less Than Once Wk           32       170        119
##   Never                       38       175        184

g <- ggplot(data = gss_newsfed, aes(x=news))
g <- g + geom_bar(aes(fill=confed), position = "dodge")
g + theme(axis.text.x = element_text(angle=60,hjust = 1))

Observations on news and confed:

Respondents who reported they read the news every day seem to have the lowest level of confidence in the executive branch of the federal government.
Interestingly, the majority of respondents who either read the news every day or never read the news have hardly any confidence in the executive branch of government.

Another plot to use to compare the results of news-confed:

plot(table(gss_newsfed$news,gss_newsfed$confed))

The mosaic plot above shows that proportionately, the number of respondents who have “A Great Deal” of confidence in the executive branch of the federal government progressively decreases as the amount of news consumption decreases.

+++++++++++++++++

Now let’s look at the plots for news consumption and confidence in the press:

g2 <- ggplot(data = gss_newspress, aes(x=news))
g2 <- g2 + geom_bar(aes(fill=conpress), position = "dodge")
g2 + theme(axis.text.x = element_text(angle=60,hjust = 1))

plot(table(gss_newspress$news,gss_newspress$conpress))

Observations: 1. Similar to the plots for confed and news, the confidence in press (a great deal), seems to progressively decrease as news consumption also decreases. 2. Yet, interestingly, the proportion of respondents who have the least confidence (hardly any) seems to increase as news consumptions decreases.

* * *

Part 4: Inference

For confidence in the exec branch of the federal government and news consumption:

Null hypothesis: News consumption and confidence in the executive branch of the federal government are independent.

Alternative hypothesis: News consumption and confidence in the executive branch of the federal government are associated.

+++++++++++++++

Independence: The GSS data is collected through random sample surveys, thus we can assume independence for the data.

Sample Size: The samples are collected without replacement and based on the number of results, the sample size is well under 10% of the entire U.S. adult population.

Degrees of Freedom: There are 3 confidence levels (hardly any, some, a great deal) and 5 levels of news consumption (every day, a few times a week, once a week, < once a week, never). since there are two categorical variables, each variable with more than 2 levels, the chi-squared test of independence to test the hypothesis should be used.

Expected Counts: To conduct a chi-square test (GOF or independence), the expected counts for each cell should be at least 5.

We can check this below:

chisq.test(gss_newsfed$news,gss_newsfed$confed)$expected

##                    gss_newsfed$confed
## gss_newsfed$news    A Great Deal Only Some Hardly Any
##   Everyday              80.97215  306.7416   262.2863
##   Few Times A Week      48.08500  182.1573   155.7577
##   Once A Week           36.49976  138.2697   118.2306
##   Less Than Once Wk     39.98779  151.4831   129.5291
##   Never                 49.45530  187.3483   160.1964

The table above clearly has an expected count more than 5 for all cells.

Then we can now conduct the chi-square test

chisq.test(gss_newsfed$news,gss_newsfed$confed)

## 
##  Pearson's Chi-squared test
## 
## data:  gss_newsfed$news and gss_newsfed$confed
## X-squared = 25.022, df = 8, p-value = 0.001541

The results are as follows: X-squared statistic is 25.728 The corresponding p-value for 8 degrees of freedom is 0.001541, which is much lower than the significance level of 0.05.

Conclusion for confed - news: Based on the data analysis, there is convincing evidence to reject the null hypothesis in favor of the alternative hypothesis. News consumption and confidence in the executive branch of the federal government are associated (not independent). Only non-causal links/correlation between the two variables can be assumed as the study is observation (NOT causation!).

+++++++++++++++++++++++++++++

For confidence in the press and news consumption:

Null hypothesis: News consumption and confidence in the press are independent.

Alternative hypothesis: News consumption and confidence in the press are associated.

+++++++++++++++

Independence: The GSS data is collected through random sample surveys, thus we can assume independence for the data.

Sample Size: The samples are collected without replacement and based on the number of results, the sample size is well under 10% of the entire U.S. adult population.

Expected Counts: To conduct a chi-square test (GOF or independence), the expected counts for each cell should be at least 5.

We can check this below:

chisq.test(gss_newspress$news,gss_newspress$conpress)$expected

##                    gss_newspress$conpress
## gss_newspress$news  A Great Deal Only Some Hardly Any
##   Everyday              58.01357  283.4106   312.5759
##   Few Times A Week      34.59525  169.0063   186.3984
##   Once A Week           26.34561  128.7048   141.9496
##   Less Than Once Wk     28.74067  140.4052   154.8541
##   Never                 35.30490  172.4731   190.2220

The table above clearly has an expected count more than 5 for all cells.

Then we can now conduct the chi-square test

chisq.test(gss_newspress$news,gss_newspress$conpress)

## 
##  Pearson's Chi-squared test
## 
## data:  gss_newspress$news and gss_newspress$conpress
## X-squared = 10.329, df = 8, p-value = 0.2427

The results are as follows: X-squared = , df = 8, p-value = 0.2427 X-squared statistic is 10.329 The corresponding p-value for 8 degrees of freedom is 0.2427, which is much higher than the significance level of 0.05.

Conclusion for conpress - news: Based on the data analysis, there is no convincing evidence to reject the null hypothesis in favor of the alternative hypothesis. We can reject the alternative hypothesis and can assume that news consumption and confidence in the press are independent.

Statistical Inference with the GSS Data

Yousie Kim

July 24, 2018

Setup

Load packages

Load data

* * *

Part 1: Data (3 points)

* * *

Part 2: Research question

* * *

Part 3: Exploratory data analysis

Plots:

* * *

Part 4: Inference