Setup

Load packages and data

suppressWarnings(library(ggplot2))
suppressWarnings(library(dplyr))
library(statsr)
library(reshape2)
load("gss.Rdata")

Part 1: About the Data

The General Social Survey (GSS) has been conducting surveys since 1972 in the USA. The survey asks a broad range of questions, from social to political to religious and economic topics. As a result, the survey has been widely used in thousands of research publications by social scientists over the years and findings have been covered by most major media publications. [Recent example: https://www.washingtonpost.com/news/monkey-cage/wp/2017/03/09/the-40-year-decline-in-the-tolerance-of-college-students-graphed/?utm_term]

Adult participants (18+) are randomly sampled from households across USA. Thus adults living in households are the target population. People living in institutions are not represented. The sampling is done across geographical locations and urban/suburban/rural people.

The survey is strictly voluntary, hence sections of population unwilling to respond (due to busyness for example) are likely to be under-represented. However, aside from this source of a small bias, the random sample can be assumed to be representative and inferences from this data can be generalized to the target population mentioned above.

Random assignment (e.g. of income) cannot be used for this kind of survey. Hence no causal inferences can be made.


Part 2: Research question

The authenticity of news media and Press has come under heavy fire recently. It seems that people’s confidence in Press is declining. It should be interesting to investigate whether this is really the case.

The GSS survey asks this question measuring the participant’s confidence in press - “I am going to name some institutions in this country. As far as the people running these institutions are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them? g. Press.”

Thus possible responses are - “A Great Deal”, “Only Some” & “Hardly Any”. The same question is asked in every survey since 1972.

It would be interesting to see if the proportions of the above three responses change over the years and if the changes are statistically significant.

Let’s phrase the question in more specific terms : “Is the public confidence in the institution of Press dependent on the year considered? Or, is it independent of year and observed differences are only due to sampling variability?” We will compare the years 1990 and 2010 to answer this question.

First, We will explore and try to understand the data (part 3) and then carry out statistical inference (part 4).


Part 3: Exploratory data analysis

The variable in the data measuring people’s confidence in press is “conpress”.

## summary of "conpress"
summary(gss$conpress)
## A Great Deal    Only Some   Hardly Any         NA's 
##         6128        20346        11465        19122

Almost half of the observations are NA. We have to be careful handling these. A natural question is whether we include NA’s while calculating proportion of each response. Let’s include NA’s in our analysis for now.

## Select only year and conpress variables. We don't need remaining variables.
## Filter out 1972 and 1985 years, as all responses are NA  for these years.
sub.data <- gss %>%
  filter(year != 1972 & year !=1985) %>%
  select(year, conpress)

## Tabulate count by year with 'table' command. Display first 5 rows.
year.count = table(sub.data$year,sub.data$conpress, useNA = 'ifany')
year.count[1:5,]
##       
##        A Great Deal Only Some Hardly Any <NA>
##   1973          346       911        220   27
##   1974          383       821        259   21
##   1975          354       823        265   48
##   1976          424       776        263   36
##   1977          383       874        236   37
## Get proportions by year in the table - "year.props" . Display first 5 rows.
year.props = prop.table(year.count,1)
year.props[1:5,]
##       
##        A Great Deal  Only Some Hardly Any       <NA>
##   1973   0.23005319 0.60571809 0.14627660 0.01795213
##   1974   0.25808625 0.55323450 0.17452830 0.01415094
##   1975   0.23758389 0.55234899 0.17785235 0.03221477
##   1976   0.28285524 0.51767845 0.17545030 0.02401601
##   1977   0.25032680 0.57124183 0.15424837 0.02418301

We tabulate counts or proportions for different levels of confidence in press i.e. “conpress” by year in two separate tables, as seen above. The first 5 years 1973-1977 are displayed. The proportion of NA is relatively small for these years.

Let’s plot the proportions vs year in one plot.

## Convert "year.props" to long format using "melt"command from reshape package.
## This makes it easier to plot. We don't have to repeat geom_line command 4 times.
props_long <- melt(year.props, id='year')
colnames(props_long) <- c('Year','Confidence_in_Press','Proportion')
ggplot(data=props_long,aes(x=Year,y=Proportion,colour=Confidence_in_Press))+geom_line()

We see a general trend of decreasing proportions for ‘A Great Deal’ (red) and ‘Only Some’ (green) over the years, while we see an overall increase in ‘Hardly Any’. This agrees with the hypothesis that public confidence in the institution of Press is decreasing in general.

However, we see erratic changes in proportion of ‘NA’ over the years (black line). The reason is that the survey question was asked only to a subset of participants in some of the years. Hence these years show high proportion of NA’s. There is another reason too. The breakdown of NA was not the same for all years. NA’s can be divided into ‘Don’t Know’, ‘Inapplicable’ and ‘Non-response’. But in some of the years, ‘Inapplicable’ was not given as an option.

These erratic changes in proportion of NA affects the proportions of other ‘levels’ in a bad way. For example, the dip in red, green and blue lines between 2000 and 2008 in the figure above is probably no real dip, but just an effect of the jump in NA (black line) in those years. Thus we cannot reasonably compare proportions between years.

The real way to deal with NA’s is to use weights. Four different weights have to be applied. These weights are available in the cumulative GSS file. However, these weights are not present in the sub-set data file provided and applying weights is outside the scope of this report. [For info on using weights: https://gssdataexplorer.norc.org/documents/441/display]

We will instead discard all NA values and calculate proportions without them. Also, we will compare the proportions between years 1990 and 2010 which show similar proportion of NA in the above graph. This is not ideal, but the graph obtained in this way agrees very well with that obtained using weights which can be seen here- [https://gssdataexplorer.norc.org/trends/Politics?measure=conpress].

year.count = table(sub.data$year,sub.data$conpress) ##don't include NA like before
year.props = prop.table(year.count,1) 
year.props[1:3,]  ##display first 3 years
##       
##        A Great Deal Only Some Hardly Any
##   1973    0.2342586 0.6167908  0.1489506
##   1974    0.2617908 0.5611757  0.1770335
##   1975    0.2454924 0.5707351  0.1837725
props_long <- melt(year.props, id='year')  ##convert to long
colnames(props_long) <- c('Year','Confidence_in_Press','Proportion')
ggplot(data=props_long,aes(x=Year,y=Proportion,colour=Confidence_in_Press))+geom_line()

This graph doesn’t show erratic behavior of proportions over the years as the previous one did. The lines are more smooth. It displays the trends of decreasing public confidence in Press more clearly.


Part 4: Inference

We consider two levels of year: 1990, 2010 and three levels of Confidence in Press: ‘A Great Deal’, ‘Only Some’ and ‘Hardly Any’ for our hypothesis test. Thus we have two categorical variables where one of the variable has 2 levels and the other has 3.

We will use a chi-square test of independence to test the independence of these variables. Note that we cannot use the usual Z-test here because one of the categorical variable has more than two levels hence there is no one parameter to estimate.

H0: Confidence in Press and year are independent variables or Confidence in Press does not vary with year.

HA: Confidence in Press and year are dependent variables or Confidence in Press varies with year.

We should make a contingency table for doing chi-square test.

our.table <- year.count[c(15,26),] ## include only 1990 and 2010
addmargins(our.table)              ##Add margins with summations
##       
##        A Great Deal Only Some Hardly Any  Sum
##   1990          132       516        219  867
##   2010          140       621        594 1355
##   Sum           272      1137        813 2222

Conditions to check before doing a chi-square test:

Let’s check the conditions for using a chi-square test.

  1. Independence: The observations are independent because:

1)random sampling is used.

2)the sample size, 2222 < 10% of population (non-institutionalized adults in the US).

3)It is highly improbable that the same person has participated in both 1990 and 2010.

  1. Sample size: Expected counts must be at least 5 in all six cases in the above contingency table. The expected counts are obtained by assuming that null hypothesis is true. They can be calculated directly from the contingency table above as, (row sum x column sum)/(table sum).

Let’s find expected counts using a built in method.

expected.table <- chisq.test(our.table)$expected
expected.table
##       
##        A Great Deal Only Some Hardly Any
##   1990     106.1314  443.6449   317.2237
##   2010     165.8686  693.3551   495.7763

The expected counts are indeed more than 5 in all six cases.

Both conditions -independence and sample size- are satisfied. So we can proceed with the chi-square test.

Chi-square test:

It is very simple to carry out the test.We calculate (O-E)^2/E for each cell (O is observed count, E is expected count), and then add all of them. This is the test statistic.

##Calculate test-statistic for chi-square test
sum(((our.table-expected.table)^2)/(expected.table))
## [1] 79.56453

The degrees of freedom are simply: (# of rows - 1)*(# of columns - 1), which is (2-1)x(3-1) or 2 here. The p-value is the upper tail area under a chi-square distribution with 2 degrees of freedom to the right of the test-statistic we already found.

##Calculate p-value
pchisq(q = 79.56453, df = 2, lower.tail = FALSE)
## [1] 5.281799e-18

We can also do the chi-square test using a built-in command.

## chi-square test using built in command
chisq.test(our.table)
## 
##  Pearson's Chi-squared test
## 
## data:  our.table
## X-squared = 79.565, df = 2, p-value < 2.2e-16

The test statistic, degrees of freedom and p-value match what we calculated before.

Interpretation and conclusion:

The test statistic is pretty high and at 2 degrees of freedom, p-value is very low.

This p-value is the conditional probability of obtaining sample proportions as different as we observed for 1990 and 2010, given that confidence in Press is independent of year. This probability is almost zero here.

The differences in proportions between years 1990 and 2010 are too large to be explained by sampling variability alone.Thus we reject the null hypothesis and conclude that Confidence in Press does vary with year.

Although we saw a clear trend in our plot, the hypothesis test establishes beyond doubt, the changing nature of public confidence in Press over the years.