Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(reshape2)
library(tidyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")
dim(gss)

## [1] 57061   114

Part 1: Data

The GSS replicated questionnaire items and wording in order to facilitate time-trend studies. Generalization is provided since the GSS data were sampling from all noninstitutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. However, since the study does not make use of random assignment, we can conclude that this an observational study, meaning that we can only stablish associations as opposed to causation, in other words, causality cannot be derived from this study as control experiment was not part of the study design. The sample can be biased as the respondents are self-selected, even though there is some randomness in the selection of the sample. It will miss people who do not wish to be surveyed.

The data set includes several parts of imformation:

Respondent Background Variables
Personal and Family Information
Attitudinal Measures

We only study some of those imformation

dat = select(gss,year, age, sex, race, educ, degree, marital, partyid, polviews)
summary(dat)
dat = na.exclude(dat)

Part 2: Research question

Trump VS Hillary, presidential election 2016, is one of the hottest show in this year! So I want to investigate the political view of Americans to get some insight of this election. My research question is that What is the relationship between the political view and some other variables. Several sub questions to this research question listed below:

what is the distribution like of the political views of the population
The association between the age of a respondent and one’s polical views
the proportion of people holding different views envolves over years

Part 3: Exploratory data analysis

First let’s see the variable of political views

table(gss$polviews)

## 
##     Extremely Liberal               Liberal      Slightly Liberal 
##                  1330                  5582                  6181 
##              Moderate Slightly Conservative          Conservative 
##                 18494                  7691                  7092 
##  Extrmly Conservative 
##                  1506

ggplot(filter(gss, !is.na(polviews)), aes(x = polviews)) + geom_bar(position = "stack", fill = "green") + ylab("number of person") + xlab("political views")

So the sample distribution of the political views of the polulation is symmetrical and unimodal.

The age of a respondent is strongly related to one’s polical views

ggplot(dat, aes(x = polviews, y = age, na.exclude = T)) + geom_boxplot() + ylab("age of respondent") + xlab("political views")

So we can see that conservatives are more likely to be elder.

We are going to see if there is changes of proportions of different years

dat1 = select(gss, year, polviews) %>% 
  na.exclude() %>%
  group_by(year) %>%
  summarise(extremely_liberal = table(polviews)[1]/length(polviews),liberal = table(polviews)[2]/length(polviews), slightly_liberal = table(polviews)[3]/length(polviews), moderate = table(polviews)[4]/length(polviews), slightly_conser = table(polviews)[5]/length(polviews), conser = table(polviews)[6]/length(polviews), extremely_conser = table(polviews)[7]/length(polviews)) 

dat1 = melt(dat1, id = "year", measure.vars = names(dat1)[2:8])

ggplot(dat1, aes(x = year, y = value, group = variable, colour = variable)) + geom_line() + ylab("proportions holding different political views")

So the proportion of people hold different political opinions keeps stable over these yeas.

Part 4: Inference

From the picture above, we can see that there is significant difference in age between the groups of people holding different political views. We are going to the hypothesis test:

\[H_0: \mbox{There are differences in age between different groups people holding different political views}\]

\[H_1: \mbox{There is no difference in age between different groups people holding different political views}\]

We assume that the mean age of different groups are \(\mu_1,..., \mu_7\), and our hypothesis may turn out to be: \[H_0: \mu_1 = \cdots = \mu_7 \leftrightarrow H_1: \mbox{there are not all equal to each other}\]

It is a test whether there are difference of age between different groups of people holding political views, so we can use analysis of variance. With some proper assumptions, we can assume that it is a one-factor hypothesis test.

ml = aov(age~polviews,data=gss)
summary(ml)

##                Df   Sum Sq Mean Sq F value Pr(>F)    
## polviews        6   234618   39103   131.4 <2e-16 ***
## Residuals   47722 14204374     298                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 9332 observations deleted due to missingness

Since it shows a very small p-value, the hypothesis is significant! So the data is evidence of certain associations between age and political views when assumptions are met.

One of the key assumptions of our hypothesis test is about the normality of the data. We can test this by the QQ plot age grouped by different political views holding people.

dat = select(gss,age, polviews) %>%
  filter(polviews =="Liberal") %>%
  select(age)

qqnorm(dat$age)

But it seems that the normality is not satisfied. Along with other doubts about the assumptions of using anova, we need more imformation and analysis to conclude if the data are evidence of assotiation between age and politcal views.