Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("./gss.Rdata")

Part 1: Data

As mentioned in the Codebook, the observation data is collected by computer-assisted personal interview (CAPI), face-to-face interview, and telephone interview. There is no experiment assignment occuring int sampling. Hence, the research question below could only produce generalizability but not causality.

Part 2: Research question

Does there appear to be a relationship between political party affiliation and region of interview?

Part 3: Exploratory data analysis

# subset political party affiliation as 'partyid' and region of interview as 'region'
data <- gss[, c('partyid', 'region')]
table(data)

##                     region
## partyid              New England Middle Atlantic E. Nor. Central
##   Strong Democrat            361            1292            1704
##   Not Str Democrat           463            1975            2011
##   Ind,Near Dem               479             976            1342
##   Independent                457            1272            1673
##   Ind,Near Rep               345             586             964
##   Not Str Republican         308            1458            1670
##   Strong Republican          208             668             983
##   Other Party                 41             143             162
##                     region
## partyid              W. Nor. Central South Atlantic E. Sou. Central
##   Strong Democrat                600           2026             709
##   Not Str Democrat               796           2306             813
##   Ind,Near Dem                   543           1100             365
##   Independent                    601           1588             513
##   Ind,Near Rep                   406            895             326
##   Not Str Republican             763           1678             603
##   Strong Republican              442           1206             365
##   Other Party                     53            116              42
##                     region
## partyid              W. Sou. Central Mountain Pacific
##   Strong Democrat                936      401    1088
##   Not Str Democrat              1238      664    1774
##   Ind,Near Dem                   579      437     922
##   Independent                    810      487    1098
##   Ind,Near Rep                   489      338     572
##   Not Str Republican             708      598    1219
##   Strong Republican              496      419     761
##   Other Party                     83       63     158

# each cell has at least 5 cases

# make a mosaic plot to show the distribution under party and affiliation
mosaicplot(~ region + partyid, data = data, main = 'Distribution of Political Party Affiliation and Region of Interview', color = TRUE)

# note: I couldn't find a solution to handle label overlapping, but it won't affect to find the proportion difference inbetween each cell.

As shown in the plotting, the proportions are different along each variable. But before we confirm a dependent relationship between these two variables, an inference test is needed. Since we have two categorical varaibles, so a Chi-Square statistics is used to test the independence between party affiliation and region of interview.

Part 4: Inference

Step 1. Check the condition

Independence:

Random sample/assignment: the observations are collected by independent interview.
If sampling without replacement, n < 10% of population. The total data points are 57,061, absolutly less than the 10% of US popluation.
Each case only contributes to one cell in the table.

Sample size:

Each particular scenario must have at least 5 expected cases. The minimum cell is 41

So, it meets the condiiton for Chi-Square test

Step 2. Build the Null and Alternative Hypothesis

H0: Political affiliation and region of interview are independent. Political affiliation does not vary by region of interview.

HA: Political affiliation and region of interview are dependent. Political affiliation does vary by region of interview.

Step 3. Implement the Chi-Square Independent Test

# transform table into dataframe with NA removed
matrix.obs <- as.data.frame.matrix(table(data))
# calculate rowsum and colsum
rowtotal <- rowSums(matrix.obs)
coltotal <- colSums(matrix.obs)
# record the total value
total <- sum(rowtotal)
# calculate expected matrix
matrix.exp <- (rowtotal / total) %*% t(coltotal)
# calculate X2 and degree of freedom
X2 <- sum((matrix.obs - matrix.exp) ^ 2 / matrix.exp)
df <- (dim(matrix.obs)[1] - 1) * (dim(matrix.obs)[2] - 1)
# calculate the p-value
pchisq(X2, df, lower.tail = FALSE)

## [1] 4.778456e-121

# or
chisq.test(matrix.obs)

## 
##  Pearson's Chi-squared test
## 
## data:  matrix.obs
## X-squared = 744.81, df = 56, p-value < 2.2e-16

Conclusion we can find the p-value is extremely small compared the 5% significant level, so we are going to reject the H0, and accect the HA. That means, there is a associaiton between the region of interview and political affiliation.