library(ggplot2)
library(dplyr)
library(statsr)load("./gss.Rdata")As mentioned in the Codebook, the observation data is collected by computer-assisted personal interview (CAPI), face-to-face interview, and telephone interview. There is no experiment assignment occuring int sampling. Hence, the research question below could only produce generalizability but not causality.
Does there appear to be a relationship between political party affiliation and region of interview?
# subset political party affiliation as 'partyid' and region of interview as 'region'
data <- gss[, c('partyid', 'region')]
table(data)## region
## partyid New England Middle Atlantic E. Nor. Central
## Strong Democrat 361 1292 1704
## Not Str Democrat 463 1975 2011
## Ind,Near Dem 479 976 1342
## Independent 457 1272 1673
## Ind,Near Rep 345 586 964
## Not Str Republican 308 1458 1670
## Strong Republican 208 668 983
## Other Party 41 143 162
## region
## partyid W. Nor. Central South Atlantic E. Sou. Central
## Strong Democrat 600 2026 709
## Not Str Democrat 796 2306 813
## Ind,Near Dem 543 1100 365
## Independent 601 1588 513
## Ind,Near Rep 406 895 326
## Not Str Republican 763 1678 603
## Strong Republican 442 1206 365
## Other Party 53 116 42
## region
## partyid W. Sou. Central Mountain Pacific
## Strong Democrat 936 401 1088
## Not Str Democrat 1238 664 1774
## Ind,Near Dem 579 437 922
## Independent 810 487 1098
## Ind,Near Rep 489 338 572
## Not Str Republican 708 598 1219
## Strong Republican 496 419 761
## Other Party 83 63 158
# each cell has at least 5 cases
# make a mosaic plot to show the distribution under party and affiliation
mosaicplot(~ region + partyid, data = data, main = 'Distribution of Political Party Affiliation and Region of Interview', color = TRUE)# note: I couldn't find a solution to handle label overlapping, but it won't affect to find the proportion difference inbetween each cell.As shown in the plotting, the proportions are different along each variable. But before we confirm a dependent relationship between these two variables, an inference test is needed. Since we have two categorical varaibles, so a Chi-Square statistics is used to test the independence between party affiliation and region of interview.
Step 1. Check the condition
Independence:
Sample size:
So, it meets the condiiton for Chi-Square test
Step 2. Build the Null and Alternative Hypothesis
H0: Political affiliation and region of interview are independent. Political affiliation does not vary by region of interview.
HA: Political affiliation and region of interview are dependent. Political affiliation does vary by region of interview.
Step 3. Implement the Chi-Square Independent Test
# transform table into dataframe with NA removed
matrix.obs <- as.data.frame.matrix(table(data))
# calculate rowsum and colsum
rowtotal <- rowSums(matrix.obs)
coltotal <- colSums(matrix.obs)
# record the total value
total <- sum(rowtotal)
# calculate expected matrix
matrix.exp <- (rowtotal / total) %*% t(coltotal)
# calculate X2 and degree of freedom
X2 <- sum((matrix.obs - matrix.exp) ^ 2 / matrix.exp)
df <- (dim(matrix.obs)[1] - 1) * (dim(matrix.obs)[2] - 1)
# calculate the p-value
pchisq(X2, df, lower.tail = FALSE)## [1] 4.778456e-121
# or
chisq.test(matrix.obs)##
## Pearson's Chi-squared test
##
## data: matrix.obs
## X-squared = 744.81, df = 56, p-value < 2.2e-16
Conclusion we can find the p-value is extremely small compared the 5% significant level, so we are going to reject the H0, and accect the HA. That means, there is a associaiton between the region of interview and political affiliation.