If you have two factor variables with multiple levels each, then you have to use the chi-square statistic to test whether or not they are related (there is some association) or independent (they are not associated).
setwd("~/Dropbox/Data General/GSS") #Set your working directory to whatever folder holds GSS.csv
options(scipen = 999) #Turn off scientific notation
options(warn = -1) #Turn off warnings
options(digits = 8) #Limit number of digits to 8
x <- read.csv("GSS.csv")
I'm interested in whether knowledge of the North-American Free Trade Agreement is associated with support for it, or if these two phenomena are independent of each other (no relationship). There are two variables in the GSS dataset, from the year 1996, asking people these two questions:
summary(x$nafta1) #what do you know about NAFTA?
## a lot not much nothing at all quite a bit NA's
## 324 1115 420 695 52533
summary(x$nafta2) #will NAFTA benefit or not benefit the United States
## benefits does not benefit
## 377 270
## dont know have never heard of nafta
## 581 120
## NA's
## 53739
The chi-square test of independence is executed with the function chisq.test()
. We run the function on a simple frequency table of the two variables. The null hypotheses is that the two variables are independent, so lower p-values provide evidence that there is a systematic relationship and higher p-values provide evidence that variation in the data is simply random chance.
table <- table(x$nafta1, x$nafta2) #make a simple contingency table of frequencies
table #print the table
##
## benefits does not benefit dont know
## a lot 88 68 25
## not much 111 75 359
## nothing at all 4 3 92
## quite a bit 173 124 104
##
## have never heard of nafta
## a lot 0
## not much 16
## nothing at all 99
## quite a bit 5
chisq.test(table) #test research hypothesis that there is an association (null = independence)
##
## Pearson's Chi-squared test
##
## data: table
## X-squared = 779.5446, df = 9, p-value < 0.00000000000000022
Here we find a very small p-value so we reject the null hypothesis of independence and conclude that levels of knowledge regarding NAFTA bear some relationship to evaluations of NAFTA. Notice, however, that this test does not allow us to parse specific relationships between categories. It only summarizes whether or not there is an overall association.