Chi-Squared Test and Fisher’s Exact Test in R


Hypothesis test of independence between two categorical variables

Method One:

mytable <- table(lhs$AGENDER,lhs$AV1GUM)
mytable
##    
##        0    1
##   F 1708  387
##   M 2942  551

Method Two:

mytable2<-matrix(
  c(387,2095-387,551,3493-551), #specifying the cell values 
  nrow=2, #specifying the number of rows
  ncol=2, #specifying the number of columns 
  byrow=TRUE, #create the matrix by rows 
  dimnames=list(c("Female", "Male"),
                c("Used nicotine gum", "Did not use nicotine gum")))

mytable2
##        Used nicotine gum Did not use nicotine gum
## Female               387                     1708
## Male                 551                     2942

Chi-squared test of independence

mychi.test<-chisq.test(mytable, correct=FALSE)
mychi.test
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 6.8252, df = 1, p-value = 0.008988
#expected as a saved object
mychi.test$expected
##    
##            0        1
##   F 1743.334 351.6661
##   M 2906.666 586.3339

QUESTIONS (Make sure to answer ALL parts of the questions!)


  1. Provide a table AND a visualization for this data. What is the observed count for females who were using nicotine gum at the 1st annual visit?
lhs$AV1GUM_1<-factor(lhs$AV1GUM, levels=c(0,1),
                     labels=c("No use","Used nicotine gum"))
mytable3 <- table(lhs$AGENDER,lhs$AV1GUM_1)
mytable3
##    
##     No use Used nicotine gum
##   F   1708               387
##   M   2942               551
mychi.test1<-chisq.test(mytable3, correct=FALSE)
mychi.test1
## 
##  Pearson's Chi-squared test
## 
## data:  mytable3
## X-squared = 6.8252, df = 1, p-value = 0.008988
mychi.test1$observed
##    
##     No use Used nicotine gum
##   F   1708               387
##   M   2942               551
mychi.test1$expected
##    
##       No use Used nicotine gum
##   F 1743.334          351.6661
##   M 2906.666          586.3339
ggplot(data = lhs, aes(AGENDER, AV1GUM_1)) + 
    geom_col() + 
    facet_grid(~AV1GUM_1 )

***

  1. Write what the null and alternative hypotheses are in the context of the question.

  1. Choose a significance level and justify why you chose this significance level. What is the test statistic and the degrees of freedom from the Chi-squared test of independence? What is the resulting p-value from that test? State your conclusion in the context of this question. If an association was found, consider whether you can make a causal statement about the association and state your conclusions accordingly.

DO NOT simply write “we reject the null hypothesis because p<.05”.

One example: The probability of observing our χ2 statistic or one more extreme if the (state null hypothesis here) is true, is (below/above) our significance level of ___ . Thus, we have sufficient evidence to conclude _____ (context).


  1. What is the expected count for females who were using nicotine gum at the 1st annual visit? Why are we interested in the expected counts (think about how this step relates to the null hypothesis and the process of testing theories)?

  1. Does your data meet the conditions to use the Chi-square test? Explain why or why not. What is p-value from Fisher’s exact test?

  1. What does the sampling distribution show us (the spread of our data or the spread of possible sample statistics)?

  1. Can we observe the true sampling distribution? Why or why not?

  1. What sampling distribution are we interested in when we conduct a hypothesis test? Why is this?

  1. If the central limit theorem conditions met, are we saying that our data is normal or that the sampling distribution is normal?

  1. Why do we check the CLT conditions and compute the standard error by plugging in pˆ when constructing confidence intervals, but by plugging in p0 (the null value) when doing hypothesis testing?

Yes, these are the same questions as in your group problem set. Let’s make sure you can do them on your own too!