Chi-Squared Test and Fisher’s Exact Test in R
The goal of this assignment is to learn how to • obtain observed counts and expected counts in a 2x2 table, and • obtain inferential results for comparing two categorical variables (e.g., Chi-squared test of independence).You can embed an R code chunk like this:
Question 1
Provide a table AND a visualization for this data. What is the observed count for females who were using nicotine gum at the 1st annual visit?
Used nicotine gum Did not use nicotine gum
Female 387 1708
Male 551 2942
The females who were using nicotine gum at the 1st annual visit is 387
##
####Step 1a: Create and save the table of the two variables & view it
mytable<-table(lhs$AGENDER,lhs$AV1GUM)
####View the table that you just created
mytable
##
## 0 1
## F 1708 387
## M 2942 551
####Step 1b: Create and save the table of the two variables & view it
mytable2<-matrix(
c(387,2095-387,551,3493-551), #specifying the cell values
nrow=2, #specifying the number of rows
ncol=2, #specifying the number of columns
byrow=TRUE, #create the matrix by rows
dimnames=list(c("Female", "Male"),
c("Used nicotine gum", "Did not use nicotine gum")))
####View the table that you just created
mytable2
## Used nicotine gum Did not use nicotine gum
## Female 387 1708
## Male 551 2942
Question 2
Write what the null and alternative hypotheses are in the context of the question.
The question is: Is there a relationship between nicotine gum use and sex?
\(H_{0}\): There is no difference between the proportion of males and females who use nicotine gum at the first annual visit. What we saw in our data was simply due to random chance.
\(H_{A}\): There is a difference between the proportion of males and females who use nicotine gum at the first annual visit. Our data indicates that there IS an association between the nicotine gum and gender.
\(H_{0}\): pMales - pFemales=0
\(H_{A}\): pMales - pFemales=!0
Question 3
Choose a significance level and justify why you chose this significance level.
What is the test statistic and the degrees of freedom from the Chi-squared test of independence?
It is a Chi-Squared Test Statistic – a type of “deviation”. And the Distribution of \(X^2\) statistics = Chi-square distribution!
General formula: \((observed count -expected count)^2/(expected count)\)
The chi-squared distribution is parametrized by just one parameter: the degrees of freedom (df) and we write the distribution name as \(χ^2_{df}\) DF = (r - 1) * (c - 1) where r is the number of levels for one categorical variable, and c is the number of levels for the other categorical variable. DF=(2-1)*(2-1)
DF=1
What is the resulting p-value from that test?
State your conclusion in the context of this question. If an association was found, consider whether you can make a causal statement about the association and state your conclusions accordingly.
## Steps for Chi-Sqaured Test
##Step 1Prepare: Create your two-way table. Choose your significance level. Define your hypotheses.
##significant level is .01
## table is mytable
## hypotesis are
##$H_{0}$: pMales - pFemales=0
##$H_{A}$: pMales - pFemales=!0
##Step 2: Check: Check the assumptions. You will need to compute the expected counts ##here.Independence and Expected counts all ≥5
##Sus1<-mychi.test$expected [1]>=5
##Sus2<-mychi.test$expected [2]>=5
##Sus3<-mychi.test$expected [3]>=5
##Sus4<-mychi.test$expected [4]>=5
##Step 3: Calculate the chi-squared test statistic. Compute the associated p-value. Compare the p-value to the significance level.
mychi.test<-chisq.test(mytable, correct=FALSE)
####View the test results
mychi.test
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 6.8252, df = 1, p-value = 0.008988
##to see what values are saved in the chisq.test
names(mychi.test)
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##To obtain expected counts for the table.
mychi.test$p.expected
## NULL
#p-value
mychi.test$p.value
## [1] 0.008988105
##Step 4: Make a conclusion based on the p-value and significance level. State your conclusion in the context of the data.
Question 4:
What is the expected count for females who were using nicotine gum at the 1st annual visit?
Why are we interested in the expected counts (think about how this step relates to the null hypothesis and the process of testing theories)?
Question 5
Does your data meet the conditions to use the Chi-square test? Explain why or why not.
What is p-value from Fisher’s exact test?
##Step 2: Check: Check the assumptions. You will need to compute the expected counts ##here.Independence and Expected counts all ≥5
Sus1<-mychi.test$expected [1]>=5
Sus2<-mychi.test$expected [2]>=5
Sus3<-mychi.test$expected [3]>=5
Sus4<-mychi.test$expected [4]>=5
##Fisher’s Exact test - alternative to the Chi-square test when the conditions are not met.
##To carry out Fisher’s Exact test, use the fisher.test() function, specifying the name of the object
##that contains the 2x2 table (e.g., mytable, mytable2):
fisher.test(mytable)
##
## Fisher's Exact Test for Count Data
##
## data: mytable
## p-value = 0.009644
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7148097 0.9565290
## sample estimates:
## odds ratio
## 0.8266191
CONCEPTUAL QUESTIONS (confidence intervals, sampling distributions and inference for a single proportion) Assume we are discussing the sampling distribution for a sample proportion.
Question 6:
What does the sampling distribution show us (the spread of our data or the spread of possible sample statistics)?
Question 7:
Can we observe the true sampling distribution? Why or why not?
Question 8:
What sampling distribution are we interested in when we conduct a hypothesis test? Why is this?
Question 9:
If the central limit theorem conditions met, are we saying that our data is normal or that the sampling distribution is normal?
The sampling distribution is normal.
Question 10:
Why do we check the CLT conditions and compute the standard error by plugging in p^ when constructing confidence intervals, but by plugging in po (the null value) when doing hypothesis testing?
p^ - is used to calculate statistics from the existing data base
po - is used to test the hypothesis because infer about the entire population