R_Coding_Assignment

Chi-Squared Test and Fisher’s Exact Test in R

Hypothesis test of independence between two categorical variables

Method One:

mytable <- table(lhs$AGENDER,lhs$AV1GUM)
mytable

##    
##        0    1
##   F 1708  387
##   M 2942  551

Method Two:

mytable2<-matrix(
  c(387,2095-387,551,3493-551), #specifying the cell values 
  nrow=2, #specifying the number of rows
  ncol=2, #specifying the number of columns 
  byrow=TRUE, #create the matrix by rows 
  dimnames=list(c("Female", "Male"),
                c("Used nicotine gum", "Did not use nicotine gum")))

mytable2

##        Used nicotine gum Did not use nicotine gum
## Female               387                     1708
## Male                 551                     2942

Chi-squared test of independence

mychi.test<-chisq.test(mytable, correct=FALSE)
mychi.test

## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 6.8252, df = 1, p-value = 0.008988

#expected as a saved object
mychi.test$expected

##    
##            0        1
##   F 1743.334 351.6661
##   M 2906.666 586.3339

QUESTIONS (Make sure to answer ALL parts of the questions!)

Provide a table AND a visualization for this data. What is the observed count for females who were using nicotine gum at the 1st annual visit?

387 is the observed count for females who were using nicotine guma at the 1st annual visit.

lhs$AV1GUM_1<-factor(lhs$AV1GUM, levels=c(0,1),
                     labels=c("No use","Used nicotine gum"))

mytable3 <- table(lhs$AGENDER,lhs$AV1GUM_1)
mytable3

##    
##     No use Used nicotine gum
##   F   1708               387
##   M   2942               551

mychi.test1<-chisq.test(mytable3, correct=FALSE)
mychi.test1

## 
##  Pearson's Chi-squared test
## 
## data:  mytable3
## X-squared = 6.8252, df = 1, p-value = 0.008988

mychi.test1$observed

##    
##     No use Used nicotine gum
##   F   1708               387
##   M   2942               551

mychi.test1$expected

##    
##       No use Used nicotine gum
##   F 1743.334          351.6661
##   M 2906.666          586.3339

ggplot(data = lhs, aes(AGENDER, AV1GUM_1)) + 
    geom_col() + 
    facet_grid(~AV1GUM_1 )

***

Write what the null and alternative hypotheses are in the context of the question.

H\(_{o}\): The use of nicotine gum is independent of the participant’s gender.
H\(_{A}\): The use of nicotine gum is associated of the participant’s gender.

Choose a significance level and justify why you chose this significance level. What is the test statistic and the degrees of freedom from the Chi-squared test of independence? What is the resulting p-value from that test? State your conclusion in the context of this question. If an association was found, consider whether you can make a causal statement about the association and state your conclusions accordingly.

Significance level of 0.05 because making a type 1 error doesn’t look like it is high risk.
The test statistic is 6.8252 and the degree of freedom is 1.
The p-value is 0.008988.
The p-value is 0.008988.
We reject the null hypotheses because the p-value 0.008988 is less than our significance level of 0.05. We conclude that the data provided convincing evidence that the use of nicotine gum is associated with the gender of the participant.

DO NOT simply write “we reject the null hypothesis because p<.05”.

One example: The probability of observing our χ2 statistic or one more extreme if the (state null hypothesis here) is true, is (below/above) our significance level of ___ . Thus, we have sufficient evidence to conclude _____ (context).

What is the expected count for females who were using nicotine gum at the 1st annual visit? Why are we interested in the expected counts (think about how this step relates to the null hypothesis and the process of testing theories)?

Expected count = 351.67. The expected count is what we would get if the variables were independent and the null hypothesis were true.

Does your data meet the conditions to use the Chi-square test? Explain why or why not. What is p-value from Fisher’s exact test?

Yes, its a random sample.
Yes, the variables are categorical.
Yes, each expected value of the sample observations are at least 5.

What does the sampling distribution show us (the spread of our data or the spread of possible sample statistics)?

It shows us the spread of possible sample statistics.

Can we observe the true sampling distribution? Why or why not?

We can not observe the true sampling distribution, because we only have one sample.

What sampling distribution are we interested in when we conduct a hypothesis test? Why is this?

We are interested in the true sampling distribution.

If the central limit theorem conditions met, are we saying that our data is normal or that the sampling distribution is normal?

The sampling distribution is normal.

Why do we check the CLT conditions and compute the standard error by plugging in pˆ when constructing confidence intervals, but by plugging in p0 (the null value) when doing hypothesis testing?

When calculating confidence intervals we use p-hat because it is the data that we have and we want to state our confidence about that point estimate. We use the null value in the hypothesis test, because we want to test for the true value.

Yes, these are the same questions as in your group problem set. Let’s make sure you can do them on your own too!

R_Coding_Assignment_7

Tully O’Leary

3/23/2021

Chi-Squared Test and Fisher’s Exact Test in R

Hypothesis test of independence between two categorical variables

QUESTIONS (Make sure to answer ALL parts of the questions!)