Provide a table AND a visualization for this data. What is the observed count for females who were using nicotine gum at the 1st annual visit?
attach(lhs)
library(knitr)
library(kableExtra)
tab <- table(AGENDER, AV1GUM)
rownames(tab) = c("Females", "Males")
colnames(tab) = c("No", "Yes")
kable(tab)
No | Yes | |
---|---|---|
Females | 1708 | 387 |
Males | 2942 | 551 |
library(ggplot2)
dat<-data.frame(tab)
names(dat)<-c("AGENDER", "AV1GUM", "Count")
ggplot(data=dat, aes(x=AGENDER, y=Count, fill=AV1GUM))+geom_bar(stat="identity")
Write what the null and alternative hypotheses are in the context of the question
Answer: p.naught = There is No relationship between nicotine gum use and gender at the one year visit p.alternative = There is a relationship between nicotine gum use and gender at the one year visit
Choose a significance level and justify why you chose this significance level. Answer: I am choosing 0.05 significance level. I chose this level of significance because the sample size is a larger amount. My assumption is that with this large of a sample size (5588 total male and females) a small difference is likely to be statistically significant.
What is the test statistic and the degrees of freedom from the Chi-squared test of independence? Answer: the test statistic is x-squared = 6.8252 and df = 1
What is the resulting p-value from that test? Answer: p-value = 0.008988
State your conclusion in the context of this question. Answer: The probability of observing our 6.8252(x-squared) statistic or one more extreme if there is No relationship between nicotine gum use and gender at the one year visit is true, is below our signficance value of 0.05. Thus, we have sufficient evidence to reject the null hypothesis and conclude that there is no relationship between gender and nicotine use at the one year visit.
If an association was found, consider whether you can make a causal statement about the association and state your conclusions accordingly. Answer: There is no association between gender and nicotine use during the one year visit.
#Significance level - .05
#Run chi-square test
mychi.test<-chisq.test(mytable, correct=FALSE)
mychi.test
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 6.8252, df = 1, p-value = 0.008988
#Check conditions - all are greater than 5 and we assume we have a random sample
mychi.test$expected
##
## 0 1
## F 1743.334 351.6661
## M 2906.666 586.3339
#p-value
mychi.test$p.value
## [1] 0.008988105
pchisq(6.8252, df = 2, lower.tail = F)
## [1] 0.0329554
What is the expected count for females who were using nicotine gum at the 1st annual visit? Answer: expected count of females who were using nicotine gum at the one year visit is 352 females.
Why are we interested in the expected counts (think about how this step relates to the null hypothesis and the process of testing theories)? Answer: we are interested in the expected count because it is an indication of the independence between two classifications.
mychi.test$expected
##
## 0 1
## F 1743.334 351.6661
## M 2906.666 586.3339
Does your data meet the conditions to use the Chi-square test? Explain why or why not. Answer: My data meets the conditions to use the chi-square test becauase the subjects are independent and the expected count is greater than 5.
What is p-value from Fisher’s exact test? Answer: 0.009644
fisher.test(mytable)
##
## Fisher's Exact Test for Count Data
##
## data: mytable
## p-value = 0.009644
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7148097 0.9565290
## sample estimates:
## odds ratio
## 0.8266191
What does the sampling distribution show us (the spread of our data or the spread of possible sample statistics)? Answer: The sampling distribution shows us the frequencies of a range of different outcomes that couple possibly occur for a statistic of a population. Additionally, knowledge of the sampling distribution can be very useful in making inferences about the population. In our example from the Chi-square test, this can be seen with RCode: mychi.test$expected and during the chi-square distribution
Can we observe the true sampling distribution? Why or why not? Answer: No, we cannot view the true sampling distribution because we do not have access to the entire population.
What sampling distribution are we interested in when we conduct a hypothesis test? Why is this? Answer: We would be interested in the sampling distribution, if the null hypothesis was true.
If the central limit theorem conditions met, are we saying that our data is normal or that the sampling distribution is normal? Answer: Sampling distribution is normal.
Why do we check the CLT conditions and compute the standard error by plugging in ˆp when constructing confidence intervals, but by plugging in p0 (the null value) when doing hypothesis testing? Answer: With the hypothesis test we use the null value because we are testing for the true value. With CI, we use p.hat because we want to state our confidence about the point estimate.