OBJECTIVE: In our lectures this unit, we spent a good chunk of time reviewing various hypothesis tests which allow us to compare means to expected null values and determine if they fit our hypothesized expectations. For this R lab, you will be tasked with performing various types of hypothesis tests. For the first few, I will tell you WHAT kind of test you should run, for the last few, YOU will have to decide based on the type of data you’re working with and the question I am asking.
For reference, I have created a markdown file that contains example code for running all types of hypothesis tests, feel free to use those or to use internet resources.
From the 3rd R lab…Please bring in the “rice.csv” file, you may call it whatever you want. The folks that collected the data would like you to analyze it to see what patterns may/may not exist. The first thing they want you to look at is whether or not the means of RootDryMass differs between wild-type and gmo rice. To run a comparison of means between these two groups, you will need to run a two-sample t-test. Use the code chunk below to bring in the data and to run the test.
rice <- read.table('rice.csv',sep=',', header=T)
rice
t.test(RootDryMass ~ variety, data = rice, conf.level = 0.95, alternative = "two.sided", var.equal = T)
##
## Two Sample t-test
##
## data: RootDryMass by variety
## t = -5.2857, df = 70, p-value = 1.353e-06
## alternative hypothesis: true difference in means between group gmo and group wt is not equal to 0
## 95 percent confidence interval:
## -23.14671 -10.46440
## sample estimates:
## mean in group gmo mean in group wt
## 9.666667 26.472222
What are your null and alternative hypotheses for this test?
What is the p-value for the two sample t test? What does this mean? #### Answer:
The null hypothesis for this test is that the variety of rice do not have an effect on RootDryMass. The alternative hypothesis is that the variety of rice does have an effect on RootDryMass.
The p-value is 1.353e-06 (or 0.000001353) which is smaller than 0.05 (the typical biology significance level) meaning the probability of obtaining the data if the null hypothesis is true is very low, leading us to reject the null hypothesis.
Let’s create a boxplot comparing the RootDryMass for each variety
boxplot(RootDryMass~ variety, data = rice, col=c('green', "blue"))
Next, the researchers want to check if other factors are influencing growth of rice. Given that they used a different fertilizer, they are curious if that impacted the RootDryMass Since there are more than 2 levels of fertilizer, the appropriate test would be an ANOVA. Use the code chunk below to run an ANOVA of how means of RootDryMass differ across fertilizer treatments (‘fert’)
Fert_RDM <- aov(RootDryMass ~ fert,
data = rice)
summary(Fert_RDM)
## Df Sum Sq Mean Sq F value Pr(>F)
## fert 2 3640 1819.8 8.855 0.000378 ***
## Residuals 69 14181 205.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(Fert_RDM)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = RootDryMass ~ fert, data = rice)
##
## $fert
## diff lwr upr p adj
## NH4Cl-F10 -16.875000 -26.787877 -6.962123 0.0003499
## NH4NO3-F10 -12.166667 -22.079543 -2.253790 0.0122604
## NH4NO3-NH4Cl 4.708333 -5.204543 14.621210 0.4943812
Let’s create a boxplot comparing the RootDryMass for each fert
boxplot(RootDryMass~ fert, data = rice, col=c("yellow", "green", "blue"))
Preventative care is incredibly important for finding and treating many disease before they become a problem. Researchers have been collecting data on one of the most common types of cancer, breast cancer, and how it associates with having preventative screening, in this case, a mammogram. Researchers have collected data on over 80,000 women over the last few decades and have been tracking whether women died from breast cancer and whether they had a mammogram. These researchers have asked you to look into possible links between the two. Here, since both variables are categorical, you will need to test for associations using a chi-squared contingency test. In the code chunk below, please read in the “mammogram.csv” dataset and run the chi-squared analysis.
mammogram <- read.table('mammogram.csv',sep=',', header=T)
mammogram
chisq.test(mammogram$treatment, mammogram$breast_cancer_death)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mammogram$treatment and mammogram$breast_cancer_death
## X-squared = 0.01748, df = 1, p-value = 0.8948
Let’s create a mosaic plot to show the relationship between cancer survival and whether or not a patient had a mammogram.
mosaicplot(~treatment + breast_cancer_death, data = mammogram, col=c("yellow", "green"))
NOTE: for whatever test you end up choosing, you will need to make sure you address the following: 1. tell me WHY you are choosing the test you’re choosing 2. Tell me your null and alternative hypotheses 3. Interpret the p-value and what it means for the potential relationship 4. Create an appropriate graph for the relationship. 5. If you choose an ANOVA, you need to determine whether or not you need to run a post-hoc test! Run it if you believe it is necessary.
Background: in the code chunk below, please bring in “poison.csv”, you may call the dataframe whatever you want. The data comes from a study looking at how different types of poison and various treatments (levels) of the poison impact how long an animal can survive for (days). Given the data from the experiment, the researchs want you to look at how the mean time until death differs by ‘treat’
HINT: you can run some of the initial code we looked at in the first few R labs to determine the types of data you have as well as the levels within the variables.
poison <- read.table('poisons (1).csv',sep=',', header=T)
poison
time_treat <- aov(time ~ treat,
data = poison)
summary(time_treat)
## Df Sum Sq Mean Sq F value Pr(>F)
## treat 3 0.9212 0.30707 6.484 0.000992 ***
## Residuals 44 2.0839 0.04736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(time~ treat, data = poison, col=c("yellow", "green", "blue", "purple"))
TukeyHSD(time_treat)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = time ~ treat, data = poison)
##
## $treat
## diff lwr upr p adj
## B-A 0.36250000 0.12528287 0.59971713 0.0010358
## C-A 0.07833333 -0.15888380 0.31555047 0.8143113
## D-A 0.22000000 -0.01721713 0.45721713 0.0778376
## C-B -0.28416667 -0.52138380 -0.04694953 0.0131752
## D-B -0.14250000 -0.37971713 0.09471713 0.3869986
## D-C 0.14166667 -0.09555047 0.37888380 0.3921830
```