OBJECTIVE: In our lectures this unit, we spent a good chunk of time reviewing various hypothesis tests which allow us to compare means to expected null values and determine if they fit our hypothesized expectations. For this R lab, you will be tasked with performing various types of hypothesis tests. For the first few, I will tell you WHAT kind of test you should run, for the last few, YOU will have to decide based on the type of data you’re working with and the question I am asking.

For reference, I have created a markdown file that contains example code for running all types of hypothesis tests, feel free to use those or to use internet resources.

Example 1

From the 3rd R lab…Please bring in the “rice.csv” file, you may call it whatever you want. The folks that collected the data would like you to analyze it to see what patterns may/may not exist. The first thing they want you to look at is whether or not the means of RootDryMass differs between wild-type and gmo rice. To run a comparison of means between these two groups, you will need to run a two-sample t-test. Use the code chunk below to bring in the data and to run the test.

rice <- read.table('rice.csv',sep=',', header=T)
rice
t.test(RootDryMass ~ variety, data = rice, conf.level = 0.95, alternative = "two.sided", var.equal = T)
## 
##  Two Sample t-test
## 
## data:  RootDryMass by variety
## t = -5.2857, df = 70, p-value = 1.353e-06
## alternative hypothesis: true difference in means between group gmo and group wt is not equal to 0
## 95 percent confidence interval:
##  -23.14671 -10.46440
## sample estimates:
## mean in group gmo  mean in group wt 
##          9.666667         26.472222
  1. What are your null and alternative hypotheses for this test?

  2. What is the p-value for the two sample t test? What does this mean? #### Answer:

  3. The null hypothesis for this test is that the variety of rice do not have an effect on RootDryMass. The alternative hypothesis is that the variety of rice does have an effect on RootDryMass.

  4. The p-value is 1.353e-06 (or 0.000001353) which is smaller than 0.05 (the typical biology significance level) meaning the probability of obtaining the data if the null hypothesis is true is very low, leading us to reject the null hypothesis.


Let’s create a boxplot comparing the RootDryMass for each variety

boxplot(RootDryMass~ variety, data = rice, col=c('green', "blue"))

Example 2: rice continued…

Next, the researchers want to check if other factors are influencing growth of rice. Given that they used a different fertilizer, they are curious if that impacted the RootDryMass Since there are more than 2 levels of fertilizer, the appropriate test would be an ANOVA. Use the code chunk below to run an ANOVA of how means of RootDryMass differ across fertilizer treatments (‘fert’)

Fert_RDM <- aov(RootDryMass ~ fert,
          data = rice)
summary(Fert_RDM)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## fert         2   3640  1819.8   8.855 0.000378 ***
## Residuals   69  14181   205.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. What are you null and alternative hypotheses for this test?
  2. What is the P-value of this test? What does this mean?
  3. Do we need to run a post-hoc test? Why or why not? a. if you DO need to run a post-hoc test, run it in the code chunk below. b. What does your post-hoc analysis tell you?

Answers:

  1. The null hypothesis is that the type of fertilizer does not alter RootDryMass. The Alternative hypothesis is that the type of fertilizer does alter the RootDryMass.
  2. The P-value of this test is 0.000378 which is less than 0.05, meaning the probability of obtaining the data if the null hypothesis is true is very low, leading us to reject the null hypothesis.
  3. Yes, we need to run a post-hoc test. The p-value is less than 0.05 meaning at least one type of fertilizer is different from the rest. The post-hoc test will tell us which fertilizer(s) stand out. b. The post-hoc test compares the means between the different fertilizers, along with the adjusted p-values. In this case there is significant difference between NH4Cl-F10 with a p-value of 0.0003499 and NH4NO3-F10 with a p-value of 0.012260 as these p-values are less than that 0.05.

TukeyHSD(Fert_RDM)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = RootDryMass ~ fert, data = rice)
## 
## $fert
##                    diff        lwr       upr     p adj
## NH4Cl-F10    -16.875000 -26.787877 -6.962123 0.0003499
## NH4NO3-F10   -12.166667 -22.079543 -2.253790 0.0122604
## NH4NO3-NH4Cl   4.708333  -5.204543 14.621210 0.4943812

Let’s create a boxplot comparing the RootDryMass for each fert

boxplot(RootDryMass~ fert, data = rice, col=c("yellow", "green", "blue"))


Example 2

Preventative care is incredibly important for finding and treating many disease before they become a problem. Researchers have been collecting data on one of the most common types of cancer, breast cancer, and how it associates with having preventative screening, in this case, a mammogram. Researchers have collected data on over 80,000 women over the last few decades and have been tracking whether women died from breast cancer and whether they had a mammogram. These researchers have asked you to look into possible links between the two. Here, since both variables are categorical, you will need to test for associations using a chi-squared contingency test. In the code chunk below, please read in the “mammogram.csv” dataset and run the chi-squared analysis.

Answer:

mammogram <- read.table('mammogram.csv',sep=',', header=T)
mammogram
chisq.test(mammogram$treatment, mammogram$breast_cancer_death)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mammogram$treatment and mammogram$breast_cancer_death
## X-squared = 0.01748, df = 1, p-value = 0.8948

  1. What are your null and alternative hypotheses for this test?
  2. How do you interpret the p-value for this relationship?

Answers:

  1. The null hypothesis is that treatment (having a mammogram) does not have an effect on breast cancer survival. The alternative hypothesis is that treatment does have an effect on breast cancer survival
  2. The p-value from the Chi-squared analysis is 0.8948 which is greater than 0.05, meaning the probability of obtaining the data if the null hypothesis is true is high, leading us to accept the null hypothesis.

Let’s create a mosaic plot to show the relationship between cancer survival and whether or not a patient had a mammogram.

mosaicplot(~treatment + breast_cancer_death, data = mammogram, col=c("yellow", "green"))

****

Example 3: YOU CHOOSE THE RIGHT TEST!

NOTE: for whatever test you end up choosing, you will need to make sure you address the following: 1. tell me WHY you are choosing the test you’re choosing 2. Tell me your null and alternative hypotheses 3. Interpret the p-value and what it means for the potential relationship 4. Create an appropriate graph for the relationship. 5. If you choose an ANOVA, you need to determine whether or not you need to run a post-hoc test! Run it if you believe it is necessary.

Background: in the code chunk below, please bring in “poison.csv”, you may call the dataframe whatever you want. The data comes from a study looking at how different types of poison and various treatments (levels) of the poison impact how long an animal can survive for (days). Given the data from the experiment, the researchs want you to look at how the mean time until death differs by ‘treat’

HINT: you can run some of the initial code we looked at in the first few R labs to determine the types of data you have as well as the levels within the variables.

  1. I’m running an ANOVA test as there are more than two groups in the treatment variable.
  2. The null hypothesis is that time does not effect the treatment. the alternative hypothesis is that time does effect the treatment.
  3. The p-value is 0.000992 which is <0.05 meaning the probability of obtaining the data if the null hypothesis is true is very low, leading us to reject the null hypothesis.
  4. See code
  5. Post-hoc analysis had to be run as p value was <0.05. There is a significant difference post-hoc will identify.
poison <- read.table('poisons (1).csv',sep=',', header=T)
poison
time_treat <- aov(time ~ treat,
          data = poison)
summary(time_treat)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## treat        3 0.9212 0.30707   6.484 0.000992 ***
## Residuals   44 2.0839 0.04736                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(time~ treat, data = poison, col=c("yellow", "green", "blue", "purple"))

TukeyHSD(time_treat)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = time ~ treat, data = poison)
## 
## $treat
##            diff         lwr         upr     p adj
## B-A  0.36250000  0.12528287  0.59971713 0.0010358
## C-A  0.07833333 -0.15888380  0.31555047 0.8143113
## D-A  0.22000000 -0.01721713  0.45721713 0.0778376
## C-B -0.28416667 -0.52138380 -0.04694953 0.0131752
## D-B -0.14250000 -0.37971713  0.09471713 0.3869986
## D-C  0.14166667 -0.09555047  0.37888380 0.3921830

```