Ok, so here are the things you should know and be able to do in R.
There are times you might just want to enter some variables into a vector:
x <- c(1,2,3,4,5) # this will save the numbers 1,2,3,4,5 inside the OBJECT, x
x <- c(1:5) # does a similar thing, but the numbers are saved as integers rather than numbers (don't worry about this now)
x <- seq(from = 1,to = 5) # a sequence of numbers from 1 to 5 (integers)
All these methods do essentially the same thing, but if we want to use a dataset in R, we probably don’t want to type it in here (although you could). A better way to go about getting our data into R is using the read.csv() function.
#file.choose() # using this to find the file of interest
onetail <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Ch11_OneTail.csv") # reading in the file and saving it as the object "onetail"
read.csv() assumes that your data has a header (header = TRUE). If that’s not the case, you need to remember to include header = FALSE in your code:
onetail <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Ch11_OneTail.csv", header = FALSE)
In this case, our data does have a header, so we won’t do that.
Once our data is in, we can check it by printing it to make sure it imported correctly:
print(onetail) # looks good
## molarity
## 1 0.97
## 2 0.95
## 3 1.01
## 4 0.96
## 5 1.03
## 6 0.98
## 7 1.04
## 8 0.95
## 9 0.95
## 10 0.97
## 11 0.97
## 12 1.01
## 13 0.99
## 14 1.02
## 15 1.03
## 16 0.97
Now that we have our data saved as an object we can use, we can perform tests on it.
t.test(onetail)
##
## One Sample t-test
##
## data: onetail
## t = 126.18, df = 15, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.9708188 1.0041812
## sample estimates:
## mean of x
## 0.9875
R has given us the ts (tcalc), the degrees of freedom (df), the p-value, the 95% CI (remember R defailts to 95%), and the sample mean.
This looks a little off though… Remember that this molarity problem is one-tailed and \(\mu\)= 1.0. R defaults to a two-tailed t-test and a mu = 0. Using the arguments or subcommands alternative = and mu =, we can tell R to change those.
t.test(onetail, alternative = "less", mu = 1.0)
##
## One Sample t-test
##
## data: onetail
## t = -1.5972, df = 15, p-value = 0.06554
## alternative hypothesis: true mean is less than 1
## 95 percent confidence interval:
## -Inf 1.00122
## sample estimates:
## mean of x
## 0.9875
Now, we get the correct values and see that our p-value is 0.06554, which is greater than our \(\alpha\) of 0.05, therefore, we fail to reject the null hypothesis, H0.
How did I know to use alternative = and mu =? If you’re ever lost or confused about how to use a function, you can place a ? before a function name to look up help information about it:
?t.test # prints help information about Student's t-Test and the arguments used within it
Super important note about using “less” and “greater”: “less” means that the HA is that the value you entered first is less than the value you entered second. Similarly, “greater” means that the HA is that the value you entered first is “greater” than the value you entered second.
Now we’ll do that same thing with the pH data from class.
#file.choose()
twotail <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Ch11_TwoTail.csv")
print(twotail)
## pH
## 1 6.89
## 2 7.19
## 3 7.64
## 4 9.48
## 5 7.16
## 6 8.00
## 7 7.15
## 8 6.25
## 9 7.38
## 10 8.57
## 11 6.69
## 12 6.10
## 13 6.21
## 14 7.99
## 15 8.97
## 16 6.03
## 17 6.79
## 18 6.34
## 19 4.52
## 20 6.85
t.test(twotail, mu = 6.3)
##
## One Sample t-test
##
## data: twotail
## t = 3.1854, df = 19, p-value = 0.004871
## alternative hypothesis: true mean is not equal to 6.3
## 95 percent confidence interval:
## 6.577779 7.642221
## sample estimates:
## mean of x
## 7.11
Maybe this time, I want a 99% confidence interval instead though. Using the conf.level = argument, I can specify this:
t.test(twotail, conf.level = 0.99, mu = 6.3)
##
## One Sample t-test
##
## data: twotail
## t = 3.1854, df = 19, p-value = 0.004871
## alternative hypothesis: true mean is not equal to 6.3
## 99 percent confidence interval:
## 6.382512 7.837488
## sample estimates:
## mean of x
## 7.11
I now get the same values except my confidence interval has changed.
When you need to conduct two-sample t-tests, the first step is determining if the variances are equal. You do this by using the var.equal() command.
# first I'll load in the data and save it as an object
#file.choose()
ftest <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Ch12_FTest.csv") # saving data as an object
head(ftest) # checkin the data. The head() function prints the first 6 rows.
## Pop.1 Pop.2
## 1 19 25
## 2 21 29
## 3 20 20
## 4 24 23
## 5 22 18
## 6 23 19
var.test(ftest$Pop.1, ftest$Pop.2) # doing the variance test
##
## F test to compare two variances
##
## data: ftest$Pop.1 and ftest$Pop.2
## F = 0.40172, num df = 17, denom df = 22, p-value = 0.05946
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1641191 1.0386504
## sample estimates:
## ratio of variances
## 0.4017245
The results of the var.test() tells me that my F-ratio (Fcalc) is 0.40172. This should already raise a red-flag…but we’ll come back to this. The test also gives the degrees of freedom for the numerator (the first data array entered) and the df for the denominator (the second array entered). Next, a p-value is reported (p = 0.05946). We get a 95% CI too (not super relevant).
Ok, so that ratio is <1 which means the smaller variance was in the numerator, which is not what we want - that gives us the left-tail Fcalc, we want the right-tail Fcalc. We could do 1 of 2 things to find the right-tail value: 1. simply find the reciprocal of the value given (\(1/F_{left.tail}\)), or 2. do the F-test again with the variables switched
Here’s an example of both:
# first option
1/0.40172
## [1] 2.489296
# second option
var.test(ftest$Pop.2, ftest$Pop.1)
##
## F test to compare two variances
##
## data: ftest$Pop.2 and ftest$Pop.1
## F = 2.4893, num df = 22, denom df = 17, p-value = 0.05946
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9627878 6.0931361
## sample estimates:
## ratio of variances
## 2.489268
Now we get the right-tail Fcalc. Looking at Fcrit and Fcalc we could determine that we do not reject the null hypothesis. We can come to the same conclusion looking at the p-value since p > \(\alpha\).
We call a column from a dataframe (a data set with a mix of characters and numbers) in this format: function.name(object.name$column.name). R doesn’t like spaces. When naming things, use _ or . instead (some people prefer to nameThingsLikeThisToo). Use names that make sense. “Ch12” isn’t super helpful - “FTest” is.
Two sample t-tests are used when we have data from two samples (duh) instead of one sample and one population (a one sample t-test). Remember both two-sample and one-sample t-test can be either one- or two-sided. The “sided-ness” of a test depends on the question and the a priori (before-doing-it) hypotheses about the data.
When we do a two sample t-test in R, we do it practically the same way we do one sample t-tests, but we don’t set a \(\mu\) value.
#file.choose()
wrens <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Wrens.csv")
head(wrens)
## Males Females Juveniles
## 1 46.0 40.0 44.0
## 2 38.0 38.0 38.0
## 3 42.5 43.0 35.0
## 4 41.0 44.0 37.0
## 5 44.0 42.5 45.0
## 6 45.5 45.5 35.5
var.test(wrens$Males, wrens$Juveniles) # p = 0.3658 <- accept the null hypothesis (var are equal)
##
## F test to compare two variances
##
## data: wrens$Males and wrens$Juveniles
## F = 1.5477, num df = 15, denom df = 19, p-value = 0.3658
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5913759 4.2918307
## sample estimates:
## ratio of variances
## 1.5477
The question for this data is whether or not the juveniles have reached adult male weight (do juveniles weigh less than males?)
t.test(wrens$Juveniles, wrens$Males, # the "Juveniles" and "Males" colums of the "wrens" data
alternative = "less", # do the Juveniles weight LESS than the males?
var.equal = TRUE) # the variances are equal (could also use T or 1 instead of TRUE)
##
## Two Sample t-test
##
## data: wrens$Juveniles and wrens$Males
## t = -1.6318, df = 34, p-value = 0.05598
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.07521618
## sample estimates:
## mean of x mean of y
## 39.550 41.625
Doing this test, we find that no, the juveniles do not weigh less than the males (p = 0.056). If the variances were unequal, we could have either put var.equal = FALSE or left that argument blank since R defaults to a Welch’s t-test (unequal variances) since a Welch’s t-test can be used with heteroscedastic or homoscedastic data.
Using the same dataset, I’m curious, do males and females weigh the same? Seems like the perfect question to use a two sample t-test to answer.
var.test(wrens$Males, wrens$Females)
##
## F test to compare two variances
##
## data: wrens$Males and wrens$Females
## F = 1.3103, num df = 15, denom df = 14, p-value = 0.6186
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4442686 3.7886779
## sample estimates:
## ratio of variances
## 1.310291
Looks like variances are equal. Cool.
t.test(wrens$Males, wrens$Females,
var.equal = TRUE) # don't need the alternative argument because default is "two.sided" (i.e. two-tailed)
##
## Two Sample t-test
##
## data: wrens$Males and wrens$Females
## t = -0.21537, df = 29, p-value = 0.831
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.236419 2.619752
## sample estimates:
## mean of x mean of y
## 41.62500 41.93333
And there we have it. Females and males do not have significantly different weights (p = 0.831).
We use paired t-tests when data is paired, meaning data are NOT independent and have some sort of paired arrangement (e.g. before/after, right/left, etc.)
Do cholesterol levels change with meat diet? Let’s find out.
#file.choose()
meat <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\MeatDiets.csv")
head(meat)
## Subject Beef Pork
## 1 1 206 185
## 2 2 230 212
## 3 3 190 182
## 4 4 231 232
## 5 5 265 221
## 6 6 204 179
# we don't do F-tests on paired data, so we'll move right into the t.test
t.test(meat$Beef, meat$Pork,
paired = TRUE) # this argument tells R that the data is paired
##
## Paired t-test
##
## data: meat$Beef and meat$Pork
## t = 4.7078, df = 9, p-value = 0.001108
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 10.64958 30.35042
## sample estimates:
## mean of the differences
## 20.5
# no extra argument for sided-ness since this is two-tailed
It looks like there is an impact of diet on cholesterol levels (p = 0.001). That’s all I can say right now. If I looked at the means of the two groups, I could say which is likely higher than the other…
mean(meat$Beef)
## [1] 222.9
mean(meat$Pork)
## [1] 202.4
Now, I can see that beef has a mean cholesterol level of 222.9 and pork has a mean of 202.4. Therefore, with some confidence, I can say that the difference in means we see in the paired t-test is likely because beef diets lead to higher cholesterol levels than pork (p = 0.001). These results are pretty sound, but they could have been stronger if we had made an a priori hypothesis about the direction (which meat would cause higher cholesterol).
t.test(meat$Beef, meat$Pork,
alternative = "greater",
paired = TRUE)
##
## Paired t-test
##
## data: meat$Beef and meat$Pork
## t = 4.7078, df = 9, p-value = 0.0005539
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 12.51783 Inf
## sample estimates:
## mean of the differences
## 20.5
The p-value reported now is (not surprisingly) half the p-value from above since we’re looking at only the right-tail. This lower p-value means our results are even more significant (and stronger) when we make a priori hypotheses about direction.
There are couple different times when we want to use a chi-square test, but it’s basically used to find if a data set fits a predicted/expected distribution (in some way or another).
The three types of chi-squared tests are: 1. Goodness of fit (GoF): testing how well that data fits an expected distribution 2. Test of Independence: testing if two traits in one population are independent of each other (does the occurence of one impact the occurence of the other) 3. Test of Homogeneity: testing if the occruence of one trait is homogenous (occuring at equal freqencies) in two populations
The function we use in R for the \(\chi^{2}\) test is chisq.test(). This command needs two arguments: the observed values and the expected probabilities/distribution (p =).
We can also use the pchisq() function with our \(\chi^{2}_{calc}\), df (\(g-1-r\)), and lower-tail = F (since R automatically gives us the left/lower-tail, which is not what we want. Remeber that \(\chi^{2}\) is right-tailed).
For goodness of fit tests, we are asking, “How well does our data fit this distribution?” Note that for a GoF test, the df is equal to \(g-1-r\), where r is number of estimated parameters. For a binomial distribution, \(r=1\) because we estimate the probability of success using the equation \(\overline{x} = kp\), so that \(p=\overline{x}/k\). Since we used known/given* values to estimate our probability of success (p), we lose a degree of freedom.
obs1 <- c(5, 6, 8, 6, 3, 2) # making list of observed values
exp1 <- c(0.06950, 0.24481, 0.34497, 0.24304, 0.08562, 0.01206) # making list of expected values
chisq.test(obs1, p = exp1) # doing a chisq test
## Warning in chisq.test(obs1, p = exp1): Chi-squared approximation may be
## incorrect
##
## Chi-squared test for given probabilities
##
## data: obs1
## X-squared = 12.573, df = 5, p-value = 0.02772
The df for this test are wrong though. R used \(n-1\) instead of \(g-1-r\) for the df, so the results of this test are not correct (this is something you have to remember to look out for).
The output from this gives us the \(\chi~{2}\) value though, which means we can now plug that in to the pchisq() function:
pchisq(12.573, 4, lower.tail = F)
## [1] 0.01356209
Now, we get an accurate p-value of 0.01356, meaning that we conclude our data does not fit a binomial distribution.
Mendelian segregation probabilities are another example of when this version of a \(\chi^{2}\) can be used.
We’ll begin the same way as with the binomial example:
obs2 <- c(180, 143, 126, 51)
exp2 <- c(9/16, 3/16, 3/16, 1/16) # phenotypes based on Mendelian genetics for a dihybrid cross with complete dominance
chisq.test(obs2, p = exp2)
##
## Chi-squared test for given probabilities
##
## data: obs2
## X-squared = 85.899, df = 3, p-value < 2.2e-16
This time, our values are correct because for Mendelian genetics questions, \(r=0\) since we don’t have to estimate any parameters - all values are given to us based on genetic theory.
Both Test of Independence and Homogeneity are calculated in the same way with the same degrees of freedom, \((r-1)*(c-1)\), where r is the number of rows and c is the number of columns. The difference between these two tests is the hypotheses being asked. For test of indendence, we are asking, “Are two traits in one population independent from each other?”, whereas in test of homogeneity, we are asking, “Does the single trait occur at equal frequencies in two populations?”
For both of these tests, we need to create matrices using the matrix() command. This command uses the combined (c()) values and either nrow = or ncol = to specify the structure of the matrix.
eyes <- matrix(c(40, 27, 22, 20, 12, 16, 22, 24, 27, 28, 49, 32),
nrow = 3) # making a matrix with the data in 3 rows
*Note that the values are printed in a “column fashion” with 40, 27, and 22 as the first value in each row rather than filling the rows in a “row fasion” with 40, 27, 22 all being in the first row. I personally find it more intuitive to use thencol = command for this reason.
Next, we can name the rows and columns. This isn’t a completely necessary step, but does make reading the matrix much easier. We do this with the rownames() and colnames() commands. Note that the format of this is a bit different from what we’ve seen before. We’re essentially forcing the combined names into the matrix we created in a row-wise or column-wise fashion.
rownames(eyes) <- c("good", "moderate", "poor") # naming the rows made in the previous line
colnames(eyes) <- c("brown", "black", "green", "blue") # naming the columns made previously
eyes # printing the data to make sure it looks right
## brown black green blue
## good 40 20 22 28
## moderate 27 12 24 49
## poor 22 16 27 32
chisq.test(eyes) # running the test
##
## Pearson's Chi-squared test
##
## data: eyes
## X-squared = 13.635, df = 6, p-value = 0.03398
The results of this test are corrent since R knows to use \((r-1)*(c-1)\) for the df, but again, it’s ALWAYS a good idea to make sure that R (or you) didn’t make a mistake.
The way this analysis is done is identical to the the test for independence, but it’s up to you to correctly interpret the results.
homogen <- matrix(c(54, 62, 16, 16, 30, 22), ncol = 3) # using ncol this time
rownames(homogen) <- c("male", "female")
colnames(homogen) <- c("good", "moderate", "poor")
homogen
## good moderate poor
## male 54 16 30
## female 62 16 22
chisq.test(homogen)
##
## Pearson's Chi-squared test
##
## data: homogen
## X-squared = 1.7825, df = 2, p-value = 0.4101
If we save the chisq.test() as an object, we can pull more information from it:
# sex and visual acuity data:
homogeneity <- chisq.test(homogen)
summary(homogeneity) # tells us what info we can get
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## method 1 -none- character
## data.name 1 -none- character
## observed 6 -none- numeric
## expected 6 -none- numeric
## residuals 6 -none- numeric
## stdres 6 -none- numeric
homogeneity$expected # prints the expected #s
## good moderate poor
## male 58 16 26
## female 58 16 26
homogeneity$statistic # prints the chisq statistic
## X-squared
## 1.782493
homogeneity$parameter # prints df
## df
## 2
When we want to conduct tests on the means of more than two groups we often use an analysis of variance (ANOVA) (next chapter), but to do an ANOVA, our data must be homoscedastic. To test if this assumption is true, we can’t use regular F-tests because the risk of making a type 1 error (rejecting a true null hypothesis) increases exponentially (\(1-(1-\alpha)^{n}\), where n is the number of F or t-tests done). Because of this, we use other methods that keep our \(\alpha\) levels low.
In R, we use the bartlett.test() function with arguments for the values and the factors by which the values are categorized (response variable, explanatory variable). We also use the leveneTest() function which is in the car package. The input for leveneTest() is the same as for bartlett.test().
Also, the null hypothesis of the variance tests is that the variances are equal (we want this). The alternative is that at least one variance is different from the others (we do not want this). When variances are unequal, we need to tranform our data.
# setting up the data
set1 <- c(3,4,5,6,7) # entering the data
set2 <- c(6,6,7,9,12)
set3 <- c(7,10,12,13,13)
all <- c(set1, set2, set3) # combining the data into a single group
The data could be entered using just the last line with all values, but entering them separately slightly reduces the risk of making a typo.
groups <- factor(rep(letters[1:3], each = 5)) # making an object "groups" that's a factor with 3 letters: a, b, and c
This line says, “make factors using a”repition" of “letters” 1-3 (aka letters a, b, and c), and do each letter 5 times"
If you print “groups”, you’ll see this:
groups
## [1] a a a a a b b b b b c c c c c
## Levels: a b c
Where the groups are just a series of a’s, b’s, and c’s that will align with the values in all.
bartlett.test(all, groups)
##
## Bartlett test of homogeneity of variances
##
## data: all and groups
## Bartlett's K-squared = 0.96044, df = 2, p-value = 0.6186
# running the test on "all" with factor set as "groups" (the groups we just made - this will re-add the sets)
Doing this test, we get a p value of 0.6186, which means we accept the null hypothesis that the groups have equal variances.
If you don’t already have the package installed, you’ll have to install the car package using install.packages("car") (some people have been having issues with this).
library(car) # loading the package so we can use the functions
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
leveneTest(all, groups) # testing "all" based on the "groups" we made
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.2222 0.804
## 12
We get a p values of 0.804, which confirms the Bartlett’s test results that the variances are equal between the groups.
If our data does not have equal variances, we need to tranform our data (use the same function on each variate) so that the variances are equal. We do this so that we can analyze our data using methods that require homoscedastic data (like ANOVAs).
# the data
data1 <- c(3,4,5,6,7)
data2 <- c(2,3,3,4,8)
data3 <- c(5,10,15,25,30)
all.2 <- c(data1, data2, data3)
groups.2 <- factor(rep(letters[1:3], each = 5)) # the same as above
Doing the Bartlett test:
# Bartlett
bartlett.test(all.2, groups.2)
##
## Bartlett test of homogeneity of variances
##
## data: all.2 and groups.2
## Bartlett's K-squared = 13.152, df = 2, p-value = 0.001393
p = 0.001393 -> reject null (variances unequal)
Doing the Levene’s test:
# Levene
leveneTest(all.2, groups.2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 5.9893 0.01571 *
## 12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p = 0.01571 -> reject null (variances unequal)
Because the variances are unequal, we need to transform the data for use in futher analyses:
transall <- log(all.2) # log() is natural log!
bartlett.test(transall, groups.2)
##
## Bartlett test of homogeneity of variances
##
## data: transall and groups.2
## Bartlett's K-squared = 1.9965, df = 2, p-value = 0.3685
p = 0.3685 -> accept null
leveneTest(transall, groups.2)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.9358 0.4191
## 12
p = 0.4191 -> accept null
Now that the data has equal varianes, we can analyzed the means using an ANOVA (next chapter).
The lab manual has you type the data into R for use. As you know, I am not a fan of this method. Using $ you can “call” columns from objects to run analyses on:
cancer <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\in class workthroughs\\Bio260\\Ch17_LSA.csv")
head(cancer)
## condition LSA
## 1 normal 11.5
## 2 normal 14.0
## 3 normal 13.3
## 4 normal 16.5
## 5 normal 15.9
## 6 normal 16.8
I use the head() command just about every time I load data into R, so that I can (1) check that the data was imported correctly and (2) so that I can see what the column headers are called. In this case, the data columns are “condition” and “LSA”.
Using the column names, I can do the variance tests.
bartlett.test(cancer$LSA, cancer$condition)
##
## Bartlett test of homogeneity of variances
##
## data: cancer$LSA and cancer$condition
## Bartlett's K-squared = 1.1739, df = 3, p-value = 0.7593
leveneTest(cancer$LSA, cancer$condition)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.6563 0.5853
## 30
Note that the format of the commands is: VarianceTest(Data$ResponseVariable, Data$ExplanatoryVariable), where response is our y and explanatory is our x.
Similar to how we did F-tests before doing two-sample t-tests, we do variance tests before we do ANOVAs. Now that we’ve done variane tests (and learned how to tranform our data when variances are unequal), we can do our main test, the ANOVA. ANOVAs are used to assess if there are differences between the means of more than two groups. The null hypothesis of an ANOVA is that at least one group mean is different from the others. Unfortunately, ANOVAs do not tell us which mean is different or in which direction. We need to do post-hoc testing and visualize our data for that.
The functions that we use for doing an ANOVA are aov() and summary(). We’ve used the summary() function before to get more information from an object. We’ll use it the same way here. For the aov() function, the input is very similar to the input for the variance tests in the previous chapter, except for one difference, we use a ~ instead of a comma when we input the arguments.
As for looking at the data, the lab manual says to visualize your data after you’ve done the tests. I disagree with this. You should look at your data before you begin. Take at look at the data and see what patterns you think there are. The tests are to test whether the differences you see are actually significant. That’s more like how analyses are done in the real world. So, that’s how I’m going to start.
The plot() function is suuuuuuuper easy to use and makes understanding your data and analyses soooo much easier. In many (but not all) cases, the only thing you have to input into plot() is the name of your data object. It’s that easy.
plot(cancer) # not doing head() because I did it a few lines ago
And just like that, we get a beautiful and suuuuuuper helpful boxplot of the data. Just looking at this we can see that the primary and recurrent levels of LSA are pretty high compared to normal and maybe compared to benign. The difference between normal and benign look like they could also be significant, but it’s hard to say without some stats.
Ok, so now, we do the stats.
res.LSA <- aov(cancer$LSA~cancer$condition) # aov() does the ANOVA
Here, I’m saving the results of the ANOVA (res.anova) in an object. Again, the format for the input is aov(DataName$ResponseVariable~DataName$ExplanatoryVariable). However, unlike the leveneTest() or bartlett.test(), the data for the aov() can also be input as aov(Response~Explanatory, data = DataName). The second version might be more intuitive:
res.LSA <- aov(LSA~condition, data = cancer)
If you print the object just created, you get some of the information, but not all of it:
res.LSA
## Call:
## aov(formula = LSA ~ condition, data = cancer)
##
## Terms:
## condition Residuals
## Sum of Squares 263.7343 249.0907
## Deg. of Freedom 3 30
##
## Residual standard error: 2.881497
## Estimated effects may be unbalanced
To get the full ANOVA table, we need to use the summary() function:
summary(res.LSA)
## Df Sum Sq Mean Sq F value Pr(>F)
## condition 3 263.7 87.91 10.59 6.53e-05 ***
## Residuals 30 249.1 8.30
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we can see the df, SS, MS, F, and most importantly the p-value which is 6.53e-05, meaning that there is most definitely a difference in the means between at least two samples.
Cool. Now what? The results of this test tell us that something is different but not what or how (these results aren’t super interesting or helpful). To really get to the bottom of things are figure out how our samples differ we need to do post-hoc tests.
We do post-hoc testing to see what pairwise groups are driving the difference between the means. There are a couple different ways to do this including Bonferroni correction and Tukey’s test. The input for Tukey’s test is a lot simpler which is great because Tukey is a more powerful test.
The function we use for this is pairwise.t.test(). The input for the function starts like the input for the leveneTest() and bartlett.test(), but there are few additional arguments needed. We need to specify how we want the p-values to be adjusted. To do this we include p.adj = "bonf" to tell R that we want the p-values adjusted using the Bonferroni correction which essentially means splitting the total \(\alpha\) between all pairwise t-tests (\(\alpha_{for each test}=\alpha_{total}/NumberOfTtests\)). Another thing that we can do in R is we can pool the variances (since we’re assuming that the variances are equal). This makes the test a little more powerful. We do this with the pool.sd = TRUE argument.
pairwise.t.test(cancer$LSA, cancer$condition, p.adj = "bonf", pool.sd = TRUE)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: cancer$LSA and cancer$condition
##
## benign normal primary
## normal 0.05498 - -
## primary 0.39406 0.00068 -
## recurrent 0.12099 0.00015 1.00000
##
## P value adjustment method: bonferroni
The output of the Bonferroni test is a series of p-values from pairwise t-tests. Looking at this output, we see that the results are significant when we compare normal to primary and normal to recurrent. This would mean that LSA levels are useful for determining if women have dangerous forms of breast cancer.
The function TukeyHSD() is used to do a Tukey-Kramer test. The input for this function is very simple since all you have to do is enter the model (the object you saved your ANOVA to).
TukeyHSD(res.LSA)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = LSA ~ condition, data = cancer)
##
## $condition
## diff lwr upr p adj
## normal-benign -3.5900000 -7.0939655 -0.08603454 0.0430409
## primary-benign 2.7128571 -1.1483247 6.57403902 0.2452200
## recurrent-benign 3.4842857 -0.3768962 7.34546759 0.0884754
## primary-normal 6.3028571 2.4416753 10.16403902 0.0006231
## recurrent-normal 7.0742857 3.2131038 10.93546759 0.0001380
## recurrent-primary 0.7714286 -3.4166112 4.95946833 0.9582376
The output of this line is a series of pairwise comparisons, but the format is a bit different. On the left side we have a list of all possible combinations, the second column is the difference in means, the 2nd and 3rd columns are the lower and upper limits of the Tukey test (don’t worry about this), and the 4th column has the p-values for the comparisons. This is what we care about. Looking at this column, we see that the normal-benign, primary-normal, recurrent-normal groups are all significant. This means that normal LSA levels are significantly different from all forms of breast cancer, but that all forms of breast cancer are indistinguishable from each other using LSA levels alone (bummer).
This is pretty similar to a one-factor (one-way) ANOVA, except there are two categorical factors that might influence the continuous response variable.
When we do a two-factor ANOVA, we have 3 null and 3 alternative hypotheses:
1: \(H_{0}:\) There is no difference in the means between groups in factor A (\(\mu_{i}=\mu_{j}=...=\mu_{n}\) or \(\sigma^{2}_{among}<\sigma^{2}_{within}\)); \(H_{A}:\) There is a difference between at least two group means in factor A (\(\mu_{i}\neq\mu_{j}\) or \(\sigma^{2}_{among}>\sigma^{2}_{within}\))
2: \(H_{0}:\) There is no difference in the means between groups in factor B; \(H_{A}:\) There is a difference between at least two group means in factor B
3: \(H_{0}:\) There is no interaction between factor A and factor B; \(H_{A}:\) There is an interaction between factor A and B
We go about assessing these problems in a similar way to one-factor ANOVAs - we start by plotting the data to see what patterns there are (what patterns are we looking to confirm or refute with our formal test), then we do a pretest to check if the assumption of equal variances is met.
corn <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\homework\\Project\\Take Home Quiz\\Q4_v1.csv")
head(corn)
## Density Rotation.Variety Corn.Yield
## 1 05 k/ha Pea 7.8
## 2 05 k/ha Pea 9.1
## 3 05 k/ha Pea 10.6
## 4 05 k/ha Soy 7.0
## 5 05 k/ha Soy 6.7
## 6 05 k/ha Soy 8.1
interaction.plot(corn$Density, corn$Rotation.Variety, corn$Corn.Yield)
When doing an interaction plot, remember that the order of arguments is interaction.plot(x-axis variable, line-type, y-axis variable) and if you have the option, try to make the x-axis a continuous variable with numbers. For example, doing the following plot, doesn’t make as much intuitive sense as the previous:
interaction.plot(corn$Rotation.Variety, corn$Density, corn$Corn.Yield)
It’s not incorrect, but interpretting th results is much more difficult because it doesn’t really make sense to draw a line from Pea to Soy to Wheat.
When we do ANOVAs (both one- and two-factor), we need to first check the assumption of equal variation using one of the methods learned in Chapter 16. Here, I’ll do Levene’s test and Bartlett’s test. When we do pretests for a two-factor ANOVA, we need to do two tests - one for each factor.
library(car) # remember that you need to first load the "car" package before doing the leveneTest()
leveneTest(corn$Corn.Yield~corn$Density) # Are variances equal in Corn Yield when grouped by Density?
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 1.6446 0.1986
## 32
leveneTest(corn$Corn.Yield~corn$Rotation.Variety) # Are variances equal in Corn Yield when grouped by Rotation Variety?
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 3.952 0.02893 *
## 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at this output, we see that variances are not equal when Corn Yield is grouped by Rotation Variety
bartlett.test(corn$Corn.Yield, corn$Density)
##
## Bartlett test of homogeneity of variances
##
## data: corn$Corn.Yield and corn$Density
## Bartlett's K-squared = 9.2349, df = 3, p-value = 0.02633
bartlett.test(corn$Corn.Yield, corn$Rotation.Variety)
##
## Bartlett test of homogeneity of variances
##
## data: corn$Corn.Yield and corn$Rotation.Variety
## Bartlett's K-squared = 7.389, df = 2, p-value = 0.02486
Looking at this output, we see that the variances are not equal by both Density and Rotation Variety (remember that Bartlett’s test is more senstitive to deviations from normality, which is probably why the p-values for this test are lower than for Levene’s test).
Regardless, this means that we need to transform the data to meet the asusmption of equal variances.
Since this is measurement data, we should use a log transformation (at least first, if that doesn’t work, we can try something else).
corn.log <- log(corn$Corn.Yield) # transforming the data - the actual numbers
Now, we’ll run the pretests again using the transformed data. We do that using the format of: leveneTest(transformedDataName, dataName$columnName).
leveneTest(corn.log, corn$Density)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.7455 0.5329
## 32
leveneTest(corn.log, corn$Rotation.Variety)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.1426 0.8676
## 33
Now, our p-values are greater than 0.05 meaning we can accept the null hypothesis of equal variances by both groups and we can move on to the main test.
The format we use for doing the ANOVA is aov(dataName$responseColumn~dataName$factorA*dataName$factorB) where A and B can be interchangeable.
Don’t forget that we want to use the transformed data (the data that actually meets the assumption of the test)!!!
corn.aov <- aov(corn$Corn.Yield~corn$Density*corn$Rotation.Variety)
summary(corn.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## corn$Density 3 99.5 33.18 16.328 5.38e-06 ***
## corn$Rotation.Variety 2 442.4 221.22 108.871 9.17e-13 ***
## corn$Density:corn$Rotation.Variety 6 82.2 13.71 6.745 0.000278 ***
## Residuals 24 48.8 2.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output tells us that factor A, factor B, and the interaction are all significant, but the only one we’re really interested in is the interaction. This means that the impacts of Density and Rotation Variety on Corn Yield depend on each other.
Looking at the interaction plot from before…
interaction.plot(corn$Density, corn$Rotation.Variety, corn$Corn.Yield)
…it looks like corn yield increases from rotation densities of 5 k/ha to 15 k/ha for Pea and Soy crop rotations then declines from there. When Wheat is the crop used, there is a decrease in yield from 5 k/ha to 15 k/ha then a slight increase, although the strength of this relationship is weaker than the one for Pea or Soy.
We do regressions and correlations to assess the linear relationship between two continuous variables. The difference between regression and correlation is that for regressions, we are interested in the slope (and sometimes the intercept) of the best-fit line between the controlled, x-variable and the y-variable. For correlations, we do not report a best-fit line, we’re interested in how well the two variables correlate to each other and whether that correlation is negative or positive. For regressions, we report our results with \(R^{2}\), the Coefficient of Determination, the slope of the line (\(b\)), and the associated p-value. For correlations, we report our results with \(r\), the Correlation Coefficient, and the associated p-value.
The key difference when deciding whether a problem is a regression or a correlation lies in whether or not the x-variable has been set by the experimenter. If it has, the question is an experiment and can imply causality. If it hasn’t, the question is purely observational and it cannot imply causality, only correlation.
Reminder that the equation for a line is given by: \(y=a+bx\)
Let’s say, that for this problem, the experimenter chose the weights of each sample taken. If that were the case, then each x-value is set and we could predict the y-values (which is the whole point of calculating a best-fit line - to make predictions about how a change in one variable impacts the other). This can’t be said for correlations.
Before we do the regression, let’s look at the data and make some predictions.
alc <- read.csv("C:\\Users\\Kyra\\OneDrive\\Documents\\Biology MS\\260TA\\Fall 2019\\homework\\Project\\Take Home Quiz\\Q7_v1.csv")
head(alc)
## Weight.of.Grains Alcohol.Percentage
## 1 9.9 4.3
## 2 10.3 4.5
## 3 9.8 6.1
## 4 9.1 4.5
## 5 10.3 5.1
## 6 11.1 6.3
plot(alc$Alcohol.Percentage~alc$Weight.of.Grains)
Looking at this, there doesn’t appear to be a strong relationship between the variables. Maybe a slight negative relationship. Let’s do the formal analysis and see.
The format used for the input is the same as for a one-factor ANOVA, except we use the lm() function.
alc.lm <- lm(alc$Alcohol.Percentage~alc$Weight.of.Grains)
summary(alc.lm)
##
## Call:
## lm(formula = alc$Alcohol.Percentage ~ alc$Weight.of.Grains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2915 -1.6936 -0.8255 2.1340 4.0255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9870 7.4595 1.875 0.0762 .
## alc$Weight.of.Grains -0.6808 0.7078 -0.962 0.3482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.34 on 19 degrees of freedom
## Multiple R-squared: 0.04644, Adjusted R-squared: -0.003752
## F-statistic: 0.9252 on 1 and 19 DF, p-value: 0.3482
From this output, we can see (looking in the “Coefficients” section) that the best estimate for our intercept (given by “(Intercept)”) is 13.9870, and the best estimate of a slope (given by “alc$Weight.of.Grains”) is -0.6808, but the p-values for both of these values are not significant (> 0.05). If this were the case, we’d still report the values, but include the p-value and state the there is not a significant linear relationship between the weight of grains and the alcohol percentage since the slope is not significantly different from 0 (we accept the null hypothesis of the regression). We’d also report the R2 value for this data. R prints two versions, the Multiple-R-squared and the Adjusted R-squared. Typically, you’d report the adjusted value (but for the exam, I’ll accept either).
We can also plot the data again with the best-fit line.
plot(alc$Alcohol.Percentage~alc$Weight.of.Grains)
abline(alc.lm)
Yeah…that line doesn’t explain much of the spread of the data at all, which is expressed in a low R2 value (since R2 is essentially the proportion of variation explained by the best-fit line).
If you were interested in the confidence interavals for the regression, you could use the confint() command with your regression name to get them:
confint(alc.lm)
## 2.5 % 97.5 %
## (Intercept) -1.625930 29.5999888
## alc$Weight.of.Grains -2.162278 0.8006147
For correlation, the two variables we’re interested in reporting are r and it’s p-value. You can get both of these variables using the cor.test() function.
Using the same data set from before as if the weights of the grains were not set by the experimenter, we would get…
cor.test(alc$Weight.of.Grains, alc$Alcohol.Percentage) # Note that you use a , instead of a ~
##
## Pearson's product-moment correlation
##
## data: alc$Weight.of.Grains and alc$Alcohol.Percentage
## t = -0.9619, df = 19, p-value = 0.3482
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5920971 0.2383715
## sample estimates:
## cor
## -0.2154895
…that \(r=-0.2155\) (from “cor” at the bottom of the output) and the p-value is 0.3482 (which you’ll notice is the same as the p-value from the regression). That’s it. Looking at those values we would say that there is not a significant correlation between the weight of the grains and the alcohol percentage of the beer.
Note that we use a slightly different format here. Rather than using cor.test(data$y~data$x), we use cor.test(data$x, data$y). You must remember that the order of x and y change depending on whether you’re using the comma or the tilde. There are times when this matters more than others. When doing correlations, x and y don’t really mean much since neither variable are technically independent or dependent. I’ve only chosen to keep Weight of Grains on the x to be consistent with the previous problem (and because it makes more logical sense), but switching the variables would give us the same results.