Data has already been collected!
library(ggplot2)
fish <- read.csv("fish.csv")
fish
## ï..location pool riffle
## 1 1 6 3
## 2 2 6 3
## 3 3 3 3
## 4 4 8 4
## 5 5 5 2
## 6 6 2 2
## 7 7 6 2
## 8 8 7 2
## 9 9 1 2
## 10 10 3 2
## 11 11 4 3
## 12 12 5 1
## 13 13 4 3
## 14 14 6 2
## 15 15 4 3
Researchers were interested in comparing pools and riffles in terms of their amount of fish. They investigated which environment the fish preferred by going to each location and observing how many fish were in each.
The purpose is to deduce whether a riffle supports the life of fish as well as the pool.
There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.
ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")
This graph indicates that there are a good deal more fish in the pool than in the riffle; however, it is crucial to having a good and repeatable process that we conduct a test of significance in order to determine if this is worth reporting. If our test indicates that this could be an outlier case then we probably need to repeat the process.
Our null hypothesis will be that there is no difference between the pool and riffle in terms of amount of fish. The population mean will be the same for the pool and riffle.
The alternative hypothesis is that there is a difference one way or the other. The population mean will not be the same between the pool and the riffle.
For this question, we will use a t-test because it is used for comparing two quantitative populations, which is what we have.
We have two samples, and will therefore be using a two-sample t-test.
The main assumption of the t-test, which we are using, is that the data will be close to the Bell curve i.e. it is normal. We will check this with the following;
ggplot(data=fish) + geom_qq(mapping=aes(sample=fish$pool))
ggplot(data=fish) + geom_qq(mapping=aes(sample=fish$riffle))
The data should fall approximately on a line if they are normal. Both graphs seem to be close enough to a line, and so we will continue.
0.05 is the standard level of significance for statisticians, so we will use 0.05.
This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.
t.test(fish$pool, fish$riffle, paired=TRUE)
##
## Paired t-test
##
## data: fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.170332 3.229668
## sample estimates:
## mean of the differences
## 2.2
The p-value, which is 0.0004264, is lesser than our level of significance which is 0.05. Therefore, we reject the null hypothesis that the means are equal.
The confidence interval for this data is 1.170332 to 3.229668. The interval does not include 0, which is the plausible variable for there being no difference in the means; in other words, the confidence interval confirms that there is a difference in the means.
The sample estimate i.e. mean of differences is 2.2. This tells us that the difference between the pool and the ripple, on average, is +2.2 in the pools.
We can conclude, through our anaylsis of the normality of our data and our subsequent testing of that data, that we can reject the null hypothesis and confirm the alternative which says that there is a difference between the pool and ripples in terms of number of fish. This study tells us that the fish in question generally preferred the pool to the ripple by the count of +2.2 on average.