This experiment will utilize statistics to measure fish located at two different locations. Our goal is to determine, on average, if more fish are found in the deep pool or the shallow riffle over the 15 spots.
fish<- read.csv("fish.csv")
fish
## location pool riffle
## 1 1 6 3
## 2 2 6 3
## 3 3 3 3
## 4 4 8 4
## 5 5 5 2
## 6 6 2 2
## 7 7 6 2
## 8 8 7 2
## 9 9 1 2
## 10 10 3 2
## 11 11 4 3
## 12 12 5 1
## 13 13 4 3
## 14 14 6 2
## 15 15 4 3
library(ggplot2)
We have a data set with 15 rows and 2 columns. The rows represent the subjects of the study ( types of fish) and the columns represent variables (location). The first variable is the pool location. The second variable is the riffle location.
Some stream fishes are most often found in pools, the deep, slow-moving parts of a stream. Others prefer riffles, the shallow, fast-moving regions. To investigate whether these two habitats support equal numbers of species (a measure of species 5 diversity) researchers captured fish at 15 locations along a river. At each location, they recorded the number of species captured in a riffle and the number captured in an adjacent pool. We will determine, on average, if more fish can be found in the deep pool or the shallow ripples over the 15 spots.
ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")
In most locations, there are more fish found in the pool than in the riffle. This suggests that the fish perfer the pool more than the riffle. We will back this hypothesis up with a p value that will give us a quantified value to the statement.
The null hypothesis states that on average, the number of fish found in the pool will be equal to the number of fish found in the riffle. In other words, the mean of the riffle will equal the mean to the pool.
The alternative hypothesis states that on average, the number of fish found in the pool will be greater than the average number of fish found in the riffle. In other words, therew ill be a difference in means between the fish found in pools and riffles.
I am going to run a t-test on this data, which means I will be testing my hypotheses (null and alternative) about population means of a quantitative variable.
This is a two sample test because we have a sample for the pool fish and the riffle fish. In other words, the data is being drawn from two different samples rather than one.
According to the qq plot below, the data is Normal because the data lies close to the line.
ggplot(data=df)+ geom_qq(mapping=aes(sample=y))
gg <- ggplot(data=fish)
gg + geom_qq(mapping=aes(sample=fish$pool))
gg + geom_qq(mapping=aes(sample=fish$riffle))
I will use the traditional level of significance, 0.05.
This is a paired problem. Each x is paired with a single y.
t.test(fish$pool, fish$riffle, paired=TRUE)
##
## Paired t-test
##
## data: fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.170332 3.229668
## sample estimates:
## mean of the differences
## 2.2
The p value is 0.0004264 which means we reject the null hypothesis.
The confidence interval does not contain 0 so we are confident that there are more fish in the pool and we can reject the null hypothesis.
On average, there are 2.2 percent more fish in the pool than the riffle (this is based on a sample, so it is an estimate).
I conclude that fish tend to prefer pools rather than riffles. Furthermore, more fish can be found in the pools than riffles by 2.2%.