The first thing you have to do is figure out how to collect your data. Some questions that you can ask can be about how many subject you want to have, as well as what different groups there are going to be. However, in this case the experiment is designed for us so this does not apply.
library(ggplot2)
fish <- read.csv("fish.csv")
fish
## location pool riffle
## 1 1 6 3
## 2 2 6 3
## 3 3 3 3
## 4 4 8 4
## 5 5 5 2
## 6 6 2 2
## 7 7 6 2
## 8 8 7 2
## 9 9 1 2
## 10 10 3 2
## 11 11 4 3
## 12 12 5 1
## 13 13 4 3
## 14 14 6 2
## 15 15 4 3
We have a data set with three columns and 15 rows. The rows show the location, while the columns show the pool and the riffle. This data set is designed to show the correlation between the number of fish in the pool at a given location and the number of fish in the riffle at that same location.
The purpose of the study is to show the correlation betwee the number of fish in the pool at a given location and the number of fish in the riffle at that same location.
There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.
I suggest a scatterplot with pool and riffle. To interpret the plot, superimpose the line where pool=riffle or y=x (slope 1, intercept 0). On one side of the line there are more fish in the pool and on the other there are more fish in the riffle.
Here is how you do it:
ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")
The plot suggests that there are more fish in the pool than in the riffle. In addition, it proves that there is really no strong linear correlation between the two. It is uncorrelated because of how widely spread the data set is.
The populations for this question are the fish in the pool and the fish in the riffle. The mean of the riffle population is equal to the mean of the pool population.
The mean of the riffle population with be greater than that of the pool population.
For this data, we will be finding the results of a t. test.
For this data, we will choose two samples: the pool and the riffle.
gg <- ggplot(data=fish)
gg+geom_qq(mapping=aes(sample=fish$pool))
gg+geom_qq(mapping=aes(sample=fish$riffle))
Use the traditional level of significance: .05.
This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.
t.test(fish$pool, fish$riffle, paired=TRUE)
##
## Paired t-test
##
## data: fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.170332 3.229668
## sample estimates:
## mean of the differences
## 2.2
In this case, the p-value is less than the level of significance. This means that we reject the null hypothesis as the means are not equal.
The confidence interval is 1.170332 to 3.229668, which is a 95% confidence interval. Since it does not include zero, the means cannot be the same.
Based on the samples, we can tell that there are more fish in the pool than there is in the riffle - 2.2 x more, to be exact.
Even though the mean populations are not the same, we have evidence that there are more fish in the pool than in the riffle.