Step 1 Design the experiment

The experiment and data comes from Example 6 on a test about fish and the fish living in different locations: the pool or the riffle.

Step 2 Collect and Load Data

library(ggplot2)
fish2 <- read.csv("fish2.csv")
fish2
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

Step 3 Describe Data

There are fifteen rows and three columns. There are fifteen different locations and the researchers counted the number of fish in the riffles and the number of fish in the pool. The other two variables are the fish in the pool and the fish in the riffles.

Step 4 Purpose of the Study

The purpose of this study is to find which location better supports fish. The researchers want to figure out if pools or riffles are better environments for fish.

Step 5 Visualize Data

There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.

I suggest a scatterplot with pool and riffle. To interpret the plot, superimpose the line where pool=riffle or y=x (slope 1, intercept 0). On one side of the line there are more fish in the pool and on the other there are more fish in the riffle.

Here is how you do it:

library(ggplot2)
ggplot(data=fish2, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")

Step 6 Interpret the Plot

The plot shows that in many of the locations, there are a higher number of fish in the pools than the riffles.

Step 7 Formulate the Null Hypothesis

The Null Hypothesis: The fish prefer to live in the fish and riffle equally. The populations in each sample are equal.

Step 8 Formulate the Alternative Hypothesis

The Alternative Hypothesis: They are unequal. The means are different. There is a larger population of fish in one location than in the other location.

Step 9 Decide on Type of Test

The T test is for testing hypotheses about population means of a quantitative variable. In this project, we want to use a paired T test.

Step 10 One Sample or Two?

We will use a two sample test. We need one sample for the pools and one sample for the riffles.

Step 11 Check the assumptions of the test

This graph shows that it was close enough to the line to be considered Normal.

gg <- ggplot(data=fish2)
ggplot(data=fish2) + geom_qq(mapping=aes(sample=fish2$pool)) 

gg + geom_qq(mapping=aes(sample=fish2$riffle))

Step 12 Decide on a level of significance for the test

The level of significance will be 0.05.

Step 13 Perform the Test

This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.

t.test(fish2$pool, fish2$riffle, paired=TRUE)
## 
##  Paired t-test
## 
## data:  fish2$pool and fish2$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.170332 3.229668
## sample estimates:
## mean of the differences 
##                     2.2

Step 14 Interpret the p-value

The p value of 0.0004264 is less than the level of significance. The level of significance was 0.05. We then will reject the null hypothesis.

Step 15 Interpret the Confidence Interval

The confidence interval is the different values for the difference in means. The confidence interval is not zero. This means that they are not different for the difference in means. The farther they are from zero the more different they are. The confidence intervals were 1.170332 and 3.229668. The mean estimate is 2.2.

Step 16 Interpret the Sample Estimates

The sample estimates guesses that there are 2.2 on average more fish in the pool than in the riffle.

Step 17 Conclusion

In conclusion, fish prefer to live in pools than in riffles. We reject the null hypothesis and accept the alternative hypothesis that the populations are unequal in the locations. More fish will live in pools than in riffles.