STEP 1: Design the experiment

STEP 2: Collect (or load) the data

fish <- read.csv("fish.csv")
fish
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

STEP 3: Describe the data

The data set consists of 15 rows with three columns. Each row represents the number of different species of fish captured along different locations at a river, and each column represents a variable. The first column is the location where the fish were captured, the second column represents the amount of fish collected in a section that can be described as a pool, and the third column represents the amount of fish collected in a section that can be described as a riffle. Pools are defined as deep, slow-moving parts of a stream, and riffles are defined as the shallow, fast-moving parts of a stream.

STEP 4: Identify the purpose of the study

This study is looking at whether the two habitats (pools and riffles) support equal numbers of species and measure species diversity.

STEP 5: Visualize the data

library(ggplot2)
ggplot(data=fish, mapping = aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")

STEP 6: Interpret the plot

The plot suggests that more fish can be found in pools than in riffles as there are more dots above the line than below or on the line. That said, there are samples that show more fish in riffles than pools and where the number of fish in pools equals the number of fish in riffles. We need a p-value and confidence interval to ensure this finding is not by chance, and a lower p-value will indicate that there is less of a chance that the population means of fish in each location are equal.

STEP 7: Formulate the null hypothesis

The population for this project is all fish in the river. The null hypothesis for this test would be that the population mean of fish found in pools would be equal to the population mean of fish found in riffles and that the difference between both means equals 0.

STEP 8: Identify the alternative hypothesis

Despite the fact that the data plot above suggests that there are more fish in pools than in riffles, it is best to go with a more conservative choice for the alternative hypothesis. Therefore, the alternative hypothesis will be that the populations are unequal.

STEP 9: Decide on type of test

The correct choice for this case would be a t-test because we are exploring hypotheses in regard to population means. Since the variable–the population mean of fish found in pools or riffles–is quantitative, that would be the reasoning behind choosing a t-test as opposed to a proportions test.

STEP 10: Choose one sample or two

A two-sample test will be needed because there are two samples–one of fish found in pools and one of fish found in riffles.

STEP 11: Check Assumptions

Check Normality

gg <- ggplot(data=fish)
gg + geom_qq(mapping=aes(sample=fish$pool))

gg + geom_qq(mapping=aes(sample=fish$riffle))

Based on the qq-plots of samples of fish in the pools and riffles, the qq-plot of fish found in pools is fairly normal, and the qq-plot of fish found in riffles–though less normal–is also close enough to qualify as a normal distribution.

STEP 12: Decide on a level of significance of the test

The level of significance for this case will be the fairly traditional value of 0.05.

STEP 13: Perform the test

This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.

t.test(fish$pool, fish$riffle, paired=TRUE)
## 
##  Paired t-test
## 
## data:  fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.170332 3.229668
## sample estimates:
## mean of the differences 
##                     2.2

STEP 14: Interpret the p-value

The p-value is much less than the level of significance, so this will lead us to reject the null hypothesis.

STEP 15: Interpret the confidence interval

Because 0 is not a value represented within the confidence interval, this result is consistent with the result of STEP 14. This also means that it is not plausible that the means of the two groups would be the same.

STEP 16: Interpret the sample estimates

The sample estimates show that the mean of the differences is equal to 2.2. In other words, there is an average of 2.2 more fish that can be found in pools than in riffles.

STEP 17: State your conclusion

STEPs 14 and 15 indicated that the means of fish found in pools and riffles are unequal, and STEP 16 can then lead us to reasonably conclude that fish prefer pools to riffles.