STEP 1: DESIGN THE EXPERIMENT

The first thing you have to do is figure out how to collect your data. Some questions that you can ask can be about how many subject you want to have, as well as what different groups there are going to be. However, in this case the experiment is designed for us so this does not apply.

STEP 2:

library(ggplot2)
fish <- read.csv("fish.csv")
fish
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

STEP 3: DESCRIBE DATA

We have a data set with three columns and 15 rows. The rows show the location, while the columns show the pool and the riffle. This data set is designed to show the correlation between the number of fish in the pool at a given location and the number of fish in the riffle at that same location.

STEP 4: IDENTIFY THE PURPOSE OF THE STUDY

The purpose of the study is to show the correlation betwee the number of fish in the pool at a given location and the number of fish in the riffle at that same location.

STEP 5: VISUALIZE DATA

There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.

I suggest a scatterplot with pool and riffle. To interpret the plot, superimpose the line where pool=riffle or y=x (slope 1, intercept 0). On one side of the line there are more fish in the pool and on the other there are more fish in the riffle.

Here is how you do it:

ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")

STEP 6: INTERPRET THE PLOT

The plot suggests that there are more fish in the pool than in the riffle. In addition, it proves that there is really no strong linear correlation between the two. It is uncorrelated because of how widely spread the data set is.

STEP 7: FORMULATE THE NULL HYPOTHESIS

The populations for this question are the fish in the pool and the fish in the riffle. The mean of the riffle population is equal to the mean of the pool population.

STEP 8: IDENTIFY THE ALTERNATIVE HYPOTHESIS

The mean of the riffle population with be greater than that of the pool population.

STEP 9: DECIDE ON A TYPE OF TEST

For this data, we will be finding the results of a t. test.

STEP 10: CHOOSE ONE SAMPLE OR TWO

For this data, we will choose two samples: the pool and the riffle.

STEP 11: CHECK ASSUMPTIONS OF THE TEST

gg <- ggplot(data=fish)
gg+geom_qq(mapping=aes(sample=fish$pool))

gg+geom_qq(mapping=aes(sample=fish$riffle))

STEP 12: DECIDE ON A LEVEL OF SIGNIFICANCE OF THE TEST

Use the traditional level of significance: .05.

STEP 13: PERFORM THE TEST

This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.

t.test(fish$pool, fish$riffle, paired=TRUE)
## 
##  Paired t-test
## 
## data:  fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.170332 3.229668
## sample estimates:
## mean of the differences 
##                     2.2

STEP 14: INTERPRET THE P-VALUE

In this case, the p-value is less than the level of significance. This means that we reject the null hypothesis as the means are not equal.

STEP 15: INTERPRET THE CONFIDENCE INTERVAL

The confidence interval is 1.170332 to 3.229668, which is a 95% confidence interval. Since it does not include zero, the means cannot be the same.

STEP 16: INTERPRET THE SAMPLE ESTIMATES

Based on the samples, we can tell that there are more fish in the pool than there is in the riffle - 2.2 x more, to be exact.

STEP 17: STATE YOUR CONCLUSION

Even though the mean populations are not the same, we have evidence that there are more fish in the pool than in the riffle.