STEP 1: Design Experiment

STEP 2: Load Data

library(ggplot2)
fish <- read.csv("fish.csv")
fish
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

STEP 3: Describe Data

The data shows three columns, which demonstrate the the pool and the riffle, and fifteen rows, which show the location. The rows represent the subjects of the study and the columns represent the variables. The data is describing the relation between the number of fish in a location of a pool and the number of fish in the same location in a riffle.

STEP 4: Purpose of Experiment

The purpose of this study to determine the correlation between the two environments and if they can support the same number of fish.

STEP 5: Visually Represent Data

There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.

I suggest a scatterplot with pool and riffle. To interpret the plot, superimpose the line where pool=riffle or y=x (slope 1, intercept 0). On one side of the line there are more fish in the pool and on the other there are more fish in the riffle.

Here is how you do it:

ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")

Another way to visualize this data would be:

library(ggplot2)
ggplot(data=fish) + geom_point(mapping=aes(x=location, y=riffle, color="purple"))+geom_point(mapping=aes(x=location, y=pool, color="blue"))+ylab("number of fish")

Or:

library(ggplot2)
ggplot(data=fish,mapping=aes(x=location,y=pool))+geom_point()

STEP 6: Interpret the Graph

The plot shows that there are more fish in pool location than there are in the riffle. The first graph shows there is not a strong linear correlation between the two. The second graph shows a similar conclusion. The last graph shows that there are a greater amount of fish living in the pool than in the riffle. Although this is the conclusion that can be drawn from the graphs, it is necessary to perform a test of significance to determine the p-value and have more concrete evidence to support the theory.

STEP 7: Null Hypothesis

Null Hypothesis: The number of fish living in the pool is equal to the number of fish living in the riffle.

STEP 8: Alternative Hypothesis

Alternative Hypothesis: The number of fish living in the pool is different from the number of fish living in the riffle: ((1)the number of fish living in the pool is greater than the number of fish living in the riffle and (2)the number of fish living in the pool is less than the number of fish living in the riffle).

STEP 9: Test

Since t-tests are used to test the hypotheses of population means within a quantitative variable, this type of test would be the correct test to use to test this data.

STEP 10: Choose Samples

We will choose two samples to test, these are going to be the pool and riffle samples.

STEP 11: Check Assumptions

Check Normality

gg <- ggplot(data=fish)
gg + geom_qq(mapping=aes(sample=fish$pool))

gg + geom_qq(mapping=aes(sample=fish$riffle))

STEP 12: Level of Significance of Test

For this test, we are going to use the traditional level of significance for a t-test, which is 0.05, or 5%.

STEP 13: Perform Test

This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.

t.test(fish$pool, fish$riffle, paired=TRUE)
## 
##  Paired t-test
## 
## data:  fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.170332 3.229668
## sample estimates:
## mean of the differences 
##                     2.2

STEP 14: Interpretation of Results

When the p-value is less than the level of significance, we reject the null hypothesis, when the p-value is more than the level of significance, we accept the null hypothesis. In this case, the p-value is less than the level of significance, so we are going to reject the null hypothesis that the means are the same.

STEP 15: Confidence Interval

The confidence interval is defined as the range of plausible values for difference in the means. Since zero is not in this interval and the 95% confidence interval of both samples is less than zero, zero is not a plausible value for the difference in means. This suggests that it is not plausible for the means to be equal.

STEP 16: Sample Estimates

Now that we know that the means are unequal, we can determine that it is more likely to find fish in the pool than to find fish in the riffle. On average, there are 2.2 more fish located in the pool than there are in the riffle.

STEP 17: Conclusion

We now positively know that there are more fish living in the pool than in the riddle and can conclude that the fish enjoy the pool’s environment more than they do the riffle’s environment, or that the environment for fish is better in the pool than it is in the riffle.