Step 1 Since we were already given the data, we do not necessarily need to formulate a hypothesis until after we see the data. This is not a common practice in statistics, but since we were given the data beforehand, we will wait to see the data and skip the designing process.

Step 2

fish <- read.csv("Copy of fish.csv")
fish
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

Step 3 Describe Data We have a data set with 3 rows and 15 columns. We have a data set with two locations (pool and riffle), and the amount of fish present in each location at a given time. We are trying to see if there is a relationship to how many fish are in the riffle in that location, and then how many are in the pool. We will use a scatterplot to try and visualize if there is a correlation or not.

Step 4 Identify the purpose of the study The purpose of the study is to assess whether there is a correlation between the number of fish in the pool at the given location, and the number of fish in the riffle.

Step 5 Visualize the Data There is no reason to plot location—location is just a label, and there is no reason to think that these labels are arbitrary. Nothing about location answers our question, which is “in each location are there more fish in the pools or in the riffles”. We could plot histograms for each variable—pool and riffle, but the trouble is we are interested in how they relate to each other. What does it mean that there are 4 fish in the pool? It means one thing if there are 2 fish in the riffle in the same location and it means and entirely different thing that there are 8 fish in the riffle.

I suggest a scatterplot with pool and riffle. To interpret the plot, superimpose the line where pool=riffle or y=x (slope 1, intercept 0). On one side of the line there are more fish in the pool and on the other there are more fish in the riffle.

Here is how you do it:

library(ggplot2)
ggplot(data=fish, mapping=aes(x=riffle, y=pool)) + geom_point() + geom_abline(slope=1, intercept=0) + annotate("text", x=1.25, y=4, label="More in Pool") + annotate("text", x=3.5, y=2, label="More in Riffle")

Step 6 Interpret the plot There is no strong, linear correlation between the amount of fish in the pool and the amount of fish in the riffle. There is a wide spread in the data, as seen through the uncorrelated plots on the scatter plot. It also shows that there are largely more fish in the pool than in the riffle rather consistently.

Step 7 Formulate the Null Hypothesis Riffle population: All the fish found in the riffles. Pool population: All the fish found in the pool.

Null hypothesis: The mean of the riffle population is equal to the meal of the pool population (a mean difference of zero.)

Step 8 Formulate the Alternatie Hypothesis Alternative hypothesis: The means are different, OR the mean of pool group is greater than the mean of the riffle group. These statements are usually written in terms of differences (differences greater than 0 or different not equal to 0). The difference between the pool population and riffle population is not equal to 0.

Step 9 Decide on type of test T-test.

Step 10 Choose one sample or two sample Two sample (riffle and pool)

Step 11 Check Assumptions

Check Normality

gg <- ggplot(data=fish)
gg+geom_qq(mapping=aes(sample=fish$pool))

gg+geom_qq(mapping=aes(sample=fish$riffle))

Step 12 Decide on a level of significance 0.05

Step 13 Perform the test This is a paired problem. Each x is paired with a single y. This was not the case with the sleep data set where there was no pairing between the Unrest subjects and the Deprived subjects. Because of this we do the t.test a little differently. Specifically we do a paired t.test.

t.test(fish$pool, fish$riffle, paired=TRUE)
## 
##  Paired t-test
## 
## data:  fish$pool and fish$riffle
## t = 4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.170332 3.229668
## sample estimates:
## mean of the differences 
##                     2.2

Step 14 Interpret the P-Value Because our p-value of 0.0004264 is less than our level of significance (0.05), we reject the null hypothesis that the means are equal. This leads us to assume that the means of the pool and riffle population are unequal.

Step 15 Interpret the confidence interval We have a 95% confidence interval of (1.170332, 3.229668). Zero is not in this interval. Therefore 0 is not a plausible value for the difference in means, so it is not plausible that the means are the same.

Step 16 Interpret the sample estimates Knowing that the population means are unequal, but looking at the samples we can estimate that there are 2.2 more fish on average in the pool than in the riffle.

Step 17 State your conclusion Although the mean populations are not equal, we have evidence from our sample that there are more fish in the pool than in the riffle.