Design the experiment

One of the things you have to do here is decide how you are going to collect the data. How many subjects are you going to have? What groups are there going to be what are you going to do differently for each group???these are called different treatments).

Load data

First, we print and load the data set.

fish<-read.csv("fish.csv")
fish
##    location pool riffle
## 1         1    6      3
## 2         2    6      3
## 3         3    3      3
## 4         4    8      4
## 5         5    5      2
## 6         6    2      2
## 7         7    6      2
## 8         8    7      2
## 9         9    1      2
## 10       10    3      2
## 11       11    4      3
## 12       12    5      1
## 13       13    4      3
## 14       14    6      2
## 15       15    4      3

Describe the data

Now, look at the data set. There are 16 rows and 3 columns. The rows correspond to subjects and the columns correspond to variables. “1-15” indicates the number of the test. The “pool” column tells us how many fish in the pool. The “riffle” tells us how many fish in the riffle.

The question is: is there evidence to conclude that the fish in the “pool” is much more than the fish in the “riffle”

Usually we would design our study before we collect data. But the data have been given to us, so will just graph the data, before we design the study.

Identify the purpose

We are going to decide which location, the pool and the riffle, has the greater number of fish to judge which location has the better environment that can be suitable for more fish to live

Visualize the data

library(ggplot2)
ggplot(data=fish,mapping=aes(x=location,y=pool))+geom_point()

library(ggplot2)
ggplot(data=fish,mapping=aes(x=location,y=riffle))+geom_point()

Interpret the plot

The number of fish in the riffle changes between 2 and 3. Though there are many variables between 4 and 8 for the number of fish in the pool, most of the numbers are located in this range, which means that the mean of the number of fish in the pool is greater than the mean of the number of fish in the riffle.

Next step: Identify null hypothesis

Null hypothesis: The population mean of the fish in the pool is the same as the population mean of the fish in the riffle. Usually this is stated as a difference in means(equall zero)

Each group is a sample from a larger population. Specifically the population of all fish who might conceivably take this test.

Alternative hypothesis: There are erow justifable alternative hypothesises:(1) that the mean of the number of fish in the pool is larger than the mean of the number of fish in the riffle (or the mean of pool minus the mean of riffle greater than 0) (2) that the mean of the number of fish in the pool is smaller than the mean of the number of fish in the riffle (or the mean of pool minus the mean of riffle less than 0) I would recommend that two sided alternatives because we have already looked at the data, and it is always more conservative.

Decide one sample or two sample

Two sample.

Decide on type of test

The choices here are t-test and proportion test. T-test for testing hypothesis about the population means of a quantitative variables. Proportion tests are for testing hypotheses about population proportions of categorical variable, so the correct choice is t-test.

Check assumptions of the test

For the t-test, the main assumption is that the data lie close enough to a Normal (bell shaped) distribution. How close does it have to be? It depends on the sample size, the greater the sample size the more robust the t-test is to non-Normality. Actually even for small sample sizes (10 or 11) it is fairly robust, so unless there is strong skewness or substantial outliers we will be OK.

The best way of judging this is to use t-test

ggplot(data=fish)+geom_qq(mapping=aes(sample=pool,color="red"))+geom_qq(mapping=aes(sample=riffle,color="blue"))

If the data are Normal, they will lie on line. This graphs shows that it is probably not close enough. The mean of the number of fish in the riffle is less significantly Normal.

Decide on a level of significance of the test

It is always safe bet to use the traditional level of significance 0.05.

Perform the test

t.test(fish$riffle,fish$pool)
## 
##  Welch Two Sample t-test
## 
## data:  fish$riffle and fish$pool
## t = -4.1482, df = 18.125, p-value = 0.0005961
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.313673 -1.086327
## sample estimates:
## mean of x mean of y 
##  2.466667  4.666667

Since the p-value is less than the level of significance, we REJECT the null hypothesis that the means are equal.

Confidence interval

The confidence interval is the range of plausible values for the difference in means. Zero is not in this interval. Therefore 0 is not a plausible value for the difference in means, so it is not plausible that the means are the same.

Sample Estimates

We have concluded that the means are not equal, but we really want to know: is the mean of the number of fish in the pool is more than the mean of the number of fish in the riffle? Knowing that the means are unequal we can answer this question by looking the sample estimates “pool” more than “riffe”.

Conclusion

We have evidence to prove that the mean of the pool is more than the mean of the riffle. Therefore, we can conclude that the environment of the pool is better the riffle and it can make more fish survive in the pool.