One of the things you have to do here is decide how you are going to collect the data. How many subjects are you going to have? What groups are there going to be what are you going to do differently for each group???these are called different treatments).
First, we print and load the data set.
fish<-read.csv("fish.csv")
fish
## location pool riffle
## 1 1 6 3
## 2 2 6 3
## 3 3 3 3
## 4 4 8 4
## 5 5 5 2
## 6 6 2 2
## 7 7 6 2
## 8 8 7 2
## 9 9 1 2
## 10 10 3 2
## 11 11 4 3
## 12 12 5 1
## 13 13 4 3
## 14 14 6 2
## 15 15 4 3
Now, look at the data set. There are 16 rows and 3 columns. The rows correspond to subjects and the columns correspond to variables. “1-15” indicates the number of the test. The “pool” column tells us how many fish in the pool. The “riffle” tells us how many fish in the riffle.
The question is: is there evidence to conclude that the fish in the “pool” is much more than the fish in the “riffle”
Usually we would design our study before we collect data. But the data have been given to us, so will just graph the data, before we design the study.
We are going to decide which location, the pool and the riffle, has the greater number of fish to judge which location has the better environment that can be suitable for more fish to live
library(ggplot2)
ggplot(data=fish,mapping=aes(x=location,y=pool))+geom_point()
library(ggplot2)
ggplot(data=fish,mapping=aes(x=location,y=riffle))+geom_point()
library(ggplot2)
ggplot(data=fish,mapping=aes(x=riffle,y=pool))+geom_point()+geom_abline(slope=1,intercept=0)+annotate("text",x=1.25,y=4,label="More in Pool")+annotate("text",x=3.5,y=2,label="More in Riffle")
The number of fish in the riffle changes between 2 and 3. Though there are many variables between 4 and 8 for the number of fish in the pool, most of the numbers are located in this range, which means that the mean of the number of fish in the pool is greater than the mean of the number of fish in the riffle.
Null hypothesis: The population mean of the fish in the pool is the same as the population mean of the fish in the riffle. Usually this is stated as a difference in means(equall zero)
Each group is a sample from a larger population. Specifically the population of all fish who might conceivably take this test.
Alternative hypothesis: There are erow justifable alternative hypothesises:(1) that the mean of the number of fish in the pool is larger than the mean of the number of fish in the riffle (or the mean of pool minus the mean of riffle greater than 0) (2) that the mean of the number of fish in the pool is smaller than the mean of the number of fish in the riffle (or the mean of pool minus the mean of riffle less than 0) I would recommend that two sided alternatives because we have already looked at the data, and it is always more conservative.
Two sample.
The choices here are t-test and proportion test. T-test for testing hypothesis about the population means of a quantitative variables. Proportion tests are for testing hypotheses about population proportions of categorical variable, so the correct choice is t-test.
For the t-test, the main assumption is that the data lie close enough to a Normal (bell shaped) distribution. How close does it have to be? It depends on the sample size, the greater the sample size the more robust the t-test is to non-Normality. Actually even for small sample sizes (10 or 11) it is fairly robust, so unless there is strong skewness or substantial outliers we will be OK.
The best way of judging this is to use t-test
ggplot(data=fish)+geom_qq(mapping=aes(sample=pool,color="red"))+geom_qq(mapping=aes(sample=riffle,color="blue"))
gg <- ggplot(data=fish)
gg + geom_qq(mapping=aes(sample=fish$pool))
As the graph shows, the distribution of the number of fish in the pool is approximately normal. Therefore, we can do the t-test based on the samples of “pool” group.
gg+geom_qq(mapping=aes(sample=fish$riffle))
If the data are Normal, they will lie on line. This graphs shows that it is probably not close enough. The mean of the number of fish in the riffle is less significantly Normal.
It is always safe bet to use the traditional level of significance 0.05.
t.test(fish$riffle,fish$pool,paired=TRUE)
##
## Paired t-test
##
## data: fish$riffle and fish$pool
## t = -4.5826, df = 14, p-value = 0.0004264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.229668 -1.170332
## sample estimates:
## mean of the differences
## -2.2
Since the p-value is less than the level of significance, we REJECT the null hypothesis that the means are equal.
The confidence interval is the range of plausible values for the difference in means. Zero is not in this interval. Therefore 0 is not a plausible value for the difference in means, so it is not plausible that the means are the same.
We have concluded that the means are not equal, but we really want to know: is the mean of the number of fish in the pool is more than the mean of the number of fish in the riffle? Knowing that the means are unequal we can answer this question by looking the sample estimates “pool” more than “riffe”.
Frm the graph, we can known that the sample mean of the number of fish in the pool is much more than the sample mean of the number of fish in the riffle. Therefore, it is plausible to conclude that the environment of the pool is better than the environment of the riffle and fish preferes the pool than the riffle.