Set working directory
setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab5_intro_to_t_test_and_hypothesis_testing")
These data are from Example 12.4 “So long; thanks to all the fish” in Whitlock and Shulter 2nd ed.
salmon <- read.csv("Lab5_data_brook_trout.csv")
Each line of data is a stream that either has introduced brook trout present or not present.
# size of dataframe
dim(salmon)
## [1] 12 3
#top of data
head(salmon)
## brook.trout.PRES.ABS salmon.released salmon.surv
## 1 present 820 166
## 2 present 960 136
## 3 present 700 153
## 4 present 545 103
## 5 present 769 173
## 6 present 1001 188
#bottom of data
tail(salmon)
## brook.trout.PRES.ABS salmon.released salmon.surv
## 7 absent 467 180
## 8 absent 959 178
## 9 absent 1029 326
## 10 absent 27 7
## 11 absent 998 120
## 12 absent 936 135
#summary of data
summary(salmon)
## brook.trout.PRES.ABS salmon.released salmon.surv
## absent :6 Min. : 27.0 Min. : 7.0
## present:6 1st Qu.: 661.2 1st Qu.:131.2
## Median : 878.0 Median :159.5
## Mean : 767.6 Mean :155.4
## 3rd Qu.: 969.5 3rd Qu.:178.5
## Max. :1029.0 Max. :326.0
The response variable in this study is the percent (%) of chinook salmon that survived over the course of the study. We therefore need to calcualte this value using R’s basic math function for divison.
percent.surv <-salmon$salmon.surv/salmon$salmon.released
The data are a percentage, so each datum should be between 0 and 1. We can look at the data using summary()
summary(percent.surv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1202 0.1753 0.1957 0.2147 0.2335 0.3854
We could also use range(), min(), and max(). to get this info. We could also make a histogram using hist().
We’ll call the new column “percent.surv” also.
salmon$percent.surv <- percent.surv
Note: this isn’t actually a percent, but a fraction. To get a percent we’d multiple it by 100.
#The whole datafame
salmon
## brook.trout.PRES.ABS salmon.released salmon.surv percent.surv
## 1 present 820 166 0.2024390
## 2 present 960 136 0.1416667
## 3 present 700 153 0.2185714
## 4 present 545 103 0.1889908
## 5 present 769 173 0.2249675
## 6 present 1001 188 0.1878122
## 7 absent 467 180 0.3854390
## 8 absent 959 178 0.1856100
## 9 absent 1029 326 0.3168124
## 10 absent 27 7 0.2592593
## 11 absent 998 120 0.1202405
## 12 absent 936 135 0.1442308
#summary()
summary(salmon)
## brook.trout.PRES.ABS salmon.released salmon.surv percent.surv
## absent :6 Min. : 27.0 Min. : 7.0 Min. :0.1202
## present:6 1st Qu.: 661.2 1st Qu.:131.2 1st Qu.:0.1753
## Median : 878.0 Median :159.5 Median :0.1957
## Mean : 767.6 Mean :155.4 Mean :0.2147
## 3rd Qu.: 969.5 3rd Qu.:178.5 3rd Qu.:0.2335
## Max. :1029.0 Max. :326.0 Max. :0.3854
We should explore the data using boxplots and histograms. The response variable is “percent.surv” and the **predictor variable* is whether introduced brook trout are present or absent.
boxplot(percent.surv~brook.trout.PRES.ABS,
data = salmon)
Median survival is a bit lower but there is not much difference.
Boxplots are adequate for looking at these data, but let’s practie make a 2-panel histrogram for practic. This requires using the subset() command to split the data up by group
Subset (split) from just the streams where brookies are present.
present <- subset(salmon,
select = c("brook.trout.PRES.ABS",
"percent.surv"),
brook.trout.PRES.ABS == "present")
Subset from just the streams where brookies are absetn
absent <- subset(salmon,
select = c("brook.trout.PRES.ABS",
"percent.surv"),
brook.trout.PRES.ABS == "absent")
First, we need to set up the plot window using the par() command. Then we can make the histograms. We specify the subset dataframe to use (present or absent) and the column within the dataframe $percent.surv
# the par() command
par(mfrow = c(1,2))
#left histogram
hist(present$percent.surv)
#right histogram
hist(absent$percent.surv)
par(mfrow = c(1,2))
#left histogram
hist(present$percent.surv,
main = "Brk Trout Present",
xlim = c(0,0.5),
xlab = "Percent surv")
#right histogram
hist(absent$percent.surv,
main = "Brk Trout Absent" ,
xlim = c(0,0.5),
xlab = "Percent surv")
We’ll use the tapply command. See https://rpubs.com/brouwern/groupeddata for more deetails
Calculate means and standard deviations for the brook trout present and brook trout absent rows of data.
NOTE: we’ll using the original salmon dataframe, NOT the subsets
#use tapply() on salmon dataframe
my.means <- tapply(salmon$percent.surv,
salmon$brook.trout.PRES.ABS,
FUN = mean,
na.rm = T)
#the result: both means calcualted
my.means
## absent present
## 0.2352653 0.1940746
#use tapply() to calcualte the sd
my.sd <- tapply(salmon$percent.surv,
salmon$brook.trout.PRES.ABS,
FUN = sd,
na.rm = T)
#the result
my.sd
## absent present
## 0.10369322 0.02978618
#sample size
n<- 6
#calculate se
my.se <- sqrt(my.sd)/n
Boxplot and histograms are good for exploring the oveall distribution of the data. We typically want to plot the mans with error bars when we are doing hypothesis testing.
From https://rpubs.com/brouwern/plot2means
#### The function STARTS here ####
plot2means <- function(means,
se,
groups = 2,
categories = c("Group 1","Group 2"),
x.axis.label = "x axis",
y.axis.label = "y axis",
axis.adjust = 0){
# reset plot window
par(mfrow = c(1,1),
mar = c(3.5,3,2,1))
# calculate values for plotting limits
y.max <- max(means+2*se) +
max(means+2*se)*axis.adjust
y.min <- min(means-2*se) -
max(means+2*se)*axis.adjust
if(groups == 2){ x.values <- c(0.25, 0.5)}
#Plot means
plot(means ~ x.values,
xlim = c(0.2,0.55),
ylim = c(y.min,y.max),
xaxt = "n",
xlab = "",
ylab = "",
cex = 1.25,
pch = 16)
axis(side = 1,
at = x.values,
labels = categories,
)
#Plot upper error bar
lwd. <- 2
arrows(y0 = means,
x0 = x.values,
y1 = means+2*se,
x1 = x.values,
length = 0,
lwd = lwd.)
#Plot lower error bar
arrows(y0 = means,
x0 = x.values,
y1 = means-2*se,
x1 = x.values,
length = 0,
lwd = lwd.)
mtext(text = x.axis.label,side = 1,line = 1.75)
mtext(text = y.axis.label,side = 2,line = 1.95)
mtext(text = "Error bars = 95% CI",side = 3,line = 0,adj = 0)
}
#### The function ENDS here ####
plot2means(means = my.means,
se = my.se,
categories = c("absent",
"present"),
x.axis.label = "Brook trout status",
y.axis.label = "Salmon survival"
)
We’ll use the main salmon dataframe
t.test(percent.surv ~ brook.trout.PRES.ABS,
data = salmon)
##
## Welch Two Sample t-test
##
## data: percent.surv by brook.trout.PRES.ABS
## t = 0.93521, df = 5.8196, p-value = 0.3868
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06739746 0.14977890
## sample estimates:
## mean in group absent mean in group present
## 0.2352653 0.1940746
Note that the output reports an interesting thing, the 95% confidence interval. Specifically, this is the 95% for the difference between the two mean values. The means are 0.235 and 0.194, and their difference is 0.041.
The *null hypothesis (Ho) is that this value is essentially equal to zero. 0.041 is close to zero, and we use the t-test and confidence intervals to determine if a difference of 0.041 could just result from random sampling error.
If we plot this difference and the confidence interval we get the graph below. The red line indicates where zero is on the y axis.
par(mfrow = c(1,2), mar = c(4,4,4,4))
errbar(x = 1:2,
y = my.means,
yplus = my.means + 2*my.se,
yminus = my.means -2*my.se,
xaxt = "n",ylab = "Survival",
xlim = c(0.75,2.25),
main = "Original means")
mtext("NOTE: y axis is from 0.15 to 0.35",side = 1, line = 2)
axis(side = 1,at = c(1,2),labels = c("absent","present"))
mtext("Plot of means of each group")
errbar(x = 1, y = mean1-mean2,
yplus = t.out$conf.int[[2]],
yminus = t.out$conf.int[[1]],
xaxt = "n",ylab = "Diff. btwn survival rates",
main = "Difference between means",
xlab = "")
abline(h=0, col = 2, lwd = 2,lty =2)
mtext("NOTE: y axis is from -0.05 to 0.15",side = 1, line = 2)
mtext("Plot difference between means")