Assignment 2

Part I: Further Reading

One determines if two datasets come from the same underlying population by looking at the measures of central tendencies, as well as the distribution of the two sample datasets. If the central tendencies and distributions are similar, then one may assume that these were drawn from the same dataset. Sample size and “noise” can have substantial effects on significance tests. A small sample size generates a large confidence interval, meaning that with a larger sample size, one can gather a more precise average measure of central tendency. Similarly, a large sample size can reduce the “noise”, or unexplained variation, within a dataset. It is often a decision of the investigator to determine what is “noise” in a population and what is statistically relevant data. If statistical analysis reveals that a null hypothesis cannot be rejected when comparing two groups, it means that the data was sampled from populations with similar measures of central tendency. When statistical analyses yield a result that implies statistical significance, the next task involves determining the biological significance, as these two are neither synonymous nor analogous. One crucial measure for determining biological significance is the effect size, which is a calculated difference of central tendency between two groups. This focuses on the actual amount of difference between two groups, rather than emphasizing the sample size, as one would with a null hypothesis significance test.

Part II: Data Analysis

setwd("/Users/jenniferpolson/Documents/School/2016S/Physci M200/Lab Projects")
#import the data
mydata = read.csv("kelpbass_data.csv")
#define your dataset
bassdata <- mydata$Total.Length

2.2.1 Figures Detailing Differences Between Male and Female Fish

bassmale <- subset(mydata, Sex == "M")
BS1M <-bassmale$Total.Length
bassfemale <- subset(mydata, Sex == "F")
BS2F <-bassfemale$Total.Length

options(digits=4)
par(mfrow=c(2,3),oma = c(0, 0, 2, 0))
#overlapping histogram
hist(BS1M, col=rgb(0,0,1,0.5), breaks=31, main="Fig. 1A Histogram", xlab="Lengths")
hist(BS2F, col=rgb(1,0,0,0.5), breaks=31, add=T)
box()
legend("topright", c("Male", "Female"), col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), lwd=10)

#qqplot
qqnorm(BS1M, main= "Fig. 1B Male Normal Q-Q Plot", col=rgb(0,0,1,0.5));qqline(bassdata, lwd=2)
qqnorm(BS2F, main= "Fig. 1C Female Normal Q-Q Plot", col=rgb(1,0,0,0.5));qqline(bassdata, lwd=2)

#boxplot
boxplot(BS1M, BS2F, col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), names = c("Male", "Female"), 
        main="Fig. 1D Box Plots", ylab="Length(inches)")
sex <- list("Male"=BS1M, "Female"=BS2F)
stripchart(sex, method = "stack",
           vertical=TRUE, add=TRUE,
           col=rgb(0.5,0.5,0.5,0.3), pch=16)

#violin plot
library(vioplot)

## Loading required package: sm

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

# Set up plot 
plot(1, 1, main = "Fig. 1E Violin Plots",
     xlim = c(0, 2), ylim = range(c(BS1M,BS2F)), type = "n",
     xlab = "", ylab = "Length (inches)", xaxt = "n") 

# Specify axis labels 
axis(side = 1, at = c(0.5:2), labels=c("Male", "Female"))
     #labels for the x axis (side = 1)

# Manually add each violin 
vioplot(BS1M, at = 0.5, col=rgb(0,0,1,0.5), add = TRUE) 
     #at = 1 > specifies x axis position (1)
vioplot(BS2F, at = 1.5, col=rgb(1,0,0,0.5), add = TRUE)
     #at = 2 > specifies x axis position (2)

#stripchart
location <- list("Male"=BS1M, "Female"=BS2F)
stripchart(location, method = "stack", vertical=TRUE,
           col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), pch=16, 
           main="Fig. 1F Stripcharts", 
           ylab="Length (inches)")
mtext("Figure 1: Comparison of Male and Female Kelp Bass Lengths", outer = TRUE, cex = 1, font = 2)

Figure 1 compares the lengths of fish that are either Male or Female. The Male and Female subpopulations are represented by blue and red, respectively, in each graph. Figure 1A represents the data from Male and Female fish in an overlapping histogram. Figures 1B and 1C depict the male and female quantile-quantile plots, respectively, that compare these data subsets to a normal distribution. The line indicates the normal distribution, and the dots indicate the quantile values of the actual datasets. Figure 1D represents the data using box plots. The box indicates the 25th and 75th percentiles, with the black line indicating the median of the data. The dashed line plungers indicate the most extreme data point that is 1.5 times the interquartile range from the quartiles, and the blank dots indicate the most extreme data that doesnt lie in that range. Figure 1E shows the same data in a violin plot. The white dot indicates the median, with the black box indicating the values of the upper and lower quantiles. Figure 1F depicts a scatter plot of the male and female data in a stacked formation.

Answer 1:

The null hypothesis is that there are no differences in length between male and female kelp bass.

To determine if this null hypothesis can be rejected, one must first visually examine the data. Both the male and female subsets of data are skewed such that the distribution below the median is much more concentrated than that after the median. Both datasets have one mode, between 10-15 inches in length. Both the male and femlae datasets have a few outliers, as shown by the box plots. The male dataset has one outlier, while the female dataset has two outliers.

In choosing to represent the data, one figure would suffice to show the distribution of both subsets, and that is the violin plot. This is because it encapsulates every attribute of the data that is shown in individual graphs. The histogram on its own does not capture the median or the first and third quantiles, and the normal Q-Q plots, while helpful statistically, do not visually show the data in a format that is intuititve. The box plots give the median and quanatiles of each graph, but do not readily show the density. Finally, the strip chart shows the density of the data distribution, but does not intuitively display the quantile data such as the violin plot. Therefore, that is the graph that will most adequately represent the data.

A visual inspection of these data lead me to believe that these two populations could be from the same populations. This is because the measures of central tendency, meaning the medians, have too much overlap to be able to conclude that these come from vastly different populations.

2.2.2 Two-group Comparison of Male and Female Fish

datasetA <- BS1M
datasetB <- BS2F

deals = 10000
    datajoined = c (datasetA, datasetB)
    difference = rep (NA, deals)
    for (i in 1 : deals) {
      bootdataA = sample (datajoined, length(datasetA), replace = TRUE)
      bootdataB = sample (datajoined, length(datasetB), replace = TRUE)
      bootmedianA = median (bootdataA)
      bootmedianB = median (bootdataB)
        difference[i] = bootmedianA - bootmedianB
}

CI <- sort(difference)[c(.025*deals, .975*deals)]
differencemeasured <- median(datasetA)-median(datasetB)

#generate the p-value thresholds
HP <- sum(difference > abs(differencemeasured))
LP <- sum(difference < -abs(differencemeasured))

#calculate the pvalue
pvalue = (HP+LP)/deals

hist(difference,
     main="Figure 2: Histogram of Resampled Median Differences",
     xlab = "Difference")
abline(v = differencemeasured,
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Effect Size", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

Figure 2 displays the histogram of the resampled differences of medians of male and female fish lengths from the “big box” comparison test. The heavy red line indicates the effect size (0.45), and the dashed lines indicate the confidence interval(-1.1,1.15).

Answer 2:

In order to compare the data, measures of central tendency must be used. Based on the distribution of the data, median would be the most adequate. In addition, based on data interpretation alone, one cannot conclude that these two samples come from different population. Therefore, a “big-box” comparison will be used.

According to the bootstrap histogram in Figure 2, the effect size of this comparison was 0.45 (-1.1,1.15). The p-value of this data is 0.4404. The bootstrapped confidence interval contains 0, meaning that there is a chance that there could be no effect. Futher, because the p-value < 0.05, you are not able to reject the null hypothesis. Therefore, it seems that male and female kelp bass do not differ in length in a statistically significant way. While it is not prudent to make biologically significant determinations from statistical analyses, it does not seem that this data is biologically significant, either.

2.2.3A Figures Detailing Differences in Fish from Malibu and Catalina

bassmalibu <- subset(mydata, Location == "Malibu")
BL1 <-bassmalibu$Total.Length
basscatal <- subset(mydata, Location == "Catalina")
BL2 <-basscatal$Total.Length

par(mfrow=c(2,3),oma = c(0, 0, 2, 0))
#overlapping histogram
hist(BL2, col=rgb(0.07,0.59,0.42,0.5), breaks=15, main="Fig. 3A Histogram", xlab="Length (inches)")
hist(BL1, col=rgb(1,0.85,0,0.5), breaks=10, add=T)
box()
legend("topright", c("Malibu", "Catalina"), col=c(rgb(0.07,0.59,0.42,0.5), rgb(1,0.85,0,0.5)), lwd=10)

#qqplot
qqnorm(BL1, main= "Fig. 3B Malibu Normal Q-Q Plot", col=rgb(0.07,0.59,0.42,0.5));qqline(bassdata, lwd=2)
qqnorm(BL2, main= "Fig. 3C Catalina Normal Q-Q Plot", col=rgb(1,0.85,0,0.5));qqline(bassdata, lwd=2)

#boxplot
boxplot(BL1, BL2, col=c(rgb(0.07,0.59,0.42,0.5), rgb(1,0.85,0,0.5)), 
        names = c("Malibu", "Catalina"), main="Fig. 3D Box Plot",
        ylab="Length(inches)")
location <- list("Malibu"=BL1, "Catalina"=BL2)
stripchart(location, method = "stack",
           vertical=TRUE, add=TRUE,
           col=rgb(0.5,0.5,0.5,0.3), pch=16,
           xlab="Length (inches)")

#violin plot
library(vioplot)

# Set up plot 
plot(1, 1, main = "Fig. 3E Violin Plots",
     xlim = c(0, 2), ylim = range(c(BL1,BL2)), type = "n",
     xlab = "", ylab = "Length (inches)", xaxt = "n") 

# Specify axis labels 
axis(side = 1, at = c(0.5:2), labels=c("Malibu", "Catalina"))
     #labels for the x axis (side = 1)

# Manually add each violin 
vioplot(BL1, at = 0.5, col=rgb(0.07,0.59,0.42,0.5), add = TRUE) 
     #at = 1 > specifies x axis position (1)
vioplot(BL2, at = 1.5, col=rgb(1,0.85,0,0.5), add = TRUE)
     #at = 2 > specifies x axis position (2)

#stripchart
sex <- list("Malibu"=BL1, "Catalina"=BL2)
stripchart(sex, method = "stack", vertical=TRUE,
           col=c(rgb(0.07,0.59,0.42,0.5), rgb(1,0.85,0,0.5)), pch=16, 
           main="Fig. 3F Stripchart", 
           ylab="Length (inches)")
mtext("Figure 3: Comparison of Kelp Bass Lengths in Malibu and Catalina", outer = TRUE, cex = 1, font = 2)

Figure3 compares the lengths of fish that were found in Malibu and Catalina. The Malibu and Catalina subpopulations are represented by green and yellow, respectively, in each graph. Figure 3A represents the data of fish from Malibu and Catalina in an overlapping histogram. Figures 3B and 3C depict quantile-quantile plots of fish from Malibu and Catalina, respectively, that compare these data subsets to a normal distribution. The line indicates the normal distribution, and the dots indicate the quantile values of the actual datasets. Figure 3D represents the data using box plots. The box indicates the 25th and 75th percentiles, with the black line indicating the median of the data. The dashed line plungers indicate the most extreme data point that is 1.5 times the interquartile range from the quartiles, and the blank dots indicate the most extreme data that doesnt lie in that range. Figure 3E shows the same data in a violin plot. The white dot indicates the median, with the black box indicating the values of the upper and lower quantiles. Figure 3F depicts a scatter plot of the male and female data in a stacked formation.

Both the Malibu and Catalina subsets of data are skewed positively, meaning that the tail on the positive side of the median is longer than the tail on the negative side of the median. Both datasets also seem to be unimodal. Both have a few outliers, as evidenced by the box plots. Based on the distribution of this data, and the vast differences in medians, it should be assumed, for comparison’s sake, that these samples do not come from the same populations. When examining Figure 3E, it is apparent that the medians are substantially different, and the distributions of the data are not similar enough to consider each sample to have originated from the same population.

2.2.3B Two-group Comparison of Fish from Malibu and Catalina

datasetA <- BL1
datasetB <- BL2

boxA = datasetA - median(datasetA) 
boxB = datasetB - median(datasetB) 
deals = 10000
difference = rep(NA,deals)
for (i in 1:deals) {
  bootdataA = sample(boxA,length(datasetA),replace=TRUE) 
  bootdataB = sample(boxB,length(datasetB),replace=TRUE) 
  bootmedianA = median(bootdataA)
  bootmedianB = median(bootdataB)
  difference[i] = bootmedianA-bootmedianB
}

CI <- sort(difference)[c(.025*deals, .975*deals)]
differencemeasured <- median(datasetA)-median(datasetB)

#generate the p-value thresholds
HP <- sum(difference > abs(differencemeasured))
LP <- sum(difference < -abs(differencemeasured))

#calculate the pvalue
pvalue = (HP+LP)/deals

hist(difference, xlim=c(-6,6),
     main="Figure 4: Histogram of Resampled Median Differences",
     xlab = "Difference")
abline(v = differencemeasured,
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Effect Size", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

Figure 4 displays the histogram of the resampled differences of medians of male and female fish lengths from the “big box” comparison test. The heavy red line indicates the effect size (-4), and the dashed lines indicate the confidence interval(-0.85,1.1).

Based on the assumption that these two samples come from different populations, a “two box” comparison seems to be most appropriate.

According to the bootstrap histogram in Figure 4, the effect size of this comparison was -4 (-0.85,1.1). The p-value of this data is 0. The null hypothesis is there is no difference in median fish length between areas where there are MPAs, and areas where there are not. Given that the effect size is completely outside of the confidence interval, and the very small p-value, you are able to reject the null hypothesis. Therefore, it appears that the localization of MPAs to a location can affect the median length of the fish that are measured at said location.