1. Further Reading: Presenting Your Data

Answer 1:
There are a few crucial methods that many investigators overlook when writing a manuscript detailing the statistical methods used for their study. The first is to detail the data procurement and analysis. This is particularly important when considering units, and when data methods are considered unconventional. Additionally, the hypothesis must be explicitly defined, as well as the variables that need to be included in order to test the hypothesis. In line with stating the methods, it is highly important to make clear the assumptions made in order to analyze the data, as many statistical tests require assumptions that are not immediately obvious. Similarly, when determining statistical significance, the authors suggest listing the actual P-value, or writing a justification for a threshold value, which demonstrates understanding of statistical significance from a theoretical standpoint. As the null hypothesis may be rejected in a small dataset, it cannot be assumed that two groups are equivalent if a null hypothesis is rejected. Therefore, justification for sample size is also necessary to prevent statistical errors. If results are not seen in one test, the authors strongly suggest against performing multiple correlational tests, as there is a greater chance of a false positive. If multiple tests were used, then corrections would be helpful to preclude the aforementioned result. When choosing methods to analyze data, consider if there will be temporal correlations, or those that are clustered, and pick a method that accounts for these. Finally, the authors highlight the importance of listing the name, version, and source of software used for analysis; this would presumably encourage reproducibility.

In addition to listing central tendency and variability, there are some other considerations to be made when datasets are small or skewed. If the data is skewed, there are a few things scientists can do. The first is to use median as a measure of central tendency, or quartiles to indicate the distribution of the data. Alternatively, the authors suggest transforming the data into a logarithmic scale, which makes the data more symmetrical, an important assumption for many statistical analyses.

2.1 Analysis

2.1.1 Histogram and Normal Q-Q plot

#import the data
mydata = read.csv("kelpbass_data.csv")
#define your dataset
bassdata <- mydata$Total.Length

#make the histogram (2.1.1a)
par(mfrow=c(1,2))
hist(bassdata, breaks=10, 
     xlab= "Length (inches)", ylab= "Number of Bass", main= "Fig. 1A Length of Kelp Bass", 
     col="white")

#make the qqnorm plot with a line to show the data (2.1.1b)
qqnorm(bassdata, main= "Fig. 1B Normal Q-Q Plot");qqline(bassdata, col="blue", lwd=2)

Figure 1 depicts the lengths of the fish in the entire kelp bass dataset. Figure 1A represents the data in a histogram fashion, with no central tendency depicted. Figure 1B is a quantile-quantile plot that compares these data to a normal distribution. The blue line indicates the normal distribution, and the dots indicate the quantile values of the actual dataset.

Answer 2.1:
According to the Figures 1A and 1B, the data is not normally distributed. A curve that is above quantile line in Fig. 1B suggests that the data distribution is skewed to the right; that is, there is a higher amount of fish measured whose lengths were toward the lower end. This is also seen in Fig. 1A, with the highest frequency of bass being in the third break of data. That being said, there are few outliers in this data, and they data also only has one mode. Given that kelp bass cannot be caught unless they are over 12 inches, it is reasonable that there is a larger number of kelp bass that are 12 inches or less. There is also a large number of fish whose recorded length was slightly above 12 inches; this also seems reasonable, as fishermen are likely to put a kelp bass back if they are close to the length limit, so as to avoid citation.

2.2 Central Tendency and Confidence Intervals

deals = 10000
len = length(bassdata)
bootmedian = rep (NA, deals)
for (i in 1 : deals) {
  bootdata = sample (bassdata, len, replace = TRUE)
  bootmedian[i] = median (bootdata)
} 

est.median <- median(bootmedian)
CI <- sort(bootmedian)[c(.025*deals, .975*deals)]

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
hist(bassdata, breaks= 31, 
     xlab= "Length (inches)", main= "Fig. 2A Length of Kelp Bass", 
     col="white", 
     ylim= c(0,0.1),
     prob=TRUE, )
#density plot 
lines(density(bassdata),
      col="blue")
#median plot
abline(v = median(bassdata),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

hist(bootmedian, breaks= 10, main= "Fig. 2B Resampled Bootstrap Data", xlab= "Length(inches)", ylab="Frequency of Median")
#median plot
abline(v = median(bootmedian),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))
qqnorm(bassdata, main= "Fig. 2C Normal Q-Q Plot");qqline(bassdata, col="blue", lwd= 2)

Figure 2 depicts the lengths of the fish in the entire kelp bass dataset. Figure 2A represents the data in a histogram fashion, with the median and confidence intervals drawn in red solid and dashed lines, respectively. Figure 2B shows the distribution of medians when the Monte Carlo method was applied to resample the data. This method was then used to determine the confidence intervals. Figure 2C is a quantile-quantile plot. The same methods were used as in Fig. 1B; the blue line indicates the normal distribution, and the dots indicate the actual values from the dataset.

Answer 2.2:

As the data is not symmetrically distributed, the median will be used to describe the central tendency of the data. As detailed above, the median value is 15.3 (14.9,15.9).

2.3 Comparison of Kelp Bass Lengths at Malibu Pier and Two Harbors, Catalina

#Comparing the data
bassmalibu <- subset(mydata, Location == "Malibu")
bassm <-bassmalibu$Total.Length
basscatal <- subset(mydata, Location == "Catalina")
bassc <-basscatal$Total.Length

len = length(bassc)
bootmedian = rep (NA, deals)
for (i in 1 : deals) {
  bootdata = sample (bassc, len, replace = TRUE)
  bootmedian[i] = median (bootdata)
} 

est.median <- median(bootmedian)
CI <- sort(bootmedian)[c(.025*deals, .975*deals)]

layout(matrix(c(1,2,3,4,5,6), 2, 3, byrow = TRUE))
hist(bassc, breaks= 10, 
     xlab= "Length (inches)", main= "Fig. 3A Kelp Bass in Catalina", 
     col="white",
     ylim= c(0,0.1),
     prob=TRUE, )
#density plot 
lines(density(bassc),
      col="blue")
#median plot
abline(v = median(bassc),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

hist(bootmedian, breaks= 10, 
     main= "Fig. 3B Resampled Catalina Data", 
     xlab= "Length(inches)", 
     ylab="Frequency of Median",
          ylim= c(0,5000))
#median plot
abline(v = median(bootmedian),
       col = "red",
       lwd = 2)
# confidence plots
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))
qqnorm(bassc, main= "Fig. 3C Catalina Normal Q-Q Plot");qqline(bassc, col="blue", lwd= 2)

#### Plotting Malibu
len = length(bassm)
bootmedian = rep (NA, deals)
for (i in 1 : deals) {
  bootdata = sample (bassm, len, replace = TRUE)
  bootmedian[i] = median (bootdata)
} 

est.median <- median(bootmedian)
CI <- sort(bootmedian)[c(.025*deals, .975*deals)]

hist(bassm, breaks= 10, 
     xlab= "Length (inches)", main= "Fig. 3D Kelp Bass in Malibu", 
     col="white",
     xlim= c(5,35),
     prob=TRUE)
#density plot 
lines(density(bassm),
      col="blue")
#median plot
abline(v = median(bassm),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

hist(bootmedian, breaks= 20, 
     main= "Fig. 3E Resampled Malibu Data", 
     xlab= "Length(inches)", 
     ylab="Frequency of Median"
      )
#median plot
abline(v = median(bootmedian),
       col = "red",
       lwd = 2)
# confidence plots
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))
qqnorm(bassm, main= "Fig. 3F Malibu Normal Q-Q Plot");qqline(bassm, col="blue", lwd= 2)

Figure 3 depicts the lengths of fish that were measured in Two Harbors, Catalina, and at the Malibu Pier. Figures 3A and 3D represent the data from Catalina and Malibu in a histogram fashion, respectively, with the median and confidence intervals drawn in red solid and dashed lines, respectively. Figures 3B and 3E show the distribution of medians when the Monte Carlo method was applied as described in Fig. 2B. Figures 3C and 3F are quantile-quantile plots. The same methods were used as in Fig. 1B; the blue line indicates the normal distribution, and the dots indicate the actual values from the dataset.

Answer 3:

The same central tendency will be used as before, as these are subsets of the data. As Figures 3A and 3D indicate, the distribution of these data is quite similar to that of the entire dataset. There are a few differences, however. The data from Figure 3A has a skew that is less obvious; in fact, it may appear bimodal in some visualizations. The distribution in 3D is similar to that of the entire dataset in that is has one mode and is skewed left. The median of the sample data from Two Harbors, Catalina (Figure 3A) is 17.2 (16.5, 17.5), and from Malibu Pier (Figure 3D) is 13.2 (12.7, 13.7). These two medians are quite different from each other, in that the median length of a kelp bass from Two Harbors, Catalina is around 4 inches longer. The confidence intervals of these medians do not overlap at all, which indicates that there is a difference between these two populations of fish. This difference can be rationalized by difference between the two locations. The most important difference between these two is the high number of Marine Protected Areas (MPAs) located near Two Harbors, and the paucity of them near the Malibu pier. As fishing is banned in MPAs, it would make sense that the fish living in or near these locales would be able to grow longer without being caught by fisherman. Thus, the differences between the two populations is reasonable.

2.4 Comparison of Male and Female Kelp Bass Lengths

#Comparing the data between sexes
bassmale <- subset(mydata, Sex == "M")
bassmalelength <-bassmale$Total.Length
bassfemale <- subset(mydata, Sex == "F")
bassfemalelength <-bassfemale$Total.Length

## PLOTTING MALE
len = length(bassmalelength)
bootmedian = rep (NA, deals)
for (i in 1 : deals) {
  bootdata = sample (bassmalelength, len, replace = TRUE)
  bootmedian[i] = median (bootdata)
} 

est.median <- median(bootmedian)
CI <- sort(bootmedian)[c(.025*deals, .975*deals)]

par(mfrow=c(2,3))
hist(bassmalelength, breaks= 20, 
     xlab= "Length (inches)", main= "Fig. 4A Male Kelp Bass", 
     col="white",
     ylim = c(0,0.1),
     prob=TRUE, )
#density plot 
lines(density(bassmalelength),
      col="blue")
#median plot
abline(v = median(bassmalelength),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

hist(bootmedian, breaks= 20, 
     main= "Fig. 4B Resampled Male Data", 
     xlab= "Length(inches)", 
     ylab="Frequency of Median",
     ylim= c(0,3000),
     xlim= c(13.5,17.5))
#median plot
abline(v = median(bootmedian),
       col = "red",
       lwd = 2)
# confidence plots
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))
qqnorm(bassmalelength, main="Fig. 4C Male Normal Q-Q Plot");qqline(bassmalelength, col="blue", lwd= 2)
#### Plotting Females
len = length(bassmalelength)
bootmedian = rep (NA, deals)
for (i in 1 : deals) {
  bootdata = sample (bassfemalelength, len, replace = TRUE)
  bootmedian[i] = median (bootdata)
} 

est.median <- median(bootmedian)
CI <- sort(bootmedian)[c(.025*deals, .975*deals)]

hist(bassfemalelength, breaks= 10, 
     xlab= "Length (inches)", main= "Fig. 4D Female Kelp Bass", 
     col="white", 
     xlim= c(5,35),
     prob=TRUE)
#density plot 
lines(density(bassfemalelength),
      col="blue")
#median plot
abline(v = median(bassfemalelength),
       col = "red",
       lwd = 2)
# confidence plots?
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))

hist(bootmedian, breaks= 10, 
     main= "Fig. 4E Resampled Female Data", 
     xlab= "Length(inches)", 
     ylab="Frequency of Median",
          ylim= c(0,3000),
     xlim= c(13.5,17.5))
#median plot
abline(v = median(bootmedian),
       col = "red",
       lwd = 2)
# confidence plots
abline(v = CI[1],
       col = "red",
       lwd = 1,
       lty = 2)
abline(v = CI[2],
       col = "red",
       lwd = 1,
       lty = 2)
legend("topright", legend = c("Median", "95%CIs"), col = "Red", lty = c(1,2), lwd = c(2,1))
qqnorm(bassfemalelength, main="Fig. 4F Female Normal Q-Q Plot");qqline(bassfemalelength, col="blue", lwd= 2)

Figure 4 compares the lengths of fish that are either male or femal. Figures 4A and 4D represent the data from Male and Female fish in a histogram fashion, respectively, with the median and confidence intervals drawn in red solid and dashed lines, respectively. Figures 4B and 4E show the distribution of medians when the Monte Carlo method was applied as described in Fig. 2B. Figures 4C and 4F are quantile-quantile plots. The same methods were used as in Fig. 1B; the blue line indicates the normal distribution, and the dots indicate the actual values from the dataset.

Answer 4:

The same central tendency will be used as before, as these are subsets of the data. Both the male and female data follow trends found in the overall dataset: they each appear to have one mode, and they are skewed to the left, as indicated by Figures 4C and 4F. The median of the sample data from male kelp bass is 15.55 (15, 16.7), and the sample median for the data from female kelp bass is 15.1 (14.5, 15.7). The overlap of the confidence intervals for these is substantial. Due to this overlap, it cannot be concluded that these two groups are significantly different; other tests may be necessary to prove if there is in fact a statistical difference. While one might expect the lengths of these two groups to be different, because there are many examples in nature in which the male is larger than the female, this is not always true; in fact, there are many fish species in which there is no sexual dimorphism with regards to length or size.