Excerpt

Restrict your answers to a test of correlation and a test of regression and the data and information presented in Table 2 and Fig 2b, as well as text information on correlation on pg 205 (left column, last paragraph) highlighted in yellow. You will need the data provided in Table 2. You will concern yourself with copepod data, not flatworm data, and concern yourself with the traits genome size and body size.

Started off by inputing data from table 2:

species <- c("Diaptomus forbesii", "Diaptomus insularis", "Diaptomus leptopus", "Diaptomus nudus", "Diaptomus sicilis", "Eurytemora composita", "Hersperodiaptomus n. sp.", "Hesperodiaptomus arcticus", "Hesperodiaptomus nevadensis", "Hesperodiaptomus shoshone", "Hesperodiaptomus vicoriaensis", "Hesperoscope septentrionalis", "Leptodiaptomus tyrrelli", "Leptodiaptomus wilsonae", "Limnocalanus macrurus", "Osphranticum labronectum")
hw_x <- c(7.62, 3.82, 5.54, 6.66, 3.54, 1.58, 11.08, 9.34, 11.42, 6.22, 8.74, 10.90, 2.66, 6.50, 3.26, 4.90)
hw_y <- c(1.54, 0.90, 2.08, 1.60, 1.20, 1.16, 3.23, 3.23, 3.15, NA, 2.70, 3.35, 1.60, 1.26, 2.00, 1.25)

copepod_table <- tibble(species, hw_x, hw_y)%>%rename("Species" = species, "Genome Size (pg)" = hw_x, "Body Size (mm)" = hw_y)

kable(copepod_table, align = "c")

Species	Genome Size (pg)	Body Size (mm)
Diaptomus forbesii	7.62	1.54
Diaptomus insularis	3.82	0.90
Diaptomus leptopus	5.54	2.08
Diaptomus nudus	6.66	1.60
Diaptomus sicilis	3.54	1.20
Eurytemora composita	1.58	1.16
Hersperodiaptomus n. sp.	11.08	3.23
Hesperodiaptomus arcticus	9.34	3.23
Hesperodiaptomus nevadensis	11.42	3.15
Hesperodiaptomus shoshone	6.22	NA
Hesperodiaptomus vicoriaensis	8.74	2.70
Hesperoscope septentrionalis	10.90	3.35
Leptodiaptomus tyrrelli	2.66	1.60
Leptodiaptomus wilsonae	6.50	1.26
Limnocalanus macrurus	3.26	2.00
Osphranticum labronectum	4.90	1.25

Question 1

Verbally state the null and alternative hypotheses tested for both correlation and regression. State the hypotheses in statistical terms also. Justify your choice of the alternative hypothesis, based on the authors’ information in the text.

Correlation:

Null Hypothesis: There is no correlation between genome size and body size.

\[H_0: \rho = 0\] Alternate Hypothesis: Genome size is correlated with body size.

\[H_a: \rho \neq 0\] Personal question- By reading the intro, I got the sense that their alternate hypothesis was that body size is positively correlated with genome size… would this be a better way to state the hypothesis in statistical terms?

This should be a one-sided test! Thus, what you wrote in green is correct, not what you had in your original answer.

\[H_a: \rho > 0\]

Regression:

Null Hypothesis: There is no linear relationship between genome size and body size.

\[H_0: \beta = 0\] Alternate Hypothesis: Genome size has a linear relationship with body size.

\[H_a: \beta \neq 0\]

Note that you can specify the slope ‘DIRECTION’ for the null and alternate hypotheses. This test is done by using the t-distribution with n-1 degrees of freedom. If you have multiple independent variables, you always want to test for BOTH correlation and regression! Correlation will tell you whether the relationship exists, while regression will allow you to predict values based on that relationship.

Question 2

Define the populations, and sample studied. Same answer for regression & correlation

population: The population studied were the invertebrate populations of turbellarian flatworms (not of interest for this assignment) and copepods ( Order: Calanoida)

sample: The researchers sampled 16 species of Copepods from “natural populations” in a range of different locations (Texas, Ontario, Wyoming etc.). The species corresponding to each sample location have been listed in table 2.

Question 3

Briefly comment on the methods and experimental design as they relate to correlation & regression analyses (same answer for correlation & regression analyses):

Collection methods seemed vague. Nothing was said about random sampling or methods to avoid bias in sampling. However, for the purposes of this assignment, we will assume this was done randomly (see assumptions later). The measurements for adult copepods from head to tail seemed simple and reliable, which is good in case the experiment was to be repeated by other researchers.. It was also very interesting to quantify genome size by appearance under microspectrophotometer… Personally, a realtime quantitative PCR analysis would have been a more precise measurement of relative genome size, but if the authors had claims that previous work indicated association of genome size and microspectrophotometer analysis of nuclei, then it seems appropriate.

Question 4

List the assumptions or necessary conditions for appropriate application of the correlation and regression analyses. How did the author’s study or data satisfy or fail to satisfy these assumptions?

Correlation:

1. Data comes from random samples * As stated in the previous question, the researchers really made no indication of sampling strategies to ensure sampling was random. However, for the purpose of this assignment, we will proceed assuming randomness in the sampling.

2. Measurements have a bivariate normal distribution in the population. This includes the following features:

a. The relationship between genome size and body size is linear - I wasn’t entirely sure how to test this without including the test of linear regression…

b. The “cloud” of points in a scatter plot of genome size and body size has a circular or elliptical shape - The best way I saw to test this was to plot the variables:

ggplot(copepod_table, aes(x = copepod_table$`Genome Size (pg)`, y = copepod_table$`Body Size (mm)`)) + geom_point() + theme_classic() + xlab("Genome Size (mm)") + ylab("Body Size (mm)")

## Warning: Removed 1 rows containing missing values (geom_point).

This scatter plot may indicate the presence of an “eliptical shape”

c. The frequency distributions of genome size and body size separately are normal - I tested this using a Shapiro-Wilk test for each of the variables:

shapiro_genome <- shapiro.test(copepod_table$`Genome Size (pg)`)$p
shapiro_body <- shapiro.test(copepod_table$`Body Size (mm)`)$p

kable(tibble(shapiro_genome, shapiro_body), col.names = c("Genome Size p-value", "Body Size p-value"))

Genome Size p-value	Body Size p-value
0.4928575	0.0308535

as you can see here, the body size variable is not normally distributed (p= 0.03). I can check this again by doing a log10 permutation:

perm_genome <- log10(copepod_table$`Genome Size (pg)`)
perm_body <- log10(copepod_table$`Body Size (mm)`)

kable(tibble(shapiro.test(perm_genome)$p, shapiro.test(perm_body)$p))

shapiro.test(perm_genome)$p	shapiro.test(perm_body)$p
0.3668055	0.1457316

log10_copepod <- tibble(species, perm_genome, perm_body)

The log10 permutation was successful; both variables were approximately normal. This permutation shall be used for the rest of the data.

Regression:

At each value of X, there is a population of possible Y-values whose mean lies on the true regression line (aka the relationship must be linear)- A good way to check this is by visually inspecting the scatterplot of the data:

ggplot(log10_copepod, aes(x = log10_copepod$perm_genome, y = log10_copepod$perm_body)) +
  geom_point() + labs(x = "Log10 of Genome Size (pg)" , y = "Log10 of Body size (mm)") + geom_smooth(method = 'lm' ,se = F, colour = 'black') + theme_classic()

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

This data seems like it fits the linear regression model fairly well (except for one outlier)

At each value of X, the distribution of possible Y-values is normal
The variance of Y-values is the same at all values of X

The two assumptions above can be confirmed using visual interpretation of a residual plot. This is the residual plot I generated from this data:

copepod_Lmodel <- lm(perm_body ~ perm_genome)
df <- augment(copepod_Lmodel)

ggplot(df, aes(x = .fitted, y = .resid)) + geom_point() + theme_classic()+
  geom_abline(yintercept = 0, slope = 0, linetype = 2) + labs(x = "Fitted values", y = "Residuals")

## Warning: Ignoring unknown parameters: yintercept

##NOTE: You can use the boxcox() function to determine what the best transformation was...

The residual plot reveals an asymmetric “cloud” of points above and below the horizontal line, which may indicate a violation of normality and/equal variance. However, for the purpose of this assignment, we will proceed with a residual test.

At each value of X, the Y-measurements represent a random sample from the population of possible Y-values- This was not explicitly stated in the methodology of the authors. For the purpose of the homework assignment, we will presume that this assumption is correct.

Question 5

State the degrees of freedom (df) in the correlation analysis and in the testing of the slope in the regression analysis.

Correlation:

\[d.f. = n-2\]

so for our test, d.f. = 13

Regression:

\[d.f. = n-2\]

So once again, d.f. = 13

NOTE: Authors didn’t explicitly state a sample size for each experiment. Also note that there is a sample missing in the table.

Question 6

Determine the critical t statistic for the regression analysis.

qt(p = 0.95, df = 13)

## [1] 1.770933

Question 7

Perform the correlation and simple linear regression analyses using data provided in Table 2. As part of performing these formal regression analyses, plot the residuals versus y; residuals versus X; and a Q-Q plot. Consider whether the authors treated their raw data prior to statistical analysis. Do you agree with their statistical conclusions?

Correlation Analysis:

cor.test(perm_body, perm_genome)

## 
##  Pearson's product-moment correlation
## 
## data:  perm_body and perm_genome
## t = 3.854, df = 13, p-value = 0.001992
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3482789 0.9042533
## sample estimates:
##       cor 
## 0.7302564

The p-value for this correlation analysis was 0.002, indicating that we can reject the null hypothesis for this test. This supports the assertion that genome size is correlated with body size. Since the R value was 0.730, we can also say that this was a moderately strong correlation.

Regression Analysis:

summary(copepod_Lmodel)

## 
## Call:
## lm(formula = perm_body ~ perm_genome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2186 -0.1180  0.0618  0.1022  0.1658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.1456     0.1123  -1.297  0.21720   
## perm_genome   0.5472     0.1420   3.854  0.00199 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1357 on 13 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5333, Adjusted R-squared:  0.4974 
## F-statistic: 14.85 on 1 and 13 DF,  p-value: 0.001992

The summary output indicates that we can’t reject our null hypothesis for the y-intercept being 0 (p = 0.217). The output shows us that the estimate for the slope of log10(genome size) was 0.5472 and that we could reject our null hypothesis for this test; Genome size has a linear relationship with body size. Note that R-squared = 0.5333

Question 8

Verbally state the results of the authors’ correlation analyses. State the statistical model that corresponds to the authors’ regression results. Did the statistical analyses support the conclusion and why or why not?

Correlation:

The correlation test used the t-distribution to reveal that there was a statistically significant correlation between genome size (pg) and body size (mm). This correlation was moderately strong, supporting the conclusions that the authors of the publication made.

Regression:

The test of regression also used the t-distribution to indicate that genome size has a statistically significant linear relationship with body size. Approximately 53.33% of the variation in body size (mm) can be explained by a linear regression model using genome size (pg) as the independent variable.

Add the regression model to this conclusion:

Yhat = 0.5(genome size) - 0.1456

For every 1 pg increase in log genome size, the average body size will go up by 0.5 mm

Question 9

Regarding the linear regression analyses, were statistics appropriate and necessary to derive the conclusion stated by the authors? Why or why not?

Yes, statistics were necessary to relate the probability of a linear relationship (using the t-distribution) from the samples collected to the probability of this same linear relationship occurring in the population of interest.

You must also say that you used the F-distribution

Regression and Correlation Homework Guide

Jeremy Bravo

April 17, 2019

Excerpt

Question 1

Correlation:

Regression:

Question 2

Question 3

Question 4

Correlation:

Regression:

Question 5

Correlation:

Regression:

Question 6

Question 7

Question 8

Correlation:

Regression:

Question 9