Markdown Author: Jessie Bell, 2023
Libraries Used:
library(ggplot2)
library(gt)
library(ggpubr) #for ggarrange
library(GGally) #yahboiiii
library(reshape2)
library(knitr)
library(ggcorrplot) #for assessing colinearity
library(plotly) #for 3d plot
library(webshot) #for printing 3d plot into pdf
Answers: plum
Using the Ocean Indexes data, evaluate the correlation between sea surface temperature (SST) and the Pacific Decadal Oscillation (PDO). (You will have to select these variables from the larger data set.)
Plot the data and make a guess for the r-value. Explain why you chose this.
Answers: I think the size of the correlation between sea surface temperature and the pacific decadal oscillation is about 0.6 and the direction is positive. I think this because there does seem to be some precise points following an positive sloping line along the axes.
# code
oceansData <- read.csv("Ocean Indexes.csv")
ggplot(oceansData, aes(PDO, SST))+
geom_point(color="#578493")+
labs(title="Strength and Direction between PDO and SST",
x="Pacific Decadal Oscillation",
y=expression("Sea Surface Temp. " ( degree~C)))
Figure 1: Checking out the data to see if there is a relationship between the Pacific Decadal Oscillation (PDO) and sea surface temperatures (°C) (SST). It looks as if there is a positive direction to this graph with a strength of about 0.6.
Are these two variables correlated? Why or why not? Provide statistical support and explain in plain language what it means.
H0: \(\rho\) \(=\) 0
HA: \(\rho\) \(\neq\) 0
Answers: According to the correlation test output (step 3 below), we can reject the null hypothesis in favor of the alternative hypothesis. There is a weak and significant positive correlation between PDO and SST because the p-value < 0.05 and r is 0.478.I think the size of the correlation between sea surface temperature and the pacific decadal oscillation is about 0.6 and the direction is positive. I think this because there does seem to be some precise points following an positive sloping line along the axes.
#code
#testing assumption of normality for PDO data
one <- ggplot(oceansData, aes(PDO))+
geom_histogram(fill="pink", bins=20)+
labs(title="PDO Distribution", x="Pacific Decadal Oscillation")
two <- ggplot(oceansData, aes(SST))+
geom_histogram(fill="plum", bins=15)+
labs(title="SST Distribution", x=expression("Sea Surface Temp. " ( degree~C)))
ggarrange(one, two, labels = c("A", "B"))
Figure 2: Checking the assumption of correlation, that the data are normally distributed. (A) PDO data is pretty normal looking and I assume if we increased our sample size, this distribution would begin to have the perfect bell-shape. (B) SST data also is pretty normal looking and I assume if we increased our sample size, this distribution would also begin to have the perfect bell-shape.
attach(oceansData)
#code to run the Pearson product-moment correlation
correlation1 <- cor.test(PDO, SST, method = "pearson")
df <- correlation1$parameter
CI1 <- correlation1$conf.int[1]
CI2<- correlation1$conf.int[2]
r <- correlation1$estimate
pvalue <- correlation1$p.value
stats <- data.frame(cbind(r, pvalue, df, CI1, CI2))
gt(stats)
| r | pvalue | df | CI1 | CI2 |
|---|---|---|---|---|
| 0.4777927 | 0.0007850816 | 44 | 0.2176885 | 0.6745316 |
ggplot(oceansData, aes(PDO, SST))+
geom_point(color="#ff61cc")+
labs(title="Correlation between PDO and SST", x="Pacific Decadal Oscillation",
y=expression("Sea Surface Temp. " ( degree~C)))+
geom_smooth(formula = y ~ x, method = "lm", color="black")+
geom_label(x=-0.7, y=10.3, label="r = 0.478, p < 0.05", size=3)
Figure 3: There is a weak and significant positive correlation between PDO and SST with a p-value < 0.05 and the correlation coefficient, r = 0.478.
Clams, as largely sessile organisms living within the benthos, have been shown to be good indicators of climate. Like trees, bivalves deposit annual growth rings which can be used to make inference about their living conditions (this is called sclerochronology). So, we have gone to a local clam bed and collected some clams. We brought them back to the lab and measured the growth increments. We only used clams that had a record from at least 1970-2015.
The upwelling index used here is from 0 to -50, where negative numbers are indicative of stronger upwelling (more water moving offshore). We know upwelling brings nutrient-rich deep waters to our coast, bringing the raw materials for the phytoplankton clams eat. Given this biological link, we might expect years with strong upwelling to result in higher growth. Can upwelling (UPW) describe clam growth (cm/yr)?
Plot your data
Answers: Looking at these data it seems that there might be slight differences in Y given X (Fig. 4). It also shows two large y values where upwelling is equal to: -10 and -30. These could be outliers and they could influence our statistical test. We can check for outliers by examining if the assumptions of the model are met (Fig. 5).
ggplot(oceansData, aes((UPW), Clam))+
geom_point(color="#ac6a6c")+
labs(title="Strength and Direction UPW and Clam Growth", x="Upwelling",
y=expression("Clam Growth " ( cm/yr)))+
geom_smooth(formula = y ~ x, method = "lm", color="black")
Figure 4: Checking out the data to see if there is a relationship between upwelling and clam growth. It seems that there is a slight negative relationship of about -0.25.
Assessing assumptions (here we will proceed assuming the data are normal enough, but do check)
Answers: The density plot of residuals does look bell-shaped, with no weird bumps or skewing (Fig. 5a). The Q-Q plot looks normal, with a few points further from the line around x = 2 (Fig. 5b). The plot of fitted values vs. predicted values shows equal amounts of points above and below the line with similar vertical variation (Fig. 5c).
#reorder UPW (x) data from small to large for QQplot
oceansData <- oceansData[order(oceansData$UPW),]
#plut it into the lm function
regression2 <- lm(Clam ~ UPW, data=oceansData)
yintercept <- (regression2$coefficients[1])
UPWslope <- (regression2$coefficients[2])
rsquared <- 0.1382 #did not use adjusted because this is not multi regress with more than
#1 explanatory (which is when you would use the adjusted)
pvaluettest <- 0.011
pvalueANOVA <- 0.01097
fstat <- 7.055 #tells the variation in meansquaresregression/meansquaresresiduals
table2 <- data.frame(cbind(yintercept, UPWslope, rsquared, pvaluettest,
pvalueANOVA, fstat))
gt(table2)
| yintercept | UPWslope | rsquared | pvaluettest | pvalueANOVA | fstat |
|---|---|---|---|---|---|
| 0.07024296 | -0.001652452 | 0.1382 | 0.011 | 0.01097 | 7.055 |
#ANOVA of Regression: tells you if Fstat for lm() we don't need to do this because
#there is only one predictor in this example. We will do this in problem 3.
#Diagnostic plots: residuals and QQ plots to test the two final assumptions of the
#regression [1] the residuals are normally distributed and [2] the residuals have equal variance
dp <- ggplot(regression2, aes(x=regression2$residuals)) +
geom_density(fill="blue", colour=NA, alpha=.2) +
geom_line(stat = "density") +
expand_limits(y = 0) +
labs(title="Density plot of residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residual") +
ylab("Density")
qqp <- ggplot(regression2, aes(sample = regression2$residuals))+
stat_qq() + stat_qq_line(col="steelblue")+
labs(title="QQ Plot of Residuals",
x="Theoretical Qualtiles",
y="Sample Quantiles")
#QQ plot plots our residuals vs. normal distribution, a line means data are normal
#Now plot fitted vs. residuals to determine if equal variance above and below line
fitres <- ggplot(regression2,
aes(regression2$fitted.values,
regression2$residuals))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", col="steelblue")+
labs(title="Residuals vs. Fitted",
x="Fitted Values", y="Residuals")
ggarrange(dp, qqp, fitres, labels = c("A", "B", "C"))
Figure 5: Evaluating Model Assumptions. (A) Density Plot of residuals does look bell-shaped, with no weird bumps or skewing. (B) The Q-Q plot looks normal, with a few points further from the line around x = 2. (C) The plot of fitted values vs. predicted values. There are equal amounts of points above and below the line with similar vertical variation.
Fit your model and plot it
For the t-test part of the regression: H0: \(\beta\) \(=\) 0; HA: \(\beta\) \(\neq\) 0
For the ANOVA part of the regression: H0: \(Y\) \(=\) \(\alpha\); HA: \(Y\) \(=\) \(\alpha\) + \(\beta\)
clam.summary <- (summary(regression2))
clam.summary
##
## Call:
## lm(formula = Clam ~ UPW, data = oceansData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.116198 -0.029658 -0.000305 0.036298 0.125350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0702430 0.0151222 4.645 3.08e-05 ***
## UPW -0.0016525 0.0006221 -2.656 0.011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05473 on 44 degrees of freedom
## Multiple R-squared: 0.1382, Adjusted R-squared: 0.1186
## F-statistic: 7.055 on 1 and 44 DF, p-value: 0.01097
ggplot(oceansData, aes(UPW, Clam))+
geom_point(color="#ac6a6c", shape=1)+
labs(title="Strength and Direction UPW and Clam Growth",
x="Upwelling",
y=expression("Clam Growth " ( cm/yr)))+
geom_smooth(formula = y ~ x, method = "lm", color="black", se=F)+
geom_label(x=-8, y=0.24,
label="y = 3.08e-05 + 0.011x + e, R^2 = 0.1382, p < 0.05",
size=3)
Figure 6: Fitting the model.
Describing the output, including any statistical tests that will aid in your interpretation.
Answers: Our slope is not equal to zero with a p < 0.05, so we reject H\(_0\) in favor of H\(_A\) for the t-test hypothesis within the regression output (Table 2, pvaluettest). Our p-value < 0.05 for our ANOVA test too, suggesting we can reject our second H\(_0\) in favor of H\(_A\), and our model is better than the null model (Table 2, pvalueANOVA). Clam growth increases about 0.0016 cm/yr for each unit of upwelling (t\(_{44}\) = 7.055, p < 0.05). Upwelling is an okay predictor of change in clam growth (R\(^{2}\) = 0.1382, F\(_{1,44}\) = 63.1, p < 0.05).
Stating in plain language what you conclude.
Answer: Upwelling causes an increase in nutrients which seems to be a weak predictor of clam shell growth (cm/yr). Manilla clams sexually mature in about 2 years with a size of about 2 cm. If the clams you found out and about (Birch Bay maybe?) happened to be manilla, ~ 0.083 percent of thier growth can be explained by upwelling, with an increase of about 0.0016 cm/yr for each unit of upwelling (Table 1). It seems we could add more variables to the model to increase our explanatory power.
Our simple linear model did not have much explanatory power. But we do have additional data and can possibly improve our model fit by adding a variable. Using the same Ocean Indexes data, conduct an analysis (here, a linear regression, potentially with multiple explanatory variables) to see if the indicators of ocean condition: PDO, North Pacific Gyre Oscillation (NPGO), and Upwelling (UPW) can describe clam growth. For simplicity, model the main effects only. Pay attention to what variables you are including–you will need to select the appropriate ones.
Your steps should include:
Plot your data and check for colinearity.
Answer: It seems that there is a very strong colinearity between PDO and clam growth, other relationships show weaker colinearity (Fig. 7). Here you can also see that there is statistical significance in the size and affect of PDO with NPGO, I am still deciding on what to do. PDO is a positive direction, while UPW and NPGO are negative.
multi <- oceansData[,c("Year","PDO","UPW", "NPGO", "Clam")]
corr <- ggpairs(multi, columns = 2:5)
corr
Figure 7: Looking at the data before running multiple linear regression through correlation. There is a strong in the correlation between PDO and clam growth. There is also a strong statistical significance in the correlation between NPGO and PDO, which means one of these should probably be thrown out of our model.
Assessing assumptions
Answer: The density plot of residuals has a bit of a right skew (Fig. 8A), but the Q-Q plot looks normal (Fig. 8B), with a few points further from the line around x = 2. The plot of fitted values vs. predicted values seems to show more points below the line than above it (Fig. 8C). The scale and location plot shows that there is no clear trend in the residual values (Fig. 8D). Finally, the residual vs. leverage plot tells us the amount of influence each observation has on our model. Points with very high residual values are poorly described by the model, and points with high leverage are influencing the model fit quite a bit, and are outliers skewing the model away from best fit. The point labelled 39 seems to be a large standardized residual, but it falls within Cook’s 0.5, meaning it might influence our model but is not definitely influencing our model.
model1.allvars <- lm(Clam~UPW+NPGO+PDO, data=multi)
dp <- ggplot(model1.allvars, aes(x=model1.allvars$residuals)) +
geom_density(fill="blue", colour=NA, alpha=.2) +
geom_line(stat = "density") +
expand_limits(y = 0) +
labs(title="Density plot residuals") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Residual") +
ylab("Density")
qqp <- ggplot(model1.allvars, aes(sample = model1.allvars$residuals))+
stat_qq() + stat_qq_line(col="steelblue")+
labs(title="QQ residuals",
x="Theoretical Qualtiles",
y="Sample Quantiles")
#QQ plot plots our residuals vs. normal distribution, a line means data are normal
#Now plot fitted vs. residuals to determine if equal variance above and below line
fitres.p <- ggplot(model1.allvars,
aes(model1.allvars$fitted.values,
model1.allvars$residuals))+
geom_point()+
geom_hline(yintercept = 0, linetype="dashed", col="steelblue")+
labs(title="Residuals vs. Fitted",
x="Fitted Values", y="Residuals")
std.resid.p<-ggplot(model1.allvars, aes(.fitted, sqrt(abs(.stdresid))))+
geom_point(na.rm=TRUE)+
stat_smooth(method="loess", na.rm = TRUE)+
xlab("Fitted Value")+
ylab(expression(sqrt("|Std. residuals|")))+
ggtitle("Scale-Location")
ggarrange(dp, qqp, fitres.p, std.resid.p, labels = c("A", "B", "C", "D"))
## `geom_smooth()` using formula = 'y ~ x'
Figure 8: Evaluating Model Assumptions. (A) Density Plot of residuals has a bit of a right skew. (B) The Q-Q plot looks normal, with a few points further from the line around x = 2, indicating some heaviness in the tails. (C) Residuals vs. fitted show no indication of non-linearity seen. (D) Scale-location plot, another type of residual vs. fitted that shows no clear trend in residual variance throughout the plot.
model1.allvars
##
## Call:
## lm(formula = Clam ~ UPW + NPGO + PDO, data = multi)
##
## Coefficients:
## (Intercept) UPW NPGO PDO
## 9.685e-02 4.315e-05 -3.508e-04 6.667e-02
options(scipen=0) #set sci pen back to 0 for default
#this takes off scientific notation
summary(model1.allvars)
##
## Call:
## lm(formula = Clam ~ UPW + NPGO + PDO, data = multi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.013459 -0.008554 -0.002433 0.003618 0.057509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.685e-02 4.083e-03 23.722 <2e-16 ***
## UPW 4.315e-05 1.833e-04 0.235 0.815
## NPGO -3.508e-04 2.533e-03 -0.138 0.891
## PDO 6.667e-02 3.231e-03 20.633 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01416 on 42 degrees of freedom
## Multiple R-squared: 0.9449, Adjusted R-squared: 0.941
## F-statistic: 240.3 on 3 and 42 DF, p-value: < 2.2e-16
plot(model1.allvars, which=5)
Figure 9: The residual vs. leverage plot tells us the amount of influence each observation has on our model. Points with very high residual values are poorly described by the model, and points with high leverage are influencing the model fit quite a bit, and are outliers skewing the model away from best fit. The point labelled 39 seems to be a large standardized residual, but it falls within Cook’s 0.5, meaning it might influence our model but is not definitely influencing our model.
Justification for variables within your model
Answer: Model 1 contained all explanatories: UPW, NPGO, and PDO. Model 2 contained all but NPGO, Model 3 contained all but PDO, and Model 4 containing all but UPW. After referencing our lecture slides, I read that “The parsimony principle for a statistical model states that a simpler model with fewer parameters is favored over more complex models with more parameters, provided the models fit the data similarly well. I chose Model 3 because both t-test and ANOVA were statistically significant. The slope was not equal to zero for either variable in the model and the model with both explanatories was better than the model without them.
#to justify which model is better, run an ANOVA on the models in question
# model1.allvars
model2.NPGO.removed <- lm(Clam~UPW+PDO, data=multi)
model3.PDO.removed <- lm(Clam~UPW+NPGO, data=multi)
model4.UPW.removed <- lm(Clam~PDO+NPGO, data=multi)
model5.justPDO <- lm(Clam~PDO, data=multi)
#model 3 has the best t-test and ANOVA values, but the lowest R^2 **shrug**
summary(model3.PDO.removed)
##
## Call:
## lm(formula = Clam ~ UPW + NPGO, data = multi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.092400 -0.028232 -0.005093 0.030923 0.131824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0731693 0.0129218 5.662 1.13e-06 ***
## UPW -0.0017575 0.0005314 -3.307 0.001910 **
## NPGO -0.0291252 0.0069745 -4.176 0.000142 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0467 on 43 degrees of freedom
## Multiple R-squared: 0.3868, Adjusted R-squared: 0.3583
## F-statistic: 13.56 on 2 and 43 DF, p-value: 2.709e-05
summary(model5.justPDO)
##
## Call:
## lm(formula = Clam ~ PDO, data = multi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.013748 -0.009301 -0.002787 0.003997 0.057779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.095904 0.002065 46.45 <2e-16 ***
## PDO 0.066607 0.002427 27.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01385 on 44 degrees of freedom
## Multiple R-squared: 0.9448, Adjusted R-squared: 0.9436
## F-statistic: 753.2 on 1 and 44 DF, p-value: < 2.2e-16
Fitting the model
UPWp <- ggplot(oceansData, aes(UPW, Clam))+
geom_point(color="#00bfc4")+
labs(title="Upwelling vs. Clam Growth",
x="Upwelling",
y=expression("Clam Growth " ( cm/yr)))+
geom_smooth(method = "lm", color="#00bfc4")+
geom_text(x=-20, y=0.23,
label="y = 0.073 - 0.0018x1 - 0.029x2 + e",
size=3)
NPGOp <- ggplot(oceansData, aes(NPGO, Clam))+
geom_point(color="#c77cff")+
labs(title="NPGO vs. Clam Growth",
x="NPGO")+
geom_smooth(method = "lm", color="#c77cff")+
geom_text(x=-.5, y= 0.228,
label="R^2 = 0.36, p < 0.05", size=3)
ggarrange(UPWp, NPGOp, labels = c("A", "B"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Figure 10: Upwelling vs. Clam Growth (A) and NPGO vs. Clam growth (B). The best fit model for clam growth is y = 0.073 - 0.0018x1 - 0.029x2 + e, where x1 is upwelling and x2 is NPGO.
# Reinstall latest development version which includes the bugfix
#devtools::install_github("quinten-goens/plotly.R@fix/kaleido-export-bug")
library(plotly)
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.3.2
plot_ly(oceansData,
x = ~NPGO,
y = ~UPW,
z = ~Clam,
color = Clam) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'NPGO'),
yaxis = list(title = 'UPW'),
zaxis = list(title = 'Clam'))) %>%
layout(show.legend = T)
## Warning: 'layout' objects don't have these attributes: 'show.legend'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Figure 11: A 3-dimensional model that was created using lattice package for static map adapted from interactive plotly code (thank you Annie!).
#trying to save interactive to static
#library(htmlwidgets)
#saveWidget(p, file=paste0( getwd(), "/Fig11.html"))
#getting plotly to convert to png is not working on my windows with the python codes. switching to lattice for this static map (gg3d depreciated, I am sure ggplot has another static 3d version of this) so using lattice instead.
See interactive version of Figure 11 here.
# Load
library("lattice")
cloud(Clam ~ UPW*NPGO,
data = oceansData,
group = Clam,
auto.key = F)
Figure 12: A static version of the 3-dimensional map that can be embedded into pdf documents.
Describing the output. For this exercise, rely on adjusted R\(^2\) as the primary means of assessing model fit in addition to checking for significant slopes and variance explained through the ANOVA. We will cover additional methods in future weeks, but here, rely on R\(^2\) to chose your best model.
Answer: There were two models that I think are okay in helping predict clam shell growth. Model2 showed statistical significance in t-test, p < 0.05, and the slope was not equal to 0. Model2 also showed statistical significance in ANOVA, p < 0.05, meaning this model with covariates is better than the model without covariates. The R\(^2\) adjusted is equal to 0.36, so it isn’t very high – but I think that is expected with environmental data. Model 5 is interesting. I really don’t know what is up with PDO data… Something seems a little over-fit to be environmental data, which is why I did not go with it.
Stating in plain language what you conclude.
Answer: Simple linear regression was used to test if both upwelling and NPGO significantly impacted clam growth. The fitted regression model was: Clam Growth = 0.073 - 0.0018(UPW) - 0.029(NPGO) + e. The overall regression was statistically significant (R\(^2\) = 0.36, F(2, 43) = 13.56, p < .05). It was found that both upwelling and NPGO significantly predicted clam growth (\(\beta_1\) = 0.0018, \(\beta_2\) = 0.029, p < .05).
[1] Dr. Sobocinski’s Regression Tutorial
[2] Bookdown
[4] Summary Stats
[5] Therimalaya
[6] Annie’s plotly code :)! See interactive version of Figure 11 here.