Your goals for today are to:
First, remember to load the necessary packages.
library(ggplot2)
library(lterdatasampler)
We’ll be using the pie_crab dataset. Take a quick a look
at it! View(), str() and ? are
your friends. Note that the site variable is a character,
not a factor, which might be relevant for future visualizations and
analyses.
Another (more correct) way to think about “linear regression” is that you’re “fitting” a linear model to your data. You’re positing that you can predict your dependent variable with some “y = mx + b” equation, where x is your independent variable. Let’s see an example.
crabs.lm = lm(size~latitude,data=pie_crab)
The code for fitting a linear model is very similar the code for fitting an ANOVA model (ANOVA and linear regression both fall under a broader umbrella called “general linear models”).
Before we check the summary() of the model, though,
let’s visualize the data.
Exercise 1: Make a scatterplot of crab size plotted against latitude. Add a trendline.
ggplot(data=pie_crab,mapping=aes(x=latitude,y=size)) +
geom_point() +
geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
You can see there’s a positive relationship between the two variables - as latitude goes up (further north, in this case), crab carapace size goes up as well. Obviously, there’s still a lot of variation - points don’t fall perfectly along the line. Let’s see if the correlation is significant.
summary(crabs.lm)
##
## Call:
## lm(formula = size ~ latitude, data = pie_crab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8376 -1.8797 0.1144 1.9484 6.9280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.62442 1.27405 -2.845 0.00468 **
## latitude 0.48512 0.03359 14.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.832 on 390 degrees of freedom
## Multiple R-squared: 0.3484, Adjusted R-squared: 0.3467
## F-statistic: 208.5 on 1 and 390 DF, p-value: < 2.2e-16
Just like the ANOVA table, you get a lot of stuff here. Let’s go through the most important terms.
(Intercept): The estimate (and standard error of the estimate) of the model’s intercept value (the “b” in y = mx + b), as well as the p-value (Pr(>|t|)), which, when below 0.05, tells you that the intercept is significantly different from zero. latitude: The slope (and slope standard error) of the model (the “m”) and the p-value (indicates that the slope is significantly different from zero).
The “y = mx + b” equation for this model would thus be size = 0.49*latitude - 3.6. Also, it looks like fiddler crabs follow Bergmann’s rule, so that’s interesting!
The R-squared value is given in the second line from the bottom. An R-squared value is the proportion of the variation in the dependent variable that is predicted or “explained” by the independent variable (or variables, if there were more than one). In this case, approximately 35% of the variation in crab size is explained by variation in latitude, with the rest attributable to all other, unmeasured factors. R-squared is the “effect size” of linear regression, and you don’t even have to install another package, you just get it for free!
You’ll notice there are two types of R-squared value shown here, multiple and adjusted. The multiple R-squared value is the R-squared value as defined above. The adjusted R-squared simply adjusts that value based on the number of independent variables - this is the one that you should report when writing up your results. You don’t need to worry about the difference when your model only has one independent variable (which this one does) - as you can see, the difference between the two for this model is extremely small. If you were doing multiple regression and were testing multiple, more complex models with more independent variables, then the adjusted R-squared would help you compare them.
The R-squared value has no direct relationship to the model’s overall p-value. The F-statistic and associated p-value indicate whether or not your model fits the data better than a null model (with no independent variables). Usually, a low p-value and a medium-to-high R-squared value go hand in hand.
So, with all that said, how do you report the results of a linear regression? Here are two examples.
Fiddler crab size increased with latitude (linear regression, R-squared = 0.35, p < 0.0001), approximately 0.49 mm per degree.
Fiddler crab size was positively correlated with latitude (linear regression, R-squared = 0.35, p < 0.0001).
That’s all there is to it. Report the R-squared and p-value and make sure that the direction of the relationship (positive or negative) is communicated in clear terms (and ideally, biological terms). Some indication of the “slope”, how much the dependent variable increases or decreases with each unit increase of the independent variable, is nice to include but not necessary.
Exercise 2: Fit linear models of crab size vs. air temperature, water temperature, and the standard deviations of air and water temp (4 separate tests). Which predictor is most closely associated with crab size?
Linear regression does make two key assumptions. First, just like ANOVA, it assumes that the model residuals are normally distributed.
shapiro.test(resid(crabs.lm))
##
## Shapiro-Wilk normality test
##
## data: resid(crabs.lm)
## W = 0.99517, p-value = 0.2628
Since the pie_crab dataset is smaller, the Shapiro-Wilk
test works here! But you can also make a qqplot.
qqnorm(resid(crabs.lm))
qqline(resid(crabs.lm))
That is what a normal qqplot looks like! So, this assumption is upheld.
Second, it assumes equal variance of the residuals, basically that the residuals don’t wary between the low end and high end of the independent variable.
plot(resid(crabs.lm)~pie_crab$latitude)
abline(h=0,lwd=3,col=2)
Here’s an example of a plot that would violate the assumption of equal variance, just so you see what that looks like.
Here, there’s a clear pattern to the residuals - they vary in magnitude depending on the length value.
If the assumptions of linear regression are not met, a Spearman’s
rank-order test is the non-parametric equivalent. The code is a little
different - it uses the cor.test() function and instead of
the y~x format you’re used to, you have to separately define the
independent variable (x) and dependent variable (y).
cor.test(x=pie_crab$latitude, y=pie_crab$size,method="spearman")
## Warning in cor.test.default(x = pie_crab$latitude, y = pie_crab$size, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: pie_crab$latitude and pie_crab$size
## S = 4218002, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5798516
In the output, you’ll notice a p-value and a value called rho. Spearman’s rho is, on its face, similar to an R-squared value, indicating the strength of the correlation, but in reality does not do so in the same way. It indicates the consistency with which the dependent variable increases (or decreases) as the independent variable increases. It doesn’t measure “fit” to any kind of model.
Values of rho:
Note that rho can be positive or negative. This indicates the direction of the relationship - positive rho means a positive correlation (y increases as x increases) while negative rho means a negative correlation (y decreases as x increases).