Linear Regression

Linear regression is a parametric test that examines whether a linear model between a continuous response variable and continuous explanatory variable explains more variation than random chance. Linear regression has the following sampling variability assumptions that have to be met:

  1.    The residuals have to be normally distributed;
  2.    There is equal variance among groups (homogeneity of variance).
  3.    The error terms need to be independent of each other.

The residuals are the leftover variation that is not accounted for by your explanatory variable. You can use analyses of the residuals to check all of these assumptions. Before we can interpret our statistical output, we need to make sure the assumptions above are met. For more information about why you are not looking for the normality of your dependent variable but the residuals, see this great post by Karen Grace-Martin https://www.theanalysisfactor.com/the-distribution-of-dependent-variables/.

Background Theory

Linear regression depends on the F-distribution, which measures the variance in the response variable that is explained by the categorical explanatory variable divided by the variance in the response variable that remains unexplained (the error term or residuals). The model is a line of best fit, which minimizes the sum of the squared differences between each dot and the line. It tests the null hypothesis that the slope of the line is 0. If the slope is significantly different from 0, then the model is explaining more variance than error (random chance) and the model is significant.

Here are some statistical terms you should understand to interpret your linear regression.

The F statistic is the ratio of the variance explained and the variance unexplained.
The p-value or alpha is the area under the curve of the F distribution at the F_critical value (Figure 1).The p-value, which is interpreted as the chance that you are wrong if you accept that there is a significant effect (also known as a Type I Error rate) has been chosen by scientists to be 5%.
The F_critical value is the value at which the area under the curve (i.e., p-value) is 5%.
The R² value is the proportion of variation that is explained by the model. It is calculated as 1 minus the proportion of variance not explained by the model (1 - sum of squares of the residuals/ total sum of squares). It is more powerful than a correlation coefficient from a correlational analysis, r, because it is assumed that the explanatory variable is measured without error. In correlation, it is assumed that both the response and explanatory variables are measured with error.
The df is equal to the number of observations minus the number of parameters estimated.

If the F statistic generated by your test is larger than the F_critical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and there is a significant relationship between the explanatory and the response variables. You have explained significantly more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. F distribution. Image: http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/

In the figure below, the best-fit line (red line) through the data indicates the variance explained by the model. In the figure on the left if you sum up the difference between each dot and the best-fit line (blue arrows), you get the variance explained by the model. If the best-fit line has a slope that is close to 0 (light blue line), the model doesn’t explain much of the variation. If the slope is large, the distance between each dot and the best-fit line is large and the model explains more variance. The figure on the right shows the variance not explained by the model (the residuals). If you add up the difference between each dot and the best-fit line (blue arrows), you get the variance not explained by the model. If the dots fall close to the line, most of the variation is explained by the explanatory variable. The F statistic can be conceptualized as the ratio of the variance you can explain (blue arrows in left figure) over the variance you can’t explain (blue arrows in right figure).

Figure 2. The number of salamanders increases with stream pH. Data are used for illustrative purposes.

A linear regression test will provide you with a data analysis table of output that looks like the table below.

	Estimate	Std. Error	t value	Pr(>
(Intercept)	182.8204	25.6037	7.140	7.58e-06 ***
Stream_pH	-2.1724	0.5731	-3.791	0.00225 **

In this output, the p-values associated with the explanatory variable (Stream_pH) tells you if the slope is significantly different from 0. Recall your null hypothesis for the slope and this will make sense to you.

You also get other information in your output that looks like the information below.

Residual standard error: 33.28 on 13 degrees of freedom
Multiple R-squared: 0.525, Adjusted R-squared: 0.4885
F-statistic: 14.37 on 1 and 13 DF, p-value: 0.002247

This part of the output comes below the table above and gives you the R² value (Multiple R-squared), the F-statistic, DF, and the p-value. You should also note that the p-value for the F-statistic in this output is the same as the p-value for the t value from the table above. They are testing the same thing (whether the estimate is significantly different from 0). The adjusted R² is only used if you are doing multiple regression and have more than one explanatory variable. It controls for the number of explanatory variables you have. You can also generate the F-statistic model output like what you get in an ANOVA (see the ANOVA webpage for this example).

The DF is the degrees of freedom for the explanatory variable (1) and the residuals (13). The entire df is the number of samples minus the number of parameters you are estimating including the intercept. In the above example, there were 15 plots where salamanders were counted and pH was measured. One DF is used to estimate the intercept and one DF is used to estimate the explanatory variable slope so there are 13 DF left (15 - 2).

Information about the Example Dataset

For this test, we will be using a dataset collected by a summer research student at Puget Sound. They measured the amount of mercury in red tailed hawk feathers from museum specimens spanning 1980 to present. Asia has increased it’s coal production since 1980, and mercury is a by-product of coal production. They wanted to examine if there was an increase in mercury in the Pacific Northwest since Asia increased coal burning by measuring it in red tailed hawk feathers.

Figure 3. Red tailed hawks are abundant in the Pacific Northwest. Photo source: https://www.allaboutbirds.org/guide/Red-tailed_Hawk/id

Loading the Data into RStudio

In order to run a linear regression, you need to have your response variable in one column and your explanatory variable in another column. To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

DATA <- read.csv(file.choose())

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.

setwd <- ('C:/Users/YourUserName/Documents/Rfiles')
DATA <- read.csv(file="c:/TO.BE.EDITED.csv", header=TRUE, sep=",")

For this example, download the Mercury.csv file, save it in your working directory and then use the code below to upload the Mercury.csv file into RStudio.

DATA <- data.frame(read.table(file='Mercury.csv', sep=',', header=TRUE, fill=TRUE))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

names(DATA)

## [1] "ID"      "Year"    "Mercury"

If the file was successfully loaded into RStudio, you should have the names of the columns above. If you got an error, go back to the main stats webpage and click on Troubleshooting. Once you are sure the data are loaded into R, you can proceed.

Exploring the Data

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data (presumably you would have some information about mercury levels in bird wings before running this analysis on what you predict to find). We will examine how mercury levels (Mercury) varies with year (Year). For this test, the amount of mercury in the bird feathers (Mercury) is the continuous response variable and year is the continuous explanatory variable.

Use the following code to create a scatterplot of your data. This is the most suitable graph form for a linear regression.

attach(DATA)
plot(Year, Mercury, xlab="Year", ylab="Mercury concentration (ppm)", pch=19)
abline(lm(Mercury~Year), col="red")

Looking at these data it seems that there might be differences in Y with X. It also shows some points located far in the top right. These might be outliers, which can strongly influence the statistical test. We will check for outliers by examining if the assumptions of the model are met.

Running the Linear Regression

Now you can run the linear regression. Below is the code you will use to run the model. We need to run a linear model (lm). Remember that to actually run the test, you need to change the Response variable and Explanatory variable to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in mercury instead of Mercury, for example, the test will not work because R won’t be able to find mercury in your datafile.

fit<-lm(Response variable~Explanatory variable, data=DATA)

fit<-lm(Mercury~Year, data=DATA)

You will notice that not much happened. If you didn’t get any error messages, that means that the test was successful. R runs the code and stores the information. You have to do more work to get at that the results of the test. But, first we need to make sure we can even look at the results.

You never look at the results of the test until AFTER you check that the assumptions of the model were met. So next we will examine if our data meet the assumptions of 1) residuals following a normal distribution, 2) equal variance among groups (the residuals from each group should be relatively similar), and 3) independent error terms.

Evaluating Model Assumptions

The difference between the observed and predicted values, called the residuals, is a measure of the error associated with each observation. It is the variation that is not explained by the explanatory variable. We plot the residuals in various ways to examine normality and homogeneity of variances.

To test whether the residuals are normally distributed (model assumption 1 above), you will use 2 pieces of information: 1. a density plot of residuals 2. a Q-Q plot of residuals

To create a density plot of residuals, use the following code:

plot(density(fit$residuals))

Examine the density plot. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your residuals are not normal. This plot looks pretty normal save for a small bump on the right. See the ANOVA webpage for examples of non-normal and normal curves.

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.

Figure 4. A quantile-quantile plot of residuals showing that they exhibit a close relationship with the theoretical quantiles predicted under a normal distribution.

Q-Q plots take your sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution, in this case the normal distribution. If the points fall pretty closely along the line, the data are normal.

To be able to run the code that creates a Q-Q plot, you need to install and load the package stats. You can do this by running the code below. The first line of code (install.packages (“PackageName”)) will install the package. The second line of code (library(“PackageName”)) loads it into R.

install.packages("stats")

library(stats)

qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

This plot looks pretty good as each point falls pretty close to the line but it does show two points pretty far from the line.

To test whether the error terms for each group have similar variance (homogeneity of variance, assmuption 2 above), we will use a plot of the fitted.values vs. the predicted values. Use the code below to create your plot:

plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")

For this plot, you are looking for no patterns. There should be equal variation in points above and below the blue line. If you see a cone where the vertical variation on one side of the figure is smaller than the vertical variation on the other side, the variation is not homogeneous and you need to transform your data. The plot in this example suggests such a problem. The points on the left side cover a much smaller range of values than the points on the right side. If your plot looks like this, your data violate the assumption of homogeneity of variance. In this case, you need to transform your data. Note that when transforming your data, you ONLY transform the response variable.

Transform and Repeat

For this dataset, you will try a square root transformation of your response variable (note that when doing transformations, you transform the response variable, not the explanatory variable). Use the code below to create a new variable called sqrtMercury, which is the square root of mercury concentration.

sqrtMercury = sqrt(DATA$Mercury)

Now you can use the code below to re-run the linear regression and model assumption tests:

fit<-lm(sqrtMercury~Year, data=DATA)
plot(density(fit$residuals))

qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))

plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")

After transforming the data, they look better! The density plot of residuals looks more normal, the points on the right side of the Q-Q plot looks less like outliers, and the variation on the left side has increased (note the change in scale on the y-axis in the fitted vs. residuals plot - it is much smaller, which means even though the points don’t extend equally on the left and right, the difference between the spreads is much smaller). There is one more round of plots we can do to check our assumptions and look for outliers.

Another quick way to examine these diagnostic plots is to use the code below. It provides four plots (you need to click enter in R to go through these plots). Note that you don’t have a choice on which plots you get with this code - you get all of them and have to go through them:
1. A fitted vs. residuals plot. In this plot look for a non-linear pattern in the red line or uneven spread around the line (narrow on one side and wide on the other) just like you did earlier for the fitted vs. residuals plot.
2. A Q-Q plot. This is a repeat of the Q-Q plot from above.
3. A spread-location plot. In this plot, look for even distribution of dots above and below the line (even spread across the line).
4. A residuals vs. leverage plot. In this plot, you are looking for influential points - those that fall past the dashed red line.

For more detail on interpreting these plots for model assumptions, check out this great post by Bommae Kim at the University of Virginia library: https://data.library.virginia.edu/diagnostic-plots/

plot(fit)

The last plot shows that no point is an outlier (even #63 and #85) because no points go beyond the dashed red line. So overall by transforming the data, we met the assumptions and can proceed to interpret the output of the model.

Interpreting Linear Regression Output

Use the code below to get the model summary.

summary(fit)

## 
## Call:
## lm(formula = sqrtMercury ~ Year, data = DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2592 -0.4445 -0.1294  0.3419  2.3711 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -49.644589  15.610534  -3.180  0.00197 **
## Year          0.025583   0.007807   3.277  0.00145 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7045 on 99 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.09785,    Adjusted R-squared:  0.08874 
## F-statistic: 10.74 on 1 and 99 DF,  p-value: 0.001448

According to the Linear Regression output, there is a significant effect of year on mercury concentration because the p-value in both the model output for Year (table format: p = 0.00145) and the one associated with the F-stat (at the bottom: p = 0.001448) show that it is less than 0.05. Recall from the Background Theory above that these both test the same thing: whether the slope is significantly different from 0. You should copy and paste the results (summary) into a Word doc or Excel file so you record it somewhere. In particular, you want to go through the Background Theory above to really understand all aspects of this output table. For your results section, you should note the df for your explanatory variable (in this case, Year) and the Residuals, the F-value, the R²-value and the p-value (Pr(>F)).

Presenting your Results

Results Statements

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include

the statistical significance and
the biological significance.
Statistical Significance: Your first sentence should list the results of the statistical test (in this case whether mercury significantly varied with year). You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The F-stat was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.
Biological Significance: For the quantitative statements, you can include how much variation in the response variable was explained by the explanatory variable (basically your R² value).

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

Results Figure

A scatterplot with a trend line is typically used to present the results of a linear regression. A scatterplot is used when both your response variable and explanatory variable are continuous. You already generated a scatterplot with a trendline but below is the code to do it again. Note that even though we transformed the data to meet the assumptions of the model, untransformed data are presented.

To create a presentable figure for your paper, you need to install and load the package ggplot2. Use the code below to install (first line of code) and then load (second line of code) the package.

install.packages("ggplot2")

library(ggplot2)

Now use the code below to create a scatterplot with a trendline and appropriate axis labels.

ggplot(DATA, aes(x = Year, y = Mercury)) +
  geom_point(shape=1) +
  ylab("Mercury [ppm]") +
  theme_classic() +
  geom_smooth(method=lm, se= FALSE)

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

Alternatively, you can use Excel to create a scatterplot. Use the link below to watch a video on how to make a scatterplot in Excel: Video on how to make a scatterplot in Excel: https://www.screencast.com/t/MwVeP49jLmPe

Figure legend

A caption must be included below each figure.

The caption for a linear regression must include:

1.  A short descriptive title following the figure number  
2.  A description of what you plotted (including the type of error bars, if appropriate)  
3.  Your sample size (e.g., # transects/site)  
4.  The p-value and the R^2^.

Example Results Statement and Figure (Excel was used to generate this figure)

Mercury concentration increased significantly with year (linear regression, F = 10.74, df = 1, 99, p = 0.001, R² = 0.10, Figure 5). Year only explained 10% of the variation in mercury concentration in bird wings (Figure 5).

Figure 5. Mercury concentration (ppm) in red tailed hawk feathers in museum specimens from 1980 to 2016. Mercury concentrations increased significantly with year (p = 0.001, R² = 0.10).

My Data Are Not Normal - What Do I Do?

We will not go into nonparametric tests for linear regression on this statistics guide. If you want to know more about nonparametric options for linear regression, see this R companion handbook here: http://rcompanion.org/handbook/F_12.html

Quick Linear Regression

#bring data into RStudio
DATA <- data.frame(read.table(file='Mercury.csv', sep=',', header=TRUE, fill=TRUE))

#check that data was loaded properly
names(DATA)

#explore data (look for outliers)
attach(DATA)
plot(Year, Mercury, xlab="Year", ylab="Mercury concentration (ppm)", pch=19)
abline(lm(Mercury~Year), col="red")

#run the linear regression
fit<-lm(Mercury~Year, data=DATA)

#check model assumptions
plot(density(fit$residuals))
install.packages("stats")
library(stats)
qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))plot(anova2$residuals~anova2$fitted.values)
plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")
plot(fit)

#transform response variable
sqrtMercury = sqrt(DATA$Mercury)

#plot data to make sure the transformation worked
plot(sqrtMercury~DATA$Year)
abline(lm(sqrtMercury~Year), col="red")

#run the linear regression with the transformed response variable
fit<-lm(sqrtMercury~Year, data=DATA)

#check model assumptions again
plot(density(fit$residuals))
qqnorm(fit$residuals)
qqline(fit$residuals, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75))plot(anova2$residuals~anova2$fitted.values)
plot(fit$residuals~fit$fitted.values)
lines(lowess(fit$fitted.values,fit$residuals), col="blue")
plot(fit)

#examine linear regression output
summary(fit)

#creating a scatterplot
ggplot(DATA, aes(x=Year, y=Mercury)) +
  geom_point(shape=1) +
  ylab("Mercury [ppm]") +
  theme_classic() +
  geom_smooth(method=lm, se= FALSE)plot(Mercury~DATA$Year)
abline(lm(Mercury~Year), col="red")

*Written by Carrie L. Woods, July 2018. Modified from http://stats.pugetsound.edu/ecology/