March 12, 2019

Linear regression

We use linear regression to analyse linear patterns between a continuous response variable and one or more continuous predictor variable(s). In contrast to correlation analysis, which focuses on the strength of a linear relationship, linear regression assumes a causal relationship between the predictor(s) and the response variable.

The model predictions can be plotted as a regression line which is determined by the intercept (y-value at x = 0) and the slope (change in y per unit increase in x). These parameters are often referred to as \(\beta_{0}\) and \(\beta_{1}\).

\[ y_i = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \quad \quad \quad \epsilon \sim N(0,\sigma^2) \]

\(y = response~variable\)
\(x = predictor~variable~(continuous~explanatory~variable)\)
\(\beta_{0}~=~intercept\)
\(\beta_{1}~=~slope\)
\(\epsilon = model~errors~estimated~by~the~residuals\)
\(\epsilon \sim N(0,\sigma^2)~reads:the~residuals~are~assumed~to~approx.~follow~a~normal~distribution~with~a~\mathbf{mean~of~zero}~and\)
\(a~\mathbf{constant~variance}~of~\sigma^2\)
Week 3

Linear regression

Week 3

Total, residual and explained sum of squares

The total variation in a data set is estimated by the total sum of squares (TSS), i.e. the squared deviations of the observations from the overall mean.
A residual is the difference between an observation and its predicted (fitted) value. The residual sum of squares (RSS) thus represents the variation that was not captured by the model, i.e. the unexplained variation in our data. It is an estimate for the model error.
The explained sum of squares (ESS) are simply the squared deviations of the model predictions from the overall mean.


Week 3

Linear regression and the \(R^{2}\)

We can use the relationship among the sum of squares to estimate the proportion of explained variation by our model.

\[R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}\]


  • TSS – Total sum of squares
  • RSS – Residual sum of squares
  • ESS – Explained sum of squares

However, we usually adjust the \(R^{2}\) value for the number of parameters in the model because obviously the more parameters there are, the more flexibility we allow our model, which may lead to overfitting. Therefore, you can think of the adjusted \(R^{2}\) (\(R_{adj}^{2}\)) as a penalised version of the \(R^{2}\) (see PDF booklet).
Week 3

Linear regression

The important thing to grasp with linear models is that the residuals that are assumed to follow a normal distribution and show homogenous variance NOT the raw response values! Why is that?
Patterns present in the response variable could be largely explained by the model. If our model has done a good job and there is no naturally occurring variance pattern in the data, we expect the remaining variation, estimated by the residuals, to be randomly scattered around zero, i.e. to follow a normal distribution with mean zero, and to show constant variance (homoscedasticity, variance homogeneity). We can evaluate these assumptions visually using

  • a quantile-quantile plot (normality assumption)
  • a plot of the residuals vs. the fitted values (model predictions) to check the homoscedasticity assumption. This plot should look like a shotgun blast, i.e. the residuals should show equal spread around the zero line across the range of the fitted values.
Week 3

Model diagnostics: variance homogeneity (homoscedasticity)

Plots of residuals vs. fitted values. Left: Good residual pattern indicating variance homogeneity. Centre: Bad fan-shaped residual pattern suggesting increasing variance at high response values. Right: Bad hump-shaped residual pattern suggesting a nonlinear trend in the data.
Week 3

Dealing with variance heterogeneity

Linear models assume that each observation provides equally precise information about the underyling model parameters. If this holds, then the variation seen in the model errors – the unexplained variation that remains in the raw data after model fitting, which is captured by the residuals – should be constant across the predictor range. This idea underlies the variance homogenity (homoscedasticity) assumption, which is the most important and the most frequently violated assumption in (non)linear modelling at the same time.

The question is: Why should we care?

While variance heterogeneity (heteroscedasticity) normally does not have a huge effect on the actual parameter estimates, it strongly affects their standard errors (and thus the confidence intervals), which are often inflated when heterscedasticity is present.

Week 3

Dealing with variance heterogeneity

The old-school approach is to apply a variance stabilising transformation to the response variable, commonly a log or square root transformation. These days we can model the variance, thus a transformation should only be used as a last resort and, if considered, it should be applied to both the response and the predictor variable to maintain the original relationship. Only applying the transformation to the response variable changes the relationship with the predictor. The boxcox function can be used determine the optimal transformation (package MASS; see Crawley M (2012) The R Book, ch. 9.11).
Week 3

Dealing with variance heterogeneity

The modern approach is to use a generalised least-squares model (gls, package nlme) that allows the incorporation of a variance structure, which you can think of as a sub-model that models the pattern seen in the residuals.
The aim of this approach is to assign weights that determine how much each observation in the data set influences the final parameter estimates. The overarching model thus becomes a weighted fit with less weight given to less precise observations and more weight to more precise observations.
Week 3

Dealing with variance heterogeneity

There is a variety of variance structures available in the nlme package.
Variance structure Description
varFixed Fixed variance (using a continous variance covariate)
varIdent Different constant variances per stratum
varExp Exponential of the variance covariate
varPower Power of the variance covariate
varConstPower Constant plus power the variance covariate
varComb a combination of variance functions
Week 3

Linear regression

Which of the two data sets do you expect to result in heteroscedastic residuals?
Week 3

Linear regression

Week 3

Linear regression with and without variance modelling

Using the data from the fight-hand figure on the previous slide, we obtain the following model outputs from lm and gls. Note the decrease in the standard errors when we account for heterscedasticity.

Model diagnostic plots without variance modelling

Week 3

Model diagnostic plots with variance modelling

Week 3

Linear regression, class data set

Multiple regression

When we have a single continuous predictor, we talk about simple linear regression.
Models with multiple continuous predictor variables, are called multiple regression models.

Issue with multiple regression:

Multicollinearity – occurs when two or more of the continuous predictor variables are highly correlated, i.e. one variable can be linearly predicted from the others. Often the case when two or more predictors provide similar/complementary information (e.g. temperature and humidity).

Consequences: unstable parameter estimates, inflated variance (and standard errors) and thus flawed P-values.

Week 3

Multiple regression

How to detect multicollinearity?

  • graphically – scatterplot matrices (pairs plot) of the predictors
  • using variance inflation factors (VIF) (vif, package car), which measure how much the variance of the model coefficients is inflated and allow identification of the 'culprits'.

Remedies:
Remove one of the 'culprits' or reduce dimensions if there are numerous predictor variables, i.e. revert to a multivariate approach to boil down the number of predictor variables.

Week 3

Analysis of covariance (ANCOVA)

If we have a continuous response variable and a mix of continuous and categorical predictors (at least one of each kind), we use a so-called analysis of covariance model (ANCOVA). Consider the growth rate (mm day\(^{-1}\)) of three different fungal species (growth) as a function of fungicide concentration (conc in mg L\(^{-1}\)).
Week 3

ANCOVA scenarios

Week 3

ANCOVA without interaction

Summary interpretation  

When there is no interaction term present, the model assumes a common slope, i.e. significant differences between factor levels indicate significantly varying intercepts but the regression lines all show the same slope.

 

Week 3

ANCOVA with interaction

Summary interpretation  

In an ANCOVA with interaction, the model allows different intercepts as well as different slopes. A significant interaction translates into different slopes among the factor levels. If the interaction term is not significant, then the model can be simplified (removal of the interaction term) and the interpretion boils down to an ANCOVA without interaction.

 

Week 3

Post-hoc test for ANCOVA with interaction

A significant interaction in an ANCOVA manifests itself in the form of different slopes among factor levels and the associated post-hoc analysis tests for differences between the slopes.  

As before, we want to ascertain whether all slopes are different from each other or if only a particular pairwise comparison is statistically significant.
Week 3

Post-hoc test for ANCOVA with interaction

slopes <- emtrends(model = m2, specs = "species", var = "conc")
summary(as.glht(pairs(slopes)), test = adjusted(type = "BH"))
Week 3