Linear Regression

What sets regression analysis apart from correlation is that we go one step beyond assessing a relationship, and actually hypothesize a direction of that relationship.

We assume one variable to be the independent (predictor) variable and one variable to be the dependent (outcome or criterion) variable.

Goal = prediction. In this section, we will go beyond simple linear regression and discuss the more likely case of multiple regression.

Regression Model

Multiple Regression is defined by the following:

\(Y_{i} = \beta_{0} + \beta_{1}x_{i1} ... + \beta_{k}x_{ik} + e_{i}\)

  • There is always error that we can't explain, or random error, which is included in the equation above as "e".

  • Both Y and e as well as x have a little "i" attached because they are unique to the individual. The other terms are fixed for the entire population.

Assumptions and Considerations

There are several assumptions we should consider before attempting to model an outcome using this method:

  • homogeneity of variance
  • linearity
  • independence of errors

and a few other less formal assumptions/considerations:

  • Normality of the errors (important for accurate standard errors, especially in small samples)
  • Model specification is correct (will never be perfect, but should be useful)
  • Predictors may be related, but should not be highly correlated with one another (this is called collinearity). This can be evaluated with a VIF (Variance Inflation Factor) coefficient (function vif available in car package)
  • The data is good - may want to assess outliers

Parameter Estimation

  • Remember that the goal is to explain variance in the outcome.
  • The model that explains the most variance in the outcome is the model that leads to the smallest amount of prediction error.
  • R will estimate the parameters in the model using Ordinary Least Squares (OLS), which will select the values for the parameters that minimize the squared error between the actual scores and the estimated scores.

Parameter Estimation, cont

The OLS estimates are unbiased, but they still may be sensitive to certain issues such as a poorly specified model or multicollinearity. More prone to bias are the standard errors of the estimates, which are sensitive to many of the assumptions and concerns listed above. Why do we care about the standard errors? The SE is an estimate that represents precision of an estimate, and recall from earlier statistics courses that the SE is the denominator value of the test statistic used for hypothesis testing. Thus, inflated standard errors are problematic because they will lead to underestimated test-statistics. Likewise, underestimated SE will lead to inflated test statistics.

Linear Regression in R

Use the lm function in r to fit a Linear Model

  • first argument is regression model in the format outcome ~ predictor.
  • add more predictor by simply using the + sign
  • add interaction terms by using * - such as var1*var2.
  • you do not have to use the $ operator, because you can directly specify your R dataset using the data argument.

I recommend that you assign your regression to an object, as shown below:

model1 <- lm(y~x, data=datasetname)

Linear Regression in R, cont

Use other functions to extract information from your model object

  • summary - returns a summary of model output
  • coef - returns coefficient estimates
  • plot - returns diagnostic plots
  • predict - returns confidence intervals
  • residuals - returns model residuals

Plotting Linear Regression in R

After installing the ggplot2 package, you should be able to try something like this:

mydata$predicted <- predict(model1)

ggplot(mydata, aes(x = xvar, y = yvar)) + 
  geom_segment(aes(xend = xvar, yend = predicted)) +
  geom_point() +
  geom_point(aes(y = predicted), shape = 1)

Interpreting regression coefficients

If you left all variables on their original scale (no standardizing):

  • intercept is interpreted as the expected value of the outcome when all predictors are at a value of 0.
  • slope coefficient is interpreted as the expected change in Y for a single unit change in that predictor

Note: predictors are often on different scales, so just because one variable's slope is larger than another variable's slope, we cannot infer form this that the former is a stronger predictor.

Interpreting regression coefficients, cont

If you want to directly compare the strength of the slopes across different predictors, you should standardize all of the variables. In this case:

  • intercept is the expected SD of Y when all predictors are at their mean. Can you guess why this is the case?
  • The slope parameters are the expected SD change in Y for a single SD increase in that predictor.

Note: Because all of the predictors are in the same unit, you may directly compare slope estimates in this case.

Interpreting regression coefficients, cont

In some cases you might be interested in centering predictor variables, i.e. centering "about the mean".

Result: a score that represents someone's distance from the mean on the original variable scale

Example: Someone with a centered score of 0 on an Age variable would have the mean age. Alternatively, someone who is 70 y.o. (M= 59), has a mean-centered age of 70-59 = 11. They are 11 years older than the mean age.

Centering predictors is particularly useful (and some would argue almost necessary) if your model includes two-way interaction terms.

Logistic Regression

What if you have an outcome that is not numeric? Example: school-readiness

  • You have data from teachers indicating which children are "ready" (Y=1), and which are "not ready" (Y=0).
  • This outcome is a binary variable with response categories 0 and 1.
  • Unfortunately, we can't just slap a linear regression onto this research question. Instead, we will use logistic regression, a cousin to linear regression, and part of a family of models called generalized linear models (not to be confused with the general linear model).

Logistic Regression, cont.

In logistic regression, rather than predicting the expected value of Y given a set of X values, we predict the log-odds that Y=1 given a set of X values.

Otherwise, things are pretty familiar here:

  • The slope is the expected change in the log-odds of Y=1 given a single unit change in X.
  • The intercept represents the log-odds that Y=1 when all predictors are at a value of 0.

Logistic Regression in R

You can fit the logistic regression model in R using the glm function. See the code below:

glm(binary.outcome ~ pred1 + pred2, data=mydata, family=binomial)

You use the family argument to specify which specific generalized model you are fitting, or more technically, to specify the error distribution for the model.

Logistic Regression in R, cont

While the model might make sense, making sense of how to interpret the estimates in a meaningful way is the biggest challenge of logistic regression. Let's translate!

  • To convert log-odds to odds, you can exponentiate the former value. This can be done in R using the exp function, such as:
exp(coef(model2)) #exponentiate the log-odds values to get odds
  • You may also be interested in translating the estimates into probabilities. Probability is related to odds such that Odds/(1+Odds) = P. You try it!

Model Selection

To evaluate the usefulness of a given model, you might look at R output such as:

  • Individual predictor estimates and associated statistical tests (H0: B=0)
  • R-squared and adjusted R-squared values for the entire model
  • F-test for full model

Model Selection, cont.

Often it is the case that you don't just have a single model to evaluate. Instead, you will likely be in the position of having to choose between multiple models.

  • R-square: proportion of variance (in outcome) that is explained by a given model
  • The adjusted R-squared value is adjusted for the number of predictors

Note: Best practice is to decide apriori what the selection criteria will be for such a decision.

Model Selection, cont.

How do you know when a change in R-squared is enough to justify favoring one model over another? Like I said, this can all be pretty subjective.

If your models are nested (all the predictors of one model are entirely included in the other), use a deviance test.

anova(model1, model2)

If the p-value for this test <.05, you may conclude that the more complex model is explaining sufficient variability in the outcome, over and above the simpler model.