1 Study Guide For Final Test: Linear Regression

NOTE: Any changes to this study guide will be announced via the class Facebook page (CalU EcoStats) and via email.

  • This study guide is organized around the book and is meant to help you use the book as a reference for studying for the test.
  • Most - BUT NOT ALL - concepts covered in lectures should be noted in this document
  • It is your responsibility to review the lecture notes to ensure that you have studied all materials that are covered by this test

NOTE: Things in brackets [like this] indicate sections/concepts in the book that were not emphasized in lecture and/or will not be on the final text. I’ve tried to italicize text in these cases also.

1.1 Chapters covered

  • Chapter 13.1 Detecting deviations from normality
    • Just this section on diagnostics/residuals
  • Chapter 16: Correlation
    • We covered very little of this chapter; see notes below
  • Chapter 17: Regression
    • This is the focus of this test
  • Chapter 18.13 Analyzing factorial designs
    • Just this section on 2x2 ANOVA

1.2 Chapter 16: Correlation

I used very little materials from the book on correlation. In lecture I mostly contrasted correlation with what we study with regression.

1.2.1 Correlation analysis

Key aspects of correlation analysis

  • addresses if there is a relationship
  • and if if it it positive or negative
  • and how strong is i
  • does NOT assume directionality
  • Only works w/ 2 variables, x1 & x2
  • only works w/numeric data

1.2.2 Regression (Ch 17)

  • 1-3)same as correlation
  • 4)assumes / tests directionality or causality
  • Simplest case: y ~ x
  • can be generalized to all forms of data
  • Powerful for prediction

1.2.3 “Directionality” & “Causality”

1.2.3.1 Directionality

1-way direction of impact or interaction * eg The virus HIV causes the disease AIDS * The disease AIDS does not cause the virus HIV to occur

1.2.3.2 Causality

Reason, cause, mechanism * HIV is the causal agent of AIDS * Nothing else causes AIDS * Without HIV, there is not AIDS * Having HIV is a strong predictor of having AIDS (though less so now)

Correlation analysis does not address these issues

1.2.3.3 Correlation

  • When there is a high density of HIV virion in a patients blood, AIDS symptoms are severe
  • When AIDS symptoms are severe, there is a high density of HIV virions

1.2.3.4 Prediction

We can predict the severity of AIDS symptoms based on abundance of HIV

1.2.4 Uses of regressin:

1.2.4.1 Prediction

  • Given x, predict y, while accounting for uncertainty / error

  • Understand causality/ test hypotheses
  • Does x cause y?

1.2.5 3 Major steps in regression analysis

1.2.5.1 Step 1) Regression model fiting

  • What line fits the model best?
  • This class: “least squares”
  • Advanced regression: “maximum likelihood”

Calculate 2 things: * Intercept of line * Slope of line There are equations that provide the exact solution We did by hand w/ruler just for illustration

1.2.5.2 Step 2) Significance testing

  • Is the line any different from a flat line?
  • The slope of flat line = 0
  • Slope of 0 = no change in y as x changes
  • Calculate: Standard errors (SE), confidence intervals (CI) t-statistics, F-statistic, p-values

1.2.5.3 Step 3) Model Checking

aka “residual analysis”, aka" “model diagnostics”

Asks: Do the data meet the assumptions of the model: Random & Independent sampling, Normality, constant variance Requires plotting the residuals (errors)

1.3 CHAPTER 17: REGRESSION

  • AKA “linear regression”
  • Regression is used to to investigate the relationship between 2 continuous variables
  • Regression, ANOVA an t-tests are all closely related and can be thought of as different types of linear models


1.3.1 17.1 Linear Regression

1.3.1.1 Ex 17.1 The lion’s nose

  • This is the background for the example I used frequently in class and lab

1.3.1.2 The method of least squares (pg 542)

  • How a regression model is “fit” to the data

1.3.1.3 Figures 17.1-2

  • Very important figure for understanding what residuals are and how the regression line is fit.
  • Related to Lab 10 where we drew regression line by hand.

1.3.1.4 Formula for the line (pg 543)

  • The book gives the form of the regression line as:
    • Y = a + b(X)
  • This is mathematically equivalent to the form we typically learn in geometry class (Y = m(X) + b) but with the the symbols changed and the order the terms appear flipped.
    • Y = b + m(X)
  • As a word equation, this would be
    • response = intercept + slope(predictor)
  • In R code, to model how Y (the response) varies with X (the predictor), we would write:
    • lm(Y ~ X, …)
  • R would then estimate the intercept (b) and the slope (m)

1.3.1.5 [Calculating the slope and intercept (pg 544)]

  • [You do not need to know the equations on page 544 for the test.]


1.3.1.6 Final lion equation on page 545

You should understand the relationship between the lion study research question, the data, and the equations for the regression line they give

  • Y = 0.88 + 10.65*X
  • Age = 0.88 + 10.65*(proportion.black)

1.3.1.7 [Populations and samples (pg 545)]

[good stuff in this section but I did not emphasize it in class]

1.3.1.8 Predicted values (aka “Y.hat”, pg 546)

  • You should understand how regressoin equations can be used for prediciton*
  • For example, how you can use the lion regression example to estimate the unknown age of a lion with a proportion of 0.55 of its nose pigmented black.

[There is a nice technical definition of what a regression prediction means in the orange box on page 546, but I forget to emphasize this in class, so it will not appear on the test]

1.3.1.9 Residuals (pg 546)

  • You should know how residuals are calculated
  • For example, residual = observation - prediction # r = Y.observed - Y.predicted
  • You should know why we square residuals when we are fitting a regression line
  • You do not need to know the equations related to MS.residual on page 547.

We covered calculating residuals in lab 10 and the sum of squared residuals. We used R functions to show that the best-fit line had the smallest sum of squared residuals.

1.3.1.10 Standard error of the slope (pg 547)

You should know that when we do regression we are estimating what the slope is. Since it is estimated, it there is a standard error of the slope the represents our uncertainty about the true value of the slope in the real world. Other data sets collected from the same study system would get slightly different result.

  • [You do not need to know the equations related to SE.b on page 548 for the test.]

1.3.1.11 Confidence interval for the slope (pg 548)

  • You should know that you can estimate a confidence interval for the slope of a regression line.
  • You do not need to know the CI equations on page 548 for the final test.
  • You should know that the “multiply by 1.96” rule can be applied to the standard errors for regression slopes.
  • Confidence intervals for the slope and intercept and be used to construct “confidence bands” around regression lines.

1.3.2 17.2 Confidence in predictions

  • You should know that since there is uncertainty in our estimate of the slope (and intercept), there will be uncertainty in any prediction from a regression model.

1.3.2.1 [Confidence intervals for predictions (pg 549)]

This information is very important but I did not cover it in class, so I will not put it on the final test

1.3.2.2 [Figure 17.2.1]

This will not be on the final test

1.3.2.3 [Extrapolation (pg 550)]

This information is very important but I did not cover it in class, so I will not put it on the final test

1.3.3 17.3 Testing hypotheses about a slope (pg 551)

1.3.4 [The t-test of regression slope (pg 552)]

You do not need to know the math covered on pgages 552 for the final test

1.3.5 The ANOVA approach (pg 554)

  • You need to know that we use an ANOVA technique to test hypotheses with our regression line.
  • We do this by fitting a model for the null hypothesis and comparing it to one for the alternative hypothesis
  • You should be to identify important parts of ANOVA output from a regression in R.

1.3.5.1 Using R2 to measure the fit of the line to data (pg 555)

  • You need to know that R^2 measure how well the model fits the data.
  • A low R^2 means that the model does not fit the data well and does not explain much of what is going on in the data
  • A high R^2 value means that the model does a good job of explaining the variation in the data
  • If all the data were in a perfectly straight line, R^2 = 0.
  • R^2 is frequently fairly low in ecological and environmental studies
  • You can have a very low p value (significant difference) AND a very low R^2 (poor fit)

1.3.6 [17.4 Regression toward the mean ]

  • skipped, but very interesting topic

1.3.7 17.5 Assumptions of regression

  • Very important section.
  • I did not emphasize the precise statement of the assumptions the book uses, so you don’t need to know these exact definitions

What you should know is that key assumptions of the regression model are

  • Data where collected using random sampling
  • This is the 4th bullet point in the book
  • There is no diagnostic plot that can tell you if this assumption is violated
  • This assumption applies to t-tests and ANOVA also
  • The relationship between y and x is fundamentally linear
  • This means that a straight line is an appropriate model to fit to the data
  • This is the 1st bullet point in the book
  • Variance in the response variables (y) does NOT change as x changes
  • This is also called the assumption of “constancy of variance”.
  • We can assess this assumption with a plot of residuals vs. predictions from the model (same as residuals vs. fitted values from the model, which is how R labels it)
  • This assumption applies to t-tests and ANOVA also. In those cases, we say that “variance int the response variable is the same within all groups.”
  • This is the 3rd bullet point in the book
  • The residuals of the model are normally distributed.
  • Key idea: the residuals are normal, NOT the raw data!
  • This is the 2nd bullet point in the book

1.3.7.1 Diagnostic plots

We used the following 4 plots to asses whether our regression fit the assumptions of the model

  • Histogram of residuals to assess normality
  • R: hist(resids)
  • QQ Plot to assess normality
  • R: plot(model, which = 2)
  • Residual vs. Fitted plot to assess constancy of variance
  • Influence plot (resid. vs “leverage”) to look at influence points and

Note that points 32 and 25 show evidence of being outliers

1.4 Assess asumption of Constant Variance

Plot residuals of model against fitted values. We can get fitted values with the fitted() function. The plot() function cal also do this automatically if we tell it “which = 1”"

1.4.1 Assessing constancy of variace: Plot Residuals ~ Fitted

  • Non-constant variance occurs when the spread of the residuals (on the y axis) changes as the fitted values (x axis) increase.
  • The variance is therefore “not constant” as the x-axis changes
  • This is a major problem and should cause more concern than non-normality

The red arrows in this plot highlight the non-constant variance in these data



1.4.1.1 Log transformation

Log transformation of the response (y) variable can often accomplish the following goals

  • Improve normality of the residual
  • Make variance more constant
  • Reduce the impact of outliers

The following plots use the same data as before, except the y variable has been log transformed

For the Influence plot * Points between the red lines labeled 0.5 are generally considered ok * Points between the 0.5 and 1 lines might be problematic * Points outside the red lines deserve careful consideration * The point marked “32”, which is our point w/the largest residual, is near the red 1 line.



1.4.1.2 [Figure 17.5-1]

You do not need to know this

1.4.1.3 Outliers

  • The authors discuss how outliers can be “influential observations”
  • I emphasize that outliers might occur due two reasons
  • Errors in data collection, recording, or entry
  • Real biological variation
  • If you see an outlying observation, its good to check your raw data to rule out the first possibility
  • If an outlier is a real data point and not a mistake, you should then consider whether it might be impacting your regression model and biasing your results
  • We discussed influence during the final lab and lecture.
  • If you want to remove an outlier, you should carefully explain this in your publication and should include information about what the results are both with and without the outlier
  • We used the following plot to examine outliers


This plot shows + a potential outlier/influential observation on the raw data

1.4.2 Detecting nonlinearity (pg 559)

We did not talk about this issue. However, it is related to * Use of a smoother plotted through a scatterplot * use of an x^2 term in a model

1.4.3 Detecting non-normality and unequal variance (pg 559)

We used these plots similar to these to look at unequal variance.

1.4.4 17.6 Transformations

We worked with the log transformation, though here they are mostly concerned with how it relates to non-linearity.

1.4.5 [Figure 17.6-1]

1.4.6 [Figure 17.6-2]

1.4.7 [Figure 17.6-3]

1.4.8 [17.7 The effects of measurement error on regression]

[important topic, but skipped]

1.4.9 17.8 Nonlinear regression

1.4.10 [A curve with an asymptote]

[skip]

1.4.11 Quadratic curves (pg 565)

We did this in the last lab, fitting an x^2 term to our model.

NLB note: quadratic curves are AKA “x^2 terms”, “squared terms”, “quadratic terms”, “squared effect”, “quadratic effects”; I will try to be consistent but will probably fail..


1.4.12 Figure 17.8-2

  • This figure shows a model with an x^2 term fitted.

1.4.13 Formula-free curve fitting (pg 566)

Here is an example of the concept of smoothing.

1.4.14 Ex 17.8 The incredible shrinking seal

1.4.15 17.9 Logistic regression: fitting a binary response variable

1.4.16 Figure 17.9-1

I used this example in class

[Equations on page 568, 569, 570: I did not discuss this]


1.4.17 17.10 Summary

Key words/concepts that appear in the summary * prediction * Y = a + b(X) * least squares * sum of squares/sum of squared differences * residuals * slope * intercept * assumptions * normality * confidence interval for the slope * ANOVA test on regression * R^2 * assumption: linearity, random sampling, normality, constant variance * “residual plot” used for model diagnostics * transformations * log transformation * smoothing * logistic regression