NOTE: Any changes to this study guide will be announced via the class Facebook page (CalU EcoStats) and via email.
NOTE: Things in brackets [like this] indicate sections/concepts in the book that were not emphasized in lecture and/or will not be on the final text. I’ve tried to italicize text in these cases also.
I used very little materials from the book on correlation. In lecture I mostly contrasted correlation with what we study with regression.
Key aspects of correlation analysis
1-way direction of impact or interaction * eg The virus HIV causes the disease AIDS * The disease AIDS does not cause the virus HIV to occur
Reason, cause, mechanism * HIV is the causal agent of AIDS * Nothing else causes AIDS * Without HIV, there is not AIDS * Having HIV is a strong predictor of having AIDS (though less so now)
Correlation analysis does not address these issues
We can predict the severity of AIDS symptoms based on abundance of HIV
Given x, predict y, while accounting for uncertainty / error
Does x cause y?
Calculate 2 things: * Intercept of line * Slope of line There are equations that provide the exact solution We did by hand w/ruler just for illustration
aka “residual analysis”, aka" “model diagnostics”
Asks: Do the data meet the assumptions of the model: Random & Independent sampling, Normality, constant variance Requires plotting the residuals (errors)
You should understand the relationship between the lion study research question, the data, and the equations for the regression line they give
[good stuff in this section but I did not emphasize it in class]
[There is a nice technical definition of what a regression prediction means in the orange box on page 546, but I forget to emphasize this in class, so it will not appear on the test]
We covered calculating residuals in lab 10 and the sum of squared residuals. We used R functions to show that the best-fit line had the smallest sum of squared residuals.
You should know that when we do regression we are estimating what the slope is. Since it is estimated, it there is a standard error of the slope the represents our uncertainty about the true value of the slope in the real world. Other data sets collected from the same study system would get slightly different result.
This information is very important but I did not cover it in class, so I will not put it on the final test
This will not be on the final test
This information is very important but I did not cover it in class, so I will not put it on the final test
You do not need to know the math covered on pgages 552 for the final test
What you should know is that key assumptions of the regression model are
We used the following 4 plots to asses whether our regression fit the assumptions of the model
Note that points 32 and 25 show evidence of being outliers
Plot residuals of model against fitted values. We can get fitted values with the fitted() function. The plot() function cal also do this automatically if we tell it “which = 1”"
The red arrows in this plot highlight the non-constant variance in these data
Log transformation of the response (y) variable can often accomplish the following goals
The following plots use the same data as before, except the y variable has been log transformed
For the Influence plot * Points between the red lines labeled 0.5 are generally considered ok * Points between the 0.5 and 1 lines might be problematic * Points outside the red lines deserve careful consideration * The point marked “32”, which is our point w/the largest residual, is near the red 1 line.
You do not need to know this
This plot shows + a potential outlier/influential observation on the raw data
We did not talk about this issue. However, it is related to * Use of a smoother plotted through a scatterplot * use of an x^2 term in a model
We used these plots similar to these to look at unequal variance.
We worked with the log transformation, though here they are mostly concerned with how it relates to non-linearity.
[important topic, but skipped]
[skip]
We did this in the last lab, fitting an x^2 term to our model.
NLB note: quadratic curves are AKA “x^2 terms”, “squared terms”, “quadratic terms”, “squared effect”, “quadratic effects”; I will try to be consistent but will probably fail..
Here is an example of the concept of smoothing.
I used this example in class
[Equations on page 568, 569, 570: I did not discuss this]
Key words/concepts that appear in the summary * prediction * Y = a + b(X) * least squares * sum of squares/sum of squared differences * residuals * slope * intercept * assumptions * normality * confidence interval for the slope * ANOVA test on regression * R^2 * assumption: linearity, random sampling, normality, constant variance * “residual plot” used for model diagnostics * transformations * log transformation * smoothing * logistic regression