KITADA
Lesson #17
Motivation:
Simple linear regression refers to an analysis procedure when there is a quantitative response variable and one explanatory variable. To begin with, the explanatory variable will be quantitative. Often, we may want to make an inference (i.e. “conclusion”) to a larger population of interest based on sample data using a simple linear regression analysis. To do so, certain conditions must exist for the conclusions to be accurate to the population of interest. We have addressed some conditions in previous lessons (“linearity” and “outliers”). In this lesson, we will address these and other conditions.
What you need to know from this lesson:
After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
A. The Simple Linear Regression model
When modeling data, we model each observation in the population (notation: \( y_i \)). That is, we try to come up with a general way to describe each observation.
For simple linear regression, it makes sense to use the least-squares regression equation to describe the relationship between the explanatory and response variable (as long as the relationship is linear!). But, not all observations fall on the regression line. So, how do can we use the equation to model each observation? We know that the difference between what we observe and what we predict to observe is called the residual. Therefore, an observed point can be described by its predicted value (using the least-squares regression equation) and its residual. That is, an observed value of the response variable is just the sum of its predicted value and its residual – this is the basic idea of the Simple Linear Regression model:
*Simple Linear Regression Model: *
Observed = Predicted + Residual
1. Writing the simple linear regression model in notation. Keep in mind that the model is about what’s happening in the population of interest. Therefore, Greek letters will be used for the parameters.
a. What is the notation for an observed value of the response variable?
\( y_i \)
b. How do we determine a predicted value of the response variable? Write this in notation.
\( \hat{y}_i=a+bx \)
c. What is the notation for a residual?
\( \epsilon_i=y_i-\hat{y}_i \)
d. Putting it altogether, write the simple linear regression model in notation.
\( y_i= a+bx+\epsilon_i \)
2. Because we typically don’t have data on an entire population, we typically don’t have values for the parameters in the regression equation.
However, to understand how this model works, let’s use the Old Faithful example from Lessons 12 and 14.
Suppose we know the equation of the least-squares regression line for the population of all eruptions of Old Faithful is
\( \hat{y}=33.97+10.36x \)
Use this equation to model an observation that has an observed duration of 2 minutes and observed interval between eruptions: 51 minutes
\( 51=33.97+10.36*2+\epsilon_i \)
So \( \epsilon_i \) must be -3.69 in this example.
B. The conditions that must exist to use the results of the a Simple Linear Regression analysis to make valid conclusions to a population of interest
One reason for doing a Simple Linear Regression analysis is to make conclusions about a population of interest. When doing so, the above model is used. In order for conclusions to be valid to the population using the simple linear regression model, certain conditions must exist.
To think about these conditions, let's start with one value of the explanatory variable, such as 2 minutes. Over the history of eruptions of Old Faithful, many have lasted exactly 2 minutes. However, the “wait time” from that eruption to the next eruption (i.e. the interval between eruptions, which is the response variable) probably has varied somewhat from eruption to eruption. That is, for all eruptions that have lasted two minutes, the interval between eruptions may not be the same for each eruption. In other words, there is some variation in the values of the response variable for a particular value of the explanatory variable.
There are two things we need to consider regarding the values of intervals between eruptions for all durations that last two minutes:
Now consider another value of the explanatory variable, such as 4 minutes for the duration of the current eruption. Once again, there have been many eruptions of Old Faithful in the history of all eruptions that have lasted exactly 4 minutes. These eruptions will also have variation in their “wait times” until the next eruption. Once again, there are a couple of things we need to consider regarding the values of the intervals between eruptions for all durations that last four minutes:
Let's do one more: durations that lasted three minutes. Again, there will be some variation in the “wait times” until the next eruption for all eruptions that lasted exactly 3 minutes. This time, though, we'll consider three important items:
the values of the response variable must be approximately normally distributed
the standard deviation of the values of the response variable for durations that lasted three minutes is the same as the standard deviation of the values of the response variable for durations that lasted two minutes and for durations that lasted four minutes.
This is the new item: the mean of the values of the interval between eruptions for each duration 2 minutes, 3 minutes, and 4 minutes) may not be the same. But, all three means can be connected with a straight line.
Of course, there are an infinite number of possible duration times. But for each duration time, the above list regarding the values of the response variable would hold. These are the conditions that must exist for conclusions from the simple linear regression analysis to be valid to a population of interest!
These are the main conditions that must be met for conclusions from a simple linear regression analysis to be valid to a population of interest. There are a few others that may be familiar. Let’s summarize the conditions in an order of importance.
Side note: many textbooks will write some of the “conditions of the simple linear regression model” in terms of residuals. In particular, the conditions of the simple linear regression model are that the residuals are normally distributed with a mean of 0 and constant standard deviation (notation: \( \sigma \) ), and the residuals are independent of each other. These reduce to the last three conditions mentioned above. Therefore, we won’t think about the conditions in terms of residuals although we may have to check some of these conditions based on the residuals.
C. Assessing the conditions
Here’s how to asses each of the conditions:
Read the explination of problem for clues about the sampling scheme. Also check the sample size.
Look at scatterplots and residual plots.
Look for clues in the explination of the problem.
Look at the residual plot.
Look at a QQ-Plot
Examples of plots to assess the constant variation condition:
DRAW PLOT IN CLASS
DRAW PLOT IN CLASS
Comment on “normality” condition:
The normality condition says that the y-values are normally distributed around the least-squares regression line FOR EACH X-VALUE. If the observations are independent AND the variation of the y-values is the same for all values of the explanatory variable, all residuals can be put together to assess this condition with a normal probability plot of the residuals.
D. Examples
Based on the following plots, determine if
If any of the above is not satisfied, describe what you would do to alleviate the problem.
1. Old Faithful: regression of interval between eruptions on duration of previous eruption.
### OLD FAITHFUL ... AGAIN ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))
## SCATTER PLOT ##
with(OLDFAITHFUL,
plot(duration, interval, pch=16, main = "Old Faithful",
ylab="interval between eruptions (minutes)",
xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
duration of current eruption", cex = 0.8)
abline(coefficients(of_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT ##
plot(OLDFAITHFUL$duration, resid(of_mod), pch=16,
xlab="Duration of Eruption",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT ##
qqnorm(resid(of_mod))
qqline(resid(of_mod))
a. linearity condition: OK
b. any outliers? NO
c. constant variation condition: Looks pretty good
d. normality condition: Very very close to normal
e. any transformation necessary?: Doesn't need transformation
2. Animal Gestation: regression of average gestation (in days) for a sample of animals on longevity average life expectancy in years) for the animals. (Note: only the residual plot and normal probability plot of the residuals are given!)
### A NEW EXAMPLE: GESTATION ###
gest_mod <- with(GESTATION, lm(gestation~longevity))
## SCATTER PLOT ##
with(GESTATION,
plot(longevity, gestation, pch=16, main = "Animal Gestation",
ylab="Average Gestation (days)",
xlab="Average Life Expectancy (years)")
)
abline(coefficients(gest_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(gest_mod), pch=16,
xlab="Average Life Expectancy (years)",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT ##
qqnorm(resid(gest_mod))
qqline(resid(gest_mod))
a. linearity condition: Yes
b. any outliers? Yes, but not influential
c. constant variation condition: No, the variation increases
d. normality condition: The upper tail deviates from normality
e. any transformation necessary?: No
3. Baseball example: regression of # of wins on team batting average (Major League baseball 2011).
### BASEBALL ... AGAIN ###
base_mod<-lm(BASEBALL$WINS~BASEBALL$AVG)
## SCATTERPLOT
plot(BASEBALL$AVG, BASEBALL$WINS, pch=16,
xlab="Team Batting Average",
ylab="Team Wins",
main="Batting Average vs Wins")
abline(coefficients(base_mod),
lwd=3, col="red", lty=2)
## RESIDUAL PLOT
plot(BASEBALL$AVG, resid(base_mod), pch=16,
xlab="Team Batting Average",
ylab="Residual",
main="Residual Plot")
abline(h=0, lwd=2, lty=2,
col="blue")
## QQ PLOT
qqnorm(resid(base_mod))
qqline(resid(base_mod))
a. linearity condition: Yes
b. any outliers? No
c. constant variation condition: Yes
d. normality condition: Very close!
e. any transformation necessary?: No
Summary
We can try to transform the data so the transformed values have a linear relationship.
Note: Transformations should be done if it makes scientific sense to do so.
We can proceed but be very very cautious about making inference. In a more complicated statistics course we could use generalized regression models.
We can assess whether or not the outlier was due to an error. We can perform analyses with and without outliers.
If its not representative we can only make very narrow inference. If the observations are not independent we would need more complex statistics tools.