KITADA

Lesson #17

Inference using Simple Linear Regression: the SLR model and assessing the conditions of the model

Motivation:

Simple linear regression refers to an analysis procedure when there is a quantitative response variable and one explanatory variable. To begin with, the explanatory variable will be quantitative. Often, we may want to make an inference (i.e. “conclusion”) to a larger population of interest based on sample data using a simple linear regression analysis. To do so, certain conditions must exist for the conclusions to be accurate to the population of interest. We have addressed some conditions in previous lessons (“linearity” and “outliers”). In this lesson, we will address these and other conditions.

What you need to know from this lesson:

After completing this lesson, you should be able to

explain the components of the simple linear regression model
explain what the conditions are for conclusions from a simple linear regression analysis to be valid to a population of interest and why each condition must be met
assess each condition using an appropriate graphical display

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 9.1 in the text (pages 524 and 530-533)
3. Do the Lesson 17 questions at the end of the lesson notes

The Lesson

A. The Simple Linear Regression model

When modeling data, we model each observation in the population (notation: \( y_i \)). That is, we try to come up with a general way to describe each observation.

For simple linear regression, it makes sense to use the least-squares regression equation to describe the relationship between the explanatory and response variable (as long as the relationship is linear!). But, not all observations fall on the regression line. So, how do can we use the equation to model each observation? We know that the difference between what we observe and what we predict to observe is called the residual. Therefore, an observed point can be described by its predicted value (using the least-squares regression equation) and its residual. That is, an observed value of the response variable is just the sum of its predicted value and its residual – this is the basic idea of the Simple Linear Regression model:

*Simple Linear Regression Model: *

Observed = Predicted + Residual

1. Writing the simple linear regression model in notation. Keep in mind that the model is about what’s happening in the population of interest. Therefore, Greek letters will be used for the parameters.

a. What is the notation for an observed value of the response variable?

\( y_i \)

b. How do we determine a predicted value of the response variable? Write this in notation.

\( \hat{y}_i=a+bx \)

c. What is the notation for a residual?

\( \epsilon_i=y_i-\hat{y}_i \)

d. Putting it altogether, write the simple linear regression model in notation.

\( y_i= a+bx+\epsilon_i \)

2. Because we typically don’t have data on an entire population, we typically don’t have values for the parameters in the regression equation.

However, to understand how this model works, let’s use the Old Faithful example from Lessons 12 and 14.

Suppose we know the equation of the least-squares regression line for the population of all eruptions of Old Faithful is

\( \hat{y}=33.97+10.36x \)

x = duration of eruption (min)
y = predicted interval between current eruption and next

Use this equation to model an observation that has an observed duration of 2 minutes and observed interval between eruptions: 51 minutes

\( 51=33.97+10.36*2+\epsilon_i \)

So \( \epsilon_i \) must be -3.69 in this example.

B. The conditions that must exist to use the results of the a Simple Linear Regression analysis to make valid conclusions to a population of interest

One reason for doing a Simple Linear Regression analysis is to make conclusions about a population of interest. When doing so, the above model is used. In order for conclusions to be valid to the population using the simple linear regression model, certain conditions must exist.
To think about these conditions, let's start with one value of the explanatory variable, such as 2 minutes. Over the history of eruptions of Old Faithful, many have lasted exactly 2 minutes. However, the “wait time” from that eruption to the next eruption (i.e. the interval between eruptions, which is the response variable) probably has varied somewhat from eruption to eruption. That is, for all eruptions that have lasted two minutes, the interval between eruptions may not be the same for each eruption. In other words, there is some variation in the values of the response variable for a particular value of the explanatory variable.
There are two things we need to consider regarding the values of intervals between eruptions for all durations that last two minutes:
- the values of the response variable need to be approximately normally distributed
- as long as the values of the response variable are approximately normally distributed, we can use the standard deviation to describe the spread of these values. Since the model is about all eruptions that lasted two minutes, we'll use \( \sigma \) as the notation of the standard deviation of the values of the response variable
Now consider another value of the explanatory variable, such as 4 minutes for the duration of the current eruption. Once again, there have been many eruptions of Old Faithful in the history of all eruptions that have lasted exactly 4 minutes. These eruptions will also have variation in their “wait times” until the next eruption. Once again, there are a couple of things we need to consider regarding the values of the intervals between eruptions for all durations that last four minutes:
- the values of the response variable must be approximately normally distributed
- the standard deviation of the values of the response variable for durations that lasted four minutes must be the same as the standard deviation of the values of the response variable for durations that lasted two minutes. Since the standard deviations are the same, we can use to notate this value of the standard deviation.
Let's do one more: durations that lasted three minutes. Again, there will be some variation in the “wait times” until the next eruption for all eruptions that lasted exactly 3 minutes. This time, though, we'll consider three important items:
the values of the response variable must be approximately normally distributed
the standard deviation of the values of the response variable for durations that lasted three minutes is the same as the standard deviation of the values of the response variable for durations that lasted two minutes and for durations that lasted four minutes.
This is the new item: the mean of the values of the interval between eruptions for each duration 2 minutes, 3 minutes, and 4 minutes) may not be the same. But, all three means can be connected with a straight line.
Of course, there are an infinite number of possible duration times. But for each duration time, the above list regarding the values of the response variable would hold. These are the conditions that must exist for conclusions from the simple linear regression analysis to be valid to a population of interest!
- the values of the response variable must be approximately normal for each value of the explanatory variable. This is referred to as the normality condition
- the values of the response variable have the same standard deviation for each value of the explanatory variable. We'll call this value . This is referred to as the constant variation condition.
- the means of the values of the response variable for each value of the explanatory variable can be connected with a straight line. This is the least-squares regression line!! Therefore, a linear relationship must exist between the response and explanatory variables for this condition to be met. This is referred to as the linearity condition.
These are the main conditions that must be met for conclusions from a simple linear regression analysis to be valid to a population of interest. There are a few others that may be familiar. Let’s summarize the conditions in an order of importance.

Side note: many textbooks will write some of the “conditions of the simple linear regression model” in terms of residuals. In particular, the conditions of the simple linear regression model are that the residuals are normally distributed with a mean of 0 and constant standard deviation (notation: \( \sigma \) ), and the residuals are independent of each other. These reduce to the last three conditions mentioned above. Therefore, we won’t think about the conditions in terms of residuals although we may have to check some of these conditions based on the residuals.

C. Assessing the conditions

Here’s how to asses each of the conditions:

sample representative of population

Read the explination of problem for clues about the sampling scheme. Also check the sample size.

linear relationship between X and Y and outlier condition

Look at scatterplots and residual plots.

independence condition

Look for clues in the explination of the problem.

constant variation condition

Look at the residual plot.

normality condition

Look at a QQ-Plot

Examples of plots to assess the constant variation condition:

Draw an example of a scatterplot and its residual plot that show that the standard deviation of the values of the response variable is the same for all values of the explanatory variable (i.e. the variation is “constant”).

DRAW PLOT IN CLASS

Draw examples of a scatterplots and their associated residual plots that show that the standard deviation of the values of the response variable is NOT the same for all values of the explanatory variable (i.e. the variation is not constant).

DRAW PLOT IN CLASS

Comment on “normality” condition:

The normality condition says that the y-values are normally distributed around the least-squares regression line FOR EACH X-VALUE. If the observations are independent AND the variation of the y-values is the same for all values of the explanatory variable, all residuals can be put together to assess this condition with a normal probability plot of the residuals.

D. Examples

Based on the following plots, determine if

1) a linear relationship exists between the response and explanatory variables
2) any outliers are present
3) the variation of the y-values is the same for all x-values
4) the values of the response variable are normally distributed for each of the x-values (assessed by determining if the residuals are normally distributed).

If any of the above is not satisfied, describe what you would do to alleviate the problem.

1. Old Faithful: regression of interval between eruptions on duration of previous eruption.

### OLD FAITHFUL ... AGAIN ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))

## SCATTER PLOT ##
with(OLDFAITHFUL, 
     plot(duration, interval, pch=16, main = "Old Faithful",
          ylab="interval between eruptions (minutes)", 
          xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
      duration of current eruption", cex = 0.8)
abline(coefficients(of_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-3

## RESIDUAL PLOT ##
plot(OLDFAITHFUL$duration, resid(of_mod), pch=16, 
     xlab="Duration of Eruption",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

## QQ PLOT ##
qqnorm(resid(of_mod))
qqline(resid(of_mod))

plot of chunk unnamed-chunk-3

a. linearity condition: OK

b. any outliers? NO

c. constant variation condition: Looks pretty good

d. normality condition: Very very close to normal

e. any transformation necessary?: Doesn't need transformation

2. Animal Gestation: regression of average gestation (in days) for a sample of animals on longevity average life expectancy in years) for the animals. (Note: only the residual plot and normal probability plot of the residuals are given!)

### A NEW EXAMPLE: GESTATION ###
gest_mod <- with(GESTATION, lm(gestation~longevity))

## SCATTER PLOT ##
with(GESTATION, 
     plot(longevity, gestation, pch=16, main = "Animal Gestation",
          ylab="Average Gestation (days)", 
          xlab="Average Life Expectancy (years)")
)
abline(coefficients(gest_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-5

## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(gest_mod), pch=16, 
     xlab="Average Life Expectancy (years)",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-5

## QQ PLOT ##
qqnorm(resid(gest_mod))
qqline(resid(gest_mod))

plot of chunk unnamed-chunk-5

a. linearity condition: Yes

b. any outliers? Yes, but not influential

c. constant variation condition: No, the variation increases

d. normality condition: The upper tail deviates from normality

e. any transformation necessary?: No

3. Baseball example: regression of # of wins on team batting average (Major League baseball 2011).

### BASEBALL ... AGAIN ###
base_mod<-lm(BASEBALL$WINS~BASEBALL$AVG)

## SCATTERPLOT 
plot(BASEBALL$AVG, BASEBALL$WINS, pch=16, 
     xlab="Team Batting Average",
     ylab="Team Wins",
     main="Batting Average vs Wins")
abline(coefficients(base_mod), 
       lwd=3, col="red", lty=2)

plot of chunk unnamed-chunk-7

## RESIDUAL PLOT
plot(BASEBALL$AVG, resid(base_mod), pch=16, 
     xlab="Team Batting Average",
     ylab="Residual",
     main="Residual Plot")
abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-7

## QQ PLOT
qqnorm(resid(base_mod))
qqline(resid(base_mod))

plot of chunk unnamed-chunk-7

a. linearity condition: Yes

b. any outliers? No

c. constant variation condition: Yes

d. normality condition: Very close!

e. any transformation necessary?: No

Summary

What should be done when a non-linear relationship exists between the response and explanatory variables?

We can try to transform the data so the transformed values have a linear relationship.

Note: Transformations should be done if it makes scientific sense to do so.

What should be done when the constant variation condition and/or the normality condition aren’t met?

We can proceed but be very very cautious about making inference. In a more complicated statistics course we could use generalized regression models.

What should be done if outliers are present?

We can assess whether or not the outlier was due to an error. We can perform analyses with and without outliers.

What should be done if the sample is not representative of the population and/or the observations are not independent?

If its not representative we can only make very narrow inference. If the observations are not independent we would need more complex statistics tools.