KITADA

Lesson #17

Inference using Simple Linear Regression: the SLR model and assessing the conditions of the model

Motivation:

Simple linear regression refers to an analysis procedure when there is a quantitative response variable and one explanatory variable. To begin with, the explanatory variable will be quantitative. Often, we may want to make an inference (i.e. “conclusion”) to a larger population of interest based on sample data using a simple linear regression analysis. To do so, certain conditions must exist for the conclusions to be accurate to the population of interest. We have addressed some conditions in previous lessons (“linearity” and “outliers”). In this lesson, we will address these and other conditions.

What you need to know from this lesson:

After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

A. The Simple Linear Regression model

When modeling data, we model each observation in the population (notation: \( y_i \)). That is, we try to come up with a general way to describe each observation.

For simple linear regression, it makes sense to use the least-squares regression equation to describe the relationship between the explanatory and response variable (as long as the relationship is linear!). But, not all observations fall on the regression line. So, how do can we use the equation to model each observation? We know that the difference between what we observe and what we predict to observe is called the residual. Therefore, an observed point can be described by its predicted value (using the least-squares regression equation) and its residual. That is, an observed value of the response variable is just the sum of its predicted value and its residual – this is the basic idea of the Simple Linear Regression model:

*Simple Linear Regression Model: *

Observed = Predicted + Residual

1. Writing the simple linear regression model in notation. Keep in mind that the model is about what’s happening in the population of interest. Therefore, Greek letters will be used for the parameters.

a. What is the notation for an observed value of the response variable?

\( y_i \)

b. How do we determine a predicted value of the response variable? Write this in notation.

\( \hat{y}_i=a+bx \)

c. What is the notation for a residual?

\( \epsilon_i=y_i-\hat{y}_i \)

d. Putting it altogether, write the simple linear regression model in notation.

\( y_i= a+bx+\epsilon_i \)

2. Because we typically don’t have data on an entire population, we typically don’t have values for the parameters in the regression equation.

However, to understand how this model works, let’s use the Old Faithful example from Lessons 12 and 14.

Suppose we know the equation of the least-squares regression line for the population of all eruptions of Old Faithful is

\( \hat{y}=33.97+10.36x \)

Use this equation to model an observation that has an observed duration of 2 minutes and observed interval between eruptions: 51 minutes

\( 51=33.97+10.36*2+\epsilon_i \)

So \( \epsilon_i \) must be -3.69 in this example.

B. The conditions that must exist to use the results of the a Simple Linear Regression analysis to make valid conclusions to a population of interest

Side note: many textbooks will write some of the “conditions of the simple linear regression model” in terms of residuals. In particular, the conditions of the simple linear regression model are that the residuals are normally distributed with a mean of 0 and constant standard deviation (notation: \( \sigma \) ), and the residuals are independent of each other. These reduce to the last three conditions mentioned above. Therefore, we won’t think about the conditions in terms of residuals although we may have to check some of these conditions based on the residuals.

C. Assessing the conditions

Here’s how to asses each of the conditions:

Read the explination of problem for clues about the sampling scheme. Also check the sample size.

Look at scatterplots and residual plots.

Look for clues in the explination of the problem.

Look at the residual plot.

Look at a QQ-Plot

Examples of plots to assess the constant variation condition:

DRAW PLOT IN CLASS

DRAW PLOT IN CLASS

Comment on “normality” condition:

The normality condition says that the y-values are normally distributed around the least-squares regression line FOR EACH X-VALUE. If the observations are independent AND the variation of the y-values is the same for all values of the explanatory variable, all residuals can be put together to assess this condition with a normal probability plot of the residuals.

D. Examples

Based on the following plots, determine if

If any of the above is not satisfied, describe what you would do to alleviate the problem.

1. Old Faithful: regression of interval between eruptions on duration of previous eruption.

### OLD FAITHFUL ... AGAIN ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))

## SCATTER PLOT ##
with(OLDFAITHFUL, 
     plot(duration, interval, pch=16, main = "Old Faithful",
          ylab="interval between eruptions (minutes)", 
          xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
      duration of current eruption", cex = 0.8)
abline(coefficients(of_mod), 
       lwd=3, col="red", lty=2) 

plot of chunk unnamed-chunk-3

## RESIDUAL PLOT ##
plot(OLDFAITHFUL$duration, resid(of_mod), pch=16, 
     xlab="Duration of Eruption",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

## QQ PLOT ##
qqnorm(resid(of_mod))
qqline(resid(of_mod))

plot of chunk unnamed-chunk-3

a. linearity condition: OK

b. any outliers? NO

c. constant variation condition: Looks pretty good

d. normality condition: Very very close to normal

e. any transformation necessary?: Doesn't need transformation

2. Animal Gestation: regression of average gestation (in days) for a sample of animals on longevity average life expectancy in years) for the animals. (Note: only the residual plot and normal probability plot of the residuals are given!)

### A NEW EXAMPLE: GESTATION ###
gest_mod <- with(GESTATION, lm(gestation~longevity))

## SCATTER PLOT ##
with(GESTATION, 
     plot(longevity, gestation, pch=16, main = "Animal Gestation",
          ylab="Average Gestation (days)", 
          xlab="Average Life Expectancy (years)")
)
abline(coefficients(gest_mod), 
       lwd=3, col="red", lty=2) 

plot of chunk unnamed-chunk-5

## RESIDUAL PLOT ##
plot(GESTATION$longevity, resid(gest_mod), pch=16, 
     xlab="Average Life Expectancy (years)",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-5

## QQ PLOT ##
qqnorm(resid(gest_mod))
qqline(resid(gest_mod))

plot of chunk unnamed-chunk-5

a. linearity condition: Yes

b. any outliers? Yes, but not influential

c. constant variation condition: No, the variation increases

d. normality condition: The upper tail deviates from normality

e. any transformation necessary?: No

3. Baseball example: regression of # of wins on team batting average (Major League baseball 2011).

### BASEBALL ... AGAIN ###
base_mod<-lm(BASEBALL$WINS~BASEBALL$AVG)

## SCATTERPLOT 
plot(BASEBALL$AVG, BASEBALL$WINS, pch=16, 
     xlab="Team Batting Average",
     ylab="Team Wins",
     main="Batting Average vs Wins")
abline(coefficients(base_mod), 
       lwd=3, col="red", lty=2) 

plot of chunk unnamed-chunk-7

## RESIDUAL PLOT
plot(BASEBALL$AVG, resid(base_mod), pch=16, 
     xlab="Team Batting Average",
     ylab="Residual",
     main="Residual Plot")
abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-7

## QQ PLOT
qqnorm(resid(base_mod))
qqline(resid(base_mod))

plot of chunk unnamed-chunk-7

a. linearity condition: Yes

b. any outliers? No

c. constant variation condition: Yes

d. normality condition: Very close!

e. any transformation necessary?: No

Summary

We can try to transform the data so the transformed values have a linear relationship.

Note: Transformations should be done if it makes scientific sense to do so.

We can proceed but be very very cautious about making inference. In a more complicated statistics course we could use generalized regression models.

We can assess whether or not the outlier was due to an error. We can perform analyses with and without outliers.

If its not representative we can only make very narrow inference. If the observations are not independent we would need more complex statistics tools.