KITADA

Lesson #18

Inference using Simple Linear Regression

Motivation:

Assessing conditions is an important step in doing a simple linear regression analysis. If the conditions aren’t met, conclusions about a population based on the analysis of the sample data may not be meaningful. Once we are sure that the conditions are met, we can proceed with the simple linear regression analysis, which involves inference methods (hypothesis test and confidence intervals). In this lesson, we will learn a new hypothesis test that is useful in determining if the explanatory variable is helpful in predicting the response variable. In addition, we’ll learn several different types of confidence intervals that are used in different situations.

What you need to know from this lesson:

After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

Example: the Old Faithful… AGAIN

Data were collected on the duration of a current eruption (DURATION) and the interval between that eruption and the next eruption (INTERVAL) on the Old Faithful geyser in Yellowstone National Park to determine if park rangers can accurately predict when the next eruption of Old Faithful will occur. In other words, does the duration of a current eruption of Old Faithful help to explain the interval between that eruption and the next eruption?

A complete simple linear regression analysis will be performed to answer that question (and some others).

A. What is the response variable and what is the explanatory variable? Are they quantitative or categorical?

Response: Interval (quantitative)

Explanatory: Duration (quantitative)

B. Performing the analysis:

Refer to the tan colored sheet of the Steps for performing a simple linear regression analysis.

Step 1: Assessing the conditions

Here are the scatterplot, residual plot, and normal probability plot of the residuals. Use the appropriate graph(s) to assess the conditions.

### OLD FAITHFUL ... AGAIN ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))

## SCATTER PLOT ##
with(OLDFAITHFUL, 
     plot(duration, interval, pch=16, main = "Old Faithful",
          ylab="interval between eruptions (minutes)", 
          xlab="duration of current eruption (minutes)")
)
mtext("Scatterplot of interval between eruptions versus
      duration of current eruption", cex = 0.8)
abline(coefficients(of_mod), 
       lwd=3, col="red", lty=2) 

plot of chunk unnamed-chunk-3

## RESIDUAL PLOT ##
plot(OLDFAITHFUL$duration, resid(of_mod), pch=16, 
     xlab="Duration of Eruption",
     ylab="Residual",
     main="Residual Plot")

abline(h=0, lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

## QQ PLOT ##
qqnorm(resid(of_mod))
qqline(resid(of_mod))

plot of chunk unnamed-chunk-3

1. Is the sample representative of the population?

Comment on whether or not the sample taken is representative of the population of all eruptions of Old Faithful.

We do not know how the rangers decided to collect data.

2 Linearity condition:

What plots are used to assess the linearity condition? Based on these plots, is the relationship between INTERVAL and DURATION linear?

The scatterplot and the residual plot. The relationship looks pretty linear.

3. Outlier condition (or other deviations):

What plots are used to assess the outlier condition? Based on these plots, are there any potentially influential outliers and/or other deviations from the pattern?

The scatterplot and residual plot. There dont appear to be any influential outliers.

4. independent observations?

Do you feel that the interval between eruptions for one eruption of Old Faithful is being influenced by the interval between eruptions for any other eruptions of Old Faithful?

This could be true. The data were collected sequentially through time (ie some temporal correlation)

5. constant variation condition:

What plots can be used to assess the constant variation condition? Based on these plots, do you feel that the spread of INTERVAL is the same for all DURATIONS?

The residual plot. It looks like the variation is constant throughout the range of x.

6. normality condition:

What plot can be used to assess the normality condition? Based on this plot, do you feel that the residuals, are normally distributed (and, therefore, the values of INTERVAL are normally distributed for each value of DURATION)?

The qq plot. There appear to be deviations from normality in the tails.

Step 2: If a transformation is needed, do so here and repeat parts 2, 5, and 6 of Step 1 with the transformed variable(s).

Is a transformation necessary?

No, it doesnt look like a transformation is necessary.

Step 3: Once a model is chosen that best satisfies the conditions, obtain the output from the simple linear regression analysis.

Below is the R output from the simple linear regression of INTERVAL on DURATION. Use it to answer the questions that follow.

### OLD FAITHFUL ... AGAIN ###
of_mod <- with(OLDFAITHFUL, lm(interval~duration))

summary(of_mod)
## 
## Call:
## lm(formula = interval ~ duration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1130  -4.4802  -0.4712   4.0370  16.8153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9668     1.4279   23.79   <2e-16 ***
## duration     10.3582     0.3822   27.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.159 on 220 degrees of freedom
## Multiple R-squared:  0.7695, Adjusted R-squared:  0.7685 
## F-statistic: 734.6 on 1 and 220 DF,  p-value: < 2.2e-16

1. Write the equation of the least-squares regression line in the context of the problem, explaining the meaning of the terms in the equation.

INTERVAL = 33.9668 + 10.3582*DURATION

2. Interpret the slope and y-intercept in the context of the problem.

3. Can park rangers accurately predict the next eruption of Old Faithful? That is, can the duration of the last eruption of Old Faithful help explain the interval between eruptions?

a. What are the null and alternative hypotheses?

\( H_0: b=0 \), (the slope is equal to zero)

\( H_A: b \neq 0 \), (the slope is not equal to zero)

b. Why are the hypotheses in part a what we use to answer the general question: “Does the explanatory variable help to predict (or explain) the response variable?”

Because if the slope is equal to zero this means that there isnt a relationship between the explanatory and response variables.

c. Calculate and interpret the test-statistic.

The test statistic is the estimate value divided by its standard error.

Thus, we have

10.3582/0.3822
## [1] 27.10152

d. How many degrees of freedom does the test-statistic have?

## n-2

222 - 2
## [1] 220

e. Determine the p-value.

pt(27.10152, 222, lower.tail=FALSE)
## [1] 1.184512e-72

f. Use the p-value to answer the question of interest.

There is convincing evidence to suggest that the slope is not equal to zero with a p-value < 0.0001. Therefore, we will reject the null.

4. Construct and interpret a 95% confidence interval for \( \beta_1 \) , the slope of the population regression line.

point_est<- 10.3582
point_est
## [1] 10.3582
critical_val<-qt(0.975, 222)
critical_val
## [1] 1.970707
std_err<-0.3822
std_err
## [1] 0.3822
point_est+c(-1,1)*critical_val*std_err
## [1]  9.604996 11.111404

5. What percent of the variation in the interval between eruptions is explained by the regression model (i.e. by duration of current eruption)? Can you think of other factors that may help explain the rest of the variation?

## Multiple R-squared:  0.7695

6. One of the conditions is that the y-values (interval between eruptions) have the same spread for all x-values (duration of current eruption).

Another way to think of this condition is that the residuals have the same standard deviation (\( \sigma \) ) for all the x-values. We don’t know what the standard deviation of the residuals is in the population of all eruptions of Old Faithful, but we can estimate it based on the sample data.

What is the estimate of \( \sigma \)?

## Residual standard error: 6.159

7. Now that we know that there is evidence to say that the duration of a current eruption can help to predict the interval between eruptions (i.e. how long we have to wait until the next eruption), let’s help the park ranger out. Suppose we just watched an eruption of Old Faithful that lasted 3.5 minutes. What is the predicted time we’d have to wait until the next eruption of Old Faithful?

33.9668+10.3582*3.5
## [1] 70.2205

8. The value in #7 is just a predicted value and may not be the exact time we’d have to wait until the next eruption.

It would be better if the park ranger provided a range of possible “wait times” until the next eruption (with a certain level of confidence). These possible values comprise a prediction interval, which is used when we want to estimate the response variable for one future observation of the explanatory variable. You will not need to know how to construct a prediction interval by hand, but do need to know when to use a prediction interval, how to obtain it in R, and how to interpret it.

Below is R output for a duration of 3.5 minutes. Based on this output, write and interpret the 95% prediction interval.

new.data<-data.frame(duration=3.5)
predict.lm(of_mod, new.data, interval="prediction")
##        fit      lwr      upr
## 1 70.22048 58.05577 82.38519

9. A visitor to Old Faithful asked a park ranger, “For all eruptions of Old Faithful that have lasted 3.5 minutes, what has been the average wait time until the next eruption?” Notice that this question is about all past eruptions of Old Faithful and not about one future eruption like in #8. Let’s help the park ranger answer this question.

a. What is the value of the best estimate of the “wait time” (i.e. interval between eruptions) for all eruptions of Old Faithful that have lasted 3.5 minutes? Where is this value in the output above?

## 70.22048

b. The value in part a is just an estimate based on the sample data. Let’s provide the park ranger with a range of possible values for the average “wait time”. These “possible values” comprise a confidence interval for the mean of the response variable. Again, you will not need to know how to construct this particular confidence interval by hand, but do need to know when to use this particular confidence interval, how to obtain it in R, and how to interpret it. Using the output below, write and interpret the 95% confidence interval for the mean “wait time” for all durations that have lasted 3.5 minutes.

new.data<-data.frame(duration=3.5)
predict.lm(of_mod, new.data, interval="confidence")
##        fit      lwr      upr
## 1 70.22048 69.40386 71.03709