Chapter 14 Notes Linear Regression Models

Estimation of Model Parameters
Allocation of Variation
Standard Deviation of Errors
Confidence Intervals for Regression Parameters
Confidence Intervals for Predictions
Visual Tests for Verifying the Regression Assumption

Estimation of Model Parameters

Sum of squared error \(SSE=\sum{(y-\hat{y})^2}\)

\(b_1 = \frac{\sum xy- n\bar{x}\bar{y}}{\sum x^2 - n(\bar{x}^2)}\)

\(b_0 = \bar{y}-b_1\bar{x}\)

Example 14.1 The number of disk I/O’s and processor times of seven programs were measured as {(14, 2), (16, 5), (27, 7), (42, 9), (39, 10), (50, 13), (83, 20)}.

x <- c(14,16,27,42,39,50,83)
y <- c(2,5,7,9,10,13,20)
xb <- mean(x)
yb <- mean(y)
n <- length(x)
b1= (sum(x*y)-n*xb*yb)/(sum(x^2)-n*xb^2)
b0 = yb - b1*xb
SSE <- sum((y-b0-b1*x)^2)
plot(x,y)
abline(b0,b1)

So the desired linear model is \(CPU\ time=-0.00828+0.24376(number\ of\ disk\ I/O's)\) and the SSE is 5.8689

Allocation of Variation

Total Variability is the total sum of squares (SST)
\(SST=\sum(y-\bar{y})^2\)

Sum of squares explained by regression (SSR)
\(SSR=SST-SSE\)

coefficient of determination, \(R^2\)
\(R^2= \frac{SSR}{SST}=\frac{SST-SSE}{SST}\)

\(R^2\) is a good measure because it always takes a value between 0 and 1 where 0 is worst and 1 is the best.

Example 14.2 For the disk I/O-CPU time data of Example 14.1 the coefficient of determination can be computed as follows:

SST = sum((y-yb)^2)
R2 = (SST-SSE)/SST

Thus the \(R^2\) score for the model is 0.9715 and this means that the regression explains 97.15% of the CPU time’s variation.

Standard Deviation of Errors

\(s_e=\sqrt{\frac{SSE}{n-2}}\)

Note that his is over \(n-\) becuase the errors were taken after calculation 2 regression parameters.

\(s_e^2\) is called the mean squared error (MSE)
\(MSE = \frac{SSE}{n-2}\)

Confidence Intervals for Regression Parameters

First we have to find the standard deviations of of \(b_0\) and \(b_1\) using the following formulas
\(s_{b_0}=s_e\left[ \frac{1}{n} + \frac{\bar{x}^2}{\Sigma{x^2}-n\bar{x}^2}\right]^{1/2}\)
\(s_{b_0}=\frac{s_e}{\left[\Sigma x^2 -n\bar{x}^2 \right]^{1/2}}\)

so the intervals are \(b_0\pm ts_{b_0}\) and \(b_1\pm ts_{b_1}\)

Example 14.4 For the disk I/O and CPU data of Example 14.1 calculate the 90% confidence interval

Since we have 7 measurements and are using 2 parameters we have 5 degrees of freedom so we have to use the t value.

# First we need se
se <- sqrt(SSE/(n-2))

# Now caclulate the standard deviations of b
sb0 <- se * sqrt((1/n)+xb^2/(sum(x^2)-n*sum(xb^2)))
sb1 <- se / sqrt(sum(x^2)-n*xb^2)

# Get t
t <- qt(0.95,5)

# Caclulate the upper and lower bounds for b0 and b1

lowerb0 <- b0-t*sb0
upperb0 <- b0+t*sb0
lowerb1 <- b1-t*sb1
upperb1 <- b1+t*sb1

So this means that the 90% confidence interval for \(b_0\) is (-1.683,1.6664) and for \(b_1\) is (0.2061,0.2814)

Confidence Intervals for Predictions

To create a confidence interval we need to first get a standard deviation on your predictions. The formula is as follows
\(s_{\hat{y}_{mp}} = s_e\left[\frac{1}{m}+\frac{1}{n}+\frac{(x_p-\bar{x}^2)}{\Sigma x^2-n\bar{x}^2}\right]\) where \(m\) is the number of future observations.

From this we have 2 special cases, \(m=1\) and \(m=\infty\). For the first, the \(\frac{1}{m}\) term will simply be 1 and for the other it will go to 0.

Example 14.5 Using the disk I/O and CPU time data of Example 14.1, let us estimate the CPU time for a program with 100 disk I/O’s.

Assuming a large number of observations:

We can just plug directly into the equations since we already have all the variables

# Get the standard deviation of a large number of estimates
xp <- 100 
syp <- se*sqrt(1/n + (xp-xb)^2/(sum(x^2)-n*xb^2))

# Make our prediction
yhp <- b0 + b1*xp

# We use the same t value as before and compute our bounds

lower <- yhp - t*syp
upper <- yhp + t*syp

So the 90% confidence interval on our predictions is (21.9172,26.8175).

If we want to make a single prediction then we modify slightly

# Get the standard deviation of a point estimate
syp <- se*sqrt(1+1/n + (xp-xb)^2/(sum(x^2)-n*xb^2))

# Make our prediction
yhp <- b0 + b1*xp

# We use the same t value as before and compute our bounds

lower <- yhp - t*syp
upper <- yhp + t*syp

So the 90% confidence interval on our single prediction is (21.0857,27.649).

Visual Tests for Verifying the Regression Assumption

In deriving the expressions for regression parameters, we made the following assumptions (bold are important):

The true relationship between the response variable y and the predictor variable x is linear.
The predictor variable x is nonstochastic and it is measured without any error.
The model errors are statistically independent.
The errors are normally distributed with zero mean
The error terms have a constant variance.

To test 3 you plot the residual errors versus \(\hat{y}\). Errors on the y axis and \(\hat{y}\) on the x axis.