Estimation of Model Parameters

Sum of squared error \(SSE=\sum{(y-\hat{y})^2}\)

\(b_1 = \frac{\sum xy- n\bar{x}\bar{y}}{\sum x^2 - n(\bar{x}^2)}\)

\(b_0 = \bar{y}-b_1\bar{x}\)

Example 14.1 The number of disk I/O’s and processor times of seven programs were measured as {(14, 2), (16, 5), (27, 7), (42, 9), (39, 10), (50, 13), (83, 20)}.

x <- c(14,16,27,42,39,50,83)
y <- c(2,5,7,9,10,13,20)
xb <- mean(x)
yb <- mean(y)
n <- length(x)
b1= (sum(x*y)-n*xb*yb)/(sum(x^2)-n*xb^2)
b0 = yb - b1*xb
SSE <- sum((y-b0-b1*x)^2)
plot(x,y)
abline(b0,b1)

So the desired linear model is \(CPU\ time=-0.00828+0.24376(number\ of\ disk\ I/O's)\) and the SSE is 5.8689

Allocation of Variation

Total Variability is the total sum of squares (SST)
\(SST=\sum(y-\bar{y})^2\)

Sum of squares explained by regression (SSR)
\(SSR=SST-SSE\)

coefficient of determination, \(R^2\)
\(R^2= \frac{SSR}{SST}=\frac{SST-SSE}{SST}\)

\(R^2\) is a good measure because it always takes a value between 0 and 1 where 0 is worst and 1 is the best.

Example 14.2 For the disk I/O-CPU time data of Example 14.1 the coefficient of determination can be computed as follows:

SST = sum((y-yb)^2)
R2 = (SST-SSE)/SST

Thus the \(R^2\) score for the model is 0.9715 and this means that the regression explains 97.15% of the CPU time’s variation.

Standard Deviation of Errors

\(s_e=\sqrt{\frac{SSE}{n-2}}\)

Note that his is over \(n-\) becuase the errors were taken after calculation 2 regression parameters.

\(s_e^2\) is called the mean squared error (MSE)
\(MSE = \frac{SSE}{n-2}\)

Confidence Intervals for Regression Parameters

First we have to find the standard deviations of of \(b_0\) and \(b_1\) using the following formulas
\(s_{b_0}=s_e\left[ \frac{1}{n} + \frac{\bar{x}^2}{\Sigma{x^2}-n\bar{x}^2}\right]^{1/2}\)
\(s_{b_0}=\frac{s_e}{\left[\Sigma x^2 -n\bar{x}^2 \right]^{1/2}}\)

so the intervals are \(b_0\pm ts_{b_0}\) and \(b_1\pm ts_{b_1}\)

Example 14.4 For the disk I/O and CPU data of Example 14.1 calculate the 90% confidence interval

Since we have 7 measurements and are using 2 parameters we have 5 degrees of freedom so we have to use the t value.

# First we need se
se <- sqrt(SSE/(n-2))

# Now caclulate the standard deviations of b
sb0 <- se * sqrt((1/n)+xb^2/(sum(x^2)-n*sum(xb^2)))
sb1 <- se / sqrt(sum(x^2)-n*xb^2)

# Get t
t <- qt(0.95,5)

# Caclulate the upper and lower bounds for b0 and b1

lowerb0 <- b0-t*sb0
upperb0 <- b0+t*sb0
lowerb1 <- b1-t*sb1
upperb1 <- b1+t*sb1

So this means that the 90% confidence interval for \(b_0\) is (-1.683,1.6664) and for \(b_1\) is (0.2061,0.2814)

Confidence Intervals for Predictions

To create a confidence interval we need to first get a standard deviation on your predictions. The formula is as follows
\(s_{\hat{y}_{mp}} = s_e\left[\frac{1}{m}+\frac{1}{n}+\frac{(x_p-\bar{x}^2)}{\Sigma x^2-n\bar{x}^2}\right]\) where \(m\) is the number of future observations.

From this we have 2 special cases, \(m=1\) and \(m=\infty\). For the first, the \(\frac{1}{m}\) term will simply be 1 and for the other it will go to 0.

Example 14.5 Using the disk I/O and CPU time data of Example 14.1, let us estimate the CPU time for a program with 100 disk I/O’s.

Assuming a large number of observations:

We can just plug directly into the equations since we already have all the variables

# Get the standard deviation of a large number of estimates
xp <- 100 
syp <- se*sqrt(1/n + (xp-xb)^2/(sum(x^2)-n*xb^2))

# Make our prediction
yhp <- b0 + b1*xp

# We use the same t value as before and compute our bounds

lower <- yhp - t*syp
upper <- yhp + t*syp

So the 90% confidence interval on our predictions is (21.9172,26.8175).

If we want to make a single prediction then we modify slightly

# Get the standard deviation of a point estimate
syp <- se*sqrt(1+1/n + (xp-xb)^2/(sum(x^2)-n*xb^2))

# Make our prediction
yhp <- b0 + b1*xp

# We use the same t value as before and compute our bounds

lower <- yhp - t*syp
upper <- yhp + t*syp

So the 90% confidence interval on our single prediction is (21.0857,27.649).

Visual Tests for Verifying the Regression Assumption

In deriving the expressions for regression parameters, we made the following assumptions (bold are important):

  1. The true relationship between the response variable y and the predictor variable x is linear.
  2. The predictor variable x is nonstochastic and it is measured without any error.
  3. The model errors are statistically independent.
  4. The errors are normally distributed with zero mean
  5. The error terms have a constant variance.

To test 3 you plot the residual errors versus \(\hat{y}\). Errors on the y axis and \(\hat{y}\) on the x axis.