env <- new.env()
load("../data.rda", envir = env)
df <- get("CH01TA01", envir = env)
names(df) <- c("x", "y")
fit <- lm(y ~ x, df)
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.88 -34.09 -5.98 38.83 103.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.366 26.177 2.38 0.026 *
## x 3.570 0.347 10.29 0.00000000044 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.8 on 23 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.814
## F-statistic: 106 on 1 and 23 DF, p-value: 0.000000000445
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 252378 252378 106 0.00000000044 ***
## Residuals 23 54825 2384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tab <- c(
'n' = nrow(fit$model), # 25
'xbar' = mean(fit$model[[2]]), # 70
'b0' = coef(fit)[[1]], # 62.366
'b1' = coef(fit)[[2]], # 3.570
'SSE' = anova(fit)["Residuals", 2], # 54,825.460
'MSE' = anova(fit)["Residuals", 3], # 2,383.716
'SSx' = sum(scale(df$x, T, F)^2), # 19,800
'SSTO' = sum(anova(fit)[,2])) # 307,203
round(tab, 2)
## n xbar b0 b1 SSE MSE SSx SSTO
## 25.00 70.00 62.37 3.57 54825.46 2383.72 19800.00 307203.04
ciPredicts values based on a linear model object
ci(model, Xh = 0, alpha = 0.1, m = 1,
interval = c("confidence", "prediction", "band"))
model – Object of class “lm”. Xh – An optional vector of prediction values. Default results in evaluating the interval about the intercept coefficient. alpha – Significance level for prediction interval. m – Number of new observations on response for a given Xh. interval – Type of interval calculation to use.
This function mimics the predict.lm function in the stats library. The aim is to follow the calculations provided for the examples in Chapter 2. Since the various equations (2.23, 2.25, 2.30, 2.33, 2.36, 2.38a, and 2.39) share the same general form, this function could easily generalize to each of the provided examples. In particular, the estimated standard deviation equation follows the general form in this function where the constant c defaults to 0 under confidence intervals, 1 under predictions. The constant m defaults to 1 unless otherwise specified to be looking for numerous mean responses. Additionally, an interval of “band” is provided. This option uses the Working-Hotelling 1-alpha confidence measure from equations (2.40) and (2.40a). Everything else is the same as a confidence interval.
The following examples will make use of this pedagogical function alongside the use of the R function that provides the same results. It is left to the reader to understand the R function arguments provided.
ci <- function(model, Xh = 0, alpha = 0.1, m = 1, interval = c("confidence", "prediction", "band"))
{
int <- match.arg(interval)
c <- ifelse(int == "prediction", 1, 0)
x <- model$model[[2]]
n <- length(x)
df <- summary(model)$df[2]
SSx <- sum( scale(x, T, F)^2 )
MSE <- anova(model)["Residuals", 3]
dev <- Xh - mean(x)
s <- sqrt( MSE * (c/m + 1/n + dev^2 / SSx) )
if (int == "band") {
val <- sqrt( 2 * qf(1 - alpha, 2, df) )
} else
val <- qt(1 - alpha/2, df)
pred <- coef(fit)[[1]] + coef(fit)[[2]] * Xh
lwr <- pred - val * s
upr <- pred + val * s
return(cbind(pred, lwr, upr))
}
ci(fit)
## pred lwr upr
## [1,] 62.37 17.5 107.2
confint(fit, parm = 1, level = 0.9)
## 5 % 95 %
## (Intercept) 17.5 107.2
newdata = c(65, 100)
ci(fit, Xh = newdata)
## pred lwr upr
## [1,] 294.4 277.4 311.4
## [2,] 419.4 394.9 443.8
predict(fit, data.frame(x = newdata), interval = "confidence", level = 0.9)
## fit lwr upr
## 1 294.4 277.4 311.4
## 2 419.4 394.9 443.8
newdata = 100
ci(fit, Xh = newdata, interval = "prediction")
## pred lwr upr
## [1,] 419.4 332.2 506.6
predict(fit, data.frame(x = newdata), interval = "prediction", level = 0.9)
## fit lwr upr
## 1 419.4 332.2 506.6
ci(fit, newdata, interval = "prediction", m = 3)
## pred lwr upr
## [1,] 419.4 365.2 473.5
predict(fit, data.frame(x = newdata), interval = "prediction",
level = 0.9, weights = 3)
## fit lwr upr
## 1 419.4 365.2 473.5
Since this user-defined function ci behaves similar to predict, it should come as no surprise that this approach to regression bands can be performed using predict.
The book here desires a Working-Hotelling confidence interval. It results in a slightly less confident (i.e., wider) band about the regression line. The reason for this has to do with the family confidence interval that is discussed in Chapter 4 of the text. The Chapter 4 walk-through will include the manual calculations using predict that are required to calculate the W statistic. Until that time, the basic approach will be used here.
Note that one could replace the below curve statements with something like
curve(predict(fit, data.frame(x = x)), add = TRUE)
ci(fit, 100, interval = "band") # Example case when Xh = 100
## pred lwr upr
## [1,] 419.4 387.2 451.6
plot(y ~ x, df, xlab = "Lot Size X", ylab = "Hours Y")
curve(ci(fit, x, interval = "band")[, "pred"], add = TRUE)
curve(ci(fit, x, interval = "band")[, "lwr"], add = TRUE, col = "red")
curve(ci(fit, x, interval = "band")[, "upr"], add = TRUE, col = "red")
curve(predict(fit, data.frame(x = x), int = "c")[, 2], add = TRUE, col = "blue")
curve(predict(fit, data.frame(x = x), int = "c")[, 3], add = TRUE, col = "blue")
MSR <- anova(fit)["x", "Mean Sq"] # 252,377.600
MSE <- anova(fit)["Residuals", "Mean Sq"] # 2,383.716
Fstat <- MSR / MSE # 105.876
fval <- qf(0.95, 1, 23) # 4.279
round(c('MSR' = MSR, 'MSE' = MSE, 'F' = Fstat, 'fval' = fval), 4)
## MSR MSE F fval
## 252377.581 2383.716 105.876 4.279
if (abs(Fstat) <= fval) {
print("Conclude H0") # Fail to reject H0
} else
print("Conclude Ha") # Reject H0
## [1] "Conclude Ha"
tab <- c(
'SSR' = anova(fit)["x", "Mean Sq"], # 252,377
'SSTO' = sum( anova(fit)["Sum Sq"] ), # 307,203
'R2' = summary(fit)$r.squared, # 0.8215 (= SSR / SSTO)
'r' = with(df, cor(x, y))) # 0.9064 (= sqrt(R2) )
round(tab, 4)
## SSR SSTO R2 r
## 252377.5808 307203.0400 0.8215 0.9064
df <- get("CH02TA04", envir = env)
y1 <- df[, 1]
y2 <- df[, 2]
cbind(
'Population' = y1,
'Expenditure' = y2,
'R1' = rank(y1),
'R2' = rank(y2))
## Population Expenditure R1 R2
## [1,] 29 127 1 2
## [2,] 435 214 8 11
## [3,] 86 133 3 4
## [4,] 1090 208 11 10
## [5,] 219 153 7 6
## [6,] 503 184 9 8
## [7,] 47 130 2 3
## [8,] 3524 217 12 12
## [9,] 185 141 6 5
## [10,] 98 154 5 7
## [11,] 952 194 10 9
## [12,] 89 103 4 1
# Calculate Spearman Correlation Information
c('rho' = sum(scale(rank(y1), T, F) * scale(rank(y2), T, F)) /
sqrt(sum(scale(rank(y1), T, F)^2) * sum(scale(rank(y2), T, F)^2)))
## rho
## 0.8951
cor.test(y1, y2, method = "spearman")$estimate # Spearman Rho = 0.895
## rho
## 0.8951
c('rho' = cor(rank(y1), rank(y2))) # Alternative R method
## rho
## 0.8951
# t* calculation
c('tstar' = (cor(rank(y1), rank(y2)) * sqrt(length(y1) - 2)) /
sqrt(1 - cor(rank(y1), rank(y2))^2))
## tstar
## 6.349