Chapter 2 – Inferences in Regression and Correlation Analysis

Load the data sets

env <- new.env()
load("../data.rda", envir = env)

Input the Toluca Company Data

df        <- get("CH01TA01", envir = env)
names(df) <- c("x", "y")
fit       <- lm(y ~ x, df)

TABLE 2.1 and FIGURE 2.2 (p 46)

Results for Toluca Company Example Obtained in Chapter 1 Regression Output–Toluca Company Example

summary(fit)

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83.88 -34.09  -5.98  38.83 103.53 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)   62.366     26.177    2.38         0.026 *  
## x              3.570      0.347   10.29 0.00000000044 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 48.8 on 23 degrees of freedom
## Multiple R-squared: 0.822,   Adjusted R-squared: 0.814 
## F-statistic:  106 on 1 and 23 DF,  p-value: 0.000000000445

anova(fit)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value        Pr(>F)    
## x          1 252378  252378     106 0.00000000044 ***
## Residuals 23  54825    2384                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


tab <- c(
  'n'    = nrow(fit$model),              # 25
  'xbar' = mean(fit$model[[2]]),         # 70
  'b0'   = coef(fit)[[1]],               # 62.366
  'b1'   = coef(fit)[[2]],               # 3.570
  'SSE'  = anova(fit)["Residuals", 2],   # 54,825.460
  'MSE'  = anova(fit)["Residuals", 3],   # 2,383.716
  'SSx'  = sum(scale(df$x, T, F)^2),     # 19,800
  'SSTO' = sum(anova(fit)[,2]))          # 307,203

round(tab, 2)

##         n      xbar        b0        b1       SSE       MSE       SSx      SSTO 
##     25.00     70.00     62.37      3.57  54825.46   2383.72  19800.00 307203.04

FUNCTION DEFINITION `ci`

Description

Predicts values based on a linear model object

Usage

ci(model, Xh = 0, alpha = 0.1, m = 1,
   interval = c("confidence", "prediction", "band"))

Arguments

model – Object of class “lm”. Xh – An optional vector of prediction values. Default results in evaluating the interval about the intercept coefficient. alpha – Significance level for prediction interval. m – Number of new observations on response for a given Xh. interval – Type of interval calculation to use.

Notes

This function mimics the predict.lm function in the stats library. The aim is to follow the calculations provided for the examples in Chapter 2. Since the various equations (2.23, 2.25, 2.30, 2.33, 2.36, 2.38a, and 2.39) share the same general form, this function could easily generalize to each of the provided examples. In particular, the estimated standard deviation equation follows the general form in this function where the constant c defaults to 0 under confidence intervals, 1 under predictions. The constant m defaults to 1 unless otherwise specified to be looking for numerous mean responses. Additionally, an interval of “band” is provided. This option uses the Working-Hotelling 1-alpha confidence measure from equations (2.40) and (2.40a). Everything else is the same as a confidence interval.

The following examples will make use of this pedagogical function alongside the use of the R function that provides the same results. It is left to the reader to understand the R function arguments provided.

ci <- function(model, Xh = 0, alpha = 0.1, m = 1, interval = c("confidence", "prediction", "band")) 
{
  int   <- match.arg(interval)
  c     <- ifelse(int == "prediction", 1, 0)
  x     <- model$model[[2]]
  n     <- length(x)                                   
  df    <- summary(model)$df[2]                        
  SSx   <- sum( scale(x, T, F)^2 )                     
  MSE   <- anova(model)["Residuals", 3]                
  dev   <- Xh - mean(x)
  s     <- sqrt( MSE * (c/m + 1/n + dev^2 / SSx) )     

  if (int == "band") {
    val <- sqrt( 2 * qf(1 - alpha, 2, df) )
  } else
    val <- qt(1 - alpha/2, df)

  pred  <- coef(fit)[[1]] + coef(fit)[[2]] * Xh
  lwr   <- pred - val * s
  upr   <- pred + val * s

  return(cbind(pred, lwr, upr))
}

EXAMPLE (p 49)

Confidence Interval for beta0

ci(fit)

##       pred  lwr   upr
## [1,] 62.37 17.5 107.2

confint(fit, parm = 1, level = 0.9)

##              5 %  95 %
## (Intercept) 17.5 107.2

EXAMPLE (pp 54-5)

Confidence Interval for E{Yh}

newdata = c(65, 100)
ci(fit, Xh = newdata)

##       pred   lwr   upr
## [1,] 294.4 277.4 311.4
## [2,] 419.4 394.9 443.8

predict(fit, data.frame(x = newdata), interval = "confidence", level = 0.9)

##     fit   lwr   upr
## 1 294.4 277.4 311.4
## 2 419.4 394.9 443.8

EXAMPLE (p 59)

Prediction Interval for Yh(new) when Parameters Unknown

newdata = 100
ci(fit, Xh = newdata, interval = "prediction")

##       pred   lwr   upr
## [1,] 419.4 332.2 506.6

predict(fit, data.frame(x = newdata), interval = "prediction", level = 0.9)

##     fit   lwr   upr
## 1 419.4 332.2 506.6

EXAMPLE (p 61)

Prediction of Mean of m New Observations Given Xh

ci(fit, newdata, interval = "prediction", m = 3)

##       pred   lwr   upr
## [1,] 419.4 365.2 473.5

predict(fit, data.frame(x = newdata), interval = "prediction", 
        level = 0.9, weights = 3)

##     fit   lwr   upr
## 1 419.4 365.2 473.5

FIGURE 2.6 (p 62)

Confidence Bands for Regression Line–Toluca Company Example

Since this user-defined function ci behaves similar to predict, it should come as no surprise that this approach to regression bands can be performed using predict.

The book here desires a Working-Hotelling confidence interval. It results in a slightly less confident (i.e., wider) band about the regression line. The reason for this has to do with the family confidence interval that is discussed in Chapter 4 of the text. The Chapter 4 walk-through will include the manual calculations using predict that are required to calculate the W statistic. Until that time, the basic approach will be used here.

Note that one could replace the below curve statements with something like

curve(predict(fit, data.frame(x = x)), add = TRUE)

ci(fit, 100, interval = "band")  # Example case when Xh = 100

##       pred   lwr   upr
## [1,] 419.4 387.2 451.6


plot(y ~ x, df, xlab = "Lot Size X", ylab = "Hours Y")

curve(ci(fit, x, interval = "band")[, "pred"], add = TRUE)
curve(ci(fit, x, interval = "band")[,  "lwr"], add = TRUE, col = "red")
curve(ci(fit, x, interval = "band")[,  "upr"], add = TRUE, col = "red")
curve(predict(fit, data.frame(x = x), int = "c")[, 2], add = TRUE, col = "blue")
curve(predict(fit, data.frame(x = x), int = "c")[, 3], add = TRUE, col = "blue")

plot of chunk unnamed-chunk-9

EXAMPLE (p 71)

F-test of beta1

MSR   <- anova(fit)["x",         "Mean Sq"]   # 252,377.600
MSE   <- anova(fit)["Residuals", "Mean Sq"]   #   2,383.716
Fstat <- MSR / MSE                            #     105.876
fval  <- qf(0.95, 1, 23)                      #       4.279
round(c('MSR' = MSR, 'MSE' = MSE, 'F' = Fstat, 'fval' = fval), 4)

##        MSR        MSE          F       fval 
## 252377.581   2383.716    105.876      4.279


if (abs(Fstat) <= fval) {
  print("Conclude H0")                        # Fail to reject H0
} else
  print("Conclude Ha")                        # Reject H0

## [1] "Conclude Ha"

EXAMPLE (pp 75-6)

Coefficient of Determination and Coefficient of Correlation

tab <- c(
  'SSR'  = anova(fit)["x", "Mean Sq"],         # 252,377
  'SSTO' = sum( anova(fit)["Sum Sq"] ),        # 307,203
  'R2'   = summary(fit)$r.squared,             # 0.8215 (= SSR / SSTO)
  'r'    = with(df, cor(x, y)))                # 0.9064 (= sqrt(R2) )

round(tab, 4)

##         SSR        SSTO          R2           r 
## 252377.5808 307203.0400      0.8215      0.9064

Input the Sales Marketing Data

df <- get("CH02TA04", envir = env)
y1 <- df[, 1]
y2 <- df[, 2]

TABLE 2.4 (pp 87-9)

Data on Population, Expenditures and Their Ranks–Sales Marketing Example

cbind(
  'Population'  = y1,
  'Expenditure' = y2,
  'R1'          = rank(y1),
  'R2'          = rank(y2))

##       Population Expenditure R1 R2
##  [1,]         29         127  1  2
##  [2,]        435         214  8 11
##  [3,]         86         133  3  4
##  [4,]       1090         208 11 10
##  [5,]        219         153  7  6
##  [6,]        503         184  9  8
##  [7,]         47         130  2  3
##  [8,]       3524         217 12 12
##  [9,]        185         141  6  5
## [10,]         98         154  5  7
## [11,]        952         194 10  9
## [12,]         89         103  4  1


# Calculate Spearman Correlation Information
c('rho' =  sum(scale(rank(y1), T, F) * scale(rank(y2), T, F)) /
           sqrt(sum(scale(rank(y1), T, F)^2) * sum(scale(rank(y2), T, F)^2)))

##    rho 
## 0.8951


cor.test(y1, y2, method = "spearman")$estimate    # Spearman Rho = 0.895

##    rho 
## 0.8951


c('rho' = cor(rank(y1), rank(y2)))                # Alternative R method

##    rho 
## 0.8951


# t* calculation
c('tstar' = (cor(rank(y1), rank(y2)) * sqrt(length(y1) - 2)) / 
             sqrt(1 - cor(rank(y1), rank(y2))^2))

## tstar 
## 6.349