Module 4: Statistical Linear Regression Models

\[ \mathbb{V}\left[ Y_i \mid X_i = x_i \right] = \sigma^2. \]

Interpreting regression coefficients

The intercept

  • \(\beta_0\) is the expected value of the response when the predictor is \(0\). \[ \mathbb{E}\left[ Y_i \mid X_i = 0 \right] = \beta_0 + \beta_{1} \times 0 = \beta_0 \]

  • Note, this isn’t always of interest, for example when \(X = 0\) is impossible or far outside of the range of the data (X is blood pressure or height, etc)

  • Consider that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \beta_1 \alpha + \beta_1 (X_i - \alpha) + \epsilon_i = \bar{\beta_0} + \beta_1 (X_i - \alpha) + \epsilon_i, \] so shifting the \(X_i\) values by a value of \(\alpha\) changes the intercept but not the slope.

  • Often \(\alpha\) is set to \(\bar{X}\) so that the intercept is interpreted as the expected response at the average \(X\) value.

The slope

  • \(\beta_1\) is the expected change in response for a 1 unit change in the predictor

\[ \mathbb{E}\left[ Y_i \mid X_i = x + 1 \right] - \mathbb{E}\left[ Y_i \mid X_i = x \right] = \beta_0 + \beta_1 (x + 1) + \epsilon_i - \beta_0 + \beta_1 (x) + \epsilon_i = \beta_1 \]

  • Consider the impact of changing the units of \(X\) \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{\alpha}X_i\alpha + \epsilon_i = \beta_0 + \bar{\beta}_{1}X_i\alpha + \epsilon_i , \] so multiplication of \(X\) by a factor \(\alpha\) results in dividing the coefficent by a factor of \(\alpha\).

  • Example: \(X\) is height in meters and Y is weight in kilograms. Then \(\beta_1\) is kilograms/meters. Convering X to centimeters implies multiplying X by \(100\) centimeters/meter. To get \(\beta_1\) in the right units, we have to divide by \(100\) centimeters/meter to get it to have the right units: \[ X \left[ \text{m} \right] \times 100 \left[\frac{\text{cm}}{\text{m}} \right] = 100 X \left[\text{cm}\right], \quad \beta_1 \left[ \frac{\text{kg}}{\text{m}} \right] \times \frac{1}{100} \left[ \frac{\text{m}}{\text{cm}} \right] = \frac{\beta_1}{100} \left[ \frac{\text{kg}}{\text{cm}} \right] \]

Example

Data is diamon prices (Singapore dollars) and diamon weight in carats (standard measure of diamond mass, 0.2 \(g\)).

library(UsingR)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
data(diamond)
library(ggplot2)

g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
                     color = 'black')
g
## `geom_smooth()` using formula = 'y ~ x'

Fitting the linear regression model

fit <- lm(price ~ carat, data = diamond)
coef(fit)
## (Intercept)       carat 
##   -259.6259   3721.0249
summary(fit)
## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## carat        3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

The interpretation of the the intercept and the sloop in this context is the following:

  • We estimate an expected 3721.02 SIN $ increase in price for every carat increat in mass of diamond.
  • The intercept -256.63 SIN $ is the expected price of a 0 carat diamond.

Getting a more interpretable intercept by centering the data.

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
##            (Intercept) I(carat - mean(carat)) 
##               500.0833              3721.0249
  • Thus, 500.1 SIN $ is the expected price for the average sized diamond of the data (0.2042 carats)

Changing the units to 0.1 carats:

fit3 <- lm(price ~ I(carat*10), data = diamond)
coef(fit3)
##   (Intercept) I(carat * 10) 
##     -259.6259      372.1025

Predicting the price of a diamond

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
## [1]  335.7381  745.0508 1005.5225
predict(fit, newdata = data.frame(carat = newx))
##         1         2         3 
##  335.7381  745.0508 1005.5225

Module 5: Residuals

Residuals represent variation left unexplained by the assumed model. There is a difference between between residuals and errors: * The errors represent the unobservable true errors from the known coefficients, * The residuals represent the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.

Motivating example

library(UsingR)
data(diamond)
library(ggplot2)

g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
                     color = 'black')
g
## `geom_smooth()` using formula = 'y ~ x'

The variation around the regression line, is residual variation, i.e. the variation left unexplained by having accounted for mass.

  • The model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), where \(\epsilon_i \sim \text{Normal}\left(\epsilon \mid 0, 1 \right)\)

  • Observed outcome \(i\) is \(Y_i\) at predictor value \(X_i\).

  • Predicted outcome \(i\) is \(\widehat{Y}_i\) at predictor value \(X_i\), \[ \widehat{Y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} X_i. \]

  • The residual is the difference between the observed outcome \(Y_i\) and predicted outcome \(\widehat{Y}_i\): \[ \epsilon_i = Y_i - \widehat{Y}_i \]

  • Least squares is minimizing \[ \sum_{i=1}^{n} \epsilon_{i}^{2} \]

  • The residual \(\epsilon_i\) can be thought of as estimates of the \(\epsilon_i\)

Properties of the residuals

  • \(\mathbb{E} \left[ \epsilon_i \right] = 0\).
  • If an intercept is included, \(\sum_{i=1}^{n} \epsilon_i = 0.\)
  • If a regressor variable, \(X_i\) is included in the model, \(\sum_{i=1}^{n} \epsilon_i X_i = 0.\)
  • Residuals are useful for investigating poor model fit.
  • Positive residuals are above the line, negative residuals are below the line.
  • Residual can be thought of as the outcome \(Y\) with the linear association of the predictor \(X\) removed.
  • One differentiates residual variation (variation after removing the predictor) from systemic variation (variation explained by the regression model).
data(diamond)
x <- diamond$carat
y <- diamond$price
n <- length(x)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e - (y - yhat)))
## [1] 6.51923e-13
max(abs(e - (y - (coef(fit)[1] + coef(fit)[2] * x))))
## [1] 5.950795e-13
sum(e)
## [1] 4.618528e-14
sum(e*x)
## [1] 7.549517e-15
plot(x, y, 
     xlab = 'Mass (carats)', 
     ylab = 'Price (SIN $)',
     bg = 'lightblue',
     col = 'black',
     cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
    lines(c(x[i], x[i]), 
          c(y[i], yhat[i]), 
          col = 'red', lwd = 2)
}

plot(x, e, 
     xlab = 'Mass (carats)', 
     ylab = 'Residuals (SIN $)',
     bg = 'lightblue',
     col = 'black',
     cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
    lines(c(x[i], x[i]), 
          c(e[i], 0), 
          col = 'red', lwd = 2)
}

Non linear example

x <- runif(100,-3,3)
y <- x + sin(x) + rnorm(100, sd = .2)
library(ggplot2)
g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g
## `geom_smooth()` using formula = 'y ~ x'

library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))), 
            aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g

Heteroskedasticity example

x <- runif(100,0,6)
y <- x + rnorm(100, sd = .001 * x)
library(ggplot2)
g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g
## `geom_smooth()` using formula = 'y ~ x'

library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))), 
            aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g

Residuals for diamond data

diamond$e <- resid(lm(price ~ carat, data = diamond))
g <- ggplot(diamond, aes(x = carat, y = e))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Residual Price (SIN $)')
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g

e <- c(resid(lm(price ~ 1, data = diamond)),
       resid(lm(price ~ carat, data = diamond)))
fit <- factor(c(rep('Itc', nrow(diamond)),
                rep('Itc, slope', nrow(diamond))))
g <- ggplot(data.frame(e = e, fit = fit), 
            aes(y = e, x = fit, fill = fit))
g <- g + geom_dotplot(binaxis = 'y', 
                      size = 2, 
                      stackdir = 'center')
## Warning in geom_dotplot(binaxis = "y", size = 2, stackdir = "center"): Ignoring
## unknown parameters: `size`
g <- g + xlab('Fitting approach')
g <- g + ylab('Residual Price (SIN $)')
g
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Estimating residual variation

  • Model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), where \(\epsilon_i \sim \text{Normal}\left( \epsilon \mid 0, \sigma^2 \right)\).

  • The ML estimate of \(\sigma^2\) is \[ \widehat{\sigma}_{\text{ML}}^2 = \frac{1}{n} \sum_{i=1}^{n} \epsilon_i^2, \] the average squared residual.

  • Typically, it is used \[ \widehat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^{n} \epsilon_i^2, \] with \(n-2\) instead of \(n\) accounting for the two degrees of freedom lost due to the two constraints:

    • including the intercept means the residuals have to sum to zero
    • including the covariate means the residuals multipled with the covariate sum to zero

More regression variables lead to more constraints hence to the loss of more degrees of freedom.

Example

x <- diamond$carat
n <- length(x)
y <- diamond$price
fit <- lm(y~x)
summary(fit)$sigma
## [1] 31.84052
sqrt(sum(resid(fit)**2)/(n-2))
## [1] 31.84052

Summarizing variation

The total variability in the response is \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2. \] The regression variability is the variability that is explained by adding the predictor \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2. \] The error variability is what’s lefover around the regression line \[ \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2. \]

Note that \[ \text{total variability} = \text{regression variability} + \text{error variability}. \] \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2 + \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2 \] because \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \left( Y_i - \widehat{Y}_i \right) = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \epsilon_i = \sum_{i=1}^{n} \widehat{Y}_i \epsilon_i - \bar{Y} \sum_{i=1}^{n} \epsilon_i = 0. \]

\(R\) squared (\(R^2\))

Since the residual variation and the regression model variation add up to the total variation, a quantity that represents the percentage of the total variation that’s represented by the model variation can be defined. This quantity is called \(R^2\) and is defined as the ratio between the regression variation and the total variation: \[ R^2 = \frac{\sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2}{\sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2} \]

  • \(R^2\) is the percentage of variation explained by the regression model.
  • \(0 \leq R^2 \leq 1\).
  • \(R^2\) is the sample correlation squared.
  • \(R^2\) can be a misleading summar of model fit
    • Deleting data can inflate \(R^2\)
    • Adding terms to a regresion model always increases \(R^2\).
example(anscombe)
## 
## anscmb> require(stats); require(graphics)
## 
## anscmb> summary(anscombe)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500  
## 
## anscmb> ##-- now some "magic" to do the 4 regressions in a loop:
## anscmb> ff <- y ~ x
## 
## anscmb> mods <- setNames(as.list(1:4), paste0("lm", 1:4))
## 
## anscmb> for(i in 1:4) {
## anscmb+   ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+   ## or   ff[[2]] <- as.name(paste0("y", i))
## anscmb+   ##      ff[[3]] <- as.name(paste0("x", i))
## anscmb+   mods[[i]] <- lmi <- lm(ff, data = anscombe)
## anscmb+   print(anova(lmi))
## anscmb+ }
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## anscmb> ## See how close they are (numerically!)
## anscmb> sapply(mods, coef)
##                   lm1      lm2       lm3       lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091
## 
## anscmb> lapply(mods, function(fm) coef(summary(fm)))
## $lm1
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
## 
## $lm2
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
## 
## $lm3
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
## 
## $lm4
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602
## 
## 
## anscmb> ## Now, do what you should have done in the first place: PLOTS
## anscmb> op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
## 
## anscmb> for(i in 1:4) {
## anscmb+   ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+   plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
## anscmb+        xlim = c(3, 19), ylim = c(3, 13))
## anscmb+   abline(mods[[i]], col = "blue")
## anscmb+ }

## 
## anscmb> mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
## 
## anscmb> par(op)

This example shows: * same mean and variance of X an Y for the 4 data sets. * identical correlations, hence identical \(R^2\). * same linear regression relationship.

Module 6: Inference in Regression

Recalling the model

The linear model is \[ Y_i = \beta_{0} + \beta_{1} X_i + \epsilon_i, \] The residual are , \[ \epsilon_1 \sim \text{Normal} \left( \epsilon \mid 0, \sigma^2 \right) \] We assume that the model is know.

The estimations for the two model parameters are \[ \beta_1 = \text{Cor} \left[ X, Y \right] \frac{ \text{sd} \left( Y \right)}{\text{sd} \left( X \right)} ;\quad \beta_0 = \bar{Y} - \beta_{1} \bar{X}. \] ## Review

Statistics like \[ \frac{\widehat{\theta} - \theta}{\widehat{\sigma}_{\theta}^2} \] often have the following properties:

  1. Is normally distributed and has a finite sample \(t\) distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).

  2. Can be used to test the null hypothesis \(H_0: \theta = \theta_0\) versus the alternative hypothesis \(H_{a}: \theta >, <, \neq \theta_0\).

  3. Can be used to create a confidence interval for \(\theta\) via \(\widehat{\theta} \pm Q_{1-\frac{\alpha}{2}}\widehat{\sigma}_{\theta}\), where \(Q_{1-\frac{\alpha}{2}}\) is the relvant quantile from either a normal or a t distribution.

The variance of the regression slope \(\widehat{\beta}_1\) is a highly informative formula: \[ \sigma_{\widehat{\beta}_1}^2 = \mathbb{V}\left[ \widehat{\beta}_1 \right] = \frac{\sigma^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \] In involves two things: * \(\sigma^2\), i.e how variable the points are around the true regression line, i.e. \(\sigma^2\). * and how variable the data points \(X_i\) are.

A better estimate of the regression slope \(\beta_1\) corresponds to a smaller slope variance \(\sigma_{\widehat{\beta}_1}^2\).

A smaller slope variance corresponds to a small value in the numerator, i.e. a small value of the variance around the regression line \(\sigma^2\) and a large value in the denominator, i.e. a high value of the data points variance \(\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2\).

The variance of the regression intercept \(\widehat{\beta}_0\): \[ \sigma_{\widehat{\beta}_0}^2 = \mathbb{V}\left[ \widehat{\beta}_0 \right] = \left( \frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right) \sigma^2 \]

  • In practice, \(\sigma\) is replaced by its estimate \(s\).

  • Under the Gaussian errors assumption \[ \frac{\widehat{\beta}_j - \beta_j}{\widehat{\sigma}_{\widehat{\beta}_j}} \sim t( \beta \mid n-2) \] which is used to created confidence intervals and perfom hypothesis testing.

Example

Coefficients table

data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)

beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)

yh <- beta0 + beta1 * x

epsilon <- y - yh

sigma <- sqrt(sum(epsilon**2) / (n-2))

sx <- sum((x-mean(x))**2)

sdBeta1 <- sigma / sqrt(sx)
sdBeta0 <- sqrt((1/n + mean(x)**2 / sx)) * sigma

tBeta1 <- beta1 / sdBeta1
tBeta0 <- beta0 / sdBeta0

pBeta1 <- 2 * pt(abs(tBeta1), df = n-2, lower.tail = FALSE)
pBeta0 <- 2 * pt(abs(tBeta0), df = n-2, lower.tail = FALSE)

coefTable <- rbind(c(beta0, sdBeta0, tBeta0, pBeta0),
                   c(beta1, sdBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c('Estimate', 'Std. Error', 't value', 'P(>|t|)')
rownames(coefTable) <- c('Intercept', 'x')
coefTable
##            Estimate Std. Error   t value      P(>|t|)
## Intercept -259.6259   17.31886 -14.99094 2.523271e-19
## x         3721.0249   81.78588  45.49715 6.751260e-40
fit <- lm(y~x)
summary(fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
## x           3721.0249   81.78588  45.49715 6.751260e-40

Confidence interval

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[1,2]
## [1] -294.4870 -224.7649
sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2]
## [1] 3556.398 3885.651
(sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2])/10
## [1] 355.6398 388.5651

Prediction of outcomes

  • Consider predicting \(Y\) at a value of \(X\)

    • Predicting the price of a diamond given the carat
    • Predicting the height of a child given the height of the parents
  • The obvious estimate for prediction at point \(x_0\) is \[ \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0 \]

  • A stander error is needed to create a prediction interval.

  • There’s a distinction between intervals for the regression line and the prediction of what a \(y_0\) would be at a point \(x_0\)

  • The regression line at \(x_0\) standard error: \[ \widehat{\sigma} \left( \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]

  • The prediction at \(x_0\) standard error \[ \widehat{\sigma} \left( 1 + \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]

data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)
fit <- lm(y ~ x)
newx = data.frame(x = seq(min(x), max(x), length = 100))

p1 <- data.frame(predict(fit, 
                         newdata = newx, 
                         interval = ('confidence')
                         ))
p1$interval = 'confidence'
p1$x = newx$x

p2 <- data.frame(predict(fit, 
                         newdata = newx, 
                         interval = ('prediction')
                         ))

p2$interval = 'prediction'
p2$x = newx$x

dat = rbind(p1, p2)

names(dat)[1] = 'y'
g <- ggplot(dat, aes(x = x, y = y))
g <- g + geom_ribbon(aes(ymin = lwr,
                         ymax = upr,
                         fill = interval), alpha = 0.2)
g <- g + geom_line()
g <- g + geom_point(data = data.frame(x=x, y = y),
                    aes(x = x, y = y), size = 2)
g