Least squares is an estimate tool. How do we do inference?
Consider developing a probabilistic model for linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
Here the errors \(\epsilon_i\) are assumed \(\text{Normal} \left( x \mid 0, 1 \right)\).
Note that \[ \mathbb{E}\left[ Y_i \mid X_i = x_i \right] = \mu_i = \beta_0 + \beta_{1} x_i \]
Note that
\[ \mathbb{V}\left[ Y_i \mid X_i = x_i \right] = \sigma^2. \]
\(\beta_0\) is the expected value of the response when the predictor is \(0\). \[ \mathbb{E}\left[ Y_i \mid X_i = 0 \right] = \beta_0 + \beta_{1} \times 0 = \beta_0 \]
Note, this isn’t always of interest, for example when \(X = 0\) is impossible or far outside of the range of the data (X is blood pressure or height, etc)
Consider that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \beta_1 \alpha + \beta_1 (X_i - \alpha) + \epsilon_i = \bar{\beta_0} + \beta_1 (X_i - \alpha) + \epsilon_i, \] so shifting the \(X_i\) values by a value of \(\alpha\) changes the intercept but not the slope.
Often \(\alpha\) is set to \(\bar{X}\) so that the intercept is interpreted as the expected response at the average \(X\) value.
\[ \mathbb{E}\left[ Y_i \mid X_i = x + 1 \right] - \mathbb{E}\left[ Y_i \mid X_i = x \right] = \beta_0 + \beta_1 (x + 1) + \epsilon_i - \beta_0 + \beta_1 (x) + \epsilon_i = \beta_1 \]
Consider the impact of changing the units of \(X\) \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{\alpha}X_i\alpha + \epsilon_i = \beta_0 + \bar{\beta}_{1}X_i\alpha + \epsilon_i , \] so multiplication of \(X\) by a factor \(\alpha\) results in dividing the coefficent by a factor of \(\alpha\).
Example: \(X\) is height in meters and Y is weight in kilograms. Then \(\beta_1\) is kilograms/meters. Convering X to centimeters implies multiplying X by \(100\) centimeters/meter. To get \(\beta_1\) in the right units, we have to divide by \(100\) centimeters/meter to get it to have the right units: \[ X \left[ \text{m} \right] \times 100 \left[\frac{\text{cm}}{\text{m}} \right] = 100 X \left[\text{cm}\right], \quad \beta_1 \left[ \frac{\text{kg}}{\text{m}} \right] \times \frac{1}{100} \left[ \frac{\text{m}}{\text{cm}} \right] = \frac{\beta_1}{100} \left[ \frac{\text{kg}}{\text{cm}} \right] \]
Data is diamon prices (Singapore dollars) and diamon weight in carats (standard measure of diamond mass, 0.2 \(g\)).
library(UsingR)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
data(diamond)
library(ggplot2)
g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6,
color = 'black',
alpha = 0.2)
g <- g + geom_point(size = 5,
color = 'blue',
alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
color = 'black')
g
## `geom_smooth()` using formula = 'y ~ x'
Fitting the linear regression model
fit <- lm(price ~ carat, data = diamond)
coef(fit)
## (Intercept) carat
## -259.6259 3721.0249
summary(fit)
##
## Call:
## lm(formula = price ~ carat, data = diamond)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.159 -21.448 -0.869 18.972 79.370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.63 17.32 -14.99 <2e-16 ***
## carat 3721.02 81.79 45.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared: 0.9783, Adjusted R-squared: 0.9778
## F-statistic: 2070 on 1 and 46 DF, p-value: < 2.2e-16
The interpretation of the the intercept and the sloop in this context is the following:
Getting a more interpretable intercept by centering the data.
fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
## (Intercept) I(carat - mean(carat))
## 500.0833 3721.0249
Changing the units to 0.1 carats:
fit3 <- lm(price ~ I(carat*10), data = diamond)
coef(fit3)
## (Intercept) I(carat * 10)
## -259.6259 372.1025
Predicting the price of a diamond
newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
## [1] 335.7381 745.0508 1005.5225
predict(fit, newdata = data.frame(carat = newx))
## 1 2 3
## 335.7381 745.0508 1005.5225
Residuals represent variation left unexplained by the assumed model. There is a difference between between residuals and errors: * The errors represent the unobservable true errors from the known coefficients, * The residuals represent the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.
library(UsingR)
data(diamond)
library(ggplot2)
g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6,
color = 'black',
alpha = 0.2)
g <- g + geom_point(size = 5,
color = 'blue',
alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
color = 'black')
g
## `geom_smooth()` using formula = 'y ~ x'
The variation around the regression line, is residual variation, i.e. the variation left unexplained by having accounted for mass.
The model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), where \(\epsilon_i \sim \text{Normal}\left(\epsilon \mid 0, 1 \right)\)
Observed outcome \(i\) is \(Y_i\) at predictor value \(X_i\).
Predicted outcome \(i\) is \(\widehat{Y}_i\) at predictor value \(X_i\), \[ \widehat{Y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} X_i. \]
The residual is the difference between the observed outcome \(Y_i\) and predicted outcome \(\widehat{Y}_i\): \[ \epsilon_i = Y_i - \widehat{Y}_i \]
Least squares is minimizing \[ \sum_{i=1}^{n} \epsilon_{i}^{2} \]
The residual \(\epsilon_i\) can be thought of as estimates of the \(\epsilon_i\)
data(diamond)
x <- diamond$carat
y <- diamond$price
n <- length(x)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e - (y - yhat)))
## [1] 6.51923e-13
max(abs(e - (y - (coef(fit)[1] + coef(fit)[2] * x))))
## [1] 5.950795e-13
sum(e)
## [1] 4.618528e-14
sum(e*x)
## [1] 7.549517e-15
plot(x, y,
xlab = 'Mass (carats)',
ylab = 'Price (SIN $)',
bg = 'lightblue',
col = 'black',
cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
lines(c(x[i], x[i]),
c(y[i], yhat[i]),
col = 'red', lwd = 2)
}
plot(x, e,
xlab = 'Mass (carats)',
ylab = 'Residuals (SIN $)',
bg = 'lightblue',
col = 'black',
cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
lines(c(x[i], x[i]),
c(e[i], 0),
col = 'red', lwd = 2)
}
x <- runif(100,-3,3)
y <- x + sin(x) + rnorm(100, sd = .2)
library(ggplot2)
g <- ggplot(data.frame(x = x, y = y),
aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7,
color = 'black',
alpha = 0.4)
g <- g + geom_point(size = 5,
color = 'red',
alpha = 0.4)
g
## `geom_smooth()` using formula = 'y ~ x'
library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))),
aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
g <- g + geom_point(size = 7,
color = 'black',
alpha = 0.4)
g <- g + geom_point(size = 5,
color = 'red',
alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g
x <- runif(100,0,6)
y <- x + rnorm(100, sd = .001 * x)
library(ggplot2)
g <- ggplot(data.frame(x = x, y = y),
aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7,
color = 'black',
alpha = 0.4)
g <- g + geom_point(size = 5,
color = 'red',
alpha = 0.4)
g
## `geom_smooth()` using formula = 'y ~ x'
library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))),
aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 7,
color = 'black',
alpha = 0.4)
g <- g + geom_point(size = 5,
color = 'red',
alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g
diamond$e <- resid(lm(price ~ carat, data = diamond))
g <- ggplot(diamond, aes(x = carat, y = e))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Residual Price (SIN $)')
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 6,
color = 'black',
alpha = 0.2)
g <- g + geom_point(size = 5,
color = 'blue',
alpha = 0.2)
g
e <- c(resid(lm(price ~ 1, data = diamond)),
resid(lm(price ~ carat, data = diamond)))
fit <- factor(c(rep('Itc', nrow(diamond)),
rep('Itc, slope', nrow(diamond))))
g <- ggplot(data.frame(e = e, fit = fit),
aes(y = e, x = fit, fill = fit))
g <- g + geom_dotplot(binaxis = 'y',
size = 2,
stackdir = 'center')
## Warning in geom_dotplot(binaxis = "y", size = 2, stackdir = "center"): Ignoring
## unknown parameters: `size`
g <- g + xlab('Fitting approach')
g <- g + ylab('Residual Price (SIN $)')
g
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.
Model \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), where \(\epsilon_i \sim \text{Normal}\left( \epsilon \mid 0, \sigma^2 \right)\).
The ML estimate of \(\sigma^2\) is \[ \widehat{\sigma}_{\text{ML}}^2 = \frac{1}{n} \sum_{i=1}^{n} \epsilon_i^2, \] the average squared residual.
Typically, it is used \[ \widehat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^{n} \epsilon_i^2, \] with \(n-2\) instead of \(n\) accounting for the two degrees of freedom lost due to the two constraints:
More regression variables lead to more constraints hence to the loss of more degrees of freedom.
x <- diamond$carat
n <- length(x)
y <- diamond$price
fit <- lm(y~x)
summary(fit)$sigma
## [1] 31.84052
sqrt(sum(resid(fit)**2)/(n-2))
## [1] 31.84052
The total variability in the response is \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2. \] The regression variability is the variability that is explained by adding the predictor \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2. \] The error variability is what’s lefover around the regression line \[ \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2. \]
Note that \[ \text{total variability} = \text{regression variability} + \text{error variability}. \] \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2 + \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2 \] because \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \left( Y_i - \widehat{Y}_i \right) = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \epsilon_i = \sum_{i=1}^{n} \widehat{Y}_i \epsilon_i - \bar{Y} \sum_{i=1}^{n} \epsilon_i = 0. \]
Since the residual variation and the regression model variation add up to the total variation, a quantity that represents the percentage of the total variation that’s represented by the model variation can be defined. This quantity is called \(R^2\) and is defined as the ratio between the regression variation and the total variation: \[ R^2 = \frac{\sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2}{\sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2} \]
example(anscombe)
##
## anscmb> require(stats); require(graphics)
##
## anscmb> summary(anscombe)
## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
##
## anscmb> ##-- now some "magic" to do the 4 regressions in a loop:
## anscmb> ff <- y ~ x
##
## anscmb> mods <- setNames(as.list(1:4), paste0("lm", 1:4))
##
## anscmb> for(i in 1:4) {
## anscmb+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+ ## or ff[[2]] <- as.name(paste0("y", i))
## anscmb+ ## ff[[3]] <- as.name(paste0("x", i))
## anscmb+ mods[[i]] <- lmi <- lm(ff, data = anscombe)
## anscmb+ print(anova(lmi))
## anscmb+ }
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 27.510 27.5100 17.99 0.00217 **
## Residuals 9 13.763 1.5292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y2
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 27.500 27.5000 17.966 0.002179 **
## Residuals 9 13.776 1.5307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y3
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 27.470 27.4700 17.972 0.002176 **
## Residuals 9 13.756 1.5285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y4
## Df Sum Sq Mean Sq F value Pr(>F)
## x4 1 27.490 27.4900 18.003 0.002165 **
## Residuals 9 13.742 1.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## anscmb> ## See how close they are (numerically!)
## anscmb> sapply(mods, coef)
## lm1 lm2 lm3 lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1 0.5000909 0.500000 0.4997273 0.4999091
##
## anscmb> lapply(mods, function(fm) coef(summary(fm)))
## $lm1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0000909 1.1247468 2.667348 0.025734051
## x1 0.5000909 0.1179055 4.241455 0.002169629
##
## $lm2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.000909 1.1253024 2.666758 0.025758941
## x2 0.500000 0.1179637 4.238590 0.002178816
##
## $lm3
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0024545 1.1244812 2.670080 0.025619109
## x3 0.4997273 0.1178777 4.239372 0.002176305
##
## $lm4
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017273 1.1239211 2.670763 0.025590425
## x4 0.4999091 0.1178189 4.243028 0.002164602
##
##
## anscmb> ## Now, do what you should have done in the first place: PLOTS
## anscmb> op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
##
## anscmb> for(i in 1:4) {
## anscmb+ ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+ plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
## anscmb+ xlim = c(3, 19), ylim = c(3, 13))
## anscmb+ abline(mods[[i]], col = "blue")
## anscmb+ }
##
## anscmb> mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
##
## anscmb> par(op)
This example shows: * same mean and variance of X an Y for the 4 data sets. * identical correlations, hence identical \(R^2\). * same linear regression relationship.
The linear model is \[ Y_i = \beta_{0} + \beta_{1} X_i + \epsilon_i, \] The residual are , \[ \epsilon_1 \sim \text{Normal} \left( \epsilon \mid 0, \sigma^2 \right) \] We assume that the model is know.
The estimations for the two model parameters are \[ \beta_1 = \text{Cor} \left[ X, Y \right] \frac{ \text{sd} \left( Y \right)}{\text{sd} \left( X \right)} ;\quad \beta_0 = \bar{Y} - \beta_{1} \bar{X}. \] ## Review
Statistics like \[ \frac{\widehat{\theta} - \theta}{\widehat{\sigma}_{\theta}^2} \] often have the following properties:
Is normally distributed and has a finite sample \(t\) distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
Can be used to test the null hypothesis \(H_0: \theta = \theta_0\) versus the alternative hypothesis \(H_{a}: \theta >, <, \neq \theta_0\).
Can be used to create a confidence interval for \(\theta\) via \(\widehat{\theta} \pm Q_{1-\frac{\alpha}{2}}\widehat{\sigma}_{\theta}\), where \(Q_{1-\frac{\alpha}{2}}\) is the relvant quantile from either a normal or a t distribution.
The variance of the regression slope \(\widehat{\beta}_1\) is a highly informative formula: \[ \sigma_{\widehat{\beta}_1}^2 = \mathbb{V}\left[ \widehat{\beta}_1 \right] = \frac{\sigma^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \] In involves two things: * \(\sigma^2\), i.e how variable the points are around the true regression line, i.e. \(\sigma^2\). * and how variable the data points \(X_i\) are.
A better estimate of the regression slope \(\beta_1\) corresponds to a smaller slope variance \(\sigma_{\widehat{\beta}_1}^2\).
A smaller slope variance corresponds to a small value in the numerator, i.e. a small value of the variance around the regression line \(\sigma^2\) and a large value in the denominator, i.e. a high value of the data points variance \(\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2\).
The variance of the regression intercept \(\widehat{\beta}_0\): \[ \sigma_{\widehat{\beta}_0}^2 = \mathbb{V}\left[ \widehat{\beta}_0 \right] = \left( \frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right) \sigma^2 \]
In practice, \(\sigma\) is replaced by its estimate \(s\).
Under the Gaussian errors assumption \[ \frac{\widehat{\beta}_j - \beta_j}{\widehat{\sigma}_{\widehat{\beta}_j}} \sim t( \beta \mid n-2) \] which is used to created confidence intervals and perfom hypothesis testing.
data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)
beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
yh <- beta0 + beta1 * x
epsilon <- y - yh
sigma <- sqrt(sum(epsilon**2) / (n-2))
sx <- sum((x-mean(x))**2)
sdBeta1 <- sigma / sqrt(sx)
sdBeta0 <- sqrt((1/n + mean(x)**2 / sx)) * sigma
tBeta1 <- beta1 / sdBeta1
tBeta0 <- beta0 / sdBeta0
pBeta1 <- 2 * pt(abs(tBeta1), df = n-2, lower.tail = FALSE)
pBeta0 <- 2 * pt(abs(tBeta0), df = n-2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, sdBeta0, tBeta0, pBeta0),
c(beta1, sdBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c('Estimate', 'Std. Error', 't value', 'P(>|t|)')
rownames(coefTable) <- c('Intercept', 'x')
coefTable
## Estimate Std. Error t value P(>|t|)
## Intercept -259.6259 17.31886 -14.99094 2.523271e-19
## x 3721.0249 81.78588 45.49715 6.751260e-40
fit <- lm(y~x)
summary(fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.6259 17.31886 -14.99094 2.523271e-19
## x 3721.0249 81.78588 45.49715 6.751260e-40
sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[1,2]
## [1] -294.4870 -224.7649
sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2]
## [1] 3556.398 3885.651
(sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2])/10
## [1] 355.6398 388.5651
Consider predicting \(Y\) at a value of \(X\)
The obvious estimate for prediction at point \(x_0\) is \[ \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0 \]
A stander error is needed to create a prediction interval.
There’s a distinction between intervals for the regression line and the prediction of what a \(y_0\) would be at a point \(x_0\)
The regression line at \(x_0\) standard error: \[ \widehat{\sigma} \left( \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]
The prediction at \(x_0\) standard error \[ \widehat{\sigma} \left( 1 + \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]
data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)
fit <- lm(y ~ x)
newx = data.frame(x = seq(min(x), max(x), length = 100))
p1 <- data.frame(predict(fit,
newdata = newx,
interval = ('confidence')
))
p1$interval = 'confidence'
p1$x = newx$x
p2 <- data.frame(predict(fit,
newdata = newx,
interval = ('prediction')
))
p2$interval = 'prediction'
p2$x = newx$x
dat = rbind(p1, p2)
names(dat)[1] = 'y'
g <- ggplot(dat, aes(x = x, y = y))
g <- g + geom_ribbon(aes(ymin = lwr,
ymax = upr,
fill = interval), alpha = 0.2)
g <- g + geom_line()
g <- g + geom_point(data = data.frame(x=x, y = y),
aes(x = x, y = y), size = 2)
g