Module 4: Statistical Linear Regression Models

Least squares is an estimate tool. How do we do inference?
Consider developing a probabilistic model for linear regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]
Here the errors $\epsilon_i$ are assumed $\text{Normal} \left( x \mid 0, 1 \right)$.
Note that \[ \mathbb{E}\left[ Y_i \mid X_i = x_i \right] = \mu_i = \beta_0 + \beta_{1} x_i \]
Note that

\[ \mathbb{V}\left[ Y_i \mid X_i = x_i \right] = \sigma^2. \]

Interpreting regression coefficients

The intercept

$\beta_0$ is the expected value of the response when the predictor is $0$. \[ \mathbb{E}\left[ Y_i \mid X_i = 0 \right] = \beta_0 + \beta_{1} \times 0 = \beta_0 \]
Note, this isn’t always of interest, for example when $X = 0$ is impossible or far outside of the range of the data (X is blood pressure or height, etc)
Consider that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \beta_1 \alpha + \beta_1 (X_i - \alpha) + \epsilon_i = \bar{\beta_0} + \beta_1 (X_i - \alpha) + \epsilon_i, \] so shifting the $X_i$ values by a value of $\alpha$ changes the intercept but not the slope.
Often $\alpha$ is set to $\bar{X}$ so that the intercept is interpreted as the expected response at the average $X$ value.

The slope

$\beta_1$ is the expected change in response for a 1 unit change in the predictor

\[ \mathbb{E}\left[ Y_i \mid X_i = x + 1 \right] - \mathbb{E}\left[ Y_i \mid X_i = x \right] = \beta_0 + \beta_1 (x + 1) + \epsilon_i - \beta_0 + \beta_1 (x) + \epsilon_i = \beta_1 \]

Consider the impact of changing the units of $X$ \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i = \beta_0 + \frac{\beta_1}{\alpha}X_i\alpha + \epsilon_i = \beta_0 + \bar{\beta}_{1}X_i\alpha + \epsilon_i , \] so multiplication of $X$ by a factor $\alpha$ results in dividing the coefficent by a factor of $\alpha$.
Example: $X$ is height in meters and Y is weight in kilograms. Then $\beta_1$ is kilograms/meters. Convering X to centimeters implies multiplying X by $100$ centimeters/meter. To get $\beta_1$ in the right units, we have to divide by $100$ centimeters/meter to get it to have the right units: \[ X \left[ \text{m} \right] \times 100 \left[\frac{\text{cm}}{\text{m}} \right] = 100 X \left[\text{cm}\right], \quad \beta_1 \left[ \frac{\text{kg}}{\text{m}} \right] \times \frac{1}{100} \left[ \frac{\text{m}}{\text{cm}} \right] = \frac{\beta_1}{100} \left[ \frac{\text{kg}}{\text{cm}} \right] \]

Example

Data is diamon prices (Singapore dollars) and diamon weight in carats (standard measure of diamond mass, 0.2 $g$).

library(UsingR)

## Loading required package: MASS

## Loading required package: HistData

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

data(diamond)
library(ggplot2)

g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
                     color = 'black')
g

## `geom_smooth()` using formula = 'y ~ x'

Fitting the linear regression model

fit <- lm(price ~ carat, data = diamond)
coef(fit)

## (Intercept)       carat 
##   -259.6259   3721.0249

summary(fit)

## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.159 -21.448  -0.869  18.972  79.370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -259.63      17.32  -14.99   <2e-16 ***
## carat        3721.02      81.79   45.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.84 on 46 degrees of freedom
## Multiple R-squared:  0.9783, Adjusted R-squared:  0.9778 
## F-statistic:  2070 on 1 and 46 DF,  p-value: < 2.2e-16

The interpretation of the the intercept and the sloop in this context is the following:

We estimate an expected 3721.02 SIN $ increase in price for every carat increat in mass of diamond.
The intercept -256.63 SIN $ is the expected price of a 0 carat diamond.

Getting a more interpretable intercept by centering the data.

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)

##            (Intercept) I(carat - mean(carat)) 
##               500.0833              3721.0249

Thus, 500.1 SIN $ is the expected price for the average sized diamond of the data (0.2042 carats)

Changing the units to 0.1 carats:

fit3 <- lm(price ~ I(carat*10), data = diamond)
coef(fit3)

##   (Intercept) I(carat * 10) 
##     -259.6259      372.1025

Predicting the price of a diamond

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx

## [1]  335.7381  745.0508 1005.5225

predict(fit, newdata = data.frame(carat = newx))

##         1         2         3 
##  335.7381  745.0508 1005.5225

Module 5: Residuals

Residuals represent variation left unexplained by the assumed model. There is a difference between between residuals and errors: * The errors represent the unobservable true errors from the known coefficients, * The residuals represent the observable errors from the estimated coefficients. In a sense, the residuals are estimates of the errors.

Motivating example

library(UsingR)
data(diamond)
library(ggplot2)

g <- ggplot(diamond, aes(x = carat, y = price))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Price (SIN $)')
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g <- g + geom_smooth(method = 'lm',
                     color = 'black')
g

## `geom_smooth()` using formula = 'y ~ x'

The variation around the regression line, is residual variation, i.e. the variation left unexplained by having accounted for mass.

The model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\epsilon_i \sim \text{Normal}\left(\epsilon \mid 0, 1 \right)$
Observed outcome $i$ is $Y_i$ at predictor value $X_i$.
Predicted outcome $i$ is $\widehat{Y}_i$ at predictor value $X_i$, \[ \widehat{Y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} X_i. \]
The residual is the difference between the observed outcome $Y_i$ and predicted outcome $\widehat{Y}_i$: \[ \epsilon_i = Y_i - \widehat{Y}_i \]
Least squares is minimizing \[ \sum_{i=1}^{n} \epsilon_{i}^{2} \]
The residual $\epsilon_i$ can be thought of as estimates of the $\epsilon_i$

Properties of the residuals

$\mathbb{E} \left[ \epsilon_i \right] = 0$.
If an intercept is included, $\sum_{i=1}^{n} \epsilon_i = 0.$
If a regressor variable, $X_i$ is included in the model, $\sum_{i=1}^{n} \epsilon_i X_i = 0.$
Residuals are useful for investigating poor model fit.
Positive residuals are above the line, negative residuals are below the line.
Residual can be thought of as the outcome $Y$ with the linear association of the predictor $X$ removed.
One differentiates residual variation (variation after removing the predictor) from systemic variation (variation explained by the regression model).

data(diamond)
x <- diamond$carat
y <- diamond$price
n <- length(x)
fit <- lm(y ~ x)
e <- resid(fit)
yhat <- predict(fit)
max(abs(e - (y - yhat)))

## [1] 6.51923e-13

max(abs(e - (y - (coef(fit)[1] + coef(fit)[2] * x))))

## [1] 5.950795e-13

sum(e)

## [1] 4.618528e-14

sum(e*x)

## [1] 7.549517e-15

plot(x, y, 
     xlab = 'Mass (carats)', 
     ylab = 'Price (SIN $)',
     bg = 'lightblue',
     col = 'black',
     cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
    lines(c(x[i], x[i]), 
          c(y[i], yhat[i]), 
          col = 'red', lwd = 2)
}

plot(x, e, 
     xlab = 'Mass (carats)', 
     ylab = 'Residuals (SIN $)',
     bg = 'lightblue',
     col = 'black',
     cex = 1.1, pch = 21, frame = FALSE)
abline(fit, lwd, 2)
for (i in 1:n){
    lines(c(x[i], x[i]), 
          c(e[i], 0), 
          col = 'red', lwd = 2)
}

Non linear example

x <- runif(100,-3,3)
y <- x + sin(x) + rnorm(100, sd = .2)

library(ggplot2)
g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g

## `geom_smooth()` using formula = 'y ~ x'

library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))), 
            aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g

Heteroskedasticity example

x <- runif(100,0,6)
y <- x + rnorm(100, sd = .001 * x)

library(ggplot2)
g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_smooth(method = 'lm', color = 'black')
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g

## `geom_smooth()` using formula = 'y ~ x'

library(ggplot2)
g <- ggplot(data.frame(x = x, y = resid(lm(y~x))), 
            aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 7, 
                    color = 'black', 
                    alpha = 0.4)
g <- g + geom_point(size = 5, 
                    color = 'red', 
                    alpha = 0.4)
g <- g + xlab('X') + ylab('Residual')
g

Residuals for diamond data

diamond$e <- resid(lm(price ~ carat, data = diamond))
g <- ggplot(diamond, aes(x = carat, y = e))
g <- g + xlab('Mass (carats)')
g <- g + ylab('Residual Price (SIN $)')
g <- g + geom_hline(yintercept = 0, size = 2)
g <- g + geom_point(size = 6, 
                    color = 'black',
                    alpha = 0.2)
g <- g + geom_point(size = 5, 
                    color = 'blue',
                    alpha = 0.2)
g

e <- c(resid(lm(price ~ 1, data = diamond)),
       resid(lm(price ~ carat, data = diamond)))
fit <- factor(c(rep('Itc', nrow(diamond)),
                rep('Itc, slope', nrow(diamond))))
g <- ggplot(data.frame(e = e, fit = fit), 
            aes(y = e, x = fit, fill = fit))
g <- g + geom_dotplot(binaxis = 'y', 
                      size = 2, 
                      stackdir = 'center')

## Warning in geom_dotplot(binaxis = "y", size = 2, stackdir = "center"): Ignoring
## unknown parameters: `size`

g <- g + xlab('Fitting approach')
g <- g + ylab('Residual Price (SIN $)')
g

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Estimating residual variation

Model $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, where $\epsilon_i \sim \text{Normal}\left( \epsilon \mid 0, \sigma^2 \right)$.
The ML estimate of $\sigma^2$ is \[ \widehat{\sigma}_{\text{ML}}^2 = \frac{1}{n} \sum_{i=1}^{n} \epsilon_i^2, \] the average squared residual.
Typically, it is used \[ \widehat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^{n} \epsilon_i^2, \] with $n-2$ instead of $n$ accounting for the two degrees of freedom lost due to the two constraints:
- including the intercept means the residuals have to sum to zero
- including the covariate means the residuals multipled with the covariate sum to zero

More regression variables lead to more constraints hence to the loss of more degrees of freedom.

Example

x <- diamond$carat
n <- length(x)
y <- diamond$price
fit <- lm(y~x)
summary(fit)$sigma

## [1] 31.84052

sqrt(sum(resid(fit)**2)/(n-2))

## [1] 31.84052

Summarizing variation

The total variability in the response is \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2. \] The regression variability is the variability that is explained by adding the predictor \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2. \] The error variability is what’s lefover around the regression line \[ \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2. \]

Note that \[ \text{total variability} = \text{regression variability} + \text{error variability}. \] \[ \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2 + \sum_{i=1}^{n} \left( Y_i - \widehat{Y}_i \right)^2 \] because \[ \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \left( Y_i - \widehat{Y}_i \right) = \sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right) \epsilon_i = \sum_{i=1}^{n} \widehat{Y}_i \epsilon_i - \bar{Y} \sum_{i=1}^{n} \epsilon_i = 0. \]

$R$ squared ($R^2$)

Since the residual variation and the regression model variation add up to the total variation, a quantity that represents the percentage of the total variation that’s represented by the model variation can be defined. This quantity is called $R^2$ and is defined as the ratio between the regression variation and the total variation: \[ R^2 = \frac{\sum_{i=1}^{n} \left( \widehat{Y}_i - \bar{Y} \right)^2}{\sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2} \]

$R^2$ is the percentage of variation explained by the regression model.
$0 \leq R^2 \leq 1$.
$R^2$ is the sample correlation squared.
$R^2$ can be a misleading summar of model fit
- Deleting data can inflate $R^2$
- Adding terms to a regresion model always increases $R^2$.

example(anscombe)

## 
## anscmb> require(stats); require(graphics)
## 
## anscmb> summary(anscombe)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500  
## 
## anscmb> ##-- now some "magic" to do the 4 regressions in a loop:
## anscmb> ff <- y ~ x
## 
## anscmb> mods <- setNames(as.list(1:4), paste0("lm", 1:4))
## 
## anscmb> for(i in 1:4) {
## anscmb+   ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+   ## or   ff[[2]] <- as.name(paste0("y", i))
## anscmb+   ##      ff[[3]] <- as.name(paste0("x", i))
## anscmb+   mods[[i]] <- lmi <- lm(ff, data = anscombe)
## anscmb+   print(anova(lmi))
## anscmb+ }
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## anscmb> ## See how close they are (numerically!)
## anscmb> sapply(mods, coef)
##                   lm1      lm2       lm3       lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091
## 
## anscmb> lapply(mods, function(fm) coef(summary(fm)))
## $lm1
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
## 
## $lm2
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
## 
## $lm3
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
## 
## $lm4
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602
## 
## 
## anscmb> ## Now, do what you should have done in the first place: PLOTS
## anscmb> op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
## 
## anscmb> for(i in 1:4) {
## anscmb+   ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## anscmb+   plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
## anscmb+        xlim = c(3, 19), ylim = c(3, 13))
## anscmb+   abline(mods[[i]], col = "blue")
## anscmb+ }

## 
## anscmb> mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
## 
## anscmb> par(op)

This example shows: * same mean and variance of X an Y for the 4 data sets. * identical correlations, hence identical $R^2$. * same linear regression relationship.

Module 6: Inference in Regression

Recalling the model

The linear model is \[ Y_i = \beta_{0} + \beta_{1} X_i + \epsilon_i, \] The residual are , \[ \epsilon_1 \sim \text{Normal} \left( \epsilon \mid 0, \sigma^2 \right) \] We assume that the model is know.

The estimations for the two model parameters are \[ \beta_1 = \text{Cor} \left[ X, Y \right] \frac{ \text{sd} \left( Y \right)}{\text{sd} \left( X \right)} ;\quad \beta_0 = \bar{Y} - \beta_{1} \bar{X}. \] ## Review

Statistics like \[ \frac{\widehat{\theta} - \theta}{\widehat{\sigma}_{\theta}^2} \] often have the following properties:

Is normally distributed and has a finite sample $t$ distribution if the estimated variance is replaced with a sample estimate (under normality assumptions).
Can be used to test the null hypothesis $H_0: \theta = \theta_0$ versus the alternative hypothesis $H_{a}: \theta >, <, \neq \theta_0$.
Can be used to create a confidence interval for $\theta$ via $\widehat{\theta} \pm Q_{1-\frac{\alpha}{2}}\widehat{\sigma}_{\theta}$, where $Q_{1-\frac{\alpha}{2}}$ is the relvant quantile from either a normal or a t distribution.

The variance of the regression slope $\widehat{\beta}_1$ is a highly informative formula: \[ \sigma_{\widehat{\beta}_1}^2 = \mathbb{V}\left[ \widehat{\beta}_1 \right] = \frac{\sigma^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \] In involves two things: * $\sigma^2$, i.e how variable the points are around the true regression line, i.e. $\sigma^2$. * and how variable the data points $X_i$ are.

A better estimate of the regression slope $\beta_1$ corresponds to a smaller slope variance $\sigma_{\widehat{\beta}_1}^2$.

A smaller slope variance corresponds to a small value in the numerator, i.e. a small value of the variance around the regression line $\sigma^2$ and a large value in the denominator, i.e. a high value of the data points variance $\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2$.

The variance of the regression intercept $\widehat{\beta}_0$: \[ \sigma_{\widehat{\beta}_0}^2 = \mathbb{V}\left[ \widehat{\beta}_0 \right] = \left( \frac{1}{n} + \frac{\bar{X}^2}{\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right) \sigma^2 \]

In practice, $\sigma$ is replaced by its estimate $s$.
Under the Gaussian errors assumption \[ \frac{\widehat{\beta}_j - \beta_j}{\widehat{\sigma}_{\widehat{\beta}_j}} \sim t( \beta \mid n-2) \] which is used to created confidence intervals and perfom hypothesis testing.

Example

Coefficients table

data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)

beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)

yh <- beta0 + beta1 * x

epsilon <- y - yh

sigma <- sqrt(sum(epsilon**2) / (n-2))

sx <- sum((x-mean(x))**2)

sdBeta1 <- sigma / sqrt(sx)
sdBeta0 <- sqrt((1/n + mean(x)**2 / sx)) * sigma

tBeta1 <- beta1 / sdBeta1
tBeta0 <- beta0 / sdBeta0

pBeta1 <- 2 * pt(abs(tBeta1), df = n-2, lower.tail = FALSE)
pBeta0 <- 2 * pt(abs(tBeta0), df = n-2, lower.tail = FALSE)

coefTable <- rbind(c(beta0, sdBeta0, tBeta0, pBeta0),
                   c(beta1, sdBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c('Estimate', 'Std. Error', 't value', 'P(>|t|)')
rownames(coefTable) <- c('Intercept', 'x')
coefTable

##            Estimate Std. Error   t value      P(>|t|)
## Intercept -259.6259   17.31886 -14.99094 2.523271e-19
## x         3721.0249   81.78588  45.49715 6.751260e-40

fit <- lm(y~x)
summary(fit)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
## x           3721.0249   81.78588  45.49715 6.751260e-40

Confidence interval

sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[1,2]

## [1] -294.4870 -224.7649

sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2]

## [1] 3556.398 3885.651

(sumCoef[2,1] + c(-1,1) * qt(0.975, df = fit$df) * sumCoef[2,2])/10

## [1] 355.6398 388.5651

Prediction of outcomes

Consider predicting $Y$ at a value of $X$
- Predicting the price of a diamond given the carat
- Predicting the height of a child given the height of the parents
The obvious estimate for prediction at point $x_0$ is \[ \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0 \]
A stander error is needed to create a prediction interval.
There’s a distinction between intervals for the regression line and the prediction of what a $y_0$ would be at a point $x_0$
The regression line at $x_0$ standard error: \[ \widehat{\sigma} \left( \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]
The prediction at $x_0$ standard error \[ \widehat{\sigma} \left( 1 + \frac{1}{n} + \frac{\left( x_0 - \bar{X} \right)^2} {\sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \right)^{1/2} \]

data(diamond)
x <- diamond$carat; y <- diamond$price ; n <- length(x)
fit <- lm(y ~ x)

newx = data.frame(x = seq(min(x), max(x), length = 100))

p1 <- data.frame(predict(fit, 
                         newdata = newx, 
                         interval = ('confidence')
                         ))
p1$interval = 'confidence'
p1$x = newx$x

p2 <- data.frame(predict(fit, 
                         newdata = newx, 
                         interval = ('prediction')
                         ))

p2$interval = 'prediction'
p2$x = newx$x

dat = rbind(p1, p2)

names(dat)[1] = 'y'

g <- ggplot(dat, aes(x = x, y = y))
g <- g + geom_ribbon(aes(ymin = lwr,
                         ymax = upr,
                         fill = interval), alpha = 0.2)
g <- g + geom_line()
g <- g + geom_point(data = data.frame(x=x, y = y),
                    aes(x = x, y = y), size = 2)
g

Linear Regression & Multivariable Regression

Regression Models: Week 2
Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

04 March, 2023

Module 4: Statistical Linear Regression Models

Interpreting regression coefficients

The intercept

The slope

Example

Module 5: Residuals

Motivating example

Properties of the residuals

Non linear example

Heteroskedasticity example

Residuals for diamond data

Estimating residual variation

Example

Summarizing variation

\(R\) squared (\(R^2\))

Module 6: Inference in Regression

Recalling the model

Example

Coefficients table

Confidence interval

Prediction of outcomes

Linear Regression & Multivariable Regression

Regression Models: Week 2 Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

04 March, 2023

Module 4: Statistical Linear Regression Models

Interpreting regression coefficients

The intercept

The slope

Example

Module 5: Residuals

Motivating example

Properties of the residuals

Non linear example

Heteroskedasticity example

Residuals for diamond data

Estimating residual variation

Example

Summarizing variation

\(R\) squared (\(R^2\))

Module 6: Inference in Regression

Recalling the model

Example

Coefficients table

Confidence interval

Prediction of outcomes

Regression Models: Week 2
Coursera Data Science: Statistics & Machine Learning Specialization