Module 1: Introduction to Regression & Least Squares

Regression models are the workhorse of data science. They are the most well described, practical and theoretically understood models in statistics.

One of the key insight for regression models is that they produce highly interpretable model fits, unlike machine learning algorithms, which often sacrifice interpretability for improved prediction performance or automation, which are valuable attributes in their own rights.

The benefit of simplicity, parsimony and intrepretability offered by regression models (and their close generalizations) typically makes them a first tool of choice for any practical problem.

Regression history starts with the work of Francis Galton, who used the parent’s heights to predict the children’s heights and invented the term and the concept regression, and the term and the concept correlation, intimately tied to linear regression.

The type of questions regression analysis can answer, using the parents and children heights dataset used by Francis Galton.

  1. Use a parent’s height to predict their children’s heights.

  2. Find a parsimonious and easily described mean relationships between the parent’s and child’s height.

  3. Find the variation that’s unexplained by the regression model, the so called residual variation.

  4. Quantify the impact the genotype information has beyond parental height explaining child’s height.

  5. Why do children of very tall parents tend to be tall, but a little bit shorter than their parents (regression to the mean).

Galton’s Data

install.packages(c('UsingR', 'reshape'),
                 repos = "http://cran.us.r-project.org")

library(UsingR)
library(reshape)
data(galton)
galtonM <- melt(galton)
g <- ggplot(galtonM, 
            aes(x = value, fill = variable))
g <- g + geom_histogram(colour = 'black',
                        binwidth = 1)
g <- g + facet_grid(.~variable)
g

Finding the middle via least squares

Consider only the children’s heights.

  • How could one describe the middle?
  • One definition, let \(Y_i\) be the height of child \(i\) for \(i = 1,\ldots, n = 928\), then define the middle as the value of \(\mu\) that minimizes: \[ \sum_{i=1}^{n} \left( Y_i - \mu \right)^2 \]
  • This is the physical center of mass of the histogram.
  • The answer is \(\mu = \bar{Y}\).

\[ \begin{align} \sum_{i=1}^{n} \left( Y_i - \mu \right)^2 & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} + \bar{Y}- \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + 2 \left(\bar{Y} - \mu \right) \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right) + \sum_{i=1}^{n} \left(\bar{Y} - \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + n \left(\bar{Y} - \mu \right)^2 \\ & \geq \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 \end{align} \] with the equality for

\[ \begin{align} \bar{Y} - \mu = 0 \end{align} \] i.e. the unique minimizer is \(\bar{Y} = \mu\).

Comparing childeren’s height and their parents heights

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  

g <- g + ggtitle("Parents-Children Height Frequency") + xlab("Parent's high [inches]") + ylab("Childern's high [inches]")

g

Regression to the origin

library(dplyr)
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + geom_smooth(method='lm', 
                     formula = y ~ x,
                     se = TRUE)

g <- g + ggtitle("Parents-Children Height Frequency")

g

lm(I(child - mean(child)) ~ I(parent - mean(parent)), data = galton)
## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)), 
##     data = galton)
## 
## Coefficients:
##              (Intercept)  I(parent - mean(parent))  
##                3.062e-14                 6.463e-01
lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton)
## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                   0.6463

Module 2: Linear Least Squares

Notice that if we substract the mean from data points we get data set that is centered, i.e. a data set that has a corresponding (empirical) mean 0. That is, if

\[ \bar{X}_i = X_i - \bar{X}, \] the empirical mean of the new dataset \(\bar{X}_1, \bar{X}_2, \ldots \bar{X}_n\) is \(0\). This process is called centering the random variable.

Least squares estimation of regression lines

We consider the parents-childern height data set again.

Let \(Y_i\) be the \(i\)th child’s height and \(X\) be the \(i\)th parent’s height. Consider finde the best line \[ \text{Child's height} = \beta_0 + \text{Parent's height} \times \beta_1 \] Using least squares means minimizing the residuals errors, i.e. minimizing \[ \sum_{i=1}^{n} \left( Y_i - \left(\beta_0 + \beta_1 X_i \right)\right)^2 \]

The least squares midel fit to the line \(Y = \beta_{0} + \beta_{1} X\) through the data pairs \((X_i, Y_i)\) with \(Y_{i}\) as the outcome obtains the line \[ Y = \widehat{\beta}_{0}+\widehat{\beta}_{1}X \] where \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(Y)}{\text{sd}(X)} ,\quad \widehat{\beta}_{0} = \bar{Y} - \widehat{\beta}_{1} \bar{X} \]

  • \(\widehat{\beta}_{1}\) has the units of \(Y/X\)

  • \(\widehat{\beta}_{0}\) has the units of \(Y\)

  • The line passes throug the point \(\left(\bar{X}, \bar{Y} \right)\)

  • The slope of the regression line with \(X\) as the outcome and \(Y\) as the the predictior is \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(X)}{\text{sd}(Y)} \]

  • The slope is the same one if data is centered.

  • The slope corresponding to normalized data is \[ \text{Cor}\left[ X, Y\right] \]

## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + ggtitle("Parents-Children Height Frequency")

g

x <- galton$parent
y <- galton$child

beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0,beta1), coef(lm(y~x)))
##      (Intercept)         x
## [1,]    23.94153 0.6462906
## [2,]    23.94153 0.6462906
x <- galton$parent
y <- galton$child

beta1 <- cor(x,y) * sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0,beta1), coef(lm(x~y)))
##      (Intercept)         y
## [1,]    46.13535 0.3256475
## [2,]    46.13535 0.3256475
xc <- x - mean(x)
yc <- y - mean(y)

beta1 <- cor(xc, yc)* sd(yc) / sd(xc)
c(beta1, coef(lm(yc~xc))[2])
##                  xc 
## 0.6462906 0.6462906
xn <- (x - mean(x))/sd(x)
yn <- (y - mean(y))/sd(y)

c(cor(x, y), cor(xn, yn), coef(lm(yn~xn))[2])
##                            xn 
## 0.4587624 0.4587624 0.4587624
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + ggtitle("Parents-Children Height Frequency")

g <- g + geom_smooth(method = 'lm', formula = y~x)
g

Module 3: Regression to the Mean

x <- (father.son$sheight - mean(father.son$sheight))/sd(father.son$sheight)
y <- (father.son$fheight - mean(father.son$fheight))/sd(father.son$fheight)

rho <- cor(x, y)

g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_point(size = 6, color = 'black', alpha = 0.2)
g <- g + geom_point(size = 4, color = 'salmon', alpha = 0.2)
#g <- g + xlim(-4, 4) + ylim(-4, 4)
g <- g + geom_abline(intercept = 0, slope = 1)
g <- g + geom_vline(xintercept = 0)
g <- g + geom_hline(yintercept = 0)
g <- g + geom_abline(yintercept = 0, slope = rho, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning in geom_abline(yintercept = 0, slope = rho, size = 2): Ignoring unknown
## parameters: `yintercept`
g <- g + geom_abline(yintercept = 0, slope = 1/rho, size = 2)
## Warning in geom_abline(yintercept = 0, slope = 1/rho, size = 2): Ignoring
## unknown parameters: `yintercept`
g