Regression models are the workhorse of data science. They are the most well described, practical and theoretically understood models in statistics.
One of the key insight for regression models is that they produce highly interpretable model fits, unlike machine learning algorithms, which often sacrifice interpretability for improved prediction performance or automation, which are valuable attributes in their own rights.
The benefit of simplicity, parsimony and intrepretability offered by regression models (and their close generalizations) typically makes them a first tool of choice for any practical problem.
Regression history starts with the work of Francis Galton, who used the parent’s heights to predict the children’s heights and invented the term and the concept regression, and the term and the concept correlation, intimately tied to linear regression.
The type of questions regression analysis can answer, using the parents and children heights dataset used by Francis Galton.
Use a parent’s height to predict their children’s heights.
Find a parsimonious and easily described mean relationships between the parent’s and child’s height.
Find the variation that’s unexplained by the regression model, the so called residual variation.
Quantify the impact the genotype information has beyond parental height explaining child’s height.
Why do children of very tall parents tend to be tall, but a little bit shorter than their parents (regression to the mean).
install.packages(c('UsingR', 'reshape'),
repos = "http://cran.us.r-project.org")
library(UsingR)
library(reshape)
data(galton)
galtonM <- melt(galton)
g <- ggplot(galtonM,
aes(x = value, fill = variable))
g <- g + geom_histogram(colour = 'black',
binwidth = 1)
g <- g + facet_grid(.~variable)
g
Consider only the children’s heights.
\[ \begin{align} \sum_{i=1}^{n} \left( Y_i - \mu \right)^2 & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} + \bar{Y}- \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + 2 \left(\bar{Y} - \mu \right) \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right) + \sum_{i=1}^{n} \left(\bar{Y} - \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + n \left(\bar{Y} - \mu \right)^2 \\ & \geq \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 \end{align} \] with the equality for
\[ \begin{align} \bar{Y} - \mu = 0 \end{align} \] i.e. the unique minimizer is \(\bar{Y} = \mu\).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:reshape':
##
## rename
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0),
aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 10),
guide = "none" )
g <- g + geom_point(aes(colour=freq,
size = freq))
g <- g + scale_colour_gradient(low = "salmon",
high="white")
g <- g + ggtitle("Parents-Children Height Frequency") + xlab("Parent's high [inches]") + ylab("Childern's high [inches]")
g
library(dplyr)
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)
freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)
g <- ggplot(filter(freqData, freq > 0),
aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 10),
guide = "none" )
g <- g + geom_point(aes(colour=freq,
size = freq))
g <- g + scale_colour_gradient(low = "salmon",
high="white")
g <- g + geom_smooth(method='lm',
formula = y ~ x,
se = TRUE)
g <- g + ggtitle("Parents-Children Height Frequency")
g
lm(I(child - mean(child)) ~ I(parent - mean(parent)), data = galton)
##
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)),
## data = galton)
##
## Coefficients:
## (Intercept) I(parent - mean(parent))
## 3.062e-14 6.463e-01
lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton)
##
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) -
## 1, data = galton)
##
## Coefficients:
## I(parent - mean(parent))
## 0.6463
We write \(X_1, X_2, \ldots X_n\), to describe \(n\) data points, but there is nothing special about the letter \(X\) and often different letter is used such as \(Y_1, Y_2, \ldots Y_n\).
We typically use Greek letters for things we don’t know, such as \(\mu\) is a mean that we’d like to estimate.
The empirical mean is defind as \[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
Notice that if we substract the mean from data points we get data set that is centered, i.e. a data set that has a corresponding (empirical) mean 0. That is, if
\[ \bar{X}_i = X_i - \bar{X}, \] the empirical mean of the new dataset \(\bar{X}_1, \bar{X}_2, \ldots \bar{X}_n\) is \(0\). This process is called centering the random variable.
The mean is the least squares solution for minimizing \[ \sum_{i=1}^{n} \left( X_i - \mu \right)^2 \]
The empirical variance is defind as \[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2 = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_{i}^{2} - n \bar{X}^{2} \right)^{2} \]
The empirical standard deviation is defind as \(s\). The standard deviation has the same units as the data.
The data defined by \(X_i/s\) have empirical standard deviation 1. This is scaling the data.
Centering and scaling the data leads to empirical mean zero and empirical standard deviation one data. This is called normalization, i.e. the data defined by \[ Z_i = \frac{X_i - \bar{X}}{s} \] have empirical mean zero and empirical standard deviation 1. Normalized data are centered at 0 and have units equal to standard deviations of the original data.
The empirical covariance is defined for pairs of data, \(\left( X_i, Y_i \right)\): \[ \text{Cov}\left[ X, Y \right] = \frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right) \left( Y_i - \bar{Y} \right) = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_i Y_i - n \bar{X} \bar{Y} \right). \]
The correlation is a measure of strength of the linear relationship between the random variabls and is defined as: \[ \text{Cor}\left[ X, Y \right] = \frac{\text{Cov}\left[ X, Y \right]}{s_x s_y} \]
Some properties of the correlation:
We consider the parents-childern height data set again.
Let \(Y_i\) be the \(i\)th child’s height and \(X\) be the \(i\)th parent’s height. Consider finde the best line \[ \text{Child's height} = \beta_0 + \text{Parent's height} \times \beta_1 \] Using least squares means minimizing the residuals errors, i.e. minimizing \[ \sum_{i=1}^{n} \left( Y_i - \left(\beta_0 + \beta_1 X_i \right)\right)^2 \]
The least squares midel fit to the line \(Y = \beta_{0} + \beta_{1} X\) through the data pairs \((X_i, Y_i)\) with \(Y_{i}\) as the outcome obtains the line \[ Y = \widehat{\beta}_{0}+\widehat{\beta}_{1}X \] where \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(Y)}{\text{sd}(X)} ,\quad \widehat{\beta}_{0} = \bar{Y} - \widehat{\beta}_{1} \bar{X} \]
\(\widehat{\beta}_{1}\) has the units of \(Y/X\)
\(\widehat{\beta}_{0}\) has the units of \(Y\)
The line passes throug the point \(\left(\bar{X}, \bar{Y} \right)\)
The slope of the regression line with \(X\) as the outcome and \(Y\) as the the predictior is \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(X)}{\text{sd}(Y)} \]
The slope is the same one if data is centered.
The slope corresponding to normalized data is \[ \text{Cor}\left[ X, Y\right] \]
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)
freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)
g <- ggplot(filter(freqData, freq > 0),
aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 10),
guide = "none" )
g <- g + geom_point(aes(colour=freq,
size = freq))
g <- g + scale_colour_gradient(low = "salmon",
high="white")
g <- g + ggtitle("Parents-Children Height Frequency")
g
x <- galton$parent
y <- galton$child
beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0,beta1), coef(lm(y~x)))
## (Intercept) x
## [1,] 23.94153 0.6462906
## [2,] 23.94153 0.6462906
x <- galton$parent
y <- galton$child
beta1 <- cor(x,y) * sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0,beta1), coef(lm(x~y)))
## (Intercept) y
## [1,] 46.13535 0.3256475
## [2,] 46.13535 0.3256475
xc <- x - mean(x)
yc <- y - mean(y)
beta1 <- cor(xc, yc)* sd(yc) / sd(xc)
c(beta1, coef(lm(yc~xc))[2])
## xc
## 0.6462906 0.6462906
xn <- (x - mean(x))/sd(x)
yn <- (y - mean(y))/sd(y)
c(cor(x, y), cor(xn, yn), coef(lm(yn~xn))[2])
## xn
## 0.4587624 0.4587624 0.4587624
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")
## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)
freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)
g <- ggplot(filter(freqData, freq > 0),
aes(x = parent, y = child))
g <- g + scale_size(range = c(2, 10),
guide = "none" )
g <- g + geom_point(aes(colour=freq,
size = freq))
g <- g + scale_colour_gradient(low = "salmon",
high="white")
g <- g + ggtitle("Parents-Children Height Frequency")
g <- g + geom_smooth(method = 'lm', formula = y~x)
g
Regression to the mean was introduced by Francis Galton in the paper Regression towards mediocrity in hereditary stature, 1886.
Imagine if you simulated pairs of random normals
Think of the regression line as the intrisic part
Suppose that we normalize \(X\) (child’s height) and \(Y\) (parent’s height) so that they both have mean 0 and variance 1.
Then, recall, our regression line passes through \((0 , 0)\) (the mean of the \(X\) and \(Y\) respectively).
The slope of the regression line is \(\text{Cor} \left[ X, Y \right]\), regardless of which variable is the outcome.
If \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/\text{Cor}\left[X,Y\right]\)
x <- (father.son$sheight - mean(father.son$sheight))/sd(father.son$sheight)
y <- (father.son$fheight - mean(father.son$fheight))/sd(father.son$fheight)
rho <- cor(x, y)
g <- ggplot(data.frame(x = x, y = y),
aes(x = x, y = y))
g <- g + geom_point(size = 6, color = 'black', alpha = 0.2)
g <- g + geom_point(size = 4, color = 'salmon', alpha = 0.2)
#g <- g + xlim(-4, 4) + ylim(-4, 4)
g <- g + geom_abline(intercept = 0, slope = 1)
g <- g + geom_vline(xintercept = 0)
g <- g + geom_hline(yintercept = 0)
g <- g + geom_abline(yintercept = 0, slope = rho, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning in geom_abline(yintercept = 0, slope = rho, size = 2): Ignoring unknown
## parameters: `yintercept`
g <- g + geom_abline(yintercept = 0, slope = 1/rho, size = 2)
## Warning in geom_abline(yintercept = 0, slope = 1/rho, size = 2): Ignoring
## unknown parameters: `yintercept`
g