Module 1: Introduction to Regression & Least Squares
- Galton’s Data
Module 2: Linear Least Squares
- Least squares estimation of regression lines
Module 3: Regression to the Mean

Module 1: Introduction to Regression & Least Squares

Regression models are the workhorse of data science. They are the most well described, practical and theoretically understood models in statistics.

One of the key insight for regression models is that they produce highly interpretable model fits, unlike machine learning algorithms, which often sacrifice interpretability for improved prediction performance or automation, which are valuable attributes in their own rights.

The benefit of simplicity, parsimony and intrepretability offered by regression models (and their close generalizations) typically makes them a first tool of choice for any practical problem.

Regression history starts with the work of Francis Galton, who used the parent’s heights to predict the children’s heights and invented the term and the concept regression, and the term and the concept correlation, intimately tied to linear regression.

The type of questions regression analysis can answer, using the parents and children heights dataset used by Francis Galton.

Use a parent’s height to predict their children’s heights.
Find a parsimonious and easily described mean relationships between the parent’s and child’s height.
Find the variation that’s unexplained by the regression model, the so called residual variation.
Quantify the impact the genotype information has beyond parental height explaining child’s height.
Why do children of very tall parents tend to be tall, but a little bit shorter than their parents (regression to the mean).

Galton’s Data

install.packages(c('UsingR', 'reshape'),
                 repos = "http://cran.us.r-project.org")

library(UsingR)
library(reshape)
data(galton)
galtonM <- melt(galton)

g <- ggplot(galtonM, 
            aes(x = value, fill = variable))
g <- g + geom_histogram(colour = 'black',
                        binwidth = 1)
g <- g + facet_grid(.~variable)
g

Finding the middle via least squares

Consider only the children’s heights.

How could one describe the middle?
One definition, let \(Y_i\) be the height of child \(i\) for \(i = 1,\ldots, n = 928\), then define the middle as the value of \(\mu\) that minimizes: \[ \sum_{i=1}^{n} \left( Y_i - \mu \right)^2 \]
This is the physical center of mass of the histogram.
The answer is \(\mu = \bar{Y}\).

\[ \begin{align} \sum_{i=1}^{n} \left( Y_i - \mu \right)^2 & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} + \bar{Y}- \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + 2 \left(\bar{Y} - \mu \right) \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right) + \sum_{i=1}^{n} \left(\bar{Y} - \mu \right)^2 \\ & = \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 + n \left(\bar{Y} - \mu \right)^2 \\ & \geq \sum_{i=1}^{n} \left( Y_i - \bar{Y} \right)^2 \end{align} \] with the equality for

\[ \begin{align} \bar{Y} - \mu = 0 \end{align} \] i.e. the unique minimizer is \(\bar{Y} = \mu\).

Comparing childeren’s height and their parents heights

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:reshape':
## 
##     rename

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  

g <- g + ggtitle("Parents-Children Height Frequency") + xlab("Parent's high [inches]") + ylab("Childern's high [inches]")

g

Regression to the origin

library(dplyr)
## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + geom_smooth(method='lm', 
                     formula = y ~ x,
                     se = TRUE)

g <- g + ggtitle("Parents-Children Height Frequency")

g

lm(I(child - mean(child)) ~ I(parent - mean(parent)), data = galton)

## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)), 
##     data = galton)
## 
## Coefficients:
##              (Intercept)  I(parent - mean(parent))  
##                3.062e-14                 6.463e-01

lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton)

## 
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) - 
##     1, data = galton)
## 
## Coefficients:
## I(parent - mean(parent))  
##                   0.6463

Module 2: Linear Least Squares

We write \(X_1, X_2, \ldots X_n\), to describe \(n\) data points, but there is nothing special about the letter \(X\) and often different letter is used such as \(Y_1, Y_2, \ldots Y_n\).
We typically use Greek letters for things we don’t know, such as \(\mu\) is a mean that we’d like to estimate.
The empirical mean is defind as \[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]

Notice that if we substract the mean from data points we get data set that is centered, i.e. a data set that has a corresponding (empirical) mean 0. That is, if

\[ \bar{X}_i = X_i - \bar{X}, \] the empirical mean of the new dataset \(\bar{X}_1, \bar{X}_2, \ldots \bar{X}_n\) is \(0\). This process is called centering the random variable.

The mean is the least squares solution for minimizing \[ \sum_{i=1}^{n} \left( X_i - \mu \right)^2 \]
The empirical variance is defind as \[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2 = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_{i}^{2} - n \bar{X}^{2} \right)^{2} \]
The empirical standard deviation is defind as \(s\). The standard deviation has the same units as the data.
The data defined by \(X_i/s\) have empirical standard deviation 1. This is scaling the data.
Centering and scaling the data leads to empirical mean zero and empirical standard deviation one data. This is called normalization, i.e. the data defined by \[ Z_i = \frac{X_i - \bar{X}}{s} \] have empirical mean zero and empirical standard deviation 1. Normalized data are centered at 0 and have units equal to standard deviations of the original data.
The empirical covariance is defined for pairs of data, \(\left( X_i, Y_i \right)\): \[ \text{Cov}\left[ X, Y \right] = \frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right) \left( Y_i - \bar{Y} \right) = \frac{1}{n-1} \left( \sum_{i=1}^{n} X_i Y_i - n \bar{X} \bar{Y} \right). \]
The correlation is a measure of strength of the linear relationship between the random variabls and is defined as: \[ \text{Cor}\left[ X, Y \right] = \frac{\text{Cov}\left[ X, Y \right]}{s_x s_y} \]
Some properties of the correlation:
- \(\text{Cor}\left[ X, Y \right] = \text{Cor}\left[ Y, X \right]\).
- \(-1 \leq \text{Cor}\left[ X, Y \right] \leq 1\).
- \(\text{Cor}\left[ X, Y \right] = 1\) or \(\text{Cor}\left[ X, Y \right] = -1\) only when the observations \((X_i, Y_i)\) fall perfecty on a positive or negative sloped line.
- \(\text{Cor}\left[ X, Y \right] = 0\) means no linear relationship.

Least squares estimation of regression lines

We consider the parents-childern height data set again.

Let \(Y_i\) be the \(i\)th child’s height and \(X\) be the \(i\)th parent’s height. Consider finde the best line \[ \text{Child's height} = \beta_0 + \text{Parent's height} \times \beta_1 \] Using least squares means minimizing the residuals errors, i.e. minimizing \[ \sum_{i=1}^{n} \left( Y_i - \left(\beta_0 + \beta_1 X_i \right)\right)^2 \]

The least squares midel fit to the line \(Y = \beta_{0} + \beta_{1} X\) through the data pairs \((X_i, Y_i)\) with \(Y_{i}\) as the outcome obtains the line \[ Y = \widehat{\beta}_{0}+\widehat{\beta}_{1}X \] where \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(Y)}{\text{sd}(X)} ,\quad \widehat{\beta}_{0} = \bar{Y} - \widehat{\beta}_{1} \bar{X} \]

\(\widehat{\beta}_{1}\) has the units of \(Y/X\)
\(\widehat{\beta}_{0}\) has the units of \(Y\)
The line passes throug the point \(\left(\bar{X}, \bar{Y} \right)\)
The slope of the regression line with \(X\) as the outcome and \(Y\) as the the predictior is \[ \widehat{\beta}_{1} = \text{Cor}\left[ X, Y\right] \frac{\text{sd}(X)}{\text{sd}(Y)} \]
The slope is the same one if data is centered.
The slope corresponding to normalized data is \[ \text{Cor}\left[ X, Y\right] \]

## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + ggtitle("Parents-Children Height Frequency")

g

x <- galton$parent
y <- galton$child

beta1 <- cor(x,y) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
rbind(c(beta0,beta1), coef(lm(y~x)))

##      (Intercept)         x
## [1,]    23.94153 0.6462906
## [2,]    23.94153 0.6462906

x <- galton$parent
y <- galton$child

beta1 <- cor(x,y) * sd(x) / sd(y)
beta0 <- mean(x) - beta1 * mean(y)
rbind(c(beta0,beta1), coef(lm(x~y)))

##      (Intercept)         y
## [1,]    46.13535 0.3256475
## [2,]    46.13535 0.3256475

xc <- x - mean(x)
yc <- y - mean(y)

beta1 <- cor(xc, yc)* sd(yc) / sd(xc)
c(beta1, coef(lm(yc~xc))[2])

##                  xc 
## 0.6462906 0.6462906

xn <- (x - mean(x))/sd(x)
yn <- (y - mean(y))/sd(y)

c(cor(x, y), cor(xn, yn), coef(lm(yn~xn))[2])

##                            xn 
## 0.4587624 0.4587624 0.4587624

## Create a dataframe corresponding to frequencies of each pair (child, parent) and set the names accordingly
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child", "parent", "freq")

## Transform both the child and parents hight variables to numeric class.
freqData$child <- as.numeric(as.character(freqData$child))
#freqData$child <- freqData$child - mean(freqData$child)

freqData$parent <- as.numeric(as.character(freqData$parent))
#freqData$parent <- freqData$parent - mean(freqData$parent)

g <- ggplot(filter(freqData, freq > 0), 
            aes(x = parent, y = child))
g <- g  + scale_size(range = c(2, 10), 
                     guide = "none" )
g <- g + geom_point(aes(colour=freq, 
                        size = freq))
g <- g + scale_colour_gradient(low = "salmon",
                               high="white")  
g <- g + ggtitle("Parents-Children Height Frequency")

g <- g + geom_smooth(method = 'lm', formula = y~x)
g

Module 3: Regression to the Mean

Regression to the mean was introduced by Francis Galton in the paper Regression towards mediocrity in hereditary stature, 1886.
Imagine if you simulated pairs of random normals
- The largest first ones would be the largest by chance and the probability that there are for the second simulation is high
- In other words, \(\text{P}\left( Y < x \mid X = x \right)\) gets bigger as \(x\) heads into the very large values.
- Similarly, \(\text{P}\left( Y > x \mid X = x \right)\) gets bigger as \(x\) heds to very small values
Think of the regression line as the intrisic part
- Unless \(\text{Cor}\left[ X, Y \right] = 1\), the intrinsic part is not perfect.
Suppose that we normalize \(X\) (child’s height) and \(Y\) (parent’s height) so that they both have mean 0 and variance 1.
Then, recall, our regression line passes through \((0 , 0)\) (the mean of the \(X\) and \(Y\) respectively).
The slope of the regression line is \(\text{Cor} \left[ X, Y \right]\), regardless of which variable is the outcome.
If \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/\text{Cor}\left[X,Y\right]\)

x <- (father.son$sheight - mean(father.son$sheight))/sd(father.son$sheight)
y <- (father.son$fheight - mean(father.son$fheight))/sd(father.son$fheight)

rho <- cor(x, y)

g <- ggplot(data.frame(x = x, y = y), 
            aes(x = x, y = y))
g <- g + geom_point(size = 6, color = 'black', alpha = 0.2)
g <- g + geom_point(size = 4, color = 'salmon', alpha = 0.2)
#g <- g + xlim(-4, 4) + ylim(-4, 4)
g <- g + geom_abline(intercept = 0, slope = 1)
g <- g + geom_vline(xintercept = 0)
g <- g + geom_hline(yintercept = 0)
g <- g + geom_abline(yintercept = 0, slope = rho, size = 2)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

## Warning in geom_abline(yintercept = 0, slope = rho, size = 2): Ignoring unknown
## parameters: `yintercept`

g <- g + geom_abline(yintercept = 0, slope = 1/rho, size = 2)

## Warning in geom_abline(yintercept = 0, slope = 1/rho, size = 2): Ignoring
## unknown parameters: `yintercept`

Least Squares & Linear Regression

Regression Models: Week 1
Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

22 February, 2023

Module 1: Introduction to Regression & Least Squares

Galton’s Data

Finding the middle via least squares

Comparing childeren’s height and their parents heights

Regression to the origin

Module 2: Linear Least Squares

Least squares estimation of regression lines

Module 3: Regression to the Mean

Least Squares & Linear Regression

Regression Models: Week 1 Coursera Data Science: Statistics & Machine Learning Specialization

Mircea Dumitru

22 February, 2023

Module 1: Introduction to Regression & Least Squares

Galton’s Data

Finding the middle via least squares

Comparing childeren’s height and their parents heights

Regression to the origin

Module 2: Linear Least Squares

Least squares estimation of regression lines

Module 3: Regression to the Mean

Regression Models: Week 1
Coursera Data Science: Statistics & Machine Learning Specialization