Galton

Francis Galton, who lived and worked during the 19th century was a problematic figure in many ways. Galton is generally credited with the discovery of regression in his paper “Regression Toward Mediocrity in Hereditary Stature.”

Galton collected data on 928 families. He computed the average height of the parents after compensating for the differences due to sex. And measured the height of their child after reaching adulthood. (Think about trying to do this without the internet!)

Anyway, in doing this study Galton developed a simple tool which has since developed into one of the most powerful tools in the Data Scientist’s toolbox.

Galton’s Data

summary(galton)

##      parent          child      
##  Min.   :64.00   Min.   :61.70  
##  1st Qu.:67.50   1st Qu.:66.20  
##  Median :68.50   Median :68.20  
##  Mean   :68.31   Mean   :68.09  
##  3rd Qu.:69.50   3rd Qu.:70.20  
##  Max.   :73.00   Max.   :73.70

galton

## # A tibble: 928 x 2
##    parent child
##     <dbl> <dbl>
##  1   70.5  61.7
##  2   68.5  61.7
##  3   65.5  61.7
##  4   64.5  61.7
##  5   64    61.7
##  6   67.5  62.2
##  7   67.5  62.2
##  8   67.5  62.2
##  9   66.5  62.2
## 10   66.5  62.2
## # … with 918 more rows

Plotting Galton’s Data

galton %>%
  ggplot(aes(x = parent, y = child)) +
  geom_point(alpha = 1/10)

EDA of Galton’s Data Set

Let’s look at the distribution of the Parent’s and Child’s height. First we’ll tidy a bit to make it easier to separate parent and child

galton %>%
  pivot_longer(cols = c(parent, child), names_to = "relation", values_to = "height" ) %>%
  ggplot(aes(x = height, fill = relation)) +
  geom_histogram(color = "gray", binwidth = 1) +
  facet_grid(.~relation)

Short Digresion into Probability and Statistics.

Recall the mean, \(\mu\), is a measure of the center of the data set. You may also recall, that \(\mu\) is a parameter which can only be estimated due to the impossibility of collecting measurements on the entire population. Thus we use the sample mean (an arithmetic average of the sample data) to approximate the population mean.

If we represent the parent’s height by the random variable \(X\) and the child’s height by the random variable \(Y\). Then we can rephrase Galton’s goal as: find a mathematical expression that describes the relationship between \(X\) and \(Y\), more ambitiously find a function, \(F\) such that:

\[ F(X) = Y \] In a little more detail, given data \(X_1,X_2,\dots,X_n\) and \(Y_1,Y_2,\ldots,Y_n\), find \(F\) such that \(F(X_i) = Y_i\) for \(i = 1,\ldots,n\)

This is clearly impossible for the data we have. Why?

Expected Value and the arithmetic mean

In your statistics class you will have learned, I hope, that collecting a lot of means of a given population gives us a way to estimate the mean of the population. And, moreover, the variance of the sample means is inversely proportional to the sample size.

This idea is the basis of statistical inference.

In Probability we have the idea of the Expected Value For a random variable \(X\), with some probability distribution, the Expected Value of \(X\) is denoted by \(E(X)\) and we say the the mean of the distribution of \(X\) is equal to \(E(X)\) which we also denote by \(\mu_X\).

Modeling Galton’s data

What do we mean by a model?
A model is a function that captures the most important component of variation in the data.
In the Galton data we might hypothesize the relationship between a parent’s height and the height of their child might be linear, i.e., \[Y = mX + b\] where \(X\) is the average heights of the parents and \(Y\) is the height of the parent.

We can look at this idea using ggplot and geom_smooth()

ggplot(galton,aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula 'y ~ x'

Looking a little deeper.

mean_child = mean(galton$child)
mean_parent = mean(galton$parent)

ggplot(galton,aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_vline(aes(xintercept = mean(parent)), color = "blue") + 
  geom_hline(aes(yintercept = mean(child)), color ="purple")

## `geom_smooth()` using formula 'y ~ x'

Notice something interesting, the point where the lines corresponding to the mean heights for each variable intesect on the red line.

A little more mathematical probability

For those of you who have studied calculus, expected value is a moment, i.e., \[E(X) = \int_{-\infty}^{\infty} x \rho(x)dx\] where, \(\rho\) is a density function giving the probability that \(X = x\). \(\rho\) also has the properties: \(\rho(x) >= 0\) and \[\int_{-\infty}^{\infty}\rho(x)dx = 1\]
Practically, this means that: if \(X\), the parents height is a good predictor of the child’s height \(Y\), then we should see: \[E(Y) = mE(X) + b\] (ignoring some details for the moment)

So what about the imposibility of finding the \(F\)?

Is a model then impossible?
No!
The idea is to add an additional random term representing an error. LThis term is called the residual, denotec by \(\epsilon\) where the Greek letter \(\epsilon\) stands for error.
So in this case the model we would use is: \[Y = mX + b + \epsilon\]
Moreover, we would like \(E(\epsilon) = 0\)
So the problem becomes finding \(m\) and \(b\) so that \(Y = mX + b\) “fits” the data closely, i.e., with a small error whose expected value is zero.

A toy problem

Let’s make a simple “toy” example, where we know the answer ahead of time.

n = 1000
x = 1:n
m = 2
b = 1
y = m*x + b + rnorm(n,mean=0,sd = 2.5)
dat <- tibble(x, y)
ggplot(dat,aes(x = x, y = y)) +
  geom_point()

The R function `lm`

The R function lm is used to find the best fit of a linear model in terms of least square, which we will discuss in a minute.
Using lm

fit <- lm(y ~ x, data = dat)
coef(fit)

## (Intercept)           x 
##    1.035026    1.999974

Fitted values

fitted <-function(x,m,b) {
  m*x + b
}
m = coef(fit)[[2]] # slope
b = coef(fit)[[1]] # intercept

regfit <- fitted(dat$x, m, b)
dat <- dat %>% mutate(fit = regfit, res = y - regfit )

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
  ggtitle("Points and fitted linear model")

Fitted values

Sometimes called predicted values, these values are the values we would obtain from the \(X\) data if the model was exact and are denoted by \(\hat{Y}\). In other words, \[ \hat{Y} = m X + b\]

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
    geom_point(aes(x = x,y = fit), color = 'red') +
  ggtitle("Fitted Values")

Residuals

The residual is the difference between the observed value of \(Y\) and the value \(\hat{Y}\),that is \[Y - \hat{Y}\]

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
    geom_point(aes(x = x,y = fit), color = 'red') +
    geom_segment(aes(x = x, y = regfit,xend = x, yend = y),color = "black") +
  ggtitle("Fitted Values and Residuals")

The residuals are the lengths of the vertical black lines in the plot.

dat %>%
  ggplot(aes(x = x, y = res)) +
    geom_point() +
    geom_hline(aes(yintercept = 0), color = "red")

galton

## # A tibble: 928 x 2
##    parent child
##     <dbl> <dbl>
##  1   70.5  61.7
##  2   68.5  61.7
##  3   65.5  61.7
##  4   64.5  61.7
##  5   64    61.7
##  6   67.5  62.2
##  7   67.5  62.2
##  8   67.5  62.2
##  9   66.5  62.2
## 10   66.5  62.2
## # … with 918 more rows

fit <- lm(child ~ parent, data = galton)
fit

## 
## Call:
## lm(formula = child ~ parent, data = galton)
## 
## Coefficients:
## (Intercept)       parent  
##     23.9415       0.6463

Child = 0.6463 * parent + 23.9415 + error

ggplot(data = galton, aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_abline(aes(slope = 0.6463, intercept = 23.9415),color = "purple")

ggplot(data = galton) +
  geom_point(aes(x = parent, y = fit$residuals)) +
  geom_hline(aes(yintercept = 0), color = 'purple')

ggplot() +
  geom_histogram(aes(x = fit$residuals),binwidth = 1)

sd(fit$residuals)

## [1] 2.237339

Introduction to Regression

Galton

Galton’s Data

Plotting Galton’s Data

EDA of Galton’s Data Set

Short Digresion into Probability and Statistics.

Expected Value and the arithmetic mean

Modeling Galton’s data

Looking a little deeper.

A little more mathematical probability

So what about the imposibility of finding the \(F\)?

A toy problem

The R function lm

Fitted values

Fitted values

Residuals

The R function `lm`