Celia Evans
Francis Galton, who lived and worked during the 19th century was a problematic figure in many ways. Galton is generally credited with the discovery of regression in his paper “Regression Toward Mediocrity in Hereditary Stature.”
Galton collected data on 928 families. He computed the average height of the parents after compensating for the differences due to sex. And measured the height of their child after reaching adulthood. (Think about trying to do this without the internet!)
Anyway, in doing this study Galton developed a simple tool which has since developed into one of the most powerful tools in the Data Scientist’s toolbox.
## parent child
## Min. :64.00 Min. :61.70
## 1st Qu.:67.50 1st Qu.:66.20
## Median :68.50 Median :68.20
## Mean :68.31 Mean :68.09
## 3rd Qu.:69.50 3rd Qu.:70.20
## Max. :73.00 Max. :73.70
## # A tibble: 928 x 2
## parent child
## <dbl> <dbl>
## 1 70.5 61.7
## 2 68.5 61.7
## 3 65.5 61.7
## 4 64.5 61.7
## 5 64 61.7
## 6 67.5 62.2
## 7 67.5 62.2
## 8 67.5 62.2
## 9 66.5 62.2
## 10 66.5 62.2
## # … with 918 more rows
Let’s look at the distribution of the Parent’s and Child’s height. First we’ll tidy a bit to make it easier to separate parent and child
galton %>%
pivot_longer(cols = c(parent, child), names_to = "relation", values_to = "height" ) %>%
ggplot(aes(x = height, fill = relation)) +
geom_histogram(color = "gray", binwidth = 1) +
facet_grid(.~relation)Recall the mean, \(\mu\), is a measure of the center of the data set. You may also recall, that \(\mu\) is a parameter which can only be estimated due to the impossibility of collecting measurements on the entire population. Thus we use the sample mean (an arithmetic average of the sample data) to approximate the population mean.
If we represent the parent’s height by the random variable \(X\) and the child’s height by the random variable \(Y\). Then we can rephrase Galton’s goal as: find a mathematical expression that describes the relationship between \(X\) and \(Y\), more ambitiously find a function, \(F\) such that:
\[ F(X) = Y \] In a little more detail, given data \(X_1,X_2,\dots,X_n\) and \(Y_1,Y_2,\ldots,Y_n\), find \(F\) such that \(F(X_i) = Y_i\) for \(i = 1,\ldots,n\)
This is clearly impossible for the data we have. Why?
In your statistics class you will have learned, I hope, that collecting a lot of means of a given population gives us a way to estimate the mean of the population. And, moreover, the variance of the sample means is inversely proportional to the sample size.
This idea is the basis of statistical inference.
In Probability we have the idea of the Expected Value For a random variable \(X\), with some probability distribution, the Expected Value of \(X\) is denoted by \(E(X)\) and we say the the mean of the distribution of \(X\) is equal to \(E(X)\) which we also denote by \(\mu_X\).
What do we mean by a model?
A model is a function that captures the most important component of variation in the data.
In the Galton data we might hypothesize the relationship between a parent’s height and the height of their child might be linear, i.e., \[Y = mX + b\] where \(X\) is the average heights of the parents and \(Y\) is the height of the parent.
ggplot and geom_smooth()ggplot(galton,aes(x = parent, y = child)) +
geom_point(alpha = 1/10) +
geom_smooth(method = "lm", se = FALSE, color = "red")## `geom_smooth()` using formula 'y ~ x'
mean_child = mean(galton$child)
mean_parent = mean(galton$parent)
ggplot(galton,aes(x = parent, y = child)) +
geom_point(alpha = 1/10) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_vline(aes(xintercept = mean(parent)), color = "blue") +
geom_hline(aes(yintercept = mean(child)), color ="purple")## `geom_smooth()` using formula 'y ~ x'
For those of you who have studied calculus, expected value is a moment, i.e., \[E(X) = \int_{-\infty}^{\infty} x \rho(x)dx\] where, \(\rho\) is a density function giving the probability that \(X = x\). \(\rho\) also has the properties: \(\rho(x) >= 0\) and \[\int_{-\infty}^{\infty}\rho(x)dx = 1\]
Practically, this means that: if \(X\), the parents height is a good predictor of the child’s height \(Y\), then we should see: \[E(Y) = mE(X) + b\] (ignoring some details for the moment)
Is a model then impossible?
No!
The idea is to add an additional random term representing an error. LThis term is called the residual, denotec by \(\epsilon\) where the Greek letter \(\epsilon\) stands for error.
So in this case the model we would use is: \[Y = mX + b + \epsilon\]
Moreover, we would like \(E(\epsilon) = 0\)
So the problem becomes finding \(m\) and \(b\) so that \(Y = mX + b\) “fits” the data closely, i.e., with a small error whose expected value is zero.
n = 1000
x = 1:n
m = 2
b = 1
y = m*x + b + rnorm(n,mean=0,sd = 2.5)
dat <- tibble(x, y)
ggplot(dat,aes(x = x, y = y)) +
geom_point()lmThe R function lm is used to find the best fit of a linear model in terms of least square, which we will discuss in a minute.
Using lm
## (Intercept) x
## 1.035026 1.999974
fitted <-function(x,m,b) {
m*x + b
}
m = coef(fit)[[2]] # slope
b = coef(fit)[[1]] # intercept
regfit <- fitted(dat$x, m, b)
dat <- dat %>% mutate(fit = regfit, res = y - regfit )
dat %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
geom_abline(aes(slope = m, intercept = b), color = "red") +
ggtitle("Points and fitted linear model")Sometimes called predicted values, these values are the values we would obtain from the \(X\) data if the model was exact and are denoted by \(\hat{Y}\). In other words, \[ \hat{Y} = m X + b\]
dat %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
geom_abline(aes(slope = m, intercept = b), color = "red") +
geom_point(aes(x = x,y = fit), color = 'red') +
ggtitle("Fitted Values")The residual is the difference between the observed value of \(Y\) and the value \(\hat{Y}\),that is \[Y - \hat{Y}\]
dat %>%
ggplot() +
geom_point(aes(x = x, y = y)) +
geom_abline(aes(slope = m, intercept = b), color = "red") +
geom_point(aes(x = x,y = fit), color = 'red') +
geom_segment(aes(x = x, y = regfit,xend = x, yend = y),color = "black") +
ggtitle("Fitted Values and Residuals")The residuals are the lengths of the vertical black lines in the plot.
## # A tibble: 928 x 2
## parent child
## <dbl> <dbl>
## 1 70.5 61.7
## 2 68.5 61.7
## 3 65.5 61.7
## 4 64.5 61.7
## 5 64 61.7
## 6 67.5 62.2
## 7 67.5 62.2
## 8 67.5 62.2
## 9 66.5 62.2
## 10 66.5 62.2
## # … with 918 more rows
##
## Call:
## lm(formula = child ~ parent, data = galton)
##
## Coefficients:
## (Intercept) parent
## 23.9415 0.6463
Child = 0.6463 * parent + 23.9415 + error
ggplot(data = galton, aes(x = parent, y = child)) +
geom_point(alpha = 1/10) +
geom_abline(aes(slope = 0.6463, intercept = 23.9415),color = "purple")ggplot(data = galton) +
geom_point(aes(x = parent, y = fit$residuals)) +
geom_hline(aes(yintercept = 0), color = 'purple')## [1] 2.237339