Introduction to Regression

Celia Evans

Galton

Francis Galton, who lived and worked during the 19th century was a problematic figure in many ways. Galton is generally credited with the discovery of regression in his paper “Regression Toward Mediocrity in Hereditary Stature.”

Galton collected data on 928 families. He computed the average height of the parents after compensating for the differences due to sex. And measured the height of their child after reaching adulthood. (Think about trying to do this without the internet!)

Anyway, in doing this study Galton developed a simple tool which has since developed into one of the most powerful tools in the Data Scientist’s toolbox.

Galton’s Data

summary(galton)
##      parent          child      
##  Min.   :64.00   Min.   :61.70  
##  1st Qu.:67.50   1st Qu.:66.20  
##  Median :68.50   Median :68.20  
##  Mean   :68.31   Mean   :68.09  
##  3rd Qu.:69.50   3rd Qu.:70.20  
##  Max.   :73.00   Max.   :73.70
galton
## # A tibble: 928 x 2
##    parent child
##     <dbl> <dbl>
##  1   70.5  61.7
##  2   68.5  61.7
##  3   65.5  61.7
##  4   64.5  61.7
##  5   64    61.7
##  6   67.5  62.2
##  7   67.5  62.2
##  8   67.5  62.2
##  9   66.5  62.2
## 10   66.5  62.2
## # … with 918 more rows

Plotting Galton’s Data

galton %>%
  ggplot(aes(x = parent, y = child)) +
  geom_point(alpha = 1/10)

EDA of Galton’s Data Set

Let’s look at the distribution of the Parent’s and Child’s height. First we’ll tidy a bit to make it easier to separate parent and child

galton %>%
  pivot_longer(cols = c(parent, child), names_to = "relation", values_to = "height" ) %>%
  ggplot(aes(x = height, fill = relation)) +
  geom_histogram(color = "gray", binwidth = 1) +
  facet_grid(.~relation)

Short Digresion into Probability and Statistics.

Recall the mean, \(\mu\), is a measure of the center of the data set. You may also recall, that \(\mu\) is a parameter which can only be estimated due to the impossibility of collecting measurements on the entire population. Thus we use the sample mean (an arithmetic average of the sample data) to approximate the population mean.

If we represent the parent’s height by the random variable \(X\) and the child’s height by the random variable \(Y\). Then we can rephrase Galton’s goal as: find a mathematical expression that describes the relationship between \(X\) and \(Y\), more ambitiously find a function, \(F\) such that:

\[ F(X) = Y \] In a little more detail, given data \(X_1,X_2,\dots,X_n\) and \(Y_1,Y_2,\ldots,Y_n\), find \(F\) such that \(F(X_i) = Y_i\) for \(i = 1,\ldots,n\)

This is clearly impossible for the data we have. Why?

Expected Value and the arithmetic mean

In your statistics class you will have learned, I hope, that collecting a lot of means of a given population gives us a way to estimate the mean of the population. And, moreover, the variance of the sample means is inversely proportional to the sample size.

This idea is the basis of statistical inference.

In Probability we have the idea of the Expected Value For a random variable \(X\), with some probability distribution, the Expected Value of \(X\) is denoted by \(E(X)\) and we say the the mean of the distribution of \(X\) is equal to \(E(X)\) which we also denote by \(\mu_X\).

Modeling Galton’s data

ggplot(galton,aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'

Looking a little deeper.

mean_child = mean(galton$child)
mean_parent = mean(galton$parent)

ggplot(galton,aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  geom_vline(aes(xintercept = mean(parent)), color = "blue") + 
  geom_hline(aes(yintercept = mean(child)), color ="purple")
## `geom_smooth()` using formula 'y ~ x'

A little more mathematical probability

So what about the imposibility of finding the \(F\)?

A toy problem

n = 1000
x = 1:n
m = 2
b = 1
y = m*x + b + rnorm(n,mean=0,sd = 2.5)
dat <- tibble(x, y)
ggplot(dat,aes(x = x, y = y)) +
  geom_point()

The R function lm

fit <- lm(y ~ x, data = dat)
coef(fit)
## (Intercept)           x 
##    1.035026    1.999974

Fitted values

fitted <-function(x,m,b) {
  m*x + b
}
m = coef(fit)[[2]] # slope
b = coef(fit)[[1]] # intercept

regfit <- fitted(dat$x, m, b)
dat <- dat %>% mutate(fit = regfit, res = y - regfit )

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
  ggtitle("Points and fitted linear model")

Fitted values

Sometimes called predicted values, these values are the values we would obtain from the \(X\) data if the model was exact and are denoted by \(\hat{Y}\). In other words, \[ \hat{Y} = m X + b\]

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
    geom_point(aes(x = x,y = fit), color = 'red') +
  ggtitle("Fitted Values")

Residuals

The residual is the difference between the observed value of \(Y\) and the value \(\hat{Y}\),that is \[Y - \hat{Y}\]

dat %>%
  ggplot() +
    geom_point(aes(x = x, y = y)) +
    geom_abline(aes(slope = m, intercept = b), color = "red") +
    geom_point(aes(x = x,y = fit), color = 'red') +
    geom_segment(aes(x = x, y = regfit,xend = x, yend = y),color = "black") +
  ggtitle("Fitted Values and Residuals")

The residuals are the lengths of the vertical black lines in the plot.

dat %>%
  ggplot(aes(x = x, y = res)) +
    geom_point() +
    geom_hline(aes(yintercept = 0), color = "red")

galton
## # A tibble: 928 x 2
##    parent child
##     <dbl> <dbl>
##  1   70.5  61.7
##  2   68.5  61.7
##  3   65.5  61.7
##  4   64.5  61.7
##  5   64    61.7
##  6   67.5  62.2
##  7   67.5  62.2
##  8   67.5  62.2
##  9   66.5  62.2
## 10   66.5  62.2
## # … with 918 more rows
fit <- lm(child ~ parent, data = galton)
fit
## 
## Call:
## lm(formula = child ~ parent, data = galton)
## 
## Coefficients:
## (Intercept)       parent  
##     23.9415       0.6463

Child = 0.6463 * parent + 23.9415 + error

ggplot(data = galton, aes(x = parent, y = child)) +
  geom_point(alpha = 1/10) +
  geom_abline(aes(slope = 0.6463, intercept = 23.9415),color = "purple")

ggplot(data = galton) +
  geom_point(aes(x = parent, y = fit$residuals)) +
  geom_hline(aes(yintercept = 0), color = 'purple')

ggplot() +
  geom_histogram(aes(x = fit$residuals),binwidth = 1)

sd(fit$residuals)
## [1] 2.237339