Regression and linear models are the go-to tool before doing any machine learning stuff Regression is good at generating results that are parsimonious
library(UsingR); data(galton); library(reshape2); long <- melt(galton)
## Loading required package: MASS
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
## No id variables; using all as measure variables
g <- ggplot(long, aes(x=value, fill=variable))
g <- g + geom_histogram(colour = "black", binwidth=1)
g <- g + facet_grid(. ~ variable)
g
Best predictor of the child’s height would be the ‘middle’ height: The value of mu that minimizes the square of the distance between the Yi (child’s height) and that value. It will be the physical center of mass of the histogram, and will equal the mean (y_bar)
library(manipulate)
myHist <- function(mu){
mse <- mean((galton$child-mu)^2)
g <- ggplot(galton, aes(x=child)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 1)
g <- g + geom_vline(xintercept = mu, size = 3)
g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse,2), sep = ""))
g
}
# manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))
So we can see as we move the mean (mu) towards the center of the histogram, we have a smaller Mean Square Error
Below graph shows us where the actual mean is:
g <- ggplot(galton, aes(x=child)) + geom_histogram(colour = "black", fill = "salmon", binwidth = 1)
g <- g + geom_vline(xintercept=mean(galton$child), size = 3)
g
Comparing Child’s height and parent’s height
g <- ggplot(galton, aes(x=child, y=parent)) + geom_point()
g
Problem in above plot is we don’t know how many observations underlie each point.
If we center the data around 0, all we need to find the best regression line is to find the slope mu that minimizes the difference between the sum of the Xi and Yi. This is regression through the mean.
myPlot <- function(beta){
y <- galton$child - mean(galton$child)
x <- galton$parent - mean(galton$parent)
freqData <- as.data.frame(table(x, y))
names(freqData) <- c("child", "parent", "freq")
plot(
as.numeric(as.vector(freqData$parent)),
as.numeric(as.vector(freqData$child)),
pch = 21, col = "black", bg = "lightblue",
cex = .15 * freqData$freq,
xlab = "parent",
ylab = "child"
)
abline(0, beta, lwd = 3)
points(0, 0, cex = 2, pch = 19)
mse <- mean( (y - beta * x)^2 )
title(paste("beta = ", beta, "mse = ", round(mse, 3)))
}
# manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
A slope of 1 in the above would be to use exactly the parent’s height as the predictor for the child’s height.
The quick/easy way to do this, is via lm function:
lm(I(child-mean(child)) ~ I(parent - mean(parent))-1, data = galton)
##
## Call:
## lm(formula = I(child - mean(child)) ~ I(parent - mean(parent)) -
## 1, data = galton)
##
## Coefficients:
## I(parent - mean(parent))
## 0.6463
Following lectures will be about how we use this stuff.
Invented by Francis Galton
Imagine that you simulate random normals
x <- rnorm(100)
y <- rnorm(100)
odr <- order(x)
x[odr[100]] # the maximum value in the x rnorm
## [1] 2.794426
y[odr[100]] # the y paired with that value
## [1] -0.06152614
The probability that y is less than x given x, gets bigger as x heads into very large values P(Y<x|X=x). Inverse is also true. P(Y>x|X=x) bigger as x heads into very small values This is a case where there is 100% regression to the mean (x and y are definitely not related). However, in reality there is some correlation between X and Y.
Suppose that we normalize X (child’s height) and Y (parent’s height) so that they both have mean 0 and variance 1; so our regression line passes through (0,0), which will be the mean of X and Y. Then the slop of the regression is the Cor(Y,X), regardless of which variable is the outcome.
What this plot shows is that: If we had a father’s height of 2 and there was no noise, we’d predict a son’s height of 2 (deviations above the mean). However, there’s lots of noise when we look at the data, so the prediction is not at Y=2, but at the regression line (half of observations are above and half below). The corresponding Y is exactly 2 * rho (2 multiplied by the correlation). That multiplication is how we measure regression to the mean.