In this document, I play around with the basics of regression lines in the context of John Galton’s father and son height data. Basic questions:

  1. Is there a correlation between father and son heights?
  2. What is the regression line for the data set?
  3. What is the [M]ean [S]quare [E]rror and [R]oot [M]ean [S]quare error?
# The father son height data can be loaded from the UsingR package. There is another package called HistData that contains many interesting historical datasets including Snow's work on cholera which I find particularly interesting.
suppressWarnings(suppressMessages(library(UsingR)))
data(father.son)

# Check out the structure of the data set
head(father.son)
##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
names(father.son)
## [1] "fheight" "sheight"
# Load ggplot to plot data
library(ggplot2)
ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point(size=2, alpha=0.7) + xlab("Height of father") + ylab("Height of son") + ggtitle("Father-son Height Data")

Now let us use R commands to compute the means, standard deviations and the correlation in the data sets.

# mean and standard deviations of the father and son heights
shmean <- mean(father.son$sheight)
fhmean <- mean(father.son$fheight)

shsd <- sd(father.son$sheight)
fhsd <- sd(father.son$fheight)

# correlation of father.son data - the off-diagonal terms are of interest
fscor <- cor(father.son)[1, 2] 

The regression line is of the for \(Y_i = a + b x_i + \epsilon_i\), where, \(b = rs_y/s_x\) (slope), \(a = \bar{y} - b \bar{x}\) (y-intercept) and \(\epsilon_i\) is the residual. From the data we can then find the estimates for the parameters \(a, b\) denoted \(\hat{a}, \hat{b}\).

# slope = r*s_y/s_x - father data is on the x axis.
bhat <- fscor * shsd / fhsd

# y-intercept = mean(y) - slope*mean(x)
ahat <- shmean - bhat*fhmean

# print regression line parameters
ahat
## [1] 33.8866
bhat
## [1] 0.514093

Let us compute the MSE and RMS error.

# computation of mean squared data average of residula error squared
MSE <- sum((father.son$sheight - (ahat + bhat * father.son$fheight))^2) / dim(father.son)[1]
MSE
## [1] 5.92579
# root mean squared error
RMSE <- sqrt(MSE)
RMSE
## [1] 2.434294
# minimum and maximum father height
fhmin <- min(father.son$fheight)
fhmax <- max(father.son$fheight)

# equally space points between from the min-max height interval
xdat <- (fhmax - fhmin) * seq(0, 1, 0.01) + fhmin
ydat <- ahat + bhat*xdat

# regression line data frame
regressionLine <- data.frame(xdat, ydat)
names(regressionLine) <- c("sheight", "fheight")

# plot of data set with regression line
ggplot(father.son, aes(x=fheight, y=sheight)) + 
        geom_point(size=2, alpha=0.7) + 
        geom_line(data=regressionLine, aes(x=sheight, y=fheight), lwd=1.5, colour="red") + 
        xlab("Height of father") + 
        ylab("Height of son") + 
        ggtitle("Father-son Height Data")

# using the built in linear regression model in R to fit the data
summary(lm(sheight ~ fheight, data=father.son))
## 
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8772 -1.5144 -0.0079  1.6285  8.9685 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.88660    1.83235   18.49   <2e-16 ***
## fheight      0.51409    0.02705   19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared:  0.2513, Adjusted R-squared:  0.2506 
## F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

The regression line is the line of best fit of the means and therefore separates the data points approximately equally above and below the lines.

Question: [True or False] Because the sons average an inch taller than the fathers, if the father is 72 inches tall, it is 50-50 whether the son is 73 inches tall.

False, the regression line value is 70.9013031, which is the 50-50 point. Hence, the chance of the son being shorter is higher.