In this document, I play around with the basics of regression lines in the context of John Galton’s father and son height data. Basic questions:
# The father son height data can be loaded from the UsingR package. There is another package called HistData that contains many interesting historical datasets including Snow's work on cholera which I find particularly interesting.
suppressWarnings(suppressMessages(library(UsingR)))
data(father.son)
# Check out the structure of the data set
head(father.son)
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
names(father.son)
## [1] "fheight" "sheight"
# Load ggplot to plot data
library(ggplot2)
ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point(size=2, alpha=0.7) + xlab("Height of father") + ylab("Height of son") + ggtitle("Father-son Height Data")
Now let us use R commands to compute the means, standard deviations and the correlation in the data sets.
# mean and standard deviations of the father and son heights
shmean <- mean(father.son$sheight)
fhmean <- mean(father.son$fheight)
shsd <- sd(father.son$sheight)
fhsd <- sd(father.son$fheight)
# correlation of father.son data - the off-diagonal terms are of interest
fscor <- cor(father.son)[1, 2]
The regression line is of the for \(Y_i = a + b x_i + \epsilon_i\), where, \(b = rs_y/s_x\) (slope), \(a = \bar{y} - b \bar{x}\) (y-intercept) and \(\epsilon_i\) is the residual. From the data we can then find the estimates for the parameters \(a, b\) denoted \(\hat{a}, \hat{b}\).
# slope = r*s_y/s_x - father data is on the x axis.
bhat <- fscor * shsd / fhsd
# y-intercept = mean(y) - slope*mean(x)
ahat <- shmean - bhat*fhmean
# print regression line parameters
ahat
## [1] 33.8866
bhat
## [1] 0.514093
Let us compute the MSE and RMS error.
# computation of mean squared data average of residula error squared
MSE <- sum((father.son$sheight - (ahat + bhat * father.son$fheight))^2) / dim(father.son)[1]
MSE
## [1] 5.92579
# root mean squared error
RMSE <- sqrt(MSE)
RMSE
## [1] 2.434294
# minimum and maximum father height
fhmin <- min(father.son$fheight)
fhmax <- max(father.son$fheight)
# equally space points between from the min-max height interval
xdat <- (fhmax - fhmin) * seq(0, 1, 0.01) + fhmin
ydat <- ahat + bhat*xdat
# regression line data frame
regressionLine <- data.frame(xdat, ydat)
names(regressionLine) <- c("sheight", "fheight")
# plot of data set with regression line
ggplot(father.son, aes(x=fheight, y=sheight)) +
geom_point(size=2, alpha=0.7) +
geom_line(data=regressionLine, aes(x=sheight, y=fheight), lwd=1.5, colour="red") +
xlab("Height of father") +
ylab("Height of son") +
ggtitle("Father-son Height Data")
# using the built in linear regression model in R to fit the data
summary(lm(sheight ~ fheight, data=father.son))
##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8772 -1.5144 -0.0079 1.6285 8.9685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.88660 1.83235 18.49 <2e-16 ***
## fheight 0.51409 0.02705 19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
## F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
The regression line is the line of best fit of the means and therefore separates the data points approximately equally above and below the lines.
Question: [True or False] Because the sons average an inch taller than the fathers, if the father is 72 inches tall, it is 50-50 whether the son is 73 inches tall.
False, the regression line value is 70.9013031
, which is the 50-50 point. Hence, the chance of the son being shorter is higher.