In this document, I play around with the basics of regression lines in the context of John Galtonâ€™s father and son height data. Basic questions:

- Is there a correlation between father and son heights?

- What is the regression line for the data set?

- What is the [M]ean [S]quare [E]rror and [R]oot [M]ean [S]quare error?

```
# The father son height data can be loaded from the UsingR package. There is another package called HistData that contains many interesting historical datasets including Snow's work on cholera which I find particularly interesting.
suppressWarnings(suppressMessages(library(UsingR)))
data(father.son)
# Check out the structure of the data set
head(father.son)
```

```
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
```

`names(father.son)`

`## [1] "fheight" "sheight"`

```
# Load ggplot to plot data
library(ggplot2)
ggplot(father.son, aes(x=fheight, y=sheight)) + geom_point(size=2, alpha=0.7) + xlab("Height of father") + ylab("Height of son") + ggtitle("Father-son Height Data")
```

Now let us use R commands to compute the means, standard deviations and the correlation in the data sets.

```
# mean and standard deviations of the father and son heights
shmean <- mean(father.son$sheight)
fhmean <- mean(father.son$fheight)
shsd <- sd(father.son$sheight)
fhsd <- sd(father.son$fheight)
# correlation of father.son data - the off-diagonal terms are of interest
fscor <- cor(father.son)[1, 2]
```

The regression line is of the for \(Y_i = a + b x_i + \epsilon_i\), where, \(b = rs_y/s_x\) (slope), \(a = \bar{y} - b \bar{x}\) (y-intercept) and \(\epsilon_i\) is the residual. From the data we can then find the estimates for the parameters \(a, b\) denoted \(\hat{a}, \hat{b}\).

```
# slope = r*s_y/s_x - father data is on the x axis.
bhat <- fscor * shsd / fhsd
# y-intercept = mean(y) - slope*mean(x)
ahat <- shmean - bhat*fhmean
# print regression line parameters
ahat
```

`## [1] 33.8866`

`bhat`

`## [1] 0.514093`

Let us compute the MSE and RMS error.

```
# computation of mean squared data average of residula error squared
MSE <- sum((father.son$sheight - (ahat + bhat * father.son$fheight))^2) / dim(father.son)[1]
MSE
```

`## [1] 5.92579`

```
# root mean squared error
RMSE <- sqrt(MSE)
RMSE
```

`## [1] 2.434294`

```
# minimum and maximum father height
fhmin <- min(father.son$fheight)
fhmax <- max(father.son$fheight)
# equally space points between from the min-max height interval
xdat <- (fhmax - fhmin) * seq(0, 1, 0.01) + fhmin
ydat <- ahat + bhat*xdat
# regression line data frame
regressionLine <- data.frame(xdat, ydat)
names(regressionLine) <- c("sheight", "fheight")
# plot of data set with regression line
ggplot(father.son, aes(x=fheight, y=sheight)) +
geom_point(size=2, alpha=0.7) +
geom_line(data=regressionLine, aes(x=sheight, y=fheight), lwd=1.5, colour="red") +
xlab("Height of father") +
ylab("Height of son") +
ggtitle("Father-son Height Data")
```

```
# using the built in linear regression model in R to fit the data
summary(lm(sheight ~ fheight, data=father.son))
```

```
##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8772 -1.5144 -0.0079 1.6285 8.9685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.88660 1.83235 18.49 <2e-16 ***
## fheight 0.51409 0.02705 19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
## F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
```

The regression line is the line of best fit of the means and therefore separates the data points approximately equally above and below the lines.

**Question: [True or False] Because the sons average an inch taller than the fathers, if the father is 72 inches tall, it is 50-50 whether the son is 73 inches tall.**

False, the regression line value is `70.9013031`

, which is the 50-50 point. Hence, the chance of the son being shorter is higher.