In 1886, Francis Galton presented a data set on a sample of adult British children and their of parents. For each child, he had recorded their adult height and the average of their parents’ heights. For each child, he had recorded their adult height and the average of their parents’ heights. His analysis of the data set the stage for correlation, regression and the bivariate normal distribution.
We are going to keep the analysis simple and just consider one child randomly selected from each set of parents and only consider the following question: “is the height of the child is determined by the height of the parent?”. In particular, we wished to see if the relationship between the two is approximately one to one, i.e., children grow up to have the same height as the average of their parent’s height - allowing for some error.
The data is stored in the file Galton.csv and contains the variables:
| Variable | Description |
|---|---|
| child | The height (converted to cm) of the child when adult. |
| parent | The average height (converted to cm) of the parents. |
We wish to investigate the relationship between children and their parents’ height and determine if this relationship is one-to-one.
Galton.df=read.csv("Galton.csv", header=T)
plot(child~parent, main="Childs height versus parents average height",data=Galton.df)
Galton.lm=lm(child~parent, data=Galton.df)
modelcheck(Galton.lm)
summary(Galton.lm)
##
## Call:
## lm(formula = child ~ parent, data = Galton.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.9305 -5.8912 -0.6636 6.6326 17.9038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.6189 20.7342 3.213 0.00154 **
## parent 0.6115 0.1214 5.036 1.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.049 on 195 degrees of freedom
## Multiple R-squared: 0.1151, Adjusted R-squared: 0.1105
## F-statistic: 25.36 on 1 and 195 DF, p-value: 1.081e-06
confint(Galton.lm)
## 2.5 % 97.5 %
## (Intercept) 25.72695 107.5109105
## parent 0.37202 0.8510386
plot(child~parent, main="Childs height versus parents average height", sub="Solid line = fitted model, dashed line = slope 1",data=Galton.df)
abline(Galton.lm$coef[1],Galton.lm$coef[2])
Since we have a linear relationship in the data, we have fitted a simple linear regression model to our data. We have a sample of families, but no information on how these were obtained, so we have to assume they were randomly selected. However, as this study was conducted so long ago when good statistical practice wasn’t understood this is unlikely to be the case. There could be doubts about independence. (However problems with multiple children from the same family was solved by randomly choosing one child from each family.) The residuals show patternless scatter with fairly constant variability - so no problems. The normality checks don’t show any major problems (slightly short tails, if anything) and the Cook’s plot doesn’t reveal any unduly influential points. Overall, all the model assumptions are satisfied.
Our model is:
\(child_i=\beta_0 +\beta_1\times parent_i+\epsilon_i\) where \(\epsilon_i \sim iid ~ N(0,\sigma^2)\)
Our model only explains 11% of the variation in the response variable.
We are interested in evaluating whether the height of the child is determined by the height of the parent. In particular, we wished to see if the relationship between the two is approximately one to one, i.e., children grow up to have the same height as the average of their parents’ height.
We have strong evidence that of an increasing relationship between the height of child and the average of parents’ height.
We estimate that for every cm increase of average parent height, the child’s height will increase by somewhere between 0.37 and 0.85cm, for example if the parents height increased by 5cm, the child’s height would only increase by 1.85 and 4.25cm
This is not consistent with a one to one relationship between average parent height and child’s height.
1.3 Comment on the plots