Using simple linear regression, we want to answer the question "Is there a relationship between a child's height and the height of their parents?"
For this investigation we will use a dataset compiled by Francis Galton that contains data about the heights of children and their parents.
First we will import the dataset into R (the dataset can be accessed via the HistData R package) and take a brief look at the data.
# Import Galton's dataset from the HistData package.
data('Galton')
galton_height_data <- GaltonFamilies
kable(head(galton_height_data), format = 'markdown')
| family | father | mother | midparentHeight | children | childNum | gender | childHeight |
|---|---|---|---|---|---|---|---|
| 001 | 78.5 | 67.0 | 75.43 | 4 | 1 | male | 73.2 |
| 001 | 78.5 | 67.0 | 75.43 | 4 | 2 | female | 69.2 |
| 001 | 78.5 | 67.0 | 75.43 | 4 | 3 | female | 69.0 |
| 001 | 78.5 | 67.0 | 75.43 | 4 | 4 | female | 69.0 |
| 002 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 |
| 002 | 75.5 | 66.5 | 73.66 | 4 | 2 | male | 72.5 |
# Count the number of observations in the dataset.
nrow(galton_height_data)
## [1] 934
As we can see from the above, the dataset contains 934 observations, and 8 variables: family, father, mother, midparentHeight, children, childNum, gender, and childHeight.
In order to assess if there is a relationship between a child's height and their parents' height, we can plot the data in a scatter graph. This allows us to easily assess if a relationship exists.
# PLot a scatter plot of the Galton dataset.
galton_plot <- ggplot(Galton, aes(x = parent, y = child)) + geom_point(color = "blue") +
ylab("Child's Height in Inches") + xlab("Parents' Height in Inches")
galton_plot
Looking at the above scatter plot, it does appear that there is a relationship between the height of a child's parents, and the height of the child. As the parents' height increases, so does the child's height. To confirm this, we can take a look at the correlation between the 2 variables (child's height vs parents' height).
# Calculate the correlation between the 2 variables.
round(cor(Galton$parent, Galton$child), 2)
## [1] 0.46
The correlation between the 2 variables is 0.46 which confirms that there is a strong relationship between the 2 variables.
The next thing we want to check is the distribution normality for the dependent variable (child's height). We can check this by plotting a histogram and seeing if the distribution is bell shapped.
hist(Galton$child)
The distribution is close to normal (bell shapped) and therefore we can proceed to check the linearity of the variables.
plot(parent ~ child, Galton)
The relationship between a child's height and their parents' height is not overly apparent, but it is linear and thus we can move forward with building the regression model.
The goal of our regression model is to predict the height of a child (response/dependent variable) based on their parents' height (predictor/independent variable).
The first step is to build the regression model using R's lm() function.
# Create the regression model.
galton_height_model <- lm(parent ~ child, Galton)
# Create a summary of the model.
summary(galton_height_model)
##
## Call:
## lm(formula = parent ~ child, data = Galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6702 -1.1702 -0.1471 1.1324 4.2722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.13535 1.41225 32.67 <2e-16 ***
## child 0.32565 0.02073 15.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.589 on 926 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
## F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
The Multiple R-squared value above is a measurement of how well the model describes the data. For this model, the R-Squared value is 0.2105 which means that the model explains 21.5% of the data’s variation.
For well fitted linear regression models, the residuals should be normally distributed with a mean as close to 0 as possible. The residuals median for this model is -0.147 which means the model over predicted.
par(mfrow = c(2, 2))
plot(galton_height_model)
par(mfrow = c(1, 1))
The QQ plot would suggest that our errors follow the straight line pretty well and so our prediction that a child's parent's height influences their height is highly probable. The stragglers at either end of the line may indicate that either the data is slight skewed, or we have some outliers.
The most likely case in this investigation is that our original hypothesis of - "a child's height is influenced by their parents' height" is true.