DATA 605 Discussion 12

Using simple linear regression, we want to answer the question "Is there a relationship between a child's height and the height of their parents?"

The Dataset

For this investigation we will use a dataset compiled by Francis Galton that contains data about the heights of children and their parents.

First we will import the dataset into R (the dataset can be accessed via the HistData R package) and take a brief look at the data.

# Import Galton's dataset from the HistData package.
data('Galton')
galton_height_data <- GaltonFamilies
kable(head(galton_height_data), format = 'markdown')

family	father	mother	midparentHeight	children	childNum	gender	childHeight
001	78.5	67.0	75.43	4	1	male	73.2
001	78.5	67.0	75.43	4	2	female	69.2
001	78.5	67.0	75.43	4	3	female	69.0
001	78.5	67.0	75.43	4	4	female	69.0
002	75.5	66.5	73.66	4	1	male	73.5
002	75.5	66.5	73.66	4	2	male	72.5

# Count the number of observations in the dataset.
nrow(galton_height_data)

## [1] 934

As we can see from the above, the dataset contains 934 observations, and 8 variables: family, father, mother, midparentHeight, children, childNum, gender, and childHeight.

In order to assess if there is a relationship between a child's height and their parents' height, we can plot the data in a scatter graph. This allows us to easily assess if a relationship exists.

# PLot a scatter plot of the Galton dataset.
galton_plot <- ggplot(Galton, aes(x = parent, y = child)) + geom_point(color = "blue") +
    ylab("Child's Height in Inches") + xlab("Parents' Height in Inches")

galton_plot

Data Exploration

Looking at the above scatter plot, it does appear that there is a relationship between the height of a child's parents, and the height of the child. As the parents' height increases, so does the child's height. To confirm this, we can take a look at the correlation between the 2 variables (child's height vs parents' height).

Correlation

# Calculate the correlation between the 2 variables.
round(cor(Galton$parent, Galton$child), 2)

## [1] 0.46

The correlation between the 2 variables is 0.46 which confirms that there is a strong relationship between the 2 variables.

Distribution normality of the dependent variable (child's height)

The next thing we want to check is the distribution normality for the dependent variable (child's height). We can check this by plotting a histogram and seeing if the distribution is bell shapped.

hist(Galton$child)

The distribution is close to normal (bell shapped) and therefore we can proceed to check the linearity of the variables.

Linearity of the relationship between parents' height and child's height

plot(parent ~ child, Galton)

The relationship between a child's height and their parents' height is not overly apparent, but it is linear and thus we can move forward with building the regression model.

Regression Model

The goal of our regression model is to predict the height of a child (response/dependent variable) based on their parents' height (predictor/independent variable).

Build the regression model

The first step is to build the regression model using R's lm() function.

# Create the regression model.
galton_height_model <- lm(parent ~ child, Galton)

# Create a summary of the model.
summary(galton_height_model)

## 
## Call:
## lm(formula = parent ~ child, data = Galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6702 -1.1702 -0.1471  1.1324  4.2722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.13535    1.41225   32.67   <2e-16 ***
## child        0.32565    0.02073   15.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.589 on 926 degrees of freedom
## Multiple R-squared:  0.2105, Adjusted R-squared:  0.2096 
## F-statistic: 246.8 on 1 and 926 DF,  p-value: < 2.2e-16

Linear model summary analysis

The Multiple R-squared value above is a measurement of how well the model describes the data. For this model, the R-Squared value is 0.2105 which means that the model explains 21.5% of the data’s variation.

For well fitted linear regression models, the residuals should be normally distributed with a mean as close to 0 as possible. The residuals median for this model is -0.147 which means the model over predicted.

Residual analysis

par(mfrow = c(2, 2))
plot(galton_height_model)

par(mfrow = c(1, 1))

The QQ plot would suggest that our errors follow the straight line pretty well and so our prediction that a child's parent's height influences their height is highly probable. The stragglers at either end of the line may indicate that either the data is slight skewed, or we have some outliers.

The most likely case in this investigation is that our original hypothesis of - "a child's height is influenced by their parents' height" is true.