Father and Son Analysis

Father and son dataset is obtained from the study of Sir Francis Galton on the hieght of fathers and their sons. The dataset contains height of 1078 fathers and their corresponding sons in inches. The analysis will establish a linear relation between height of father and sons. Furthermore, through a linear regression model, the height of son is predicted by height of fathers.

Exploratory Data Analysis

We obtain the data from the UsingR package of R library.

# Exploring the dataset 
head(father.son)

##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

dim(father.son)

## [1] 1078    2

str(father.son)

## 'data.frame':    1078 obs. of  2 variables:
##  $ fheight: num  65 63.3 65 65.8 61.1 ...
##  $ sheight: num  59.8 63.2 63.3 62.8 64.3 ...

# Rounding up the values to the 2 digit
father.son <- round(father.son, digit = 2)

library(gridExtra)

# Histogram of height of father

fplot <- ggplot(father.son, aes(x = fheight))+ geom_histogram(col = "grey", fill = "darkslategray") + geom_vline(xintercept = mean(father.son$fheight), col = "dark green") + xlab("Height of Father") + ylab("Number of Observation") + ggtitle("Height of Fathers")


# Histogram of height of Son 
splot <- ggplot(father.son, aes(x = sheight)) + geom_histogram(fill = "dark green", col = "grey") + geom_vline(xintercept = mean(father.son$sheight), col = "darkslategray") + xlab("Height of Sons") + ylab("Number of Observation") + ggtitle("Height of Sons")


# Scatterplot of the dataset 
scatplot <- Fa.son <- ggplot(father.son,  aes(x = fheight, y = sheight)) + geom_point(col = "darkslategray") + xlab("Height of Father") + ylab("Height of Sons") + ggtitle("Height of Fathers and Sons")

grid.arrange(fplot, splot, ncol = 2, nrow = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

scatplot

# Summary of dataset
summary(father.son)

Correlation

There seems to be an association between heights of fathers and those of sons. In first glance, as the height of father increases, those of son also increase. The strength of the association is calculated through correlation.

cor(father.son$fheight, father.son$sheight)

## [1] 0.501362

Height of the fathers and sons are associated positively with each other. As the height of father increases,those of sons also increase.

Regression Analysis

Using simple linear regression modelling, the relationship between both variables is expressed through a linear equation.

\[ Y(shieght) = \beta_0 + \beta_1 (fhieght) + \epsilon_i\] \(\beta_0 = Intercept\;(Value\; of\; Response\; Variable\; when \; Explanatory \; Variable\; is\; equal\; to \; 0)\)

\(\beta_1 = Slope\; of\; the\; line\)

\(\epsilon_i = Error \; Term\)

Using lm function, the regression coefficients are estimated:

# Specificying the response and explanatory variable
Y <- father.son$sheight
X <- father.son$fheight

# Building the linear model
Model <- lm(Y ~ X)

Model

## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##     33.8839       0.5141

The regression coefficients of the model indicate that hieght of a son can be predicted using linear regression through the following formula

\[ Y_i = 33.88 + 0.51(fhieght) \]

Statistical Significance

It is important to ensure that the model is statistically significant to establish the relationship between both variables. Through hypothesis testing we will look into the statistical significance of the model.

We assume no relationship between height of fathers and sons in our null hypothesis. While the alternate hypothesis states that there is a relationship between height of fathers and sons.

\(H_0: \; There \; is \; no \; relationship\; between\; height\; of\; fathers\; and\; sons, \;Slope=0.\) \(H_a: There\; is \; a \; relationship\; between\; height\; of\; fathers\; and \; sons, \; Slope\neq0.\)

A summary of the model will give us enough information to decide for or against the null hypothesis.

summary(Model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8759 -1.5149 -0.0048  1.6322  8.9648 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.88389    1.83239   18.49   <2e-16 ***
## X            0.51413    0.02705   19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared:  0.2514, Adjusted R-squared:  0.2507 
## F-statistic: 361.3 on 1 and 1076 DF,  p-value: < 2.2e-16

The slope is estimated to be 0.51 by the model. The t-value and p-value are also estimated by the model. The significance level is set at 0.05. We have obtained a t-value of 19.01. The p-value is significantly smaller than the significance level. If the null hypothesis is true, Slope = 0, probability of obtaining a t-value as extreme as 19.01 is 2e-16, which is significantly lower than 0.05. Thus, the null hypothesis is provisionaly rejected in favor of the alternate hypothesis. In other words, the results have established that there is a correlation between height of fathers and sons.

Prediction

We have established a linear relation between height of father and sons through linear regression. Using the equation, height of the sons can be predicted by the height of their fathers, with some error. We would like to examine how precise and accurate our model is in predicting the height of the sons. Therefore, we split the dataset into training and test datasets with 80:20 ratio. The model will built and trained on the train dataset and will then be given new data values to predict their respective heights.

## 80% of the sample size
smp_size <- floor(0.80 * nrow(father.son))


set.seed(123)
train_ind <- sample(seq_len(nrow(father.son)), size = smp_size)

train <- father.son[train_ind, ]
test <- father.son[-train_ind, ]

## For convenience, variables are assigned to the following
a <- train$fheight
b <- train$sheight

## Regression Model
y <- lm(b ~ a)

## Predicting height of son for the new data points, test. 
Pred_inter <- predict(y, data.frame(a = test$fheight), interval = "prediction")
Pred_confidence <- predict(y, data.frame(a = test$fheight), interval = "confidence")

Using the predict function, confidence interval and prediction interval of the estimate are calculated.

dat <- cbind(test, Pred_inter)
ggplot(dat, aes(x = fheight, y = sheight)) + geom_point(color = "darkslategray", size = 1) + geom_smooth(method = lm, col = "ivory4") + geom_line(aes(y = lwr), col = "red", linetype = "dashed") + geom_line(aes(y = upr), col = "red", linetype = "dashed") + xlab("Height of Father") + ylab("Height of Sons") + ggtitle("Confidence vs Prediction Interval")

The grey shaded area shows 95% confidence interval within which the estimate is depicted. In other words, if we were to do repeated samples, 95% of the times we may obtain the correspondent height of son estimated within the shaded area- given the regression model. Furthermore, the red dashed line depicts the prediction interval of the prediction. If we were to include another observation in the dataset, the correspondent height of the son may lie within the dashed area with some probability.

Goodness of fit

Now that that relationship between both variables have been established by the linear model, it is important to examine how accurate the model has predicted the values. R-squared, is one of the determinants of a model’s goodness. The model has 0.25 R-squared. This means around 25% of the variations in height of sons is explained by our model.

Furthermore, we look into the squared differences between the actual and the predicted value for each data point. Mean Squared Error is defined as following:

\[ MSE=\dfrac{1}{n}\sum_{i = 1} ^n (y_i - \hat y^2)\] \[ RMSE=\sqrt{\dfrac{1}{n}\sum_{i = 1} ^n (y_i - \hat y^2)}\]

ggplot(dat, aes(x = fheight, y = sheight)) + geom_segment(aes(xend = fheight, yend = fit))+ geom_point(col = "darkslategrey") + geom_point(aes(y = fit), col = "cornflowerblue") + xlab("Height of Father") + ylab("Height of Sons") + ggtitle("Residual Distance from the Model")

summa <- summary(y)
mse <- function(summa){
  mean(summa$residuals^2)
}

## Mean of the Squared Errors (MSE)
mse(summa)

## [1] 6.030449

## Square Root of the Mean Squared Error (RMSE)
sqrt(mse(summa))

## [1] 2.455697

\(MSE=6.03\) \(RMSE=2.45\)

The Root Mean Squared Error (RMSE) is 2.45. On average, data points are 2.45 units away from the fitted line.

Conclusion

Height of Fathers and height of sons are positively correlated. As the height of fathers increase, those of sons also increase.
For one inch increase in height of father, height of son also increase by 0.51 inch as per the model.
The model dictates that 25 percent of variations in the height of son can explained by the height of father.
On average, height of sons are 2.45 inches away from the predicted value by the model.