Learning from Data: Father-Son Height Dataset

Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
library(tidyverse)
library(Sleuth3) # Contains data for problemset
library(UsingR) # Contains data for problemset
library(MASS) # Modern applied statistics functions
library(ggplot2)

Problem 1:

Davis et al. (1998) collected data on the proportion of births that were male in Denmark, the Netherlands, Canada, and the United States for selected years. Davis et al. argue that the proportion of male births is declining in these countries. We will explore this hypothesis.

# Import male births data
malebirths <- Sleuth3::ex0724

1a.

Using the function in to fit four (one per country) simple linear regression models of the yearly proportion of males births as a function of the year and obtain the least squares fits.

plotRegression <- function (Country,n) {

  fit <- lm(Country~Year, malebirths)
  slope <- signif(summary(fit)$coefficients[2],5)
  yIntercept <- signif(summary(fit)$coefficients[1],5)
  
  ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
    ylab(paste("Proportion of male births in", n))+
    labs(title=paste("\nModel: y = ",
                    yIntercept,
                    ifelse(slope<0, slope, paste("+", abs(slope))), 
                    "* x"))
  
  #labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
   #                  "Intercept =",signif(fit$coef[[1]],5 ),
    #                 " Slope =",signif(fit$coef[[2]], 5),
     #                " P =",signif(summary(fit)$coef[2,4], 5)),
      #sub = paste("For country",n))
}

col <- colnames(malebirths[])

for(i in 2:5){
  print(plotRegression(malebirths[,i], col[i]))
}

1b.

Obtaining the \(t\)-statistic for the test that the slopes of the regression lines are zero, for each of the four countries.

for(i in 2:5){
  cat("\nT-statistic for", col[i], "when fitting a linear model = ",
      summary(lm(malebirths[,i]~Year, malebirths))$coefficients[2,3])
  cat("\nCorresponding P-value =",
      summary(lm(malebirths[,i]~Year, malebirths))$coefficients[2,4])
  cat("\n")
}

## 
## T-statistic for Denmark when fitting a linear model =  -2.072598
## Corresponding P-value = 0.04423828
## 
## T-statistic for Netherlands when fitting a linear model =  -5.710196
## Corresponding P-value = 9.636921e-07
## 
## T-statistic for Canada when fitting a linear model =  -4.016653
## Corresponding P-value = 0.0007375947
## 
## T-statistic for USA when fitting a linear model =  -5.779212
## Corresponding P-value = 1.439109e-05

Considering the fact the the p-value of each of the above models (for all the 4 countries given) is less than 0.05, therefore we reject the null hypothesis in favor of the alternative hypothesis and infer that the proportion of male-births is indeed declining over the given period.

Problem 2:

Regression was originally used by Francis Galton to study the relationship between parents and children. One relationship he considered was height. Predicting if a man’s height based on the height of his father.

# Import and look at the height data
heightData <- tbl_df(get("father.son"))

2a.

Performing an exploratory analysis of the dataset.

str(heightData) #getting the structure of data

## Classes 'tbl_df', 'tbl' and 'data.frame':    1078 obs. of  2 variables:
##  $ fheight: num  65 63.3 65 65.8 61.1 ...
##  $ sheight: num  59.8 63.2 63.3 62.8 64.3 ...

cat("\n\nA statistical summary of the fathers' height and the sons' heights\n")

## 
## 
## A statistical summary of the fathers' height and the sons' heights

summary(heightData) #getting the summary of each of the two variables

##     fheight         sheight     
##  Min.   :59.01   Min.   :58.51  
##  1st Qu.:65.79   1st Qu.:66.93  
##  Median :67.77   Median :68.62  
##  Mean   :67.69   Mean   :68.68  
##  3rd Qu.:69.60   3rd Qu.:70.47  
##  Max.   :75.43   Max.   :78.36

ggplot(heightData, aes(x = fheight, y = sheight)) + 
  geom_point() +
  #stat_smooth(method = "lm", col = "red")+
  xlab("Father's Height")+
  ylab("Son's Height")+
  labs(title="Father's Height vs Son's Height")

* From the above visualization, it is clear that the son’s heights are linearly correlated witht the father’s heights and, in general, the higher the height of the father, the higher is the height of the son.

2b.

Using the function in R to fit a simple linear regression model to predict son’s height as a function of father’s height.

modelHeight <- lm(sheight~fheight, heightData)
cat("\nModel: \n y = ",summary(modelHeight)$coefficients[1], "+", 
    summary(modelHeight)$coefficients[2], "* x")

## 
## Model: 
##  y =  33.8866 + 0.514093 * x

Y-Intercept of the linear model= 33.89
Slope of the linear model = 0.51

2c.

Finding the 95% confidence intervals for the estimates.

confint(modelHeight, level = 0.95)

##                  2.5 %     97.5 %
## (Intercept) 30.2912126 37.4819961
## fheight      0.4610188  0.5671673

The above command shows the values for the slope and the y-intercept corresponding to the 95% confidence interval.

2d.

Producing a visualization of the data and the least squares regression line.

ggplot(heightData, aes(x = fheight, y = sheight)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")+
  xlab("Father's Height")+
  ylab("Son's Height")+
  labs(title="Father's Height vs Son's Height")

2e.

Producing a visualization of the residuals versus the fitted values.

#par(mfrow=c(2,2))
plot(modelHeight, which = c(1,1))

As per the above residual plot, the data points are scattered randomly around the mean value of 0. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

2f.

Using the model fitted in part (b) to predict the height was 5 males whose father are 50, 55, 70, 75, and 90 inches respectively.

newdata <- data.frame(fheight = c(50, 55, 70, 75, 90))
prediction <- predict(modelHeight, newdata)

cat("\nPrediction on son's height when the father's height is given based on the linear model developed previously:\n")

## 
## Prediction on son's height when the father's height is given based on the linear model developed previously:

cbind(newdata, `Corresponding Son Height` = prediction)

The mean of the father’s height is 67.69 and that of the sons’ heights is 68.68.
From the above predicted values, we can infer that the sons of tall fathers tend to be tall but not as tall as their fathers. Conversely, sons of short fathers tend to be short but not as short as their fathers.
This is because of a concept of regression to the mean.

EC(a).

Assumptions are made about the distribution of the explanatory variable in the normal simple linear regression model.

The main assumptions are: * Linear relationship * Multivariate normality * No or little multicollinearity * No auto-correlation * Homoscedasticity

EC(b).

Discussing if \(R^2\) close to one not be used as evidence that the simple linear regression model is appropriate.

No. An r-squared value that is too high does not necessarily means that the model is appropriate.
It is because, the model can be overfitted and may not account for actual errors/outliers that may creep in when taking another sample.

EC(c).

Consider a regression of weight on height for a sample of adult males. Suppose the intercept is 5 kg. Does this imply that males of height 0 weigh 5 kg, on average? Would this imply that the simple linear regression model is meaningless?

No. That would be an absurd assumption to make. It is so because the regression line takes into account only the range of the data (explanatory variable). In this case, 0 height is not achievable and lies outside the range of the explanatory variable. Hence, we simply do not predict any height at 0 kg. Thus, even though the y-intercept is non-interpretable when explanatory variable is 0, the regression still makes sense for the values lying within the range of the data.

EC(d).

Suppose we had data on pairs \((X,Y)\) which gave the scatterplot been below. How would we approach the analysis?

Scatterplot

From the above scatter plot we see that the explanatory variable, after a certain value can have two corresponding responses, hence we can re-group the responses into two sub-groups.
In this case a single linear model would not be an appropriate model.