Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
library(tidyverse)
library(Sleuth3) # Contains data for problemset
library(UsingR) # Contains data for problemset
library(MASS) # Modern applied statistics functions
library(ggplot2)

Problem 1:

Davis et al. (1998) collected data on the proportion of births that were male in Denmark, the Netherlands, Canada, and the United States for selected years. Davis et al. argue that the proportion of male births is declining in these countries. We will explore this hypothesis.

# Import male births data
malebirths <- Sleuth3::ex0724

1a.

Using the function in to fit four (one per country) simple linear regression models of the yearly proportion of males births as a function of the year and obtain the least squares fits.

plotRegression <- function (Country,n) {

  fit <- lm(Country~Year, malebirths)
  slope <- signif(summary(fit)$coefficients[2],5)
  yIntercept <- signif(summary(fit)$coefficients[1],5)
  
  ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
    ylab(paste("Proportion of male births in", n))+
    labs(title=paste("\nModel: y = ",
                    yIntercept,
                    ifelse(slope<0, slope, paste("+", abs(slope))), 
                    "* x"))
  
  #labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
   #                  "Intercept =",signif(fit$coef[[1]],5 ),
    #                 " Slope =",signif(fit$coef[[2]], 5),
     #                " P =",signif(summary(fit)$coef[2,4], 5)),
      #sub = paste("For country",n))
}

col <- colnames(malebirths[])

for(i in 2:5){
  print(plotRegression(malebirths[,i], col[i]))
}

1b.

Obtaining the \(t\)-statistic for the test that the slopes of the regression lines are zero, for each of the four countries.

for(i in 2:5){
  cat("\nT-statistic for", col[i], "when fitting a linear model = ",
      summary(lm(malebirths[,i]~Year, malebirths))$coefficients[2,3])
  cat("\nCorresponding P-value =",
      summary(lm(malebirths[,i]~Year, malebirths))$coefficients[2,4])
  cat("\n")
}
## 
## T-statistic for Denmark when fitting a linear model =  -2.072598
## Corresponding P-value = 0.04423828
## 
## T-statistic for Netherlands when fitting a linear model =  -5.710196
## Corresponding P-value = 9.636921e-07
## 
## T-statistic for Canada when fitting a linear model =  -4.016653
## Corresponding P-value = 0.0007375947
## 
## T-statistic for USA when fitting a linear model =  -5.779212
## Corresponding P-value = 1.439109e-05

Problem 2:

Regression was originally used by Francis Galton to study the relationship between parents and children. One relationship he considered was height. Predicting if a man’s height based on the height of his father.

# Import and look at the height data
heightData <- tbl_df(get("father.son"))

2a.

Performing an exploratory analysis of the dataset.

str(heightData) #getting the structure of data
## Classes 'tbl_df', 'tbl' and 'data.frame':    1078 obs. of  2 variables:
##  $ fheight: num  65 63.3 65 65.8 61.1 ...
##  $ sheight: num  59.8 63.2 63.3 62.8 64.3 ...
cat("\n\nA statistical summary of the fathers' height and the sons' heights\n")
## 
## 
## A statistical summary of the fathers' height and the sons' heights
summary(heightData) #getting the summary of each of the two variables
##     fheight         sheight     
##  Min.   :59.01   Min.   :58.51  
##  1st Qu.:65.79   1st Qu.:66.93  
##  Median :67.77   Median :68.62  
##  Mean   :67.69   Mean   :68.68  
##  3rd Qu.:69.60   3rd Qu.:70.47  
##  Max.   :75.43   Max.   :78.36
ggplot(heightData, aes(x = fheight, y = sheight)) + 
  geom_point() +
  #stat_smooth(method = "lm", col = "red")+
  xlab("Father's Height")+
  ylab("Son's Height")+
  labs(title="Father's Height vs Son's Height")

* From the above visualization, it is clear that the son’s heights are linearly correlated witht the father’s heights and, in general, the higher the height of the father, the higher is the height of the son.

2b.

Using the function in R to fit a simple linear regression model to predict son’s height as a function of father’s height.

modelHeight <- lm(sheight~fheight, heightData)
cat("\nModel: \n y = ",summary(modelHeight)$coefficients[1], "+", 
    summary(modelHeight)$coefficients[2], "* x")
## 
## Model: 
##  y =  33.8866 + 0.514093 * x

2c.

Finding the 95% confidence intervals for the estimates.

confint(modelHeight, level = 0.95)
##                  2.5 %     97.5 %
## (Intercept) 30.2912126 37.4819961
## fheight      0.4610188  0.5671673

2d.

Producing a visualization of the data and the least squares regression line.

ggplot(heightData, aes(x = fheight, y = sheight)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")+
  xlab("Father's Height")+
  ylab("Son's Height")+
  labs(title="Father's Height vs Son's Height")

2e.

Producing a visualization of the residuals versus the fitted values.

#par(mfrow=c(2,2))
plot(modelHeight, which = c(1,1))

2f.

Using the model fitted in part (b) to predict the height was 5 males whose father are 50, 55, 70, 75, and 90 inches respectively.

newdata <- data.frame(fheight = c(50, 55, 70, 75, 90))
prediction <- predict(modelHeight, newdata)

cat("\nPrediction on son's height when the father's height is given based on the linear model developed previously:\n")
## 
## Prediction on son's height when the father's height is given based on the linear model developed previously:
cbind(newdata, `Corresponding Son Height` = prediction)

EC(a).

Assumptions are made about the distribution of the explanatory variable in the normal simple linear regression model.

The main assumptions are: * Linear relationship * Multivariate normality * No or little multicollinearity * No auto-correlation * Homoscedasticity

EC(b).

Discussing if \(R^2\) close to one not be used as evidence that the simple linear regression model is appropriate.

EC(c).

Consider a regression of weight on height for a sample of adult males. Suppose the intercept is 5 kg. Does this imply that males of height 0 weigh 5 kg, on average? Would this imply that the simple linear regression model is meaningless?

EC(d).

Suppose we had data on pairs \((X,Y)\) which gave the scatterplot been below. How would we approach the analysis?

Scatterplot

Scatterplot