2024-01-30

Exercise 4.1.

  • Examine the two variables in your dataset, perhaps using plots or summary statistics.

  • Start a new program. Import the data and create data objects Height and Handspan.

  • Plot the bivariate data in a scatter plot.

  • Does there appear to be any relationship between the two variables?

I plot them:

Look at the correlation coefficient

## The pearson correlation coefficient is: 0.2273731
## The Spearman correlation coefficient is: 0.3390771

The residual plot:

  • There appear to be not a strong relationship between Handspan and Height.

Exercise 4.2.

  • Create a function which takes two vectors of equal length and outputs \(\hat{\alpha}\), \(\hat{\beta}\) and \(\hat{\sigma^2}\) using the formulae from the STAT0002 notes.

  • Test the function using x = (1, 2, 3) and Y = (3, 5, 7) - you should be able to predict its outputs for that data.

  • Run the function against the Handspan (Y) Height (x) data you have downloaded. Update the function so that it also outputs \(R^2\), the coefficient of determination, which describes how much of the variation in Y is explained by x.

Define the Function

para <- function (X, Y) {
  
  Ybar <- (1 / length(Y)) * sum(Y)
  xbar <- (1 / length(X)) * sum(X)
  
  bhat <- (sum( (X - xbar)* (Y - Ybar)) 
           / 
             sum( (X - xbar)^2)
           )
  ahat <- Ybar - bhat * xbar
  
  RSS <- sum((Y - ahat - bhat * X)^2)
  sigma_hat_2 = RSS/( length(X)-2)
  Rsquared <- 1 - RSS / sum((Y - Ybar)^2)
    
  return (list (ahat = ahat,
                bhat = bhat,
                sigma_hat_2 = sigma_hat_2,
                Rsquared = Rsquared))
}

try with x = 123, y = 357

try1 <- para(X = c (1, 2, 3),
             Y = c (3, 5, 7))
try1
## $ahat
## [1] 1
## 
## $bhat
## [1] 2
## 
## $sigma_hat_2
## [1] 0
## 
## $Rsquared
## [1] 1

try with x = 123, y = 357

try1_check <- lm(c (3, 5, 7) ~ c (1, 2, 3))
try1_check
## 
## Call:
## lm(formula = c(3, 5, 7) ~ c(1, 2, 3))
## 
## Coefficients:
## (Intercept)   c(1, 2, 3)  
##           1            2

try with x = height, y = handspan

try2 <- para (X = qanda$Height,
              Y = qanda$Handspan)
try2
## $ahat
## [1] -0.03742753
## 
## $bhat
## [1] 0.1034013
## 
## $sigma_hat_2
## [1] 16.85727
## 
## $Rsquared
## [1] 0.05169853

try with x = height, y = handspan

try2_check <- lm (qanda$Handspan ~
                    qanda$Height)
try2_check
## 
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
## 
## Coefficients:
##  (Intercept)  qanda$Height  
##     -0.03743       0.10340

Comments

  • Comment: \(R^2\) is the coefficient of determination, which compare the variance of the residuals \(r_i\) with the variance of the original observation \(Y_i\) .
  • \(R^2 = 1\) indicates a perfect fit
  • \(R^2 = 0\) indicates that none of the variability in \(Y_1, ..., Y_n\) is explained by \(x_1, ..., x_n\), producing a horizontal regression line .
  • the value of 0.05 here indicates 5% of the variance in the dependent variable is explained by the independent variable(s) included in the model.

Exercise 4.3.

  • Find \(\hat{\alpha}\), \(\hat{\beta}\) and \(\hat{\sigma^2}\) and \(R^2\) in the above output from R (ignore everything else for now: all will become clear as your degree progresses) and

  • compare the values with those from your function in Exercise 4.2.

  • Write a paragraph which explains what the values of these estimates means for our data (you could write it in RStudio at the bottom of your program).

Find \(\hat{\alpha}\) & \(\hat{\beta}\)

HH <- lm(formula = qanda$Handspan ~ qanda$Height ,
   data = qanda)

coef(summary(HH))[, "Estimate"]
##  (Intercept) qanda$Height 
##  -0.03742753   0.10340131
 # this gives me the alphahat = (intercept) and betahat = qanda$Height

Find \(\hat{\sigma^2}\)

  • I can get sigma^2 from my function.

Find \(R^2\)

summary(HH)$r.squared 
## [1] 0.05169853
 # this is R^2

Comments:

  • \(\hat{\alpha}\): the expected value of Y when X = 0. But in this case, it is meaningless for people to have 0 Height.
  • So we can infer that: the handspan of people with height 150 is about 15.5 cm, which is 4 cm smaller than those with 190cm in height.
  • \(\hat{\beta}\): the amount by which the mean of Y given X = x increases when x is increased by 1 unit:
  • here, it means as the height of people increase by 1cm, the handspan will be larger by 0.1034cm.
  • \(\hat{\sigma^2}\): error variance - the variability of the response about the linear regression line (in the vertical direction).

Exercise 4.4.

  • Use the x values in the data (ie Height) and the parameters estimated in hand_height_model to work outa vector of \(\hat{Y}_i\)s.

  • Note that you can get the parameter estimates using coef(HH) (you can also use HH$coefficients but it is not preferred) so you don’t have to type the numbers in.

  • Now use the Y values in the data (ie Handspan) to derive the residuals. Check your answer against residuals(HH).

Get the estimated data:

alphahat <- coef(summary(HH)) ["(Intercept)", "Estimate"]
betahat <- coef(summary(HH)) ["qanda$Height", "Estimate"]
xi <- qanda$Height

Yhati <- alphahat + betahat * xi

Find the residuals

epsiloni <- qanda$Handspan - Yhati

summary(epsiloni)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666
summary( residuals(HH))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666
  • they are the same

Exercise 4.5.

  • Comment on the model assumptions, as set out in STAT0002, from the plots in Figure 4.1.

  • What conclusion can you draw about our simple linear model for linking handspan to height?

which = 1: residual vs fitted

plot (HH, which = c (1))

which = 2 normal Q-Q plot

plot (HH, which = c (2))

which = 3 scale-location plot

plot (HH, which = c (3))

which = 4 Cook’s distance plot

plot (HH, which = c (4))

Assumptions:

  • Linearity (weak)
  • Constant error variance (from the Scale-Location plot YES)
  • Uncorrelatedness of errors (from the residual & fitted plot YES)
  • Normality of errors (from the QQ plot YES)

Exercise 4.6.

In Country_data.csv,

  • build a linear model for gdp_head based upon an appropriate linear model, using all relevant covariates, from Exercise 3.5.

  • Check the model assumptions using residuals plots.

  • Comment on your model.

Plot it!

Model checking 1: Linearity

  • The relationship between Y against X is linear.

Model checking 2: Constant error variance

Model checking 2: Constant error variance

The variability of residuals are a bit higher between 3.0 and 3.2.

Model checking 3: Uncorrelatedness of errors

Model checking 3: Uncorrelatedness of errors

The dashed line (patterns of residuals) is very close to the solid line (expected value of residuals when the model is accurate) so yes.

Model checking 4: Normality of errors

Model checking 4: Normality of errors

The standardized residuals fits the qq plot so yes.

General Comment:

The log of median of age and gdp per head fits the linear regression model.