Exercise 4

Exercise 4.1.

Examine the two variables in your dataset, perhaps using plots or summary statistics.
Start a new program. Import the data and create data objects Height and Handspan.
Plot the bivariate data in a scatter plot.
Does there appear to be any relationship between the two variables?

I plot them

plot (qanda$Handspan ~ qanda$Height, 
      col = "blue", cex = 0.5, pch = 16)

fit4.1 <- lm (qanda$Handspan ~ qanda$Height)
fit4.1

## 
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
## 
## Coefficients:
##  (Intercept)  qanda$Height  
##     -0.03743       0.10340

abline (fit4.1, col = "red")

cor (qanda$Handspan, qanda$Height, method = "pearson")

## [1] 0.2273731

residual plot

model4.1 <- lm(qanda$Handspan ~ qanda$Height,
                          data = qanda)

scatterplot(fitted(model4.1), resid(model4.1))

There appear to be not a strong relationship between Handspan and Height.

Exercise 4.2.

Create a function which takes two vectors of equal length and outputs $\hat{\alpha}$, $\hat{\beta}$ and $\hat{\sigma^2}$ using the formulae from the STAT0002 notes.
Test the function using x = (1, 2, 3) and Y = (3, 5, 7) - you should be able to predict its outputs for that data.
Run the function against the Handspan (Y) Height (x) data you have downloaded. Update the function so that it also outputs $R^2$, the coefficient of determination, which describes how much of the variation in Y is explained by x.

Define the Function

para <- function (X, Y) {
  
  Ybar <- (1 / length(Y)) * sum(Y)
  xbar <- (1 / length(X)) * sum(X)
  
  bhat <- (sum( (X - xbar)* (Y - Ybar)) 
           / 
             sum( (X - xbar)^2)
           )
  
  ahat <- Ybar - bhat * xbar
  
  RSS <- sum((Y - ahat - bhat * X)^2)
  sigma_hat_2 = RSS/( length(X)-2)
  Rsquared <- 1 - RSS / sum((Y - Ybar)^2)
    
  return (list (ahat = ahat,
                bhat = bhat,
                sigma_hat_2 = sigma_hat_2,
                Rsquared = Rsquared))
}

try with x = 123, y = 357

try1 <- para(X = c (1, 2, 3),
             Y = c (3, 5, 7))
try1

## $ahat
## [1] 1
## 
## $bhat
## [1] 2
## 
## $sigma_hat_2
## [1] 0
## 
## $Rsquared
## [1] 1

try1_check <- lm(c (3, 5, 7) ~ c (1, 2, 3))
try1_check

## 
## Call:
## lm(formula = c(3, 5, 7) ~ c(1, 2, 3))
## 
## Coefficients:
## (Intercept)   c(1, 2, 3)  
##           1            2

try with x = height, y = handspan

try2 <- para (X = qanda$Height,
              Y = qanda$Handspan)
try2

## $ahat
## [1] -0.03742753
## 
## $bhat
## [1] 0.1034013
## 
## $sigma_hat_2
## [1] 16.85727
## 
## $Rsquared
## [1] 0.05169853

try2_check <- lm (qanda$Handspan ~
                    qanda$Height)
try2_check

## 
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
## 
## Coefficients:
##  (Intercept)  qanda$Height  
##     -0.03743       0.10340

Comment: $R^2$ is the coefficient of determination, which compare the variance of the residuals $r_i$ with the variance of the original observation $Y_i$ .
$R^2 = 1$ indicates a perfect fit
$R^2 = 0$ indicates that none of the variability in $Y_1, ..., Y_n$ is explained by $x_1, ..., x_n$, producing a horizontal regression line .
the value of 0.05 here indicates 5% of the variance in the dependent variable is explained by the independent variable(s) included in the model.

Exercise 4.3.

Find $\hat{\alpha}$, $\hat{\beta}$ and $\hat{\sigma^2}$ and $R^2$ in the above output from R (ignore everything else for now: all will become clear as your degree progresses) and
compare the values with those from your function in Exercise 4.2.
Write a paragraph which explains what the values of these estimates means for our data (you could write it in RStudio at the bottom of your program).

HH <- lm(formula = qanda$Handspan ~ qanda$Height ,
   data = qanda)

coef(summary(HH))[, "Estimate"]

##  (Intercept) qanda$Height 
##  -0.03742753   0.10340131

 # this gives me the alphahat = (intercept) and betahat = qanda$Height

# I can get sigma^2 from my function.

summary(HH)$r.squared

## [1] 0.05169853

 # this is R^2

$\hat{\alpha}$: the expected value of Y when X = 0. But in this case, it is meaningless for people to have 0 Height.

So we can infer that: the handspan of people with height 150 is about 15.5 cm, which is 4 cm smaller than those with 190cm in height.

$\hat{\beta}$: the amount by which the mean of Y given X = x increases when x is increased by 1 unit:

here, it means as the height of people increase by 1cm, the handspan will be larger by 0.1034cm.

$\hat{\sigma^2}$: error variance - the variability of the response about the linear regression line (in the vertical direction).

Exercise 4.4.

Use the x values in the data (ie Height) and the parameters estimated in hand_height_model to work outa vector of $\hat{Y}_i$s.
Note that you can get the parameter estimates using coef(HH) (you can also use HH$coefficients but it is not preferred) so you don’t have to type the numbers in.
Now use the Y values in the data (ie Handspan) to derive the residuals. Check your answer against residuals(HH).

Get the estimated data:

alphahat <- coef(summary(HH)) ["(Intercept)", "Estimate"]
betahat <- coef(summary(HH)) ["qanda$Height", "Estimate"]
xi <- qanda$Height

Yhati <- alphahat + betahat * xi

Find the residuals

epsiloni <- qanda$Handspan - Yhati

summary(epsiloni)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666

summary( residuals(HH))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666

they are the same

Exercise 4.5.

Comment on the model assumptions, as set out in STAT0002, from the plots in Figure 4.1.
What conclusion can you draw about our simple linear model for linking handspan to height?

plot (HH, which = c (1))

plot (HH, which = c (2))

plot (HH, which = c (3))

plot (HH, which = c (4))

Assumptions:

Linearity (weak)
Constant error variance (from the Scale-Location plot YES)
Uncorrelatedness of errors (from the residual & fitted plot YES)
Normality of errors (from the QQ plot YES)

Exercise 4.6.

In Country_data.csv,

build a linear model for gdp_head based upon an appropriate linear model, using all relevant covariates, from Exercise 3.5.
Check the model assumptions using residuals plots.
Comment on your model.

cd <- read.csv("Country_data.csv")

pea3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
                method = "pearson")
pea3.5.1 <- round (pea3.5.1 ,3)

spe3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
                method = "spearman")
spe3.5.1 <- round (spe3.5.1 , 3)

plot (log(cd$age_median) ~ log(cd$gdp_head),
      xlab = paste("log gdp per head: pearson c.c.=",pea3.5.1,
                   "; spearman c.c. =",spe3.5.1),
      ylab = "log median of age",
      main = "log Medium of age and GDP per head",
      col = "blue", cex = 0.5, pch = 16)

fit3.5.1 <- lm (log(cd$age_median) ~ log(cd$gdp_head))

abline (fit3.5.1, col = "red")

Model checking 1: Linearity

The relationship between Y against X is linear.

Model checking 2: Constant error variance

plot (fit3.5.1, which = c (3), col = "blue", cex = 0.7, pch = 16)

The variability of residuals are a bit higher between 3.0 and 3.2.

Model checking 3: Uncorrelatedness of errors

scatterplot(fitted(fit3.5.1), resid(fit3.5.1))

The dashed line (patterns of residuals) is very close to the solid line (expected value of residuals when the model is accurate) so yes.

Model checking 4: Normality of errors

plot (fit3.5.1, which = c (2), col = "blue", cex = 0.7, pch = 16)

The standardized residuals fits the qq plot so yes.

General Comment:

The log of median of age and gdp per head fits the linear regression model.