Exercise 4.1.

I plot them

plot (qanda$Handspan ~ qanda$Height, 
      col = "blue", cex = 0.5, pch = 16)

fit4.1 <- lm (qanda$Handspan ~ qanda$Height)
fit4.1
## 
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
## 
## Coefficients:
##  (Intercept)  qanda$Height  
##     -0.03743       0.10340
abline (fit4.1, col = "red")

cor (qanda$Handspan, qanda$Height, method = "pearson")
## [1] 0.2273731

residual plot

model4.1 <- lm(qanda$Handspan ~ qanda$Height,
                          data = qanda)

scatterplot(fitted(model4.1), resid(model4.1))

Exercise 4.2.

Define the Function

para <- function (X, Y) {
  
  Ybar <- (1 / length(Y)) * sum(Y)
  xbar <- (1 / length(X)) * sum(X)
  
  bhat <- (sum( (X - xbar)* (Y - Ybar)) 
           / 
             sum( (X - xbar)^2)
           )
  
  ahat <- Ybar - bhat * xbar
  
  RSS <- sum((Y - ahat - bhat * X)^2)
  sigma_hat_2 = RSS/( length(X)-2)
  Rsquared <- 1 - RSS / sum((Y - Ybar)^2)
    
  return (list (ahat = ahat,
                bhat = bhat,
                sigma_hat_2 = sigma_hat_2,
                Rsquared = Rsquared))
}

try with x = 123, y = 357

try1 <- para(X = c (1, 2, 3),
             Y = c (3, 5, 7))
try1
## $ahat
## [1] 1
## 
## $bhat
## [1] 2
## 
## $sigma_hat_2
## [1] 0
## 
## $Rsquared
## [1] 1
try1_check <- lm(c (3, 5, 7) ~ c (1, 2, 3))
try1_check
## 
## Call:
## lm(formula = c(3, 5, 7) ~ c(1, 2, 3))
## 
## Coefficients:
## (Intercept)   c(1, 2, 3)  
##           1            2

try with x = height, y = handspan

try2 <- para (X = qanda$Height,
              Y = qanda$Handspan)
try2
## $ahat
## [1] -0.03742753
## 
## $bhat
## [1] 0.1034013
## 
## $sigma_hat_2
## [1] 16.85727
## 
## $Rsquared
## [1] 0.05169853
try2_check <- lm (qanda$Handspan ~
                    qanda$Height)
try2_check
## 
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
## 
## Coefficients:
##  (Intercept)  qanda$Height  
##     -0.03743       0.10340
  • Comment: \(R^2\) is the coefficient of determination, which compare the variance of the residuals \(r_i\) with the variance of the original observation \(Y_i\) .

  • \(R^2 = 1\) indicates a perfect fit

  • \(R^2 = 0\) indicates that none of the variability in \(Y_1, ..., Y_n\) is explained by \(x_1, ..., x_n\), producing a horizontal regression line .

  • the value of 0.05 here indicates 5% of the variance in the dependent variable is explained by the independent variable(s) included in the model.

Exercise 4.3.

HH <- lm(formula = qanda$Handspan ~ qanda$Height ,
   data = qanda)

coef(summary(HH))[, "Estimate"]
##  (Intercept) qanda$Height 
##  -0.03742753   0.10340131
 # this gives me the alphahat = (intercept) and betahat = qanda$Height
# I can get sigma^2 from my function.
summary(HH)$r.squared 
## [1] 0.05169853
 # this is R^2

So we can infer that: the handspan of people with height 150 is about 15.5 cm, which is 4 cm smaller than those with 190cm in height.

here, it means as the height of people increase by 1cm, the handspan will be larger by 0.1034cm.

Exercise 4.4.

Get the estimated data:

alphahat <- coef(summary(HH)) ["(Intercept)", "Estimate"]
betahat <- coef(summary(HH)) ["qanda$Height", "Estimate"]
xi <- qanda$Height

Yhati <- alphahat + betahat * xi

Find the residuals

epsiloni <- qanda$Handspan - Yhati

summary(epsiloni)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666
summary( residuals(HH))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -16.768  -1.925   1.395   0.000   2.690  10.666
  • they are the same

Exercise 4.5.

plot (HH, which = c (1))

plot (HH, which = c (2))

plot (HH, which = c (3))

plot (HH, which = c (4))

Assumptions:

  1. Linearity (weak)

  2. Constant error variance (from the Scale-Location plot YES)

  3. Uncorrelatedness of errors (from the residual & fitted plot YES)

  4. Normality of errors (from the QQ plot YES)

Exercise 4.6.

In Country_data.csv,

cd <- read.csv("Country_data.csv")

pea3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
                method = "pearson")
pea3.5.1 <- round (pea3.5.1 ,3)

spe3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
                method = "spearman")
spe3.5.1 <- round (spe3.5.1 , 3)

plot (log(cd$age_median) ~ log(cd$gdp_head),
      xlab = paste("log gdp per head: pearson c.c.=",pea3.5.1,
                   "; spearman c.c. =",spe3.5.1),
      ylab = "log median of age",
      main = "log Medium of age and GDP per head",
      col = "blue", cex = 0.5, pch = 16)

fit3.5.1 <- lm (log(cd$age_median) ~ log(cd$gdp_head))

abline (fit3.5.1, col = "red")

Model checking 1: Linearity

  • The relationship between Y against X is linear.

Model checking 2: Constant error variance

plot (fit3.5.1, which = c (3), col = "blue", cex = 0.7, pch = 16)

The variability of residuals are a bit higher between 3.0 and 3.2.

Model checking 3: Uncorrelatedness of errors

scatterplot(fitted(fit3.5.1), resid(fit3.5.1))

The dashed line (patterns of residuals) is very close to the solid line (expected value of residuals when the model is accurate) so yes.

Model checking 4: Normality of errors

plot (fit3.5.1, which = c (2), col = "blue", cex = 0.7, pch = 16)

The standardized residuals fits the qq plot so yes.

General Comment:

The log of median of age and gdp per head fits the linear regression model.