Examine the two variables in your dataset, perhaps using plots or summary statistics.
Start a new program. Import the data and create data objects Height and Handspan.
Plot the bivariate data in a scatter plot.
Does there appear to be any relationship between the two variables?
I plot them
plot (qanda$Handspan ~ qanda$Height,
col = "blue", cex = 0.5, pch = 16)
fit4.1 <- lm (qanda$Handspan ~ qanda$Height)
fit4.1
##
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
##
## Coefficients:
## (Intercept) qanda$Height
## -0.03743 0.10340
abline (fit4.1, col = "red")
cor (qanda$Handspan, qanda$Height, method = "pearson")
## [1] 0.2273731
residual plot
model4.1 <- lm(qanda$Handspan ~ qanda$Height,
data = qanda)
scatterplot(fitted(model4.1), resid(model4.1))
Create a function which takes two vectors of equal length and outputs \(\hat{\alpha}\), \(\hat{\beta}\) and \(\hat{\sigma^2}\) using the formulae from the STAT0002 notes.
Test the function using x = (1, 2, 3) and Y = (3, 5, 7) - you should be able to predict its outputs for that data.
Run the function against the Handspan (Y) Height (x) data you have downloaded. Update the function so that it also outputs \(R^2\), the coefficient of determination, which describes how much of the variation in Y is explained by x.
para <- function (X, Y) {
Ybar <- (1 / length(Y)) * sum(Y)
xbar <- (1 / length(X)) * sum(X)
bhat <- (sum( (X - xbar)* (Y - Ybar))
/
sum( (X - xbar)^2)
)
ahat <- Ybar - bhat * xbar
RSS <- sum((Y - ahat - bhat * X)^2)
sigma_hat_2 = RSS/( length(X)-2)
Rsquared <- 1 - RSS / sum((Y - Ybar)^2)
return (list (ahat = ahat,
bhat = bhat,
sigma_hat_2 = sigma_hat_2,
Rsquared = Rsquared))
}
try1 <- para(X = c (1, 2, 3),
Y = c (3, 5, 7))
try1
## $ahat
## [1] 1
##
## $bhat
## [1] 2
##
## $sigma_hat_2
## [1] 0
##
## $Rsquared
## [1] 1
try1_check <- lm(c (3, 5, 7) ~ c (1, 2, 3))
try1_check
##
## Call:
## lm(formula = c(3, 5, 7) ~ c(1, 2, 3))
##
## Coefficients:
## (Intercept) c(1, 2, 3)
## 1 2
try2 <- para (X = qanda$Height,
Y = qanda$Handspan)
try2
## $ahat
## [1] -0.03742753
##
## $bhat
## [1] 0.1034013
##
## $sigma_hat_2
## [1] 16.85727
##
## $Rsquared
## [1] 0.05169853
try2_check <- lm (qanda$Handspan ~
qanda$Height)
try2_check
##
## Call:
## lm(formula = qanda$Handspan ~ qanda$Height)
##
## Coefficients:
## (Intercept) qanda$Height
## -0.03743 0.10340
Comment: \(R^2\) is the coefficient of determination, which compare the variance of the residuals \(r_i\) with the variance of the original observation \(Y_i\) .
\(R^2 = 1\) indicates a perfect fit
\(R^2 = 0\) indicates that none of the variability in \(Y_1, ..., Y_n\) is explained by \(x_1, ..., x_n\), producing a horizontal regression line .
the value of 0.05 here indicates 5% of the variance in the dependent variable is explained by the independent variable(s) included in the model.
Find \(\hat{\alpha}\), \(\hat{\beta}\) and \(\hat{\sigma^2}\) and \(R^2\) in the above output from R (ignore everything else for now: all will become clear as your degree progresses) and
compare the values with those from your function in Exercise 4.2.
Write a paragraph which explains what the values of these estimates means for our data (you could write it in RStudio at the bottom of your program).
HH <- lm(formula = qanda$Handspan ~ qanda$Height ,
data = qanda)
coef(summary(HH))[, "Estimate"]
## (Intercept) qanda$Height
## -0.03742753 0.10340131
# this gives me the alphahat = (intercept) and betahat = qanda$Height
# I can get sigma^2 from my function.
summary(HH)$r.squared
## [1] 0.05169853
# this is R^2
So we can infer that: the handspan of people with height 150 is about 15.5 cm, which is 4 cm smaller than those with 190cm in height.
here, it means as the height of people increase by 1cm, the handspan will be larger by 0.1034cm.
Use the x values in the data (ie Height) and the parameters estimated in hand_height_model to work outa vector of \(\hat{Y}_i\)s.
Note that you can get the parameter estimates using coef(HH) (you can also use HH$coefficients but it is not preferred) so you don’t have to type the numbers in.
Now use the Y values in the data (ie Handspan) to derive the residuals. Check your answer against residuals(HH).
alphahat <- coef(summary(HH)) ["(Intercept)", "Estimate"]
betahat <- coef(summary(HH)) ["qanda$Height", "Estimate"]
xi <- qanda$Height
Yhati <- alphahat + betahat * xi
epsiloni <- qanda$Handspan - Yhati
summary(epsiloni)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.768 -1.925 1.395 0.000 2.690 10.666
summary( residuals(HH))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.768 -1.925 1.395 0.000 2.690 10.666
Comment on the model assumptions, as set out in STAT0002, from the plots in Figure 4.1.
What conclusion can you draw about our simple linear model for linking handspan to height?
plot (HH, which = c (1))
plot (HH, which = c (2))
plot (HH, which = c (3))
plot (HH, which = c (4))
Linearity (weak)
Constant error variance (from the Scale-Location plot YES)
Uncorrelatedness of errors (from the residual & fitted plot YES)
Normality of errors (from the QQ plot YES)
In Country_data.csv,
build a linear model for gdp_head based upon an appropriate linear model, using all relevant covariates, from Exercise 3.5.
Check the model assumptions using residuals plots.
Comment on your model.
cd <- read.csv("Country_data.csv")
pea3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
method = "pearson")
pea3.5.1 <- round (pea3.5.1 ,3)
spe3.5.1 <- cor(log(cd$age_median), log(cd$gdp_head),
method = "spearman")
spe3.5.1 <- round (spe3.5.1 , 3)
plot (log(cd$age_median) ~ log(cd$gdp_head),
xlab = paste("log gdp per head: pearson c.c.=",pea3.5.1,
"; spearman c.c. =",spe3.5.1),
ylab = "log median of age",
main = "log Medium of age and GDP per head",
col = "blue", cex = 0.5, pch = 16)
fit3.5.1 <- lm (log(cd$age_median) ~ log(cd$gdp_head))
abline (fit3.5.1, col = "red")
plot (fit3.5.1, which = c (3), col = "blue", cex = 0.7, pch = 16)
The variability of residuals are a bit higher between 3.0 and 3.2.
plot (fit3.5.1, which = c (2), col = "blue", cex = 0.7, pch = 16)
The standardized residuals fits the qq plot so yes.
The log of median of age and gdp per head fits the linear regression model.