Look at the whiteboard for more notes on the expectation and variance of random matrices
We looked a the Boston dataset last class. You will need to load it into this lab session from library MASS.
#library(MASS)
#library(ISLR)
#fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
Lets look at the variables we wish to model:Age and Lstat vs Median Home Value
# First I created a new dataframe (so that we aren't carrying all the other variables around)
Bostondf<-data.frame(lstat=Boston$lstat,
age=Boston$age,
medv=Boston$medv)
## MLR
# medv : median value of owner-occupied homes in $1000s
# lstat : lower status of the population (percent)
# age : proportion of owner-occupied units built prior to 1940's
pairs(Bostondf)
p<-2
You might want to try to make a 3D scatterplot. You might consider using the following packages (however, I’ve encountered some errors in the package downloads)
## 3D Scatterplot
#install.packages("scatterplot3d") # Install
library("scatterplot3d")
## Warning: package 'scatterplot3d' was built under R version 3.4.4
scatterplot3d(Bostondf, angle=75)
#install.packages("rgl")
#library(rgl)
## ERRORS
Use the lm() function that we used in SLR to fit a MLR, where median value is a function of lstat and age.
mlr_mod<-lm(medv ~ lstat+age, data=Boston)
summary(mlr_mod)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
Lets try to verify the model summary by learning about matrix algebra in R.
First, create your response vector, Y, and look at the dimensions. We will also want to keep track of the sample size.
Y<-as.matrix(Boston$medv)
head(Y)
## [,1]
## [1,] 24.0
## [2,] 21.6
## [3,] 34.7
## [4,] 33.4
## [5,] 36.2
## [6,] 28.7
dim(Y)
## [1] 506 1
n<-dim(Y)[1]
Now create the design matrix, X.
Hint: Dont forget to include the first column of one’s.
X<-matrix(c(rep(1, n),
Boston$lstat,
Boston$age),
ncol=3,
byrow=FALSE)
head(X)
## [,1] [,2] [,3]
## [1,] 1 4.98 65.2
## [2,] 1 9.14 78.9
## [3,] 1 4.03 61.1
## [4,] 1 2.94 45.8
## [5,] 1 5.33 54.2
## [6,] 1 5.21 58.7
dim(X)
## [1] 506 3
Before, we start this next exercise lets familiarize ourselves with matrix operations in R here
Now, calculate the vector of least squares estimate for the parameters of interest. Verify with the model output.
## Least squares estimates
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
## [,1]
## [1,] 33.22276053
## [2,] -1.03206856
## [3,] 0.03454434
Calculate the projection matrix (aka the “hat” matrix) and use it to find the fitted values. Compare this with the fitted model output.
## Projection matrix
P<-X%*%solve(t(X)%*%X)%*%t(X)
dim(P)
## [1] 506 506
## Verify Fitted Values
fit<-P%*%Y
head(fit)
## [,1]
## [1,] 30.33535
## [2,] 26.51520
## [3,] 31.17418
## [4,] 31.77061
## [5,] 29.59414
## [6,] 29.87344
head(mlr_mod$fitted)
## 1 2 3 4 5 6
## 30.33535 26.51520 31.17418 31.77061 29.59414 29.87344
Create an n by n identity matrix and use it to obtain the residuals (also using the components previously computed). Compare with the model residuals.
## Identity matrix
I<-diag(n)
dim(I)
## [1] 506 506
## Verify Residuals Values
res<-(I-P)%*%Y
head(res)
## [,1]
## [1,] -6.335350
## [2,] -4.915202
## [3,] 3.525817
## [4,] 1.629390
## [5,] 6.605862
## [6,] -1.173436
head(mlr_mod$residuals)
## 1 2 3 4 5 6
## -6.335350 -4.915202 3.525817 1.629390 6.605862 -1.173436
Use the residuals to obtain an estimate of s^2. First calculate the sum of squares residuals then calculate the mean squares residuals. Compare with this with the anova output.
## Sum of Square Residual
ss_res<-sum(t(res)%*%res)
ss_res
## [1] 19168.13
## ANOVA
anova(mlr_mod)
## Analysis of Variance Table
##
## Response: medv
## Df Sum Sq Mean Sq F value Pr(>F)
## lstat 1 23243.9 23243.9 609.955 < 2.2e-16 ***
## age 1 304.3 304.3 7.984 0.004907 **
## Residuals 503 19168.1 38.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## MS(Res) = s^2 (estimates sigma^2)
ms_res<-ss_res/(n-p-1)
ms_res
## [1] 38.10761
Find the variance-covariance matrix for the beta hat matrix. The square roots of the diagonal elements are the standard errors of their respective beta estimates.
## SE(betaHat)
var_betaHat<-ms_res*solve(t(X)%*%X)
var_betaHat
## [,1] [,2] [,3]
## [1,] 0.534137493 -0.0050496041 -0.0057591486
## [2,] -0.005049604 0.0023223465 -0.0003548703
## [3,] -0.005759149 -0.0003548703 0.0001494620
se_betaHat<-sqrt(diag(var_betaHat))
se_betaHat
## [1] 0.73084711 0.04819073 0.01222547
Now we have all the tools to do inference!