First a little more theory …

Look at the whiteboard for more notes on the expectation and variance of random matrices

Housing in Boston

We looked a the Boston dataset last class. You will need to load it into this lab session from library MASS.

#library(MASS)
#library(ISLR)
#fix(Boston)

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Exercise 1 : Looking at the data

Lets look at the variables we wish to model:Age and Lstat vs Median Home Value

# First I created a new dataframe (so that we aren't carrying all the other variables around)
Bostondf<-data.frame(lstat=Boston$lstat, 
                     age=Boston$age,
                     medv=Boston$medv)

## MLR
# medv : median value of owner-occupied homes in $1000s
# lstat : lower status of the population (percent)
# age : proportion of owner-occupied units built prior to 1940's
pairs(Bostondf)

p<-2

You might want to try to make a 3D scatterplot. You might consider using the following packages (however, I’ve encountered some errors in the package downloads)

## 3D Scatterplot
#install.packages("scatterplot3d") # Install
library("scatterplot3d") 
## Warning: package 'scatterplot3d' was built under R version 3.4.4
scatterplot3d(Bostondf, angle=75)

#install.packages("rgl")
#library(rgl)
## ERRORS

Exercise 2 : Using the lm()

Use the lm() function that we used in SLR to fit a MLR, where median value is a function of lstat and age.

mlr_mod<-lm(medv ~ lstat+age, data=Boston)
summary(mlr_mod)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

Exercise 3 : (Using Matrices) The Response Vector

Lets try to verify the model summary by learning about matrix algebra in R.

First, create your response vector, Y, and look at the dimensions. We will also want to keep track of the sample size.

Y<-as.matrix(Boston$medv)
head(Y)
##      [,1]
## [1,] 24.0
## [2,] 21.6
## [3,] 34.7
## [4,] 33.4
## [5,] 36.2
## [6,] 28.7
dim(Y)
## [1] 506   1
n<-dim(Y)[1]

Exercise 4: (Using Matrices) Design Matrix

Now create the design matrix, X.

Hint: Dont forget to include the first column of one’s.

X<-matrix(c(rep(1, n),
            Boston$lstat, 
            Boston$age), 
          ncol=3, 
          byrow=FALSE)
head(X)
##      [,1] [,2] [,3]
## [1,]    1 4.98 65.2
## [2,]    1 9.14 78.9
## [3,]    1 4.03 61.1
## [4,]    1 2.94 45.8
## [5,]    1 5.33 54.2
## [6,]    1 5.21 58.7
dim(X)
## [1] 506   3

Exercise 5: MLR Estimates

Before, we start this next exercise lets familiarize ourselves with matrix operations in R here

Now, calculate the vector of least squares estimate for the parameters of interest. Verify with the model output.

## Least squares estimates
betaHat<-solve(t(X)%*%X)%*%t(X)%*%Y
betaHat
##             [,1]
## [1,] 33.22276053
## [2,] -1.03206856
## [3,]  0.03454434

Exercise 6: The Projection Matrix

Calculate the projection matrix (aka the “hat” matrix) and use it to find the fitted values. Compare this with the fitted model output.

## Projection matrix
P<-X%*%solve(t(X)%*%X)%*%t(X)
dim(P)
## [1] 506 506
## Verify Fitted Values
fit<-P%*%Y
head(fit)
##          [,1]
## [1,] 30.33535
## [2,] 26.51520
## [3,] 31.17418
## [4,] 31.77061
## [5,] 29.59414
## [6,] 29.87344
head(mlr_mod$fitted)
##        1        2        3        4        5        6 
## 30.33535 26.51520 31.17418 31.77061 29.59414 29.87344

Exercise 7: Calculating Residuals

Create an n by n identity matrix and use it to obtain the residuals (also using the components previously computed). Compare with the model residuals.

## Identity matrix 
I<-diag(n)
dim(I)
## [1] 506 506
## Verify Residuals Values
res<-(I-P)%*%Y
head(res)
##           [,1]
## [1,] -6.335350
## [2,] -4.915202
## [3,]  3.525817
## [4,]  1.629390
## [5,]  6.605862
## [6,] -1.173436
head(mlr_mod$residuals)
##         1         2         3         4         5         6 
## -6.335350 -4.915202  3.525817  1.629390  6.605862 -1.173436

Exercise 8: Sum of Squared Residuals

Use the residuals to obtain an estimate of s^2. First calculate the sum of squares residuals then calculate the mean squares residuals. Compare with this with the anova output.

## Sum of Square Residual
ss_res<-sum(t(res)%*%res)
ss_res
## [1] 19168.13
## ANOVA
anova(mlr_mod)
## Analysis of Variance Table
## 
## Response: medv
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## lstat       1 23243.9 23243.9 609.955 < 2.2e-16 ***
## age         1   304.3   304.3   7.984  0.004907 ** 
## Residuals 503 19168.1    38.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## MS(Res) = s^2 (estimates sigma^2)
ms_res<-ss_res/(n-p-1)
ms_res
## [1] 38.10761

Exercise 9: Variance-Covariance Matrix for Beta

Find the variance-covariance matrix for the beta hat matrix. The square roots of the diagonal elements are the standard errors of their respective beta estimates.

## SE(betaHat)
var_betaHat<-ms_res*solve(t(X)%*%X)
var_betaHat
##              [,1]          [,2]          [,3]
## [1,]  0.534137493 -0.0050496041 -0.0057591486
## [2,] -0.005049604  0.0023223465 -0.0003548703
## [3,] -0.005759149 -0.0003548703  0.0001494620
se_betaHat<-sqrt(diag(var_betaHat))
se_betaHat
## [1] 0.73084711 0.04819073 0.01222547

Now we have all the tools to do inference!