This Rmd is partly based on an Rmd from the book “Introduction to Statistical Learning in R.”
We will use the Boston data set is available in package MASS. (If this package not available on your computer, you can install it by running install.packages(“MASS”).)
rm(list = ls()) #remove everything in the environment
library(MASS)
Now Boston dataset is available as a part of package MASS
The Boston
data set records medv
(median
house value) for \(506\) census tracts
in Boston. We will seek to predict medv
using predictors
such as rmvar
(average number of rooms per house),
age
(average age of houses), and lstat
(percent of households with low socioeconomic status). [To find out more
about the data set, one can type ?Boston
.]
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Suppose we want to run the regression of medv on lstat and age variables. First we create vector of response and explanatory variables.
We will use the package tidyverse for selecting variables. (Install it, if it is not yet available.) We also need to convert dataframes to matrices, which is done using data.matrix( ) function. (This loses some information about types of variables.)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
y = select(Boston, medv)
X = select(Boston, age, lstat)
y = data.matrix(y)
X = data.matrix(X)
#adding a column of ones
X = cbind(matrix(data = 1, nrow = nrow(X), ncol = 1), X)
Now the matrices \(X^t X\) and \(X^t y\) must be calculated. This can be done in two ways by either using the transposition and matrix multiplication functions, or by using the crossproduct function.
t(X) %*% X #or, altenatively
## age lstat
## 506.00 34698.9 6402.45
## age 34698.90 2779614.6 500191.63
## lstat 6402.45 500191.6 106762.96
crossprod(X, X) #or, even simpler
## age lstat
## 506.00 34698.9 6402.45
## age 34698.90 2779614.6 500191.63
## lstat 6402.45 500191.6 106762.96
crossprod(X)
## age lstat
## 506.00 34698.9 6402.45
## age 34698.90 2779614.6 500191.63
## lstat 6402.45 500191.6 106762.96
Now it is easy to find the estimators.
beta = solve(crossprod(X), crossprod(X,y))
beta
## medv
## 33.22276053
## age 0.03454434
## lstat -1.03206856
What about standard errors? These are square roots of diagonal terms in the matrix \(\sigma^2 (X^t*X)^{-1}\).
sigma = sqrt(crossprod(y - X %*% beta)/(nrow(X) - 3))
sigma
## medv
## medv 6.173136
#We need the inverse function from matlib package
library(matlib)
inv(crossprod(X))
##
## [1,] 0.01401656 -0.00015113 -0.00013251
## [2,] -0.00015113 0.00000392 -0.00000931
## [3,] -0.00013251 -0.00000931 0.00006094
#Alternatively, tthe inversion can be done by solve function as well
solve(t(X) %*% X)
## age lstat
## 0.0140165566 -1.511286e-04 -1.325091e-04
## age -0.0001511286 3.922105e-06 -9.312320e-06
## lstat -0.0001325091 -9.312320e-06 6.094180e-05
sigma[1] * inv(crossprod(X))**(1/2)
##
## [1,] 0.7308472 NaN NaN
## [2,] NaN 0.01222219 NaN
## [3,] NaN NaN 0.04819002
The diagonal terms here are the standard errors of the estimators.
The previous calculations have shown that in principle, the linear regression can be easily calculated in R by using the matrix functions. However it is much more convenient to use the built-in function for muultiple linear regression. It is called lm(), for multiple linear regression. Here arerr some details.
In order to fit a multiple linear regression model using least
squares, we use the lm()
function. The syntax {} is used to
fit a model with three predictors, x1
, x2
, and
x3
. The summary()
function now outputs the
regression coefficients for all the predictors. We use option Boston to
indicate that variables medv, lstat, and age are the variables in Boston
dataset, not in global namespace.
lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The Boston
data set contains many variables, and so it
would be cumbersome to have to type all of these in order to perform a
regression using all of the predictors. Instead, we can use the
following short-hand:
lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
What if we would like to perform a regression using all of the
variables but one? For example, in the above regression output,
age
and
indus' have high $p$-values. So we may wish to run a regression excluding these predictors. The following syntax results in a regression using all predictors except
age`.
lm.fit1 <- lm(medv ~ . - age - indus, data = Boston)
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
Alternatively, the update()
function can be used.
lm.fit1 <- update(lm.fit, ~ . - age - indus)
One other useful feature of the lm function is that we can extract the data matrix from the model. This is useful if one wants to do some additional calculations besides those that are in the sdandard interface. (Note the option x in the lm function)
lm.fit <- lm(medv ~ lstat + age, data = Boston, x = T)
X <- lm.fit$x
t(X) %*% X #this is the matrix of cross-products
## (Intercept) lstat age
## (Intercept) 506.00 6402.45 34698.9
## lstat 6402.45 106762.96 500191.6
## age 34698.90 500191.63 2779614.6
It is easy to include interaction terms in a linear model using the
lm()
function. The syntax lstat:black
tells
R
to include an interaction term between lstat
and black
. The syntax lstat * age
simultaneously includes lstat
, age
, and the
interaction term lstat
\(\times\)age
as predictors; it
is a shorthand for lstat + age + lstat:age
.
summary(lm(medv ~ lstat * age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
The lm()
function can also accommodate non-linear
transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2)
. The
function I()
is needed since the ^
has a
special meaning in a formula object; wrapping as we do allows the
standard usage in R
, which is to raise X
to
the power 2
. We now perform a regression of
medv
onto lstat
and lstat^2
.
lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The near-zero \(p\)-value associated with the quadratic term suggests that it leads to an improved model.
In order to create a cubic fit, we can include a predictor of the
form I(X^3)
. However, this approach can start to get
cumbersome for higher-order polynomials. A better approach involves
using the poly()
function to create the polynomial within
lm()
. For example, the following command produces a
fifth-order polynomial fit:
lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant \(p\)-values in a regression fit.
By default, the poly()
function orthogonalizes the
predictors: this means that the features output by this function are not
simply a sequence of powers of the argument. However, a linear model
applied to the output of the poly()
function will have the
same fitted values as a linear model applied to the raw polynomials
(although the coefficient estimates, standard errors, and p-values will
differ). In order to obtain the raw polynomials from the
poly()
function, the argument raw = TRUE
must
be used.
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.
summary(lm(medv ~ log(rm), data = Boston))
##
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.487 -2.875 -0.104 2.837 39.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.488 5.028 -15.21 <2e-16 ***
## log(rm) 54.055 2.739 19.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
## F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16