Lab: Linear Regression

Libraries

The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR2 package, which includes the data sets associated with this book.

{r chunk1} library(MASS) install.packages("ISLR2") library(ISLR2)

If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR2, must be downloaded the first time they are used. This can be done directly from within R. For example, on a Windows system, select the Install package option under the Packages tab. After you select any mirror site, a list of available packages will appear. Simply select the package you wish to install and R will automatically download the package. Alternatively, this can be done at the R command line via install.packages("ISLR2"). This installation only needs to be done the first time you use a package. However, the library() function must be called within each R session.

Simple Linear Regression

The ISLR2 library contains the Boston data set, which records medv (median house value) for $506$ census tracts in Boston. We will seek to predict medv using $12$ predictors such as rmvar (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).

{r chunk2} head(Boston) *Shows small part of ‘Boston’ data set

To find out more about the data set, we can type ?Boston.

We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic syntax is {}, where y is the response, x is the predictor, and data is the data set in which these two variables are kept.

{r chunk3, error=TRUE} lm.fit <- lm(medv ~ lstat)

The command causes an error because R does not know where to find the variables medv and lstat. The next line tells R that the variables are in Boston. If we attach Boston, the first line works fine because R now recognizes the variables.

{r chunk4} lm.fit <- lm(medv ~ lstat, data = Boston) attach(Boston) lm.fit <- lm(medv ~ lstat) *Attaching the data set allows for the variables to be recognized

If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us $p$-values and standard errors for the coefficients, as well as the $R^2$ statistic and $F$-statistic for the model.

{r chunk5} lm.fit summary(lm.fit) *Summary gives a more detailed output in comparison to the output that would be shown without it

We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef() to access them.

{r chunk6} names(lm.fit) coef(lm.fit) *Can use names or coef to find more information but coef is better to use

In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command. %Type confint(lm.fit) at the command line to obtain the confidence intervals.

{r chunk7} confint(lm.fit) *Use confint to find confidence interval

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.

{r chunk8} predict(lm.fit, data.frame(lstat = (c(5, 10, 15))), interval = "confidence") predict(lm.fit, data.frame(lstat = (c(5, 10, 15))), interval = "prediction") *predict() is used when finding different intervals for prediction

For instance, the 95,% confidence interval associated with a lstat value of 10 is $(24.47, 25.63)$, and the 95,% prediction interval is $(12.828, 37.28)$. As expected, the confidence and prediction intervals are centered around the same point (a predicted value of $25.05$ for medv when lstat equals 10), but the latter are substantially wider.

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.

{r chunk9} plot(lstat, medv) abline(lm.fit) *abline creates the least square regression line

There is some evidence for non-linearity in the relationship between lstat and medv. We will explore this issue later in this lab.

The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 command causes the width of the regression line to be increased by a factor of 3; this works for the plot() and lines() functions also. We can also use the pch option to create different plotting symbols.

{r chunk10} plot(lstat, medv) abline(lm.fit, lwd = 5) abline(lm.fit, lwd = 5, col = "red") plot(lstat, medv, col = "red") plot(lstat, medv, pch = 20) plot(lstat, medv, pch = "+") plot(1:20, 1:20, pch = 1:20)

Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() and mfrow() functions, which tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a $2 \times 2$ grid of panels.

{r chunk11} par(mfrow = c(2, 2)) plot(lm.fit) par(mfrow = c(3, 3)) plot(lm.fit)

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

{r chunk12} plot(predict(lm.fit), residuals(lm.fit)) plot(predict(lm.fit), rstudent(lm.fit)) *plots residuals against fitted values

On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues() function.

{r chunk13} plot(hatvalues(lm.fit)) which.max(hatvalues(lm.fit)) *hatvalues() can compute leverage stats for any number of predictors

The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.

Multiple Linear Regression

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax {} is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

{r chunk14} lm.fit <- lm(medv ~ lstat + age, data = Boston) summary(lm.fit) *Shows how to fit a model with multiple predictors

The Boston data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

{r chunk15} lm.fit <- lm(medv ~ ., data = Boston) summary(lm.fit) *Summary of all variables in data set in one line

We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the $R^2$, and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages() function in R.

{r chunk16} library(car) vif(lm.fit) *How to access individual components of a summary by name

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high $p$-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

{r chunk17} lm.fit1 <- lm(medv ~ . - age, data = Boston) summary(lm.fit1) *How to perform regression with exceptions

Alternatively, the update() function can be used.

{r chunk18} lm.fit1 <- update(lm.fit, ~ . - age) *update() can also be used to perform regression with exceptions ## Interaction Terms

It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat * age simultaneously includes lstat, age, and the interaction term lstat$\times$age as predictors; it is a shorthand for lstat + age + lstat:age. %We can also pass in transformed versions of the predictors.

{r chunk19} summary(lm(medv ~ lstat * age, data = Boston)) *Including interaction terms

Non-linear Transformations of the Predictors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor $X$, we can create a predictor $X^2$ using I(X^2). The function I() is needed since the ^ has a special meaning in a formula object; wrapping as we do allows the standard usage in R, which is to raise X to the power 2. We now perform a regression of medv onto lstat and lstat^2.

{r chunk20} lm.fit2 <- lm(medv ~ lstat + I(lstat^2)) summary(lm.fit2) *Non-linear transformations can also be done using lm()

The near-zero $p$-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.

{r chunk21} lm.fit <- lm(medv ~ lstat) anova(lm.fit, lm.fit2) *anova() is used for further quantifying the extent that the quadratic fit is superior to the linear fit

Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat^2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the $F$-statistic is $135$ and the associated $p$-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type

{r chunk22} par(mfrow = c(2, 2)) plot(lm.fit2)

then we see that when the lstat^2 term is included in the model, there is little discernible pattern in the residuals.

In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

{r chunk23} lm.fit5 <- lm(medv ~ poly(lstat, 5)) summary(lm.fit5) *How to produce a fifth-order polynomial fit

This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant $p$-values in a regression fit.

By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

{r chunk24} summary(lm(medv ~ log(rm), data = Boston)) *How to do a log transformation

Qualitative Predictors

We will now examine the Carseats data, which is part of the ISLR2 library. We will attempt to predict Sales (child car seat sales) in $400$ locations based on a number of predictors.

{r chunk25} head(Carseats)

The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good. Given a qualitative variable such as shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

{r chunk26} lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats) summary(lm.fit) *Fitting a multiple regression model with interaction terms

The contrasts() function returns the coding that R uses for the dummy variables.

{r chunk27} attach(Carseats) contrasts(ShelveLoc) *contrasts() returns dummy variables in R

Use ?contrasts to learn about other contrasts, and how to set them. #?contrasts R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location.

Writing Functions

As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.

{r chunk28, error=TRUE} LoadLibraries LoadLibraries() *Shows an error because LoadLibraries is not defined

We now create the function. Note that the + symbols are printed by R and should not be typed in. The { symbol informs R that multiple commands are about to be input. Hitting Enter after typing { will cause R to print the + symbol. We can then input as many commands as we wish, hitting {Enter} after each one. Finally the } symbol informs R that no further commands will be entered.

{r chunk29} LoadLibraries <- function() { library(ISLR2) library(MASS) print("The libraries have been loaded.") } *Defining LoadLibraries function

Now if we type in LoadLibraries, R will tell us what is in the function.

{r chunk30} LoadLibraries *Output shows what is defined in the function

If we call the function, the libraries are loaded in and the print statement is output.

{r chunk31} LoadLibraries() *Prints out statement provided by function