Linear Regression in R

Libraries
Simple Linear Regression
Multiple Linear Regression
Interaction Terms
Non-Linear Transformations of the Predictors
Qualitative Predictors
Writing Functions
Conceptual Exercises

Libraries

The library() function is used to load groups of functions/datasets.
Here we load the MASS package, a large collection of datasets/functions.
Also load the datasets associated with this book with ISLR.

library(MASS)
library(ISLR)
attach(Boston)

Simple Linear Regression

Use Boston dataset predictors medv (median house val) and lstat (percent households with low socioeconomic status).

names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Fit a simple linear regression with medv as response, lstat as predictor.
- Syntax: lm(y~x, data)
- For this example:
```
lm.fit = lm(medv~lstat, data=Boston) # OR...
lm.fit = lm(medv~lstat)
```
For more detailed output, summary(lm.fit).
- Use names(lm.fit) function to find out information stored in lm.fit().
- Use extractor functions like coef(lm.fit) or confint(lm.fit) to access these quantities.
The predict() function can evaluate response for a given input value (or list of values). Can produce the associated confidence intervals or prediction intervals (need to specify which).
- Example: Evaluate medv(lstat = {5, 10, 15}) and conf(or prediction if interval=“prediction”) intervals for each:
```
predict(lm.fit, data.frame(lstat=c(5, 10, 15)), interval="confidence")
```

Plot medv and lstat along with least squares regression line using plot() and abline().

plot(lstat, medv)   # Plots data points only.
abline(lm.fit)      # Plots regression line.

abline() can be used to draw any line with slope $a$ and intercept $b$ via abline(a, b). + lwd=3 (in plot() and abline()) causes width of line to increase by factor of 3. + pch="+" (in plot(), maybe abline()) can be used to create different plotting symbols; here, creates ‘+’ symbol as marker.
```
abline(lm.fit, lwd=3, col="red")
plot(lstat, medv, pch=20) # 20 == small circles
plot(lstat, medv, pch=1:20) # Uses 20 different symbols.
```
Diagnostic plots from Section 3.3.3:
- Tip: use mfrow to divide plotting region into 2x2 grid so can view all outputs of plot(lm.fit()) at once.
```
par(mfrow=c(2,2))
plot(lm.fit) 
```

Multiple Linear Regression

Similarly, can use lm(y~x1 + x2 + x3) to fit a model in various ways with three predictors $x1$, $x2$, $x3$.

lm.fit = lm(medv~lstat+age, data=Boston)    # Two predictors.
lm.fit = lm(medv~., data=Boston)            # All predictors.
lm.fit = lm(medv~.-age, data=Boston)        # All EXCEPT age, method 1. 
lm.fit1 = update(lm.fit, ~.-age)            # All EXCEPT age, method 2. 
summary(lm.fit)

Can access the individual components of a summary by name. Type ?summary.lm to see what is available.
- For example, to get the $R^2$ and RSE: summary(lm.fit)$r.sq and summary(lm.fit)$sigma.
Need to include the car package to get VIF. Use the vif() function.
```
library(car)
vif(lm.fit)
```

Interaction Terms

Can easily include interaction terms as well (i.e. multiplying predictors together).
```
lm(medv~lstat*age, data=Boston)
```
- Syntax: $x1:x2$ tells R to include interaction term $x1x2$. Can also use shorthand $x1*x2$ to simultaneously include x1, x2, and x1:x2 as predictors; short for x1 + x2 + x1:x2.

Non-Linear Transformations of the Predictors

Given a predictor $X$, create a predictor $X^2$ using I(X^2).
Use the anova() function to quantify extent to which the quadratic fit is superior to the linear fit.

lm.fit = lm(medv ~ lstat)
lm.fit2 = lm(medv ~ lstat + I(lstat^2))
anova(lm.fit, lm.fit2)

## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For fitting higher-order polynomials, is easier to use the poly() function to create the polynomials in lm(), via lm(y~poly(x, 5)) for a 5th order polynomial, etc.

Qualitative Predictors

Writing Functions

Conceptual Exercises

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
- Table 3.4 shows p-values corresponding to the least-squares coefficient estimates of the multiple linear regression of number of units sold on radio, TV, and newspaper advertising budgets.
- Intercept: The low p-value of $< 0.0001$ for the intercept means that, in the absence of any advertising via TV, radio, and newspaper, the $Pr(t \geq 9.42) < 0.0001$. Therefore, we can reject the null hypothesis $H_0 : \beta_0 = 0$.
- TV/Radio: Similarly, both TV and Radio have p < 0.0001. Therefore, we have strong evidence that there exists some relationship between both (and independently) sales-TV, sales-Radio.
- Newspaper: Here, $p = 0.8599$. In other words, there is a high probability of obtaining our observation of $t = -0.18$, assuming that there is no relationship between newspaper and sales. Therefore, we don’t have strong enough evidence to conclude there is any relationship between sales and radio. Correction: Should have emphasized that we are rejecting the hypothesis that there is no relationship between newspaper and sales in the presence of TV and radio.
Carefully explain the differences between the KNN classifier and KNN regression methods.

Linear Regression in R

Brandon McKinzie

May 25, 2016

Libraries

Simple Linear Regression

Multiple Linear Regression

Interaction Terms

Non-Linear Transformations of the Predictors

Qualitative Predictors

Writing Functions

Conceptual Exercises