Linear Regression in R

1 Loading Dataset

1.1 Boston dataset

The Boston dataset, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average of houses), and lstat (percent of households with low socioeconomic status)

More information: https://archive.ics.uci.edu/ml/datasets/Housing

boston.dataset = read.csv("Boston.csv")
names(boston.dataset) # Get all column names of the data set

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

1.2 Split into train/test dataset

set.seed(123) # for reproducible results
sample.size <- floor(0.75 * nrow(boston.dataset))
train.index <- sample(seq_len(nrow(boston.dataset)), size = sample.size)
train <- boston.dataset[train.index, ]
test <- boston.dataset[- train.index, ]

2 Simple Linear Regression

2.1 Fit and interpret the model

lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance

Basic syntax is lm(y ~ x, data)
Use summary() function to find more detailed information about the model

lm.fit = lm(medv ~ lstat, data = train) # fit a simple linear regression model
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.126  -3.895  -1.343   1.720  24.525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.43371    0.65112   52.88   <2e-16 ***
## lstat       -0.94009    0.04536  -20.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.222 on 377 degrees of freedom
## Multiple R-squared:  0.5325, Adjusted R-squared:  0.5313 
## F-statistic: 429.5 on 1 and 377 DF,  p-value: < 2.2e-16

2.2 Prediction

Use predict() to produce confidence intervals and prediction intervals for the prediction of medv for a given value lsat

head(predict(lm.fit, test, interval = "confidence"), 5)

##          fit       lwr       upr
## 6  29.535867 28.631219 30.440516
## 8  16.431071 15.567267 17.294874
## 9   6.296945  4.620566  7.973325
## 10 18.358247 17.607999 19.108494
## 18 20.642655 19.985224 21.300087

head(predict(lm.fit, test, interval = "prediction"), 5)

##          fit       lwr      upr
## 6  29.535867 17.267822 41.80391
## 8  16.431071  4.165969 28.69617
## 9   6.296945 -6.052014 18.64591
## 10 18.358247  6.100619 30.61587
## 18 20.642655  8.390359 32.89495

2.3 Diagnostic Plot

Four diagnostic plots are automatically produced by applying the plot() function directly to the output of lm()

par(mfrow = c(2,2))
plot(lm.fit)

3 Multiple Linear Regression

3.1 Fit and interperet the model

We will again use lm() to fit model and summary() to output the regression coefficients.

Basic syntax is lm(y ~ x1 + x2 + x3, data)

lm.fit = lm(medv ~ lstat + age,data = train) # fit a multiple linear regression model
#summary(lm.fit)

3.2 Regression with more predictors

The Boston dataset contains 13 variables. In order to perform a regression using all of the predictors. We can use the following short-hand:

lm.fit = lm(medv ~ ., data = train)
# lm.fit = lm(mdev ~ . - age, data = train) # This will perform a regression using all variables but age
# summary(lm.fit)

3.3 Interaction Terms

The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat:age as predictors; it is shorthand for lstat+age+lstat:age.

lm.fit = lm(medv ~ lstat * age, data = boston.dataset)
# summary(lm.fit)

3.4 Non-linear Transformations of the Predictiors

Given a predictor X, we can create predictor X² using I(X^2).

lm.fit = lm(medv ~ crim + chas + nox + rm + dis + ptratio + black + lstat + I(lstat ^ 2), data=train)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ crim + chas + nox + rm + dis + ptratio + 
##     black + lstat + I(lstat^2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.2337  -2.6950  -0.4378   1.7613  25.5685 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.095399   5.080930   7.301 1.77e-12 ***
## crim         -0.080073   0.034204  -2.341 0.019760 *  
## chas          3.392759   0.980433   3.460 0.000602 ***
## nox         -11.033161   3.556692  -3.102 0.002070 ** 
## rm            3.322814   0.433383   7.667 1.57e-13 ***
## dis          -1.269777   0.182673  -6.951 1.65e-11 ***
## ptratio      -0.632826   0.125327  -5.049 6.98e-07 ***
## black         0.006355   0.002970   2.139 0.033054 *  
## lstat        -1.770121   0.138387 -12.791  < 2e-16 ***
## I(lstat^2)    0.035154   0.003717   9.458  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.51 on 369 degrees of freedom
## Multiple R-squared:  0.7596, Adjusted R-squared:  0.7537 
## F-statistic: 129.6 on 9 and 369 DF,  p-value: < 2.2e-16

3.5 Prediction