The Boston dataset, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average of houses), and lstat (percent of households with low socioeconomic status)
More information: https://archive.ics.uci.edu/ml/datasets/Housing
boston.dataset = read.csv("Boston.csv")
names(boston.dataset) # Get all column names of the data set
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
set.seed(123) # for reproducible results
sample.size <- floor(0.75 * nrow(boston.dataset))
train.index <- sample(seq_len(nrow(boston.dataset)), size = sample.size)
train <- boston.dataset[train.index, ]
test <- boston.dataset[- train.index, ]
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance
Basic syntax is lm(y ~ x, data)
Use summary() function to find more detailed information about the model
lm.fit = lm(medv ~ lstat, data = train) # fit a simple linear regression model
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.126 -3.895 -1.343 1.720 24.525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.43371 0.65112 52.88 <2e-16 ***
## lstat -0.94009 0.04536 -20.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.222 on 377 degrees of freedom
## Multiple R-squared: 0.5325, Adjusted R-squared: 0.5313
## F-statistic: 429.5 on 1 and 377 DF, p-value: < 2.2e-16
Use predict() to produce confidence intervals and prediction intervals for the prediction of medv for a given value lsat
head(predict(lm.fit, test, interval = "confidence"), 5)
## fit lwr upr
## 6 29.535867 28.631219 30.440516
## 8 16.431071 15.567267 17.294874
## 9 6.296945 4.620566 7.973325
## 10 18.358247 17.607999 19.108494
## 18 20.642655 19.985224 21.300087
head(predict(lm.fit, test, interval = "prediction"), 5)
## fit lwr upr
## 6 29.535867 17.267822 41.80391
## 8 16.431071 4.165969 28.69617
## 9 6.296945 -6.052014 18.64591
## 10 18.358247 6.100619 30.61587
## 18 20.642655 8.390359 32.89495
Four diagnostic plots are automatically produced by applying the plot() function directly to the output of lm()
par(mfrow = c(2,2))
plot(lm.fit)
We will again use lm() to fit model and summary() to output the regression coefficients.
Basic syntax is lm(y ~ x1 + x2 + x3, data)
lm.fit = lm(medv ~ lstat + age,data = train) # fit a multiple linear regression model
#summary(lm.fit)
The Boston dataset contains 13 variables. In order to perform a regression using all of the predictors. We can use the following short-hand:
lm.fit = lm(medv ~ ., data = train)
# lm.fit = lm(mdev ~ . - age, data = train) # This will perform a regression using all variables but age
# summary(lm.fit)
The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat:age as predictors; it is shorthand for lstat+age+lstat:age.
lm.fit = lm(medv ~ lstat * age, data = boston.dataset)
# summary(lm.fit)
Given a predictor X, we can create predictor X2 using I(X^2).
lm.fit = lm(medv ~ crim + chas + nox + rm + dis + ptratio + black + lstat + I(lstat ^ 2), data=train)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ crim + chas + nox + rm + dis + ptratio +
## black + lstat + I(lstat^2), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.2337 -2.6950 -0.4378 1.7613 25.5685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.095399 5.080930 7.301 1.77e-12 ***
## crim -0.080073 0.034204 -2.341 0.019760 *
## chas 3.392759 0.980433 3.460 0.000602 ***
## nox -11.033161 3.556692 -3.102 0.002070 **
## rm 3.322814 0.433383 7.667 1.57e-13 ***
## dis -1.269777 0.182673 -6.951 1.65e-11 ***
## ptratio -0.632826 0.125327 -5.049 6.98e-07 ***
## black 0.006355 0.002970 2.139 0.033054 *
## lstat -1.770121 0.138387 -12.791 < 2e-16 ***
## I(lstat^2) 0.035154 0.003717 9.458 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.51 on 369 degrees of freedom
## Multiple R-squared: 0.7596, Adjusted R-squared: 0.7537
## F-statistic: 129.6 on 9 and 369 DF, p-value: < 2.2e-16
Use predict() to produce confidence intervals and prediction intervals for the prediction of medv for a given value lsat
head(predict(lm.fit, test, interval = "confidence"), 5)
## fit lwr upr
## 6 28.11043 27.16997 29.05088
## 8 16.15223 14.57756 17.72690
## 9 13.63438 11.32964 15.93913
## 10 15.72888 14.18574 17.27202
## 18 16.35996 15.49165 17.22826
head(predict(lm.fit, test, interval = "prediction"), 5)
## fit lwr upr
## 6 28.11043 19.191835 37.02901
## 8 16.15223 7.144661 25.15981
## 9 13.63438 4.470941 22.79782
## 10 15.72888 6.726764 24.73099
## 18 16.35996 7.448686 25.27123
Four diagnostic plots are automatically produced by applying the plot() function directly to the output of lm()
par(mfrow = c(2,2))
plot(lm.fit)