Introduction

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.

Extract the data and create the training and testing sample

For the current model, let’s take the Boston dataset that is part of the MASS library in R Studio. Following are the features available in Boston dataset. The problem statement is to predict ‘medv’ based on the set of input features.

attach(Boston)
names(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Split the sample data and make the model

Split the input data into training and evaluation set and make the model for the training dataset. It can be seen that training dataset has 404 observations and testing dataset has 102 observations based on 80-20 split.

##Sample the dataset. The return for this is row nos.
set.seed(1)
row.number <- sample(1:nrow(Boston), 0.8*nrow(Boston))
train = Boston[row.number,]
test = Boston[-row.number,]
dim(train)

## [1] 404  14

dim(test)

## [1] 102  14

Explore the response variable

Let’s check for the distribution of response variable medv. The following figure shows the three distributions of medv original, log transformation and square root transformation.

##Explore the data.
grid.arrange(ggplot(Boston, aes(medv)) + geom_density(fill="blue", alpha = 0.1),
ggplot(train, aes(log(medv))) + geom_density(fill="blue",alpha = 0.2),
ggplot(train, aes(sqrt(medv))) + geom_density(fill="blue", alpha = 0.3),
ncol=3)

We can see that both log and sqrt does a decent job to transform ‘medv’ distribution closer to normal. In the following model, I have selected ‘log’ transformation but it is also possible to try out ‘sqrt’ transformation.

Model Building – Model 1

Now as a first step we will fit the multiple regression models. We will start by taking all input variables in the multiple regression.

model1 = lm(log(medv)~., data=train)
summary(model1)

## 
## Call:
## lm(formula = log(medv) ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71392 -0.10435 -0.00913  0.10259  0.83290 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.5387176  0.2331496  19.467  < 2e-16 ***
## crim        -0.0110546  0.0015143  -7.300 1.63e-12 ***
## zn           0.0014176  0.0006281   2.257 0.024574 *  
## indus        0.0020512  0.0028308   0.725 0.469120    
## chas         0.0853159  0.0402646   2.119 0.034732 *  
## nox         -0.9285807  0.1693534  -5.483 7.52e-08 ***
## rm           0.0589055  0.0189839   3.103 0.002056 ** 
## age          0.0002373  0.0006075   0.391 0.696247    
## dis         -0.0598220  0.0091621  -6.529 2.06e-10 ***
## rad          0.0152004  0.0030216   5.031 7.47e-07 ***
## tax         -0.0005681  0.0001709  -3.325 0.000968 ***
## ptratio     -0.0427382  0.0059698  -7.159 4.07e-12 ***
## black        0.0003423  0.0001209   2.831 0.004885 ** 
## lstat       -0.0319466  0.0022703 -14.071  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.193 on 390 degrees of freedom
## Multiple R-squared:  0.7933, Adjusted R-squared:  0.7864 
## F-statistic: 115.1 on 13 and 390 DF,  p-value: < 2.2e-16

Observation

In this model, F=115.1 is far greater than 1, and so it can be concluded that there is a relationship between predictor and response variable.

Based on the ‘p-value’, we can conclude significant variable. We can see that zn, age and indus are less significant features as the ‘p’ value is large for them.

R2 (multiple-R-squared) value as it indicates how much variation is captured by the model. R2 closer to 1 indicates that the model explains the large value of the variance of the model and hence a good fit. In this case, the value is 0.7933 (closer to 1) and hence the model is a good fit.

Model Building – Model 2

As the next step, we can remove the four lesser significant features (zn,indus, and age) and check the model again.

# remove the less significant feature
model2 = update(model1, ~.-zn-indus-age) 
summary(model2)

## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71008 -0.10962 -0.01296  0.10330  0.84030 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.5489757  0.2327067  19.548  < 2e-16 ***
## crim        -0.0107965  0.0015120  -7.140 4.54e-12 ***
## chas         0.0854786  0.0400433   2.135 0.033407 *  
## nox         -0.9117698  0.1566753  -5.819 1.23e-08 ***
## rm           0.0644686  0.0184026   3.503 0.000513 ***
## dis         -0.0523327  0.0072616  -7.207 2.95e-12 ***
## rad          0.0143808  0.0028974   4.963 1.03e-06 ***
## tax         -0.0004624  0.0001505  -3.073 0.002263 ** 
## ptratio     -0.0464620  0.0055623  -8.353 1.16e-15 ***
## black        0.0003416  0.0001211   2.821 0.005026 ** 
## lstat       -0.0314967  0.0021598 -14.583  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1935 on 393 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7851 
## F-statistic: 148.3 on 10 and 393 DF,  p-value: < 2.2e-16

F=148.3 is far greater than 1 and this value is more than the F value of the previous model. It can be concluded that there is a relationship between predictor and response variable.

R2 =0.7905 is closer to 1 and so this model is a good fit. Please note that this value has decreased a little from the first model but this should be fine as removing three predictors caused a drop from 0.7933 to 0.7905 and this is a small drop. In other words, the contribution of three predictors towards explaining the variance is an only small value(0.0028) and hence it is better to drop the predictor.

pred1 <- predict(model2, newdata = test)
plot(test$medv, exp(pred1))

Conclusion

The example shows how to approach linear regression modeling.

Reference

An Introduction to Statistical Learning, with Application in R. By G. Casella, S. Fienberg, I. Olkin

Blog1: How to apply linear regression

V Patel

2020-05-21