Machine Learning and Regression

Extract Data

Using Boston data set which is a part of Mass library to create the training and testing sample. The problem statement is to predict \(medv\) based on the set of input features.

library(MASS)
## Warning: package 'MASS' was built under R version 3.6.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
attach(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Split the data to make the model

split data in 80/20 into training and evaluation set and make the model for training dataset.

set.seed(1)
row.number <- sample(1:nrow(Boston), 0.8*nrow(Boston))
train = Boston[row.number,]
test = Boston[-row.number,]
dim(train)
## [1] 404  14
dim(test)
## [1] 102  14

Explore the response variable

Let’s check the distribution of the response variable \(medv\). Following graphs shows the three distribution of \(medv\), original, log and square root transformation.

ggplot(Boston, aes(medv)) + geom_density(fill="yellow")

ggplot(test, aes(log(medv))) + geom_density(fill="yellow")

ggplot(train, aes(sqrt(medv))) + geom_density(fill="yellow")

## Model Building

Model 1

As a first step, let’s fit the multiple regression models. I will start by taking all the input variables in the multiple regression.

model1 = lm(log(medv)~., data=train)
summary(model1)
## 
## Call:
## lm(formula = log(medv) ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73932 -0.09713 -0.01923  0.08883  0.86529 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.133e+00  2.370e-01  17.438  < 2e-16 ***
## crim        -1.166e-02  1.636e-03  -7.123 5.14e-12 ***
## zn           1.116e-03  6.129e-04   1.821  0.06941 .  
## indus        2.134e-03  2.718e-03   0.785  0.43286    
## chas         1.084e-01  3.797e-02   2.854  0.00454 ** 
## nox         -7.142e-01  1.727e-01  -4.135 4.35e-05 ***
## rm           8.303e-02  1.907e-02   4.353 1.72e-05 ***
## age         -9.102e-05  5.898e-04  -0.154  0.87743    
## dis         -5.104e-02  9.132e-03  -5.589 4.29e-08 ***
## rad          1.645e-02  2.885e-03   5.700 2.36e-08 ***
## tax         -7.018e-04  1.624e-04  -4.322 1.96e-05 ***
## ptratio     -3.593e-02  6.048e-03  -5.941 6.29e-09 ***
## black        4.138e-04  1.201e-04   3.447  0.00063 ***
## lstat       -2.957e-02  2.238e-03 -13.213  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1921 on 390 degrees of freedom
## Multiple R-squared:  0.7914, Adjusted R-squared:  0.7844 
## F-statistic: 113.8 on 13 and 390 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)

### Model 1 Observation: Based on the summary F-stat is 115.1, which is far greater than 1. Therefor it is safe to say that there is a relationship between predictor variable and response variable. Now let’s take a look at the p values. Based on the p values we can conclude which predictor variables are significant or not. \(zn\), \(indus\), and \(age\) have less significance features, therefor I will remove these variables in my next model. Multiple \(R^2\) (multiple-R-squared) value being closer to 1 (0.7933) indicates that the model explains the large value of the variance of the model and its a good fit.

Model 2

For my next model I am removing the less significant values (zn, indus, age) and check it again.

model2 <- update(model1, ~.-zn-indus-age)
summary(model2)
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73727 -0.10583 -0.02177  0.09436  0.86896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1511772  0.2361239  17.581  < 2e-16 ***
## crim        -0.0114749  0.0016304  -7.038 8.74e-12 ***
## chas         0.1098383  0.0377278   2.911 0.003804 ** 
## nox         -0.7160222  0.1599660  -4.476 9.97e-06 ***
## rm           0.0854763  0.0184393   4.636 4.85e-06 ***
## dis         -0.0450161  0.0073599  -6.116 2.31e-09 ***
## rad          0.0156919  0.0027803   5.644 3.19e-08 ***
## tax         -0.0006071  0.0001455  -4.171 3.74e-05 ***
## ptratio     -0.0390424  0.0056372  -6.926 1.78e-11 ***
## black        0.0004127  0.0001198   3.445 0.000632 ***
## lstat       -0.0294784  0.0021172 -13.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1923 on 393 degrees of freedom
## Multiple R-squared:  0.7894, Adjusted R-squared:  0.784 
## F-statistic: 147.3 on 10 and 393 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)

Model 2 observation

F-stat is 148.3 which is far greater than 1 and even greater than the F-stat of previous model. I can conclude that there is a relationship between predictor and response variables. As we got rid of less significant features from this model, all the predictors are significant in this model. Also its R2 = 0.7905 is very close to 1 and can say that this is a goof fit too. Removing three predictors caused the R2 to drop from 0.7933 to 0.7905. (0.0028) is considerably a very small change and therefor dropping those three predictors with less significant feature did not affect that much in this model hense it was a good decition to drop them.

Check for Predictor Vs. Residual Plot

In this section we will be checking residual graphs of all the significant features from the model 2. We are trying to look at residual plot that should not have any patterns. Below plots we can see some non-linear patterns for the features like \(crim\), \(nox\), \(rm\) etc.

attach(train)
## The following objects are masked from Boston:
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 3.6.3
plot1 = ggplot(train, aes(crim, residuals(model2))) + geom_point() + geom_smooth()
plot2=ggplot(train, aes(chas, residuals(model2))) + geom_point() + geom_smooth()
plot3=ggplot(train, aes(nox, residuals(model2))) + geom_point() + geom_smooth()
plot4=ggplot(train, aes(rm, residuals(model2))) + geom_point() + geom_smooth()
plot5=ggplot(train, aes(dis, residuals(model2))) + geom_point() + geom_smooth()
plot6=ggplot(train, aes(rad, residuals(model2))) + geom_point() + geom_smooth()
plot7=ggplot(train, aes(tax, residuals(model2))) + geom_point() + geom_smooth()
plot8=ggplot(train, aes(ptratio, residuals(model2))) + geom_point() + geom_smooth()
plot9=ggplot(train, aes(black, residuals(model2))) + geom_point() + geom_smooth()
plot10=ggplot(train, aes(lstat, residuals(model2))) + geom_point() + geom_smooth()
grid.arrange(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10,ncol=5,nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at -0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 2.5e-005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.005
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.01
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Model 3

This model is an enhanced model 2 by introducing square items of all features.

# add square term in the model
model3 = lm(log(medv)~crim+chas+nox+rm+dis+rad+tax+ptratio+
black+lstat+ I(crim^2)+ I(chas^2)+I(nox^2)+ I(rm^2)+ I(dis^2)+ 
I(rad^2)+ I(tax^2)+ I(ptratio^2)+ I(black^2)+ I(lstat^2), data=train)
summary(model3)
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat + I(crim^2) + I(chas^2) + I(nox^2) + 
##     I(rm^2) + I(dis^2) + I(rad^2) + I(tax^2) + I(ptratio^2) + 
##     I(black^2) + I(lstat^2), data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58373 -0.08611 -0.01228  0.08528  0.77344 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.273e+00  8.749e-01   9.456  < 2e-16 ***
## crim         -3.291e-02  4.505e-03  -7.306 1.61e-12 ***
## chas          1.124e-01  3.223e-02   3.487 0.000546 ***
## nox          -6.286e-01  1.074e+00  -0.585 0.558693    
## rm           -8.026e-01  1.324e-01  -6.063 3.20e-09 ***
## dis          -1.202e-01  2.452e-02  -4.900 1.41e-06 ***
## rad           1.628e-02  9.436e-03   1.726 0.085217 .  
## tax          -3.393e-04  5.300e-04  -0.640 0.522477    
## ptratio      -1.592e-01  7.163e-02  -2.222 0.026843 *  
## black         1.314e-03  5.115e-04   2.568 0.010594 *  
## lstat        -5.419e-02  5.487e-03  -9.876  < 2e-16 ***
## I(crim^2)     2.961e-04  6.690e-05   4.426 1.25e-05 ***
## I(chas^2)            NA         NA      NA       NA    
## I(nox^2)     -2.450e-01  8.002e-01  -0.306 0.759664    
## I(rm^2)       6.752e-02  1.036e-02   6.520 2.22e-10 ***
## I(dis^2)      6.899e-03  1.936e-03   3.564 0.000411 ***
## I(rad^2)      2.739e-04  3.730e-04   0.734 0.463258    
## I(tax^2)     -4.613e-07  6.474e-07  -0.712 0.476601    
## I(ptratio^2)  3.751e-03  2.040e-03   1.839 0.066742 .  
## I(black^2)   -2.355e-06  1.129e-06  -2.085 0.037695 *  
## I(lstat^2)    7.380e-04  1.520e-04   4.854 1.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1627 on 384 degrees of freedom
## Multiple R-squared:  0.8526, Adjusted R-squared:  0.8453 
## F-statistic: 116.9 on 19 and 384 DF,  p-value: < 2.2e-16

Model 4

This model is done by removing insignificant features from the model 3.

# Remove insignificant variables
model4=update(model3, ~.-nox-tax-I(crim^2)-I(chas^2)-I(nox^2)-
I(rad^2)-I(tax^2)-I(ptratio^2))
summary(model4)
## 
## Call:
## lm(formula = log(medv) ~ crim + chas + rm + dis + rad + ptratio + 
##     black + lstat + I(rm^2) + I(dis^2) + I(black^2) + I(lstat^2), 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68939 -0.09035 -0.01192  0.10008  0.84934 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.330e+00  4.564e-01  13.868  < 2e-16 ***
## crim        -1.265e-02  1.556e-03  -8.131 5.67e-15 ***
## chas         1.226e-01  3.483e-02   3.521 0.000481 ***
## rm          -8.777e-01  1.428e-01  -6.144 1.98e-09 ***
## dis         -4.142e-03  2.016e-02  -0.205 0.837350    
## rad          3.152e-03  1.575e-03   2.002 0.045977 *  
## ptratio     -1.887e-02  5.228e-03  -3.609 0.000348 ***
## black        9.415e-04  5.578e-04   1.688 0.092221 .  
## lstat       -5.754e-02  5.733e-03 -10.037  < 2e-16 ***
## I(rm^2)      7.554e-02  1.110e-02   6.804 3.84e-11 ***
## I(dis^2)    -1.066e-03  1.838e-03  -0.580 0.562248    
## I(black^2)  -1.231e-06  1.228e-06  -1.002 0.316724    
## I(lstat^2)   7.374e-04  1.585e-04   4.653 4.49e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1785 on 391 degrees of freedom
## Multiple R-squared:  0.8193, Adjusted R-squared:  0.8138 
## F-statistic: 147.7 on 12 and 391 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model4)

library(performance)
compare_performance(model1, model2, model3, model4, rank = TRUE)
## # Comparison of Model Performance Indices
## 
## Model  | Type |     AIC |     BIC |   R2 | R2_adjusted | RMSE | Sigma |        BF | Performance_Score
## -----------------------------------------------------------------------------------------------------
## model3 |   lm | -299.16 | -215.13 | 0.85 |        0.84 | 0.16 |  0.16 | BF > 1000 |           100.00%
## model4 |   lm | -230.95 | -174.93 | 0.82 |        0.81 | 0.18 |  0.18 | BF > 1000 |            49.31%
## model2 |   lm | -172.99 | -124.97 | 0.79 |        0.78 | 0.19 |  0.19 | BF > 1000 |             2.52%
## model1 |   lm | -170.90 | -110.88 | 0.79 |        0.78 | 0.19 |  0.19 | BF = 1.00 |             1.26%
## 
## Model model3 (of class lm) performed best with an overall performance score of 100.00%.

Prediction

As per the compare_performance function model 3 is the best model. Real goal of a model is to reduce the testing error. As we have already split data into test and training data, we will use the test dataset to evaluate the model 3. We will make a prediction based on Model 3 and will evaluate the model.

pred1 <- predict(model3, newdata = test)
## Warning in predict.lm(model3, newdata = test): prediction from a rank-deficient
## fit may be misleading
rmse <- sqrt(sum((exp(pred1) - test$medv)^2)/length(test$medv))
c(RMSE = rmse, R2=summary(model4)$r.squared)
##     RMSE       R2 
## 4.847852 0.819311

Lets take a look at Model 4, not that we know choosing model 3 can be misleading based on compare_performance function.

pred2 <- predict(model4, newdata = test)
rmse <- sqrt(sum((exp(pred2) - test$medv)^2)/length(test$medv))
c(RMSE = rmse, R2=summary(model4)$r.squared)
##     RMSE       R2 
## 5.337152 0.819311