Practical 3

Exercice 1 (6.1 of the book)

Preliminaries

# Cleaning environment and console
rm(list = ls())
cat("\014")

# Calling packages
library(rpart)
library(rpart.plot)
library(forecast)
library(pander)
library(dummies)
library(caret)
library(neuralnet)
library(ellipse)
library(corrplot)


# Importing data

house <- read.csv("BostonHousing.csv", sep = ",", header = TRUE)

attach(house)

str(house)

## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM     : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN       : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS    : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX      : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM       : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE      : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS      : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD      : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX      : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO  : num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ LSTAT    : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV     : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ CAT..MEDV: int  0 0 1 1 1 0 0 0 0 0 ...

pander(head(house))

Table continues below
CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX
0.00632	18	2.31	0.538	6.575	65.2	4.09	1	296
0.02731	0	7.07	0.469	6.421	78.9	4.967	2	242
0.02729	0	7.07	0.469	7.185	61.1	4.967	2	242
0.03237	0	2.18	0.458	6.998	45.8	6.062	3	222
0.06905	0	2.18	0.458	7.147	54.2	6.062	3	222
0.02985	0	2.18	0.458	6.43	58.7	6.062	3	222

PTRATIO	LSTAT	MEDV	CAT..MEDV
15.3	4.98	24	0
17.8	9.14	21.6	0
17.8	4.03	34.7	1
18.7	2.94	33.4	1
18.7	5.33	36.2	1
18.7	5.21	28.7	0

# Partitioning (60%-40%)

house_train.obs <- sample(rownames(house), dim(house)[1]*0.6)
house_train.set <- house[house_train.obs,]

house_valid.obs <- setdiff(rownames(house), house_train.obs)
house_valid.set <- house[house_valid.obs,]

a. Why partitioning into training and validation sets?

This is done because we want to do two things :

Build our model : we need data for this, so we use only a part of the total dataset, which we call “training set”. During this phase, we are “tuning” the parameters and adjusting things so that we can explain the training set correctly.
Verify that the model is good at predicting “new” data : if we use, to test the model, the same data which was used to build it, we will find that the predictions are almost perfect. However, if we apply the model to new and unseen data, then most probably the predictions won’t be as good. This is why we must keep one partition of the entire dataset to be used for validation (even sometimes we keep a third part called “test set”).

So this illustrates the common trade-off between overfitting the training data and predicting “new” data correctly. We do not want to overfit the training data too much, because we want to be able to predict the validation set as good as possible.

b. Multiple Linear Regression

# Running the regression :

MLR <- lm(MEDV ~ CRIM + CHAS + RM, data = house_train.set)

summary(MLR)

## 
## Call:
## lm(formula = MEDV ~ CRIM + CHAS + RM, data = house_train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0751  -2.8530  -0.1695   2.9024  25.3980 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -31.16106    3.30130  -9.439  < 2e-16 ***
## CRIM         -0.31484    0.04307  -7.310 2.46e-12 ***
## CHAS          9.99984    1.53024   6.535 2.74e-10 ***
## RM            8.57376    0.51690  16.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.741 on 299 degrees of freedom
## Multiple R-squared:  0.617,  Adjusted R-squared:  0.6132 
## F-statistic: 160.6 on 3 and 299 DF,  p-value: < 2.2e-16

# Equation for predicting MEDV (using the estimated coefficients) :

equation <- -34.10504 + -0.20280*CRIM + 6.84439*CHAS + 9.08959*RM



# Adding regression outcome into the data frame :

house$Prediction <- round(equation, 4)

head(house$Prediction)

## [1] 25.6577 24.2537 31.1981 29.4973 30.8443 24.3350

These numbers above are some of the predictions for the prices (in $1000s).

From the results above, we can see that all the coefficients are statistically significant (at 0.1% level, with 3 “stars” *).

Now, before we make any further inferences, it is imperative that we check whether the Linear Regression assumptions are satisfied!

The three most important assumptions that any Linear Regression Model should satisfy are :

1- The relationship between outcome and predictors is linear.
2- The residuals are normally distributed.
3- The residuals have mean = zero and constant variance.

Now, let’s verify each assumption, one by one.

1- Linear relationship (using plot matrix)

pairs(house_train.set[, c(1, 4, 6, 13)])

The relationship between MEDV and RM is quite linear. However, the relationship between MEDV and CRIM does not seem so linear!

Maybe we could transform it by taking the logarithm…

# Log transformation :

house_train.set$log_CRIM <- log(house_train.set$CRIM)

# Ploting to see the effect :

plot(house_train.set$MEDV, house_train.set$log_CRIM)

Ok, now the (negative) relationship seems quite linear, although not as strong as the one between MEDV and RM.

Therefore, for our predictions, it would be better to include the variable CRIM by taking the logarithm. Let’s build that model now and use it from now on :

# New model with variable tranformation :

MLR_transf <- lm(MEDV ~ log_CRIM + CHAS + RM, data = house_train.set)

summary(MLR_transf)

## 
## Call:
## lm(formula = MEDV ~ log_CRIM + CHAS + RM, data = house_train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0701  -3.4913  -0.7244   2.7162  25.6250 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30.6627     3.2019  -9.576  < 2e-16 ***
## log_CRIM     -1.2940     0.1517  -8.530 7.42e-16 ***
## CHAS         11.4437     1.4970   7.644 2.88e-13 ***
## RM            8.1563     0.5124  15.917  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.589 on 299 degrees of freedom
## Multiple R-squared:  0.6369, Adjusted R-squared:  0.6333 
## F-statistic: 174.8 on 3 and 299 DF,  p-value: < 2.2e-16

Note that for the variable CHAS, checking linear relationship is nonesense, since it is a binary variable!

2 and 3 : Normality, mean and variance

plot(MLR_transf)

Although the Q-Q plot does not show good fit on the tails, we could assume that the residuals follow a normal distribution.

In the first plot, we can see that although the fitting line is somewhat curved, the mean is more or less around 0.

Concerning the variance, we can see that the residuals are spread more or less homogenously along the x-axis. So we can state that we have a constant variance and no heteroskedasticity.

Just by the way, there does not seem to be any influencial outlier, since no observation exceeds the Cook’s distance on the last graph.

c. Predicting using the regression model

We just have to use the equation we wrote down before (this time substituting CRIM by log_CRIM, AND replacing the old coefficients by the new ones) and replace the values that are indicated in the prompt :

predicted_price = -34.3656  + -0.8609*log(0.1) + 7.3472*0 + 8.8831*6

predicted_price

## [1] 20.9153

The predicted price for this house is $20915.3

d. Reducing the number of predictors

i. Which predictors are likely to be measuring the same thing?

From our intuition, it seems that, for instance, AGE and RM may be highly correlated, because surely the older the house is, the more rooms it has. This is because nowadays houses tend to be smaller, since families are not so big and there are many more single people.

Also, CRIM and ZN are positively correlated, because in a more populated area, usually the possibilities to find a criminal are higher.

However, instead of only trusting our intuition, the best thing to do would be a correlation plot!

ii. Correlation table and removing pairs

corrplot(cor(house))

From this correlation plot, we can identify the following highly correlated pairs (excluding the variables Prediction, CAT..MEDV and MEDV, since we are interested only in explanatory variables) :

RAD and TAX
NOX and INDUS
DIS and AGE
DIS and NOX
DIS and INDUS

The other pais do not seem to be highly correlated (whether negatively or positively). So, we may want to eliminate the variable “DIS”, for instance, since it appears in 3 pairs out of the 5 we exposed.

If we do not eliminate “DIS”, we may suffer from multicollinearity.

But before reaching these conclusions, one should proceed to some kind of variable selection (exhaustive search, or stepwise regressions). This is what we do in the next section.

iii. Using stepwise regressions

First, let’s do a forward selection :

# Creating a model with no predictors :

MLR_nopred <- lm(MEDV ~ 1, data = house_train.set[, -c(1, 14)])

# We are taking away column 1 (because it corresponds to the untransformed CRIM) and column 14 (because it contains information about the OUTCOME variable MEDV)!


# Creating a FULL model (all predictors)

MLR_full <- lm(MEDV ~., data = house_train.set[, -c(1, 14)])


# Running the FORWARD selection :

MLR_forward <- step(MLR_nopred, scope = list(lower = MLR_nopred, upper = MLR_full), direction = "forward")

## Start:  AIC=1347.8
## MEDV ~ 1
## 
##            Df Sum of Sq   RSS    AIC
## + LSTAT     1   14831.2 10897 1089.5
## + RM        1   12654.3 13074 1144.7
## + PTRATIO   1    6224.9 19503 1265.9
## + NOX       1    5659.2 20069 1274.5
## + INDUS     1    5603.2 20125 1275.4
## + log_CRIM  1    5410.6 20317 1278.3
## + TAX       1    5355.9 20372 1279.1
## + AGE       1    4068.4 21659 1297.7
## + RAD       1    3975.7 21752 1298.9
## + ZN        1    3601.6 22126 1304.1
## + CHAS      1    2562.7 23165 1318.0
## + DIS       1    2214.5 23513 1322.5
## <none>                  25728 1347.8
## 
## Step:  AIC=1089.49
## MEDV ~ LSTAT
## 
##            Df Sum of Sq     RSS    AIC
## + RM        1   2592.21  8304.4 1009.2
## + CHAS      1   1308.73  9587.9 1052.7
## + PTRATIO   1   1156.73  9739.9 1057.5
## + DIS       1    348.22 10548.4 1081.7
## + AGE       1    247.91 10648.7 1084.5
## + ZN        1     96.89 10799.7 1088.8
## + TAX       1     79.26 10817.4 1089.3
## <none>                  10896.6 1089.5
## + log_CRIM  1     21.44 10875.2 1090.9
## + RAD       1     18.93 10877.7 1091.0
## + NOX       1      2.57 10894.1 1091.4
## + INDUS     1      0.19 10896.4 1091.5
## 
## Step:  AIC=1009.18
## MEDV ~ LSTAT + RM
## 
##            Df Sum of Sq    RSS     AIC
## + CHAS      1   1126.97 7177.5  966.99
## + PTRATIO   1    831.53 7472.9  979.21
## + TAX       1    225.72 8078.7 1002.83
## + DIS       1    218.09 8086.3 1003.11
## + RAD       1    195.55 8108.9 1003.96
## + AGE       1     63.50 8240.9 1008.85
## <none>                  8304.4 1009.18
## + ZN        1     11.96 8292.5 1010.74
## + INDUS     1     11.19 8293.2 1010.77
## + log_CRIM  1      6.77 8297.7 1010.93
## + NOX       1      0.14 8304.3 1011.17
## 
## Step:  AIC=966.99
## MEDV ~ LSTAT + RM + CHAS
## 
##            Df Sum of Sq    RSS    AIC
## + PTRATIO   1    697.22 6480.2 938.02
## + TAX       1    338.63 6838.8 954.34
## + RAD       1    302.20 6875.3 955.95
## + log_CRIM  1     74.49 7103.0 965.82
## + DIS       1     69.39 7108.1 966.04
## + NOX       1     49.78 7127.7 966.88
## + ZN        1     47.94 7129.5 966.95
## <none>                  7177.5 966.99
## + INDUS     1      8.55 7168.9 968.62
## + AGE       1      0.00 7177.5 968.99
## 
## Step:  AIC=938.02
## MEDV ~ LSTAT + RM + CHAS + PTRATIO
## 
##            Df Sum of Sq    RSS    AIC
## + DIS       1   173.128 6307.1 931.82
## + TAX       1    94.323 6385.9 935.58
## + RAD       1    48.263 6432.0 937.76
## <none>                  6480.2 938.02
## + NOX       1    38.403 6441.8 938.22
## + AGE       1    10.981 6469.3 939.51
## + INDUS     1     4.731 6475.5 939.80
## + ZN        1     3.429 6476.8 939.86
## + log_CRIM  1     1.984 6478.3 939.93
## 
## Step:  AIC=931.82
## MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS
## 
##            Df Sum of Sq    RSS    AIC
## + NOX       1    411.69 5895.4 913.36
## + TAX       1    233.83 6073.3 922.37
## + RAD       1    123.92 6183.2 927.80
## + log_CRIM  1     99.68 6207.4 928.99
## + INDUS     1     53.48 6253.6 931.24
## + AGE       1     49.35 6257.8 931.44
## + ZN        1     43.80 6263.3 931.71
## <none>                  6307.1 931.82
## 
## Step:  AIC=913.36
## MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS + NOX
## 
##            Df Sum of Sq    RSS    AIC
## + TAX       1    48.708 5846.7 912.85
## + ZN        1    39.015 5856.4 913.35
## <none>                  5895.4 913.36
## + AGE       1    12.092 5883.3 914.74
## + RAD       1    10.249 5885.2 914.84
## + INDUS     1     0.402 5895.0 915.34
## + log_CRIM  1     0.110 5895.3 915.36
## 
## Step:  AIC=912.85
## MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS + NOX + TAX
## 
##            Df Sum of Sq    RSS    AIC
## + ZN        1    70.099 5776.6 911.20
## <none>                  5846.7 912.85
## + log_CRIM  1    24.753 5822.0 913.56
## + RAD       1    20.564 5826.1 913.78
## + AGE       1    18.029 5828.7 913.91
## + INDUS     1     7.822 5838.9 914.44
## 
## Step:  AIC=911.2
## MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS + NOX + TAX + ZN
## 
##            Df Sum of Sq    RSS    AIC
## + log_CRIM  1    56.250 5720.4 910.23
## <none>                  5776.6 911.20
## + RAD       1    22.932 5753.7 911.99
## + INDUS     1    11.241 5765.4 912.61
## + AGE       1     9.117 5767.5 912.72
## 
## Step:  AIC=910.23
## MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS + NOX + TAX + ZN + log_CRIM
## 
##         Df Sum of Sq    RSS    AIC
## <none>               5720.4 910.23
## + AGE    1   13.4720 5706.9 911.52
## + INDUS  1   11.6571 5708.7 911.61
## + RAD    1    1.2841 5719.1 912.16

summary(MLR_forward)

## 
## Call:
## lm(formula = MEDV ~ LSTAT + RM + CHAS + PTRATIO + DIS + NOX + 
##     TAX + ZN + log_CRIM, data = house_train.set[, -c(1, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2491  -2.5433  -0.4525   2.2283  20.7290 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.254689   5.952793   4.914 1.48e-06 ***
## LSTAT        -0.574477   0.057239 -10.036  < 2e-16 ***
## RM            4.671773   0.496481   9.410  < 2e-16 ***
## CHAS          8.045255   1.218528   6.602 1.90e-10 ***
## PTRATIO      -0.709177   0.154686  -4.585 6.74e-06 ***
## DIS          -1.238169   0.230663  -5.368 1.62e-07 ***
## NOX         -15.350337   4.481796  -3.425 0.000703 ***
## TAX          -0.008104   0.003070  -2.640 0.008742 ** 
## ZN            0.037956   0.016639   2.281 0.023255 *  
## log_CRIM      0.473780   0.279122   1.697 0.090685 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.419 on 293 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7708 
## F-statistic: 113.9 on 9 and 293 DF,  p-value: < 2.2e-16

We see here that, starting from a completely “empty” model, 9 variables were added :

LSTAT
RM
PTRATIO
CHAS
DIS
NOX
ZN
log_CRIM
TAX

This means that, if we wanted to do variable selection, just using these 9 variables will be fine.

Now, it’s the backward selection’s turn!

# Running the BACKWARD selection :

MLR_backward <- step(MLR_full, direction = "backward")

## Start:  AIC=914.67
## MEDV ~ ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + 
##     PTRATIO + LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - RAD       1      2.56 5693.6 912.81
## - AGE       1     12.33 5703.4 913.33
## - INDUS     1     15.65 5706.7 913.51
## - log_CRIM  1     32.71 5723.8 914.41
## <none>                  5691.1 914.67
## - ZN        1     91.49 5782.6 917.51
## - TAX       1    102.54 5793.6 918.09
## - NOX       1    219.25 5910.3 924.13
## - PTRATIO   1    390.47 6081.5 932.78
## - DIS       1    474.73 6165.8 936.95
## - CHAS      1    839.65 6530.7 954.37
## - LSTAT     1   1639.58 7330.6 989.38
## - RM        1   1665.07 7356.1 990.44
## 
## Step:  AIC=912.81
## MEDV ~ ZN + INDUS + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO + 
##     LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - INDUS     1     13.26 5706.9 911.52
## - AGE       1     15.07 5708.7 911.61
## <none>                  5693.6 912.81
## - log_CRIM  1     61.35 5755.0 914.06
## - ZN        1     94.93 5788.6 915.82
## - TAX       1    156.44 5850.1 919.02
## - NOX       1    218.57 5912.2 922.23
## - PTRATIO   1    409.44 6103.1 931.85
## - DIS       1    476.41 6170.0 935.16
## - CHAS      1    850.73 6544.4 953.01
## - LSTAT     1   1637.35 7331.0 987.40
## - RM        1   1731.36 7425.0 991.26
## 
## Step:  AIC=911.52
## MEDV ~ ZN + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO + LSTAT + 
##     log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - AGE       1     13.47 5720.4 910.23
## <none>                  5706.9 911.52
## - log_CRIM  1     60.60 5767.5 912.72
## - ZN        1     91.19 5798.1 914.32
## - TAX       1    143.58 5850.5 917.04
## - NOX       1    205.42 5912.3 920.23
## - PTRATIO   1    407.76 6114.7 930.43
## - DIS       1    545.49 6252.4 937.18
## - CHAS      1    864.51 6571.4 952.26
## - LSTAT     1   1624.88 7331.8 985.43
## - RM        1   1730.60 7437.5 989.77
## 
## Step:  AIC=910.23
## MEDV ~ ZN + CHAS + NOX + RM + DIS + TAX + PTRATIO + LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  5720.4 910.23
## - log_CRIM  1     56.25 5776.6 911.20
## - ZN        1    101.60 5822.0 913.56
## - TAX       1    136.04 5856.4 915.35
## - NOX       1    229.03 5949.4 920.13
## - PTRATIO   1    410.36 6130.7 929.22
## - DIS       1    562.55 6282.9 936.65
## - CHAS      1    851.07 6571.4 950.26
## - RM        1   1728.68 7449.0 988.24
## - LSTAT     1   1966.60 7687.0 997.77

summary(MLR_backward)

## 
## Call:
## lm(formula = MEDV ~ ZN + CHAS + NOX + RM + DIS + TAX + PTRATIO + 
##     LSTAT + log_CRIM, data = house_train.set[, -c(1, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2491  -2.5433  -0.4525   2.2283  20.7290 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.254689   5.952793   4.914 1.48e-06 ***
## ZN            0.037956   0.016639   2.281 0.023255 *  
## CHAS          8.045255   1.218528   6.602 1.90e-10 ***
## NOX         -15.350337   4.481796  -3.425 0.000703 ***
## RM            4.671773   0.496481   9.410  < 2e-16 ***
## DIS          -1.238169   0.230663  -5.368 1.62e-07 ***
## TAX          -0.008104   0.003070  -2.640 0.008742 ** 
## PTRATIO      -0.709177   0.154686  -4.585 6.74e-06 ***
## LSTAT        -0.574477   0.057239 -10.036  < 2e-16 ***
## log_CRIM      0.473780   0.279122   1.697 0.090685 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.419 on 293 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7708 
## F-statistic: 113.9 on 9 and 293 DF,  p-value: < 2.2e-16

If I am not mistaken, it seems that the backward selection selected the same predictors as the forward one!

And finally, we are doing both selections :

# Running BOTH selections :

MLR_both <- step(MLR_full, direction = "both")

## Start:  AIC=914.67
## MEDV ~ ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + 
##     PTRATIO + LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - RAD       1      2.56 5693.6 912.81
## - AGE       1     12.33 5703.4 913.33
## - INDUS     1     15.65 5706.7 913.51
## - log_CRIM  1     32.71 5723.8 914.41
## <none>                  5691.1 914.67
## - ZN        1     91.49 5782.6 917.51
## - TAX       1    102.54 5793.6 918.09
## - NOX       1    219.25 5910.3 924.13
## - PTRATIO   1    390.47 6081.5 932.78
## - DIS       1    474.73 6165.8 936.95
## - CHAS      1    839.65 6530.7 954.37
## - LSTAT     1   1639.58 7330.6 989.38
## - RM        1   1665.07 7356.1 990.44
## 
## Step:  AIC=912.81
## MEDV ~ ZN + INDUS + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO + 
##     LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - INDUS     1     13.26 5706.9 911.52
## - AGE       1     15.07 5708.7 911.61
## <none>                  5693.6 912.81
## - log_CRIM  1     61.35 5755.0 914.06
## + RAD       1      2.56 5691.1 914.67
## - ZN        1     94.93 5788.6 915.82
## - TAX       1    156.44 5850.1 919.02
## - NOX       1    218.57 5912.2 922.23
## - PTRATIO   1    409.44 6103.1 931.85
## - DIS       1    476.41 6170.0 935.16
## - CHAS      1    850.73 6544.4 953.01
## - LSTAT     1   1637.35 7331.0 987.40
## - RM        1   1731.36 7425.0 991.26
## 
## Step:  AIC=911.52
## MEDV ~ ZN + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO + LSTAT + 
##     log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## - AGE       1     13.47 5720.4 910.23
## <none>                  5706.9 911.52
## - log_CRIM  1     60.60 5767.5 912.72
## + INDUS     1     13.26 5693.6 912.81
## + RAD       1      0.17 5706.7 913.51
## - ZN        1     91.19 5798.1 914.32
## - TAX       1    143.58 5850.5 917.04
## - NOX       1    205.42 5912.3 920.23
## - PTRATIO   1    407.76 6114.7 930.43
## - DIS       1    545.49 6252.4 937.18
## - CHAS      1    864.51 6571.4 952.26
## - LSTAT     1   1624.88 7331.8 985.43
## - RM        1   1730.60 7437.5 989.77
## 
## Step:  AIC=910.23
## MEDV ~ ZN + CHAS + NOX + RM + DIS + TAX + PTRATIO + LSTAT + log_CRIM
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  5720.4 910.23
## - log_CRIM  1     56.25 5776.6 911.20
## + AGE       1     13.47 5706.9 911.52
## + INDUS     1     11.66 5708.7 911.61
## + RAD       1      1.28 5719.1 912.16
## - ZN        1    101.60 5822.0 913.56
## - TAX       1    136.04 5856.4 915.35
## - NOX       1    229.03 5949.4 920.13
## - PTRATIO   1    410.36 6130.7 929.22
## - DIS       1    562.55 6282.9 936.65
## - CHAS      1    851.07 6571.4 950.26
## - RM        1   1728.68 7449.0 988.24
## - LSTAT     1   1966.60 7687.0 997.77

summary(MLR_both)

## 
## Call:
## lm(formula = MEDV ~ ZN + CHAS + NOX + RM + DIS + TAX + PTRATIO + 
##     LSTAT + log_CRIM, data = house_train.set[, -c(1, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2491  -2.5433  -0.4525   2.2283  20.7290 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.254689   5.952793   4.914 1.48e-06 ***
## ZN            0.037956   0.016639   2.281 0.023255 *  
## CHAS          8.045255   1.218528   6.602 1.90e-10 ***
## NOX         -15.350337   4.481796  -3.425 0.000703 ***
## RM            4.671773   0.496481   9.410  < 2e-16 ***
## DIS          -1.238169   0.230663  -5.368 1.62e-07 ***
## TAX          -0.008104   0.003070  -2.640 0.008742 ** 
## PTRATIO      -0.709177   0.154686  -4.585 6.74e-06 ***
## LSTAT        -0.574477   0.057239 -10.036  < 2e-16 ***
## log_CRIM      0.473780   0.279122   1.697 0.090685 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.419 on 293 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7708 
## F-statistic: 113.9 on 9 and 293 DF,  p-value: < 2.2e-16

And again, exactly the same predictors are chosen by the algorithm.

This fact is indeed reassuring! 😄

Now, the only thing we have left is to predict the validation set with the best model from each of the stepwise selections. However, because each selection yielded the SAME predictors, there is ONLY one model to be used to predict the validation set…

So, now let’s run a Multiple Linear Regression using the validation set. We will only be using the 9 predictors we found previously.

# Running the model with the selected predictors

MLR_selected <- lm(MEDV ~ log_CRIM + CHAS + RM + ZN + NOX + DIS + PTRATIO + LSTAT + TAX, data = house_train.set[, -c(1, 14)])

summary(MLR_selected)

## 
## Call:
## lm(formula = MEDV ~ log_CRIM + CHAS + RM + ZN + NOX + DIS + PTRATIO + 
##     LSTAT + TAX, data = house_train.set[, -c(1, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2491  -2.5433  -0.4525   2.2283  20.7290 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.254689   5.952793   4.914 1.48e-06 ***
## log_CRIM      0.473780   0.279122   1.697 0.090685 .  
## CHAS          8.045255   1.218528   6.602 1.90e-10 ***
## RM            4.671773   0.496481   9.410  < 2e-16 ***
## ZN            0.037956   0.016639   2.281 0.023255 *  
## NOX         -15.350337   4.481796  -3.425 0.000703 ***
## DIS          -1.238169   0.230663  -5.368 1.62e-07 ***
## PTRATIO      -0.709177   0.154686  -4.585 6.74e-06 ***
## LSTAT        -0.574477   0.057239 -10.036  < 2e-16 ***
## TAX          -0.008104   0.003070  -2.640 0.008742 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.419 on 293 degrees of freedom
## Multiple R-squared:  0.7777, Adjusted R-squared:  0.7708 
## F-statistic: 113.9 on 9 and 293 DF,  p-value: < 2.2e-16

These results tell us that all the predictors seem to be statistically significant. However, some are more significant than others, but already a 5% significance level is great, so we keep all the predictors.

Now, to have an idea of the performance of this new model on predicting the validation data compared with the model proposed in the prompt (only CRIM, CHAS and RM), let’s compute some useful metrics :

# Creating the log transformation in the validations set :

house_valid.set$log_CRIM = log(house_valid.set$CRIM)

# Predicting the validation data using the initial proposed model :

pred_initial = predict(MLR_transf, house_valid.set)
accuracy(pred_initial, house_valid.set$MEDV)

##                 ME    RMSE      MAE       MPE     MAPE
## Test set 0.3355111 7.32952 4.763903 -2.926922 22.66992

# Predicting the validation data using the new model :

pred_selected = predict(MLR_selected, house_valid.set)
accuracy(pred_selected, house_valid.set$MEDV)

##                 ME    RMSE      MAE        MPE    MAPE
## Test set 0.4453469 5.90066 3.918759 -0.1941773 18.9326

We can see that the RMSE is smaller when selecting only the 9 variables found in our selection process. So, we may think that this reduced model is the best model.

However, it would be interesting to compare this reduced (and supposedly better) model with the FULL model (all variables included).

# Comparing with FULL model :

pred_full = predict(MLR_full, house_valid.set)
accuracy(pred_full, house_valid.set$MEDV)

##                 ME    RMSE      MAE        MPE     MAPE
## Test set 0.4771218 5.93313 3.933339 0.03933146 18.98019

Actually, much to our surprise, we see that the error metrics (we are focusing on ME, RMSE, MAPE and MAE) are almost the same between the reduced and the full model! However, it is true that they are smaller for the reduced model (only 9 predictors).

This means that using the full model (with all predictors) and the reduced model (selected by running selection algorithms) won’t make a huge difference, actually… And this is quite counterintuitive! 😱

Conclusion

Because the stepwise selections do NOT guarantee to find the “best” model, it is absolutely important that we verify the predictions on the validation data.

This is what we did, and we found that actually using the full model or using the reduced one (after applying stepwise selection algorithms) would yield more or less the same amount of accuracy, and thus almost the same predictive performance. However, one should prefer using the reduced model, since the error metrics are somewhat smaller (as we have seen).

Exercice 2 (9.3 of the book)

Preliminaries (importing the dataset & exploratory analysis)

cars <- read.csv("ToyotaCorolla.csv", sep = ",", header = TRUE)

str(cars)

## 'data.frame':    1436 obs. of  39 variables:
##  $ Id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Model            : chr  "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" " TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" "TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors" ...
##  $ Price            : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
##  $ Age_08_04        : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ Mfg_Month        : int  10 10 9 7 3 1 6 3 6 10 ...
##  $ Mfg_Year         : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ KM               : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type        : chr  "Diesel" "Diesel" "Diesel" "Diesel" ...
##  $ HP               : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Met_Color        : int  1 1 1 0 0 0 1 1 0 0 ...
##  $ Color            : chr  "Blue" "Silver" "Blue" "Black" ...
##  $ Automatic        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CC               : int  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
##  $ Doors            : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Cylinders        : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Gears            : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Quarterly_Tax    : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Weight           : int  1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
##  $ Mfr_Guarantee    : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ BOVAG_Guarantee  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Guarantee_Period : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ ABS              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_1         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_2         : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Airco            : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Boardcomputer    : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ CD_Player        : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Central_Lock     : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ Powered_Windows  : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Power_Steering   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Radio            : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Mistlamps        : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ Sport_Model      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Backseat_Divider : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Metallic_Rim     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Radio_cassette   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Parking_Assistant: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tow_Bar          : int  0 0 0 0 0 0 0 0 0 0 ...

pander(head(cars))

Table continues below
Id	Model	Price	Age_08_04	Mfg_Month	Mfg_Year
1	TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors	13500	23	10	2002
2	TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors	13750	23	10	2002
3	TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors	13950	24	9	2002
4	TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors	14950	26	7	2002
5	TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors	13750	30	3	2002
6	TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors	12950	32	1	2002

Table continues below
KM	Fuel_Type	HP	Met_Color	Color	CC	Doors
46986	Diesel	90	1	Blue	2000	3
72937	Diesel	90	1	Silver	2000	3
41711	Diesel	90	1	Blue	2000	3
48000	Diesel	90	0	Black	2000	3
38500	Diesel	90	0	Black	2000	3
61000	Diesel	90	0	White	2000	3

Table continues below
Cylinders	Gears	Quarterly_Tax	Weight	Mfr_Guarantee	BOVAG_Guarantee
4	5	210	1165	0	1
4	5	210	1165	0	1
4	5	210	1165	1	1
4	5	210	1165	1	1
4	5	210	1170	1	1
4	5	210	1170	0	1

Table continues below
Guarantee_Period	ABS	Airbag_1	Airbag_2	Airco
3	1	1	1	0
3	1	1	1	1
3	1	1	1	0
3	1	1	1	0
3	1	1	1	1
3	1	1	1	1

Table continues below
Boardcomputer	CD_Player	Central_Lock	Powered_Windows	Power_Steering
1	0	1	1	1
1	1	1	0	1
1	0	0	0	1
1	0	0	0	1
1	0	1	1	1
1	0	1	1	1

Table continues below
Radio	Mistlamps	Sport_Model	Backseat_Divider	Metallic_Rim
0	0	0	1	0
0	0	0	1	0
0	0	0	1	0
0	0	0	1	0
0	1	0	1	0
0	1	0	1	0

Radio_cassette	Parking_Assistant	Tow_Bar
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0
0	0	0

Preprocessing (dividing into training and validation)

set.seed(1)

# Partitioning the data (60% - 40%)

train.obs <- sample(rownames(cars), dim(cars)[1]*0.6)
train.set <- cars[train.obs, c(3,4,7,8,9,12,14,17,19,21,25,26,28,30,34,39)]

valid.obs <- setdiff(rownames(cars), train.obs)
valid.set <- cars[valid.obs, c(3,4,7,8,9,12,14,17,19,21,25,26,28,30,34,39)]

# We only considered the variables which were determined in the prompt!

a. Running the regression tree

Let’s now run the tree!

tree <- rpart(
   Price ~ .,
   data = train.set,
   method = "anova",
   control = rpart.control(
      cp = 0.001,
      minbucket = 1,
      maxdepth = 30
   )
)


plot_tree = prp(
   tree,
   type = 1,
   extra = 1,
   under = TRUE,
   split.font = 1,
   varlen = -10
)

i. Which are the three or four most important variables?

# Checking the variable importance
sort(tree$variable.importance, decreasing = TRUE)

##        Age_08_04  Automatic_airco               KM    Quarterly_Tax 
##       9885745349       2987651311       2886703313       1885088412 
##               HP Guarantee_Period        CD_Player        Fuel_Type 
##       1665897142        430364046        357332146        206914123 
##            Airco  Powered_Windows            Doors    Mfr_Guarantee 
##        122474786        109250988         63966015         21300178 
##        Automatic      Sport_Model 
##         19776457          3395399

In the previous table, the higher the values are, the more important are the variables in contributing to the model.

Therefore, we can see that (after having sorted the vector), the four most important car specifications are : “Age”, “Automatic Air Conditioning”, “KM”, and “Quarterly Tax”.

ii. Compare the RMSE for training and validation sets

First, let’s concentrate on the training set.

# First, let's create two vectors, one for the predicted values, and another for the actual values :

predicted_train <- predict(tree, train.set)
actual_train <- train.set$Price

# And lastly, we make use of the RSME formula to calculate it :

RMSE_train = sqrt(mean((predicted_train-actual_train)^2))

RMSE_train

## [1] 986.5121

And now, let’s turn to the validation set.

predicted_valid <- predict(tree, valid.set)
actual_valid <- valid.set$Price


RMSE_valid = sqrt(mean((predicted_valid-actual_valid)^2))

RMSE_valid

## [1] 1192.877

The RMSE of the training set is smaller than the one of the validation set. This is normal, because we are overfitting the training data. When we use new data not used in building the model, it is normal to have poorer predictive performance. This is because the tree was built with a single purpose : predict the training data as well as possible. However, since the validation data is “new” and “unseen”, then the tree won’t predict it as well, because it was NOT built with that purpose in mind.

Since the RMSE measures the standard deviation of the residuals, it shows how well the predictions were. The smaller the RMSE, the better for the model in terms of predictive performance.

It turns out that the RMSE is an ABSOLUTE measure, so we can actually interpret it in the same unit as the response variable and say that (for example) a RMSE of 986.5121 means that the model’s predictions are far away to the regression line (on average) by a measure of $986.5121 (if the price was in $). And since we always would like that our residuals are small (and consequently that our observations are as close to the regression line as possible), we would also like to have a small RMSE!

Now, let’s plot the two boxplots!

par(mfrow = c(1, 2))

boxplot(
   predicted_train,
   actual_train,
   names = c("Predicted", "Actual"),
   ylab = "Price",
   xlab = "Training Data"
)

boxplot(
   predicted_valid,
   actual_valid,
   names = c("Predicted", "Actual"),
   ylab = "Price",
   xlab = "Validation Data"
)

We saw that the RMSE of the training set was smaller than the RMSE of the validation set. This is corroborated by these two boxplots. We can see that for the training data, actual and predicted boxplots are almost the same. However, for the validation data, this is not the case since the actual data is more “spread out” than the predicted data (the boxplot is more “stretched” for the actual data). Therefore, once again, we can say that the model has less predictive power on the validation set than on the training set.

iii. How to achieve better VALIDATION predictive performance?

This question can be restated as follows : how to address overfitting?

Because growing a tree to its entirety (or with a very small complexity parameter, as was the case in section a.) causes the training data to be very well predicted, thus overfitting occurs!

There is one very effective way to avoid overfitting, and thus getting better at predicting the validation data at the expense of the training data : PRUNING.

This consists of (using ONLY validation data) trimming the entirely-grown tree until error starts rising.

iv. Creating a less deep tree, and comparing RMSE

smaller_tree <- rpart(Price ~ ., data = train.set, method = "anova")


plot_smaller_tree = prp(
   smaller_tree,
   type = 1,
   extra = 1,
   under = TRUE,
   split.font = 1,
   varlen = -10
)

Now, let’s again compute the RMSE for the validation data, BUT this time, on the smaller tree.

predicted_valid_small <- predict(smaller_tree, valid.set)

RMSE_valid_small = sqrt(mean((predicted_valid_small-actual_valid)^2))

RMSE_valid_small

## [1] 1341.264

At this point we may be discouraged… because actually we found a RMSE that is quite bigger that the one found previously. 😞

However, we should not despair, because there is one step missing : we must ALSO compute the RMSE for the TRAINING set! If this RMSE for the training set is smaller than the one of the validation set, then we have a problem, because it means that our pruning was not successful.

predicted_train_small <- predict(smaller_tree, train.set)

RMSE_train_small = sqrt(mean((predicted_train_small-actual_train)^2))

RMSE_train_small

## [1] 1396.864

So, as we can see, `RMSE_train_small` is bigger than `RMSE_valid_small`, which is logical since we pruned the tree. Now the tree is smaller and it does not overfit the training data as much as before.

However, it may be that the default complexity parameter and maximum depth are NOT the adecuate ones, since now both RMSE are way higher than with the full-grown tree.

In order to choose the best complexity parameter, one can proceed with a cross-validation approach. Then, the standard error of the estimated cross-validation error can be compared with different values of the complexity parameter, in order to choose the complexity parameter which corresponds to the lowest estimated CV error (`xerror`).

b. Same thing, but binning the price variable

options(scipen = 999)

# Categorizing the price variable into 20 bins
train.set$Price_Bins <- cut(train.set$Price, breaks = 20,
                             labels=c("1" , "2", "3", "4", "5", "6" , "7", "8", "9","10", "11", "12" ,"13", "14", "15", "16", "17", "18", "19", "20"))


# Taking "continuous" price away
new_train.set = train.set[, -1]


# And now, running the classification tree :
class_tree <- rpart(Price_Bins ~ ., data = new_train.set, method = "class")


plot_class_tree = prp(
   class_tree,
   type = 1,
   extra = 1,
   under = TRUE,
   split.font = 1,
   varlen = -10
)

i. Comparing the classification tree with the regression tree

The first thing that can be seen, is that the structure is exactly the same. This means that both trees have the same number of nodes, the same number of branches and leaves. The reason is, basically, because nothing has really changed apart from categorizing the outcome variable. Indeed, we used the SAME training set and the same procedure.

However, we can see that two main things have changed :

The number of records in each terminal node. Although the numbers are roughly the same in both trees, sometimes they are NOT the same. Nevertheless, in some instances there is no difference at all.
The values for the prices. This is quite expected, since now inside each node we do not have a big number but a small one, indicating not the price per se, but the category in which it is located.

ii. Predicting a given new observation with each type of tree

First of all, let’s create the new observation.

# Creating the new car 

new_car <- data.frame(Age_08_04 = 77, KM = 117,000, Fuel_Type = "Petrol", HP = 110, Automatic = 0, Doors = 5, Quarterly_Tax = 100, Mfr_Guarantee = 0, Guarantee_Period = 3, Airco = 1, Automatic_airco = 0, CD_Player = 0, Powered_Windows = 0, Sport_Model = 0, Tow_Bar = 1)

Now, let’s start with the regression tree!

# Predicting the new car's price with the regression tree :

predict(smaller_tree, new_car)

##        1 
## 7875.686

So, the predicted price is 7875.686 (dollars?).

And finally, let’s do the same but with the classification tree.

# Predicting the new car's price with the regression tree :

predict(class_tree, new_car)

##            1         2         3         4           5 6 7 8 9 10 11 12 13 14
## 1 0.02654867 0.2256637 0.4867257 0.2522124 0.008849558 0 0 0 0  0  0  0  0  0
##   15 16 17 18 19 20
## 1  0  0  0  0  0  0

The highest probability is found at bin nº3, so the price belongs to this category. But this does not give information about the actual price range inside this category!

So, let’s draw a plot to really know what this category contains.

plot(train.set$Price, train.set$Price_Bins, xlab = "Price", ylab = "Bin number")

Category nº3 contains prices inside the range 5000 and 10000, but this does not help us… Let’s do something different and pretty helpful! 😎

price_vs_bins = table(train.set$Price, train.set$Price_Bins)
which(price_vs_bins[, 3]!=0)

## 7200 7250 7350 7400 7450 7490 7495 7500 7600 7750 7795 7800 7900 7950 7995 8000 
##   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42   43 
## 8100 8200 8250 8400 8450 8495 8500 
##   44   45   46   47   48   49   50

From this vector output, we can see the range of prices which are inside the category nº3 (from 7200 to 8500).

And this makes sense, since the predicted price with the regression tree (7875.686) would also have fallen into this category. Thus, both predictions seem to have been more or less the same.

iii. Comparing both predictions

Both predictions used the same predictors, hence the fact that they are actually very close to each other.

Actually, the only difference was the fact of categorizing the continuous variable “Price”, but apart from that, the methodology remained the same.

However, it is quite difficult to compare both methods in terms of predictive power, since computing RMSE for categorical outcomes is not possible :

# Just a measure of error (RMSE) :

predicted_train_class <- predict(class_tree, new_train.set)
actual_class <- new_train.set$Price_Bins

RMSE_train_class = sqrt(mean((predicted_train_class-actual_class)^2))

RMSE_train_class

## [1] NA

# The result is `NA`...

Exercice 3 (11.3 of the book)

a. Fit a neural net to the data

For this exercice, let’s use the same data partition as in the previous one. This way, we can better compare the results of the predictions.

We are asked to use the same predictors as in the previous exercice.

But before we move forward, we must normalize the data (predictors and response) into a [0, 1] range, because neural nets work better with this kind of normalization.

Start with the training set :

set.seed(1)

# First, remove the "Price_Bins" variable :

train.set_nobins <- train.set[, -17]

# Now, we dummify the categorical predictors ("Doors", "Fuel_Type" and "Guarantee_Period") :

train.set.dumm <-
  cbind(train.set[1:3], dummy(train.set$Fuel_Type, sep = "_"), train.set[5:6], dummy(train.set$Doors, sep = "_"), train.set[8:9], dummy(train.set$Guarantee_Period, sep = "_"), train.set[11:16])

names(train.set.dumm)[4:6] <- c("Fuel_Type_CNG", "Fuel_Type_Diesel", "Fuel_Type_Petrol")

names(train.set.dumm)[9:12] <- c("Doors_2", "Doors_3", "Doors_4", "Doors_5")

names(train.set.dumm)[15:21] <- c("Guarantee_Period_3", "Guarantee_Period_6", "Guarantee_Period_12", "Guarantee_Period_13", "Guarantee_Period_24", "Guarantee_Period_28", "Guarantee_Period_36")


# "Min-max" normalization :

norm.values.train <- caret::preProcess(train.set_nobins[, c(1:3, 5, 8)], method = "range")

train.set_norm <- predict(norm.values.train, train.set.dumm)

And now, the same but with the validation set :

set.seed(1)

# We dummify the categorical predictors ("Doors" and "Guarantee") :

valid.set.dumm <-
  cbind(valid.set[1:3], dummy(valid.set$Fuel_Type, sep = "_"), valid.set[5:6], dummy(valid.set$Doors, sep = "_"), valid.set[8:9], dummy(valid.set$Guarantee_Period, sep = "_"), valid.set[11:16])

names(valid.set.dumm)[4:6] <- c("Fuel_Type_CNG", "Fuel_Type_Diesel", "Fuel_Type_Petrol")

names(valid.set.dumm)[9:12] <- c("Doors_2", "Doors_3", "Doors_4", "Doors_5")

names(valid.set.dumm)[15:21] <- c("Guarantee_Period_3", "Guarantee_Period_6", "Guarantee_Period_12", "Guarantee_Period_13", "Guarantee_Period_24", "Guarantee_Period_28", "Guarantee_Period_36")


# "Min-max" normalization :

norm.values.valid <- caret::preProcess(valid.set[, c(1:3, 5, 8)], method = "range")

valid.set_norm <- predict(norm.values.valid, valid.set.dumm)

Ok, now we are ready to go! 🚀

Training data

We begin with the TRAINING DATA (we will do the validation data afterwards).

Let’s fit the neural network!

set.seed(1)

net.train <- neuralnet(Price ~., data = train.set_norm, hidden = 2)

# IMPORTANT NOTE : inside hidden, "2" alone means "ONE layer + 2 nodes"

# Similarly, "c(2,3)" means "TWO layers, with respectively "2" and "3" nodes


# Here we can see the prediction results :

train.norm.pred <- predict(object = net.train, newdata = train.set_norm[,-1])

head(train.norm.pred)

##           [,1]
## 1017 0.1628378
## 679  0.1643410
## 129  0.4921746
## 930  0.1816020
## 471  0.2282467
## 299  0.3578104

# Let's plot the neural net :
plot(net.train, rep = "best")

Now that we have our model built on the training data, we need to DENORMALIZE in order to find meaningful numbers!

# Creating a "denormalization" function :

denormalize <- function(x, y) {
  return (x*(max(y) - min(y)) + min(y))
}


# Choosing only the ORIGINAL prices :
only_price_train = train.set.dumm[, 1]


# Applying the function :
denorm.train = denormalize(train.norm.pred, only_price_train)

head(denorm.train)

##           [,1]
## 1017  8933.883
## 679   8976.199
## 129  18204.716
## 930   9462.095
## 471  10775.145
## 299  14422.362

Ok, and now the final step is to compute the RMSE.

Let’s do it with the function `postResample`:

# First, create a (nx2) data frame with predictions and actual numbers :

df.train = data.frame("Prediction" = denorm.train, "Actual" = only_price_train)


# Now, compute the metrics :

RMSE_train_1 = postResample(denorm.train, only_price_train)

RMSE_train_1

##         RMSE     Rsquared          MAE 
## 1056.6133443    0.9198639  790.2279073

Now we need to create the FINAL data frame where we will compare the RMSE for each of the different networks :

# Creating the data frame for comparisons, and already adding the first RMSE :

comparing_df = data.frame("Network_Type" = c("1 layer, 2 nodes", "1 layer, 5 nodes", "2 layers, 5 nodes in each"), "RMSE_train" = as.numeric(c(round(RMSE_train_1[1], 3), "NA", "NA")), "RMSE_valid" = as.numeric(c("NA", "NA", "NA")))

## Warning in data.frame(Network_Type = c("1 layer, 2 nodes", "1 layer, 5 nodes", :
## NAs introducidos por coerción

## Warning in data.frame(Network_Type = c("1 layer, 2 nodes", "1 layer, 5 nodes", :
## NAs introducidos por coerción

pander(comparing_df)

Network_Type	RMSE_train	RMSE_valid
1 layer, 2 nodes	1057	NA
1 layer, 5 nodes	NA	NA
2 layers, 5 nodes in each	NA	NA

Ok, and now let’s do the SAME thing, but with the validation data!

Validation data

Again the same as we did with the training data…

set.seed(1)

net.valid <- neuralnet(Price ~., data = valid.set_norm, hidden = 2)


valid.norm.pred <- predict(object = net.valid, newdata = valid.set_norm[,-1])

head(valid.norm.pred)

##         [,1]
## 1  0.5912632
## 2  0.5211129
## 4  0.5382568
## 9  0.7391759
## 10 0.4411255
## 12 0.9224447

# Let's plot the neural net :
plot(net.valid, rep = "best")

# Applying the denormalization function :

only_price_valid = valid.set.dumm[, 1]

denorm.valid = denormalize(valid.norm.pred, only_price_valid)

head(denorm.valid)

##        [,1]
## 1  16306.62
## 2  14994.81
## 4  15315.40
## 9  19072.59
## 10 13499.05
## 12 22499.71

# Creating a (nx2) data frame with predictions and actual numbers :

df.valid = data.frame("Prediction" = denorm.valid, "Actual" = only_price_valid)


# Now, compute the metrics :

RMSE_valid_1 = postResample(denorm.valid, only_price_valid)

RMSE_valid_1

##        RMSE    Rsquared         MAE 
## 927.6498088   0.9279509 726.5968885

comparing_df[1, 3] <- RMSE_valid_1[1]

pander(comparing_df)

Network_Type	RMSE_train	RMSE_valid
1 layer, 2 nodes	1057	927.6
1 layer, 5 nodes	NA	NA
2 layers, 5 nodes in each	NA	NA

Ok, now we are finally done!

However, we still have to REDO the SAME procedure but changing the number of layers and the number of nodes.

Creating a function + filling-in the data frame

In order to do a cleaner output and avoid a too long code, let’s create TWO functions which will compute the RMSE for each “train” and “valid”.

The functions will have as inputs the number of layers and of nodes, and as output the RMSE.

# Creating the function for the training data :

RMSE.for.train <- function(x) {
   
   set.seed(1)
   
   net.train <-
      neuralnet(Price ~ ., data = train.set_norm, hidden = x)
   
   train.norm.pred <-
      predict(object = net.train, newdata = train.set_norm[, -1])
   
   
   only_price_train = train.set.dumm[, 1]
   
   
   
   denorm.train = denormalize(train.norm.pred, only_price_train)
   
   
   df.train = data.frame("Prediction" = denorm.train, "Actual" = only_price_train)
   
   
   RMSE_train = postResample(denorm.train, only_price_train)
   

   return (RMSE_train[1])
}


# And same for the validation data :

RMSE.for.valid <- function(x) {
   
   set.seed(1)
   
   net.valid <-
      neuralnet(Price ~ ., data = valid.set_norm, hidden = x)
   
   valid.norm.pred <-
      predict(object = net.valid, newdata = valid.set_norm[, -1])
   
   
   only_price_train = valid.set.dumm[, 1]
   
   
   
   denorm.valid = denormalize(valid.norm.pred, only_price_train)
   
   
   df.valid = data.frame("Prediction" = denorm.valid, "Actual" = only_price_valid)
   
   
   RMSE_valid = postResample(denorm.valid, only_price_valid)
   

   return (RMSE_valid[1])
}

And now, let’s fill the `comparing_df` data frame using our function!

# One layer, 5 nodes :

comparing_df[2, 2] <- RMSE.for.train(5)
comparing_df[2, 3] <- RMSE.for.valid(5)

# Two layers, 5 nodes each :

comparing_df[3, 2] <- RMSE.for.train(c(5, 5))
comparing_df[3, 3] <- RMSE.for.valid(c(5, 5))


# The FINAL table!

pander(comparing_df)

Network_Type	RMSE_train	RMSE_valid
1 layer, 2 nodes	1057	927.6
1 layer, 5 nodes	905.9	783.2
2 layers, 5 nodes in each	948.8	815.1

Now we are ready to answer the questions. 👍

i. Training data : what happens to the RMSE as number of nodes and layers increase?

pander(comparing_df)

Network_Type	RMSE_train	RMSE_valid
1 layer, 2 nodes	1057	927.6
1 layer, 5 nodes	905.9	783.2
2 layers, 5 nodes in each	948.8	815.1

We can see that if only the number of nodes increases, the RMSE goes down. However, if the number of layers is the only one to increase, actually the RMSE goes up.

More nodes means that we are better fitting the training data, and thus there is a danger of overfitting! For example, just by increasing the nodes by 3, we achieve a smaller RMSE for the training data (905.9305) than with the (nodes unchanged) validation data (927.6498). This is clearly overfitting!

ii. Validation data : what happens to the RMSE as number of nodes and layers increase?

It happens the same thing as with the training data.

However, we see that in all instances, the validation data was better predicted, because the RMSE is always smaller.

iii. Appropriate number of layers and nodes

pander(comparing_df)

Network_Type	RMSE_train	RMSE_valid
1 layer, 2 nodes	1057	927.6
1 layer, 5 nodes	905.9	783.2
2 layers, 5 nodes in each	948.8	815.1

By sticking to ONLY the different combinations {layers, nodes} that were asked in the prompt, we can see from the data frame that the best combination is {1 hidden layer, 5 nodes}. The reason is that we achieve the smallest RMSE on both validation and training datasets.

Machine Learning with R - Homework 3

Ángel Tomás-Ripoll

17/10/2020

Practical 3

Exercice 1 (6.1 of the book)

Preliminaries

a. Why partitioning into training and validation sets?

This is done because we want to do two things :

So this illustrates the common trade-off between overfitting the training data and predicting “new” data correctly. We do not want to overfit the training data too much, because we want to be able to predict the validation set as good as possible.

b. Multiple Linear Regression

These numbers above are some of the predictions for the prices (in $1000s).

From the results above, we can see that all the coefficients are statistically significant (at 0.1% level, with 3 “stars” *).

Now, before we make any further inferences, it is imperative that we check whether the Linear Regression assumptions are satisfied!

The three most important assumptions that any Linear Regression Model should satisfy are :

Now, let’s verify each assumption, one by one.

1- Linear relationship (using plot matrix)

The relationship between MEDV and RM is quite linear. However, the relationship between MEDV and CRIM does not seem so linear!

Maybe we could transform it by taking the logarithm…

Ok, now the (negative) relationship seems quite linear, although not as strong as the one between MEDV and RM.

Therefore, for our predictions, it would be better to include the variable CRIM by taking the logarithm. Let’s build that model now and use it from now on :

Note that for the variable CHAS, checking linear relationship is nonesense, since it is a binary variable!

2 and 3 : Normality, mean and variance

Although the Q-Q plot does not show good fit on the tails, we could assume that the residuals follow a normal distribution.

In the first plot, we can see that although the fitting line is somewhat curved, the mean is more or less around 0.

Concerning the variance, we can see that the residuals are spread more or less homogenously along the x-axis. So we can state that we have a constant variance and no heteroskedasticity.

Just by the way, there does not seem to be any influencial outlier, since no observation exceeds the Cook’s distance on the last graph.

c. Predicting using the regression model

We just have to use the equation we wrote down before (this time substituting CRIM by log_CRIM, AND replacing the old coefficients by the new ones) and replace the values that are indicated in the prompt :

The predicted price for this house is $20915.3

d. Reducing the number of predictors

i. Which predictors are likely to be measuring the same thing?

From our intuition, it seems that, for instance, AGE and RM may be highly correlated, because surely the older the house is, the more rooms it has. This is because nowadays houses tend to be smaller, since families are not so big and there are many more single people.

Also, CRIM and ZN are positively correlated, because in a more populated area, usually the possibilities to find a criminal are higher.

However, instead of only trusting our intuition, the best thing to do would be a correlation plot!

ii. Correlation table and removing pairs

From this correlation plot, we can identify the following highly correlated pairs (excluding the variables Prediction, CAT..MEDV and MEDV, since we are interested only in explanatory variables) :

The other pais do not seem to be highly correlated (whether negatively or positively). So, we may want to eliminate the variable “DIS”, for instance, since it appears in 3 pairs out of the 5 we exposed.

If we do not eliminate “DIS”, we may suffer from multicollinearity.

But before reaching these conclusions, one should proceed to some kind of variable selection (exhaustive search, or stepwise regressions). This is what we do in the next section.

iii. Using stepwise regressions

First, let’s do a forward selection :

We see here that, starting from a completely “empty” model, 9 variables were added :

This means that, if we wanted to do variable selection, just using these 9 variables will be fine.

Now, it’s the backward selection’s turn!

If I am not mistaken, it seems that the backward selection selected the same predictors as the forward one!

And finally, we are doing both selections :

And again, exactly the same predictors are chosen by the algorithm.

This fact is indeed reassuring! 😄

Now, the only thing we have left is to predict the validation set with the best model from each of the stepwise selections. However, because each selection yielded the SAME predictors, there is ONLY one model to be used to predict the validation set…

So, now let’s run a Multiple Linear Regression using the validation set. We will only be using the 9 predictors we found previously.

These results tell us that all the predictors seem to be statistically significant. However, some are more significant than others, but already a 5% significance level is great, so we keep all the predictors.

Now, to have an idea of the performance of this new model on predicting the validation data compared with the model proposed in the prompt (only CRIM, CHAS and RM), let’s compute some useful metrics :

We can see that the RMSE is smaller when selecting only the 9 variables found in our selection process. So, we may think that this reduced model is the best model.

However, it would be interesting to compare this reduced (and supposedly better) model with the FULL model (all variables included).

Actually, much to our surprise, we see that the error metrics (we are focusing on ME, RMSE, MAPE and MAE) are almost the same between the reduced and the full model! However, it is true that they are smaller for the reduced model (only 9 predictors).

This means that using the full model (with all predictors) and the reduced model (selected by running selection algorithms) won’t make a huge difference, actually… And this is quite counterintuitive! 😱

Conclusion

Because the stepwise selections do NOT guarantee to find the “best” model, it is absolutely important that we verify the predictions on the validation data.

Exercice 2 (9.3 of the book)

Preliminaries (importing the dataset & exploratory analysis)

Preprocessing (dividing into training and validation)

a. Running the regression tree

Let’s now run the tree!

i. Which are the three or four most important variables?

In the previous table, the higher the values are, the more important are the variables in contributing to the model.

Therefore, we can see that (after having sorted the vector), the four most important car specifications are : “Age”, “Automatic Air Conditioning”, “KM”, and “Quarterly Tax”.

ii. Compare the RMSE for training and validation sets

First, let’s concentrate on the training set.

And now, let’s turn to the validation set.

Since the RMSE measures the standard deviation of the residuals, it shows how well the predictions were. The smaller the RMSE, the better for the model in terms of predictive performance.

Now, let’s plot the two boxplots!

iii. How to achieve better VALIDATION predictive performance?

This question can be restated as follows : how to address overfitting?

Because growing a tree to its entirety (or with a very small complexity parameter, as was the case in section a.) causes the training data to be very well predicted, thus overfitting occurs!

There is one very effective way to avoid overfitting, and thus getting better at predicting the validation data at the expense of the training data : PRUNING.

This consists of (using ONLY validation data) trimming the entirely-grown tree until error starts rising.

iv. Creating a less deep tree, and comparing RMSE

Now, let’s again compute the RMSE for the validation data, BUT this time, on the smaller tree.

At this point we may be discouraged… because actually we found a RMSE that is quite bigger that the one found previously. 😞

However, we should not despair, because there is one step missing : we must ALSO compute the RMSE for the TRAINING set! If this RMSE for the training set is smaller than the one of the validation set, then we have a problem, because it means that our pruning was not successful.

So, as we can see, `RMSE_train_small` is bigger than `RMSE_valid_small`, which is logical since we pruned the tree. Now the tree is smaller and it does not overfit the training data as much as before.

Let’s do it with the function `postResample`:

And now, let’s fill the `comparing_df` data frame using our function!