Introduction

The feature selection is the process that choose a reduced number of explanatory variable to describe a response variable. The main reasons why feature selection is used are:

  • Make the model easier to interpret, removing variables that are redundant and do not add any information;
  • Reduce the size of the problem to enable algorithms to work faster, making it possible to handle with high-dimensional data
  • Reduce overfitting.

The dataset used in this case study contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The 506 observations in this study was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.The price of house in Boston at 1978 is predicted using regression models based on different advanced techniques and the accuracy of different models are compared. The predictions are made based on thirteen environmental and socio-economic variables.

Variable Nature.of.variable Type Comments
medv Housing value - Response variable/ variable of interest Numerical Median value of owner-occupied homes
rm Housing structure related Numerical Average number of rooms in owner units
age Housing structure related Numerical Proportion of units built prior to 1940
black Neighborhood related Numerical 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat Neighborhood related Numerical Proportion of neighboring population that is of lower status
crim Neighborhood related Numerical Crime rate by town
zn Neighborhood related Numerical Proportion of land zoned for commercial construction
indus Neighborhood related Numerical Proportion of industrial area acres
tax Neighborhood related Numerical Full value property tax rate
ptratio Neighborhood related Numerical Pupil-teacher ratio by town district
chas Neighborhood related Integer (Binary) 1 if land tract bounds Charles River; 0 otherwise
dis Accessibility related Numerical Weighted distances to employment centres
rad Accessibility related Integer Index of highway accessibility
nox Air pollution related Numerical Nitrogen oxide concentrations in pphm

Packages

  • pacman: To load packages and install missing ones
  • MASS: To load Boston data set and stepAIC function
  • dplyr: Package for data manipulation
  • leaps: For best subset variable selection
  • glmnet: Fits linear models or generalized linear model penalizing the maximum likelihood with both the LASSO method and the Ridge Regression
  • corrplot: To plot the correlation of variables.
if (!require("pacman")) install.packages("pacman")
# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("MASS", "leaps", "dplyr", "glmnet", "corrplot")

EDA

The values in corrplot function denotes the extent of linear correlation between different variables. If the value is faded it denotes that there is minimal linear relationship between the variables.

correlation = cor(Boston)
corrplot(correlation, type = "upper", method = "number")

The moderate positive correlation between median house price and Average number of rooms in owner units, moderarte negative linear correlation between Proportion of neighboring population that is of lower status and median house price can be seen clearly in the scatter plots

par(mfrow=c(2,3))
plot(Boston$rm, Boston$medv )
plot(Boston$age, Boston$medv)
plot(Boston$black, Boston$medv)
plot(Boston$lstat, Boston$medv )
plot(Boston$crim, Boston$medv)
plot(Boston$zn, Boston$medv)

plot(Boston$indus, Boston$medv )
plot(Boston$tax, Boston$medv)
plot(Boston$ptratio, Boston$medv)
plot(Boston$dis, Boston$medv )
plot(Boston$rad, Boston$medv)
plot(Boston$nox, Boston$medv)

Best subset method

The regsubset function from leaps package provides sets of candidate models of different sizes. Using the parameter nbest , we can specify how many models of each size are to be kept in the result object. If we need one or more variables to be included/excluded in all models, we can specify this condition using the force.in / force.out parameter. Variables are specified not by name but by order in the formula.

In the summary, models are listed in the order of size which is denoted in the first column of the matrix. Within size, the models are listed in order of fit (best model first). TRUE or FALSE indicates if the variable was selected in the model. If matrix.logical parameter is specified as TRUE in summary function, then the included variables are indicated by asterisks in quotes and those not included are represented using empty quotes.

model_subset <- regsubsets(medv ~ . ,data = Boston, nbest = 2 , nvmax = 13)
summary(model_subset, matrix.logical = TRUE)$which
   (Intercept)  crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio black lstat
1         TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE  TRUE
1         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE FALSE FALSE
2         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE FALSE  TRUE
2         TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    TRUE FALSE  TRUE
3         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE    TRUE FALSE  TRUE
3         TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE   FALSE FALSE  TRUE
4         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE  TRUE
4         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE    TRUE  TRUE  TRUE
5         TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE  TRUE
5         TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE  TRUE
6         TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE  TRUE
6         TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE  TRUE
7         TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE  TRUE
7         TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE FALSE  TRUE
8         TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE    TRUE  TRUE  TRUE
8         TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE    TRUE  TRUE  TRUE
9         TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE    TRUE  TRUE  TRUE
9         TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
10        TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
10        TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
11        TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
11        TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
12        TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
12        TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE
13        TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    TRUE  TRUE  TRUE

By plotting the models, we can find the best model. For simplicity of plotting only one model of each size is created. The numbers on the left margin are the values of Schwartz’ Bayesian Information Criterion. Each row is a model and the shaded rectangle indicate that the variable is included in the model. The darkness of the shading simply represents the ordering of the BIC values.

model_subset <- regsubsets(medv ~ . ,data = Boston, nvmax = 13)
plot(model_subset)

For a model to have good predictive power, we prefer high adjusted r square value and low bic values.

# bic value of each model
summary(model_subset)$bic
 [1] -385.0521 -496.2582 -549.4767 -561.9884 -585.6823 -592.9553 -598.2295 -600.1663 -600.5767 -603.9917
[11] -608.0353 -601.9237 -595.7000
# Finding the model with minimum bic
which(summary(model_subset)$bic == min(summary(model_subset)$bic))
[1] 11
# adjusted r square value of each model
summary(model_subset)$adjr2
 [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702 0.7123567 0.7182560 0.7222072 0.7252743 0.7299149
[11] 0.7348058 0.7343282 0.7337897
# Finding the model with maximum adjusted r square
which(summary(model_subset)$adjr2 == max(summary(model_subset)$adjr2))
[1] 11

We can see that model number 11 (indus and age are excluded) is found to be the best model. As per this selection method, we consider Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price.

Step AIC method

AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. k is the multiple of the number of degrees of freedom used for the penalty. Only k = 2 gives the genuine AIC method of variable selection.

The step function can be used to perform variable selection. The mode of stepwise search, can be one of “both”, “backward”, or “forward”, with a default of “both”. If the scope argument is missing the default for direction is “backward”.

  • forward : This tells R to start with the null model and search through models lying in the range between the null and full model using the forward selection algorithm. In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it’s P-value is below some pre-set level.

  • backward : This tells R to start with the full model and search through models lying in the range between the null and full model using the backward selection algorithm. Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

  • both : This method works by dropping or adding the one variable that leads to the best AIC improvement (smallest AIC)

The variables age and indus are eliminated in this method. This means that in the final model , we assume that Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price.

null_model <- lm(medv ~ 1 , data = Boston)
full_model <- lm(medv ~ . , data = Boston)
AIC_model  <- step( null_model, scope = list( lower = null_model, upper = full_model), k = 2, direction = "forward")
Start:  AIC=2246.51
medv ~ 1

          Df Sum of Sq   RSS    AIC
+ lstat    1   23243.9 19472 1851.0
+ rm       1   20654.4 22062 1914.2
+ ptratio  1   11014.3 31702 2097.6
+ indus    1    9995.2 32721 2113.6
+ tax      1    9377.3 33339 2123.1
+ nox      1    7800.1 34916 2146.5
+ crim     1    6440.8 36276 2165.8
+ rad      1    6221.1 36495 2168.9
+ age      1    6069.8 36647 2171.0
+ zn       1    5549.7 37167 2178.1
+ black    1    4749.9 37966 2188.9
+ dis      1    2668.2 40048 2215.9
+ chas     1    1312.1 41404 2232.7
<none>                 42716 2246.5

Step:  AIC=1851.01
medv ~ lstat

          Df Sum of Sq   RSS    AIC
+ rm       1    4033.1 15439 1735.6
+ ptratio  1    2670.1 16802 1778.4
+ chas     1     786.3 18686 1832.2
+ dis      1     772.4 18700 1832.5
+ age      1     304.3 19168 1845.0
+ tax      1     274.4 19198 1845.8
+ black    1     198.3 19274 1847.8
+ zn       1     160.3 19312 1848.8
+ crim     1     146.9 19325 1849.2
+ indus    1      98.7 19374 1850.4
<none>                 19472 1851.0
+ rad      1      25.1 19447 1852.4
+ nox      1       4.8 19468 1852.9

Step:  AIC=1735.58
medv ~ lstat + rm

          Df Sum of Sq   RSS    AIC
+ ptratio  1   1711.32 13728 1678.1
+ chas     1    548.53 14891 1719.3
+ black    1    512.31 14927 1720.5
+ tax      1    425.16 15014 1723.5
+ dis      1    351.15 15088 1725.9
+ crim     1    311.42 15128 1727.3
+ rad      1    180.45 15259 1731.6
+ indus    1     61.09 15378 1735.6
<none>                 15439 1735.6
+ zn       1     56.56 15383 1735.7
+ age      1     20.18 15419 1736.9
+ nox      1     14.90 15424 1737.1

Step:  AIC=1678.13
medv ~ lstat + rm + ptratio

        Df Sum of Sq   RSS    AIC
+ dis    1    499.08 13229 1661.4
+ black  1    389.68 13338 1665.6
+ chas   1    377.96 13350 1666.0
+ crim   1    122.52 13606 1675.6
+ age    1     66.24 13662 1677.7
<none>               13728 1678.1
+ tax    1     44.36 13684 1678.5
+ nox    1     24.81 13703 1679.2
+ zn     1     14.96 13713 1679.6
+ rad    1      6.07 13722 1679.9
+ indus  1      0.83 13727 1680.1

Step:  AIC=1661.39
medv ~ lstat + rm + ptratio + dis

        Df Sum of Sq   RSS    AIC
+ nox    1    759.56 12469 1633.5
+ black  1    502.64 12726 1643.8
+ chas   1    267.43 12962 1653.1
+ indus  1    242.65 12986 1654.0
+ tax    1    240.34 12989 1654.1
+ crim   1    233.54 12995 1654.4
+ zn     1    144.81 13084 1657.8
+ age    1     61.36 13168 1661.0
<none>               13229 1661.4
+ rad    1     22.40 13206 1662.5

Step:  AIC=1633.47
medv ~ lstat + rm + ptratio + dis + nox

        Df Sum of Sq   RSS    AIC
+ chas   1    328.27 12141 1622.0
+ black  1    311.83 12158 1622.7
+ zn     1    151.71 12318 1629.3
+ crim   1    141.43 12328 1629.7
+ rad    1     53.48 12416 1633.3
<none>               12469 1633.5
+ indus  1     17.10 12452 1634.8
+ tax    1     10.50 12459 1635.0
+ age    1      0.25 12469 1635.5

Step:  AIC=1621.97
medv ~ lstat + rm + ptratio + dis + nox + chas

        Df Sum of Sq   RSS    AIC
+ black  1   272.837 11868 1612.5
+ zn     1   164.406 11977 1617.1
+ crim   1   116.330 12025 1619.1
+ rad    1    58.556 12082 1621.5
<none>               12141 1622.0
+ indus  1    26.274 12115 1622.9
+ tax    1     4.187 12137 1623.8
+ age    1     2.331 12139 1623.9

Step:  AIC=1612.47
medv ~ lstat + rm + ptratio + dis + nox + chas + black

        Df Sum of Sq   RSS    AIC
+ zn     1   189.936 11678 1606.3
+ rad    1   144.320 11724 1608.3
+ crim   1    55.633 11813 1612.1
<none>               11868 1612.5
+ indus  1    15.584 11853 1613.8
+ age    1     9.446 11859 1614.1
+ tax    1     2.703 11866 1614.4

Step:  AIC=1606.31
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn

        Df Sum of Sq   RSS    AIC
+ crim   1    94.712 11584 1604.2
+ rad    1    93.614 11585 1604.2
<none>               11678 1606.3
+ indus  1    16.048 11662 1607.6
+ tax    1     3.952 11674 1608.1
+ age    1     1.491 11677 1608.2

Step:  AIC=1604.19
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim

        Df Sum of Sq   RSS    AIC
+ rad    1   228.604 11355 1596.1
<none>               11584 1604.2
+ indus  1    15.773 11568 1605.5
+ age    1     2.470 11581 1606.1
+ tax    1     1.305 11582 1606.1

Step:  AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad

        Df Sum of Sq   RSS    AIC
+ tax    1   273.619 11081 1585.8
<none>               11355 1596.1
+ indus  1    33.894 11321 1596.6
+ age    1     0.096 11355 1598.1

Step:  AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
    crim + rad + tax

        Df Sum of Sq   RSS    AIC
<none>               11081 1585.8
+ indus  1   2.51754 11079 1587.7
+ age    1   0.06271 11081 1587.8
summary(AIC_model)

Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
    black + zn + crim + rad + tax, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
black         0.009291   0.002674   3.475 0.000557 ***
zn            0.045845   0.013523   3.390 0.000754 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16
AIC_model$anova

Step BIC method

BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. k is the multiple of the number of degrees of freedom used for the penalty. k = log(n) gives the BIC method of variable selection.

The step function can be used to perform variable selection. The mode of stepwise search, can be one of “both”, “backward”, or “forward”, with a default of “both”. If the scope argument is missing the default for direction is “backward”.

  • forward : This tells R to start with the null model and search through models lying in the range between the null and full model using the forward selection algorithm. In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it’s P-value is below some pre-set level.

  • backward : This tells R to start with the full model and search through models lying in the range between the null and full model using the backward selection algorithm. Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

  • both : This method works by dropping or adding the one variable that leads to the best AIC improvement (smallest BIC)

The variables crim, rad, tax, age and indus are eliminated in this method. This means that in the final model , we assume that Crime rate by town, Index of highway accessibility, Full value property tax rate, Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price.

The AIC value of final model obtained in step AIC method is lower than step BIC method. This is because AIC chooses complex models than BIC.

null_model <- lm(medv ~ 1 , data = Boston)
full_model <- lm(medv ~ . , data = Boston)
BIC_model  <- step( null_model, scope = list( lower = null_model, upper = full_model), k = log(nrow(Boston)), direction = "forward")
Start:  AIC=2250.74
medv ~ 1

          Df Sum of Sq   RSS    AIC
+ lstat    1   23243.9 19472 1859.5
+ rm       1   20654.4 22062 1922.6
+ ptratio  1   11014.3 31702 2106.1
+ indus    1    9995.2 32721 2122.1
+ tax      1    9377.3 33339 2131.6
+ nox      1    7800.1 34916 2154.9
+ crim     1    6440.8 36276 2174.3
+ rad      1    6221.1 36495 2177.3
+ age      1    6069.8 36647 2179.4
+ zn       1    5549.7 37167 2186.6
+ black    1    4749.9 37966 2197.3
+ dis      1    2668.2 40048 2224.3
+ chas     1    1312.1 41404 2241.2
<none>                 42716 2250.7

Step:  AIC=1859.46
medv ~ lstat

          Df Sum of Sq   RSS    AIC
+ rm       1    4033.1 15439 1748.3
+ ptratio  1    2670.1 16802 1791.1
+ chas     1     786.3 18686 1844.8
+ dis      1     772.4 18700 1845.2
+ age      1     304.3 19168 1857.7
+ tax      1     274.4 19198 1858.5
<none>                 19472 1859.5
+ black    1     198.3 19274 1860.5
+ zn       1     160.3 19312 1861.5
+ crim     1     146.9 19325 1861.9
+ indus    1      98.7 19374 1863.1
+ rad      1      25.1 19447 1865.0
+ nox      1       4.8 19468 1865.6

Step:  AIC=1748.26
medv ~ lstat + rm

          Df Sum of Sq   RSS    AIC
+ ptratio  1   1711.32 13728 1695.0
+ chas     1    548.53 14891 1736.2
+ black    1    512.31 14927 1737.4
+ tax      1    425.16 15014 1740.3
+ dis      1    351.15 15088 1742.8
+ crim     1    311.42 15128 1744.2
<none>                 15439 1748.3
+ rad      1    180.45 15259 1748.5
+ indus    1     61.09 15378 1752.5
+ zn       1     56.56 15383 1752.6
+ age      1     20.18 15419 1753.8
+ nox      1     14.90 15424 1754.0

Step:  AIC=1695.04
medv ~ lstat + rm + ptratio

        Df Sum of Sq   RSS    AIC
+ dis    1    499.08 13229 1682.5
+ black  1    389.68 13338 1686.7
+ chas   1    377.96 13350 1687.1
<none>               13728 1695.0
+ crim   1    122.52 13606 1696.7
+ age    1     66.24 13662 1698.8
+ tax    1     44.36 13684 1699.6
+ nox    1     24.81 13703 1700.3
+ zn     1     14.96 13713 1700.7
+ rad    1      6.07 13722 1701.0
+ indus  1      0.83 13727 1701.2

Step:  AIC=1682.53
medv ~ lstat + rm + ptratio + dis

        Df Sum of Sq   RSS    AIC
+ nox    1    759.56 12469 1658.8
+ black  1    502.64 12726 1669.2
+ chas   1    267.43 12962 1678.4
+ indus  1    242.65 12986 1679.4
+ tax    1    240.34 12989 1679.5
+ crim   1    233.54 12995 1679.7
<none>               13229 1682.5
+ zn     1    144.81 13084 1683.2
+ age    1     61.36 13168 1686.4
+ rad    1     22.40 13206 1687.9

Step:  AIC=1658.83
medv ~ lstat + rm + ptratio + dis + nox

        Df Sum of Sq   RSS    AIC
+ chas   1    328.27 12141 1651.6
+ black  1    311.83 12158 1652.2
<none>               12469 1658.8
+ zn     1    151.71 12318 1658.9
+ crim   1    141.43 12328 1659.3
+ rad    1     53.48 12416 1662.9
+ indus  1     17.10 12452 1664.4
+ tax    1     10.50 12459 1664.6
+ age    1      0.25 12469 1665.0

Step:  AIC=1651.56
medv ~ lstat + rm + ptratio + dis + nox + chas

        Df Sum of Sq   RSS    AIC
+ black  1   272.837 11868 1646.3
+ zn     1   164.406 11977 1650.9
<none>               12141 1651.6
+ crim   1   116.330 12025 1652.9
+ rad    1    58.556 12082 1655.3
+ indus  1    26.274 12115 1656.7
+ tax    1     4.187 12137 1657.6
+ age    1     2.331 12139 1657.7

Step:  AIC=1646.28
medv ~ lstat + rm + ptratio + dis + nox + chas + black

        Df Sum of Sq   RSS    AIC
+ zn     1   189.936 11678 1644.3
<none>               11868 1646.3
+ rad    1   144.320 11724 1646.3
+ crim   1    55.633 11813 1650.1
+ indus  1    15.584 11853 1651.8
+ age    1     9.446 11859 1652.1
+ tax    1     2.703 11866 1652.4

Step:  AIC=1644.35
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn

        Df Sum of Sq   RSS    AIC
<none>               11678 1644.3
+ crim   1    94.712 11584 1646.5
+ rad    1    93.614 11585 1646.5
+ indus  1    16.048 11662 1649.9
+ tax    1     3.952 11674 1650.4
+ age    1     1.491 11677 1650.5
summary(BIC_model)

Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
    black + zn, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6996  -2.7925  -0.5477   1.7005  27.6510 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.316950   4.870856   6.224 1.03e-09 ***
lstat        -0.543125   0.047652 -11.398  < 2e-16 ***
rm            4.116082   0.408594  10.074  < 2e-16 ***
ptratio      -0.881851   0.115718  -7.621 1.29e-13 ***
dis          -1.382714   0.187604  -7.370 7.15e-13 ***
nox         -16.687428   3.228873  -5.168 3.43e-07 ***
chas          3.111062   0.870076   3.576 0.000384 ***
black         0.009404   0.002639   3.563 0.000401 ***
zn            0.037808   0.013298   2.843 0.004652 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.847 on 497 degrees of freedom
Multiple R-squared:  0.7266,    Adjusted R-squared:  0.7222 
F-statistic: 165.1 on 8 and 497 DF,  p-value: < 2.2e-16
AIC(BIC_model)
[1] 3044.275
AIC(AIC_model)
[1] 3023.726
BIC_model$anova

LASSO method

Formulated by Robert Tibshirani, LASSO - Least Absolute Shrinkage and Selection Operator is a powerful method that perform two main tasks: regularization and feature selection. The LASSO method tries to reduce the sum of the absolute values of the model parameters below a fixed value (upper bound). In order to do so the method apply a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero. Variables that still have a non-zero coefficient after the shrinking process are selected to be part of the model. The goal of this process is to minimize the prediction error.

Tuning parameter λ, controls the strength of the penalty. The larger is the parameter λ the more number of coefficients are shrinked to zero. On the other hand if λ = 0 we have an OLS (Ordinary Least Sqaure) regression. The aplha value chooses between LASSO and ridge methods (α = 1 LASSO, α = 0 Ridge Regularization).

The x-axis shows different values of the λ parameter. Each line represent one of the explanatory variable and its role in the model. In the plots we can see when each variable entered in the model and to which extent they influenced the response variable. We can see that the variables 13 and 6 has the highest influence, because it enters the model first, and steadily affect the response variable

Boston_X_std <- scale(select(Boston,-medv))
Boston_X     <- as.matrix(Boston_X_std)
Boston_Y     <- Boston$medv
lasso_fit    <- glmnet( x = Boston_X, y = Boston_Y, family = "gaussian", alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE)

The R function cv.glmnet allows the user to select the most appropriate value for λ, using nfold cross validation. Smallest value allowed for nfolds is 3. The first vertical line gives λmin, minimum mean cross-validated error and second vertical dotted line gives a model such that error is within one standard error of the minimum( λ1se ). Below shows the mean square error of predictions using λmin and λ1se.

cv_lasso  <- cv.glmnet( x = Boston_X, y = Boston_Y, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv_lasso)

#lambda for minimum mean cross-validated error
cv_lasso$lambda.min
[1] 0.02118502
#lambda for error that is within one standard error of the minimum
cv_lasso$lambda.1se
[1] 0.3145908
pred_lambda_min <- predict(lasso_fit, newx = Boston_X, s=cv.lasso$lambda.min)
mean((Boston_Y-pred_lambda_min)^2)
[1] 21.89822
pred_lambda_1se <- predict(lasso_fit, newx = Boston_X, s=cv.lasso$lambda.1se)
mean((Boston_Y-pred_lambda_1se)^2)
[1] 21.95768
---
title: "Study of variable selection methods"
output: html_notebook
---
```{r setup, include=FALSE}

knitr::opts_chunk$set(cache = TRUE, 
                      #suppress any R warnings from being included in the final document
                      warning = FALSE, 
                      #suppress any R messages from being included in the final document
                      message = FALSE, 
                      cache.lazy = FALSE,
                      fig.width=3, fig.height=3 )
```

#{.tabset .tabset-fade}

## Introduction

The feature selection is the process that choose a reduced number of explanatory variable to describe a response variable. The main reasons why feature selection is used are:

* Make the model easier to interpret, removing variables that are redundant and do not add any information;
* Reduce the size of the problem to enable algorithms to work faster, making it possible to handle with high-dimensional data
* Reduce overfitting.

The [dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) used in this case study contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The 506 observations in this study was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.The price of house in Boston at 1978 is predicted using regression models based on different advanced techniques and the accuracy of different models are compared. The predictions are made based on thirteen environmental and socio-economic variables.


```{r data_dictionary, echo = FALSE }

library(knitr)
dataDic <- read.table("Boston.txt", sep ="\t", header = TRUE)
kable(dataDic)

```


## Packages

* [pacman](https://cran.r-project.org/web/packages/pacman/vignettes/Introduction_to_pacman.html): To load packages and install missing ones
* [MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf): To load Boston data set and stepAIC function
* [dplyr](https://cran.r-project.org/web/packages/DT/index.html): Package for data manipulation
* [leaps](https://cran.r-project.org/web/packages/leaps/leaps.pdf): For best subset variable selection
* [glmnet](https://cran.r-project.org/web/packages/glmnet/index.html): Fits linear models or generalized linear model penalizing the maximum likelihood with both the LASSO method and the Ridge Regression
* [corrplot](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html): To plot the correlation of variables.

```{r libraries }

if (!require("pacman")) install.packages("pacman")

# p_load function installs missing packages and loads all the packages given as input
pacman::p_load("MASS", "leaps", "dplyr", "glmnet", "corrplot")


```

## EDA

The values in [corrplot function](https://www.rdocumentation.org/packages/arm/versions/1.9-3/topics/corrplot) denotes the extent of linear correlation between different variables. If the value is faded it denotes that there is minimal linear relationship between the variables.

```{r corplot }

correlation = cor(Boston)
corrplot(correlation, type = "upper", method = "number")

```

The moderate positive correlation between median house price and Average number of rooms in owner units, moderarte negative linear correlation between Proportion of neighboring population that is of lower status and median house price can be seen clearly in the scatter plots 

```{r scatterplot}
par(mfrow=c(2,3))
plot(Boston$rm, Boston$medv )
plot(Boston$age, Boston$medv)
plot(Boston$black, Boston$medv)
plot(Boston$lstat, Boston$medv )
plot(Boston$crim, Boston$medv)
plot(Boston$zn, Boston$medv)
plot(Boston$indus, Boston$medv )
plot(Boston$tax, Boston$medv)
plot(Boston$ptratio, Boston$medv)
plot(Boston$dis, Boston$medv )
plot(Boston$rad, Boston$medv)
plot(Boston$nox, Boston$medv)
```
## Best subset method

The [regsubset](https://www.rdocumentation.org/packages/leaps/versions/2.1-1/topics/regsubsets) function from [leaps package](https://cran.r-project.org/web/packages/leaps/leaps.pdf) provides sets of candidate models of different sizes. Using the parameter *nbest* , we can specify how many models of each size are to be kept in the
result object. If we need one or more variables to be included/excluded in all models, we can specify this condition using the *force.in / force.out* parameter. Variables are specified not by name but by order in the formula.

In the summary, models are listed in the order of size which is denoted in the first column of the matrix. Within size, the models are listed in order of fit (best model first). TRUE or FALSE indicates if the variable was selected in the model. If *matrix.logical* parameter is specified as TRUE in summary function, then the included variables are indicated by asterisks in quotes and those not included are represented using empty quotes.

```{r best_subset }

model_subset <- regsubsets(medv ~ . ,data = Boston, nbest = 2 , nvmax = 13)
summary(model_subset, matrix.logical = TRUE)$which

```
By plotting the models, we can find the best model. For simplicity of plotting only one model of each size is created. The numbers on the left margin are the values of [Schwartz’ Bayesian Information Criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion). Each row is a model and the shaded rectangle indicate that the variable is included in the model. The darkness of the shading simply represents the ordering of the BIC values.

```{r plot_best_subset }

model_subset <- regsubsets(medv ~ . ,data = Boston, nvmax = 13)
plot(model_subset)

```

For a model to have good predictive power, we prefer high adjusted r square value and low bic values. 

```{r best_subset_values }

# bic value of each model
summary(model_subset)$bic

# Finding the model with minimum bic
which(summary(model_subset)$bic == min(summary(model_subset)$bic))

# adjusted r square value of each model
summary(model_subset)$adjr2

# Finding the model with maximum adjusted r square
which(summary(model_subset)$adjr2 == max(summary(model_subset)$adjr2))
```

We can see that model number 11 (indus and age are excluded) is found to be the best model. As per this selection method, we consider Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price.

## Step AIC method

[AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. k is the multiple of the number of degrees of freedom used for the penalty. Only **k = 2** gives the genuine AIC method of variable selection. 

The [step function](https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/step) can be used to perform variable selection. The mode of stepwise search, can be one of "both", "backward", or "forward", with a default of "both". If the scope argument is missing the default for direction is "backward".


* forward : This tells R to start with the null model and search through models lying in the range between the null and full model using the [forward selection algorithm](https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node41.html). In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it's P-value is below some pre-set level.

* backward : This tells R to start with the full model and search through models lying in the range between the null and full model using the [backward selection algorithm](https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node42.html). Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

* both : This method works by dropping or adding the one variable that leads to the best AIC improvement (smallest AIC)


The variables age and indus are eliminated in this method. This means that in the final model , we assume that Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price. 

```{r step_AIC }

null_model <- lm(medv ~ 1 , data = Boston)
full_model <- lm(medv ~ . , data = Boston)
AIC_model  <- step( null_model, scope = list( lower = null_model, upper = full_model), k = 2, direction = "forward")
summary(AIC_model)
AIC_model$anova
```

## Step BIC method

[BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion) is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. k is the multiple of the number of degrees of freedom used for the penalty. **k = log(n)** gives the BIC method of variable selection. 


The [step function](https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/step) can be used to perform variable selection. The mode of stepwise search, can be one of "both", "backward", or "forward", with a default of "both". If the scope argument is missing the default for direction is "backward".

* forward : This tells R to start with the null model and search through models lying in the range between the null and full model using the [forward selection algorithm](https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node41.html). In this approach, one adds variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these variables is added to the model, so long as it's P-value is below some pre-set level.

* backward : This tells R to start with the full model and search through models lying in the range between the null and full model using the [backward selection algorithm](https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node42.html). Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

* both : This method works by dropping or adding the one variable that leads to the best AIC improvement (smallest BIC)

The variables crim, rad, tax, age and indus are eliminated in this method. This means that in the final model , we assume that Crime rate by town, Index of highway accessibility, Full value property tax rate, Proportion of units built prior to 1940 and Proportion of industrial area acres to have no predictive influence on the median house price. 

The AIC value of final model obtained in step AIC method is lower than step BIC method. This is because AIC chooses complex models than BIC.

```{r step_BIC }

null_model <- lm(medv ~ 1 , data = Boston)
full_model <- lm(medv ~ . , data = Boston)
BIC_model  <- step( null_model, scope = list( lower = null_model, upper = full_model), k = log(nrow(Boston)), direction = "forward")

summary(BIC_model)

AIC(BIC_model)
AIC(AIC_model)

BIC_model$anova
```

## LASSO method


Formulated by [Robert Tibshirani](https://statweb.stanford.edu/~tibs/), LASSO - [*Least Absolute Shrinkage and Selection Operator*](https://en.wikipedia.org/wiki/Lasso_(statistics))  is a powerful method that perform two main tasks: regularization and feature selection. The LASSO method tries to reduce the sum of the absolute values of the model parameters below a fixed value (upper bound). In order to do so the method apply a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero. Variables that still have a non-zero coefficient after the shrinking process are selected to be part of the model. The goal of this process is to minimize the prediction error.

Tuning parameter λ, controls the strength of the penalty. The larger is the parameter λ the more number of coefficients are shrinked to zero. On the other hand if λ = 0 we have an OLS (Ordinary Least Sqaure) regression. The aplha value chooses between LASSO and ridge methods (α = 1 LASSO, α = 0 Ridge Regularization).


The x-axis shows different values of the λ parameter. Each line represent one of the explanatory variable and its role in the model. In the plots we can see when each variable entered in the model and to which extent they influenced the response variable. We can see that the variables 13 and 6 has the highest influence,  because it enters the model first, and steadily affect the response variable

```{r LASSO }

Boston_X_std <- scale(select(Boston,-medv))
Boston_X     <- as.matrix(Boston_X_std)
Boston_Y     <- Boston$medv

lasso_fit    <- glmnet( x = Boston_X, y = Boston_Y, family = "gaussian", alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE)

```

The R function [cv.glmnet](https://www.rdocumentation.org/packages/glmnet/versions/2.0-12/topics/cv.glmnet) allows the user to select the most appropriate value for λ, using nfold cross validation. Smallest value allowed for nfolds is 3. The first vertical line gives λmin, minimum mean cross-validated error and second vertical dotted line gives a model such that error is within one standard error of the minimum( λ1se ). Below shows the mean square error of predictions using λmin and λ1se.

```{r LASSO_cv }

cv_lasso  <- cv.glmnet( x = Boston_X, y = Boston_Y, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv_lasso)

#lambda for minimum mean cross-validated error
cv_lasso$lambda.min

#lambda for error that is within one standard error of the minimum
cv_lasso$lambda.1se


```


```{r LASSO_mse }

pred_lambda_min <- predict(lasso_fit, newx = Boston_X, s=cv.lasso$lambda.min)
mean((Boston_Y-pred_lambda_min)^2)

pred_lambda_1se <- predict(lasso_fit, newx = Boston_X, s=cv.lasso$lambda.1se)
mean((Boston_Y-pred_lambda_1se)^2)

```