Project#4-Regularization Methods for Multiple Regression

6.1 Predicting Boston Housing Prices. The file BostonHousing.csv contains information collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 13 predictors, and the response is the median house price (MEDV). Table 6.9 describes each of the predictors and the response.

Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?

We want to partition the data into a training and validation set because in data mining we are more concerned with predicting and for predicting we need to train our model and test it to see how well it performs. The training set will be used to build our model while the validation set will be used to test how well our model performs.

Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

b.boston <- boston.df[,-14] #we don't need CAT.MEDV
reduced.boston <- boston.df %>% select(c("MEDV","CRIM","CHAS","RM"))
b.lm <- lm(MEDV~.,data = b.boston)
lm <- lm(MEDV ~ ., data = reduced.boston)

summary(lm)

## 
## Call:
## lm(formula = MEDV ~ ., data = reduced.boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.829  -2.968  -0.415   2.433  38.945 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28.81068    2.56331 -11.240  < 2e-16 ***
## CRIM         -0.26072    0.03274  -7.964 1.12e-14 ***
## CHAS          3.76304    1.08620   3.464 0.000577 ***
## RM            8.27818    0.40182  20.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.17 on 502 degrees of freedom
## Multiple R-squared:  0.5527, Adjusted R-squared:   0.55 
## F-statistic: 206.7 on 3 and 502 DF,  p-value: < 2.2e-16

Thus the equation for median house price from the predictors in the model is \[\widehat{MEDV}= -28.81-0.26*CRIM+3.76*CHAS+8.28*RM\]

Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6? What is the prediction error?

c.lm.pred <- predict(lm,data.frame("CRIM" = 0.1,"RM" = 6, "CHAS" = 0))
c.lm.pred

##        1 
## 20.83232

Thus the predicted MEDV is 20.8 for this house. Since their is no data that is exactly the same as the example given above we are not able to calculate the prediction error (residual).

Reduce the number of predictors:
1. Which predictors are likely to be measuring the same thing among the 13 predictors? Discuss the relationships among INDUS, NOX, and TAX.

plot(boston.df) #this matrix shows how each variable relates to another

The matrix above can give us hints to which variables are related to each other and which are not. For example, MEDV and LSTAT are negatively correlated which is obvious becuase LSTAT is measuring the percentage lower status of the population which should be closely tide to MEDV. Then form the graph matrix we can see that INDUS, NOX, and TAX seem to be positively, at on the left of the graphs in the bottom triangle. That would indicate that these three things are measuring the same thing.

Compute the correlation table for the 12 numerical predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Choose which ones to remove based on this table.

boston.df[,-c(13,14)] %>% cor() %>% heatmap(,Rowv = NA,Colv = NA)

Note worthy variables to look out for would be ZN to DIS, RAD to TAX, and PTRATIO to TAX and RAD. That is, we would want to remove PTRATIO, ZN, and TAX for example.

Use stepwise regression with the three options (backward, forward, both) to reduce the remaining predictors as follows: Run stepwise on the training set. Choose the top model from each stepwise run. Then use each of these models separately to predict the validation set. Compare RMSE, MAPE, and mean error, as well as lift charts. Finally, describe the best model.

#valid and train
n <- nrow(b.boston)
b.shuffle.idx <- sample(c(1:n),round(0.6*n))
b.train.df <- b.boston[b.shuffle.idx,]
b.valid.df <- b.boston[-b.shuffle.idx,]

#backward
lm.back <- step(b.lm,direction = "backward")

## Start:  AIC=1599.85
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + 
##     TAX + PTRATIO + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - INDUS    1      1.08 11350 1597.9
## - AGE      1      1.69 11351 1597.9
## <none>                 11349 1599.8
## - CHAS     1    245.31 11595 1608.7
## - TAX      1    256.28 11606 1609.2
## - ZN       1    263.59 11613 1609.5
## - CRIM     1    311.49 11661 1611.6
## - RAD      1    430.71 11780 1616.7
## - NOX      1    546.10 11896 1621.6
## - PTRATIO  1   1157.70 12507 1647.0
## - DIS      1   1258.52 12608 1651.1
## - RM       1   1744.36 13094 1670.2
## - LSTAT    1   2733.54 14083 1707.0
## 
## Step:  AIC=1597.9
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + TAX + 
##     PTRATIO + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - AGE      1      1.69 11352 1596.0
## <none>                 11350 1597.9
## - CHAS     1    251.21 11602 1607.0
## - ZN       1    262.99 11614 1607.5
## - TAX      1    299.68 11650 1609.1
## - CRIM     1    313.07 11664 1609.7
## - RAD      1    453.61 11804 1615.7
## - NOX      1    574.23 11925 1620.9
## - PTRATIO  1   1168.01 12518 1645.5
## - DIS      1   1333.19 12684 1652.1
## - RM       1   1750.50 13101 1668.5
## - LSTAT    1   2743.21 14094 1705.4
## 
## Step:  AIC=1595.98
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
##     LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11352 1596.0
## - CHAS     1    254.21 11606 1605.2
## - ZN       1    261.75 11614 1605.5
## - TAX      1    298.57 11651 1607.1
## - CRIM     1    313.27 11666 1607.8
## - RAD      1    452.16 11804 1613.7
## - NOX      1    601.74 11954 1620.1
## - PTRATIO  1   1168.51 12521 1643.5
## - DIS      1   1496.35 12848 1656.6
## - RM       1   1848.38 13201 1670.3
## - LSTAT    1   3043.23 14395 1714.2

summary(lm.back)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + LSTAT, data = b.boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1814  -2.7625  -0.6243   1.8448  26.3920 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
## CRIM         -0.121665   0.032919  -3.696 0.000244 ***
## ZN            0.046191   0.013673   3.378 0.000787 ***
## CHAS          2.871873   0.862591   3.329 0.000935 ***
## NOX         -18.262427   3.565247  -5.122 4.33e-07 ***
## RM            3.672957   0.409127   8.978  < 2e-16 ***
## DIS          -1.515951   0.187675  -8.078 5.08e-15 ***
## RAD           0.283932   0.063945   4.440 1.11e-05 ***
## TAX          -0.012292   0.003407  -3.608 0.000340 ***
## PTRATIO      -0.930961   0.130423  -7.138 3.39e-12 ***
## LSTAT        -0.546509   0.047442 -11.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7289 
## F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

lm.back.pred <- predict(lm.back, b.valid.df)
accuracy(lm.back.pred, b.valid.df$MEDV)

##                 ME     RMSE      MAE      MPE     MAPE
## Test set 0.2966321 5.340989 3.704112 -2.22121 16.56623

lm.null <- lm(MEDV~1,data = boston.df)
lm.forward <- step(lm.null, 
                   scope = list(lower=lm.null, upper= b.lm),
                   direction = "forward")

## Start:  AIC=2246.51
## MEDV ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + LSTAT    1   23243.9 19472 1851.0
## + RM       1   20654.4 22062 1914.2
## + PTRATIO  1   11014.3 31702 2097.6
## + INDUS    1    9995.2 32721 2113.6
## + TAX      1    9377.3 33339 2123.1
## + NOX      1    7800.1 34916 2146.5
## + CRIM     1    6440.8 36276 2165.8
## + RAD      1    6221.1 36495 2168.9
## + AGE      1    6069.8 36647 2171.0
## + ZN       1    5549.7 37167 2178.1
## + DIS      1    2668.2 40048 2215.9
## + CHAS     1    1312.1 41404 2232.7
## <none>                 42716 2246.5
## 
## Step:  AIC=1851.01
## MEDV ~ LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## + RM       1    4033.1 15439 1735.6
## + PTRATIO  1    2670.1 16802 1778.4
## + CHAS     1     786.3 18686 1832.2
## + DIS      1     772.4 18700 1832.5
## + AGE      1     304.3 19168 1845.0
## + TAX      1     274.4 19198 1845.8
## + ZN       1     160.3 19312 1848.8
## + CRIM     1     146.9 19325 1849.2
## + INDUS    1      98.7 19374 1850.4
## <none>                 19472 1851.0
## + RAD      1      25.1 19447 1852.4
## + NOX      1       4.8 19468 1852.9
## 
## Step:  AIC=1735.58
## MEDV ~ LSTAT + RM
## 
##           Df Sum of Sq   RSS    AIC
## + PTRATIO  1   1711.32 13728 1678.1
## + CHAS     1    548.53 14891 1719.3
## + TAX      1    425.16 15014 1723.5
## + DIS      1    351.15 15088 1725.9
## + CRIM     1    311.42 15128 1727.3
## + RAD      1    180.45 15259 1731.6
## + INDUS    1     61.09 15378 1735.6
## <none>                 15439 1735.6
## + ZN       1     56.56 15383 1735.7
## + AGE      1     20.18 15419 1736.9
## + NOX      1     14.90 15424 1737.1
## 
## Step:  AIC=1678.13
## MEDV ~ LSTAT + RM + PTRATIO
## 
##         Df Sum of Sq   RSS    AIC
## + DIS    1    499.08 13229 1661.4
## + CHAS   1    377.96 13350 1666.0
## + CRIM   1    122.52 13606 1675.6
## + AGE    1     66.24 13662 1677.7
## <none>               13728 1678.1
## + TAX    1     44.36 13684 1678.5
## + NOX    1     24.81 13703 1679.2
## + ZN     1     14.96 13713 1679.6
## + RAD    1      6.07 13722 1679.9
## + INDUS  1      0.83 13727 1680.1
## 
## Step:  AIC=1661.39
## MEDV ~ LSTAT + RM + PTRATIO + DIS
## 
##         Df Sum of Sq   RSS    AIC
## + NOX    1    759.56 12469 1633.5
## + CHAS   1    267.43 12962 1653.1
## + INDUS  1    242.65 12986 1654.0
## + TAX    1    240.34 12989 1654.1
## + CRIM   1    233.54 12995 1654.4
## + ZN     1    144.81 13084 1657.8
## + AGE    1     61.36 13168 1661.0
## <none>               13229 1661.4
## + RAD    1     22.40 13206 1662.5
## 
## Step:  AIC=1633.47
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX
## 
##         Df Sum of Sq   RSS    AIC
## + CHAS   1    328.27 12141 1622.0
## + ZN     1    151.71 12318 1629.3
## + CRIM   1    141.43 12328 1629.7
## + RAD    1     53.48 12416 1633.3
## <none>               12469 1633.5
## + INDUS  1     17.10 12452 1634.8
## + TAX    1     10.50 12459 1635.0
## + AGE    1      0.25 12469 1635.5
## 
## Step:  AIC=1621.97
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS
## 
##         Df Sum of Sq   RSS    AIC
## + ZN     1   164.406 11977 1617.1
## + CRIM   1   116.330 12025 1619.1
## + RAD    1    58.556 12082 1621.5
## <none>               12141 1622.0
## + INDUS  1    26.274 12115 1622.9
## + TAX    1     4.187 12137 1623.8
## + AGE    1     2.331 12139 1623.9
## 
## Step:  AIC=1617.07
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + ZN
## 
##         Df Sum of Sq   RSS    AIC
## + CRIM   1   170.902 11806 1611.8
## <none>               11977 1617.1
## + TAX    1    31.773 11945 1617.7
## + RAD    1    28.311 11948 1617.9
## + INDUS  1    27.377 11949 1617.9
## + AGE    1     0.071 11977 1619.1
## 
## Step:  AIC=1611.8
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + ZN + CRIM
## 
##         Df Sum of Sq   RSS    AIC
## + RAD    1   155.006 11651 1607.1
## <none>               11806 1611.8
## + INDUS  1    24.957 11781 1612.7
## + TAX    1     1.418 11804 1613.7
## + AGE    1     0.178 11806 1613.8
## 
## Step:  AIC=1607.11
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + ZN + CRIM + 
##     RAD
## 
##         Df Sum of Sq   RSS    AIC
## + TAX    1   298.573 11352 1596.0
## <none>               11651 1607.1
## + INDUS  1    44.346 11606 1607.2
## + AGE    1     0.581 11650 1609.1
## 
## Step:  AIC=1595.98
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + ZN + CRIM + 
##     RAD + TAX
## 
##         Df Sum of Sq   RSS    AIC
## <none>               11352 1596.0
## + AGE    1    1.6865 11350 1597.9
## + INDUS  1    1.0784 11351 1597.9

summary(lm.forward)

## 
## Call:
## lm(formula = MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + 
##     ZN + CRIM + RAD + TAX, data = boston.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1814  -2.7625  -0.6243   1.8448  26.3920 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
## LSTAT        -0.546509   0.047442 -11.519  < 2e-16 ***
## RM            3.672957   0.409127   8.978  < 2e-16 ***
## PTRATIO      -0.930961   0.130423  -7.138 3.39e-12 ***
## DIS          -1.515951   0.187675  -8.078 5.08e-15 ***
## NOX         -18.262427   3.565247  -5.122 4.33e-07 ***
## CHAS          2.871873   0.862591   3.329 0.000935 ***
## ZN            0.046191   0.013673   3.378 0.000787 ***
## CRIM         -0.121665   0.032919  -3.696 0.000244 ***
## RAD           0.283932   0.063945   4.440 1.11e-05 ***
## TAX          -0.012292   0.003407  -3.608 0.000340 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7289 
## F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

lm.forward.pred <- predict(lm.forward, b.valid.df)
accuracy(lm.forward.pred, b.valid.df$MEDV)

##                 ME     RMSE      MAE      MPE     MAPE
## Test set 0.2966321 5.340989 3.704112 -2.22121 16.56623

lm.step <- step(b.lm, direction = "both")

## Start:  AIC=1599.85
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + 
##     TAX + PTRATIO + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - INDUS    1      1.08 11350 1597.9
## - AGE      1      1.69 11351 1597.9
## <none>                 11349 1599.8
## - CHAS     1    245.31 11595 1608.7
## - TAX      1    256.28 11606 1609.2
## - ZN       1    263.59 11613 1609.5
## - CRIM     1    311.49 11661 1611.6
## - RAD      1    430.71 11780 1616.7
## - NOX      1    546.10 11896 1621.6
## - PTRATIO  1   1157.70 12507 1647.0
## - DIS      1   1258.52 12608 1651.1
## - RM       1   1744.36 13094 1670.2
## - LSTAT    1   2733.54 14083 1707.0
## 
## Step:  AIC=1597.9
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + TAX + 
##     PTRATIO + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - AGE      1      1.69 11352 1596.0
## <none>                 11350 1597.9
## + INDUS    1      1.08 11349 1599.8
## - CHAS     1    251.21 11602 1607.0
## - ZN       1    262.99 11614 1607.5
## - TAX      1    299.68 11650 1609.1
## - CRIM     1    313.07 11664 1609.7
## - RAD      1    453.61 11804 1615.7
## - NOX      1    574.23 11925 1620.9
## - PTRATIO  1   1168.01 12518 1645.5
## - DIS      1   1333.19 12684 1652.1
## - RM       1   1750.50 13101 1668.5
## - LSTAT    1   2743.21 14094 1705.4
## 
## Step:  AIC=1595.98
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
##     LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11352 1596.0
## + AGE      1      1.69 11350 1597.9
## + INDUS    1      1.08 11351 1597.9
## - CHAS     1    254.21 11606 1605.2
## - ZN       1    261.75 11614 1605.5
## - TAX      1    298.57 11651 1607.1
## - CRIM     1    313.27 11666 1607.8
## - RAD      1    452.16 11804 1613.7
## - NOX      1    601.74 11954 1620.1
## - PTRATIO  1   1168.51 12521 1643.5
## - DIS      1   1496.35 12848 1656.6
## - RM       1   1848.38 13201 1670.3
## - LSTAT    1   3043.23 14395 1714.2

summary(lm.step)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + LSTAT, data = b.boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1814  -2.7625  -0.6243   1.8448  26.3920 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
## CRIM         -0.121665   0.032919  -3.696 0.000244 ***
## ZN            0.046191   0.013673   3.378 0.000787 ***
## CHAS          2.871873   0.862591   3.329 0.000935 ***
## NOX         -18.262427   3.565247  -5.122 4.33e-07 ***
## RM            3.672957   0.409127   8.978  < 2e-16 ***
## DIS          -1.515951   0.187675  -8.078 5.08e-15 ***
## RAD           0.283932   0.063945   4.440 1.11e-05 ***
## TAX          -0.012292   0.003407  -3.608 0.000340 ***
## PTRATIO      -0.930961   0.130423  -7.138 3.39e-12 ***
## LSTAT        -0.546509   0.047442 -11.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7289 
## F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

lm.step.pred <- predict(lm.step,b.valid.df)
accuracy(lm.step.pred,b.valid.df$MEDV)

##                 ME     RMSE      MAE      MPE     MAPE
## Test set 0.2966321 5.340989 3.704112 -2.22121 16.56623

#lets collect all those RMSE, MAPE, and mean error in one place
comparetest <- data.frame( 
    backwards = c(accuracy(lm.back.pred,b.valid.df$MEDV)),
    forward =c(accuracy(lm.forward.pred, b.valid.df$MEDV)),
    step = c(accuracy(lm.step.pred,b.valid.df$MEDV))
)
rownames(comparetest) <- c("ME","RMSE","MAE","MPE","MAPE")
comparetest

##       backwards    forward       step
## ME    0.2966321  0.2966321  0.2966321
## RMSE  5.3409886  5.3409886  5.3409886
## MAE   3.7041124  3.7041124  3.7041124
## MPE  -2.2212097 -2.2212097 -2.2212097
## MAPE 16.5662271 16.5662271 16.5662271

actual = b.valid.df$MEDV
#lift for backwards
gain1 = gains(actual, 
             lm.back.pred,
             group = 10)

plot(c(0, gain1$cume.pct.of.total*sum(actual))~c(0, gain1$cume.obs), type = "l", xlab = "#Cases", ylab = "Cumulative MEDV", main = "Lift Chart for backwards")
segments(0, 0, nrow(b.valid.df), sum(actual), lty = "dashed", col = "red", lwd = 2)

#lift for forward
gain2 = gains(actual, 
             lm.forward.pred,
             group = 10)

plot(c(0, gain2$cume.pct.of.total*sum(actual))~c(0, gain2$cume.obs), type = "l", xlab = "#Cases", ylab = "Cumulative MEDV", main = "Lift Chart for forward")
segments(0, 0, nrow(b.valid.df), sum(actual), lty = "dashed", col = "red", lwd = 2)

#lift for step
gain3 = gains(actual, 
             lm.step.pred,
             group = 10)

plot(c(0, gain3$cume.pct.of.total*sum(actual))~c(0, gain3$cume.obs), type = "l", xlab = "#Cases", ylab = "Cumulative MEDV", main = "Lift Chart for step")
segments(0, 0, nrow(b.valid.df), sum(actual), lty = "dashed", col = "red", lwd = 2)

Note that all the models have the same ME, MAPE, and RMSE. Also the lift charts all look very similar. And taking a closer look these methods we find that they are arriving at the same model.

Additional Regression Analysis:

lambda <- 10^seq(-3, 3, length = 100)

Ridge <- train(
  MEDV ~., 
  data = b.train.df, 
  method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda)
  )
# Model coefficients
coef(Ridge$finalModel, Ridge$bestTune$lambda)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  29.766058545
## CRIM         -0.066044899
## ZN            0.039289466
## INDUS        -0.014507381
## CHAS          1.587810747
## NOX         -16.738786119
## RM            4.472760108
## AGE           0.012429980
## DIS          -1.000188639
## RAD           0.097062820
## TAX          -0.008643112
## PTRATIO      -0.872827370
## LSTAT        -0.380100883

# Make predictions
predictions_r <- Ridge %>% predict(b.valid.df)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions_r, b.valid.df$MEDV),
  Rsquare = R2(predictions_r, b.valid.df$MEDV)
)

##       RMSE  Rsquare
## 1 5.807662 0.629935

Lasso <- train(
  MEDV ~., 
  data = b.train.df, 
  method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda)
  )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

# Model coefficients
coef(Lasso$finalModel, Lasso$bestTune$lambda)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  40.12088159
## CRIM         -0.09095348
## ZN            0.05461482
## INDUS         0.05597609
## CHAS          1.14560737
## NOX         -25.13019749
## RM            4.19538568
## AGE           0.02273288
## DIS          -1.38656919
## RAD           0.27748624
## TAX          -0.01663009
## PTRATIO      -0.97258343
## LSTAT        -0.43040852

# Make predictions
predictions_l <- Lasso %>% predict(b.valid.df)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions_l, b.valid.df$MEDV),
  Rsquare = R2(predictions_l, b.valid.df$MEDV)
)

##       RMSE   Rsquare
## 1 5.755752 0.6361706

Elastic <- train(
  MEDV ~., 
  data = b.train.df, 
  method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneLength = 10
  )
# Model coefficients
coef(Elastic$finalModel, Elastic$bestTune$lambda)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  37.65927294
## CRIM         -0.08308149
## ZN            0.05057609
## INDUS         0.03589453
## CHAS          1.23430736
## NOX         -23.28564596
## RM            4.28689221
## AGE           0.02021279
## DIS          -1.30972932
## RAD           0.22768139
## TAX          -0.01429361
## PTRATIO      -0.95223800
## LSTAT        -0.42080670

# Make predictions
predictions_e <- Elastic %>% predict(b.valid.df)
# Model prediction performance
data.frame(
  RMSE = RMSE(predictions_e, b.valid.df$MEDV),
  Rsquare = R2(predictions_e, b.valid.df$MEDV)
)

##       RMSE   Rsquare
## 1 5.752484 0.6364384

# comparing models
models <- list(ridge = Ridge, lasso = Lasso, elastic = Elastic)
resamples(models) %>% summary( metric = c("RMSE","Rsquare"))

## 
## Call:
## summary.resamples(object = ., metric = c("RMSE", "Rsquare"))
## 
## Models: ridge, lasso, elastic 
## Number of resamples: 10 
## 
## RMSE 
##             Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## ridge   3.181587 3.924235 4.165574 4.341620 4.466548 6.839733    0
## lasso   3.334449 3.804264 3.999592 4.313565 4.560470 6.183690    0
## elastic 3.375455 3.615131 3.935958 4.251272 4.562399 6.647375    0
## 
## Rsquare 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## ridged   0.3469412 0.7654247 0.8240114 0.7601320 0.8353505 0.8622493    0
## lassod   0.4656746 0.7699098 0.7926356 0.7588528 0.8258618 0.8797937    0
## elasticd 0.5637989 0.7298357 0.8040176 0.7747996 0.8344246 0.8767010    0

It can be seen that elastic has the lowest median RMSE and ridge has the lowest mean RSME. And lasso has the highest mean and median \(R^2\). It is also noteworthy that the model from the elastic net method is the only model that had a coefficient of 0 for one of the variables (AGE).

Since we are just trying to make the best prediction, I would use the elastic model since it has the lowest RMSE and has the highest max \(R^2\).

Project#4-Regularization Methods for Multiple Regression

Sean Cranston

2/24/2021

lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD +

TAX + PTRATIO + LSTAT, data = b.boston)