Exercise 6.6: Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. a. The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. Lasso usually reduces the number of predictors which leads to less flexibility. It will lead to an increase in bias as a byproduct.
  1. for ridge regression relative to least squares.
  1. Ridge regression will have the same issues and flexibility as Lasso.
  1. non-linear methods relative to least squares.
  1. Non-linear methods are definitely more flexible. It will perform better as long as the variance does not get out of control.

Exercise 6.6: Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

  1. Split the data set into a training set and a test set.
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
## 
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4963.4  -447.7   -42.8   338.3  7790.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -685.74538  470.15497  -1.459 0.145178    
## PrivateYes  -531.17790  159.50713  -3.330 0.000918 ***
## Accept         1.58651    0.04501  35.246  < 2e-16 ***
## Enroll        -0.89994    0.19981  -4.504 7.93e-06 ***
## Top10perc     54.46747    6.25983   8.701  < 2e-16 ***
## Top25perc    -17.40449    5.09522  -3.416 0.000676 ***
## F.Undergrad    0.06185    0.03609   1.714 0.087071 .  
## P.Undergrad    0.04396    0.03491   1.259 0.208365    
## Outstate      -0.07995    0.02191  -3.649 0.000284 ***
## Room.Board     0.16431    0.05679   2.893 0.003941 ** 
## Books          0.16979    0.28981   0.586 0.558185    
## Personal       0.01522    0.07075   0.215 0.829784    
## PhD          -10.58598    5.49279  -1.927 0.054390 .  
## Terminal      -1.95091    6.00164  -0.325 0.745239    
## S.F.Ratio     22.79666   14.82745   1.537 0.124673    
## perc.alumni   -0.44579    4.77616  -0.093 0.925666    
## Expend         0.07988    0.01372   5.821 9.22e-09 ***
## Grad.Rate      9.62036    3.40940   2.822 0.004924 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1092 on 642 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9278 
## F-statistic: 499.5 on 17 and 642 DF,  p-value: < 2.2e-16
## [1] 521452.1
  1. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
##   (Intercept)   PrivateYes         Accept            Enroll      
##  Min.   :1    Min.   :0.0000   Min.   :   90.0   Min.   :  35.0  
##  1st Qu.:1    1st Qu.:0.0000   1st Qu.:  618.2   1st Qu.: 244.0  
##  Median :1    Median :1.0000   Median : 1127.0   Median : 438.0  
##  Mean   :1    Mean   :0.7242   Mean   : 2073.3   Mean   : 797.8  
##  3rd Qu.:1    3rd Qu.:1.0000   3rd Qu.: 2447.8   3rd Qu.: 910.2  
##  Max.   :1    Max.   :1.0000   Max.   :26330.0   Max.   :6392.0  
##    Top10perc       Top25perc       F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.00   Min.   :  199   Min.   :    1.0  
##  1st Qu.:15.75   1st Qu.: 42.00   1st Qu.: 1000   1st Qu.:   94.0  
##  Median :24.00   Median : 55.00   Median : 1720   Median :  351.0  
##  Mean   :28.01   Mean   : 56.46   Mean   : 3795   Mean   :  881.1  
##  3rd Qu.:36.00   3rd Qu.: 70.00   3rd Qu.: 4270   3rd Qu.: 1017.2  
##  Max.   :96.00   Max.   :100.00   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal     
##  Min.   : 2580   Min.   :1880   Min.   :  96.0   Min.   : 250.0  
##  1st Qu.: 7338   1st Qu.:3590   1st Qu.: 450.0   1st Qu.: 873.8  
##  Median :10147   Median :4205   Median : 500.0   Median :1200.0  
##  Mean   :10519   Mean   :4357   Mean   : 546.4   Mean   :1348.2  
##  3rd Qu.:13126   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700.0  
##  Max.   :21700   Max.   :7425   Max.   :2340.0   Max.   :6800.0  
##       PhD           Terminal        S.F.Ratio      perc.alumni   
##  Min.   :  8.0   Min.   : 25.00   Min.   : 2.90   Min.   : 0.00  
##  1st Qu.: 62.0   1st Qu.: 71.00   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.0   Median : 83.00   Median :13.60   Median :21.00  
##  Mean   : 72.9   Mean   : 79.95   Mean   :14.12   Mean   :23.04  
##  3rd Qu.: 86.0   3rd Qu.: 92.00   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.0   Max.   :100.00   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate  
##  Min.   : 3186   Min.   : 10  
##  1st Qu.: 6806   1st Qu.: 54  
##  Median : 8520   Median : 66  
##  Mean   : 9737   Mean   : 66  
##  3rd Qu.:10895   3rd Qu.: 78  
##  Max.   :56233   Max.   :118
## [1] 513191.6
  1. Fit a lasso model on the training set, with \(\lambda\) chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
##   (Intercept)   PrivateYes         Accept            Enroll      
##  Min.   :1    Min.   :0.0000   Min.   :   90.0   Min.   :  35.0  
##  1st Qu.:1    1st Qu.:0.0000   1st Qu.:  618.2   1st Qu.: 244.0  
##  Median :1    Median :1.0000   Median : 1127.0   Median : 438.0  
##  Mean   :1    Mean   :0.7242   Mean   : 2073.3   Mean   : 797.8  
##  3rd Qu.:1    3rd Qu.:1.0000   3rd Qu.: 2447.8   3rd Qu.: 910.2  
##  Max.   :1    Max.   :1.0000   Max.   :26330.0   Max.   :6392.0  
##    Top10perc       Top25perc       F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   :  9.00   Min.   :  199   Min.   :    1.0  
##  1st Qu.:15.75   1st Qu.: 42.00   1st Qu.: 1000   1st Qu.:   94.0  
##  Median :24.00   Median : 55.00   Median : 1720   Median :  351.0  
##  Mean   :28.01   Mean   : 56.46   Mean   : 3795   Mean   :  881.1  
##  3rd Qu.:36.00   3rd Qu.: 70.00   3rd Qu.: 4270   3rd Qu.: 1017.2  
##  Max.   :96.00   Max.   :100.00   Max.   :31643   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal     
##  Min.   : 2580   Min.   :1880   Min.   :  96.0   Min.   : 250.0  
##  1st Qu.: 7338   1st Qu.:3590   1st Qu.: 450.0   1st Qu.: 873.8  
##  Median :10147   Median :4205   Median : 500.0   Median :1200.0  
##  Mean   :10519   Mean   :4357   Mean   : 546.4   Mean   :1348.2  
##  3rd Qu.:13126   3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700.0  
##  Max.   :21700   Max.   :7425   Max.   :2340.0   Max.   :6800.0  
##       PhD           Terminal        S.F.Ratio      perc.alumni   
##  Min.   :  8.0   Min.   : 25.00   Min.   : 2.90   Min.   : 0.00  
##  1st Qu.: 62.0   1st Qu.: 71.00   1st Qu.:11.50   1st Qu.:13.00  
##  Median : 75.0   Median : 83.00   Median :13.60   Median :21.00  
##  Mean   : 72.9   Mean   : 79.95   Mean   :14.12   Mean   :23.04  
##  3rd Qu.: 86.0   3rd Qu.: 92.00   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.0   Max.   :100.00   Max.   :39.80   Max.   :64.00  
##      Expend        Grad.Rate  
##  Min.   : 3186   Min.   : 10  
##  1st Qu.: 6806   1st Qu.: 54  
##  Median : 8520   Median : 66  
##  Mean   : 9737   Mean   : 66  
##  3rd Qu.:10895   3rd Qu.: 78  
##  Max.   :56233   Max.   :118
## [1] 4
## [1] 513191.6
  1. Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
## Data:    X dimension: 660 17 
##  Y dimension: 660 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4068     4007     2136     2139     1809     1644     1641
## adjCV         4068     4009     2134     2139     1811     1637     1636
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1630     1590     1571      1562      1571      1571      1571
## adjCV     1629     1580     1567      1557      1566      1567      1567
##        14 comps  15 comps  16 comps  17 comps
## CV         1574      1549      1242      1197
## adjCV      1570      1536      1233      1190
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.034    57.78    64.92    70.57     75.8    80.75    84.30    87.66
## Apps    3.259    73.13    73.14    80.92     84.7    84.83    85.02    85.86
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.70     93.09     95.16     96.94     97.98     98.78     99.36
## Apps    86.13     86.45     86.45     86.52     86.59     86.62     89.92
##       16 comps  17 comps
## X        99.83    100.00
## Apps     92.56     92.97

## [1] 521452.1
  1. Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

## [1] 521452.1
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Ridge (appears unlisted) and lasso give the lowest test errors but the lowest is generated by the ridge. It definitely seems to vary greatly by the split selected.

Exercise 6.6: Question 11

We will now try to predict per capita crime rate in the Boston data set.

  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

## <seaborn.axisgrid.PairGrid object at 0x3415489b0>
## <Axes: >

pairplot heatmap

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Boston crime metrics seem to have the most ideal normal distribution, so this will be something to try to fit to.

## [1] 1.630519
## [1] 1.677635
## [1] 1.648161

## [1] 1.630519

## [1] 1.630519

Lasso seems to perform very slightly better. Some coefficients have been dropped:

## 7 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -12.157450324
## (Intercept)   .          
## rm            .          
## age           0.007925442
## medv         -0.001304823
## nox          11.909560317
## ptratio       0.229580912
## [1] 13
  1. Does your chosen model involve all of the features in the data set? Why or why not?

The feature “rm” which is the average number of rooms per dwelling did not have any statistical significance in relation to crime.

## 
## Call:
## lm(formula = log(crim) ~ ., data = boston_data[train, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2827 -0.8619  0.0630  0.7705  3.5294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.653847   0.967427 -13.080  < 2e-16 ***
## rm            0.042559   0.110288   0.386  0.69976    
## age           0.008314   0.002980   2.790  0.00549 ** 
## medv         -0.004776   0.009562  -0.499  0.61768    
## nox          12.061847   0.735765  16.394  < 2e-16 ***
## ptratio       0.240261   0.030233   7.947 1.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.207 on 449 degrees of freedom
## Multiple R-squared:  0.683,  Adjusted R-squared:  0.6795 
## F-statistic: 193.5 on 5 and 449 DF,  p-value: < 2.2e-16