R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: Answer: iii.Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.The Lasso’s soultion can result in a variance decrease with a narrow increase in bias.This can increase the accuracy of the predictions. Lasso’s adavantage proves the least squares is rooted in the bias-variance and also performs variable selection which can help interpret other methods. i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (b) Repeat (a) for ridge regression relative to least squares. Answer: iii.Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.Lasso’s advantage and Ridge regression are both rooted in the bias variance trade-off. As lambda increases, the variable also decreases as the ridge regression fit decreases, but increases its bias. The least squares coefficient produces a major change and larger value for variance when there is a small change in the training data. By trading off a small increase in bias for a major decrease in variance the ridge regression can still perform. (c) Repeat (a) for non-linear methods relative to least squares Answer:ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR2)
data(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
pairs(College)

(a) Split the data set into a training set and a test set. Answer:

set.seed(1) 
trainingindex<-sample(nrow(College), 0.75*nrow(College))
head(trainingindex)
## [1] 679 129 509 471 299 270
train<-College[trainingindex, ]
test<-College[-trainingindex, ]
dim(College)
## [1] 777  18
dim(train)
## [1] 582  18
dim(test)
## [1] 195  18
  1. Fit a linear model using least squares on the training set, and report the test error obtained. Answer:The linear model fit on all variables is 1384604. While the rs is 93.63%. We can try reducingthe error and increase the r2.
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16
lm.pred=predict(lm.fit, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604
  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. Answer:Based on our results we can see that the rdige regression results in better results than OLS. From our results we can see that the ridge regression has a defect when it comes to some variables that shrinks the coefficients to zero which then creates a predcition accuracy and makes it difficult to interpret the variable’s relationship.
xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
set.seed(1)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
ridge.fit=cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)

ridge.lambda=ridge.fit$lambda.min
ridge.lambda
## [1] 364.6228
ridge.pred=predict(ridge.fit, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1260111
  1. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates. Answer:Lasso chose variable that would be helpful in the test dataset prediction.While performing a simpler model we can see it did not perform as well as the ridge regression by predicting accuracy. But since the variance of the ridge regression was lower than the Lasso we can see it produced lower error. The response of ridge-outperformance may have been caused by the function of the predictors.
set.seed(1)
lasso.fit=cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)

lasso.lambda=lasso.fit$lambda.min
lasso.lambda
## [1] 1.945882
lasso.pred=predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1394834
lasso.coeff=predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff
##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.030320e+03  0.000000e+00  1.693638e+00 -1.100616e+00  5.104012e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.369969e+01  7.851307e-02  6.036558e-02 -9.861002e-02  1.475536e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  2.185146e-01  0.000000e+00 -8.390364e+00  1.349648e+00  2.457249e+01 
##   perc.alumni        Expend     Grad.Rate 
##  7.684156e-01  5.858007e-02  5.324356e+00
  1. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. Answer:Since we chose a large amount of principal components we can see the PCR model did not outperform any of the other models used previously. While the bias decreased the variance increased with the PCR model.The U-shape for the errors, are resulted from the CV errors. By reducing the prinicipal component number would probably not have made a better outcome because then we would have a higher CV error.
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")

summary(pcr.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3773     2100     2106     1812     1678     1674
## adjCV         3862     3772     2097     2104     1800     1666     1668
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1674     1625     1580      1579      1582      1586      1590
## adjCV     1669     1613     1574      1573      1576      1580      1584
##        14 comps  15 comps  16 comps  17 comps
## CV         1585      1480      1189      1123
## adjCV      1580      1463      1179      1115
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693
  1. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. 6.6 Exercises 287 Answer: We can see from our results that the PLS shows slightly better reults than the ridge regression, but is not better overall. While the PLS tries to decrease the dimensions it also reduces its bias and therefore increases the model variance.
pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1916     1708     1496     1415     1261     1181
## adjCV         3862     1912     1706     1489     1396     1237     1170
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1166     1152     1146      1144      1144      1140      1138
## adjCV     1157     1144     1137      1135      1135      1131      1129
##        14 comps  15 comps  16 comps  17 comps
## CV         1137      1137      1137      1137
## adjCV      1129      1129      1129      1129
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36
pls.pred=predict(pls.fit, test, ncomp = 7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1333314
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches? Answer:Based on the r2 we can conclude that all have good accuracy rates except for PCR.
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2
## [1] 0.9086432
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2
## [1] 0.9168573
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2
## [1] 0.9079682
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2
## [1] 0.8711605

11. We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider. 288 6. Linear Model Selection and Regularization Answer:The appropriate PCR model would include 12 components it would have a 99.52% is exaplained by the predictors by the model. 12 components has 44.13% crime rate.Besides the BSM, this is a good predictor model.

set.seed(1)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
data(Boston)
attach(Boston)
library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

boston.bsm.err=(rmse.cv[which.min(rmse.cv)])^2
boston.bsm.err
## [1] 42.81453
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.176491
## zn          .       
## indus       .       
## chas        .       
## nox         .       
## rm          .       
## age         .       
## dis         .       
## rad         0.150484
## tax         .       
## ptratio     .       
## black       .       
## lstat       .       
## medv        .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.921353
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.669133
pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error. Answer: Computed abopve helps us answer this question. the model witht the lowest cross-validation error is the one chosen by the best subset selection method and this model had an MSE of 44.32201.
  2. Does your chosen model involve all of the features in the data set? Why or why not? **Answer: The best subset selection only includes indus, zn, nox, rad, dis, ptratio,lstat, and medv as varaibles or 9 total. We could have more varaition if the model did not rule out the other varaibles. Hence more variables more prediction accuracy we could have, aswe are in fact looking for more prediction accuracy and lower MSE.