More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer is iii, this is because the lasso has an an advantage over least squares rooted in the bias-variance trade off. so when the least squares have a high variance the lasso solution can result in a reduction of the variance creating more accurate predictions
The correct answer is also iii, variance is because the ridge regression and lasso’s advantage over least squares bias-variance trade off. As variances increase the flexibility of the ridge regression decreases creating a decreased variance with an increase of bias. it helps because when least squares have a small change the coefficients experience a large change and a larger variance, where the ridge regression can still be more accurate by trading off the small increase in bias for a large decrease in variance.
The correct answer is ii, compared to the previous two the non-linear method is more flexible than the least sqaure method. Which will result in more accurate predictions when its increase in variance is less than the increase in bias.
library(ISLR)
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
pairs(College)
set.seed(1)
trainingindex<-sample(nrow(College), 0.75*nrow(College))
head(trainingindex)
## [1] 679 129 509 471 299 270
train<-College[trainingindex, ]
test<-College[-trainingindex, ]
dim(College)
## [1] 777 18
dim(train)
## [1] 582 18
dim(test)
## [1] 195 18
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
lm.pred=predict(lm.fit, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604
The test error from the linear model fit of all the variables is 1384604.the r squared come out as 93.36% which means that only 93.36% of the varience is decribed by this model.
xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
set.seed(1)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
ridge.fit=cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)
ridge.lambda=ridge.fit$lambda.min
ridge.lambda
## [1] 364.6228
ridge.pred=predict(ridge.fit, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1260111
the relation ship between log(λ) and mean squared error in the ridge regression we can see that the λ reduces the error rate. The ridge regression for this data gave a better test error. in the last question i explained why this happens, the only downside is that it inlcludes all the variables which penalizes some variables and reduces its coeffecients to 0.This doesnt change the accuracy but makes it more dificult to interpret.
set.seed(1)
lasso.fit=cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)
lasso.lambda=lasso.fit$lambda.min
lasso.lambda
## [1] 1.945882
lasso.pred=predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1394834
lasso.coeff=predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff
## (Intercept) (Intercept) Accept Enroll Top10perc
## -1.030320e+03 0.000000e+00 1.693638e+00 -1.100616e+00 5.104012e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.369969e+01 7.851307e-02 6.036558e-02 -9.861002e-02 1.475536e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.185146e-01 0.000000e+00 -8.390364e+00 1.349648e+00 2.457249e+01
## perc.alumni Expend Grad.Rate
## 7.684156e-01 5.858007e-02 5.324356e+00
The lasso model is not as good at predicting in this model as the ridge regression. The lasso chose some variables that were helpful in the predicion, making it more simple to interpret, but it did not perform as well in prediction accuracy. The upside is that it produced a lower error rate.
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3773 2100 2106 1812 1678 1674
## adjCV 3862 3772 2097 2104 1800 1666 1668
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1674 1625 1580 1579 1582 1586 1590
## adjCV 1669 1613 1574 1573 1576 1580 1584
## 14 comps 15 comps 16 comps 17 comps
## CV 1585 1480 1189 1123
## adjCV 1580 1463 1179 1115
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693
From the graph we can see that around 10 components is when it has the lowest error rate and most accurate results. it has a a variance of 94.92% in predictors and a 85.89% varience in the response variable apps. The PCR does not outperform the previous models because a large amount of components were used. The model explains that with more components the bias decreases and the varience increases.
pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1916 1708 1496 1415 1261 1181
## adjCV 3862 1912 1706 1489 1396 1237 1170
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1166 1152 1146 1144 1144 1140 1138
## adjCV 1157 1144 1137 1135 1135 1131 1129
## 14 comps 15 comps 16 comps 17 comps
## CV 1137 1137 1137 1137
## adjCV 1129 1129 1129 1129
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
pls.pred=predict(pls.fit, test, ncomp = 7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1333314
in this graph we can see that the lowest points are between 5 and 10 so i will use the 7 components and call it my lowest area. at 7 components there is a 76.80% variance in the predictors an a 93.14% variance in variable apps. This model comes close to beating the ridge regression, but it does not outperfome the data. the PCR reduces the bias and increases the variance of the model so it doesn’t not perform better.
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2
## [1] 0.9086432
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2
## [1] 0.9168573
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2
## [1] 0.9079682
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2
## [1] 0.8711605
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2
## [1] 0.9120273
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), col="darkslategray2", xlab="Regression Methods", ylab="Test R-Squared", main="Test R-Squared for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
based on the r squared valued the models are not bad in accuracy except for the PCR which is the only inacurate model.
set.seed(1)
library(MASS)
data(Boston)
attach(Boston)
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*" "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(rmse.cv)
## [1] 9
boston.bsm.err=(rmse.cv[which.min(rmse.cv)])^2
boston.bsm.err
## [1] 42.81453
we can see the that cross-validation happens at variable 9. the cv estimate for the test mse is 42.81453 which is the lowest reported.
lasso
boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)
coef(boston.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err
## [1] 62.74783
the lasso variabe method is a reduction method which in this case reduces the mse the model includes to 1 variable with an mse of 62.74783
ridge regression
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)
coef(boston.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err
## [1] 58.8156
compared to the other two method the ridge method does not perform well and it has an mse of 58.8156
PCR
library(pls)
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
based on the cv error as well as in the variences, the most appropriate PCR moeld would have 8 components. Which would have an output of 93.45% variance in predictors and 42.47% varience in the response variable model.
the model that would be proposed would have to be the first one computed which was the best substet selection as it had the lowest MSE
the model only includes 9 variables. the variables that included produced the best outcome of the data and if the model included the thrown-out features more variation if the response would be present. In this problem we are looking for the most accurate and the lowest MSE so the best subset selcetion accomplished that.