Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
library(ISLR)
data(College)
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
# Load the ISLR package to access the College dataset
library(ISLR)
# Load the College dataset
data(College)
# Set the seed for reproducibility
set.seed(123)
# Calculate the number of rows to sample for the training set (70% of the dataset)
training_size <- floor(0.7 * nrow(College))
# Sample indices for the training data
training_indices <- sample(seq_len(nrow(College)), size = training_size)
# Create the training dataset
training_set <- College[training_indices, ]
# Create the test dataset
test_set <- College[-training_indices, ]
# Assuming you've already split the data into training_set and test_set
# Fit the linear model using least squares with all variables as predictors
lm_model <- lm(Outstate ~ ., data = training_set)
# Make predictions on the test set
predictions <- predict(lm_model, newdata = test_set)
# Calculate the Mean Squared Error (MSE) as the test error
mse_test_error <- mean((test_set$Outstate - predictions)^2)
# Print the test error
print(mse_test_error)
## [1] 3465996
# Load necessary libraries
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Assuming training_set and test_set are already defined
# Prepare the data
x_train <- model.matrix(Outstate ~ . - 1, data = training_set)
y_train <- training_set$Outstate
x_test <- model.matrix(Outstate ~ . - 1, data = test_set)
y_test <- test_set$Outstate
# Fit the ridge regression model with lambda chosen by cross-validation
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
# Determine the best lambda
best_lambda <- cv_ridge$lambda.min
# Make predictions on the test set
predictions <- predict(cv_ridge, newx = x_test, s = best_lambda)
# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)
# Print the test error
print(mse_test_error)
## [1] 3525209
# Load the glmnet package
library(glmnet)
# Prepare the data
x_train <- model.matrix(Outstate ~ . - 1, data = training_set)
y_train <- training_set$Outstate
x_test <- model.matrix(Outstate ~ . - 1, data = test_set)
y_test <- test_set$Outstate
# Fit the Lasso model with lambda chosen by cross-validation
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
# Determine the best lambda
best_lambda <- cv_lasso$lambda.min
# Make predictions on the test set
predictions <- predict(cv_lasso, newx = x_test, s = best_lambda)
# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)
# Count non-zero coefficients
coefficients <- coef(cv_lasso, s = best_lambda)
non_zero_coeffs <- sum(coefficients != 0) - 1 # Subtract 1 for the intercept
# Print the test error and the number of non-zero coefficients
print(paste("Test MSE:", mse_test_error))
## [1] "Test MSE: 3470770.44069077"
print(paste("Number of non-zero coefficients:", non_zero_coeffs))
## [1] "Number of non-zero coefficients: 18"
# Load the pls package
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
# Assuming training_set and test_set are already prepared
# Prepare the response and predictors
x_train <- as.matrix(training_set[,-which(names(training_set) == "Outstate")])
y_train <- training_set$Outstate
x_test <- as.matrix(test_set[,-which(names(test_set) == "Outstate")])
y_test <- test_set$Outstate
# Fit the PLS model with cross-validation to select the number of components
set.seed(123) # For reproducibility
pls_model <- plsr(Outstate ~ ., data = training_set, scale = TRUE, validation = "CV")
# Summary to find the optimal number of components
summary(pls_model)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4087 2240 2170 2125 2116 2112 2101
## adjCV 4087 2239 2171 2121 2110 2106 2095
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 2104 2112 2110 2105 2103 2106 2105
## adjCV 2097 2105 2103 2098 2097 2099 2099
## 14 comps 15 comps 16 comps 17 comps
## CV 2106 2106 2106 2106
## adjCV 2099 2099 2099 2099
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 27.43 47.34 62.74 65.55 70.25 72.20 75.49
## Outstate 70.53 72.95 74.83 75.60 75.76 76.08 76.21
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 79.36 83.86 87.09 89.52 91.40 93.43 95.59
## Outstate 76.26 76.28 76.30 76.31 76.31 76.31 76.31
## 15 comps 16 comps 17 comps
## X 96.18 98.92 100.00
## Outstate 76.31 76.31 76.31
# Determine the optimal number of components (M)
# This requires examining the output of summary(pls_model) or using RMSEP(pls_model)
optimal_M <- which.min(RMSEP(pls_model)$val)
predictions <- predict(pls_model, newdata = test_set, ncomp = optimal_M)
# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)
# Print the test error and the optimal number of components (M)
cat("Test MSE:", mse_test_error, "\n")
## Test MSE: 3465210
cat("Optimal number of components (M):", optimal_M, "\n")
## Optimal number of components (M): 14
boston_data <- read.csv("C:/Users/ngaku/Downloads/Boston(3).csv")
head(boston_data)
## X crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Install and load the 'glmnet' package for lasso
# install.packages("glmnet")
library(glmnet)
x <- model.matrix(crim ~ .-1, data=boston_data)
y <- boston_data$crim
cv.lasso <- cv.glmnet(x, y, alpha=1)
plot(cv.lasso)
best_lambda <- cv.lasso$lambda.min
lasso_model <- glmnet(x, y, alpha=1, lambda=best_lambda)
coef(lasso_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 13.7114239727
## X -0.0009668943
## zn 0.0387561910
## indus -0.0720603542
## chas -0.6077353369
## nox -7.7711707044
## rm 0.2859536637
## age .
## dis -0.8409327634
## rad 0.5334607845
## tax -0.0001813521
## ptratio -0.2085933882
## black -0.0075193765
## lstat 0.1230771487
## medv -0.1681809654
# Install and load the 'glmnet' package for lasso
# install.packages("glmnet")
library(glmnet)
x <- model.matrix(crim ~ .-1, data=boston_data)
y <- boston_data$crim
cv.lasso <- cv.glmnet(x, y, alpha=1)
plot(cv.lasso)
best_lambda <- cv.lasso$lambda.min
lasso_model <- glmnet(x, y, alpha=1, lambda=best_lambda)
coef(lasso_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.4029421458
## X -0.0005431722
## zn 0.0364234327
## indus -0.0675545272
## chas -0.5773945069
## nox -6.6103691969
## rm 0.2097601083
## age .
## dis -0.7716199384
## rad 0.5186715881
## tax .
## ptratio -0.1789952229
## black -0.0075451239
## lstat 0.1222821512
## medv -0.1554263833
cv.ridge <- cv.glmnet(x, y, alpha=0)
best_lambda_ridge <- cv.ridge$lambda.min
ridge_model <- glmnet(x, y, alpha=0, lambda=best_lambda_ridge)
coef(ridge_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 9.0856096436
## X 0.0001524894
## zn 0.0329998243
## indus -0.0817470701
## chas -0.7417454620
## nox -5.3951955228
## rm 0.3291405167
## age 0.0022128418
## dis -0.7007821333
## rad 0.4226389060
## tax 0.0033170561
## ptratio -0.1362881171
## black -0.0084819938
## lstat 0.1422306362
## medv -0.1394098146
cv.ridge <- cv.glmnet(x, y, alpha=0)
best_lambda_ridge <- cv.ridge$lambda.min
ridge_model <- glmnet(x, y, alpha=0, lambda=best_lambda_ridge)
coef(ridge_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 9.0856096436
## X 0.0001524894
## zn 0.0329998243
## indus -0.0817470701
## chas -0.7417454620
## nox -5.3951955228
## rm 0.3291405167
## age 0.0022128418
## dis -0.7007821333
## rad 0.4226389060
## tax 0.0033170561
## ptratio -0.1362881171
## black -0.0084819938
## lstat 0.1422306362
## medv -0.1394098146
# Install and load the 'pls' package for PCR
# install.packages("pls")
library(pls)
pcr_model <- pcr(crim ~ ., data=boston_data, scale=TRUE, validation="CV")
summary(pcr_model)
## Data: X dimension: 506 14
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 14
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.101 7.092 6.784 6.798 6.800 6.868
## adjCV 8.61 7.100 7.102 6.782 6.813 6.796 6.860
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.846 6.834 6.703 6.736 6.736 6.723 6.661
## adjCV 6.838 6.826 6.684 6.725 6.725 6.712 6.648
## 14 comps
## CV 6.591
## adjCV 6.578
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 46.09 57.96 68.91 75.23 81.49 86.37 89.47 91.99
## crim 32.17 32.24 38.28 38.28 39.00 39.00 39.65 39.82
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 93.97 95.76 97.29 98.58 99.56 100.00
## crim 42.21 42.21 42.46 42.85 44.14 45.44
validationplot(pcr_model, val.type="MSEP")
These results suggest that both the Lasso and PCR models are able to
reduce the complexity of the predictive model by either selecting fewer
predictors or reducing the dimensionality of the feature space, which
can help to mitigate overfitting and improve out-of-sample prediction
accuracy.
the Lasso model is preferable because it has performed well in terms of validation set error.
It appears that the Lasso model does not involve all the features in the dataset. The reasons for not including all features can be:
Sparsity: Lasso helps in achieving a sparse model that is easier to interpret and can be beneficial when dealing with high-dimensional data.
Regularization: The process of regularization helps in reducing overfitting by penalizing large coefficients, leading to improved model generalization on unseen data.
Feature Selection: By excluding non-informative predictors, the Lasso method improves model simplicity and focuses on the most significant predictors.
Multicollinearity: If there is multicollinearity in the data, Lasso can help by selecting among correlated predictors, which may mitigate the variance inflation caused by multicollinearity.