Assignment 5

Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

library(ISLR)

data(College)

head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

summary(College)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

# Load the ISLR package to access the College dataset
library(ISLR)

# Load the College dataset
data(College)

# Set the seed for reproducibility
set.seed(123)

# Calculate the number of rows to sample for the training set (70% of the dataset)
training_size <- floor(0.7 * nrow(College))

# Sample indices for the training data
training_indices <- sample(seq_len(nrow(College)), size = training_size)

# Create the training dataset
training_set <- College[training_indices, ]

# Create the test dataset
test_set <- College[-training_indices, ]

# Assuming you've already split the data into training_set and test_set

# Fit the linear model using least squares with all variables as predictors
lm_model <- lm(Outstate ~ ., data = training_set)

# Make predictions on the test set
predictions <- predict(lm_model, newdata = test_set)

# Calculate the Mean Squared Error (MSE) as the test error
mse_test_error <- mean((test_set$Outstate - predictions)^2)

# Print the test error
print(mse_test_error)

## [1] 3465996

# Load necessary libraries
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

# Assuming training_set and test_set are already defined

# Prepare the data
x_train <- model.matrix(Outstate ~ . - 1, data = training_set)
y_train <- training_set$Outstate
x_test <- model.matrix(Outstate ~ . - 1, data = test_set)
y_test <- test_set$Outstate

# Fit the ridge regression model with lambda chosen by cross-validation
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)

# Determine the best lambda
best_lambda <- cv_ridge$lambda.min

# Make predictions on the test set
predictions <- predict(cv_ridge, newx = x_test, s = best_lambda)

# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)

# Print the test error
print(mse_test_error)

## [1] 3525209

# Load the glmnet package
library(glmnet)

# Prepare the data
x_train <- model.matrix(Outstate ~ . - 1, data = training_set)
y_train <- training_set$Outstate
x_test <- model.matrix(Outstate ~ . - 1, data = test_set)
y_test <- test_set$Outstate

# Fit the Lasso model with lambda chosen by cross-validation
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)

# Determine the best lambda
best_lambda <- cv_lasso$lambda.min

# Make predictions on the test set
predictions <- predict(cv_lasso, newx = x_test, s = best_lambda)

# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)

# Count non-zero coefficients
coefficients <- coef(cv_lasso, s = best_lambda)
non_zero_coeffs <- sum(coefficients != 0) - 1 # Subtract 1 for the intercept

# Print the test error and the number of non-zero coefficients
print(paste("Test MSE:", mse_test_error))

## [1] "Test MSE: 3470770.44069077"

print(paste("Number of non-zero coefficients:", non_zero_coeffs))

## [1] "Number of non-zero coefficients: 18"

# Load the pls package
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

# Assuming training_set and test_set are already prepared
# Prepare the response and predictors
x_train <- as.matrix(training_set[,-which(names(training_set) == "Outstate")])
y_train <- training_set$Outstate
x_test <- as.matrix(test_set[,-which(names(test_set) == "Outstate")])
y_test <- test_set$Outstate

# Fit the PLS model with cross-validation to select the number of components
set.seed(123) # For reproducibility
pls_model <- plsr(Outstate ~ ., data = training_set, scale = TRUE, validation = "CV")

# Summary to find the optimal number of components
summary(pls_model)

## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4087     2240     2170     2125     2116     2112     2101
## adjCV         4087     2239     2171     2121     2110     2106     2095
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        2104     2112     2110      2105      2103      2106      2105
## adjCV     2097     2105     2103      2098      2097      2099      2099
##        14 comps  15 comps  16 comps  17 comps
## CV         2106      2106      2106      2106
## adjCV      2099      2099      2099      2099
## 
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           27.43    47.34    62.74    65.55    70.25    72.20    75.49
## Outstate    70.53    72.95    74.83    75.60    75.76    76.08    76.21
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           79.36    83.86     87.09     89.52     91.40     93.43     95.59
## Outstate    76.26    76.28     76.30     76.31     76.31     76.31     76.31
##           15 comps  16 comps  17 comps
## X            96.18     98.92    100.00
## Outstate     76.31     76.31     76.31

# Determine the optimal number of components (M)
# This requires examining the output of summary(pls_model) or using RMSEP(pls_model)
optimal_M <- which.min(RMSEP(pls_model)$val)
predictions <- predict(pls_model, newdata = test_set, ncomp = optimal_M)

# Calculate the test error (Mean Squared Error, MSE)
mse_test_error <- mean((y_test - predictions)^2)

# Print the test error and the optimal number of components (M)
cat("Test MSE:", mse_test_error, "\n")

## Test MSE: 3465210

cat("Optimal number of components (M):", optimal_M, "\n")

## Optimal number of components (M): 14

boston_data <- read.csv("C:/Users/ngaku/Downloads/Boston(3).csv")
head(boston_data)

##   X    crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

# Install and load the 'glmnet' package for lasso
# install.packages("glmnet")
library(glmnet)

x <- model.matrix(crim ~ .-1, data=boston_data)
y <- boston_data$crim

cv.lasso <- cv.glmnet(x, y, alpha=1)
plot(cv.lasso)

best_lambda <- cv.lasso$lambda.min

lasso_model <- glmnet(x, y, alpha=1, lambda=best_lambda)
coef(lasso_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 13.7114239727
## X           -0.0009668943
## zn           0.0387561910
## indus       -0.0720603542
## chas        -0.6077353369
## nox         -7.7711707044
## rm           0.2859536637
## age          .           
## dis         -0.8409327634
## rad          0.5334607845
## tax         -0.0001813521
## ptratio     -0.2085933882
## black       -0.0075193765
## lstat        0.1230771487
## medv        -0.1681809654

# Install and load the 'glmnet' package for lasso
# install.packages("glmnet")
library(glmnet)

x <- model.matrix(crim ~ .-1, data=boston_data)
y <- boston_data$crim

cv.lasso <- cv.glmnet(x, y, alpha=1)
plot(cv.lasso)

best_lambda <- cv.lasso$lambda.min

lasso_model <- glmnet(x, y, alpha=1, lambda=best_lambda)
coef(lasso_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 12.4029421458
## X           -0.0005431722
## zn           0.0364234327
## indus       -0.0675545272
## chas        -0.5773945069
## nox         -6.6103691969
## rm           0.2097601083
## age          .           
## dis         -0.7716199384
## rad          0.5186715881
## tax          .           
## ptratio     -0.1789952229
## black       -0.0075451239
## lstat        0.1222821512
## medv        -0.1554263833

cv.ridge <- cv.glmnet(x, y, alpha=0)
best_lambda_ridge <- cv.ridge$lambda.min

ridge_model <- glmnet(x, y, alpha=0, lambda=best_lambda_ridge)
coef(ridge_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  9.0856096436
## X            0.0001524894
## zn           0.0329998243
## indus       -0.0817470701
## chas        -0.7417454620
## nox         -5.3951955228
## rm           0.3291405167
## age          0.0022128418
## dis         -0.7007821333
## rad          0.4226389060
## tax          0.0033170561
## ptratio     -0.1362881171
## black       -0.0084819938
## lstat        0.1422306362
## medv        -0.1394098146

cv.ridge <- cv.glmnet(x, y, alpha=0)
best_lambda_ridge <- cv.ridge$lambda.min

ridge_model <- glmnet(x, y, alpha=0, lambda=best_lambda_ridge)
coef(ridge_model)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  9.0856096436
## X            0.0001524894
## zn           0.0329998243
## indus       -0.0817470701
## chas        -0.7417454620
## nox         -5.3951955228
## rm           0.3291405167
## age          0.0022128418
## dis         -0.7007821333
## rad          0.4226389060
## tax          0.0033170561
## ptratio     -0.1362881171
## black       -0.0084819938
## lstat        0.1422306362
## medv        -0.1394098146

# Install and load the 'pls' package for PCR
# install.packages("pls")
library(pls)

pcr_model <- pcr(crim ~ ., data=boston_data, scale=TRUE, validation="CV")
summary(pcr_model)

## Data:    X dimension: 506 14 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 14
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.101    7.092    6.784    6.798    6.800    6.868
## adjCV         8.61    7.100    7.102    6.782    6.813    6.796    6.860
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.846    6.834    6.703     6.736     6.736     6.723     6.661
## adjCV    6.838    6.826    6.684     6.725     6.725     6.712     6.648
##        14 comps
## CV        6.591
## adjCV     6.578
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       46.09    57.96    68.91    75.23    81.49    86.37    89.47    91.99
## crim    32.17    32.24    38.28    38.28    39.00    39.00    39.65    39.82
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       93.97     95.76     97.29     98.58     99.56    100.00
## crim    42.21     42.21     42.46     42.85     44.14     45.44

validationplot(pcr_model, val.type="MSEP")

These results suggest that both the Lasso and PCR models are able to reduce the complexity of the predictive model by either selecting fewer predictors or reducing the dimensionality of the feature space, which can help to mitigate overfitting and improve out-of-sample prediction accuracy.

the Lasso model is preferable because it has performed well in terms of validation set error.

It appears that the Lasso model does not involve all the features in the dataset. The reasons for not including all features can be:

Sparsity: Lasso helps in achieving a sparse model that is easier to interpret and can be beneficial when dealing with high-dimensional data.

Regularization: The process of regularization helps in reducing overfitting by penalizing large coefficients, leading to improved model generalization on unseen data.

Feature Selection: By excluding non-informative predictors, the Lasso method improves model simplicity and focuses on the most significant predictors.

Multicollinearity: If there is multicollinearity in the data, Lasso can help by selecting among correlated predictors, which may mitigate the variance inflation caused by multicollinearity.

Assignment 5

Ndungu Gakunga

2024-03-23