Problem 1

This problem involves the Advertising data set which is a csv file on eLC. Create a training set containing 75% of the observations, and a test set containing the remaining observations. Set the random seed to 10. Before analyzing the data, remove the column “Observation”. Assume all models will use all features.

Part A

Using 5-fold cross validation, determine the optimal shrinkage parameter for the LASSO model. Report the value of \(\lambda\), and include a plot of the error.

adData = read.csv("Advertising.csv")
adData = adData %>% select(-Observation)

names(adData)
## [1] "TV"        "Radio"     "Newspaper" "Sales"
set.seed(10)
idx = createDataPartition(adData$Sales, p = 0.75, list=F )
training = adData[idx,]
testing = adData[-idx,]


grid = expand.grid(alpha = 1, lambda = 10^seq(3,-3, length = 20))

control = trainControl(method = "cv", number = 5)

cv_lasso = train(Sales ~. ,
                 data = training,
                 method = "glmnet",
                 tuneGrid = grid,
                 trControl = control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(cv_lasso)

best_lambdaL = cv_lasso$bestTune$lambda
best_lambdaL
## [1] 0.0379269

*Answer: 0.01832981

Part B

Using 5-fold cross validation, determine the optimal shrinkage parameter for the Ridge model. Report the value of \(\lambda\), and include a plot of the error.

grid = expand.grid(alpha = 0, lambda = 10^seq(3,-3, length = 20))

control = trainControl(method = "cv", number = 5)

cv_ridge = train(Sales ~. ,
                 data = training,
                 method = "glmnet",
                 tuneGrid = grid,
                 trControl = control)

plot(cv_ridge)

best_lambdaR = cv_ridge$bestTune$lambda
best_lambdaR
## [1] 0.3359818

*Answer: 0.3359818

Part C

Apply both the LASSO and Ridge models to the test data. Which has a lower test MSE?

grid = expand.grid(alpha=seq(0,1,length=5), lambda=10^seq(3,-3,length=20))

lasso_model <- glmnet(as.matrix(training[, -which(names(training) == "Sales")]), 
                       training$Sales, alpha = 1, lambda = best_lambdaL)
lasso_pred <- predict(lasso_model, newx = as.matrix(testing[, -which(names(testing) == "Sales")]))

ridge_model <- glmnet(as.matrix(training[, -which(names(training) == "Sales")]), 
                       training$Sales, alpha = 0, lambda = best_lambdaR)
ridge_pred <- predict(ridge_model, newx = as.matrix(testing[, -which(names(testing) == "Sales")]))

lasso_mse <- mean((testing$Sales - lasso_pred)^2)
ridge_mse <- mean((testing$Sales - ridge_pred)^2)

lasso_mse
## [1] 4.878805
ridge_mse
## [1] 5.187494

*Answer: The lasso model’s MSE is 3.023595, while the ridge model’s MSE is 3.377544. Since the lasso model has a lower MSE, this indicates that the lasso model is better.

Part D

Given the results in part C, what if anything can you infer about the features in the training data?

*Answer: Since the Lasso method has a lower MSE, it may indicate that there are features within the data that aren’t as important/significant, since Lasso performs feature selection.

Part E

For the preferred model, report the coefficients corresponding to the optimal tuning parameter. What do you observe in terms of the coefficient values, and what does this tell you?

lasso_coef <- coef(lasso_model, s = best_lambdaL)

lasso_coef
## 4 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) 3.27359998
## TV          0.04420654
## Radio       0.18349895
## Newspaper   .

*Answer: Based on the coefficients, we can conclude that all the variables are significant. Since we are analyzing the lasso regression, any variable with a non-zero coefficient would be considered relevant to the dataset.