This problem involves the Advertising data set which is a csv file on eLC. Create a training set containing 75% of the observations, and a test set containing the remaining observations. Set the random seed to 10. Before analyzing the data, remove the column “Observation”. Assume all models will use all features.
Using 5-fold cross validation, determine the optimal shrinkage parameter for the LASSO model. Report the value of \(\lambda\), and include a plot of the error.
adData = read.csv("Advertising.csv")
adData = adData %>% select(-Observation)
names(adData)
## [1] "TV" "Radio" "Newspaper" "Sales"
set.seed(10)
idx = createDataPartition(adData$Sales, p = 0.75, list=F )
training = adData[idx,]
testing = adData[-idx,]
grid = expand.grid(alpha = 1, lambda = 10^seq(3,-3, length = 20))
control = trainControl(method = "cv", number = 5)
cv_lasso = train(Sales ~. ,
data = training,
method = "glmnet",
tuneGrid = grid,
trControl = control)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
plot(cv_lasso)
best_lambdaL = cv_lasso$bestTune$lambda
best_lambdaL
## [1] 0.0379269
*Answer: 0.01832981
Using 5-fold cross validation, determine the optimal shrinkage parameter for the Ridge model. Report the value of \(\lambda\), and include a plot of the error.
grid = expand.grid(alpha = 0, lambda = 10^seq(3,-3, length = 20))
control = trainControl(method = "cv", number = 5)
cv_ridge = train(Sales ~. ,
data = training,
method = "glmnet",
tuneGrid = grid,
trControl = control)
plot(cv_ridge)
best_lambdaR = cv_ridge$bestTune$lambda
best_lambdaR
## [1] 0.3359818
*Answer: 0.3359818
Apply both the LASSO and Ridge models to the test data. Which has a lower test MSE?
grid = expand.grid(alpha=seq(0,1,length=5), lambda=10^seq(3,-3,length=20))
lasso_model <- glmnet(as.matrix(training[, -which(names(training) == "Sales")]),
training$Sales, alpha = 1, lambda = best_lambdaL)
lasso_pred <- predict(lasso_model, newx = as.matrix(testing[, -which(names(testing) == "Sales")]))
ridge_model <- glmnet(as.matrix(training[, -which(names(training) == "Sales")]),
training$Sales, alpha = 0, lambda = best_lambdaR)
ridge_pred <- predict(ridge_model, newx = as.matrix(testing[, -which(names(testing) == "Sales")]))
lasso_mse <- mean((testing$Sales - lasso_pred)^2)
ridge_mse <- mean((testing$Sales - ridge_pred)^2)
lasso_mse
## [1] 4.878805
ridge_mse
## [1] 5.187494
*Answer: The lasso model’s MSE is 3.023595, while the ridge model’s MSE is 3.377544. Since the lasso model has a lower MSE, this indicates that the lasso model is better.
Given the results in part C, what if anything can you infer about the features in the training data?
*Answer: Since the Lasso method has a lower MSE, it may indicate that there are features within the data that aren’t as important/significant, since Lasso performs feature selection.
For the preferred model, report the coefficients corresponding to the optimal tuning parameter. What do you observe in terms of the coefficient values, and what does this tell you?
lasso_coef <- coef(lasso_model, s = best_lambdaL)
lasso_coef
## 4 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.27359998
## TV 0.04420654
## Radio 0.18349895
## Newspaper .
*Answer: Based on the coefficients, we can conclude that all the variables are significant. Since we are analyzing the lasso regression, any variable with a non-zero coefficient would be considered relevant to the dataset.