Question 1: Set the random seed at 470.
Develop a linear regression model to predict heart_disease_mortality_per_100k using the LASSO method based on the training data set, using 10-fold cross validation to select lambda.
How many variables are selected if we use lambda.1se?
glmnet
install.packages("glmnet")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/grace/AppData/Local/R/win-library/4.5’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.5/glmnet_4.1-8.zip'
Content type 'application/zip' length 2474797 bytes (2.4 MB)
downloaded 2.4 MB
package ‘glmnet’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\grace\AppData\Local\Temp\RtmpkNTMn8\downloaded_packages
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
set seed
set.seed(470)
Data
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
train_data <- Heart_disease_mortality_train
Error: object 'Heart_disease_mortality_train' not found
train_data <- na.omit(train_data)
Prep
y <- train_data$heart_disease_mortality_per_100k
X <- model.matrix(heart_disease_mortality_per_100k ~ ., train_data)[, -1]
Lasso
lasso_cv <- cv.glmnet(X, y, alpha = 1, nfolds = 10)
Lambda
coef_lasso <- coef(lasso_cv, s = "lambda.1se")
num_selected <- sum(coef_lasso != 0) - 1
Output
cat("Number of variables selected at lambda.1se:", num_selected, "\n")
Number of variables selected at lambda.1se: 13
Question 2:
coef_paths <- coef(lasso_cv)
plot(lasso_cv)
``
lambda_values <- lasso_cv$lambda
coef_paths <- coef(lasso_cv)
variable_names <- c(
"econ__economic_typologyFederal/State government-dependent",
"econ__economic_typologyManufacturing-dependent",
"econ__economic_typologyMining-dependent",
"econ__economic_typologyNonspecialized",
"econ__economic_typologyRecreation",
"econ__pct_civilian_labor",
"demo__pct_aged_65_years_and_older",
"demo__pct_hispanic",
"demo__pct_american_indian_or_alaskan_native",
"demo__pct_adults_bachelors_or_higher",
"demo__birth_rate_per_1k",
"demo__death_rate_per_1k",
"health__pct_adult_obesity",
"health__pct_adult_smoking",
"health__pct_diabetes",
"health__pct_excessive_drinking",
"health__pct_physical_inacticity",
"health__air_pollution_particulate_matter",
"health__motor_vehicle_crash_deaths_per_100k",
"health__pop_per_primary_care_physician"
)
coef_matrix <- coef_paths
elimination_order <- apply(coef_matrix, 1, function(x) which(x == 0)[1])
last_variable_index <- which.max(elimination_order)
last_variable <- variable_names[last_variable_index]
print(last_variable)
[1] "econ__economic_typologyManufacturing-dependent"
Question 3: Use the LASSO model (lambda.1se) to predict the observations in the testing data set. What is the MSE?
Predict
predictions <- predict(lasso_cv, s = "lambda.1se", newx = X)
mse <- mean((predictions - y)^2)
cat
cat("Mean Squared Error (MSE) on the test set:", mse, "\n")
Mean Squared Error (MSE) on the test set: 685.3974
Question 4: Use the LASSO model (lambda.min) to predict the observations in the testing data set. What is the MSE?
predictions_lambda_min <- predict(lasso_cv, s = "lambda.min", newx = X)
mse_lambda_min <- mean((predictions_lambda_min - y)^2)
cat("Mean Squared Error (MSE) on the test set using lambda.min:", mse_lambda_min, "\n")
Mean Squared Error (MSE) on the test set using lambda.min: 644.2104
Question 5: Apply the LASSO model (lambda.min) to the testing data set. On average, the predicted county-level heart disease mortality per 100K is off by _________ per 100k. (Round your answer to the nearest integer if needed)
New data prep
library(janitor)
test_data <-Heart_disease_mortality_test_1_
y_new <- test_data$heart_disease_mortality_per_100k
X_new <- model.matrix(heart_disease_mortality_per_100k ~ ., test_data)[, -1]
Predict
predictions_lambda_min_new <- predict(lasso_cv, s = "lambda.min", newx = X_new)
absolute_errors_new <- abs(predictions_lambda_min_new - y_new)
mae_lambda_min_new <- mean(absolute_errors_new)
mae_lambda_min_new_rounded <- round(mae_lambda_min_new)
cat("On average, the predicted county-level heart disease mortality per 100K is off by", mae_lambda_min_new_rounded, "per 100k.\n")
On average, the predicted county-level heart disease mortality per 100K is off by 22 per 100k.
Question 6: Recall the regression model you developed in Quiz 4, where you used all variables in the training data set as the candidates to predict heart_disease_mortality_per_100k. Let’s call this model a classic model. Use the classic model to predict the observations in the testing data set. What is the MSE?
y_test <- test_data$heart_disease_mortality_per_100k
classic_model <- lm(heart_disease_mortality_per_100k ~ ., data = train_data)
predictions_classic <- predict(classic_model, newdata = test_data)
if(length(predictions_classic) == length(y_test)) {
mse_classic <- mean((predictions_classic - y_test)^2)
cat("Mean Squared Error (MSE) on the test set using the classic model:", mse_classic, "\n")
} else {
cat("The lengths of predictions and actual values do not match. Check the data.\n")
}
Mean Squared Error (MSE) on the test set using the classic model: 785.5377
Question 7: Which model you develop above (i.e., classic model, LASSO using lambda.min, LASSO using lambda.1se) has the best prediction performance (i.e., MSE) on the testing data set?
mse_classic <- 785.5377
mse_lambda_min <- 644.2104
predictions_lambda_1se <- predict(lasso_cv, s = "lambda.1se", newx = X)
mse_lambda_1se <- mean((predictions_lambda_1se - y)^2)
cat("MSE of Classic Model:", mse_classic, "\n")
MSE of Classic Model: 785.5377
cat("MSE of LASSO using lambda.min:", mse_lambda_min, "\n")
MSE of LASSO using lambda.min: 644.2104
cat("MSE of LASSO using lambda.1se:", mse_lambda_1se, "\n")
MSE of LASSO using lambda.1se: 685.3974
mse_values <- c(classic = mse_classic, lambda_min = mse_lambda_min, lambda_1se = mse_lambda_1se)
best_model <- names(mse_values)[which.min(mse_values)]
cat("The best model based on MSE is:", best_model, "\n")
The best model based on MSE is: lambda_min
Question 8: According to the model you identified in the previous question, if the health_air_pollution_particulate_matter increases, would heart_disease_mortality_per_100k increase or decrease, assuming all other variables stay constant?
lasso_lambda_min_coefficients <- coef(lasso_cv, s = "lambda.min")
air_pollution_coeff <- lasso_lambda_min_coefficients["health__air_pollution_particulate_matter", ]
if (air_pollution_coeff > 0) {
cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would increase.\n")
} else {
cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would decrease.\n")
}
If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would decrease.