Question 1: Set the random seed at 470.

Develop a linear regression model to predict heart_disease_mortality_per_100k using the LASSO method based on the training data set, using 10-fold cross validation to select lambda.

How many variables are selected if we use lambda.1se?

glmnet

install.packages("glmnet")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/grace/AppData/Local/R/win-library/4.5’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.5/glmnet_4.1-8.zip'
Content type 'application/zip' length 2474797 bytes (2.4 MB)
downloaded 2.4 MB
package ‘glmnet’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\grace\AppData\Local\Temp\RtmpkNTMn8\downloaded_packages
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8

set seed

set.seed(470)

Data

library(janitor)

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test
train_data <- Heart_disease_mortality_train
Error: object 'Heart_disease_mortality_train' not found
train_data <- na.omit(train_data)

Prep

y <- train_data$heart_disease_mortality_per_100k
X <- model.matrix(heart_disease_mortality_per_100k ~ ., train_data)[, -1]

Lasso

lasso_cv <- cv.glmnet(X, y, alpha = 1, nfolds = 10)

Lambda

coef_lasso <- coef(lasso_cv, s = "lambda.1se")
num_selected <- sum(coef_lasso != 0) - 1

Output

cat("Number of variables selected at lambda.1se:", num_selected, "\n")
Number of variables selected at lambda.1se: 13 

Question 2:

coef_paths <- coef(lasso_cv)
plot(lasso_cv)
``
lambda_values <- lasso_cv$lambda
coef_paths <- coef(lasso_cv)
variable_names <- c(
  "econ__economic_typologyFederal/State government-dependent",
  "econ__economic_typologyManufacturing-dependent",
  "econ__economic_typologyMining-dependent",
  "econ__economic_typologyNonspecialized",
  "econ__economic_typologyRecreation",
  "econ__pct_civilian_labor",
  "demo__pct_aged_65_years_and_older",
  "demo__pct_hispanic",
  "demo__pct_american_indian_or_alaskan_native",
  "demo__pct_adults_bachelors_or_higher",
  "demo__birth_rate_per_1k",
  "demo__death_rate_per_1k",
  "health__pct_adult_obesity",
  "health__pct_adult_smoking",
  "health__pct_diabetes",
  "health__pct_excessive_drinking",
  "health__pct_physical_inacticity",
  "health__air_pollution_particulate_matter",
  "health__motor_vehicle_crash_deaths_per_100k",
  "health__pop_per_primary_care_physician"
)
coef_matrix <- coef_paths
elimination_order <- apply(coef_matrix, 1, function(x) which(x == 0)[1])
last_variable_index <- which.max(elimination_order)
last_variable <- variable_names[last_variable_index]
print(last_variable)
[1] "econ__economic_typologyManufacturing-dependent"

Question 3: Use the LASSO model (lambda.1se) to predict the observations in the testing data set. What is the MSE?

Predict

predictions <- predict(lasso_cv, s = "lambda.1se", newx = X)
mse <- mean((predictions - y)^2)

cat

cat("Mean Squared Error (MSE) on the test set:", mse, "\n")
Mean Squared Error (MSE) on the test set: 685.3974 

Question 4: Use the LASSO model (lambda.min) to predict the observations in the testing data set. What is the MSE?

predictions_lambda_min <- predict(lasso_cv, s = "lambda.min", newx = X)
mse_lambda_min <- mean((predictions_lambda_min - y)^2)
cat("Mean Squared Error (MSE) on the test set using lambda.min:", mse_lambda_min, "\n")
Mean Squared Error (MSE) on the test set using lambda.min: 644.2104 

Question 5: Apply the LASSO model (lambda.min) to the testing data set. On average, the predicted county-level heart disease mortality per 100K is off by _________ per 100k. (Round your answer to the nearest integer if needed)

New data prep

library(janitor)
test_data <-Heart_disease_mortality_test_1_
y_new <- test_data$heart_disease_mortality_per_100k
X_new <- model.matrix(heart_disease_mortality_per_100k ~ ., test_data)[, -1]

Predict

predictions_lambda_min_new <- predict(lasso_cv, s = "lambda.min", newx = X_new)
absolute_errors_new <- abs(predictions_lambda_min_new - y_new)
mae_lambda_min_new <- mean(absolute_errors_new)
mae_lambda_min_new_rounded <- round(mae_lambda_min_new)
cat("On average, the predicted county-level heart disease mortality per 100K is off by", mae_lambda_min_new_rounded, "per 100k.\n")
On average, the predicted county-level heart disease mortality per 100K is off by 22 per 100k.

Question 6: Recall the regression model you developed in Quiz 4, where you used all variables in the training data set as the candidates to predict heart_disease_mortality_per_100k. Let’s call this model a classic model. Use the classic model to predict the observations in the testing data set. What is the MSE?

y_test <- test_data$heart_disease_mortality_per_100k
classic_model <- lm(heart_disease_mortality_per_100k ~ ., data = train_data)
predictions_classic <- predict(classic_model, newdata = test_data)
if(length(predictions_classic) == length(y_test)) {
  mse_classic <- mean((predictions_classic - y_test)^2)

  cat("Mean Squared Error (MSE) on the test set using the classic model:", mse_classic, "\n")
} else {
  cat("The lengths of predictions and actual values do not match. Check the data.\n")
}
Mean Squared Error (MSE) on the test set using the classic model: 785.5377 

Question 7: Which model you develop above (i.e., classic model, LASSO using lambda.min, LASSO using lambda.1se) has the best prediction performance (i.e., MSE) on the testing data set?

mse_classic <- 785.5377
mse_lambda_min <- 644.2104 
predictions_lambda_1se <- predict(lasso_cv, s = "lambda.1se", newx = X)
mse_lambda_1se <- mean((predictions_lambda_1se - y)^2)
cat("MSE of Classic Model:", mse_classic, "\n")
MSE of Classic Model: 785.5377 
cat("MSE of LASSO using lambda.min:", mse_lambda_min, "\n")
MSE of LASSO using lambda.min: 644.2104 
cat("MSE of LASSO using lambda.1se:", mse_lambda_1se, "\n")
MSE of LASSO using lambda.1se: 685.3974 
mse_values <- c(classic = mse_classic, lambda_min = mse_lambda_min, lambda_1se = mse_lambda_1se)
best_model <- names(mse_values)[which.min(mse_values)]
cat("The best model based on MSE is:", best_model, "\n")
The best model based on MSE is: lambda_min 

Question 8: According to the model you identified in the previous question, if the health_air_pollution_particulate_matter increases, would heart_disease_mortality_per_100k increase or decrease, assuming all other variables stay constant?

lasso_lambda_min_coefficients <- coef(lasso_cv, s = "lambda.min")
air_pollution_coeff <- lasso_lambda_min_coefficients["health__air_pollution_particulate_matter", ]
if (air_pollution_coeff > 0) {
  cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would increase.\n")
} else {
  cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would decrease.\n")
}
If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would decrease.
---
title: Quiz 5
output: html_notebook
---


Question 1: Set the random seed at 470. 

Develop a linear regression model to predict heart_disease_mortality_per_100k using the LASSO method based on the training data set, using 10-fold cross validation to select lambda.

How many variables are selected if we use lambda.1se?

glmnet
```{r}
install.packages("glmnet")

```

```{r}
library(glmnet)
```

set seed
```{r}
set.seed(470)
```

Data
```{r}
library(janitor)
train_data <- Heart_disease_mortality_train
```

```{r}
train_data <- na.omit(train_data)
```

Prep
```{r}
y <- train_data$heart_disease_mortality_per_100k
X <- model.matrix(heart_disease_mortality_per_100k ~ ., train_data)[, -1]
```

Lasso
```{r}
lasso_cv <- cv.glmnet(X, y, alpha = 1, nfolds = 10)
```

Lambda
```{r}
coef_lasso <- coef(lasso_cv, s = "lambda.1se")
num_selected <- sum(coef_lasso != 0) - 1
```

Output
```{r}
cat("Number of variables selected at lambda.1se:", num_selected, "\n")
```


Question 2:
```{r}
coef_paths <- coef(lasso_cv)
plot(lasso_cv)
``
```

```{r}
lambda_values <- lasso_cv$lambda
```
```{r}
coef_paths <- coef(lasso_cv)
```
```{r}
variable_names <- c(
  "econ__economic_typologyFederal/State government-dependent",
  "econ__economic_typologyManufacturing-dependent",
  "econ__economic_typologyMining-dependent",
  "econ__economic_typologyNonspecialized",
  "econ__economic_typologyRecreation",
  "econ__pct_civilian_labor",
  "demo__pct_aged_65_years_and_older",
  "demo__pct_hispanic",
  "demo__pct_american_indian_or_alaskan_native",
  "demo__pct_adults_bachelors_or_higher",
  "demo__birth_rate_per_1k",
  "demo__death_rate_per_1k",
  "health__pct_adult_obesity",
  "health__pct_adult_smoking",
  "health__pct_diabetes",
  "health__pct_excessive_drinking",
  "health__pct_physical_inacticity",
  "health__air_pollution_particulate_matter",
  "health__motor_vehicle_crash_deaths_per_100k",
  "health__pop_per_primary_care_physician"
)
```
```{r}
coef_matrix <- coef_paths
elimination_order <- apply(coef_matrix, 1, function(x) which(x == 0)[1])
```

```{r}
last_variable_index <- which.max(elimination_order)
```

```{r}
last_variable <- variable_names[last_variable_index]
print(last_variable)
```

Question 3: Use the LASSO model (lambda.1se) to predict the observations in the testing data set. What is the MSE?

Predict
```{r}
predictions <- predict(lasso_cv, s = "lambda.1se", newx = X)
```
```{r}
mse <- mean((predictions - y)^2)
```

cat
```{r}
cat("Mean Squared Error (MSE) on the test set:", mse, "\n")
```

Question 4: Use the LASSO model (lambda.min) to predict the observations in the testing data set. What is the MSE?

```{r}
predictions_lambda_min <- predict(lasso_cv, s = "lambda.min", newx = X)
```

```{r}
mse_lambda_min <- mean((predictions_lambda_min - y)^2)
```
```{r}
cat("Mean Squared Error (MSE) on the test set using lambda.min:", mse_lambda_min, "\n")
```

Question 5: Apply the LASSO model (lambda.min) to the testing data set. On average, the predicted county-level heart disease mortality per 100K is off by _________ per 100k. (Round your answer to the nearest integer if needed)

New data prep
```{r}
library(janitor)
test_data <-Heart_disease_mortality_test_1_
```

```{r}
y_new <- test_data$heart_disease_mortality_per_100k
X_new <- model.matrix(heart_disease_mortality_per_100k ~ ., test_data)[, -1]
```
Predict
```{r}
predictions_lambda_min_new <- predict(lasso_cv, s = "lambda.min", newx = X_new)
```

```{r}
absolute_errors_new <- abs(predictions_lambda_min_new - y_new)
```

```{r}
mae_lambda_min_new <- mean(absolute_errors_new)
```

```{r}
mae_lambda_min_new_rounded <- round(mae_lambda_min_new)
```

```{r}
cat("On average, the predicted county-level heart disease mortality per 100K is off by", mae_lambda_min_new_rounded, "per 100k.\n")
```

Question 6: Recall the regression model you developed in Quiz 4, where you used all variables in the training data set as the candidates to predict heart_disease_mortality_per_100k.  Let's call this model a classic model.
Use the classic model to predict the observations in the testing data set. What is the MSE?

```{r}
y_test <- test_data$heart_disease_mortality_per_100k
```


```{r}
classic_model <- lm(heart_disease_mortality_per_100k ~ ., data = train_data)
```

```{r}
predictions_classic <- predict(classic_model, newdata = test_data)
```

```{r}
if(length(predictions_classic) == length(y_test)) {
  mse_classic <- mean((predictions_classic - y_test)^2)

  cat("Mean Squared Error (MSE) on the test set using the classic model:", mse_classic, "\n")
} else {
  cat("The lengths of predictions and actual values do not match. Check the data.\n")
}
```

Question 7: Which model you develop above (i.e., classic model, LASSO using lambda.min, LASSO using lambda.1se) has the best prediction performance (i.e., MSE) on the testing data set?

```{r}
mse_classic <- 785.5377
mse_lambda_min <- 644.2104 
predictions_lambda_1se <- predict(lasso_cv, s = "lambda.1se", newx = X)
mse_lambda_1se <- mean((predictions_lambda_1se - y)^2)

```

```{r}
cat("MSE of Classic Model:", mse_classic, "\n")
cat("MSE of LASSO using lambda.min:", mse_lambda_min, "\n")
cat("MSE of LASSO using lambda.1se:", mse_lambda_1se, "\n")
```
```{r}
mse_values <- c(classic = mse_classic, lambda_min = mse_lambda_min, lambda_1se = mse_lambda_1se)
best_model <- names(mse_values)[which.min(mse_values)]

```

```{r}
cat("The best model based on MSE is:", best_model, "\n")
```

Question 8: According to the model you identified in the previous question, if the health_air_pollution_particulate_matter increases, would heart_disease_mortality_per_100k increase or decrease, assuming all other variables stay constant?

```{r}
lasso_lambda_min_coefficients <- coef(lasso_cv, s = "lambda.min")
```

```{r}
air_pollution_coeff <- lasso_lambda_min_coefficients["health__air_pollution_particulate_matter", ]

```

```{r}
if (air_pollution_coeff > 0) {
  cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would increase.\n")
} else {
  cat("If health_air_pollution_particulate_matter increases, heart_disease_mortality_per_100k would decrease.\n")
}
```
























