library(dplyr)
library(ggplot2)

1. Reading the dataset

library(readr)
dataset <- read_csv("/home/harpo/Dropbox/ongoing-work/git-repos/complex-systems-2023-final/datasets/results-1511/dataset.csv")
Rows: 532 Columns: 648
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): name
dbl (647): psd_1, psd_2, psd_3, psd_4, psd_5, psd_6, psd_7, psd_8, psd_9, psd_10, psd_11, psd_12, psd_13, psd_14, psd_15, psd_16, p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)

1.1 remove predictor with zero variance

library(caret)
# Assuming 'dataset' is your dataset
nzv <- nearZeroVar(dataset, saveMetrics = FALSE)
dataset <- dataset[, -nzv]

2. Splitting the dataset

We will split the dataset into a training set (70%) and a testing set (30%).

set.seed(123) # for reproducibility
sample_index <- sample(1:nrow(dataset), 0.7*nrow(dataset))
train_data <- dataset[sample_index, ]
test_data <- dataset[-sample_index, ]

3. Creating a regression model

Using the glmnet package, we’ll predict the tmg feature

library(glmnet)
# Assuming 'dataset' is scaled, if not, you must scale it before running glmnet
y <- train_data$tmg
x <- train_data %>% select(-name,-tmg) %>% as.matrix()

Normalize the data

Standardization (Z-score normalization): This method transforms each feature to have a mean of 0 and a standard deviation of 1. It’s particularly useful when your features have different units or very different scales.

library(caret)
# Assuming 'dataset' is your original dataset and 'tmg' is the response variable
scaled_x <- preProcess(x, method = c("center", "scale"))
x <- predict(scaled_x, x)

In the context of Elastic Net regression, the alpha parameter is a crucial component that balances the mix between Lasso (L1) and Ridge (L2) regularization methods. Elastic Net is a regularization technique that combines both L1 and L2 penalties, which are used to prevent overfitting by adding a penalty to the model’s loss function.

Here’s a breakdown of the alpha parameter:

  1. Range: alpha can take on any value between 0 and 1 (inclusive).

    • alpha = 1: The penalty is entirely Lasso (L1 regularization).
    • alpha = 0: The penalty is entirely Ridge (L2 regularization).
    • alpha between 0 and 1: A combination of Lasso and Ridge.
  2. Effect of L1 (Lasso) Regularization: L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients. This can lead to some coefficients being exactly zero, which is useful for feature selection if you have a large number of features.

  3. Effect of L2 (Ridge) Regularization: L2 regularization adds a penalty equal to the square of the magnitude of coefficients. This tends to shrink the coefficients but does not set them to zero, which is useful when you have correlated features.

  4. Choosing alpha:

    • If you have a lot of features that you suspect are not all useful, a value closer to 1 (more Lasso) might be more appropriate as it will perform feature selection.
    • If all your features are believed to be important, or you have a small number of features, a value closer to 0 (more Ridge) might work better.
    • Often, the best way to choose an alpha value is through cross-validation, trying different values and selecting the one that minimizes prediction error.
  5. Interaction with lambda: The lambda parameter in Elastic Net controls the overall strength of the penalty. So, the effect of alpha is in conjunction with lambda. A grid search over both alpha and lambda is a common practice to find the best combination that minimizes cross-validation error.

In summary, the alpha parameter in Elastic Net allows you to balance the type of regularization applied to your model, providing the flexibility to choose between Lasso, Ridge, or a mix of both based on your data and the specific requirements of your problem.

# Elastic Net model
set.seed(123)
cv_model <- cv.glmnet(x, y, alpha = 0.5) # alpha=0.5 indicates Elastic Net
best_lambda <- cv_model$lambda.min
model_en <- glmnet(x, y, alpha = 0.5, lambda = best_lambda)
print(model_en)

Call:  glmnet(x = x, y = y, alpha = 0.5, lambda = best_lambda) 

  Df  %Dev   Lambda
1 52 87.56 0.001445

4. Calculate importance using elasticnet coeficients

# The variable importance is inferred from the coefficients
c <- coef(model_en)
predictor <- c  %>% rownames()
importance <- c %>% as.matrix()
importance <- data.frame(predictor,importance)
rownames(importance) <- NULL
names(importance)[2] <- "importance"
importance<-importance[-1,] # Exclude intercept
importance <- importance[order(-importance$importance), ]
importance <- head(importance, 20)
library(dplyr)
library(ggplot2)
plot_perm <-importance  %>%  mutate(predictor = factor(predictor, levels = rev(unique(predictor)))) %>%
  ggplot()+
  geom_col(aes(y=predictor,x=importance),fill='darkblue', color='gray')+
  ggtitle("Top 20 predictor importance using glmnet")+
  theme_minimal()
plot_perm

5. Evaluate results on test dataset

x_test <- test_data %>% select(-name,-tmg) %>% as.matrix()
y_test <- test_data$tmg

x_test <-predict(scaled_x, x_test)

predictions <- predict(model_en, s = best_lambda, newx = x_test)
# Compute the RMSE (Root Mean Square Error)
RMSE <- sqrt(mean((predictions - y_test)^2))
print(RMSE)
[1] 0.04124405

6. Plot: Predicted vs Reference values


results <- data.frame(Reference = y_test, Predicted = as.vector(predictions))
ggplot(results, aes(x = Reference, y = Predicted)) +
  geom_point(color='blue') +
  geom_abline(intercept = 0, slope = 1, color='red') +
  ggtitle("Predicted vs Reference values") +
  xlab("Reference Values") +
  ylab("Predicted Values") +
  theme_bw()

LS0tCnRpdGxlOiAiUmVncmVzc2lvbiBNb2RlbCB1c2luZyBFbGFzdGljIE5ldCBmb3IgRmVOaSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKYGBgCgpgYGB7cn0KbGlicmFyeShkcGx5cikKbGlicmFyeShnZ3Bsb3QyKQpgYGAKCiMjIDEuIFJlYWRpbmcgdGhlIGRhdGFzZXQKCmBgYHtyfQpsaWJyYXJ5KHJlYWRyKQpkYXRhc2V0IDwtIHJlYWRfY3N2KCIvaG9tZS9oYXJwby9Ecm9wYm94L29uZ29pbmctd29yay9naXQtcmVwb3MvY29tcGxleC1zeXN0ZW1zLTIwMjMtZmluYWwvZGF0YXNldHMvcmVzdWx0cy0xNTExL2RhdGFzZXQuY3N2IikKaGVhZChkYXRhc2V0KQpgYGAKCiMjIDEuMSByZW1vdmUgcHJlZGljdG9yIHdpdGggemVybyB2YXJpYW5jZQoKYGBge3J9CmxpYnJhcnkoY2FyZXQpCm56diA8LSBuZWFyWmVyb1ZhcihkYXRhc2V0LCBzYXZlTWV0cmljcyA9IEZBTFNFKQpkYXRhc2V0IDwtIGRhdGFzZXRbLCAtbnp2XQpgYGAKCgojIyAyLiBTcGxpdHRpbmcgdGhlIGRhdGFzZXQKV2Ugd2lsbCBzcGxpdCB0aGUgZGF0YXNldCBpbnRvIGEgdHJhaW5pbmcgc2V0ICg3MCUpIGFuZCBhIHRlc3Rpbmcgc2V0ICgzMCUpLgoKYGBge3J9CnNldC5zZWVkKDEyMykgIyBmb3IgcmVwcm9kdWNpYmlsaXR5CnNhbXBsZV9pbmRleCA8LSBzYW1wbGUoMTpucm93KGRhdGFzZXQpLCAwLjcqbnJvdyhkYXRhc2V0KSkKdHJhaW5fZGF0YSA8LSBkYXRhc2V0W3NhbXBsZV9pbmRleCwgXQp0ZXN0X2RhdGEgPC0gZGF0YXNldFstc2FtcGxlX2luZGV4LCBdCmBgYAoKIyMgMy4gQ3JlYXRpbmcgYSByZWdyZXNzaW9uIG1vZGVsClVzaW5nIHRoZSBnbG1uZXQgcGFja2FnZSwgd2UnbGwgcHJlZGljdCB0aGUgYHRtZ2AgZmVhdHVyZQoKYGBge3J9CmxpYnJhcnkoZ2xtbmV0KQp5IDwtIHRyYWluX2RhdGEkdG1nCnggPC0gdHJhaW5fZGF0YSAlPiUgc2VsZWN0KC1uYW1lLC10bWcpICU+JSBhcy5tYXRyaXgoKQpgYGAKCiMjIyBOb3JtYWxpemUgdGhlIGRhdGEKClN0YW5kYXJkaXphdGlvbiAoWi1zY29yZSBub3JtYWxpemF0aW9uKTogVGhpcyBtZXRob2QgdHJhbnNmb3JtcyBlYWNoIGZlYXR1cmUgdG8gaGF2ZSBhIG1lYW4gb2YgMCBhbmQgYSBzdGFuZGFyZCBkZXZpYXRpb24gb2YgMS4gSXQncyBwYXJ0aWN1bGFybHkgdXNlZnVsIHdoZW4geW91ciBmZWF0dXJlcyBoYXZlIGRpZmZlcmVudCB1bml0cyBvciB2ZXJ5IGRpZmZlcmVudCBzY2FsZXMuCmBgYHtyfQpsaWJyYXJ5KGNhcmV0KQpzY2FsZWRfeCA8LSBwcmVQcm9jZXNzKHgsIG1ldGhvZCA9IGMoImNlbnRlciIsICJzY2FsZSIpKQp4IDwtIHByZWRpY3Qoc2NhbGVkX3gsIHgpCmBgYAoKSW4gdGhlIGNvbnRleHQgb2YgRWxhc3RpYyBOZXQgcmVncmVzc2lvbiwgdGhlIGBhbHBoYWAgcGFyYW1ldGVyIGlzIGEgY3J1Y2lhbCBjb21wb25lbnQgdGhhdCBiYWxhbmNlcyB0aGUgbWl4IGJldHdlZW4gTGFzc28gKEwxKSBhbmQgUmlkZ2UgKEwyKSByZWd1bGFyaXphdGlvbiBtZXRob2RzLiBFbGFzdGljIE5ldCBpcyBhIHJlZ3VsYXJpemF0aW9uIHRlY2huaXF1ZSB0aGF0IGNvbWJpbmVzIGJvdGggTDEgYW5kIEwyIHBlbmFsdGllcywgd2hpY2ggYXJlIHVzZWQgdG8gcHJldmVudCBvdmVyZml0dGluZyBieSBhZGRpbmcgYSBwZW5hbHR5IHRvIHRoZSBtb2RlbCdzIGxvc3MgZnVuY3Rpb24uCgpIZXJlJ3MgYSBicmVha2Rvd24gb2YgdGhlIGBhbHBoYWAgcGFyYW1ldGVyOgoKMS4gKipSYW5nZSoqOiBgYWxwaGFgIGNhbiB0YWtlIG9uIGFueSB2YWx1ZSBiZXR3ZWVuIDAgYW5kIDEgKGluY2x1c2l2ZSkuIAogICAtIGBhbHBoYSA9IDFgOiBUaGUgcGVuYWx0eSBpcyBlbnRpcmVseSBMYXNzbyAoTDEgcmVndWxhcml6YXRpb24pLgogICAtIGBhbHBoYSA9IDBgOiBUaGUgcGVuYWx0eSBpcyBlbnRpcmVseSBSaWRnZSAoTDIgcmVndWxhcml6YXRpb24pLgogICAtIGBhbHBoYWAgYmV0d2VlbiAwIGFuZCAxOiBBIGNvbWJpbmF0aW9uIG9mIExhc3NvIGFuZCBSaWRnZS4KCjIuICoqRWZmZWN0IG9mIEwxIChMYXNzbykgUmVndWxhcml6YXRpb24qKjogTDEgcmVndWxhcml6YXRpb24gYWRkcyBhIHBlbmFsdHkgZXF1YWwgdG8gdGhlIGFic29sdXRlIHZhbHVlIG9mIHRoZSBtYWduaXR1ZGUgb2YgY29lZmZpY2llbnRzLiBUaGlzIGNhbiBsZWFkIHRvIHNvbWUgY29lZmZpY2llbnRzIGJlaW5nIGV4YWN0bHkgemVybywgd2hpY2ggaXMgdXNlZnVsIGZvciBmZWF0dXJlIHNlbGVjdGlvbiBpZiB5b3UgaGF2ZSBhIGxhcmdlIG51bWJlciBvZiBmZWF0dXJlcy4KCjMuICoqRWZmZWN0IG9mIEwyIChSaWRnZSkgUmVndWxhcml6YXRpb24qKjogTDIgcmVndWxhcml6YXRpb24gYWRkcyBhIHBlbmFsdHkgZXF1YWwgdG8gdGhlIHNxdWFyZSBvZiB0aGUgbWFnbml0dWRlIG9mIGNvZWZmaWNpZW50cy4gVGhpcyB0ZW5kcyB0byBzaHJpbmsgdGhlIGNvZWZmaWNpZW50cyBidXQgZG9lcyBub3Qgc2V0IHRoZW0gdG8gemVybywgd2hpY2ggaXMgdXNlZnVsIHdoZW4geW91IGhhdmUgY29ycmVsYXRlZCBmZWF0dXJlcy4KCjQuICoqQ2hvb3NpbmcgYGFscGhhYCoqOiAKICAgLSBJZiB5b3UgaGF2ZSBhIGxvdCBvZiBmZWF0dXJlcyB0aGF0IHlvdSBzdXNwZWN0IGFyZSBub3QgYWxsIHVzZWZ1bCwgYSB2YWx1ZSBjbG9zZXIgdG8gMSAobW9yZSBMYXNzbykgbWlnaHQgYmUgbW9yZSBhcHByb3ByaWF0ZSBhcyBpdCB3aWxsIHBlcmZvcm0gZmVhdHVyZSBzZWxlY3Rpb24uCiAgIC0gSWYgYWxsIHlvdXIgZmVhdHVyZXMgYXJlIGJlbGlldmVkIHRvIGJlIGltcG9ydGFudCwgb3IgeW91IGhhdmUgYSBzbWFsbCBudW1iZXIgb2YgZmVhdHVyZXMsIGEgdmFsdWUgY2xvc2VyIHRvIDAgKG1vcmUgUmlkZ2UpIG1pZ2h0IHdvcmsgYmV0dGVyLgogICAtIE9mdGVuLCB0aGUgYmVzdCB3YXkgdG8gY2hvb3NlIGFuIGBhbHBoYWAgdmFsdWUgaXMgdGhyb3VnaCBjcm9zcy12YWxpZGF0aW9uLCB0cnlpbmcgZGlmZmVyZW50IHZhbHVlcyBhbmQgc2VsZWN0aW5nIHRoZSBvbmUgdGhhdCBtaW5pbWl6ZXMgcHJlZGljdGlvbiBlcnJvci4KCjUuICoqSW50ZXJhY3Rpb24gd2l0aCBgbGFtYmRhYCoqOiBUaGUgYGxhbWJkYWAgcGFyYW1ldGVyIGluIEVsYXN0aWMgTmV0IGNvbnRyb2xzIHRoZSBvdmVyYWxsIHN0cmVuZ3RoIG9mIHRoZSBwZW5hbHR5LiBTbywgdGhlIGVmZmVjdCBvZiBgYWxwaGFgIGlzIGluIGNvbmp1bmN0aW9uIHdpdGggYGxhbWJkYWAuIEEgZ3JpZCBzZWFyY2ggb3ZlciBib3RoIGBhbHBoYWAgYW5kIGBsYW1iZGFgIGlzIGEgY29tbW9uIHByYWN0aWNlIHRvIGZpbmQgdGhlIGJlc3QgY29tYmluYXRpb24gdGhhdCBtaW5pbWl6ZXMgY3Jvc3MtdmFsaWRhdGlvbiBlcnJvci4KCkluIHN1bW1hcnksIHRoZSBgYWxwaGFgIHBhcmFtZXRlciBpbiBFbGFzdGljIE5ldCBhbGxvd3MgeW91IHRvIGJhbGFuY2UgdGhlIHR5cGUgb2YgcmVndWxhcml6YXRpb24gYXBwbGllZCB0byB5b3VyIG1vZGVsLCBwcm92aWRpbmcgdGhlIGZsZXhpYmlsaXR5IHRvIGNob29zZSBiZXR3ZWVuIExhc3NvLCBSaWRnZSwgb3IgYSBtaXggb2YgYm90aCBiYXNlZCBvbiB5b3VyIGRhdGEgYW5kIHRoZSBzcGVjaWZpYyByZXF1aXJlbWVudHMgb2YgeW91ciBwcm9ibGVtLgoKYGBge3J9CiMgRWxhc3RpYyBOZXQgbW9kZWwKc2V0LnNlZWQoMTIzKQpjdl9tb2RlbCA8LSBjdi5nbG1uZXQoeCwgeSwgYWxwaGEgPSAwLjUpICMgYWxwaGE9MC41IGluZGljYXRlcyBFbGFzdGljIE5ldApiZXN0X2xhbWJkYSA8LSBjdl9tb2RlbCRsYW1iZGEubWluCm1vZGVsX2VuIDwtIGdsbW5ldCh4LCB5LCBhbHBoYSA9IDAuNSwgbGFtYmRhID0gYmVzdF9sYW1iZGEpCnByaW50KG1vZGVsX2VuKQpgYGAKIyMgNC4gQ2FsY3VsYXRlIGltcG9ydGFuY2UgdXNpbmcgZWxhc3RpY25ldCBjb2VmaWNpZW50cwoKYGBge3J9CiMgVGhlIHZhcmlhYmxlIGltcG9ydGFuY2UgaXMgaW5mZXJyZWQgZnJvbSB0aGUgY29lZmZpY2llbnRzCmMgPC0gY29lZihtb2RlbF9lbikKcHJlZGljdG9yIDwtIGMgICU+JSByb3duYW1lcygpCmltcG9ydGFuY2UgPC0gYyAlPiUgYXMubWF0cml4KCkKaW1wb3J0YW5jZSA8LSBkYXRhLmZyYW1lKHByZWRpY3RvcixpbXBvcnRhbmNlKQpyb3duYW1lcyhpbXBvcnRhbmNlKSA8LSBOVUxMCm5hbWVzKGltcG9ydGFuY2UpWzJdIDwtICJpbXBvcnRhbmNlIgppbXBvcnRhbmNlPC1pbXBvcnRhbmNlWy0xLF0gIyBFeGNsdWRlIGludGVyY2VwdAppbXBvcnRhbmNlIDwtIGltcG9ydGFuY2Vbb3JkZXIoLWltcG9ydGFuY2UkaW1wb3J0YW5jZSksIF0KaW1wb3J0YW5jZSA8LSBoZWFkKGltcG9ydGFuY2UsIDIwKQpgYGAKCgpgYGB7cn0KbGlicmFyeShkcGx5cikKbGlicmFyeShnZ3Bsb3QyKQpwbG90X3Blcm0gPC1pbXBvcnRhbmNlICAlPiUgIG11dGF0ZShwcmVkaWN0b3IgPSBmYWN0b3IocHJlZGljdG9yLCBsZXZlbHMgPSByZXYodW5pcXVlKHByZWRpY3RvcikpKSkgJT4lCiAgZ2dwbG90KCkrCiAgZ2VvbV9jb2woYWVzKHk9cHJlZGljdG9yLHg9aW1wb3J0YW5jZSksZmlsbD0nZGFya2JsdWUnLCBjb2xvcj0nZ3JheScpKwogIGdndGl0bGUoIlRvcCAyMCBwcmVkaWN0b3IgaW1wb3J0YW5jZSB1c2luZyBnbG1uZXQiKSsKICB0aGVtZV9taW5pbWFsKCkKcGxvdF9wZXJtCmBgYAojIyA1LiBFdmFsdWF0ZSByZXN1bHRzIG9uIHRlc3QgZGF0YXNldAoKYGBge3J9CnhfdGVzdCA8LSB0ZXN0X2RhdGEgJT4lIHNlbGVjdCgtbmFtZSwtdG1nKSAlPiUgYXMubWF0cml4KCkKeV90ZXN0IDwtIHRlc3RfZGF0YSR0bWcKCnhfdGVzdCA8LXByZWRpY3Qoc2NhbGVkX3gsIHhfdGVzdCkKCnByZWRpY3Rpb25zIDwtIHByZWRpY3QobW9kZWxfZW4sIHMgPSBiZXN0X2xhbWJkYSwgbmV3eCA9IHhfdGVzdCkKIyBDb21wdXRlIHRoZSBSTVNFIChSb290IE1lYW4gU3F1YXJlIEVycm9yKQpSTVNFIDwtIHNxcnQobWVhbigocHJlZGljdGlvbnMgLSB5X3Rlc3QpXjIpKQpwcmludChSTVNFKQpgYGAKCiMjIDYuIFBsb3Q6IFByZWRpY3RlZCB2cyBSZWZlcmVuY2UgdmFsdWVzCgpgYGB7cn0KCnJlc3VsdHMgPC0gZGF0YS5mcmFtZShSZWZlcmVuY2UgPSB5X3Rlc3QsIFByZWRpY3RlZCA9IGFzLnZlY3RvcihwcmVkaWN0aW9ucykpCmdncGxvdChyZXN1bHRzLCBhZXMoeCA9IFJlZmVyZW5jZSwgeSA9IFByZWRpY3RlZCkpICsKICBnZW9tX3BvaW50KGNvbG9yPSdibHVlJykgKwogIGdlb21fYWJsaW5lKGludGVyY2VwdCA9IDAsIHNsb3BlID0gMSwgY29sb3I9J3JlZCcpICsKICBnZ3RpdGxlKCJQcmVkaWN0ZWQgdnMgUmVmZXJlbmNlIHZhbHVlcyIpICsKICB4bGFiKCJSZWZlcmVuY2UgVmFsdWVzIikgKwogIHlsYWIoIlByZWRpY3RlZCBWYWx1ZXMiKSArCiAgdGhlbWVfYncoKQoKYGBgCgo=