library(dplyr)
library(ggplot2)
1. Reading the dataset
library(readr)
dataset <- read_csv("/home/harpo/Dropbox/ongoing-work/git-repos/complex-systems-2023-final/datasets/results-1511/dataset.csv")
Rows: 532 Columns: 648
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): name
dbl (647): psd_1, psd_2, psd_3, psd_4, psd_5, psd_6, psd_7, psd_8, psd_9, psd_10, psd_11, psd_12, psd_13, psd_14, psd_15, psd_16, p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
1.1 remove predictor with zero variance
library(caret)
# Assuming 'dataset' is your dataset
nzv <- nearZeroVar(dataset, saveMetrics = FALSE)
dataset <- dataset[, -nzv]
2. Splitting the dataset
We will split the dataset into a training set (70%) and a testing set
(30%).
set.seed(123) # for reproducibility
sample_index <- sample(1:nrow(dataset), 0.7*nrow(dataset))
train_data <- dataset[sample_index, ]
test_data <- dataset[-sample_index, ]
3. Creating a regression model
Using the glmnet package, we’ll predict the tmg
feature
library(glmnet)
# Assuming 'dataset' is scaled, if not, you must scale it before running glmnet
y <- train_data$tmg
x <- train_data %>% select(-name,-tmg) %>% as.matrix()
Normalize the data
Standardization (Z-score normalization): This method transforms each
feature to have a mean of 0 and a standard deviation of 1. It’s
particularly useful when your features have different units or very
different scales.
library(caret)
# Assuming 'dataset' is your original dataset and 'tmg' is the response variable
scaled_x <- preProcess(x, method = c("center", "scale"))
x <- predict(scaled_x, x)
In the context of Elastic Net regression, the alpha
parameter is a crucial component that balances the mix between Lasso
(L1) and Ridge (L2) regularization methods. Elastic Net is a
regularization technique that combines both L1 and L2 penalties, which
are used to prevent overfitting by adding a penalty to the model’s loss
function.
Here’s a breakdown of the alpha
parameter:
Range: alpha
can take on any value
between 0 and 1 (inclusive).
alpha = 1
: The penalty is entirely Lasso (L1
regularization).
alpha = 0
: The penalty is entirely Ridge (L2
regularization).
alpha
between 0 and 1: A combination of Lasso and
Ridge.
Effect of L1 (Lasso) Regularization: L1
regularization adds a penalty equal to the absolute value of the
magnitude of coefficients. This can lead to some coefficients being
exactly zero, which is useful for feature selection if you have a large
number of features.
Effect of L2 (Ridge) Regularization: L2
regularization adds a penalty equal to the square of the magnitude of
coefficients. This tends to shrink the coefficients but does not set
them to zero, which is useful when you have correlated
features.
Choosing alpha
:
- If you have a lot of features that you suspect are not all useful, a
value closer to 1 (more Lasso) might be more appropriate as it will
perform feature selection.
- If all your features are believed to be important, or you have a
small number of features, a value closer to 0 (more Ridge) might work
better.
- Often, the best way to choose an
alpha
value is through
cross-validation, trying different values and selecting the one that
minimizes prediction error.
Interaction with lambda
: The
lambda
parameter in Elastic Net controls the overall
strength of the penalty. So, the effect of alpha
is in
conjunction with lambda
. A grid search over both
alpha
and lambda
is a common practice to find
the best combination that minimizes cross-validation error.
In summary, the alpha
parameter in Elastic Net allows
you to balance the type of regularization applied to your model,
providing the flexibility to choose between Lasso, Ridge, or a mix of
both based on your data and the specific requirements of your
problem.
# Elastic Net model
set.seed(123)
cv_model <- cv.glmnet(x, y, alpha = 0.5) # alpha=0.5 indicates Elastic Net
best_lambda <- cv_model$lambda.min
model_en <- glmnet(x, y, alpha = 0.5, lambda = best_lambda)
print(model_en)
Call: glmnet(x = x, y = y, alpha = 0.5, lambda = best_lambda)
Df %Dev Lambda
1 52 87.56 0.001445
4. Calculate importance using elasticnet coeficients
# The variable importance is inferred from the coefficients
c <- coef(model_en)
predictor <- c %>% rownames()
importance <- c %>% as.matrix()
importance <- data.frame(predictor,importance)
rownames(importance) <- NULL
names(importance)[2] <- "importance"
importance<-importance[-1,] # Exclude intercept
importance <- importance[order(-importance$importance), ]
importance <- head(importance, 20)
library(dplyr)
library(ggplot2)
plot_perm <-importance %>% mutate(predictor = factor(predictor, levels = rev(unique(predictor)))) %>%
ggplot()+
geom_col(aes(y=predictor,x=importance),fill='darkblue', color='gray')+
ggtitle("Top 20 predictor importance using glmnet")+
theme_minimal()
plot_perm

5. Evaluate results on test dataset
x_test <- test_data %>% select(-name,-tmg) %>% as.matrix()
y_test <- test_data$tmg
x_test <-predict(scaled_x, x_test)
predictions <- predict(model_en, s = best_lambda, newx = x_test)
# Compute the RMSE (Root Mean Square Error)
RMSE <- sqrt(mean((predictions - y_test)^2))
print(RMSE)
[1] 0.04124405
6. Plot: Predicted vs Reference values
results <- data.frame(Reference = y_test, Predicted = as.vector(predictions))
ggplot(results, aes(x = Reference, y = Predicted)) +
geom_point(color='blue') +
geom_abline(intercept = 0, slope = 1, color='red') +
ggtitle("Predicted vs Reference values") +
xlab("Reference Values") +
ylab("Predicted Values") +
theme_bw()

LS0tCnRpdGxlOiAiUmVncmVzc2lvbiBNb2RlbCB1c2luZyBFbGFzdGljIE5ldCBmb3IgRmVOaSIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9CmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkKYGBgCgpgYGB7cn0KbGlicmFyeShkcGx5cikKbGlicmFyeShnZ3Bsb3QyKQpgYGAKCiMjIDEuIFJlYWRpbmcgdGhlIGRhdGFzZXQKCmBgYHtyfQpsaWJyYXJ5KHJlYWRyKQpkYXRhc2V0IDwtIHJlYWRfY3N2KCIvaG9tZS9oYXJwby9Ecm9wYm94L29uZ29pbmctd29yay9naXQtcmVwb3MvY29tcGxleC1zeXN0ZW1zLTIwMjMtZmluYWwvZGF0YXNldHMvcmVzdWx0cy0xNTExL2RhdGFzZXQuY3N2IikKaGVhZChkYXRhc2V0KQpgYGAKCiMjIDEuMSByZW1vdmUgcHJlZGljdG9yIHdpdGggemVybyB2YXJpYW5jZQoKYGBge3J9CmxpYnJhcnkoY2FyZXQpCm56diA8LSBuZWFyWmVyb1ZhcihkYXRhc2V0LCBzYXZlTWV0cmljcyA9IEZBTFNFKQpkYXRhc2V0IDwtIGRhdGFzZXRbLCAtbnp2XQpgYGAKCgojIyAyLiBTcGxpdHRpbmcgdGhlIGRhdGFzZXQKV2Ugd2lsbCBzcGxpdCB0aGUgZGF0YXNldCBpbnRvIGEgdHJhaW5pbmcgc2V0ICg3MCUpIGFuZCBhIHRlc3Rpbmcgc2V0ICgzMCUpLgoKYGBge3J9CnNldC5zZWVkKDEyMykgIyBmb3IgcmVwcm9kdWNpYmlsaXR5CnNhbXBsZV9pbmRleCA8LSBzYW1wbGUoMTpucm93KGRhdGFzZXQpLCAwLjcqbnJvdyhkYXRhc2V0KSkKdHJhaW5fZGF0YSA8LSBkYXRhc2V0W3NhbXBsZV9pbmRleCwgXQp0ZXN0X2RhdGEgPC0gZGF0YXNldFstc2FtcGxlX2luZGV4LCBdCmBgYAoKIyMgMy4gQ3JlYXRpbmcgYSByZWdyZXNzaW9uIG1vZGVsClVzaW5nIHRoZSBnbG1uZXQgcGFja2FnZSwgd2UnbGwgcHJlZGljdCB0aGUgYHRtZ2AgZmVhdHVyZQoKYGBge3J9CmxpYnJhcnkoZ2xtbmV0KQp5IDwtIHRyYWluX2RhdGEkdG1nCnggPC0gdHJhaW5fZGF0YSAlPiUgc2VsZWN0KC1uYW1lLC10bWcpICU+JSBhcy5tYXRyaXgoKQpgYGAKCiMjIyBOb3JtYWxpemUgdGhlIGRhdGEKClN0YW5kYXJkaXphdGlvbiAoWi1zY29yZSBub3JtYWxpemF0aW9uKTogVGhpcyBtZXRob2QgdHJhbnNmb3JtcyBlYWNoIGZlYXR1cmUgdG8gaGF2ZSBhIG1lYW4gb2YgMCBhbmQgYSBzdGFuZGFyZCBkZXZpYXRpb24gb2YgMS4gSXQncyBwYXJ0aWN1bGFybHkgdXNlZnVsIHdoZW4geW91ciBmZWF0dXJlcyBoYXZlIGRpZmZlcmVudCB1bml0cyBvciB2ZXJ5IGRpZmZlcmVudCBzY2FsZXMuCmBgYHtyfQpsaWJyYXJ5KGNhcmV0KQpzY2FsZWRfeCA8LSBwcmVQcm9jZXNzKHgsIG1ldGhvZCA9IGMoImNlbnRlciIsICJzY2FsZSIpKQp4IDwtIHByZWRpY3Qoc2NhbGVkX3gsIHgpCmBgYAoKSW4gdGhlIGNvbnRleHQgb2YgRWxhc3RpYyBOZXQgcmVncmVzc2lvbiwgdGhlIGBhbHBoYWAgcGFyYW1ldGVyIGlzIGEgY3J1Y2lhbCBjb21wb25lbnQgdGhhdCBiYWxhbmNlcyB0aGUgbWl4IGJldHdlZW4gTGFzc28gKEwxKSBhbmQgUmlkZ2UgKEwyKSByZWd1bGFyaXphdGlvbiBtZXRob2RzLiBFbGFzdGljIE5ldCBpcyBhIHJlZ3VsYXJpemF0aW9uIHRlY2huaXF1ZSB0aGF0IGNvbWJpbmVzIGJvdGggTDEgYW5kIEwyIHBlbmFsdGllcywgd2hpY2ggYXJlIHVzZWQgdG8gcHJldmVudCBvdmVyZml0dGluZyBieSBhZGRpbmcgYSBwZW5hbHR5IHRvIHRoZSBtb2RlbCdzIGxvc3MgZnVuY3Rpb24uCgpIZXJlJ3MgYSBicmVha2Rvd24gb2YgdGhlIGBhbHBoYWAgcGFyYW1ldGVyOgoKMS4gKipSYW5nZSoqOiBgYWxwaGFgIGNhbiB0YWtlIG9uIGFueSB2YWx1ZSBiZXR3ZWVuIDAgYW5kIDEgKGluY2x1c2l2ZSkuIAogICAtIGBhbHBoYSA9IDFgOiBUaGUgcGVuYWx0eSBpcyBlbnRpcmVseSBMYXNzbyAoTDEgcmVndWxhcml6YXRpb24pLgogICAtIGBhbHBoYSA9IDBgOiBUaGUgcGVuYWx0eSBpcyBlbnRpcmVseSBSaWRnZSAoTDIgcmVndWxhcml6YXRpb24pLgogICAtIGBhbHBoYWAgYmV0d2VlbiAwIGFuZCAxOiBBIGNvbWJpbmF0aW9uIG9mIExhc3NvIGFuZCBSaWRnZS4KCjIuICoqRWZmZWN0IG9mIEwxIChMYXNzbykgUmVndWxhcml6YXRpb24qKjogTDEgcmVndWxhcml6YXRpb24gYWRkcyBhIHBlbmFsdHkgZXF1YWwgdG8gdGhlIGFic29sdXRlIHZhbHVlIG9mIHRoZSBtYWduaXR1ZGUgb2YgY29lZmZpY2llbnRzLiBUaGlzIGNhbiBsZWFkIHRvIHNvbWUgY29lZmZpY2llbnRzIGJlaW5nIGV4YWN0bHkgemVybywgd2hpY2ggaXMgdXNlZnVsIGZvciBmZWF0dXJlIHNlbGVjdGlvbiBpZiB5b3UgaGF2ZSBhIGxhcmdlIG51bWJlciBvZiBmZWF0dXJlcy4KCjMuICoqRWZmZWN0IG9mIEwyIChSaWRnZSkgUmVndWxhcml6YXRpb24qKjogTDIgcmVndWxhcml6YXRpb24gYWRkcyBhIHBlbmFsdHkgZXF1YWwgdG8gdGhlIHNxdWFyZSBvZiB0aGUgbWFnbml0dWRlIG9mIGNvZWZmaWNpZW50cy4gVGhpcyB0ZW5kcyB0byBzaHJpbmsgdGhlIGNvZWZmaWNpZW50cyBidXQgZG9lcyBub3Qgc2V0IHRoZW0gdG8gemVybywgd2hpY2ggaXMgdXNlZnVsIHdoZW4geW91IGhhdmUgY29ycmVsYXRlZCBmZWF0dXJlcy4KCjQuICoqQ2hvb3NpbmcgYGFscGhhYCoqOiAKICAgLSBJZiB5b3UgaGF2ZSBhIGxvdCBvZiBmZWF0dXJlcyB0aGF0IHlvdSBzdXNwZWN0IGFyZSBub3QgYWxsIHVzZWZ1bCwgYSB2YWx1ZSBjbG9zZXIgdG8gMSAobW9yZSBMYXNzbykgbWlnaHQgYmUgbW9yZSBhcHByb3ByaWF0ZSBhcyBpdCB3aWxsIHBlcmZvcm0gZmVhdHVyZSBzZWxlY3Rpb24uCiAgIC0gSWYgYWxsIHlvdXIgZmVhdHVyZXMgYXJlIGJlbGlldmVkIHRvIGJlIGltcG9ydGFudCwgb3IgeW91IGhhdmUgYSBzbWFsbCBudW1iZXIgb2YgZmVhdHVyZXMsIGEgdmFsdWUgY2xvc2VyIHRvIDAgKG1vcmUgUmlkZ2UpIG1pZ2h0IHdvcmsgYmV0dGVyLgogICAtIE9mdGVuLCB0aGUgYmVzdCB3YXkgdG8gY2hvb3NlIGFuIGBhbHBoYWAgdmFsdWUgaXMgdGhyb3VnaCBjcm9zcy12YWxpZGF0aW9uLCB0cnlpbmcgZGlmZmVyZW50IHZhbHVlcyBhbmQgc2VsZWN0aW5nIHRoZSBvbmUgdGhhdCBtaW5pbWl6ZXMgcHJlZGljdGlvbiBlcnJvci4KCjUuICoqSW50ZXJhY3Rpb24gd2l0aCBgbGFtYmRhYCoqOiBUaGUgYGxhbWJkYWAgcGFyYW1ldGVyIGluIEVsYXN0aWMgTmV0IGNvbnRyb2xzIHRoZSBvdmVyYWxsIHN0cmVuZ3RoIG9mIHRoZSBwZW5hbHR5LiBTbywgdGhlIGVmZmVjdCBvZiBgYWxwaGFgIGlzIGluIGNvbmp1bmN0aW9uIHdpdGggYGxhbWJkYWAuIEEgZ3JpZCBzZWFyY2ggb3ZlciBib3RoIGBhbHBoYWAgYW5kIGBsYW1iZGFgIGlzIGEgY29tbW9uIHByYWN0aWNlIHRvIGZpbmQgdGhlIGJlc3QgY29tYmluYXRpb24gdGhhdCBtaW5pbWl6ZXMgY3Jvc3MtdmFsaWRhdGlvbiBlcnJvci4KCkluIHN1bW1hcnksIHRoZSBgYWxwaGFgIHBhcmFtZXRlciBpbiBFbGFzdGljIE5ldCBhbGxvd3MgeW91IHRvIGJhbGFuY2UgdGhlIHR5cGUgb2YgcmVndWxhcml6YXRpb24gYXBwbGllZCB0byB5b3VyIG1vZGVsLCBwcm92aWRpbmcgdGhlIGZsZXhpYmlsaXR5IHRvIGNob29zZSBiZXR3ZWVuIExhc3NvLCBSaWRnZSwgb3IgYSBtaXggb2YgYm90aCBiYXNlZCBvbiB5b3VyIGRhdGEgYW5kIHRoZSBzcGVjaWZpYyByZXF1aXJlbWVudHMgb2YgeW91ciBwcm9ibGVtLgoKYGBge3J9CiMgRWxhc3RpYyBOZXQgbW9kZWwKc2V0LnNlZWQoMTIzKQpjdl9tb2RlbCA8LSBjdi5nbG1uZXQoeCwgeSwgYWxwaGEgPSAwLjUpICMgYWxwaGE9MC41IGluZGljYXRlcyBFbGFzdGljIE5ldApiZXN0X2xhbWJkYSA8LSBjdl9tb2RlbCRsYW1iZGEubWluCm1vZGVsX2VuIDwtIGdsbW5ldCh4LCB5LCBhbHBoYSA9IDAuNSwgbGFtYmRhID0gYmVzdF9sYW1iZGEpCnByaW50KG1vZGVsX2VuKQpgYGAKIyMgNC4gQ2FsY3VsYXRlIGltcG9ydGFuY2UgdXNpbmcgZWxhc3RpY25ldCBjb2VmaWNpZW50cwoKYGBge3J9CiMgVGhlIHZhcmlhYmxlIGltcG9ydGFuY2UgaXMgaW5mZXJyZWQgZnJvbSB0aGUgY29lZmZpY2llbnRzCmMgPC0gY29lZihtb2RlbF9lbikKcHJlZGljdG9yIDwtIGMgICU+JSByb3duYW1lcygpCmltcG9ydGFuY2UgPC0gYyAlPiUgYXMubWF0cml4KCkKaW1wb3J0YW5jZSA8LSBkYXRhLmZyYW1lKHByZWRpY3RvcixpbXBvcnRhbmNlKQpyb3duYW1lcyhpbXBvcnRhbmNlKSA8LSBOVUxMCm5hbWVzKGltcG9ydGFuY2UpWzJdIDwtICJpbXBvcnRhbmNlIgppbXBvcnRhbmNlPC1pbXBvcnRhbmNlWy0xLF0gIyBFeGNsdWRlIGludGVyY2VwdAppbXBvcnRhbmNlIDwtIGltcG9ydGFuY2Vbb3JkZXIoLWltcG9ydGFuY2UkaW1wb3J0YW5jZSksIF0KaW1wb3J0YW5jZSA8LSBoZWFkKGltcG9ydGFuY2UsIDIwKQpgYGAKCgpgYGB7cn0KbGlicmFyeShkcGx5cikKbGlicmFyeShnZ3Bsb3QyKQpwbG90X3Blcm0gPC1pbXBvcnRhbmNlICAlPiUgIG11dGF0ZShwcmVkaWN0b3IgPSBmYWN0b3IocHJlZGljdG9yLCBsZXZlbHMgPSByZXYodW5pcXVlKHByZWRpY3RvcikpKSkgJT4lCiAgZ2dwbG90KCkrCiAgZ2VvbV9jb2woYWVzKHk9cHJlZGljdG9yLHg9aW1wb3J0YW5jZSksZmlsbD0nZGFya2JsdWUnLCBjb2xvcj0nZ3JheScpKwogIGdndGl0bGUoIlRvcCAyMCBwcmVkaWN0b3IgaW1wb3J0YW5jZSB1c2luZyBnbG1uZXQiKSsKICB0aGVtZV9taW5pbWFsKCkKcGxvdF9wZXJtCmBgYAojIyA1LiBFdmFsdWF0ZSByZXN1bHRzIG9uIHRlc3QgZGF0YXNldAoKYGBge3J9CnhfdGVzdCA8LSB0ZXN0X2RhdGEgJT4lIHNlbGVjdCgtbmFtZSwtdG1nKSAlPiUgYXMubWF0cml4KCkKeV90ZXN0IDwtIHRlc3RfZGF0YSR0bWcKCnhfdGVzdCA8LXByZWRpY3Qoc2NhbGVkX3gsIHhfdGVzdCkKCnByZWRpY3Rpb25zIDwtIHByZWRpY3QobW9kZWxfZW4sIHMgPSBiZXN0X2xhbWJkYSwgbmV3eCA9IHhfdGVzdCkKIyBDb21wdXRlIHRoZSBSTVNFIChSb290IE1lYW4gU3F1YXJlIEVycm9yKQpSTVNFIDwtIHNxcnQobWVhbigocHJlZGljdGlvbnMgLSB5X3Rlc3QpXjIpKQpwcmludChSTVNFKQpgYGAKCiMjIDYuIFBsb3Q6IFByZWRpY3RlZCB2cyBSZWZlcmVuY2UgdmFsdWVzCgpgYGB7cn0KCnJlc3VsdHMgPC0gZGF0YS5mcmFtZShSZWZlcmVuY2UgPSB5X3Rlc3QsIFByZWRpY3RlZCA9IGFzLnZlY3RvcihwcmVkaWN0aW9ucykpCmdncGxvdChyZXN1bHRzLCBhZXMoeCA9IFJlZmVyZW5jZSwgeSA9IFByZWRpY3RlZCkpICsKICBnZW9tX3BvaW50KGNvbG9yPSdibHVlJykgKwogIGdlb21fYWJsaW5lKGludGVyY2VwdCA9IDAsIHNsb3BlID0gMSwgY29sb3I9J3JlZCcpICsKICBnZ3RpdGxlKCJQcmVkaWN0ZWQgdnMgUmVmZXJlbmNlIHZhbHVlcyIpICsKICB4bGFiKCJSZWZlcmVuY2UgVmFsdWVzIikgKwogIHlsYWIoIlByZWRpY3RlZCBWYWx1ZXMiKSArCiAgdGhlbWVfYncoKQoKYGBgCgo=