We will use use a supervised machine learning linear regression program to predict metabolite concentration.
Load Libraries
library(caret)
library(ggplot2)
library(car)
library(lmtest)
library(corrplot)
library(GGally)
library(reshape2)
library(glmnet)
library(DALEX)
Step 1: Load the data
met_data <- read.csv("metabolite_data.csv")
head(met_data)
## mass logP num_rings num_H_bonds enzyme_activity pathway_count
## 1 165.9881 4.259048 2 4 23.98300 1
## 2 174.2456 3.549930 0 1 45.46802 5
## 3 218.9677 2.287884 3 2 43.24518 4
## 4 181.7627 2.934555 3 5 37.77074 3
## 5 183.2322 2.168528 2 6 65.46609 4
## 6 222.8766 2.119002 0 0 35.84718 4
## metabolite_concentration
## 1 122.6491
## 2 128.3079
## 3 155.9756
## 4 145.3023
## 5 161.8741
## 6 149.0430
dim(met_data)
## [1] 200 7
str(met_data)
## 'data.frame': 200 obs. of 7 variables:
## $ mass : num 166 174 219 182 183 ...
## $ logP : num 4.26 3.55 2.29 2.93 2.17 ...
## $ num_rings : int 2 0 3 3 2 0 2 2 0 2 ...
## $ num_H_bonds : int 4 1 2 5 6 0 5 6 4 5 ...
## $ enzyme_activity : num 24 45.5 43.2 37.8 65.5 ...
## $ pathway_count : int 1 5 4 3 4 4 4 3 3 3 ...
## $ metabolite_concentration: num 123 128 156 145 162 ...
The dataset contains 200 metabolites with 7 variables describing their chemical and biological properties. Each column (e.g., mass, logP, enzyme activity) represents a measurable feature that could help predict metabolite concentration. The data structure confirms proper formatting for analysis.
Step 2: Basic EDA: Exploratory Data Analysis
#Step 2a: check the metrics (mean, media etc) for each variable
summary(met_data)
## mass logP num_rings num_H_bonds enzyme_activity
## Min. :122.3 Min. :0.5273 Min. :0.0 Min. :0.00 Min. :23.98
## 1st Qu.:164.4 1st Qu.:2.0274 1st Qu.:1.0 1st Qu.:1.00 1st Qu.:43.24
## Median :178.5 Median :2.5183 Median :2.0 Median :3.00 Median :51.20
## Mean :179.8 Mean :2.5337 Mean :1.6 Mean :3.08 Mean :49.91
## 3rd Qu.:194.2 3rd Qu.:3.0719 3rd Qu.:3.0 3rd Qu.:5.00 3rd Qu.:56.38
## Max. :261.0 Max. :4.5572 Max. :3.0 Max. :6.00 Max. :76.92
## pathway_count metabolite_concentration
## Min. :1.000 Min. :102.8
## 1st Qu.:2.000 1st Qu.:134.4
## Median :3.000 Median :144.0
## Mean :2.955 Mean :145.3
## 3rd Qu.:4.000 3rd Qu.:155.8
## Max. :5.000 Max. :191.0
The summary shows that most metabolites have moderate mass (~180), moderate enzyme activity (~50), and concentration (~145). The variables vary across wide ranges, suggesting sufficient diversity for modeling. No clear outliers appear extreme enough to disrupt analysis.
# Step 2b: plots for each variable: x-axis = order in the data set, y-axis: measurement
par(mfrow=c(3,3))
plot(met_data$mass)
plot(met_data$logP)
plot(met_data$num_rings)
plot(met_data$num_H_bonds)
plot(met_data$enzyme_activity)
plot(met_data$pathway_count)
plot(met_data$metabolite_concentration)
par(mfrow = c(1, 1))
These line plots show how each variable changes across samples. Mass and
enzyme activity vary smoothly without strong trends, while discrete
variables like num_rings and pathway_count take limited integer values,
as expected. No visible data entry errors are present.
# Step 2c: Boxplots with names and 45° labels
boxplot(
met_data$mass,
met_data$enzyme_activity,
met_data$metabolite_concentration,
names = FALSE,
main = "Boxplots of Key Metabolite Variables",
col = c("skyblue", "lightgreen", "salmon"),
xaxt = "n"
)
labels1 <- c("mass", "enzyme_activity", "metabolite_concentration")
text(x = 1:length(labels1), y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),
labels = labels1, srt = 45, adj = 1, xpd = TRUE, cex = 0.9)
boxplot(
met_data$logP,
met_data$num_rings,
met_data$num_H_bonds,
met_data$pathway_count,
names = FALSE,
main = "Boxplots of Other Predictors",
col = c("orange", "purple", "lightblue", "lightpink"),
xaxt = "n"
)
labels2 <- c("logP", "num_rings", "num_H_bonds", "pathway_count")
text(x = 1:length(labels2), y = par("usr")[3] - 0.05 * diff(par("usr")[3:4]),
labels = labels2, srt = 45, adj = 1, xpd = TRUE, cex = 0.9)
The boxplots display the spread and central tendency for each variable.
Mass and metabolite concentration have slightly wider ranges, while
enzyme activity is more compact. There are no severe outliers,
indicating that the data is relatively clean and well-distributed across
variables.
# Step 2d: Correlation Heatmap ----
corrplot(cor(met_data), method = "color", type = "upper", tl.col = "black")
The heatmap reveals strong positive correlations between metabolite
concentration and features like mass, enzyme activity, and
pathway_count. This suggests these variables likely contribute
significantly to predicting metabolite levels. Low correlations among
predictors indicate minimal multicollinearity.
# Step 2e: Pairwise relationships
ggpairs(met_data)
The scatterplot matrix confirms linear relationships between metabolite
concentration and several predictors, especially mass and enzyme
activity. Variables appear moderately correlated without overlapping
patterns, supporting suitability for linear regression.
# Step 2f: Distribution Check ----
met_data_long <- melt(met_data)
ggplot(met_data_long, aes(x = value, fill = variable)) +
geom_histogram(bins = 20) +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distributions of Metabolite Features")
Most variables show approximately normal or slightly right-skewed
distributions. Continuous variables like mass and enzyme_activity appear
smooth, while categorical-like variables (num_rings, pathway_count) show
discrete groupings, consistent with their integer nature.
Step 3: Train and Split
set.seed(42)
train_index <- createDataPartition(met_data$metabolite_concentration, p = 0.7, list = FALSE)
train_data <- met_data[train_index, ]
test_data <- met_data[-train_index, ]
The dataset was split into 70% training and 30% testing sets. This ensures the model can be trained on sufficient data while reserving unseen samples for unbiased performance evaluation.
Step 4: Linear regression Model
lm_model <- lm(metabolite_concentration ~ ., data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = metabolite_concentration ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1504 -3.7074 -0.1107 2.7126 15.4738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.24431 4.66993 -0.909 0.365
## mass 0.50097 0.02003 25.015 < 2e-16 ***
## logP 2.66187 0.56185 4.738 5.47e-06 ***
## num_rings 3.17428 0.39278 8.082 3.44e-13 ***
## num_H_bonds 1.57205 0.23086 6.809 3.06e-10 ***
## enzyme_activity 0.74271 0.04244 17.499 < 2e-16 ***
## pathway_count 1.82975 0.31593 5.792 4.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.205 on 133 degrees of freedom
## Multiple R-squared: 0.8904, Adjusted R-squared: 0.8855
## F-statistic: 180.1 on 6 and 133 DF, p-value: < 2.2e-16
The regression results show a strong model fit (R² = 0.89). All predictors are statistically significant (p < 0.001), meaning they meaningfully influence metabolite concentration. Mass and enzyme activity show the strongest positive effects, aligning with biological expectations.
Step 5: Diagnostics
# Step 5a Check residual normality
shapiro.test(lm_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm_model$residuals
## W = 0.98715, p-value = 0.2191
par(mfrow = c(2, 3))
hist(lm_model$residuals, main = "Histogram of Residuals")
plot(lm_model)
par(mfrow = c(1, 1))
The residual histogram and plots indicate a roughly normal distribution
(Shapiro p = 0.2191). Residuals are evenly spread without patterns,
suggesting the model assumptions of normality and linearity are well
met.
# Step 5b Statistical significance: Multicollinearity
vif(lm_model)
## mass logP num_rings num_H_bonds enzyme_activity
## 1.080017 1.061047 1.031570 1.095657 1.031244
## pathway_count
## 1.084573
All VIF values are close to 1, confirming that predictors are not highly correlated with one another. This strengthens the validity of the regression coefficients and model stability.
# Step 5c Statistical significance: Homoscadasticity
bptest(lm_model)
##
## studentized Breusch-Pagan test
##
## data: lm_model
## BP = 14.878, df = 6, p-value = 0.02122
The p-value (0.021) indicates slight heteroscedasticity, meaning residual variance may not be perfectly constant across fitted values. While mild, it suggests that a transformation or robust model could improve accuracy marginally.
Step 6: Model performance
lm_pred <- predict(lm_model, newdata = test_data)
# Step 6a: Calculate performance metrics and print
rmse <- sqrt(mean((test_data$metabolite_concentration - lm_pred)^2))
mae <- mean(abs(test_data$metabolite_concentration - lm_pred))
r2 <- cor(test_data$metabolite_concentration, lm_pred)^2
# Print results
cat("RMSE:", rmse, "\n")
## RMSE: 4.970806
cat("MAE:", mae, "\n")
## MAE: 3.885294
cat("R²:", r2, "\n")
## R²: 0.9003138
The model performs well on test data (RMSE ≈ 4.97, R² ≈ 0.90). These results confirm high predictive accuracy, as predicted concentrations closely match observed values with minimal average error.
# Step 6b: plot predicted vs observed
ggplot(data.frame(Observed = test_data$metabolite_concentration, Predicted = lm_pred),
aes(x = Observed, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Predicted vs Observed Metabolite Concentration")
The scatterplot shows that most points fall close to the 1:1 red dashed
line, indicating strong agreement between predicted and actual
metabolite concentrations. Only a few points deviate, confirming good
generalization performance.
Step 7
#Which Variable/Predictor is important in the model
explainer <- explain(lm_model, data = train_data[, -7],
y = train_data$metabolite_concentration)
## Preparation of a new explainer is initiated
## -> model label : lm ( default )
## -> data : 140 rows 6 cols
## -> target variable : 140 values
## -> predict function : yhat.lm will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package stats , ver. 4.5.1 , task regression ( default )
## -> predicted values : numerical, min = 107.7701 , mean = 145.5337 , max = 189.6574
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -12.1504 , mean = 6.501466e-14 , max = 15.47382
## A new explainer has been created!
vip <- model_parts(explainer)
plot(vip) + ggtitle("Variable Importance (DALEX)")
The DALEX variable importance plot highlights mass and enzyme_activity
as top predictors. This agrees with the regression summary, indicating
that physical and enzymatic properties play key roles in determining
metabolite concentration.
Step 8
#Testing the model on an observation
new_met <- data.frame(
mass=200,
logP=3,
num_rings=2,
num_H_bonds=3,
enzyme_activity=60,
pathway_count=2
)
predict(lm_model, newdata=new_met)
## 1
## 163.2212
The model predicts a concentration of approximately 163.2 for the new metabolite input, consistent with expected values given its moderate mass and enzyme activity. This shows the model can generalize to unseen observations effectively.
Step 9: Cross-Validation
#Step 9a: Cross-Validation: another way to train and predict ----
set.seed(42)
# Define 10-fold CV
train_control <- trainControl(method = "cv", number = 10)
# Train linear regression model with CV
cv_model <- train(
metabolite_concentration ~ .,
data = met_data,
method = "lm",
trControl = train_control,
metric = "RMSE"
)
# Show CV results
print(cv_model)
## Linear Regression
##
## 200 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.12678 0.8853752 4.151905
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Extract performance metrics
cv_results <- cv_model$resample
summary(cv_results)
## RMSE Rsquared MAE Resample
## Min. :3.108 Min. :0.8112 Min. :2.494 Length:10
## 1st Qu.:4.262 1st Qu.:0.8359 1st Qu.:3.543 Class :character
## Median :5.319 Median :0.8906 Median :4.329 Mode :character
## Mean :5.127 Mean :0.8854 Mean :4.152
## 3rd Qu.:5.940 3rd Qu.:0.9355 3rd Qu.:4.704
## Max. :6.677 Max. :0.9599 Max. :5.506
# Mean CV RMSE, MAE and R²
cat("Mean CV RMSE:", mean(cv_results$RMSE), "\n")
## Mean CV RMSE: 5.12678
cat("Mean CV MAE:", mean(cv_results$MAE), "\n")
## Mean CV MAE: 4.151905
cat("Mean CV R²:", mean(cv_results$Rsquared), "\n")
## Mean CV R²: 0.8853752
Cross-validation results are consistent with the initial model (Mean RMSE = 5.13, R² = 0.885), confirming the model’s reliability and low overfitting. Performance is stable across different subsets of data.
# Step 9b Visualize cross-validation performance: RMSE, MAE, Rsquared
par(mfrow = c(2, 2))
# RMSE
ggplot(cv_results, aes(x = RMSE)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white", alpha = 0.8) +
theme_minimal() +
labs(title = "Cross-Validation RMSE Distribution (10-Fold)",
x = "RMSE", y = "Count")
# MAE
ggplot(cv_results, aes(x = MAE)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white", alpha = 0.8) +
theme_minimal() +
labs(title = "Cross-Validation MAE Distribution (10-Fold)",
x = "MAE", y = "Count")
# RSquared
ggplot(cv_results, aes(x = Rsquared)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white", alpha = 0.8) +
theme_minimal() +
labs(title = "Cross-Validation Rsquared Distribution (10-Fold)",
x = "Rsquared", y = "Count")
par(mfrow = c(1, 1))
The histograms show that RMSE, MAE, and R² values are tightly clustered, indicating consistent performance across folds. This uniformity suggests strong model robustness and reproducibility.
# Step 9c: Predict New Metabolite with Cross Validation ----
new_met <- data.frame(
mass = 200,
logP = 3,
num_rings = 2,
num_H_bonds = 3,
enzyme_activity = 60,
pathway_count = 2
)
predict(cv_model, newdata = new_met)
## 1
## 163.1173
The cross-validated model predicts a nearly identical value (163.1) to the initial model (163.2), confirming high model stability and agreement between training approaches.
# Step 9d: Compare lm and cv
predict(lm_model, newdata = new_met)
## 1
## 163.2212
predict(cv_model, newdata = new_met)
## 1
## 163.1173
Both the simple linear model and cross-validated model yield almost identical predictions, reinforcing that the model’s accuracy and coefficients are not sensitive to sampling variations.