Aquatic Biota
Our water environments, like rivers and lakes, are increasingly exposed to various chemicals, which can harm aquatic life, especially fish. Traditionally, assessing the toxicity of these chemicals involves lengthy and complex laboratory tests, which are not only time-consuming but also costly. With a vast array of chemicals in use, there’s a pressing need for a more streamlined and efficient method to predict their impact on aquatic ecosystems.
The urgency to address this issue is twofold. Firstly, quick and accurate identification of harmful chemicals is crucial to protect aquatic species and maintain the health of our water ecosystems. Secondly, a simplified and efficient approach saves valuable time and resources, allowing for the rapid screening of numerous chemicals, which is vital given the ever-increasing number of chemicals introduced into the environment.
This project aims to create regression models to predict the toxic effects of chemicals on aquatic life. Our approach involves utilizing advanced regression techniques to establish relationships between chemical properties and their toxicity levels. The model will be designed to be trained and validated to ensure its accuracy and applicability. By doing so, we aspire to provide a practical tool for swiftly assessing chemical toxicity, reducing the need for extensive lab testing, and ultimately aiding in the preservation of aquatic ecosystems.
In this project, we will use QSAR fish toxicity dataset. This dataset consists of 908 observations and 7 columns. 6 columns represent molecular descriptors and 1 column represents toxicity response.
CIC0: indicates the variety of atom types in a
molecule.SM1_Dz(Z): shows the presence and variety of non-carbon
atoms in the molecule.GATS1i: measures the similarity of bonded atoms’
ionization potential.NdssC & NdsCH: count specific types of
carbon atoms - NdssC for carbon-sulfur bonds and
NdsCH for carbon-hydrogen bonds.MLOGP: measures a compound’s tendency to dissolve in
fats over water.LC50: the concentration of a substance that causes
death in 50% of a test population, like fish, within a set period. Lower
values indicate higher toxicity.Given the provided background and dataset, the problem statement is well articulated:
Building regression models to predict toxicity levels based on several molecular descriptors.
LC50.Below are the packages that we will use during the analysis.
# for data loading - data preprocessing
library(readxl)
library(dplyr)
library(tidyr)
# for exploratory data analysis
library(psych)
library(ggplot2)
library(GGally)
library(gridExtra)
# for modeling - evaluation
library(caret)
library(partykit)
library(randomForest)
library(xgboost)
library(tidymodels)
library(MLmetrics)
# for model interpretation
library(lime)
# tidy visualization
library(DT)
# for reproducible analysis
set.seed(123)In the initial phase of our analysis, we will focus on data cleansing to guarantee precision in our analysis and modeling process. Data cleansing involves identifying and rectifying any inaccuracies, corruptions, formatting errors, duplications, or incomplete information within our dataset. To begin, it’s essential to familiarize ourselves with the data structure.
#> Rows: 908
#> Columns: 7
#> $ CIC0 <dbl> 3.260, 2.189, 2.125, 3.027, 2.094, 3.222, 3.179, 3.000, 2.62…
#> $ SM1_Dz.Z. <dbl> 0.829, 0.580, 0.638, 0.331, 0.827, 0.331, 0.000, 0.000, 0.49…
#> $ GATS1i <dbl> 1.676, 0.863, 0.831, 1.472, 0.860, 2.177, 1.063, 0.938, 0.99…
#> $ NdsCH <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
#> $ NdssC <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ MLOGP <dbl> 1.453, 1.348, 1.348, 1.807, 1.886, 0.706, 2.942, 2.851, 2.94…
#> $ LC50 <dbl> 3.770, 3.115, 3.531, 3.510, 5.390, 1.819, 3.947, 3.513, 4.40…
Below are the first six rows of our observations.
The dataset consists of seven numeric columns, each representing
molecular descriptors and associated LC50 toxicity values.
With a total of 908 rows, it encompasses a wide range of chemical
compounds. The data is already in a suitable numeric format, requiring
no further alteration in data types.
Now, let’s check if our data is complete by looking for any missing or blank spots, called null values. These missing parts can cause problems in our analysis. It’s important to find out if there are any and then figure out what to do with them to keep our data good and useful.
#> CIC0 SM1_Dz.Z. GATS1i NdsCH NdssC MLOGP LC50
#> 0 0 0 0 0 0 0
After checking, we found no missing parts in our data.
To clean our data, it’s essential to remove duplicate entries. Duplicates can skew the modeling process by giving undue weight to repeated information. We’ll identify and discard any such repeats for a more accurate analysis.
We have identified one duplicated value. We will remove this to maintain the uniqueness of our data.
Our data now is clean and ready for analysis.
Checking for outliers, which are data points significantly differing from others in a dataset, is a crucial step in data analysis. They can skew statistical measures and mislead interpretations. Identifying outliers helps in deciding whether to retain, modify, or remove them, ensuring more accurate analysis. This step also aids in better understanding the data, spotting anomalies, and upholding the integrity of subsequent analysis stages. Hence, outlier detection is essential for a robust data analysis process.
To spot these outliers, we utilize base function
boxplot().
Looking at the boxplots, we see outliers in all the columns. There are different ways to handle these unusual points, like taking them out or keeping them. We’ve decided to keep them in our data. This is because sometimes, these outliers can tell us something important about what we’re studying. By keeping them, we make sure our analysis covers everything in the data, even the unusual parts.
In this step, we’ll examine the relationships between different variables in our dataset through correlation analysis. This process helps us understand how variables are connected. If we find that two variables often increase or decrease together, they have a positive correlation. Conversely, if one goes up when the other goes down, that’s a negative correlation.
Understanding these relationships is vital for interpreting our data
accurately. We’ll use the ggcorr() function from the
GGally package to conduct this analysis.
The correlation plot shows that most variable pairs have weak
correlations. Notably:
MLOGP and LC50 have a strong positive
correlation of 0.7, indicating they tend to increase together.CIC0 and MLOGP have a moderate positive
correlation of 0.5.GATS1i and MLOGP have a moderate negative
correlation of -0.5, meaning they tend to move in opposite
directions.Further insights can be gained by examining scatterplots of these variable pairs.
# scatterplot for MLOGP and LC50
ggplot(data_clean, aes(x = MLOGP, y = LC50)) +
geom_point() +
labs(title = "MLOGP vs LC50",
x = "MLOGP",
y = "LC50")The scatterplot above displays a spread of data points indicating
that as the MLOGP value increases, the LC50
value also tends to increase. This suggests that substances with a
higher capability to dissolve in fats (as indicated by
MLOGP) may also have a higher lethal concentration value,
implying a lower toxicity.
# scatterplot for CIC0 and MLOGP
ggplot(data_clean, aes(x = CIC0, y = MLOGP)) +
geom_point() +
labs(title = "CIC0 vs MLOGP",
x = "CIC0",
y = "MLOGP")
Here, the points form a slight upward trend, suggesting that as the
topological structure complexity (
CICO) increases, there
might be a slight increase in lipophilicity (MLOGP), but
the relationship is not strongly pronounced.
# scatterplot for GATS1i and MLOGP
ggplot(data_clean, aes(x = GATS1i, y = MLOGP)) +
geom_point() +
labs(title = "GATS1i vs MLOGP",
x = "GATS1i",
y = "MLOGP")
The scatterplot displays the relationship between
GATS1i
and MLOGP. As GATS1i increases,
MLOGP tends to decrease, which is indicated by a
correlation of -0.5. The spread of the points suggests a moderate
inverse relationship between the two variables.
Having explored our data and gained key insights, we are now poised to take the next step: building a predictive regression model. This model will aim to predict the lethal concentration of substances based on their molecular properties. For our project, we will explore three robust regression models: decision tree regressor, random forest regressor, and extra gradient boosting (XGBoost) regressor.
For the modeling purpose, we’ll be splitting the data into a training set and a testing set. We’ll be giving 80% of the data to the training set, which is where our model will learn its stuff, and the other 20% will go to the testing set, where we’ll see how well our model has learned. This way, we get to train our model well and also test it fairly.
This splitting utilizes sample() function. This function
will return a list of data indexes for training data.
# for reproducibility
RNGkind(sample.kind = "Rounding")
set.seed(100)
# define train and test proportion
index <- sample(nrow(data_clean), nrow(data_clean) * 0.8)
# subset
train_data <- data_clean[index,]
test_data <- data_clean[-index,]#> [1] 725 7
#> [1] 182 7
Now, we will train the data with 725 observations and test the result with 182 observations.
In this project, we will mainly focus on utilization on tree-based robust regression model: decision tree regressor, random forest regressor, and extra gradient boosting (XGBoost) regressor.
A decision tree regressor operates by dividing a dataset into
distinct regions based on specific criteria. Suppose we have a dataset
where we want to predict LC50, a measure of toxicity, using
molecular descriptors like CIC0, SM1_Dz(Z),
GATS1i, NdssC, NdsCH, and
MLOGP. Each of these descriptors gives us different
insights into the molecules, such as the variety of atom types, the
presence of non-carbon atoms, and the molecule’s solubility.
We begin by examining all the data. The decision tree first looks for
the best way to split the molecules into groups based on one of these
descriptors. For instance, it might divide them based on
CIC0, grouping molecules with similar types of atoms. The
goal is to create groups that are as uniform as possible in terms of
their LC50 values.
At each step, we evaluate each molecular descriptor to determine the
best way to further divide the groups. This is done using a mathematical
formula, like mean squared error (MSE), which helps us gauge the
effectiveness of the split. We continue this process of splitting the
dataset, creating smaller and more specific groups, until we reach a
certain condition, such as a maximum tree depth or groups that are
homogeneous enough in their LC50 values.
The final model is a tree comprised of leaf nodes. Each leaf
represents a group of molecules with a similar predicted
LC50 value, usually the average LC50 of the
molecules in that leaf. So, when we introduce a new molecule with its
descriptors into the model, the tree navigates to the appropriate leaf
based on these features. The predicted LC50 is then based
on the average value of that leaf’s group. This method allows us to
effectively predict LC50 based on the provided molecular
characteristics.
Now that we understand how to construct a decision tree regressor
from our dataset. Our next step is to implement it in R. We’ll create
our model using the ctree() function from the
partykit package. This function is versatile, handling both
classification and regression tasks. To use it, we need to specify two
main parameters: formula and data. The
formula parameter defines our target variable (in this
case, LC50) and the predictors we want to use (such as
CIC0, SM1_Dz(Z), GATS1i,
NdssC, NdsCH, and MLOGP). The
data parameter is where we input our training dataset. By
setting these parameters, we can efficiently build a Decision Tree
Regressor tailored to our specific requirements in R.
To visualize the structure of our decision tree regressor, we can
utilize the plot() function in R. This will enable us to
see a graphical representation of the tree, showcasing how the data is
split at each node and providing insight into the decision-making
process of the model.
#>
#> Model formula:
#> LC50 ~ CIC0 + SM1_Dz.Z. + GATS1i + NdsCH + NdssC + MLOGP
#>
#> Fitted party:
#> [1] root
#> | [2] MLOGP <= 2.621
#> | | [3] MLOGP <= 1.595
#> | | | [4] SM1_Dz.Z. <= 0.629
#> | | | | [5] NdsCH <= 0
#> | | | | | [6] GATS1i <= 1.59: 2.849 (n = 60, err = 54.6)
#> | | | | | [7] GATS1i > 1.59
#> | | | | | | [8] MLOGP <= 0.8: 1.772 (n = 50, err = 34.1)
#> | | | | | | [9] MLOGP > 0.8: 2.801 (n = 20, err = 14.2)
#> | | | | [10] NdsCH > 0: 3.741 (n = 27, err = 20.8)
#> | | | [11] SM1_Dz.Z. > 0.629: 3.577 (n = 95, err = 181.4)
#> | | [12] MLOGP > 1.595
#> | | | [13] SM1_Dz.Z. <= 0.83
#> | | | | [14] NdsCH <= 0
#> | | | | | [15] MLOGP <= 2.056: 3.446 (n = 55, err = 21.2)
#> | | | | | [16] MLOGP > 2.056: 3.842 (n = 82, err = 24.5)
#> | | | | [17] NdsCH > 0
#> | | | | | [18] SM1_Dz.Z. <= 0.331: 3.775 (n = 14, err = 3.6)
#> | | | | | [19] SM1_Dz.Z. > 0.331: 4.735 (n = 18, err = 10.6)
#> | | | [20] SM1_Dz.Z. > 0.83: 4.579 (n = 62, err = 55.6)
#> | [21] MLOGP > 2.621
#> | | [22] MLOGP <= 3.818
#> | | | [23] SM1_Dz.Z. <= 1.192: 4.615 (n = 135, err = 79.3)
#> | | | [24] SM1_Dz.Z. > 1.192: 5.939 (n = 22, err = 35.5)
#> | | [25] MLOGP > 3.818: 5.944 (n = 85, err = 120.7)
#>
#> Number of inner nodes: 12
#> Number of terminal nodes: 13
In the tree diagram we made, MLOGP is at the very top.
This means it’s the most important thing the tree looks at when
predicting LC50, which is about how toxic something is.
Basically, MLOGP tells us about how a molecule dissolves,
and that seems to be a big deal for deciding its toxicity. So, when we
use this tree to guess toxicity, it starts by checking
MLOGP first, and then makes more decisions based on that.
This shows us that how a molecule dissolves is really key in
understanding how toxic it might be.
Next, we will evaluate the performance of our dtr_base
model on both the training and testing datasets, focusing on two key
metrics: adjusted r-squared and MAPE (Mean Absolute Percentage Error).
It’s important to assess the model’s effectiveness on both sets because
this helps us understand how well it can apply what it’s learned to new,
unseen data. By doing this, we ensure that our model isn’t just
memorizing the training data but is actually learning patterns that are
generally true and can be applied broadly.
# custom function for calculating adj. r-squared
adj_r2 <- function(preds, actual, n, p){
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
return (1 - (((1-rsq)*(n-1))/(n-p-1)))
}# prediction on training data and testing data
y_train_pred <- predict(object = dtr_base, newdata = train_data)
y_test_pred <- predict(object = dtr_base, newdata = test_data)dtr_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
MAPE_train = MAPE(y_train_pred, train_data$LC50),
adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
MAPE_test = MAPE(y_test_pred, test_data$LC50)
)
dtr_results💡 Insight:
LC50 for the training data and 54.85% for
the testing data.These results aren’t very good. It looks like the decision tree
regressor might be too simple for what we’re trying to do, and it’s not
predicting LC50 as accurately as we hoped. Next, we will be
experimenting with random forest regressor.
A random forest regressor is an advanced model that combines many decision trees to improve prediction accuracy. It creates these trees using different random samples of data, a process known as ‘bootstrapping’. Each tree also uses a randomly selected set of features, making each one unique and helping the overall model avoid errors that a single tree might make. When it comes to making predictions, the random forest runs the input through all its trees. Each tree gives its own prediction, and the final output is the average of these. This method not only makes the model more accurate but also helps it perform well on new, unseen data by reducing overfitting.
The random forest is versatile, good for both regression and classification, and can handle a mix of data types, making it a popular choice in many predictive modeling tasks.
Previously, we utilized the ctree() function from the
partykit package, specifically designed for decision trees.
However, for our next step in model building, we’ll use the
train() function from the caret package, a
high-level abstraction function. This function allows us to experiment
with different models simply by modifying the method parameter. In this
case, as we aim to build a random forest regressor, we will set
method = "rf".
Since a random forest has many trees and it’s hard to show them all
at once, we use something called ‘feature importance’ to understand the
model better. Feature importance tells us how important each part of our
data is for making good predictions. We can find this out by using the
varImp() function.
Based on what we found, MLOGP is still the most
important thing for predicting LC50, with
SM1_Dz(Z) and GATS1i also being really
important. This means that the model really pays attention to
MLOGP, showing that how a molecule dissolves is key in
figuring out its toxicity. SM1_Dz(Z) and
GATS1i are also big factors, telling us that the types of
atoms in a molecule and their properties are crucial for making good
predictions about LC50.
Similarly to the decision tree regressor, we will evaluate the model’s performance on both the training and testing data.
# prediction on training data and testing data
y_train_pred <- predict(object = rfr_base, newdata = train_data)
y_test_pred <- predict(object = rfr_base, newdata = test_data)rfr_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
MAPE_train = MAPE(y_train_pred, train_data$LC50),
adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
MAPE_test = MAPE(y_test_pred, test_data$LC50)
)
rfr_results💡 Insight:
LC50 for the training data and 69.13% for
the testing data.Based on the adjusted r-squared and MAPE, it is evident that the random forest regressor performs better than the decision tree regressor. However, the significant gap between training performance and testing performance leads us to conclude that this model overfits the training data and has limited capability in generalizing to new, unseen dataset. Next, we will attempt to model the dataset using an XGBoost regressor.
XGBoost works by taking the concept of decision trees to the next level. It begins by creating a basic decision tree, but instead of stopping there, XGBoost continuously builds more trees, each one designed to correct the errors made by the previous ones. Imagine it like a team where each new member learns from the mistakes of the earlier members, gradually improving the overall performance. XGBoost does this through a technique called ‘boosting’, where each new tree focuses on the tricky parts that previous trees struggled with. It’s like solving a puzzle by focusing on the hardest parts one step at a time.
What’s special about XGBoost is its efficiency and precision. It’s not just randomly adding trees; it’s doing it in a smart, calculated way. The algorithm also includes ways to prevent overfitting, which means it avoids getting too caught up in the specific details of the training data, making it more reliable for predicting new, unseen data. In essence, XGBoost creates a highly refined and collaborative system of trees that work together to make more accurate predictions, especially useful in handling complex datasets.
In implementing the XGBoost regressor, we will use the same approach
as with the random forest regressor, with the only change being setting
method = xgbTree.
Alternatively, we can construct an XGBoost regressor using the
boost_tree() function from the tidy_models
package.
# prediction on training data and testing data with model from tidymodels
y_train_pred <- predict(object = xgbr_base_tm, new_data = train_data)
y_test_pred <- predict(object = xgbr_base_tm, new_data = test_data)xgbr_tm_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
MAPE_train = MAPE(y_train_pred$.pred, train_data$LC50),
adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
MAPE_test = MAPE(y_test_pred$.pred, test_data$LC50)
)
xgbr_tm_results💡 Insight:
LC50 for the training data and 68.96% for
the testing data.Based on the results from three regressors, XGBoost achieved a relatively lower error compared to the random forest and decision rree regressors. However, like the random rorest, XGBoost demonstrated a tendency to perform well on training data but underperform on testing data. This pattern suggests a potential issue of overfitting. Therefore, we plan to fine-tune this model to better control its complexity and improve its generalization on unseen data.
In this section, we will fine-tune the XGBoost regressor using the
tidymodels package. Among the adjustable parameters for the
XGBoost regressor, we will focus on trees, which controls
the number of trees used in the model.
Initially, we define the model and identify the specific argument to
be tuned. For tuning solely the trees parameter, we assign
the tune() function to it.
# 1. Mark trees as parameter to be tuned
xgbr_tune <- boost_tree(trees = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost") Next, we define the recipe, which outlines the steps for preparing our data for modeling. This process can involve various preprocessing tasks. In the the following code, we specify what to predict, the predictors to use, and the dataset.
Subsequently, we integrate the model and recipe using the
workflow() function. The following code demonstrates
creating a tuning workflow that combines the XGBoost regressor with our
LC50 dataset.
#> # Trees (quantitative)
#> Range: [1, 2000]
Executing the above code shows that the default tuning range spans
from 1 to 2000. However, we want to specifically tune the
trees value between 7 and 15. To do this, we can manually
create a tibble with our desired range.
Next, we will define the benchmark for determining the best trees value, using MAPE (Mean Absolute Percentage Error) as the standard of measurement.
Finally, we are ready to tune our model using the
tune_grid() function.
# 6. Tune the model using the specified range for trees
tuning_results <- tune_grid(
xgb_wflow,
grid = trees_values,
resamples = vfold_cv(train_data, v = 5),
metrics = mape_res
)Let’s inspect the best parameter for our XGBoost regressor.
Based on the results, trees = 8 yields the lowest MAPE in the 5-fold cross-validation. We will remodel our LC50 data using this outcome.
xgbr_tune8 <- boost_tree(trees = 8) %>%
set_mode("regression") %>%
set_engine("xgboost") %>%
fit(LC50 ~ ., data = train_data)# prediction on training data and testing data with tuned model from tidymodels
y_train_pred <- predict(object = xgbr_tune8, new_data = train_data)
y_test_pred <- predict(object = xgbr_tune8, new_data = test_data)xgbr_tm_tuned_results = data.frame(
adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
MAPE_train = MAPE(y_train_pred$.pred, train_data$LC50),
adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
MAPE_test = MAPE(y_test_pred$.pred, test_data$LC50)
)
xgbr_tm_tuned_resultsCompared to our pre-tune model, xgbr_base_tm, the
xgbr_tune8 model yields a higher error rate in both the
training and testing data. This suggests that a greater number of trees
(as xgbr_base_tm has 15 trees by default) may lead to
better predictions than a model with fewer trees. However, both
xgbr_base_tm and xgbr_tune8 still suffer from
overfitting, as indicated by their strong performance on the training
data but weaker performance on the testing data.
For the next part, we will attempt to use xgbr_base_tm to explain the
model for each observation using lime package.
So far, our best model is xgbr_base_tm. It has an
adjusted r-squared of 68.96% and a MAPE of 19.62% on the test data. The
XGBoost regressor is good at finding complicated patterns but it’s hard
to understand how it makes its predictions.
Luckily, R has a tool to help with this. It’s called the
lime package, short for Local Interpretable Model-Agnostic
Explanations. This tool can explain any kind of predictive model, even
complex ones like ours.
Now, let’s use the lime function to better understand
our model. Here’s the code we’ll use.
# select only the first 4 observations of test_data
selected_data <- test_data %>%
select(-LC50) %>%
dplyr::slice(1:4)explanation <- explain(x = selected_data,
explainer = explainer,
feature_select = "auto", # method of feature selection for lime
n_features = 6 # number of features to explain the model
)Now, we will visualize the explanation using the
plot_features() function.
As we can see from the figure, our plot contains 4 observations and the details for each.
1. Case
Case indicates the index of our observation.
Case = 1 means the first row of our test data.Case = 2 means the second row of our test data.2. Prediction
Prediction refers to the predicted lethal concentration
LC50 given the predictors. Now, we will compare with the
predicted values with our actual values.
#> [1] 3.510 5.390 1.819 3.513
The prediction for the second observation is more off compared to the first, third, and fourth ones. This suggests that our model may not be as accurate in certain cases, indicating a need for further investigation or adjustment.
3. Explanation Fit
Explanation Fit reveals how good lime is in
interpreting our model in terms of r-squared values of linear
regression. Given the figure, we know lime is only able to
explain <20% of the variation of our data. By this result, it can be
concluded that lime can only explain a tiny part of our
model.
4. Feature Barplot
For each observation, lime generates bar plots in red
and blue. Red reveals a negative contribution to our target and blue
otherwise. Each observation shows a different pattern regarding which
variables strengthen the lethal concentration value.
#> [1] "CIC0" "SM1_Dz.Z." "GATS1i" "NdsCH" "NdssC" "MLOGP"
#> [7] "LC50"
Interestingly, for 4 cases, CIC0,
SM1_Dz(Z), NdsCH, and
NdssCcontribute in negative direction towards
LC50 value. Specifically, this occurs when:
CIC0 <= 3.43,SM1_Dz(Z)<= 0.889,NdsCH and NdssC <= 1.GATS1i contributes towards positive direction in case 2
and 4 (GATS1i <= 0.953). MLOGP contributes
towards positive direction in case 4 (2.13 < MLOGP
3.05). Both variables will works towards negative direction if the value
fall outside pre-mentioned ranges.
dtr_results$model <- "Base Decision Tree Regressor"
rfr_results$model <- "Base Random Forest Regressor"
xgbr_tm_results$model <- "Base XGBoost Regressor"
xgbr_tm_tuned_results$model <- "Fine-Tuned XGBoost Regressor"
summary <- rbind(dtr_results,
rfr_results,
xgbr_tm_results,
xgbr_tm_tuned_results)
datatable(summary[, c(ncol(summary), 1:(ncol(summary) - 1))])LC50, we explored three regression model
variants: decision tree regressor, random forest regressor, and XGBoost
regressor. The results table clearly indicates that the XGBoost
regressor outperforms the other two models in predicting the lethal
concentration of chemicals affecting aquatic life.xgbr_base_tm, set by default to
trees = 15. It demonstrates an impressive performance with
89.46% adjusted r-squared on training data and 68.96% on testing data.
In terms of error rates, it registers 13.67% MAPE on training data and
19.62% on testing data. These results show indications of overfitting
since our model outperforms only in the training set.