1 Introduction

Aquatic Biota

1.1 Problem Background

Our water environments, like rivers and lakes, are increasingly exposed to various chemicals, which can harm aquatic life, especially fish. Traditionally, assessing the toxicity of these chemicals involves lengthy and complex laboratory tests, which are not only time-consuming but also costly. With a vast array of chemicals in use, there’s a pressing need for a more streamlined and efficient method to predict their impact on aquatic ecosystems.

The urgency to address this issue is twofold. Firstly, quick and accurate identification of harmful chemicals is crucial to protect aquatic species and maintain the health of our water ecosystems. Secondly, a simplified and efficient approach saves valuable time and resources, allowing for the rapid screening of numerous chemicals, which is vital given the ever-increasing number of chemicals introduced into the environment.

This project aims to create regression models to predict the toxic effects of chemicals on aquatic life. Our approach involves utilizing advanced regression techniques to establish relationships between chemical properties and their toxicity levels. The model will be designed to be trained and validated to ensure its accuracy and applicability. By doing so, we aspire to provide a practical tool for swiftly assessing chemical toxicity, reducing the need for extensive lab testing, and ultimately aiding in the preservation of aquatic ecosystems.

1.2 Dataset

In this project, we will use QSAR fish toxicity dataset. This dataset consists of 908 observations and 7 columns. 6 columns represent molecular descriptors and 1 column represents toxicity response.

CIC0: indicates the variety of atom types in a molecule.
SM1_Dz(Z): shows the presence and variety of non-carbon atoms in the molecule.
GATS1i: measures the similarity of bonded atoms’ ionization potential.
NdssC & NdsCH: count specific types of carbon atoms - NdssC for carbon-sulfur bonds and NdsCH for carbon-hydrogen bonds.
MLOGP: measures a compound’s tendency to dissolve in fats over water.
LC50: the concentration of a substance that causes death in 50% of a test population, like fish, within a set period. Lower values indicate higher toxicity.

1.3 Problem Statement, Predictor and Target Variable Identification

Given the provided background and dataset, the problem statement is well articulated:

Building regression models to predict toxicity levels based on several molecular descriptors.

Target variable: LC50.
Predictor variables: molecular descriptors.

2 Load Packages

Below are the packages that we will use during the analysis.

# for data loading - data preprocessing
library(readxl)
library(dplyr)
library(tidyr)

# for exploratory data analysis
library(psych)
library(ggplot2)
library(GGally)
library(gridExtra)

# for modeling - evaluation
library(caret)
library(partykit)
library(randomForest)
library(xgboost)
library(tidymodels)
library(MLmetrics)

# for model interpretation
library(lime)

# tidy visualization
library(DT)

# for reproducible analysis
set.seed(123)

3 Load Dataset

data <- read.csv("data_input/qsar_fish_toxicity.csv", sep=';')

4 Data Cleansing

In the initial phase of our analysis, we will focus on data cleansing to guarantee precision in our analysis and modeling process. Data cleansing involves identifying and rectifying any inaccuracies, corruptions, formatting errors, duplications, or incomplete information within our dataset. To begin, it’s essential to familiarize ourselves with the data structure.

glimpse(data)

#> Rows: 908
#> Columns: 7
#> $ CIC0      <dbl> 3.260, 2.189, 2.125, 3.027, 2.094, 3.222, 3.179, 3.000, 2.62…
#> $ SM1_Dz.Z. <dbl> 0.829, 0.580, 0.638, 0.331, 0.827, 0.331, 0.000, 0.000, 0.49…
#> $ GATS1i    <dbl> 1.676, 0.863, 0.831, 1.472, 0.860, 2.177, 1.063, 0.938, 0.99…
#> $ NdsCH     <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, …
#> $ NdssC     <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
#> $ MLOGP     <dbl> 1.453, 1.348, 1.348, 1.807, 1.886, 0.706, 2.942, 2.851, 2.94…
#> $ LC50      <dbl> 3.770, 3.115, 3.531, 3.510, 5.390, 1.819, 3.947, 3.513, 4.40…

Below are the first six rows of our observations.

head(data)

The dataset consists of seven numeric columns, each representing molecular descriptors and associated LC50 toxicity values. With a total of 908 rows, it encompasses a wide range of chemical compounds. The data is already in a suitable numeric format, requiring no further alteration in data types.

Now, let’s check if our data is complete by looking for any missing or blank spots, called null values. These missing parts can cause problems in our analysis. It’s important to find out if there are any and then figure out what to do with them to keep our data good and useful.

colSums(is.na(data))

#>      CIC0 SM1_Dz.Z.    GATS1i     NdsCH     NdssC     MLOGP      LC50 
#>         0         0         0         0         0         0         0

After checking, we found no missing parts in our data.

To clean our data, it’s essential to remove duplicate entries. Duplicates can skew the modeling process by giving undue weight to repeated information. We’ll identify and discard any such repeats for a more accurate analysis.

data[duplicated(data),]

We have identified one duplicated value. We will remove this to maintain the uniqueness of our data.

# drop duplicated data

duplicates <- duplicated(data)
data_clean <- data[!duplicates, ]

# rechecking
data_clean[duplicated(data_clean),]

Our data now is clean and ready for analysis.

5 Data Analysis

5.1 Univariate: Outliers

Checking for outliers, which are data points significantly differing from others in a dataset, is a crucial step in data analysis. They can skew statistical measures and mislead interpretations. Identifying outliers helps in deciding whether to retain, modify, or remove them, ensuring more accurate analysis. This step also aids in better understanding the data, spotting anomalies, and upholding the integrity of subsequent analysis stages. Hence, outlier detection is essential for a robust data analysis process.

To spot these outliers, we utilize base function boxplot().

boxplot(data_clean$CIC0, horizontal = T)

boxplot(data_clean$SM1_Dz.Z., horizontal = T)

boxplot(data_clean$GATS1i, horizontal = T)

boxplot(data_clean$NdsCH, horizontal = T)

boxplot(data_clean$NdssC, horizontal = T)

boxplot(data_clean$MLOGP, horizontal = T)

Looking at the boxplots, we see outliers in all the columns. There are different ways to handle these unusual points, like taking them out or keeping them. We’ve decided to keep them in our data. This is because sometimes, these outliers can tell us something important about what we’re studying. By keeping them, we make sure our analysis covers everything in the data, even the unusual parts.

5.2 Multivariate: Correlation

In this step, we’ll examine the relationships between different variables in our dataset through correlation analysis. This process helps us understand how variables are connected. If we find that two variables often increase or decrease together, they have a positive correlation. Conversely, if one goes up when the other goes down, that’s a negative correlation.

Understanding these relationships is vital for interpreting our data accurately. We’ll use the ggcorr() function from the GGally package to conduct this analysis.

ggcorr(data_clean, label = T, hjust = 0.8)

The correlation plot shows that most variable pairs have weak correlations. Notably:

MLOGP and LC50 have a strong positive correlation of 0.7, indicating they tend to increase together.
CIC0 and MLOGP have a moderate positive correlation of 0.5.
GATS1i and MLOGP have a moderate negative correlation of -0.5, meaning they tend to move in opposite directions.

Further insights can be gained by examining scatterplots of these variable pairs.

# scatterplot for MLOGP and LC50

ggplot(data_clean, aes(x = MLOGP, y = LC50)) +
  geom_point() +
  labs(title = "MLOGP vs LC50", 
       x = "MLOGP", 
       y = "LC50")

The scatterplot above displays a spread of data points indicating that as the MLOGP value increases, the LC50 value also tends to increase. This suggests that substances with a higher capability to dissolve in fats (as indicated by MLOGP) may also have a higher lethal concentration value, implying a lower toxicity.

# scatterplot for CIC0 and MLOGP

ggplot(data_clean, aes(x = CIC0, y = MLOGP)) +
  geom_point() +
  labs(title = "CIC0 vs MLOGP", 
       x = "CIC0", 
       y = "MLOGP")

Here, the points form a slight upward trend, suggesting that as the topological structure complexity (CICO) increases, there might be a slight increase in lipophilicity (MLOGP), but the relationship is not strongly pronounced.

# scatterplot for GATS1i and MLOGP

ggplot(data_clean, aes(x = GATS1i, y = MLOGP)) +
  geom_point() +
  labs(title = "GATS1i vs MLOGP", 
       x = "GATS1i", 
       y = "MLOGP")

The scatterplot displays the relationship between GATS1i and MLOGP. As GATS1i increases, MLOGP tends to decrease, which is indicated by a correlation of -0.5. The spread of the points suggests a moderate inverse relationship between the two variables.

6 Modeling

Having explored our data and gained key insights, we are now poised to take the next step: building a predictive regression model. This model will aim to predict the lethal concentration of substances based on their molecular properties. For our project, we will explore three robust regression models: decision tree regressor, random forest regressor, and extra gradient boosting (XGBoost) regressor.

6.1 Train-Test Splitting

For the modeling purpose, we’ll be splitting the data into a training set and a testing set. We’ll be giving 80% of the data to the training set, which is where our model will learn its stuff, and the other 20% will go to the testing set, where we’ll see how well our model has learned. This way, we get to train our model well and also test it fairly.

This splitting utilizes sample() function. This function will return a list of data indexes for training data.

# for reproducibility
RNGkind(sample.kind = "Rounding")
set.seed(100)

# define train and test proportion
index <- sample(nrow(data_clean), nrow(data_clean) * 0.8)

# subset
train_data <- data_clean[index,]
test_data <- data_clean[-index,]

# check the result

dim(train_data)

#> [1] 725   7

dim(test_data)

#> [1] 182   7

Now, we will train the data with 725 observations and test the result with 182 observations.

6.2 Model Building

In this project, we will mainly focus on utilization on tree-based robust regression model: decision tree regressor, random forest regressor, and extra gradient boosting (XGBoost) regressor.

6.2.1 Decision Tree Regressor

6.2.1.1 How Decision Tree Regressor Works

A decision tree regressor operates by dividing a dataset into distinct regions based on specific criteria. Suppose we have a dataset where we want to predict LC50, a measure of toxicity, using molecular descriptors like CIC0, SM1_Dz(Z), GATS1i, NdssC, NdsCH, and MLOGP. Each of these descriptors gives us different insights into the molecules, such as the variety of atom types, the presence of non-carbon atoms, and the molecule’s solubility.

We begin by examining all the data. The decision tree first looks for the best way to split the molecules into groups based on one of these descriptors. For instance, it might divide them based on CIC0, grouping molecules with similar types of atoms. The goal is to create groups that are as uniform as possible in terms of their LC50 values.

At each step, we evaluate each molecular descriptor to determine the best way to further divide the groups. This is done using a mathematical formula, like mean squared error (MSE), which helps us gauge the effectiveness of the split. We continue this process of splitting the dataset, creating smaller and more specific groups, until we reach a certain condition, such as a maximum tree depth or groups that are homogeneous enough in their LC50 values.

The final model is a tree comprised of leaf nodes. Each leaf represents a group of molecules with a similar predicted LC50 value, usually the average LC50 of the molecules in that leaf. So, when we introduce a new molecule with its descriptors into the model, the tree navigates to the appropriate leaf based on these features. The predicted LC50 is then based on the average value of that leaf’s group. This method allows us to effectively predict LC50 based on the provided molecular characteristics.

6.2.1.2 Implementation

Now that we understand how to construct a decision tree regressor from our dataset. Our next step is to implement it in R. We’ll create our model using the ctree() function from the partykit package. This function is versatile, handling both classification and regression tasks. To use it, we need to specify two main parameters: formula and data. The formula parameter defines our target variable (in this case, LC50) and the predictors we want to use (such as CIC0, SM1_Dz(Z), GATS1i, NdssC, NdsCH, and MLOGP). The data parameter is where we input our training dataset. By setting these parameters, we can efficiently build a Decision Tree Regressor tailored to our specific requirements in R.

# predict LC50 with all predictors
dtr_base <- ctree(formula = LC50 ~ .,
                 data = train_data)

To visualize the structure of our decision tree regressor, we can utilize the plot() function in R. This will enable us to see a graphical representation of the tree, showcasing how the data is split at each node and providing insight into the decision-making process of the model.

# visualize the tree
plot(dtr_base, type="simple")

dtr_base

#> 
#> Model formula:
#> LC50 ~ CIC0 + SM1_Dz.Z. + GATS1i + NdsCH + NdssC + MLOGP
#> 
#> Fitted party:
#> [1] root
#> |   [2] MLOGP <= 2.621
#> |   |   [3] MLOGP <= 1.595
#> |   |   |   [4] SM1_Dz.Z. <= 0.629
#> |   |   |   |   [5] NdsCH <= 0
#> |   |   |   |   |   [6] GATS1i <= 1.59: 2.849 (n = 60, err = 54.6)
#> |   |   |   |   |   [7] GATS1i > 1.59
#> |   |   |   |   |   |   [8] MLOGP <= 0.8: 1.772 (n = 50, err = 34.1)
#> |   |   |   |   |   |   [9] MLOGP > 0.8: 2.801 (n = 20, err = 14.2)
#> |   |   |   |   [10] NdsCH > 0: 3.741 (n = 27, err = 20.8)
#> |   |   |   [11] SM1_Dz.Z. > 0.629: 3.577 (n = 95, err = 181.4)
#> |   |   [12] MLOGP > 1.595
#> |   |   |   [13] SM1_Dz.Z. <= 0.83
#> |   |   |   |   [14] NdsCH <= 0
#> |   |   |   |   |   [15] MLOGP <= 2.056: 3.446 (n = 55, err = 21.2)
#> |   |   |   |   |   [16] MLOGP > 2.056: 3.842 (n = 82, err = 24.5)
#> |   |   |   |   [17] NdsCH > 0
#> |   |   |   |   |   [18] SM1_Dz.Z. <= 0.331: 3.775 (n = 14, err = 3.6)
#> |   |   |   |   |   [19] SM1_Dz.Z. > 0.331: 4.735 (n = 18, err = 10.6)
#> |   |   |   [20] SM1_Dz.Z. > 0.83: 4.579 (n = 62, err = 55.6)
#> |   [21] MLOGP > 2.621
#> |   |   [22] MLOGP <= 3.818
#> |   |   |   [23] SM1_Dz.Z. <= 1.192: 4.615 (n = 135, err = 79.3)
#> |   |   |   [24] SM1_Dz.Z. > 1.192: 5.939 (n = 22, err = 35.5)
#> |   |   [25] MLOGP > 3.818: 5.944 (n = 85, err = 120.7)
#> 
#> Number of inner nodes:    12
#> Number of terminal nodes: 13

In the tree diagram we made, MLOGP is at the very top. This means it’s the most important thing the tree looks at when predicting LC50, which is about how toxic something is. Basically, MLOGP tells us about how a molecule dissolves, and that seems to be a big deal for deciding its toxicity. So, when we use this tree to guess toxicity, it starts by checking MLOGP first, and then makes more decisions based on that. This shows us that how a molecule dissolves is really key in understanding how toxic it might be.

6.2.1.3 Model Performance

Next, we will evaluate the performance of our dtr_base model on both the training and testing datasets, focusing on two key metrics: adjusted r-squared and MAPE (Mean Absolute Percentage Error). It’s important to assess the model’s effectiveness on both sets because this helps us understand how well it can apply what it’s learned to new, unseen data. By doing this, we ensure that our model isn’t just memorizing the training data but is actually learning patterns that are generally true and can be applied broadly.

# custom function for calculating adj. r-squared

adj_r2 <- function(preds, actual, n, p){
  
  rss <- sum((preds - actual) ^ 2) 
  tss <- sum((actual - mean(actual)) ^ 2)  
  rsq <- 1 - rss/tss

  return  (1 - (((1-rsq)*(n-1))/(n-p-1)))
}

# prediction on training data and testing data

y_train_pred <- predict(object = dtr_base, newdata = train_data)
y_test_pred <- predict(object = dtr_base, newdata = test_data)

dtr_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
  MAPE_train = MAPE(y_train_pred, train_data$LC50),
  adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
  MAPE_test = MAPE(y_test_pred, test_data$LC50)
)

dtr_results

💡 Insight:

The adjusted r-squared shows that our model explains about 56.81% of the variations in LC50 for the training data and 54.85% for the testing data.
With the MAPE, we found out that our predictions might be around 29.36% too high or too low for the training data and about 23.75% off for the testing data.

These results aren’t very good. It looks like the decision tree regressor might be too simple for what we’re trying to do, and it’s not predicting LC50 as accurately as we hoped. Next, we will be experimenting with random forest regressor.

6.2.2 Random Forest Regressor

6.2.2.1 How Random Forest Regressor Works

A random forest regressor is an advanced model that combines many decision trees to improve prediction accuracy. It creates these trees using different random samples of data, a process known as ‘bootstrapping’. Each tree also uses a randomly selected set of features, making each one unique and helping the overall model avoid errors that a single tree might make. When it comes to making predictions, the random forest runs the input through all its trees. Each tree gives its own prediction, and the final output is the average of these. This method not only makes the model more accurate but also helps it perform well on new, unseen data by reducing overfitting.

The random forest is versatile, good for both regression and classification, and can handle a mix of data types, making it a popular choice in many predictive modeling tasks.

6.2.2.2 Implementation

Previously, we utilized the ctree() function from the partykit package, specifically designed for decision trees. However, for our next step in model building, we’ll use the train() function from the caret package, a high-level abstraction function. This function allows us to experiment with different models simply by modifying the method parameter. In this case, as we aim to build a random forest regressor, we will set method = "rf".

rfr_base <- train(LC50 ~ .,
                  data = train_data,
                  method = "rf")

Since a random forest has many trees and it’s hard to show them all at once, we use something called ‘feature importance’ to understand the model better. Feature importance tells us how important each part of our data is for making good predictions. We can find this out by using the varImp() function.

varImp(rfr_base$finalModel)

Based on what we found, MLOGP is still the most important thing for predicting LC50, with SM1_Dz(Z) and GATS1i also being really important. This means that the model really pays attention to MLOGP, showing that how a molecule dissolves is key in figuring out its toxicity. SM1_Dz(Z) and GATS1i are also big factors, telling us that the types of atoms in a molecule and their properties are crucial for making good predictions about LC50.

6.2.2.3 Model Performance

Similarly to the decision tree regressor, we will evaluate the model’s performance on both the training and testing data.

# prediction on training data and testing data

y_train_pred <- predict(object = rfr_base, newdata = train_data)
y_test_pred <- predict(object = rfr_base, newdata = test_data)

rfr_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
  MAPE_train = MAPE(y_train_pred, train_data$LC50),
  adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
  MAPE_test = MAPE(y_test_pred, test_data$LC50)
)

rfr_results

💡 Insight:

The adjusted r-squared shows that our model explains about 90.72% of the variations in LC50 for the training data and 69.13% for the testing data.
With the MAPE, we found out that our predictions might be around 13.82% too high or too low for the training data and about 19.90% off for the testing data.

Based on the adjusted r-squared and MAPE, it is evident that the random forest regressor performs better than the decision tree regressor. However, the significant gap between training performance and testing performance leads us to conclude that this model overfits the training data and has limited capability in generalizing to new, unseen dataset. Next, we will attempt to model the dataset using an XGBoost regressor.

6.2.3 XGBoost Regressor

6.2.3.1 How XGBoost Regressor Works

XGBoost works by taking the concept of decision trees to the next level. It begins by creating a basic decision tree, but instead of stopping there, XGBoost continuously builds more trees, each one designed to correct the errors made by the previous ones. Imagine it like a team where each new member learns from the mistakes of the earlier members, gradually improving the overall performance. XGBoost does this through a technique called ‘boosting’, where each new tree focuses on the tricky parts that previous trees struggled with. It’s like solving a puzzle by focusing on the hardest parts one step at a time.

What’s special about XGBoost is its efficiency and precision. It’s not just randomly adding trees; it’s doing it in a smart, calculated way. The algorithm also includes ways to prevent overfitting, which means it avoids getting too caught up in the specific details of the training data, making it more reliable for predicting new, unseen data. In essence, XGBoost creates a highly refined and collaborative system of trees that work together to make more accurate predictions, especially useful in handling complex datasets.

6.2.3.2 Implementation

In implementing the XGBoost regressor, we will use the same approach as with the random forest regressor, with the only change being setting method = xgbTree.

xgbr_base <- train(LC50 ~ .,
                  data = train_data,
                  method = "xgbTree")

Alternatively, we can construct an XGBoost regressor using the boost_tree() function from the tidy_models package.

xgbr_base_tm <- boost_tree() %>% 
                set_mode("regression") %>% 
                set_engine("xgboost") %>% 
                fit(LC50 ~ ., data = train_data)

6.2.3.3 Model Performance

# prediction on training data and testing data with model from tidymodels

y_train_pred <- predict(object = xgbr_base_tm, new_data = train_data)
y_test_pred <- predict(object = xgbr_base_tm, new_data = test_data)

xgbr_tm_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
  MAPE_train = MAPE(y_train_pred$.pred, train_data$LC50),
  adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
  MAPE_test = MAPE(y_test_pred$.pred, test_data$LC50)
)

xgbr_tm_results

💡 Insight:

The adjusted r-squared shows that our model explains about 89.46% of the variations in LC50 for the training data and 68.96% for the testing data.
With the MAPE, we found out that our predictions might be around 13.67% too high or too low for the training data and about 19.62% off for the testing data.

Based on the results from three regressors, XGBoost achieved a relatively lower error compared to the random forest and decision rree regressors. However, like the random rorest, XGBoost demonstrated a tendency to perform well on training data but underperform on testing data. This pattern suggests a potential issue of overfitting. Therefore, we plan to fine-tune this model to better control its complexity and improve its generalization on unseen data.

6.3 Model Improvement: Fine Tuning

In this section, we will fine-tune the XGBoost regressor using the tidymodels package. Among the adjustable parameters for the XGBoost regressor, we will focus on trees, which controls the number of trees used in the model.

Initially, we define the model and identify the specific argument to be tuned. For tuning solely the trees parameter, we assign the tune() function to it.

# 1. Mark trees as parameter to be tuned
xgbr_tune <- boost_tree(trees = tune()) %>% 
                set_mode("regression") %>% 
                set_engine("xgboost")

Next, we define the recipe, which outlines the steps for preparing our data for modeling. This process can involve various preprocessing tasks. In the the following code, we specify what to predict, the predictors to use, and the dataset.

# 2. Define recipe
rcp <- recipe(LC50 ~ ., data = train_data)

Subsequently, we integrate the model and recipe using the workflow() function. The following code demonstrates creating a tuning workflow that combines the XGBoost regressor with our LC50 dataset.

# 3. Make workflow 
xgb_wflow <- 
  workflow() %>% 
  add_model(xgbr_tune) %>% 
  add_recipe(rcp)

xgb_wflow %>% extract_parameter_dials("trees")

#> # Trees (quantitative)
#> Range: [1, 2000]

Executing the above code shows that the default tuning range spans from 1 to 2000. However, we want to specifically tune the trees value between 7 and 15. To do this, we can manually create a tibble with our desired range.

# 4. Create a tibble with the specific range for trees
trees_values <- tibble(trees = 7:15)

Next, we will define the benchmark for determining the best trees value, using MAPE (Mean Absolute Percentage Error) as the standard of measurement.

# 5. Define specific metrics
mape_res <- metric_set(mape)

Finally, we are ready to tune our model using the tune_grid() function.

# 6. Tune the model using the specified range for trees
tuning_results <- tune_grid(
  xgb_wflow,
  grid = trees_values,
  resamples = vfold_cv(train_data, v = 5),
  metrics = mape_res
)

Let’s inspect the best parameter for our XGBoost regressor.

show_best(tuning_results, n = 1)

Based on the results, trees = 8 yields the lowest MAPE in the 5-fold cross-validation. We will remodel our LC50 data using this outcome.

xgbr_tune8 <- boost_tree(trees = 8) %>% 
                set_mode("regression") %>% 
                set_engine("xgboost") %>% 
                fit(LC50 ~ ., data = train_data)

# prediction on training data and testing data with tuned model from tidymodels

y_train_pred <- predict(object = xgbr_tune8, new_data = train_data)
y_test_pred <- predict(object = xgbr_tune8, new_data = test_data)

xgbr_tm_tuned_results = data.frame(
  adj_r2_train = adj_r2(y_train_pred, train_data$LC50, nrow(train_data), ncol(train_data)),
  MAPE_train = MAPE(y_train_pred$.pred, train_data$LC50),
  adj_r2_test = adj_r2(y_test_pred, test_data$LC50, nrow(test_data), ncol(test_data)),
  MAPE_test = MAPE(y_test_pred$.pred, test_data$LC50)
)

xgbr_tm_tuned_results

xgbr_tm_results

Compared to our pre-tune model, xgbr_base_tm, the xgbr_tune8 model yields a higher error rate in both the training and testing data. This suggests that a greater number of trees (as xgbr_base_tm has 15 trees by default) may lead to better predictions than a model with fewer trees. However, both xgbr_base_tm and xgbr_tune8 still suffer from overfitting, as indicated by their strong performance on the training data but weaker performance on the testing data.

For the next part, we will attempt to use xgbr_base_tm to explain the model for each observation using lime package.

6.4 Model Interpretation

So far, our best model is xgbr_base_tm. It has an adjusted r-squared of 68.96% and a MAPE of 19.62% on the test data. The XGBoost regressor is good at finding complicated patterns but it’s hard to understand how it makes its predictions.

Luckily, R has a tool to help with this. It’s called the lime package, short for Local Interpretable Model-Agnostic Explanations. This tool can explain any kind of predictive model, even complex ones like ours.

Now, let’s use the lime function to better understand our model. Here’s the code we’ll use.

# create the explainer
explainer <- lime(x = train_data %>% select(-LC50), 
                  model = xgbr_base_tm)

# select only the first 4 observations of test_data
selected_data <- test_data %>% 
  select(-LC50) %>% 
  dplyr::slice(1:4)

selected_data

explanation <- explain(x = selected_data, 
                       explainer = explainer, 
                       feature_select = "auto", # method of feature selection for lime
                       n_features = 6 # number of features to explain the model
                       )

Now, we will visualize the explanation using the plot_features() function.

plot_features(explanation)

As we can see from the figure, our plot contains 4 observations and the details for each.

1. Case

Case indicates the index of our observation.

Case = 1 means the first row of our test data.
Case = 2 means the second row of our test data.
and so on.

2. Prediction

Prediction refers to the predicted lethal concentration LC50 given the predictors. Now, we will compare with the predicted values with our actual values.

test_data$LC50[1:4]

#> [1] 3.510 5.390 1.819 3.513

The prediction for the second observation is more off compared to the first, third, and fourth ones. This suggests that our model may not be as accurate in certain cases, indicating a need for further investigation or adjustment.

3. Explanation Fit

Explanation Fit reveals how good lime is in interpreting our model in terms of r-squared values of linear regression. Given the figure, we know lime is only able to explain <20% of the variation of our data. By this result, it can be concluded that lime can only explain a tiny part of our model.

4. Feature Barplot

For each observation, lime generates bar plots in red and blue. Red reveals a negative contribution to our target and blue otherwise. Each observation shows a different pattern regarding which variables strengthen the lethal concentration value.

colnames(train_data)

#> [1] "CIC0"      "SM1_Dz.Z." "GATS1i"    "NdsCH"     "NdssC"     "MLOGP"    
#> [7] "LC50"

Interestingly, for 4 cases, CIC0, SM1_Dz(Z), NdsCH, and NdssCcontribute in negative direction towards LC50 value. Specifically, this occurs when:

CIC0 <= 3.43,
SM1_Dz(Z)<= 0.889,
NdsCH and NdssC <= 1.

GATS1i contributes towards positive direction in case 2 and 4 (GATS1i <= 0.953). MLOGP contributes towards positive direction in case 4 (2.13 < MLOGP 3.05). Both variables will works towards negative direction if the value fall outside pre-mentioned ranges.

7 Conclusion

dtr_results$model <- "Base Decision Tree Regressor"
rfr_results$model <- "Base Random Forest Regressor"
xgbr_tm_results$model <- "Base XGBoost Regressor"
xgbr_tm_tuned_results$model <- "Fine-Tuned XGBoost Regressor"

summary <- rbind(dtr_results, 
      rfr_results, 
      xgbr_tm_results, 
      xgbr_tm_tuned_results)

datatable(summary[, c(ncol(summary), 1:(ncol(summary) - 1))])

In our quest to build a predictive model for estimating the lethal concentration LC50, we explored three regression model variants: decision tree regressor, random forest regressor, and XGBoost regressor. The results table clearly indicates that the XGBoost regressor outperforms the other two models in predicting the lethal concentration of chemicals affecting aquatic life.
The standout model is xgbr_base_tm, set by default to trees = 15. It demonstrates an impressive performance with 89.46% adjusted r-squared on training data and 68.96% on testing data. In terms of error rates, it registers 13.67% MAPE on training data and 19.62% on testing data. These results show indications of overfitting since our model outperforms only in the training set.
To enhance our models and tackle overfitting:
- Expand Parameter Tuning: We’ll explore more parameter tuning options for the XGBoost model to achieve a balanced performance.
- Adjust Parameters: We plan to adjust existing model parameters, like learning rate and tree depth, for better generalization.
- Data Preprocessing: Advanced data preprocessing steps, including feature selection and transformation, will be implemented for enhanced accuracy.
- Fine Tuning Random Forest: As shown in the table, the random forest regressor performs slightly better than the XGBoost regressor in terms of adjusted r-squared. In the future, we could also consider fine-tuning this model.

Fish Toxicity Prediction with Robust Regression Model

Saskia Dwi Ulfah

29 December 2023

1 Introduction

1.1 Problem Background

1.2 Dataset

1.3 Problem Statement, Predictor and Target Variable Identification

2 Load Packages

3 Load Dataset

4 Data Cleansing

5 Data Analysis

5.1 Univariate: Outliers

5.2 Multivariate: Correlation

6 Modeling

6.1 Train-Test Splitting

6.2 Model Building

6.2.1 Decision Tree Regressor

6.2.1.1 How Decision Tree Regressor Works

6.2.1.2 Implementation

6.2.1.3 Model Performance

6.2.2 Random Forest Regressor

6.2.2.1 How Random Forest Regressor Works

6.2.2.2 Implementation

6.2.2.3 Model Performance

6.2.3 XGBoost Regressor

6.2.3.1 How XGBoost Regressor Works

6.2.3.2 Implementation

6.2.3.3 Model Performance

6.3 Model Improvement: Fine Tuning

6.4 Model Interpretation

7 Conclusion

8 Resources