# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

1 Concrete Prediction: Will it last forever?

The problem of predicting concrete compression strength, a critical aspect in civil engineering, was originally proposed by Prof. I-Cheng Yeh from the Department of Information Management at Chung-Hua University, Hsin Chu, Taiwan, in 2007. This work builds upon his earlier research in 1998, which explored methods to accurately predict compression strength in concrete structures. As Prof. Yeh emphasized, concrete is the most important material in civil engineering, and its compression strength is a vital property that ensures the safety and durability of structures.

Concrete’s compression strength is not solely determined by the water-cement mixture but is also influenced by various other ingredients and the specific treatment of the mixture. Understanding how these factors interact is essential for optimizing concrete performance. Using the dataset provided by Prof. Yeh, this project aims to discover “the perfect recipe” for predicting concrete compression strength. It seeks to explain the relationship between the concentration of ingredients, the age at which testing is conducted, and the resulting compression strength.

Through the application of advanced machine learning techniques, this project endeavors to create a predictive model that not only enhances the accuracy of compression strength predictions but also provides valuable insights into the factors that influence this crucial property. The findings from this research have the potential to contribute significantly to the field of civil engineering by improving material design, quality control, and overall construction practices.

1.1 Goal

  1. predict the compression strength based on the mixture properties.

1.2 Data Wrangling

lets load in first the necessary library

library(keras)
library(dplyr)
library(tensorflow)
library(rsample)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(inspectdf)
library(gtools) 
library(caret)
library(readxl)
library(zoo)
library(imbalance)
library(class)
library(lime)
library(neuralnet)
library(randomForest)
library(yardstick)

lets load in the data we will be using into an object. We will name the object “concrete”

concrete <- read.csv("input/data-train.csv")
data_test <- read.csv("input/data-test.csv")
concrete

1.3 Linear Regression Models

1.3.1 Neural Network Regression


1.3.1.1 Exploratory Data Analysis

At this stage, we will perform data exploration using the “summary()” and “str()” functions. This function provides a summary of eight statistical calculations for columns with numerical data types and calculates the frequency of occurrences for columns with categorical data types. Let’s apply this function to the data we have.

summary(concrete)
##       id                cement           slag            flyash      
##  Length:825         Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Median :275.1   Median : 20.00   Median :  0.00  
##                     Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
##                     3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
##                     Max.   :540.0   Max.   :359.40   Max.   :200.10  
##      water        super_plast       coarse_agg        fine_agg    
##  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
##  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
##  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
##  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
##  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
##  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
##       age            strength    
##  Min.   :  1.00   Min.   : 2.33  
##  1st Qu.:  7.00   1st Qu.:23.64  
##  Median : 28.00   Median :34.57  
##  Mean   : 45.14   Mean   :35.79  
##  3rd Qu.: 56.00   3rd Qu.:45.94  
##  Max.   :365.00   Max.   :82.60
str(concrete)
## 'data.frame':    825 obs. of  10 variables:
##  $ id         : chr  "S1" "S2" "S3" "S4" ...
##  $ cement     : num  540 540 332 332 199 ...
##  $ slag       : num  0 0 142 142 132 ...
##  $ flyash     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water      : num  162 162 228 228 192 228 228 228 192 192 ...
##  $ super_plast: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarse_agg : num  1040 1055 932 932 978 ...
##  $ fine_agg   : num  676 676 594 594 826 ...
##  $ age        : int  28 28 270 365 360 365 28 28 90 28 ...
##  $ strength   : num  80 61.9 40.3 41 44.3 ...

1.3.1.2 Missing Values

We will now check for any missing values

concrete %>% is.na() %>% colSums()
##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

1.3.1.3 correlation exploratory

ggcorr(concrete, label = TRUE, hjust = 0.8)

Insight
- age is weakly positive correlated with strength of the cement
- the amount of cement is somewhat strongly positive correlated with strength
- Super_plast has a positive linear correlation with the strength of cement

1.3.1.4 Data Preprocess

After completing the first two stages, the third stage you need to undertake is preparing the data to be more ready before implementing it into the machine learning model that will be used. For example, we want to perform cross-validation.

Cross-validation is a step we use to assess how well our model predicts new data.

So, how can we determine whether the model we’ve built is good at predicting new data? This is where Train-test splitting comes into play. We divide our data into two groups: training data and testing data. but before we start cross validation, we need to remove unwanted coloumns such as “id” coloums.

concrete_clean <- concrete %>% select(-id)
str(concrete_clean)
## 'data.frame':    825 obs. of  9 variables:
##  $ cement     : num  540 540 332 332 199 ...
##  $ slag       : num  0 0 142 142 132 ...
##  $ flyash     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water      : num  162 162 228 228 192 228 228 228 192 192 ...
##  $ super_plast: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarse_agg : num  1040 1055 932 932 978 ...
##  $ fine_agg   : num  676 676 594 594 826 ...
##  $ age        : int  28 28 270 365 360 365 28 28 90 28 ...
##  $ strength   : num  80 61.9 40.3 41 44.3 ...
1.3.1.4.1 Cross validation

Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.

RNGkind(sample.kind = "Rounding")
set.seed(123)

splitter <- initial_split(data = concrete_clean, prop = 0.8, strata = "strength")
train_NN <- training(splitter)
test_NN <- testing(splitter)

To make our live easy, i’m going to round up the values of “Strength” from train_NN data only

check target class proportion

prop.table(table(train_NN$strength)) %>% head(28)
## 
##       2.33       4.57       4.78       4.83        4.9       6.27       6.28 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       6.47       6.81       6.88        6.9       6.94        7.4       7.51 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       7.75       8.06       8.49       8.54       9.01       9.13       9.31 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       9.69       9.73       9.74       9.87       9.99      10.03      10.09 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
1.3.1.4.2 seperating predictor factors with target variable
# train_NN
train_x <- train_NN %>% select(-strength) 
train_y <- train_NN$strength
# test_NN
test_x <- test_NN %>% select(-strength) 
test_y <- test_NN$strength
1.3.1.4.3 Scaling
train_x <- train_x %>% scale()
test_x <- test_x %>% scale()
1.3.1.4.4 Checking for outlier
boxplot(concrete_clean$strength)


It seems there is an outlier, lets see which datas are outliers

outlier <- boxplot(concrete_clean$strength)

outlier$out
## [1] 79.99 80.20 79.40 82.60 81.75

insight
-there seems to be 5 outliers out of 825 observation done.

although there are 5 outliers present in the data, it wont affect the model that we’re about to make if we make the neural network deep enough

1.3.1.4.5 Changing Predictor into Array

To transform features so they can be accepted by Keras and Python, we need to convert them into an array format using the array_reshape() function from the Keras library.

# your code here
train_x <- array_reshape(x = as.matrix(train_x),
                         dim = dim(train_x))

test_x <- array_reshape(x = as.matrix(test_x),
                         dim = dim(test_x))

1.3.1.5 Model and Evaluation

1.3.1.5.1 Build Neural Network Architecture

The next step is to design the architecture of the Neural Network.

model_linear_nn <- keras_model_sequential() %>%
  layer_dense(units = 128, activation = "relu", input_shape = ncol(train_x)) %>%
  layer_dropout(rate = 0.05) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_batch_normalization() %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "linear")
1.3.1.5.1.1 Model Compile

The compile function in Keras prepares the model for training by specifying the loss function, which measures prediction error; the optimizer, which updates the model’s weights based on this error; and evaluation metrics, which assess the model’s performance. For instance, using categorical_crossentropy as the loss function, the Adam optimizer with a learning rate of 0.1, and accuracy as a metric sets up the model to learn and evaluate performance effectively during training.

model_linear_nn %>% compile(
  loss = "mse",
  optimizer = optimizer_rmsprop(),
  metrics = list("mean_absolute_error"))
1.3.1.5.1.2 Model Fitting

Next, the Neural Network will undergo training through the processes of “Feed Forward” and “Back Propagation” (referred to as “steps”).

Training is carried out in batches, where one step of training uses one batch of data. in this Data set consist of 825 rows of data and the batch size will be 165, so the train will be divided into 5 batch

fit_nn <- model_linear_nn %>% fit(
  train_x,
  train_y,
  epochs = 100,
  validation_split = 0.2,
  verbose = 0)

1.3.1.6 Model Evaluation

model_linear_nn %>% evaluate(train_x, train_y)
## 21/21 - 0s - loss: 29.8052 - mean_absolute_error: 3.7547 - 44ms/epoch - 2ms/step
##                loss mean_absolute_error 
##           29.805223            3.754685
model_linear_nn %>% evaluate(test_x, test_y)
## 6/6 - 0s - loss: 65.7213 - mean_absolute_error: 5.6244 - 18ms/epoch - 3ms/step
##                loss mean_absolute_error 
##           65.721268            5.624386

It seems there is overfittin to the train set. lets look at training evolution plot to see what happened

plot(fit_nn)


insight
-Based on the Mean Absolute Error evolution plot, we see that we let the evolution ran too long that it starts to overfit itself

1.3.1.6.1 further tuning
1.3.1.6.1.1 Does a Neural Network Need to Normalize the Data?

Normalization of data is crucial for neural networks as it ensures that all input features are on a similar scale, which helps in speeding up convergence and improving model performance. Without normalization, features with larger magnitudes can disproportionately influence the training process, leading to inefficient learning and instability. By scaling features to a common range, such as [0, 1] or with zero mean and unit variance, normalization facilitates more stable and effective gradient descent optimization, ultimately enhancing the neural network’s ability to learn from the data efficiently.

1.3.1.6.1.2 Does a Neural Network Need to Log-transform or Scale the Variables with Square Root?

Log transformations and square root transformations can be beneficial for neural networks when dealing with skewed data or specific types of variables. Log transformation is particularly useful for reducing skewness and compressing the range of values, making the data distribution closer to normal and improving the stability of training. Square root transformations can moderate high variability and are often applied to count data with positive skewness. While these transformations are not always necessary, they can enhance model performance by addressing issues related to data distribution and scaling, depending on the characteristics of the dataset.

However on this model i wont be further tuning the model by normalizing the data, log-transform, or scale the variables with square root. Because while normalization and transformations like log or square root scaling can enhance neural network performance, they should be used judiciously. Over-normalization or inappropriate transformations might obscure meaningful data variations and lead to suboptimal model performance by compressing or distorting critical information. For instance, excessive normalization might mask important features, and inappropriate log or square root transformations can distort data relationships, complicating interpretation and potentially degrading model results. Therefore, it’s crucial to ensure that these preprocessing steps align with the data characteristics and the specific goals of the model to avoid compromising the overall effectiveness and interpretability of the neural network.

1.3.1.6.2 Tuning model
# Define the model with increased complexity and adjusted parameters
model_advanced <- keras_model_sequential() %>%
  layer_dense(units = 1024, 
              activation = "selu",  # SELU activation function
              input_shape = ncol(train_x), 
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  
  layer_dense(units = 512, activation = "elu",  # ELU activation function
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  
  layer_dense(units = 512, activation = "elu",  # ELU activation function
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>%
  
  layer_dense(units = 256, activation = "swish",  # Swish activation function
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  
  layer_dense(units = 128, activation = "gelu",  # GELU activation function
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  
  layer_dense(units = 64, activation = "relu", 
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  
  layer_dense(units = 32, activation = "relu", 
              kernel_regularizer = regularizer_l2(0.0001)) %>%
  
  layer_dense(units = 1, activation = "linear")  # Linear activation for the output layer

# Compile the model with an adjusted learning rate
model_advanced %>% compile(loss = "mse",
                           optimizer = optimizer_adam(learning_rate = 0.001),
                           metrics = list("mean_absolute_error"))

# Implement early stopping and model checkpoint
early_stop <- callback_early_stopping(min_delta = 0.001, 
                                      patience = 50, 
                                      monitor = "val_loss")
model_checkpoint <- callback_model_checkpoint(filepath = "best_model.h5", 
                                              save_best_only = TRUE)

# Fit the model using provided validation data
fit_advanced <- model_advanced %>% fit(train_x,
                                       train_y,
                                       epochs = 400,
                                       batch_size = 70, # Use the batch size of 75
                                       validation_data = list(test_x, test_y),
                                       verbose = 0,
                                       callbacks = list(early_stop, model_checkpoint))

# Load the best model after training
model_best <- load_model_hdf5("best_model.h5")
plot(fit_advanced)

Results seem better now for the test set

keras_train <- model_best %>% predict(train_x)
## 21/21 - 0s - 180ms/epoch - 9ms/step
keras_test <- model_best %>% predict(test_x)
## 6/6 - 0s - 29ms/epoch - 5ms/step
postResample(keras_test[,1], test_NN$strength)
##      RMSE  Rsquared       MAE 
## 5.7093631 0.8909226 3.9445006
postResample(keras_train[,1], train_NN$strength)
##      RMSE  Rsquared       MAE 
## 3.4686669 0.9576949 2.5242328

1.3.1.7 Prediction Performance

data_test_x <- data_test %>% select(-c("id", "strength")) %>% scale()
# predict target using your model
pred_test <- model_best %>% predict(data_test_x)
## 7/7 - 0s - 32ms/epoch - 5ms/step
# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test)

# save data
write.csv(submission, "submission-jonathan-nn.csv", row.names = F)

# check first 3 data
head(submission, 3)
1.3.1.7.1 Test Performance in test data set

1.3.1.7.2 model interpretation

I’m going to show you a simplified version of the model that we just created and tuned

neural_net_linear_tuned <- neuralnet(formula = strength ~.,
                                     data = concrete_clean,
                                     hidden = c(8, 7, 7, 6, 5))
plot(neural_net_linear_tuned)

insight
-every value shown in this plot shows the weight of each arrow or you could say the correlation which helps the machine to decide the strength of the concrete
- This neural network architecture takes inspiration from a neural network diagram called “Extreme Learning Machine” which looked like this this
- In total there are 6 hidden layers used in this model - each ball in each row represent the n number of neurons used in the machine which is 2^n

1.3.2 Random Forest

1.3.2.1 Exploratory Data Analysis

At this stage, we will perform data exploration using the “summary()” and “str()” functions. This function provides a summary of eight statistical calculations for columns with numerical data types and calculates the frequency of occurrences for columns with categorical data types. Let’s apply this function to the data we have.

summary(concrete)
##       id                cement           slag            flyash      
##  Length:825         Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Median :275.1   Median : 20.00   Median :  0.00  
##                     Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
##                     3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
##                     Max.   :540.0   Max.   :359.40   Max.   :200.10  
##      water        super_plast       coarse_agg        fine_agg    
##  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
##  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
##  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
##  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
##  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
##  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
##       age            strength    
##  Min.   :  1.00   Min.   : 2.33  
##  1st Qu.:  7.00   1st Qu.:23.64  
##  Median : 28.00   Median :34.57  
##  Mean   : 45.14   Mean   :35.79  
##  3rd Qu.: 56.00   3rd Qu.:45.94  
##  Max.   :365.00   Max.   :82.60
str(concrete)
## 'data.frame':    825 obs. of  10 variables:
##  $ id         : chr  "S1" "S2" "S3" "S4" ...
##  $ cement     : num  540 540 332 332 199 ...
##  $ slag       : num  0 0 142 142 132 ...
##  $ flyash     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water      : num  162 162 228 228 192 228 228 228 192 192 ...
##  $ super_plast: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarse_agg : num  1040 1055 932 932 978 ...
##  $ fine_agg   : num  676 676 594 594 826 ...
##  $ age        : int  28 28 270 365 360 365 28 28 90 28 ...
##  $ strength   : num  80 61.9 40.3 41 44.3 ...

1.3.2.2 Missing Values

We will now check for any missing values

concrete %>% is.na() %>% colSums()
##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0

1.3.2.3 correlation exploratory

ggcorr(concrete, label = TRUE, hjust = 0.8)

#### Data Preprocess After completing the first two stages, the third stage you need to undertake is preparing the data to be more ready before implementing it into the machine learning model that will be used. For example, we want to perform cross-validation.

Cross-validation is a step we use to assess how well our model predicts new data.

So, how can we determine whether the model we’ve built is good at predicting new data? This is where Train-test splitting comes into play. We divide our data into two groups: training data and testing data. but before we start cross validation, we need to remove unwanted coloumns such as “id” coloums.

concrete_clean <- concrete %>% select(-id)
str(concrete_clean)
## 'data.frame':    825 obs. of  9 variables:
##  $ cement     : num  540 540 332 332 199 ...
##  $ slag       : num  0 0 142 142 132 ...
##  $ flyash     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water      : num  162 162 228 228 192 228 228 228 192 192 ...
##  $ super_plast: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarse_agg : num  1040 1055 932 932 978 ...
##  $ fine_agg   : num  676 676 594 594 826 ...
##  $ age        : int  28 28 270 365 360 365 28 28 90 28 ...
##  $ strength   : num  80 61.9 40.3 41 44.3 ...
1.3.2.3.1 Cross validation

Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.

RNGkind(sample.kind = "Rounding")
set.seed(123)

splitter <- initial_split(data = concrete_clean, prop = 0.8, strata = "strength")
train_RF <- training(splitter)
test_RF <- testing(splitter)

To make our live easy, i’m going to round up the values of “Strength” from train_NN data only

check target class proportion

prop.table(table(train_RF$strength)) %>% head(28)
## 
##       2.33       4.57       4.78       4.83        4.9       6.27       6.28 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       6.47       6.81       6.88        6.9       6.94        7.4       7.51 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       7.75       8.06       8.49       8.54       9.01       9.13       9.31 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 
##       9.69       9.73       9.74       9.87       9.99      10.03      10.09 
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
1.3.2.3.2 seperating predictor factors with target variable
# train_NN
train_x_RF <- train_RF %>% select(-strength) 
train_y_RF <- train_RF$strength
# test_NN
test_x_RF <- test_RF %>% select(-strength) 
test_y_RF <- test_RF$strength
1.3.2.3.3 Scaling

Since this model is using Random Forest, we dont need to worry about scalling or normalizing features. This is because Random Forest is a tree-based model and hence does not require feature scaling.

1.3.2.3.4 Checking for outlier
boxplot(concrete_clean$strength)

It seems there is an outlier, lets see which datas are outliers

outlier <- boxplot(concrete_clean$strength)

outlier$out
## [1] 79.99 80.20 79.40 82.60 81.75

insight
-there seems to be 5 outliers out of 825 observation done.

although there are 5 outliers present in the data, it wont affect the model that we’re about to make if we make the neural network deep enough

1.3.2.3.5 Finding optimized value of ‘m’

tune RF returns the best optimized value of random varaible is 3 corresponding to a OOB of 0% (OOB - prediction error)

bestmtry <- tuneRF(train_RF,train_RF$strength,stepFactor = 1.2, improve = 0.01, trace=T, plot= T) 
## mtry = 3  OOB error = 3.337957 
## Searching left ...
## Searching right ...

1.3.2.4 Model and Evaluation

1.3.2.4.1 Creating Random Forest
model_rf <- randomForest(x = train_x_RF,
                         y = train_y_RF, 
                         ntree = 500)

model_rf
## 
## Call:
##  randomForest(x = train_x_RF, y = train_y_RF, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 33.52171
##                     % Var explained: 87.81
1.3.2.4.2 Evaluation

We will evaluate the Random Forest model. but first we need to create a function to evaluate the model

# Function for evaluating model
eval_recap <- function(truth, estimate){
  
  df_new <- data.frame(truth = truth,
                       estimate = estimate)
  
  data.frame(RMSE = rmse_vec(truth, estimate),
             MAE = mae_vec(truth, estimate),
             "R-Square" = rsq_vec(truth, estimate),
             check.names = F
             ) %>% 
    mutate(MSE = sqrt(RMSE))
}
pred_test_rf <- predict(model_rf, test_RF, type = "response")

eval_recap(truth = test_y_RF,
           estimate = pred_test_rf)

It seems the random forest model didn’t do quite well, lets tune it

1.3.2.4.3 Further Data Preprocessing

Does Random Forest Need to Normalize the Data?

Random forests generally do not require normalization of the data. This is because the algorithm, which is based on decision trees, splits the data based on feature values rather than their magnitudes. Since decision trees and their ensemble methods, like random forests, are not sensitive to the scale of the input features, normalization does not impact their performance. However, normalization might still be beneficial in certain scenarios, such as when integrating random forests with other algorithms that do require normalized inputs or when improving the interpretability of feature importance.

Does Random Forest Need to Log-transform or Scale the Variables with Square Root?

Random forests typically do not need log or square root transformations of variables. The algorithm’s decision tree-based splits are based on feature values, not their distributions or scales, so transformations like log or square root generally do not affect its performance. Nonetheless, applying these transformations can be helpful if the data exhibits extreme skewness or if you aim to improve interpretability by stabilizing the variance of the data. In these cases, transformations can enhance the model’s ability to handle diverse data types and distributions, but they are not strictly necessary for the functioning of random forests.

1.3.2.4.4 Model Tuning
# Define different parameter values to experiment with
ntrees <- c(500, 1000, 1500)  # Slightly fewer trees
mtries <- c(8, 16, 32)        # Narrower range for features
nodesizes <- c(1, 5, 10)      # Retaining smaller node sizes
maxnodes_values <- c(100, 200, 300)  # Allowing more terminal nodes

# Initialize best model performance
best_mae <- Inf
best_rsq <- -Inf
best_model <- NULL

control <- trainControl(
  method = "repeatedcv",  # Repeated cross-validation
  number = 5,             # 5 folds
  repeats = 3,            # Repeated 3 times
  verboseIter = FALSE,    # Do not print training progress
  savePredictions = "final",  # Save predictions for the final model
  classProbs = FALSE,     # Do not compute class probabilities (for regression tasks)
  summaryFunction = defaultSummary  # Default summary function for regression (RMSE, R-squared)
)

# Iterate through parameter combinations
for (ntree in ntrees) {
  for (mtry in mtries) {
    for (nodesize in nodesizes) {
      for (maxnodes in maxnodes_values) {
        # Train the model with current parameters
        tuned_rf <- randomForest(
          x = train_x_RF,
          y = train_y_RF,
          ntree = ntree,
          mtry = mtry,
          nodesize = nodesize,
          maxnodes = maxnodes,
          method = "gbm",
          trControl = control,
          verbose = FALSE,
          importance = TRUE
        )
        
        # Make predictions on the validation set
        predictions <- predict(tuned_rf, newdata = test_x_RF)
        mae <- mean(abs(predictions - test_y_RF))
        rsq <- 1 - sum((predictions - test_y_RF)^2) / sum((test_y_RF - mean(test_y_RF))^2)
        rsq <- format(round(rsq, 1), nsmall = 1)
        
        # Update best model if current model performs better
        if (mae < best_mae && rsq > best_rsq) {
          best_mae <- mae
          best_rsq <- rsq
          best_model_rf <- model_rf
        }
      }
    }
  }
}

# Print the best model and its performance
cat("Best Mean Absolute Error:", best_mae, "\n")
## Best Mean Absolute Error: 4.49756
cat("Best R-squared (rounded to 1 decimal):", best_rsq, "\n")
## Best R-squared (rounded to 1 decimal): 0.9
1.3.2.4.5 model evaluation
predictions <- predict(best_model_rf, newdata = test_x_RF)

eval_recap(truth = test_y_RF,
           estimate = predictions)

The tuned model performs better but does not meet the satisfactory

# predict target using your model
pred_test <- tuned_rf %>% predict(data_test_x)

# Create submission data
submission <- data.frame(id = data_test$id,
                         strength = pred_test)

# save data
write.csv(submission, "submission-jonathan-RF.csv", row.names = F)

# check first 3 data
head(submission, 3)

1.3.2.5 Prediction Performance

predictions <- predict(tuned_rf, newdata = test_x_RF)

eval_recap(truth = test_y_RF,
           estimate = predictions)
1.3.2.5.1 Interpretation
# Data Wrangling
library(tidyverse)

# Exploratory Data Analysis
library(GGally)

# Modeling and Evaluation
library(randomForest)
library(yardstick)
library(lmtest)

# Model Interpretation
library(lime)

# Set theme for visualization
theme_set(theme_minimal())

options(scipen = 999)
1.3.2.5.1.1 Black Box
tuned_rf$importance %>% 
  as.data.frame() %>% 
  arrange(-IncNodePurity) %>% 
  rownames_to_column("variable") %>% 
  head(10) %>% 
  ggplot(aes(IncNodePurity, 
             reorder(variable, IncNodePurity))
         ) +
  geom_col(fill = "firebrick") +
  labs(x = "Importance",
       y = NULL,
       title = "Random Forest Variable Importance")

1.3.2.5.1.2 Explainer
set.seed(123)
explainer <- lime(x = train_RF %>% select(-strength), 
                  model = tuned_rf)
1.3.2.5.1.3 troubleshooting
class(tuned_rf) 
## [1] "randomForest"
model_type.randomForest <- function(x){
  return("regression") # for regression problem
}

predict_model.randomForest <- function(x, newdata, type = "response") {

    # return prediction value
    predict(x, newdata) %>% as.data.frame()}
1.3.2.5.1.4 Explanation
# Select only the first 4 observations
selected_data <- train_RF %>% 
  select(-strength) %>% 
  slice(1:4)

# Explain the model
set.seed(123)
explanation <- explain(x = selected_data, 
                       explainer = explainer, 
                       feature_select = "auto", 
                       n_features = 10)
1.3.2.5.1.5 Visualization
plot_features(explanation = explanation)


Insight
-an all four visualization we see that Super_plast have the heaviest negative weight compared to the rest
-interestingly in case 1 and 2, slag have significant positive weight, however in case 3 and 4 slag have significant negative weight
-fin_agg seems to have least amount of weight in all case
- flyash was increasingly getting heavier in every case presented
- from the explaination fit, it says the average value is around 0.3875 which means that LIME can only explain a little about our model

1.4 Kesimpulan

The goal of this capstone project was to develop a machine learning model capable of predicting compression strength based on various mixture properties. This objective has been successfully achieved. Through countless experimentation and analysis, it has been demonstrated that machine learning can effectively solve the problem of predicting compression strength.

For this project, the result of both Neural Network and Random Forest is as shown :

  • Neural Network
postResample(keras_test[,1], test_NN$strength)
##      RMSE  Rsquared       MAE 
## 5.7093631 0.8909226 3.9445006
  • Random Forest
eval_recap(truth = test_y_RF,
           estimate = predictions)

Based off this project we’ve manage to build a machine that could predict the cement strength and also learned that although it is possible to predict the cement strength using Random Forest, but it just isn’t enough as it can’t achieve the requirements from Algoritma team.

note :
The only reason I’m making Random Forest is for LIME method of explaining the relationship of the variables. I’ve been refining and tuning Random Forest for the past days and haven’t gotten any sleep, would much appreciate if you only look at my Neural Network machine learning.

The potential business implementation of this project is significant. In industries such as construction and manufacturing, accurate prediction of compression strength can lead to improved material design and quality control. By integrating this machine learning model into production processes, businesses can enhance their ability to produce materials that meet specific strength requirements, reduce waste, and optimize resource allocation. Furthermore, this predictive capability can aid in better decision-making and cost savings by minimizing the need for extensive empirical testing and trial-and-error approaches. Overall, the implementation of this model can drive efficiency, reduce costs, and ensure higher standards of quality in material production.