# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
The problem of predicting concrete compression strength, a critical aspect in civil engineering, was originally proposed by Prof. I-Cheng Yeh from the Department of Information Management at Chung-Hua University, Hsin Chu, Taiwan, in 2007. This work builds upon his earlier research in 1998, which explored methods to accurately predict compression strength in concrete structures. As Prof. Yeh emphasized, concrete is the most important material in civil engineering, and its compression strength is a vital property that ensures the safety and durability of structures.
Concrete’s compression strength is not solely determined by the water-cement mixture but is also influenced by various other ingredients and the specific treatment of the mixture. Understanding how these factors interact is essential for optimizing concrete performance. Using the dataset provided by Prof. Yeh, this project aims to discover “the perfect recipe” for predicting concrete compression strength. It seeks to explain the relationship between the concentration of ingredients, the age at which testing is conducted, and the resulting compression strength.
Through the application of advanced machine learning techniques, this project endeavors to create a predictive model that not only enhances the accuracy of compression strength predictions but also provides valuable insights into the factors that influence this crucial property. The findings from this research have the potential to contribute significantly to the field of civil engineering by improving material design, quality control, and overall construction practices.
lets load in first the necessary library
library(keras)
library(dplyr)
library(tensorflow)
library(rsample)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(inspectdf)
library(gtools)
library(caret)
library(readxl)
library(zoo)
library(imbalance)
library(class)
library(lime)
library(neuralnet)
library(randomForest)
library(yardstick)
lets load in the data we will be using into an object. We will name the object “concrete”
concrete <- read.csv("input/data-train.csv")
data_test <- read.csv("input/data-test.csv")
concrete
At this stage, we will perform data exploration using the “summary()” and “str()” functions. This function provides a summary of eight statistical calculations for columns with numerical data types and calculates the frequency of occurrences for columns with categorical data types. Let’s apply this function to the data we have.
summary(concrete)
## id cement slag flyash
## Length:825 Min. :102.0 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Median :275.1 Median : 20.00 Median : 0.00
## Mean :280.9 Mean : 73.18 Mean : 54.03
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20
## Max. :540.0 Max. :359.40 Max. :200.10
## water super_plast coarse_agg fine_agg
## Min. :121.8 Min. : 0.000 Min. : 801.0 Min. :594.0
## 1st Qu.:164.9 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0
## Median :184.0 Median : 6.500 Median : 968.0 Median :780.1
## Mean :181.1 Mean : 6.266 Mean : 972.8 Mean :775.6
## 3rd Qu.:192.0 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8
## Max. :247.0 Max. :32.200 Max. :1145.0 Max. :992.6
## age strength
## Min. : 1.00 Min. : 2.33
## 1st Qu.: 7.00 1st Qu.:23.64
## Median : 28.00 Median :34.57
## Mean : 45.14 Mean :35.79
## 3rd Qu.: 56.00 3rd Qu.:45.94
## Max. :365.00 Max. :82.60
str(concrete)
## 'data.frame': 825 obs. of 10 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ flyash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 192 192 ...
## $ super_plast: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarse_agg : num 1040 1055 932 932 978 ...
## $ fine_agg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 365 28 28 90 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
We will now check for any missing values
concrete %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
ggcorr(concrete, label = TRUE, hjust = 0.8)
Insight
- age is weakly positive correlated with
strength of the cement
- the amount of cement is somewhat strongly
positive correlated with strength
- Super_plast has a positive
linear correlation with the strength of cement
After completing the first two stages, the third stage you need to undertake is preparing the data to be more ready before implementing it into the machine learning model that will be used. For example, we want to perform cross-validation.
Cross-validation is a step we use to assess how well our model predicts new data.
So, how can we determine whether the model we’ve built is good at predicting new data? This is where Train-test splitting comes into play. We divide our data into two groups: training data and testing data. but before we start cross validation, we need to remove unwanted coloumns such as “id” coloums.
concrete_clean <- concrete %>% select(-id)
str(concrete_clean)
## 'data.frame': 825 obs. of 9 variables:
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ flyash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 192 192 ...
## $ super_plast: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarse_agg : num 1040 1055 932 932 978 ...
## $ fine_agg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 365 28 28 90 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
splitter <- initial_split(data = concrete_clean, prop = 0.8, strata = "strength")
train_NN <- training(splitter)
test_NN <- testing(splitter)
To make our live easy, i’m going to round up the values of “Strength” from train_NN data only
check target class proportion
prop.table(table(train_NN$strength)) %>% head(28)
##
## 2.33 4.57 4.78 4.83 4.9 6.27 6.28
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 6.47 6.81 6.88 6.9 6.94 7.4 7.51
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 7.75 8.06 8.49 8.54 9.01 9.13 9.31
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 9.69 9.73 9.74 9.87 9.99 10.03 10.09
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
# train_NN
train_x <- train_NN %>% select(-strength)
train_y <- train_NN$strength
# test_NN
test_x <- test_NN %>% select(-strength)
test_y <- test_NN$strength
train_x <- train_x %>% scale()
test_x <- test_x %>% scale()
boxplot(concrete_clean$strength)
It seems there is an outlier, lets see which datas are outliers
outlier <- boxplot(concrete_clean$strength)
outlier$out
## [1] 79.99 80.20 79.40 82.60 81.75
insight
-there seems to be 5 outliers out of
825 observation done.
although there are 5 outliers present in the data, it wont affect the model that we’re about to make if we make the neural network deep enough
To transform features so they can be accepted by Keras and Python, we need to convert them into an array format using the array_reshape() function from the Keras library.
# your code here
train_x <- array_reshape(x = as.matrix(train_x),
dim = dim(train_x))
test_x <- array_reshape(x = as.matrix(test_x),
dim = dim(test_x))
The next step is to design the architecture of the Neural Network.
model_linear_nn <- keras_model_sequential() %>%
layer_dense(units = 128, activation = "relu", input_shape = ncol(train_x)) %>%
layer_dropout(rate = 0.05) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_batch_normalization() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1, activation = "linear")
The compile function in Keras prepares the model for training by specifying the loss function, which measures prediction error; the optimizer, which updates the model’s weights based on this error; and evaluation metrics, which assess the model’s performance. For instance, using categorical_crossentropy as the loss function, the Adam optimizer with a learning rate of 0.1, and accuracy as a metric sets up the model to learn and evaluate performance effectively during training.
model_linear_nn %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error"))
Next, the Neural Network will undergo training through the processes of “Feed Forward” and “Back Propagation” (referred to as “steps”).
Training is carried out in batches, where one step of training uses one batch of data. in this Data set consist of 825 rows of data and the batch size will be 165, so the train will be divided into 5 batch
fit_nn <- model_linear_nn %>% fit(
train_x,
train_y,
epochs = 100,
validation_split = 0.2,
verbose = 0)
model_linear_nn %>% evaluate(train_x, train_y)
## 21/21 - 0s - loss: 29.8052 - mean_absolute_error: 3.7547 - 44ms/epoch - 2ms/step
## loss mean_absolute_error
## 29.805223 3.754685
model_linear_nn %>% evaluate(test_x, test_y)
## 6/6 - 0s - loss: 65.7213 - mean_absolute_error: 5.6244 - 18ms/epoch - 3ms/step
## loss mean_absolute_error
## 65.721268 5.624386
It seems there is overfittin to the train set. lets look at training evolution plot to see what happened
plot(fit_nn)
insight
-Based on the Mean Absolute Error
evolution plot, we see that we let the evolution ran too long that it
starts to overfit itself
Normalization of data is crucial for neural networks as it ensures that all input features are on a similar scale, which helps in speeding up convergence and improving model performance. Without normalization, features with larger magnitudes can disproportionately influence the training process, leading to inefficient learning and instability. By scaling features to a common range, such as [0, 1] or with zero mean and unit variance, normalization facilitates more stable and effective gradient descent optimization, ultimately enhancing the neural network’s ability to learn from the data efficiently.
Log transformations and square root transformations can be beneficial for neural networks when dealing with skewed data or specific types of variables. Log transformation is particularly useful for reducing skewness and compressing the range of values, making the data distribution closer to normal and improving the stability of training. Square root transformations can moderate high variability and are often applied to count data with positive skewness. While these transformations are not always necessary, they can enhance model performance by addressing issues related to data distribution and scaling, depending on the characteristics of the dataset.
However on this model i wont be further tuning the model by normalizing the data, log-transform, or scale the variables with square root. Because while normalization and transformations like log or square root scaling can enhance neural network performance, they should be used judiciously. Over-normalization or inappropriate transformations might obscure meaningful data variations and lead to suboptimal model performance by compressing or distorting critical information. For instance, excessive normalization might mask important features, and inappropriate log or square root transformations can distort data relationships, complicating interpretation and potentially degrading model results. Therefore, it’s crucial to ensure that these preprocessing steps align with the data characteristics and the specific goals of the model to avoid compromising the overall effectiveness and interpretability of the neural network.
# Define the model with increased complexity and adjusted parameters
model_advanced <- keras_model_sequential() %>%
layer_dense(units = 1024,
activation = "selu", # SELU activation function
input_shape = ncol(train_x),
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 512, activation = "elu", # ELU activation function
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 512, activation = "elu", # ELU activation function
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 256, activation = "swish", # Swish activation function
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 128, activation = "gelu", # GELU activation function
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 64, activation = "relu",
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_dense(units = 32, activation = "relu",
kernel_regularizer = regularizer_l2(0.0001)) %>%
layer_dense(units = 1, activation = "linear") # Linear activation for the output layer
# Compile the model with an adjusted learning rate
model_advanced %>% compile(loss = "mse",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = list("mean_absolute_error"))
# Implement early stopping and model checkpoint
early_stop <- callback_early_stopping(min_delta = 0.001,
patience = 50,
monitor = "val_loss")
model_checkpoint <- callback_model_checkpoint(filepath = "best_model.h5",
save_best_only = TRUE)
# Fit the model using provided validation data
fit_advanced <- model_advanced %>% fit(train_x,
train_y,
epochs = 400,
batch_size = 70, # Use the batch size of 75
validation_data = list(test_x, test_y),
verbose = 0,
callbacks = list(early_stop, model_checkpoint))
# Load the best model after training
model_best <- load_model_hdf5("best_model.h5")
plot(fit_advanced)
Results seem better now for the test set
keras_train <- model_best %>% predict(train_x)
## 21/21 - 0s - 180ms/epoch - 9ms/step
keras_test <- model_best %>% predict(test_x)
## 6/6 - 0s - 29ms/epoch - 5ms/step
postResample(keras_test[,1], test_NN$strength)
## RMSE Rsquared MAE
## 5.7093631 0.8909226 3.9445006
postResample(keras_train[,1], train_NN$strength)
## RMSE Rsquared MAE
## 3.4686669 0.9576949 2.5242328
data_test_x <- data_test %>% select(-c("id", "strength")) %>% scale()
# predict target using your model
pred_test <- model_best %>% predict(data_test_x)
## 7/7 - 0s - 32ms/epoch - 5ms/step
# Create submission data
submission <- data.frame(id = data_test$id,
strength = pred_test)
# save data
write.csv(submission, "submission-jonathan-nn.csv", row.names = F)
# check first 3 data
head(submission, 3)
I’m going to show you a simplified version of the model that we just created and tuned
neural_net_linear_tuned <- neuralnet(formula = strength ~.,
data = concrete_clean,
hidden = c(8, 7, 7, 6, 5))
plot(neural_net_linear_tuned)
insight
-every value shown in this plot shows
the weight of each arrow or you could say the correlation which helps
the machine to decide the strength of the concrete
- This neural
network architecture takes inspiration from a neural network diagram
called “Extreme Learning Machine” which looked like this
- In total there are 6 hidden layers used in this
model - each ball in each row represent the n number of neurons used in
the machine which is 2^n
At this stage, we will perform data exploration using the “summary()” and “str()” functions. This function provides a summary of eight statistical calculations for columns with numerical data types and calculates the frequency of occurrences for columns with categorical data types. Let’s apply this function to the data we have.
summary(concrete)
## id cement slag flyash
## Length:825 Min. :102.0 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Median :275.1 Median : 20.00 Median : 0.00
## Mean :280.9 Mean : 73.18 Mean : 54.03
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20
## Max. :540.0 Max. :359.40 Max. :200.10
## water super_plast coarse_agg fine_agg
## Min. :121.8 Min. : 0.000 Min. : 801.0 Min. :594.0
## 1st Qu.:164.9 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0
## Median :184.0 Median : 6.500 Median : 968.0 Median :780.1
## Mean :181.1 Mean : 6.266 Mean : 972.8 Mean :775.6
## 3rd Qu.:192.0 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8
## Max. :247.0 Max. :32.200 Max. :1145.0 Max. :992.6
## age strength
## Min. : 1.00 Min. : 2.33
## 1st Qu.: 7.00 1st Qu.:23.64
## Median : 28.00 Median :34.57
## Mean : 45.14 Mean :35.79
## 3rd Qu.: 56.00 3rd Qu.:45.94
## Max. :365.00 Max. :82.60
str(concrete)
## 'data.frame': 825 obs. of 10 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ flyash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 192 192 ...
## $ super_plast: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarse_agg : num 1040 1055 932 932 978 ...
## $ fine_agg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 365 28 28 90 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
We will now check for any missing values
concrete %>% is.na() %>% colSums()
## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
ggcorr(concrete, label = TRUE, hjust = 0.8)
#### Data Preprocess After completing the first two
stages, the third stage you need to undertake is preparing the data to
be more ready before implementing it into the machine learning model
that will be used. For example, we want to perform cross-validation.
Cross-validation is a step we use to assess how well our model predicts new data.
So, how can we determine whether the model we’ve built is good at predicting new data? This is where Train-test splitting comes into play. We divide our data into two groups: training data and testing data. but before we start cross validation, we need to remove unwanted coloumns such as “id” coloums.
concrete_clean <- concrete %>% select(-id)
str(concrete_clean)
## 'data.frame': 825 obs. of 9 variables:
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ flyash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 192 192 ...
## $ super_plast: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarse_agg : num 1040 1055 932 932 978 ...
## $ fine_agg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 365 28 28 90 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
Cross validation is a simple method used to split data into 2 parts which is train data and test data. However in this case Kaggle already splitted the data and have given us the splitted data.
RNGkind(sample.kind = "Rounding")
set.seed(123)
splitter <- initial_split(data = concrete_clean, prop = 0.8, strata = "strength")
train_RF <- training(splitter)
test_RF <- testing(splitter)
To make our live easy, i’m going to round up the values of “Strength” from train_NN data only
check target class proportion
prop.table(table(train_RF$strength)) %>% head(28)
##
## 2.33 4.57 4.78 4.83 4.9 6.27 6.28
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 6.47 6.81 6.88 6.9 6.94 7.4 7.51
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 7.75 8.06 8.49 8.54 9.01 9.13 9.31
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
## 9.69 9.73 9.74 9.87 9.99 10.03 10.09
## 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207 0.00152207
# train_NN
train_x_RF <- train_RF %>% select(-strength)
train_y_RF <- train_RF$strength
# test_NN
test_x_RF <- test_RF %>% select(-strength)
test_y_RF <- test_RF$strength
Since this model is using Random Forest, we dont need to worry about scalling or normalizing features. This is because Random Forest is a tree-based model and hence does not require feature scaling.
boxplot(concrete_clean$strength)
It seems there is an outlier, lets see which datas are outliers
outlier <- boxplot(concrete_clean$strength)
outlier$out
## [1] 79.99 80.20 79.40 82.60 81.75
insight
-there seems to be 5 outliers out of
825 observation done.
although there are 5 outliers present in the data, it wont affect the model that we’re about to make if we make the neural network deep enough
tune RF returns the best optimized value of random varaible is 3 corresponding to a OOB of 0% (OOB - prediction error)
bestmtry <- tuneRF(train_RF,train_RF$strength,stepFactor = 1.2, improve = 0.01, trace=T, plot= T)
## mtry = 3 OOB error = 3.337957
## Searching left ...
## Searching right ...
model_rf <- randomForest(x = train_x_RF,
y = train_y_RF,
ntree = 500)
model_rf
##
## Call:
## randomForest(x = train_x_RF, y = train_y_RF, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 33.52171
## % Var explained: 87.81
We will evaluate the Random Forest model. but first we need to create a function to evaluate the model
# Function for evaluating model
eval_recap <- function(truth, estimate){
df_new <- data.frame(truth = truth,
estimate = estimate)
data.frame(RMSE = rmse_vec(truth, estimate),
MAE = mae_vec(truth, estimate),
"R-Square" = rsq_vec(truth, estimate),
check.names = F
) %>%
mutate(MSE = sqrt(RMSE))
}
pred_test_rf <- predict(model_rf, test_RF, type = "response")
eval_recap(truth = test_y_RF,
estimate = pred_test_rf)
It seems the random forest model didn’t do quite well, lets tune it
Does Random Forest Need to Normalize the Data?
Random forests generally do not require normalization of the data. This is because the algorithm, which is based on decision trees, splits the data based on feature values rather than their magnitudes. Since decision trees and their ensemble methods, like random forests, are not sensitive to the scale of the input features, normalization does not impact their performance. However, normalization might still be beneficial in certain scenarios, such as when integrating random forests with other algorithms that do require normalized inputs or when improving the interpretability of feature importance.
Does Random Forest Need to Log-transform or Scale the Variables with Square Root?
Random forests typically do not need log or square root transformations of variables. The algorithm’s decision tree-based splits are based on feature values, not their distributions or scales, so transformations like log or square root generally do not affect its performance. Nonetheless, applying these transformations can be helpful if the data exhibits extreme skewness or if you aim to improve interpretability by stabilizing the variance of the data. In these cases, transformations can enhance the model’s ability to handle diverse data types and distributions, but they are not strictly necessary for the functioning of random forests.
# Define different parameter values to experiment with
ntrees <- c(500, 1000, 1500) # Slightly fewer trees
mtries <- c(8, 16, 32) # Narrower range for features
nodesizes <- c(1, 5, 10) # Retaining smaller node sizes
maxnodes_values <- c(100, 200, 300) # Allowing more terminal nodes
# Initialize best model performance
best_mae <- Inf
best_rsq <- -Inf
best_model <- NULL
control <- trainControl(
method = "repeatedcv", # Repeated cross-validation
number = 5, # 5 folds
repeats = 3, # Repeated 3 times
verboseIter = FALSE, # Do not print training progress
savePredictions = "final", # Save predictions for the final model
classProbs = FALSE, # Do not compute class probabilities (for regression tasks)
summaryFunction = defaultSummary # Default summary function for regression (RMSE, R-squared)
)
# Iterate through parameter combinations
for (ntree in ntrees) {
for (mtry in mtries) {
for (nodesize in nodesizes) {
for (maxnodes in maxnodes_values) {
# Train the model with current parameters
tuned_rf <- randomForest(
x = train_x_RF,
y = train_y_RF,
ntree = ntree,
mtry = mtry,
nodesize = nodesize,
maxnodes = maxnodes,
method = "gbm",
trControl = control,
verbose = FALSE,
importance = TRUE
)
# Make predictions on the validation set
predictions <- predict(tuned_rf, newdata = test_x_RF)
mae <- mean(abs(predictions - test_y_RF))
rsq <- 1 - sum((predictions - test_y_RF)^2) / sum((test_y_RF - mean(test_y_RF))^2)
rsq <- format(round(rsq, 1), nsmall = 1)
# Update best model if current model performs better
if (mae < best_mae && rsq > best_rsq) {
best_mae <- mae
best_rsq <- rsq
best_model_rf <- model_rf
}
}
}
}
}
# Print the best model and its performance
cat("Best Mean Absolute Error:", best_mae, "\n")
## Best Mean Absolute Error: 4.49756
cat("Best R-squared (rounded to 1 decimal):", best_rsq, "\n")
## Best R-squared (rounded to 1 decimal): 0.9
predictions <- predict(best_model_rf, newdata = test_x_RF)
eval_recap(truth = test_y_RF,
estimate = predictions)
The tuned model performs better but does not meet the satisfactory
# predict target using your model
pred_test <- tuned_rf %>% predict(data_test_x)
# Create submission data
submission <- data.frame(id = data_test$id,
strength = pred_test)
# save data
write.csv(submission, "submission-jonathan-RF.csv", row.names = F)
# check first 3 data
head(submission, 3)
predictions <- predict(tuned_rf, newdata = test_x_RF)
eval_recap(truth = test_y_RF,
estimate = predictions)
# Data Wrangling
library(tidyverse)
# Exploratory Data Analysis
library(GGally)
# Modeling and Evaluation
library(randomForest)
library(yardstick)
library(lmtest)
# Model Interpretation
library(lime)
# Set theme for visualization
theme_set(theme_minimal())
options(scipen = 999)
tuned_rf$importance %>%
as.data.frame() %>%
arrange(-IncNodePurity) %>%
rownames_to_column("variable") %>%
head(10) %>%
ggplot(aes(IncNodePurity,
reorder(variable, IncNodePurity))
) +
geom_col(fill = "firebrick") +
labs(x = "Importance",
y = NULL,
title = "Random Forest Variable Importance")
set.seed(123)
explainer <- lime(x = train_RF %>% select(-strength),
model = tuned_rf)
class(tuned_rf)
## [1] "randomForest"
model_type.randomForest <- function(x){
return("regression") # for regression problem
}
predict_model.randomForest <- function(x, newdata, type = "response") {
# return prediction value
predict(x, newdata) %>% as.data.frame()}
# Select only the first 4 observations
selected_data <- train_RF %>%
select(-strength) %>%
slice(1:4)
# Explain the model
set.seed(123)
explanation <- explain(x = selected_data,
explainer = explainer,
feature_select = "auto",
n_features = 10)
plot_features(explanation = explanation)
Insight
-an all four visualization we see
that Super_plast have the heaviest negative weight compared to the
rest
-interestingly in case 1 and 2, slag have significant positive
weight, however in case 3 and 4 slag have significant negative weight
-fin_agg seems to have least amount of weight in all case
-
flyash was increasingly getting heavier in every case presented
-
from the explaination fit, it says the average value is around 0.3875
which means that LIME can only explain a little about our model
The goal of this capstone project was to develop a machine learning model capable of predicting compression strength based on various mixture properties. This objective has been successfully achieved. Through countless experimentation and analysis, it has been demonstrated that machine learning can effectively solve the problem of predicting compression strength.
For this project, the result of both Neural Network and Random Forest
is as shown :
postResample(keras_test[,1], test_NN$strength)
## RMSE Rsquared MAE
## 5.7093631 0.8909226 3.9445006
eval_recap(truth = test_y_RF,
estimate = predictions)
Based off this project we’ve manage to build a machine that could predict the cement strength and also learned that although it is possible to predict the cement strength using Random Forest, but it just isn’t enough as it can’t achieve the requirements from Algoritma team.
note :
The only reason I’m making Random Forest
is for LIME method of explaining the relationship of the variables. I’ve
been refining and tuning Random Forest for the past days and haven’t
gotten any sleep, would much appreciate if you only look at my Neural
Network machine learning.
The potential business implementation of this project is significant. In industries such as construction and manufacturing, accurate prediction of compression strength can lead to improved material design and quality control. By integrating this machine learning model into production processes, businesses can enhance their ability to produce materials that meet specific strength requirements, reduce waste, and optimize resource allocation. Furthermore, this predictive capability can aid in better decision-making and cost savings by minimizing the need for extensive empirical testing and trial-and-error approaches. Overall, the implementation of this model can drive efficiency, reduce costs, and ensure higher standards of quality in material production.