This subject was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is all based on his research in 1998 about how to predict compression strength in a concrete structure.
The objective is to predict the compressive strength of concrete with maximum accuracy and lowest error, for various mixtures of materials as input. The conrete cube exhibits behavioral differences in their compressive strengths for cubes that are cured/not cured. Curing is the process of maintaining the moisture to ensure uninterrupted hydration of concrete.
The concrete strength increases if the concrete cubes are cured periodically. The rate of increase in strength is described here.
Time % Of Total Strength Achieved
1 day 16% 3 days 40% 7 days 65% 14 days 90% 28 days 99% At 28 days, concrete achieves 99% of the strength. Thus usual measurements of strength are taken at 28 days
Again, the goal of this subject is to predict the compression strength based on the mixture of materials.
Here are some data pre processing step that has to be performed before we proceed to the next step: Set Up Libraries, Read Data, Take a look of glimpse of data and check several first rows/ bottom rows if needed, data structures with summaries, removing unnecessary columns and are there any N/A values
After that we will check possiblities to do scaling features, is it necessar or not?. The last, we will detect some outliers and what we will have to do with them.
Set Up Libraries
First we will set up some libraries which are needed to process the data.
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
## The following object is masked from 'package:purrr':
##
## lift
## -- Attaching packages ------------------------------------------------------------------------------- tidymodels 0.1.0 --
## v broom 0.5.4 v rsample 0.0.5
## v dials 0.0.4 v tune 0.0.1
## v infer 0.5.1 v workflows 0.1.0
## v parsnip 0.0.5 v yardstick 0.0.5
## v recipes 0.1.9
## -- Conflicts ---------------------------------------------------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x caret::lift() masks purrr::lift()
## x dials::margin() masks ggplot2::margin()
## x yardstick::precision() masks caret::precision()
## x yardstick::recall() masks caret::recall()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:dials':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
##
## explain
Read Data
## Observations: 825
## Variables: 10
## $ id <fct> S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13,...
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...
Data obesrvation consists of the some variables below:
id : Id of each cement mixture, cement : The amount of cement (Kg) in a m3 mixture, slag : The amount of blast furnace slag (Kg) in a m3 mixture, flyash : The amount of fly ash (Kg) in a m3 mixture, water : The amount of water (Kg) in a m3 mixture, super_plast: The amount of Superplasticizer (Kg) in a m3 mixture, coarse_agg : The amount of Coarse Aggreagate (Kg) in a m3 mixture, fine_agg : The amount of Fine Aggregate (Kg) in a m3 mixture, age : the number of resting days before the compressive strength measurement, strength : Concrete compressive strength measurement in MPa unit.
Removing unnecessary column
Removing id column & a short glance of datae
Data Summary and Structure
## cement slag flyash water
## Min. :102.0 Min. : 0.00 Min. : 0.00 Min. :121.8
## 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:164.9
## Median :275.1 Median : 20.00 Median : 0.00 Median :184.0
## Mean :280.9 Mean : 73.18 Mean : 54.03 Mean :181.1
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20 3rd Qu.:192.0
## Max. :540.0 Max. :359.40 Max. :200.10 Max. :247.0
## super_plast coarse_agg fine_agg age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0 1st Qu.: 7.00
## Median : 6.500 Median : 968.0 Median :780.1 Median : 28.00
## Mean : 6.266 Mean : 972.8 Mean :775.6 Mean : 45.14
## 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## strength
## Min. : 2.33
## 1st Qu.:23.64
## Median :34.57
## Mean :35.79
## 3rd Qu.:45.94
## Max. :82.60
## Observations: 825
## Variables: 9
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0,...
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, ...
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, ...
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932....
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0,...
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90,...
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29,...
All of the predictor variables are already in numerical format. We could use scaling features if necessary in the cross-validation section to make all variables in the same scale.
Check missing values
## [1] FALSE
Check Outliers
we have one outlier values
we have no outliers in flyash variables
we have three outliers, 2 on top and one at the bottom.
we have two outliers, 2 on top.
We have no outliers for coarse_agg
We have two outliers for fine_agg, one on top and one at the bottom
We have three outliers for data_age, three on top
We have some outliers for strength.
Based on the boxplot of data-set above we can find outliers in several columns like slag, water, super_plast, fine_agg, age, strength. We can treat the column, either by removing or imputing. For this we have have to check also the outliers in data test submission in order for data equality.
## Observations: 205
## Variables: 9
## $ cement <dbl> 266.0, 266.0, 427.5, 190.0, 380.0, 427.5, 198.6, 332.5,...
## $ slag <dbl> 114.0, 114.0, 47.5, 190.0, 0.0, 47.5, 132.4, 142.5, 237...
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ water <dbl> 228.0, 228.0, 228.0, 228.0, 228.0, 228.0, 192.0, 228.0,...
## $ super_plast <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ coarse_agg <dbl> 932.0, 932.0, 932.0, 932.0, 932.0, 932.0, 978.4, 932.0,...
## $ fine_agg <dbl> 670.0, 670.0, 594.0, 670.0, 670.0, 594.0, 825.5, 594.0,...
## $ age <int> 90, 28, 270, 90, 270, 28, 180, 90, 180, 365, 365, 180, ...
## $ strength <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Specifically for the super_plast and age columns, we will not do any data removal or imputation because from the observation of submission data test, we could found the same matters. The reason for this is to be able to predict the same condition in our new dataset test.
slag <- which(data_$slag %in% boxplot(data_$slag, plot=FALSE)$out) #2 obs
water <- which(data_$water %in% boxplot(data_$water, plot=FALSE)$out) #9 obs
# which(data$fine_agg %in% boxplot(data$fine_agg, plot=FALSE)$out) #27 obs
fine_agg <- which(data_$fine_agg %in% boxplot(data_$fine_agg, plot=FALSE)$out)[23:27] #5
strength <- which(data_$strength %in% boxplot(data_$strength, plot=FALSE)$out) #5 obsThe next thing we should do are removing outliers from slag, water, fine_agg (only top outliers because 594 as fine_agg is detected as outliers here but we find the same value in our submission data as well) and also strength column (and the total only 2.55 % from our total dataset observation).
## [1] 804
Correlation Matrix
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
There are some tasks that we have to try to answer in this case. We have to explore the relation between the target and the features :
Is strength positively correlated with age? The correlation value between strength and age is 0.347. Positively yes, but not strong. As already mentioned in the introduction that the number of resting days before the measurement effect on the concrete’s compressive strength.
Is strength and cement has strong correlation? The correlation value between strength and cement is 0.49 (ggpairs), 0.5 (ggcorr) Based on the theory the compressive strength of concrete is strongly influenced by the ratio of cement and water. And that is makes sense if we found that cement and water are variables that influences the strength of concrete, but not individually cement only without water.
Is super_plast has a linear correlation with the strength? Super plasticizer is also known as water reducer. The amount of Superplasticizer definitely has a positive correlation with strength and also has a negative correlation with water. Correlation values can be seen in the chart above that is 0.35 (ggpairs) and 0.4 with ggcorr
Cross Validation
For further step, we have to divide data into training dataset (to develop models) and testing dataset (for models validation). There are no specific rules for dividing proportion of the data. The proportion of the training vs testing dataset 80:20 would be the most commonly used also referred to as the [Pareto principle][1]. Traditionally we use 5-folders cross validation to verify our algorithm, so 80:20 would be ok. Actually the real portion of train and test dataset is related with the real situation and the quantity of the dataset.
[1] : https://en.wikipedia.org/wiki/Pareto_principle
Specifically for this subject, we already have data-train (804 observations) and data test submission (205). Therefore we should use 75 : 25 (around 600:200 observations) ratio to split data-train into training dataset and testing/validation dataset, in order to get the same proportion with the data-submission that we will try to predict later.
set.seed(123)
idx <- initial_split(data = data_, prop = 0.76, strata = "strength")
train <- training(idx)
test <- testing(idx)Model 1 | Linear Regression
Model Development
##
## Call:
## lm(formula = strength ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.5366 -6.5395 0.9119 7.3042 27.9477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.657271 36.091891 0.462 0.644589
## cement 0.105495 0.011465 9.201 < 0.0000000000000002 ***
## slag 0.090794 0.013244 6.856 0.0000000000176 ***
## flyash 0.075124 0.016319 4.603 0.0000050712854 ***
## water -0.217980 0.057817 -3.770 0.000179 ***
## super_plast 0.249187 0.132828 1.876 0.061135 .
## coarse_agg 0.008304 0.012317 0.674 0.500451
## fine_agg 0.004161 0.014343 0.290 0.771831
## age 0.121650 0.007089 17.161 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.34 on 603 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.6068
## F-statistic: 118.9 on 8 and 603 DF, p-value: < 0.00000000000000022
LS <- lm(formula = strength~cement+slag+flyash+water+super_plast+age, data = train)
summary(LS)$r.squared## [1] 0.6115207
Predict
LS.prediction1 <- predict(object = LS, newdata = train)
LS.prediction2 <- predict(object = LS, newdata = test)Evaluate
## [1] 8.301161
## [1] 7.701927
Based on Adjusted r-squared result 0.6224 and MAE 8.118 in-sample or training data and 8.399 out-sample or testing data, we have a conclusion that Linear Regression model do not have good enough performance. We can have any other model approaches to compare, in this case, we will use: Random Forest and trials with some engine like tidy models and rangers
Model 2 | Random Forest
#ctrl1<- trainControl(method = "repeatedcv", number = 5, repeats = 5)
#ctrl2<- trainControl(method = "repeatedcv" , number = 10, repeats = 10)
#RF5 <- train(strength~., data = train, method = "rf", trControl = ctrl1, importance=T)
#RF10 <- train(strength~., data = train, method = "rf", trControl = ctrl2, importance=T)#saveRDS(RF5, "model/RF5_7624.Rds")
RF5 <- readRDS("model/RF5_7624.Rds")
#saveRDS(RF10, "model/RF10_7624.Rds")
RF10 <- readRDS("model/RF10_7624.Rds")
RF5## Random Forest
##
## 612 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 488, 489, 490, 490, 491, 491, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 6.094318 0.8879792 4.613207
## 5 5.453978 0.8957533 3.977163
## 8 5.525872 0.8899515 4.007702
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
## Random Forest
##
## 612 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 551, 549, 552, 550, 551, 550, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.805998 0.8983617 4.391536
## 5 5.207266 0.9050365 3.798351
## 8 5.280290 0.9001829 3.832484
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
We have R Squared of 0.8964 and MAE 3.87
Out-of-Bag (OOB) Error
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 25.45268
## % Var explained: 90.62
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 25.69966
## % Var explained: 90.53
Predict
RF5.prediction1 <- predict(RF5, newdata = train)
RF5.prediction2 <- predict(RF5, newdata = test)
head(RF5.prediction2)## 3 7 25 28 29 31
## 40.88507 39.68581 52.51470 44.97232 41.45121 42.59622
RF10.prediction1 <- predict(RF10, newdata = train)
RF10.prediction2 <- predict(RF10, newdata = test)
head(RF10.prediction2)## 3 7 25 28 29 31
## 40.74696 39.97782 52.76630 43.86445 41.59919 42.67330
Evaluate
## [1] 1.668849
## [1] 1.666181
## [1] 4.252441
## [1] 4.275926
To get the value of R-squared we can see the % Var explained from the summary of our final model or by manual calculating as below.
actual <- test$strength
predicted5 <- RF5.prediction2
R2test_5 <- 1 - (sum((actual-predicted5)^2)/sum((actual-mean(actual))^2))
R2test_5## [1] 0.8746272
actual <- test$strength
predicted10 <- RF10.prediction2
R2test_10 <- 1 - (sum((actual-predicted10)^2)/sum((actual-mean(actual))^2))
R2test_10## [1] 0.8731011
This model is slightly overfit and we have slightly below expectation in value of R Squared in test data validation
Model 3 | Tidy Models
Data preprocessing using Recipes
data_recipe <- recipe(strength~., train) %>%
step_corr(all_predictors()) %>%
step_sqrt(all_numeric()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
data_train <- juice(data_recipe)
data_test <- bake(data_recipe, test)3a. Tidy model using Random Forest Engine
Model Fitting using PARSNIP
# set-up model specification
model_spec <- rand_forest(
mode = "regression",
mtry = 2,
trees = 1123,
min_n = 3
)
model_spec## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 2
## trees = 1123
## min_n = 3
# set-up model engine
model_engine1 <- set_engine(
object = model_spec,
engine = "randomForest"
)
model_engine1## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 2
## trees = 1123
## min_n = 3
##
## Computational engine: randomForest
To fit the model, we can have have two options:
Formula interface X-Y interface
## parsnip model object
##
## Fit time: 2.4s
##
## Call:
## randomForest(x = as.data.frame(x), y = y, ntree = ~1123, mtry = ~2, nodesize = ~3)
## Type of random forest: regression
## Number of trees: 1123
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1114947
## % Var explained: 88.83
# fit the model
TidyRF1 <- fit_xy(
object = model_engine1,
x = select(data_train, -strength),
y = select(data_train, strength)
)Predict
scaled_prediction <- data_test %>%
select(strength) %>%
bind_cols((predict(TidyRF1, data_test)))
# quick check
scaled_predictionEvaluate
Back Transform
# data_recipe$steps
recipe_bt <- function(x, data_recipe){
means <- data_recipe$steps[[3]]$means[["strength"]]
sds <- data_recipe$steps[[4]]$sds[["strength"]]
x <- (x*sds+means)^2
}revert_prediction1 <- apply(scaled_prediction, MARGIN = 2, FUN = recipe_bt, data_recipe = data_recipe) %>%
as.data.frame()
head(revert_prediction1)Evaluation Metrics
## [1] "data.frame"
revert_prediction1 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
RMSE = rmse_vec(strength, .pred),
MAE = mae_vec(strength, .pred),
MAPE = mape_vec(strength, .pred),
MASE = mase_vec(strength, .pred)
)With this model we cannot have satifactory results
Tidymodels using ranger
Parsnip
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 5
## trees = 555
## min_n = 5
# set-up model engine
model_engine2 <- set_engine(
object = model_spec2,
engine = "ranger",
seed = 555
)
model_engine2## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 5
## trees = 555
## min_n = 5
##
## Engine-Specific Arguments:
## seed = 555
##
## Computational engine: ranger
# fit the model
TidyRanger <- fit(
object = model_engine2,
formula = strength ~ .,
data = data_train
)
TidyRanger## parsnip model object
##
## Fit time: 801ms
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~555, min.node.size = ~5, seed = ~555, num.threads = 1, verbose = FALSE)
##
## Type: Regression
## Number of trees: 555
## Sample size: 612
## Number of independent variables: 8
## Mtry: 5
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.09082269
## R squared (OOB): 0.9091773
using fit_xy()
# fit the model
TidyRanger <- fit_xy(
object = model_engine2,
x = select(data_train, -strength),
y = select(data_train, strength)
)
# quick check
TidyRanger## parsnip model object
##
## Fit time: 750ms
## Ranger result
##
## Call:
## ranger::ranger(formula = formula, data = data, mtry = ~5, num.trees = ~555, min.node.size = ~5, seed = ~555, num.threads = 1, verbose = FALSE)
##
## Type: Regression
## Number of trees: 555
## Sample size: 612
## Number of independent variables: 8
## Mtry: 5
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 0.09082269
## R squared (OOB): 0.9091773
Predict
scaled_prediction2 <- data_test %>%
select(strength) %>%
bind_cols((predict(TidyRanger, data_test)))
# quick check
scaled_prediction2Evaluate
revert_prediction2 <- apply(scaled_prediction2, MARGIN = 2, FUN = recipe_bt, data_recipe = data_recipe) %>%
as.data.frame()
head(revert_prediction2)Evaluation Metrics
## [1] "data.frame"
revert_prediction2 %>%
summarise(
R_SQUARED = rsq_vec(strength, .pred),
RMSE = rmse_vec(strength, .pred),
MAE = mae_vec(strength, .pred),
MAPE = mape_vec(strength, .pred),
MASE = mase_vec(strength, .pred)
)Best Model
Random Forest
We have R Squared of 0.8964 (almost 90%) and MAE 3.87 with Random Forest 10r10n
RF5_Rsquared <- round(R2test_5*100,2)
RF5_MAE <- round(MAE(RF5.prediction2, test$strength),2)
RF5_Model <- cbind(Model = "Random Forest (5r5n)", `R-squared`=RF5_Rsquared,MAE=RF5_MAE ) %>%
as.data.frame()
RF5_ModelRF10_Rsquared <- round(R2test_5*100,2)
RF10_MAE <- round(MAE(RF10.prediction2, test$strength),2)
RF10_Model <- cbind(Model = "Random Forest (10r10n)", `R-squared`=RF10_Rsquared,MAE=RF10_MAE ) %>%
as.data.frame()
RF10_Model## [1] 4.275926
From this model, we will apply to achieve prediction for strength variable in submission data
datax_ <- datax_ %>%
rename(age = data_age)
RF10.predictionsub <- predict(RF10, newdata = datax_)
view(RF10.predictionsub)
#write.csv(RF10.predictionsub,'datasub.csv')Data joining with Id
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## x = col_double()
## )
colnames(datasub)[colnames(datasub) =="x"] <- "strength"
colnames(datasub)[colnames(datasub) =="X1"] <- "no"
datasub_ <- datasub %>%
mutate(ids = row_number())
head(datasub_)Notes: Goal Achieved, only with slightly below expectation results in data validation for R Squared Machine learning could solve problem in predicting concrete strength with random forest as the best model The performance is measured by MAE (Mean Absolute Error) below 4.0, R Squared in prediction 0.90, and in validation almost 0.90 For this we could achieve Business Implementation for potential best mixture of concrete strength that will be achieved in ID S858 and S859
Building explainer with train data
Explaining selected samples
LIME Conclusion With LIME, I do not need to scale back, because it already contained original value/ unscaled data. I use 2 features to have 2 most materials that will contribute to strength. With LIME, we could see what breakdown and see possibilities component that contribute the bost and possibility for best mixtures.
The difference in interpreting: Blackbox: we could get accuracy and optimization of the models by minimizing errors and fit the models LIME: We could breakdown the variables that has contribution, correlation and combination to provide impact to target in the models
From 4 observation in the plot, Explanation fit tell us Cement and Super_plast is the most influential materials to contribute strength and effecting age.