1. Introduction

Concrete is a building material made of a mixture of cement, water and some other aggregates. in the manufacture of concrete is also adjusted to the compressive strength which are expected. Where the power is affected by the components in it. Each component can have a good effect on quality concrete or vice versa. For the quality of concrete need a model to predict the quality of concrete which are expected. Where is the importance of a power data concrete of the material to define each design and planning.

This case is a case that must be resolved to fulfill the capstone project at the Algorithm Team Institute

2. Library and Setup

library(dplyr)
library(GGally)
library(randomForest)
library(caret)
library(car)
library(ggplot2)
library(MLmetrics)
library(lime)
library(rsample)
library(lmtest)

3. Import Data

train.concrete<-read.csv("data/data-train.csv")
head(train.concrete)
test.concrete<-read.csv("data/data-test.csv")
head(test.concrete)

4. Data Preprocessing

Cek Tipe Data

glimpse(train.concrete)
## Rows: 825
## Columns: 10
## $ id          <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10~
## $ cement      <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19~
## $ slag        <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132~
## $ flyash      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ water       <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228~
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0~
## $ coarse_agg  <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, ~
## $ fine_agg    <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82~
## $ age         <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36~
## $ strength    <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38~

The data type is correct.

Check Missing Value

train.concrete %>%  is.na() %>% colSums()
##          id      cement        slag      flyash       water super_plast 
##           0           0           0           0           0           0 
##  coarse_agg    fine_agg         age    strength 
##           0           0           0           0
There is no missing value from train.concrete.
### Check Outlier

Checking whether there are outliers in the data train.concrete is done using a graph approach, namely the existence of points on the boxplot.

boxplot(train.concrete %>% select(-id))

Based on the boxplot graph presented above, there are outliers in the variable slag, water, coarse_agg,age, strength.

summary(train.concrete)
##       id                cement           slag            flyash      
##  Length:825         Min.   :102.0   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:194.7   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Median :275.1   Median : 20.00   Median :  0.00  
##                     Mean   :280.9   Mean   : 73.18   Mean   : 54.03  
##                     3rd Qu.:350.0   3rd Qu.:141.30   3rd Qu.:118.20  
##                     Max.   :540.0   Max.   :359.40   Max.   :200.10  
##      water        super_plast       coarse_agg        fine_agg    
##  Min.   :121.8   Min.   : 0.000   Min.   : 801.0   Min.   :594.0  
##  1st Qu.:164.9   1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:734.0  
##  Median :184.0   Median : 6.500   Median : 968.0   Median :780.1  
##  Mean   :181.1   Mean   : 6.266   Mean   : 972.8   Mean   :775.6  
##  3rd Qu.:192.0   3rd Qu.:10.100   3rd Qu.:1028.4   3rd Qu.:826.8  
##  Max.   :247.0   Max.   :32.200   Max.   :1145.0   Max.   :992.6  
##       age            strength    
##  Min.   :  1.00   Min.   : 2.33  
##  1st Qu.:  7.00   1st Qu.:23.64  
##  Median : 28.00   Median :34.57  
##  Mean   : 45.14   Mean   :35.79  
##  3rd Qu.: 56.00   3rd Qu.:45.94  
##  Max.   :365.00   Max.   :82.60

The following is the outlier value for each variable

cat("\ncement:",boxplot(train.concrete$cement, plot = F)$out)
## 
## cement:
cat("\nslag:",boxplot(train.concrete$slag, plot = F)$out)
## 
## slag: 359.4 359.4
cat("\nwater:",boxplot(train.concrete$water, plot = F)$out)
## 
## water: 121.8 121.8 121.8 121.8 121.8 237 247 246.9 236.7
cat("\nsuper_plast:",boxplot(train.concrete$super_plast, plot = F)$out)
## 
## super_plast: 32.2 28.2 28.2 32.2 28.2 32.2 28.2
cat("\ncoarse_agg",boxplot(train.concrete$coarse_agg, plot = F)$out)
## 
## coarse_agg
cat("\nfine_agg:",boxplot(train.concrete$fine_agg, plot = F)$out)
## 
## fine_agg: 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 992.6 992.6 992.6 992.6 992.6
cat("\nage:",boxplot(train.concrete$age, plot = F)$out)
## 
## age: 270 365 360 365 365 180 180 180 365 270 180 365 365 270 365 270 180 180 180 270 270 270 180 270 360 180 180 180 360 180 365 180 365 180 270 180 180 360 180 360 180 180 180 180 360 270
cat("\nstrength:",boxplot(train.concrete$strength, plot = F)$out)
## 
## strength: 79.99 80.2 79.4 82.6 81.75

Check Korelasi Data

ggcorr(train.concrete, label = T)
## Warning in ggcorr(train.concrete, label = T): data in column(s) 'id' are not
## numeric and were ignored

Insight obtained from the graph above:
1. Variable strength has the strongest correlation with the cement.
2. Variable strength variable has a positively correlated with the variables age, super_plast,slag.
3. Variable strength has a negatively correlated with the variables fine_agg, coarse_agg, water, flyash.

5. Data Cleansing

# delete column id
concrete.pre <- train.concrete %>% 
  select(-id)

## data train that is scaled so that the distance between data is not too far and the distribution can approach a normal distribution.
concrete.scale <- concrete.pre %>%
  filter(strength<79.40) %>%
  as.matrix() %>% 
  scale() %>% 
  as.data.frame()

head(concrete.scale)

6. Cross Validation

set.seed(100)
idx <- initial_split(concrete.scale,prop = 0.85,strata = "strength")
concrete.train <- training(idx)
concrete.test <- testing(idx)

7. Modelling

a. Model Tuning 1: Handle Outlier with Scaling

mod.linear <- lm(formula = strength ~., concrete.train)
mod.linear.step<- step(object = mod.linear, direction = "both")
## Start:  AIC=-618.76
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
##     fine_agg + age
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     281.86 -618.76
## - fine_agg     1     1.002 282.87 -618.28
## - water        1     1.719 283.58 -616.50
## - coarse_agg   1     1.916 283.78 -616.02
## - super_plast  1     5.304 287.17 -607.71
## - flyash       1    13.237 295.10 -588.63
## - slag         1    27.120 308.98 -556.45
## - cement       1    55.453 337.32 -495.04
## - age          1   128.934 410.80 -357.09
summary(mod.linear.step)
## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + fine_agg + age, data = concrete.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72036 -0.41622  0.03637  0.44167  1.98144 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006087   0.024184  -0.252 0.801350    
## cement       0.749160   0.064253  11.660  < 2e-16 ***
## slag         0.508613   0.062377   8.154 1.66e-15 ***
## flyash       0.336567   0.059082   5.697 1.81e-08 ***
## water       -0.124906   0.060838  -2.053 0.040440 *  
## super_plast  0.152715   0.042351   3.606 0.000333 ***
## coarse_agg   0.112005   0.051675   2.167 0.030538 *  
## fine_agg     0.097174   0.061990   1.568 0.117440    
## age          0.442254   0.024875  17.779  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6387 on 691 degrees of freedom
## Multiple R-squared:  0.5955, Adjusted R-squared:  0.5908 
## F-statistic: 127.2 on 8 and 691 DF,  p-value: < 2.2e-16

Model Evaluation

prediction_test <- predict(object = mod.linear.step,newdata = concrete.test, interval = "confidence", level = 0.95)

Computing MAE from the Data Test

MAE<-mean(abs(prediction_test - concrete.test$strength))
MAE
## [1] 0.4540487
prediction_train <- predict(object = mod.linear.step,newdata = concrete.train, interval = "confidence", level = 0.95)

Computing MAE from the Data Train

MAE<-mean(abs(prediction_train - concrete.train$strength))
MAE
## [1] 0.5169102

Adjusted R-squared : 59.54% on train data, but 68% on test data. So it can be concluded that it is not over fitting.

Assumption Checking

#Normality Test
shapiro.test(mod.linear$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mod.linear$residuals
## W = 0.99561, p-value = 0.04511

Error is not distributed normally (P-value lower than 0.05)

#Heteroskedasticity assumption
bptest(formula = mod.linear)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod.linear
## BP = 100.01, df = 8, p-value < 2.2e-16

There is Heteroscedasticity

#Variance Inflation Factor assumption
vif(mod.linear)
##      cement        slag      flyash       water super_plast  coarse_agg 
##    7.181576    6.609436    6.004855    6.313402    2.886758    4.466061 
##    fine_agg         age 
##    6.703119    1.124223

There is no multicollinearity in the model because all variables have a VIF value of less than 10.

b. Model Tunning 2: Min-Max Normalization

normalize <- function(x){
  return ( 
    (x - min(x))/(max(x) - min(x)) 
           )
}
mm_concrete <- train.concrete[2:10]

Cross Validation

set.seed(100)
idx <- initial_split(mm_concrete, prop = 0.85,strata = "strength")
mm_train <- training(idx)
mm_test <- testing(idx)

Normalization

mm_train <- normalize(mm_train)
mm_test <- normalize(mm_test)

Forming Model

model.linear2<-lm(formula = strength ~ ., data = mm_train)
model.linear2<- step(object = model.linear2, direction = "both")
## Start:  AIC=-6586.44
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
##     fine_agg + age
## 
##               Df Sum of Sq      RSS     AIC
## - fine_agg     1 0.0001303 0.059467 -6586.9
## <none>                     0.059337 -6586.4
## - coarse_agg   1 0.0003066 0.059643 -6584.8
## - super_plast  1 0.0006383 0.059975 -6580.9
## - water        1 0.0007429 0.060080 -6579.7
## - flyash       1 0.0027534 0.062090 -6556.5
## - slag         1 0.0058901 0.065227 -6521.8
## - cement       1 0.0114328 0.070770 -6464.4
## - age          1 0.0271020 0.086439 -6323.6
## 
## Step:  AIC=-6586.89
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg + 
##     age
## 
##               Df Sum of Sq      RSS     AIC
## <none>                     0.059467 -6586.9
## - coarse_agg   1  0.000203 0.059670 -6586.5
## + fine_agg     1  0.000130 0.059337 -6586.4
## - super_plast  1  0.000574 0.060041 -6582.1
## - water        1  0.003687 0.063154 -6546.5
## - flyash       1  0.004850 0.064318 -6533.7
## - slag         1  0.016394 0.075861 -6417.5
## - age          1  0.026972 0.086439 -6325.6
## - cement       1  0.036423 0.095891 -6252.5
summary(model.linear2)
## 
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast + 
##     coarse_agg + age, data = mm_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0262791 -0.0060797  0.0006573  0.0062449  0.0297398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.010927   0.009393   1.163  0.24510    
## cement       0.108736   0.005266  20.647  < 2e-16 ***
## slag         0.088410   0.006383  13.852  < 2e-16 ***
## flyash       0.072454   0.009616   7.535 1.53e-13 ***
## water       -0.190606   0.029016  -6.569 9.92e-11 ***
## super_plast  0.300779   0.116094   2.591  0.00978 ** 
## coarse_agg   0.010008   0.006499   1.540  0.12399    
## age          0.120561   0.006786  17.767  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009243 on 696 degrees of freedom
## Multiple R-squared:  0.6053, Adjusted R-squared:  0.6013 
## F-statistic: 152.5 on 7 and 696 DF,  p-value: < 2.2e-16

Predict

pred_test<-predict(object = model.linear2,newdata = mm_test, interval = "confidence", level = 0.95)

Computing MAE from the Data Test

MAE<-mean(abs(pred_test - mm_test$strength))
MAE
## [1] 0.006891868
pred_train<-predict(object = model.linear2,newdata = mm_train, interval = "confidence", level = 0.95)

Computing MAE from the Data Train

MAE<-mean(abs(pred_train- mm_train$strength))
MAE
## [1] 0.007428877

c. Model Tunning 3: Log transforming the Variable

Log Transforming

train.log <- log1p(concrete.pre)
test.log <- log1p(concrete.pre)
concrete.log3<- lm(data = train.log, formula = strength ~ .)
summary(concrete.log3)
## 
## Call:
## lm(formula = strength ~ ., data = train.log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9458 -0.1311  0.0201  0.1552  0.5682 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.482578   2.471153   1.409  0.15913    
## cement       0.771994   0.034998  22.058  < 2e-16 ***
## slag         0.065494   0.005653  11.585  < 2e-16 ***
## flyash       0.024491   0.005864   4.176 3.28e-05 ***
## water       -0.846822   0.131538  -6.438 2.07e-10 ***
## super_plast  0.082210   0.013247   6.206 8.63e-10 ***
## coarse_agg   0.168115   0.162380   1.035  0.30083    
## fine_agg    -0.356007   0.128062  -2.780  0.00556 ** 
## age          0.301523   0.007392  40.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2335 on 816 degrees of freedom
## Multiple R-squared:  0.8066, Adjusted R-squared:  0.8047 
## F-statistic: 425.3 on 8 and 816 DF,  p-value: < 2.2e-16

Predict

pred_test3<-predict(object = concrete.log3,newdata = test.log, interval = "confidence", level = 0.95)

Computing MAE from the Data Test

MAE<-mean(abs(pred_test3 - test.log$strength))
MAE
## [1] 0.1841183
pred_train3<-predict(object = concrete.log3,newdata = train.log, interval = "confidence", level = 0.95)

Computing MAE from the Data Train

MAE<-mean(abs(pred_train3 - train.log$strength))
MAE
## [1] 0.1841183

d. Model Random Forest

clean.data <- train.concrete%>% select(-id)
clean.data <- clean.data %>% filter(strength<79.40) 

Cross Validation

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

index <- sample(nrow(clean.data), nrow(clean.data) * 0.85)
data_train_rf <- clean.data[index,]
data_test_rf <- clean.data[-index,]

Setting Up the Random Forest Model

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = clean.data, method = "rf", trControl = ctrl)
saveRDS(forest, "model_rf.RDS") 

Model Evaluation

forest_read <- readRDS("model_rf.RDS")
forest_read
## Random Forest 
## 
## 820 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times) 
## Summary of sample sizes: 682, 684, 684, 684, 683, 683, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     5.673441  0.9009270  4.231798
##   5     5.056894  0.9089051  3.653251
##   8     5.104844  0.9055707  3.661927
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
forest_read$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 23.62833
##                     % Var explained: 91.19
varImp(forest_read)
## rf variable importance
## 
##             Overall
## age         100.000
## cement       81.804
## water        27.917
## super_plast  12.971
## slag          7.833
## fine_agg      5.664
## coarse_agg    1.081
## flyash        0.000

8. Prediction Performance

pred_rf <- predict(object = forest_read, newdata = test.concrete)
test.concrete$strength <- pred_rf
submission <- data.frame(id = test.concrete$id,
                         strength = pred_rf
                         )
write.csv(submission, "submission-cyndy.csv", row.names = F)
head(submission, 3)

\

9. LIME Method For Intpretation

LIME can be used to interpret complex models to make them easier to understand/interpret with local variables. This is more helpful than a complex random forest which tends to be black box and only displays a summary of the modeling results

explainerrf <- lime(data_train_rf, model = forest)
set.seed(100)

testlimerf <- data_test_rf[2:5,]
explanationrf <- explain(testlimerf,
                       explainer = explainerrf, 
                       n_features = 8)

Visualization

plot_features(explanationrf)

plot_explanations(explanationrf)
## Warning: Unknown or uninitialised column: `label`.

10. Conclusion

The conclusion of this project is:
1. Is your goal achieved? Yes, because the model built in the form of a random forest has met; R-adjusted >90% and MAE score <4.
2. Is the problem can be solved by machine learning?: So far yes, it can be solved by machine learning
3. What model did you use and how is the performance?: I use two models, namely multiple linear regression and random forest. Among the two models, the best performance is the random forest model.

Regression Linear a. Handle Outlier with Scaling
- Adjusted R Square: 59.54%
- MAE train : 0.46
- MAE test : 0.51

  1. Min-Max Normalization
  • Adjusted R Square: 60.40%
  • MAE train : 0.00673
  • MAE test : 0.00752
  1. log Transforming
  • Adjusted R Square: 80.47%
  • MAE train : 0.1841183
  • MAE test : 0.1841183

Random Forest
- Adjusted R Square: 91%
- MAE : 3.57