10 Jun 2021
Concrete is a building material made of a mixture of cement, water and some other aggregates. in the manufacture of concrete is also adjusted to the compressive strength which are expected. Where the power is affected by the components in it. Each component can have a good effect on quality concrete or vice versa. For the quality of concrete need a model to predict the quality of concrete which are expected. Where is the importance of a power data concrete of the material to define each design and planning.
This case is a case that must be resolved to fulfill the capstone project at the Algorithm Team Institute
library(dplyr)
library(GGally)
library(randomForest)
library(caret)
library(car)
library(ggplot2)
library(MLmetrics)
library(lime)
library(rsample)
library(lmtest)train.concrete<-read.csv("data/data-train.csv")
head(train.concrete)test.concrete<-read.csv("data/data-test.csv")
head(test.concrete)glimpse(train.concrete)## Rows: 825
## Columns: 10
## $ id <chr> "S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10~
## $ cement <dbl> 540.0, 540.0, 332.5, 332.5, 198.6, 380.0, 380.0, 475.0, 19~
## $ slag <dbl> 0.0, 0.0, 142.5, 142.5, 132.4, 95.0, 95.0, 0.0, 132.4, 132~
## $ flyash <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ water <dbl> 162, 162, 228, 228, 192, 228, 228, 228, 192, 192, 228, 228~
## $ super_plast <dbl> 2.5, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0~
## $ coarse_agg <dbl> 1040.0, 1055.0, 932.0, 932.0, 978.4, 932.0, 932.0, 932.0, ~
## $ fine_agg <dbl> 676.0, 676.0, 594.0, 594.0, 825.5, 594.0, 594.0, 594.0, 82~
## $ age <int> 28, 28, 270, 365, 360, 365, 28, 28, 90, 28, 28, 90, 90, 36~
## $ strength <dbl> 79.99, 61.89, 40.27, 41.05, 44.30, 43.70, 36.45, 39.29, 38~
The data type is correct.
train.concrete %>% is.na() %>% colSums()## id cement slag flyash water super_plast
## 0 0 0 0 0 0
## coarse_agg fine_agg age strength
## 0 0 0 0
There is no missing value from train.concrete.
Checking whether there are outliers in the data train.concrete is done using a graph approach, namely the existence of points on the boxplot.
boxplot(train.concrete %>% select(-id))
Based on the boxplot graph presented above, there are outliers in the variable slag, water, coarse_agg,age, strength.
summary(train.concrete)## id cement slag flyash
## Length:825 Min. :102.0 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:194.7 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Median :275.1 Median : 20.00 Median : 0.00
## Mean :280.9 Mean : 73.18 Mean : 54.03
## 3rd Qu.:350.0 3rd Qu.:141.30 3rd Qu.:118.20
## Max. :540.0 Max. :359.40 Max. :200.10
## water super_plast coarse_agg fine_agg
## Min. :121.8 Min. : 0.000 Min. : 801.0 Min. :594.0
## 1st Qu.:164.9 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:734.0
## Median :184.0 Median : 6.500 Median : 968.0 Median :780.1
## Mean :181.1 Mean : 6.266 Mean : 972.8 Mean :775.6
## 3rd Qu.:192.0 3rd Qu.:10.100 3rd Qu.:1028.4 3rd Qu.:826.8
## Max. :247.0 Max. :32.200 Max. :1145.0 Max. :992.6
## age strength
## Min. : 1.00 Min. : 2.33
## 1st Qu.: 7.00 1st Qu.:23.64
## Median : 28.00 Median :34.57
## Mean : 45.14 Mean :35.79
## 3rd Qu.: 56.00 3rd Qu.:45.94
## Max. :365.00 Max. :82.60
The following is the outlier value for each variable
cat("\ncement:",boxplot(train.concrete$cement, plot = F)$out)##
## cement:
cat("\nslag:",boxplot(train.concrete$slag, plot = F)$out)##
## slag: 359.4 359.4
cat("\nwater:",boxplot(train.concrete$water, plot = F)$out)##
## water: 121.8 121.8 121.8 121.8 121.8 237 247 246.9 236.7
cat("\nsuper_plast:",boxplot(train.concrete$super_plast, plot = F)$out)##
## super_plast: 32.2 28.2 28.2 32.2 28.2 32.2 28.2
cat("\ncoarse_agg",boxplot(train.concrete$coarse_agg, plot = F)$out)##
## coarse_agg
cat("\nfine_agg:",boxplot(train.concrete$fine_agg, plot = F)$out)##
## fine_agg: 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 594 992.6 992.6 992.6 992.6 992.6
cat("\nage:",boxplot(train.concrete$age, plot = F)$out)##
## age: 270 365 360 365 365 180 180 180 365 270 180 365 365 270 365 270 180 180 180 270 270 270 180 270 360 180 180 180 360 180 365 180 365 180 270 180 180 360 180 360 180 180 180 180 360 270
cat("\nstrength:",boxplot(train.concrete$strength, plot = F)$out)##
## strength: 79.99 80.2 79.4 82.6 81.75
ggcorr(train.concrete, label = T)## Warning in ggcorr(train.concrete, label = T): data in column(s) 'id' are not
## numeric and were ignored
Insight obtained from the graph above:
1. Variable strength has the strongest correlation with the cement.
2. Variable strength variable has a positively correlated with the variables age, super_plast,slag.
3. Variable strength has a negatively correlated with the variables fine_agg, coarse_agg, water, flyash.
# delete column id
concrete.pre <- train.concrete %>%
select(-id)
## data train that is scaled so that the distance between data is not too far and the distribution can approach a normal distribution.
concrete.scale <- concrete.pre %>%
filter(strength<79.40) %>%
as.matrix() %>%
scale() %>%
as.data.frame()
head(concrete.scale)set.seed(100)
idx <- initial_split(concrete.scale,prop = 0.85,strata = "strength")
concrete.train <- training(idx)
concrete.test <- testing(idx)mod.linear <- lm(formula = strength ~., concrete.train)
mod.linear.step<- step(object = mod.linear, direction = "both")## Start: AIC=-618.76
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg +
## fine_agg + age
##
## Df Sum of Sq RSS AIC
## <none> 281.86 -618.76
## - fine_agg 1 1.002 282.87 -618.28
## - water 1 1.719 283.58 -616.50
## - coarse_agg 1 1.916 283.78 -616.02
## - super_plast 1 5.304 287.17 -607.71
## - flyash 1 13.237 295.10 -588.63
## - slag 1 27.120 308.98 -556.45
## - cement 1 55.453 337.32 -495.04
## - age 1 128.934 410.80 -357.09
summary(mod.linear.step)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + fine_agg + age, data = concrete.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72036 -0.41622 0.03637 0.44167 1.98144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006087 0.024184 -0.252 0.801350
## cement 0.749160 0.064253 11.660 < 2e-16 ***
## slag 0.508613 0.062377 8.154 1.66e-15 ***
## flyash 0.336567 0.059082 5.697 1.81e-08 ***
## water -0.124906 0.060838 -2.053 0.040440 *
## super_plast 0.152715 0.042351 3.606 0.000333 ***
## coarse_agg 0.112005 0.051675 2.167 0.030538 *
## fine_agg 0.097174 0.061990 1.568 0.117440
## age 0.442254 0.024875 17.779 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6387 on 691 degrees of freedom
## Multiple R-squared: 0.5955, Adjusted R-squared: 0.5908
## F-statistic: 127.2 on 8 and 691 DF, p-value: < 2.2e-16
prediction_test <- predict(object = mod.linear.step,newdata = concrete.test, interval = "confidence", level = 0.95)Computing MAE from the Data Test
MAE<-mean(abs(prediction_test - concrete.test$strength))
MAE## [1] 0.4540487
prediction_train <- predict(object = mod.linear.step,newdata = concrete.train, interval = "confidence", level = 0.95)Computing MAE from the Data Train
MAE<-mean(abs(prediction_train - concrete.train$strength))
MAE## [1] 0.5169102
Adjusted R-squared : 59.54% on train data, but 68% on test data. So it can be concluded that it is not over fitting.
#Normality Test
shapiro.test(mod.linear$residuals)##
## Shapiro-Wilk normality test
##
## data: mod.linear$residuals
## W = 0.99561, p-value = 0.04511
Error is not distributed normally (P-value lower than 0.05)
#Heteroskedasticity assumption
bptest(formula = mod.linear)##
## studentized Breusch-Pagan test
##
## data: mod.linear
## BP = 100.01, df = 8, p-value < 2.2e-16
There is Heteroscedasticity
#Variance Inflation Factor assumption
vif(mod.linear)## cement slag flyash water super_plast coarse_agg
## 7.181576 6.609436 6.004855 6.313402 2.886758 4.466061
## fine_agg age
## 6.703119 1.124223
There is no multicollinearity in the model because all variables have a VIF value of less than 10.
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}mm_concrete <- train.concrete[2:10]set.seed(100)
idx <- initial_split(mm_concrete, prop = 0.85,strata = "strength")
mm_train <- training(idx)
mm_test <- testing(idx)mm_train <- normalize(mm_train)
mm_test <- normalize(mm_test)model.linear2<-lm(formula = strength ~ ., data = mm_train)
model.linear2<- step(object = model.linear2, direction = "both")## Start: AIC=-6586.44
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg +
## fine_agg + age
##
## Df Sum of Sq RSS AIC
## - fine_agg 1 0.0001303 0.059467 -6586.9
## <none> 0.059337 -6586.4
## - coarse_agg 1 0.0003066 0.059643 -6584.8
## - super_plast 1 0.0006383 0.059975 -6580.9
## - water 1 0.0007429 0.060080 -6579.7
## - flyash 1 0.0027534 0.062090 -6556.5
## - slag 1 0.0058901 0.065227 -6521.8
## - cement 1 0.0114328 0.070770 -6464.4
## - age 1 0.0271020 0.086439 -6323.6
##
## Step: AIC=-6586.89
## strength ~ cement + slag + flyash + water + super_plast + coarse_agg +
## age
##
## Df Sum of Sq RSS AIC
## <none> 0.059467 -6586.9
## - coarse_agg 1 0.000203 0.059670 -6586.5
## + fine_agg 1 0.000130 0.059337 -6586.4
## - super_plast 1 0.000574 0.060041 -6582.1
## - water 1 0.003687 0.063154 -6546.5
## - flyash 1 0.004850 0.064318 -6533.7
## - slag 1 0.016394 0.075861 -6417.5
## - age 1 0.026972 0.086439 -6325.6
## - cement 1 0.036423 0.095891 -6252.5
summary(model.linear2)##
## Call:
## lm(formula = strength ~ cement + slag + flyash + water + super_plast +
## coarse_agg + age, data = mm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0262791 -0.0060797 0.0006573 0.0062449 0.0297398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.010927 0.009393 1.163 0.24510
## cement 0.108736 0.005266 20.647 < 2e-16 ***
## slag 0.088410 0.006383 13.852 < 2e-16 ***
## flyash 0.072454 0.009616 7.535 1.53e-13 ***
## water -0.190606 0.029016 -6.569 9.92e-11 ***
## super_plast 0.300779 0.116094 2.591 0.00978 **
## coarse_agg 0.010008 0.006499 1.540 0.12399
## age 0.120561 0.006786 17.767 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009243 on 696 degrees of freedom
## Multiple R-squared: 0.6053, Adjusted R-squared: 0.6013
## F-statistic: 152.5 on 7 and 696 DF, p-value: < 2.2e-16
pred_test<-predict(object = model.linear2,newdata = mm_test, interval = "confidence", level = 0.95)Computing MAE from the Data Test
MAE<-mean(abs(pred_test - mm_test$strength))
MAE## [1] 0.006891868
pred_train<-predict(object = model.linear2,newdata = mm_train, interval = "confidence", level = 0.95)Computing MAE from the Data Train
MAE<-mean(abs(pred_train- mm_train$strength))
MAE## [1] 0.007428877
train.log <- log1p(concrete.pre)
test.log <- log1p(concrete.pre)concrete.log3<- lm(data = train.log, formula = strength ~ .)
summary(concrete.log3)##
## Call:
## lm(formula = strength ~ ., data = train.log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9458 -0.1311 0.0201 0.1552 0.5682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.482578 2.471153 1.409 0.15913
## cement 0.771994 0.034998 22.058 < 2e-16 ***
## slag 0.065494 0.005653 11.585 < 2e-16 ***
## flyash 0.024491 0.005864 4.176 3.28e-05 ***
## water -0.846822 0.131538 -6.438 2.07e-10 ***
## super_plast 0.082210 0.013247 6.206 8.63e-10 ***
## coarse_agg 0.168115 0.162380 1.035 0.30083
## fine_agg -0.356007 0.128062 -2.780 0.00556 **
## age 0.301523 0.007392 40.793 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2335 on 816 degrees of freedom
## Multiple R-squared: 0.8066, Adjusted R-squared: 0.8047
## F-statistic: 425.3 on 8 and 816 DF, p-value: < 2.2e-16
pred_test3<-predict(object = concrete.log3,newdata = test.log, interval = "confidence", level = 0.95)Computing MAE from the Data Test
MAE<-mean(abs(pred_test3 - test.log$strength))
MAE## [1] 0.1841183
pred_train3<-predict(object = concrete.log3,newdata = train.log, interval = "confidence", level = 0.95)Computing MAE from the Data Train
MAE<-mean(abs(pred_train3 - train.log$strength))
MAE## [1] 0.1841183
clean.data <- train.concrete%>% select(-id)
clean.data <- clean.data %>% filter(strength<79.40) RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(clean.data), nrow(clean.data) * 0.85)
data_train_rf <- clean.data[index,]
data_test_rf <- clean.data[-index,]RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
ctrl <- trainControl(method="repeatedcv", number = 6, repeats = 3)
forest <- train(strength ~ ., data = clean.data, method = "rf", trControl = ctrl)
saveRDS(forest, "model_rf.RDS") forest_read <- readRDS("model_rf.RDS")
forest_read## Random Forest
##
## 820 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (6 fold, repeated 3 times)
## Summary of sample sizes: 682, 684, 684, 684, 683, 683, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 5.673441 0.9009270 4.231798
## 5 5.056894 0.9089051 3.653251
## 8 5.104844 0.9055707 3.661927
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
forest_read$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 23.62833
## % Var explained: 91.19
varImp(forest_read)## rf variable importance
##
## Overall
## age 100.000
## cement 81.804
## water 27.917
## super_plast 12.971
## slag 7.833
## fine_agg 5.664
## coarse_agg 1.081
## flyash 0.000
pred_rf <- predict(object = forest_read, newdata = test.concrete)
test.concrete$strength <- pred_rfsubmission <- data.frame(id = test.concrete$id,
strength = pred_rf
)
write.csv(submission, "submission-cyndy.csv", row.names = F)
head(submission, 3) \
LIME can be used to interpret complex models to make them easier to understand/interpret with local variables. This is more helpful than a complex random forest which tends to be black box and only displays a summary of the modeling results
explainerrf <- lime(data_train_rf, model = forest)
set.seed(100)
testlimerf <- data_test_rf[2:5,]
explanationrf <- explain(testlimerf,
explainer = explainerrf,
n_features = 8)plot_features(explanationrf)plot_explanations(explanationrf)## Warning: Unknown or uninitialised column: `label`.
The conclusion of this project is:
1. Is your goal achieved? Yes, because the model built in the form of a random forest has met; R-adjusted >90% and MAE score <4.
2. Is the problem can be solved by machine learning?: So far yes, it can be solved by machine learning
3. What model did you use and how is the performance?: I use two models, namely multiple linear regression and random forest. Among the two models, the best performance is the random forest model.
Regression Linear a. Handle Outlier with Scaling
- Adjusted R Square: 59.54%
- MAE train : 0.46
- MAE test : 0.51
Random Forest
- Adjusted R Square: 91%
- MAE : 3.57