For the fully functional html version, please visit http://www.rpubs.com/jasonchanhku/concrete

Libraries Used

library(ggvis) #Data visulization
library(psych) #Scatterplot matrix
library(knitr) #html table
library(neuralnet) #artifical neural network 


Objective

In the engineering, it is crucial to have accurate estimates of concrete strength to develop safety guidelines in construction. Concrete performance varies greatly due to a wide variety of ingredients that interact in complex ways. As a result, it is difficult to accurately predict the strength of the final product.

This project aims to develop a reliable model using Artificial Neural Networks (ANN) to predict concrete strength given a list of composition inputs.


Step 1: Data Exploration

The dataset on the compressive strength of concrete is obtained from the UCI Machine Learning Data Repository http://archive.ics.uci.edu/ml. The concrete dataset contains 1,030 examples of concrete with eight features describing the components used in the mixture.

Data Preview

concrete <- read.csv(file = "Machine-Learning-with-R-datasets-master/concrete.csv")

knitr::kable(head(concrete), caption = "Partial Table Preview")
Partial Table Preview
cement slag ash water superplastic coarseagg fineagg age strength
540.0 0.0 0 162 2.5 1040.0 676.0 28 79.99
540.0 0.0 0 162 2.5 1055.0 676.0 28 61.89
332.5 142.5 0 228 0.0 932.0 594.0 270 40.27
332.5 142.5 0 228 0.0 932.0 594.0 365 41.05
198.6 132.4 0 192 0.0 978.4 825.5 360 44.30
266.0 114.0 0 228 0.0 932.0 670.0 90 47.03

Data Structure

str(concrete)
## 'data.frame':    1030 obs. of  9 variables:
##  $ cement      : num  540 540 332 332 199 ...
##  $ slag        : num  0 0 142 142 132 ...
##  $ ash         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water       : num  162 162 228 228 192 228 228 228 228 228 ...
##  $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarseagg   : num  1040 1055 932 932 978 ...
##  $ fineagg     : num  676 676 594 594 826 ...
##  $ age         : int  28 28 270 365 360 90 365 28 28 28 ...
##  $ strength    : num  80 61.9 40.3 41 44.3 ...

Features

Target Variable:

  • strength

Features Used:

  • cement
  • slag
  • ash
  • water
  • superplasticizer
  • coarse aggregate
  • fine aggregate

Data Visualization

Visualizing the data enables us to adjust the model if needed and to spot outliers at an early stage. It is also important to visualize the distribution of the target variable, strength.

Strength Histogram

concrete %>% ggvis(x = ~strength, fill:= "#27bc9c") %>% layer_histograms() %>% layer_paths(y = ~strength, 35.82, stroke := "red")

From the histogram, it is clear that the distribution is slightly positively skewed. Despite that, there are still majority lot of concretes with strength close to the mean of 35.82. Not too many concretes have strength too strong or weak.

Scatterplot Matrices

For a general overview of correlation and scatterplots, a scatterplot matrix is plotted for all features.

Scatterplot Matrix 1

pairs.panels(concrete[c("cement", "slag", "ash", "strength")])

Scatterplot Matrix 2

pairs.panels(concrete[c("superplastic", "coarseagg", "fineagg", "age", "strength")])


Step 2: Data Preprocessing & Preparation

As activation functions in ANNs are sensitive to the change in x values over a small range, the data needs to be normalized. Neural networks work best when the input data are scaled to a narrow range around zero.

Traditional Normalization

As most of the feature’s and target variable are not normally distributed as seen from the scatterplot matrices, traditional normalization instead of Z-Score is more appropriate. Normalization is done using a customized normalize() function.

normalize <- function(x){
  return ((x - min(x))/(max(x) - min(x) ))
}

concrete_norm <- as.data.frame(lapply(concrete, normalize))

A preview of the normalized concrete dataset:

kable(round(head(concrete_norm), digits = 3), caption = "Normalized Data Preview")
Normalized Data Preview
cement slag ash water superplastic coarseagg fineagg age strength
1.000 0.000 0 0.321 0.078 0.695 0.206 0.074 0.967
1.000 0.000 0 0.321 0.078 0.738 0.206 0.074 0.742
0.526 0.396 0 0.848 0.000 0.381 0.000 0.739 0.473
0.526 0.396 0 0.848 0.000 0.381 0.000 1.000 0.482
0.221 0.368 0 0.561 0.000 0.516 0.581 0.986 0.523
0.374 0.317 0 0.848 0.000 0.381 0.191 0.245 0.557


Data Preparation

After normalization, the dataset is ready to be split into its training set and test set. The proportion used here will be 75% training and 25% test. Note that the dataset is already randomly sorted. Therefore, there is no need for random sampling before preparation.

#training set
concrete_train <- concrete_norm[1:773, ]

#test set
concrete_test <- concrete_norm[774:1030, ]

The training set will be used to build the neural network and the test set will be used to evaluate how well model generalizes future data.


Step 3: Model Training

The model will be trained using the neuralnet package. The package also offers visualization of the network architecture. A multilayer feedforward network with one hidden node is constructured. The training is implemented as follows:

#Build a neural network with one hidden layer 
concrete_model <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age , data = concrete_train, hidden = 1)

Visualizing Neural Network

The constructed neural network can be visualized simply by using plot()

plot(concrete_model)

The error shown on the plot is the Sum of Squared Errors (SSE). The lower the SSE the better.


Step 4: Model Evaluation

As the constructed neural network has trained on the training data, it is now time to put the test data to the test. Note that compute() is being used here rather than predict()

#building the predictor, exclude the target variable column
model_results <- compute(concrete_model, concrete_test[1:8])

#store the net.results column 
predicted_strength <- model_results$net.result

Model Accuracy

As this a numeric prediction problem,correlation insead of a confusion matrix is used to provide insights of the linear association between them both.

cor(predicted_strength, concrete_test$strength)
##             [,1]
## [1,] 0.716963052


Step 5: Improving the Model

Neural networks with more topology are capable of learning more complex relationships. Hence, 5 hidden nodes shall be set in the constructed hidden layer in hopes to improve the model.

#building the new model
concrete_model2 <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age, data = concrete_train, hidden = 5 )

Visualizing the Improved Neural Network

The SSE has reduced significantly from 5.67 to only 1.64 with increased in few thousands of steps.

Implementing the Improved Neural Network

#nuilding the new predictor
model_results2 <- compute(concrete_model2, concrete_test[1:8])

#storing the results
predicted_strength2 <- model_results2$net.result

Evaluating New Model

cor(predicted_strength2, concrete_test$strength)
##              [,1]
## [1,] 0.7917119854