For the fully functional html version, please visit http://www.rpubs.com/jasonchanhku/concrete
library(ggvis) #Data visulization
library(psych) #Scatterplot matrix
library(knitr) #html table
library(neuralnet) #artifical neural network
In the engineering, it is crucial to have accurate estimates of concrete strength to develop safety guidelines in construction. Concrete performance varies greatly due to a wide variety of ingredients that interact in complex ways. As a result, it is difficult to accurately predict the strength of the final product.
This project aims to develop a reliable model using Artificial Neural Networks (ANN) to predict concrete strength given a list of composition inputs.
The dataset on the compressive strength of concrete is obtained from the UCI Machine Learning Data Repository http://archive.ics.uci.edu/ml. The concrete dataset contains 1,030 examples of concrete with eight features describing the components used in the mixture.
concrete <- read.csv(file = "Machine-Learning-with-R-datasets-master/concrete.csv")
knitr::kable(head(concrete), caption = "Partial Table Preview")
cement | slag | ash | water | superplastic | coarseagg | fineagg | age | strength |
---|---|---|---|---|---|---|---|---|
540.0 | 0.0 | 0 | 162 | 2.5 | 1040.0 | 676.0 | 28 | 79.99 |
540.0 | 0.0 | 0 | 162 | 2.5 | 1055.0 | 676.0 | 28 | 61.89 |
332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 270 | 40.27 |
332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 365 | 41.05 |
198.6 | 132.4 | 0 | 192 | 0.0 | 978.4 | 825.5 | 360 | 44.30 |
266.0 | 114.0 | 0 | 228 | 0.0 | 932.0 | 670.0 | 90 | 47.03 |
str(concrete)
## 'data.frame': 1030 obs. of 9 variables:
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ ash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 228 228 ...
## $ superplastic: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarseagg : num 1040 1055 932 932 978 ...
## $ fineagg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 90 365 28 28 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
Target Variable:
Features Used:
Visualizing the data enables us to adjust the model if needed and to spot outliers at an early stage. It is also important to visualize the distribution of the target variable, strength.
concrete %>% ggvis(x = ~strength, fill:= "#27bc9c") %>% layer_histograms() %>% layer_paths(y = ~strength, 35.82, stroke := "red")
From the histogram, it is clear that the distribution is slightly positively skewed. Despite that, there are still majority lot of concretes with strength close to the mean of 35.82. Not too many concretes have strength too strong or weak.
For a general overview of correlation and scatterplots, a scatterplot matrix is plotted for all features.
pairs.panels(concrete[c("cement", "slag", "ash", "strength")])
pairs.panels(concrete[c("superplastic", "coarseagg", "fineagg", "age", "strength")])
As activation functions in ANNs are sensitive to the change in x values over a small range, the data needs to be normalized. Neural networks work best when the input data are scaled to a narrow range around zero.
As most of the feature’s and target variable are not normally distributed as seen from the scatterplot matrices, traditional normalization instead of Z-Score is more appropriate. Normalization is done using a customized normalize()
function.
normalize <- function(x){
return ((x - min(x))/(max(x) - min(x) ))
}
concrete_norm <- as.data.frame(lapply(concrete, normalize))
A preview of the normalized concrete dataset:
kable(round(head(concrete_norm), digits = 3), caption = "Normalized Data Preview")
cement | slag | ash | water | superplastic | coarseagg | fineagg | age | strength |
---|---|---|---|---|---|---|---|---|
1.000 | 0.000 | 0 | 0.321 | 0.078 | 0.695 | 0.206 | 0.074 | 0.967 |
1.000 | 0.000 | 0 | 0.321 | 0.078 | 0.738 | 0.206 | 0.074 | 0.742 |
0.526 | 0.396 | 0 | 0.848 | 0.000 | 0.381 | 0.000 | 0.739 | 0.473 |
0.526 | 0.396 | 0 | 0.848 | 0.000 | 0.381 | 0.000 | 1.000 | 0.482 |
0.221 | 0.368 | 0 | 0.561 | 0.000 | 0.516 | 0.581 | 0.986 | 0.523 |
0.374 | 0.317 | 0 | 0.848 | 0.000 | 0.381 | 0.191 | 0.245 | 0.557 |
After normalization, the dataset is ready to be split into its training set and test set. The proportion used here will be 75% training and 25% test. Note that the dataset is already randomly sorted. Therefore, there is no need for random sampling before preparation.
#training set
concrete_train <- concrete_norm[1:773, ]
#test set
concrete_test <- concrete_norm[774:1030, ]
The training set will be used to build the neural network and the test set will be used to evaluate how well model generalizes future data.
The model will be trained using the neuralnet
package. The package also offers visualization of the network architecture. A multilayer feedforward network with one hidden node is constructured. The training is implemented as follows:
#Build a neural network with one hidden layer
concrete_model <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age , data = concrete_train, hidden = 1)
The constructed neural network can be visualized simply by using plot()
plot(concrete_model)
The error shown on the plot is the Sum of Squared Errors (SSE). The lower the SSE the better.
As the constructed neural network has trained on the training data, it is now time to put the test data to the test. Note that compute()
is being used here rather than predict()
#building the predictor, exclude the target variable column
model_results <- compute(concrete_model, concrete_test[1:8])
#store the net.results column
predicted_strength <- model_results$net.result
As this a numeric prediction problem,correlation insead of a confusion matrix is used to provide insights of the linear association between them both.
cor(predicted_strength, concrete_test$strength)
## [,1]
## [1,] 0.716963052
Neural networks with more topology are capable of learning more complex relationships. Hence, 5 hidden nodes shall be set in the constructed hidden layer in hopes to improve the model.
#building the new model
concrete_model2 <- neuralnet(strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age, data = concrete_train, hidden = 5 )
The SSE has reduced significantly from 5.67 to only 1.64 with increased in few thousands of steps.
#nuilding the new predictor
model_results2 <- compute(concrete_model2, concrete_test[1:8])
#storing the results
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$strength)
## [,1]
## [1,] 0.7917119854