In the field of engineering, it is crucial to have accurate estimates of the performance of building materials. Estimating the strength of concrete is a challenge of particular interest. Although it is used in nearly every construction project, concrete performance varies greatly due to the use of a wide variety of ingredients that interact in complex ways. As a result, it is difficult to accurately predict the strength of the final product. A model that could reliably predict concrete strength given a listing of the composition of the input materials could result in safer construction practices.
library(neuralnet)
library(lattice)
library(car)
library(GGally)
For this analysis, I will utilize data on the compressive strength of concrete donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by I-Cheng Yeh.
concrete <- read.csv("concrete.csv")
str(concrete)
## 'data.frame': 1030 obs. of 9 variables:
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ ash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 228 228 ...
## $ superplastic: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarseagg : num 1040 1055 932 932 978 ...
## $ fineagg : num 676 676 594 594 826 ...
## $ age : int 28 28 270 365 360 90 365 28 28 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
The variables of the dataset are the following:
summary(concrete)
## cement slag ash water
## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.30 3rd Qu.:192.0
## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
## superplastic coarseagg fineagg age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
## Median : 6.400 Median : 968.0 Median :779.5 Median : 28.00
## Mean : 6.205 Mean : 972.9 Mean :773.6 Mean : 45.66
## 3rd Qu.:10.200 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## strength
## Min. : 2.33
## 1st Qu.:23.71
## Median :34.45
## Mean :35.82
## 3rd Qu.:46.13
## Max. :82.60
Neural networks work best when the input data are scaled to a narrow range around zero, and here I see values ranging anywhere from zero up to over a thousand.
In order to solve this problem, I will normalise data to a 0-1 range.
NB: any transformation applied to the data prior to training the model will have to be applied in reverse later on in order to convert back to the original units of measurement. To facilitate the rescaling, I will save the original data and cerate a new dataset for the normalised data.
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
concrete_norm <- as.data.frame(lapply(concrete, normalize))
summary(concrete_norm)
## cement slag ash water
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2063 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.3442
## Median :0.3902 Median :0.06121 Median :0.0000 Median :0.5048
## Mean :0.4091 Mean :0.20561 Mean :0.2708 Mean :0.4774
## 3rd Qu.:0.5662 3rd Qu.:0.39775 3rd Qu.:0.5912 3rd Qu.:0.5607
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## superplastic coarseagg fineagg age
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.3808 1st Qu.:0.3436 1st Qu.:0.01648
## Median :0.1988 Median :0.4855 Median :0.4654 Median :0.07418
## Mean :0.1927 Mean :0.4998 Mean :0.4505 Mean :0.12270
## 3rd Qu.:0.3168 3rd Qu.:0.6640 3rd Qu.:0.5770 3rd Qu.:0.15110
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## strength
## Min. :0.0000
## 1st Qu.:0.2664
## Median :0.4001
## Mean :0.4172
## 3rd Qu.:0.5457
## Max. :1.0000
par(mfrow=c(1,2)) # get two graphs side by side
hist(concrete_norm$strength, prob=T, xlab='',
main='Histogram of strength value')
lines(density(concrete_norm$strength,na.rm=T))
rug(jitter(concrete_norm$strength))
qqPlot(concrete_norm$strength,main='Normal QQ plot of strength')
par(mfrow=c(1,1))
ggpairs(concrete)
Now, I will partition the data (which is already randomly sorted) into a training set with 75 percent of the examples and a testing set with 25 percent.
concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]
Begin by training the simplest multilayer feedforward network with only a single hidden node and visualize the network topology using the plot() function on the concrete_model object.
In this simple model, there is one input node for each of the eight features, followed by a single hidden node and a single output node that predicts the concrete strength. The weights for each of the connections are also depicted, as are the bias terms (indicated by the nodes with a 1).
set.seed(1)
concrete_model <- neuralnet(strength ~ cement + slag +
ash + water + superplastic +
coarseagg + fineagg + age,
data = concrete_train)
plot(concrete_model)
The plot returns the number of training steps equal to 2227 and the Sum of Squared Errors (SSE) equal to 5.67.
The network topology diagram does not provide much information about how well the model fits our data. To estimate the model’s performance, I use the compute() function to generate predictions on the testing dataset:
model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
head(predicted_strength)
## [,1]
## 774 0.3897719790
## 775 0.2453570993
## 776 0.2502719641
## 777 0.2273656208
## 778 0.3319452009
## 779 0.1821083505
Because this is a numeric prediction problem rather than a classification problem, it is not possible to use a confusion matrix to examine model accuracy. Instead, the correlation between our predicted concrete strength and the true value must be measured. This provides an insight into the strength of the linear association between the two variables.
cor(predicted_strength, concrete_test$strength)[,1]
## [1] 0.7225217887
The correlation here of about 0.72 indicates a fairly strong relationship, which implies that the model is doing a fairly good job, even with only a single hidden node.
Given that I only used one hidden node, it is likely that it is possible to improve the performance of the model. It is known that networks with more complex topologies are capable of learning more difficult concepts. Let’s see what happens when the number of hidden nodes is increased to five:
set.seed(3)
concrete_model2 <- neuralnet(strength ~ cement + slag +
ash + water + superplastic +
coarseagg + fineagg + age,
data = concrete_train,
hidden = 5)
plot(concrete_model2)
The reported error measured by SSE has been reduced from 5.67 in the previous model to 1.59 here. However, the number of training steps rose from 2227 to 5560, which is no surprise given how much more complex the model has become.
Now, I will apply the same steps as before to evaluate the correlation between the predicted values and true values of the test dataset.
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$strength)[,1]
## [1] 0.8132541967
Now, the obtained correlation is 0.81, which is a considerable improvement over the previous result.