In the field of engineering, it is crucial to have accurate estimates of the performance of building materials. Estimating the strength of concrete is a challenge of particular interest. Although it is used in nearly every construction project, concrete performance varies greatly due to the use of a wide variety of ingredients that interact in complex ways. As a result, it is difficult to accurately predict the strength of the final product. A model that could reliably predict concrete strength given a listing of the composition of the input materials could result in safer construction practices.

Used Libraries

library(neuralnet)
library(lattice)
library(car)
library(GGally)

Load Data

For this analysis, I will utilize data on the compressive strength of concrete donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by I-Cheng Yeh.

concrete <- read.csv("concrete.csv")
str(concrete)
## 'data.frame':    1030 obs. of  9 variables:
##  $ cement      : num  540 540 332 332 199 ...
##  $ slag        : num  0 0 142 142 132 ...
##  $ ash         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ water       : num  162 162 228 228 192 228 228 228 228 228 ...
##  $ superplastic: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ coarseagg   : num  1040 1055 932 932 978 ...
##  $ fineagg     : num  676 676 594 594 826 ...
##  $ age         : int  28 28 270 365 360 90 365 28 28 28 ...
##  $ strength    : num  80 61.9 40.3 41 44.3 ...

The variables of the dataset are the following:

summary(concrete)
##      cement           slag            ash             water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##   superplastic      coarseagg         fineagg           age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.71  
##  Median :34.45  
##  Mean   :35.82  
##  3rd Qu.:46.13  
##  Max.   :82.60

Data Analysis

Neural networks work best when the input data are scaled to a narrow range around zero, and here I see values ranging anywhere from zero up to over a thousand.

Normalisation

In order to solve this problem, I will normalise data to a 0-1 range.

NB: any transformation applied to the data prior to training the model will have to be applied in reverse later on in order to convert back to the original units of measurement. To facilitate the rescaling, I will save the original data and cerate a new dataset for the normalised data.

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

concrete_norm <- as.data.frame(lapply(concrete, normalize))
summary(concrete_norm)
##      cement            slag              ash             water       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2063   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.3442  
##  Median :0.3902   Median :0.06121   Median :0.0000   Median :0.5048  
##  Mean   :0.4091   Mean   :0.20561   Mean   :0.2708   Mean   :0.4774  
##  3rd Qu.:0.5662   3rd Qu.:0.39775   3rd Qu.:0.5912   3rd Qu.:0.5607  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   superplastic      coarseagg         fineagg            age         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.3808   1st Qu.:0.3436   1st Qu.:0.01648  
##  Median :0.1988   Median :0.4855   Median :0.4654   Median :0.07418  
##  Mean   :0.1927   Mean   :0.4998   Mean   :0.4505   Mean   :0.12270  
##  3rd Qu.:0.3168   3rd Qu.:0.6640   3rd Qu.:0.5770   3rd Qu.:0.15110  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     strength     
##  Min.   :0.0000  
##  1st Qu.:0.2664  
##  Median :0.4001  
##  Mean   :0.4172  
##  3rd Qu.:0.5457  
##  Max.   :1.0000

Visualising Data

par(mfrow=c(1,2)) # get two graphs side by side 
hist(concrete_norm$strength, prob=T, xlab='',
     main='Histogram of strength value')
lines(density(concrete_norm$strength,na.rm=T))
rug(jitter(concrete_norm$strength))
qqPlot(concrete_norm$strength,main='Normal QQ plot of strength')

par(mfrow=c(1,1))
ggpairs(concrete)

Splitting Dataset

Now, I will partition the data (which is already randomly sorted) into a training set with 75 percent of the examples and a testing set with 25 percent.

concrete_train <- concrete_norm[1:773, ]
concrete_test <- concrete_norm[774:1030, ]

Training Model

Begin by training the simplest multilayer feedforward network with only a single hidden node and visualize the network topology using the plot() function on the concrete_model object.

In this simple model, there is one input node for each of the eight features, followed by a single hidden node and a single output node that predicts the concrete strength. The weights for each of the connections are also depicted, as are the bias terms (indicated by the nodes with a 1).

set.seed(1)
concrete_model <- neuralnet(strength ~ cement + slag + 
                              ash + water + superplastic + 
                              coarseagg + fineagg + age,
                            data = concrete_train)

plot(concrete_model)

The plot returns the number of training steps equal to 2227 and the Sum of Squared Errors (SSE) equal to 5.67.

Evaluating Model Performance

The network topology diagram does not provide much information about how well the model fits our data. To estimate the model’s performance, I use the compute() function to generate predictions on the testing dataset:

model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
head(predicted_strength)
##             [,1]
## 774 0.3897719790
## 775 0.2453570993
## 776 0.2502719641
## 777 0.2273656208
## 778 0.3319452009
## 779 0.1821083505

Because this is a numeric prediction problem rather than a classification problem, it is not possible to use a confusion matrix to examine model accuracy. Instead, the correlation between our predicted concrete strength and the true value must be measured. This provides an insight into the strength of the linear association between the two variables.

cor(predicted_strength, concrete_test$strength)[,1]
## [1] 0.7225217887

The correlation here of about 0.72 indicates a fairly strong relationship, which implies that the model is doing a fairly good job, even with only a single hidden node.

Improving Model Performance

Given that I only used one hidden node, it is likely that it is possible to improve the performance of the model. It is known that networks with more complex topologies are capable of learning more difficult concepts. Let’s see what happens when the number of hidden nodes is increased to five:

set.seed(3)
concrete_model2 <- neuralnet(strength ~ cement + slag + 
                              ash + water + superplastic + 
                              coarseagg + fineagg + age,
                            data = concrete_train,
                            hidden = 5)

plot(concrete_model2)

The reported error measured by SSE has been reduced from 5.67 in the previous model to 1.59 here. However, the number of training steps rose from 2227 to 5560, which is no surprise given how much more complex the model has become.

Now, I will apply the same steps as before to evaluate the correlation between the predicted values and true values of the test dataset.

model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$strength)[,1]
## [1] 0.8132541967

Now, the obtained correlation is 0.81, which is a considerable improvement over the previous result.