Principal Component Analysis (PCA) of Concrete Strength Data

The aims of applying Principal Component Analysis (PCA):

Load necessary libraries

library(ggplot2)
library(factoextra)
library(corrplot)
library(gridExtra)

Load the dataset

The dataset contains various cement characteristics and the final strength of concrete. Relationships between these characteristics will be analyzed.

concrete <- read.csv("Concrete strength prediction.csv")

# rename columns for better readability
colnames(concrete) <- c("Cement", "Slag", "FlyAsh", "Water", "Superplasticizer", "CoarseAgg", "FineAgg", "Age", "Strength")

# "Strength" is excluded to perform PCA only on independent variables.
independent_vars <- concrete[, 1:8]

# inspect the structure of the dataset
str(independent_vars)
## 'data.frame':    1030 obs. of  8 variables:
##  $ Cement          : num  540 540 332 332 199 ...
##  $ Slag            : num  0 0 142 142 132 ...
##  $ FlyAsh          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Water           : num  162 162 228 228 192 228 228 228 228 228 ...
##  $ Superplasticizer: num  2.5 2.5 0 0 0 0 0 0 0 0 ...
##  $ CoarseAgg       : num  1040 1055 932 932 978 ...
##  $ FineAgg         : num  676 676 594 594 826 ...
##  $ Age             : num  28 28 270 365 360 90 365 28 28 28 ...

Correlation matrix

Visualization of correlations between the variables helps to understand their relationships before applying PCA.

correlation_matrix <- cor(independent_vars)
corrplot(correlation_matrix, method = "color", tl.cex = 0.8, number.cex = 0.7)

Narrowing the data scope

Superplasticizer variable is exluded due to high correlation with Water.

independent_vars <- subset(independent_vars, select = -Superplasticizer)

Performing PCA

PCA was applied to reduce the dataset’s dimensionality while preserving the maximum variance and information. The primary objective of this process is to simplify analysis and visualization by reducing the dataset’s size while retaining as much of the original data structure as possible.

# standardization to ensure that all variables contribute equally to the analysis
vars_scaled <- scale(independent_vars)

# PCA applied
pca_result <- prcomp(vars_scaled, center = F, scale. = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     1.3806 1.1730 1.0978 1.0067 0.9138 0.79602 0.1756
## Proportion of Variance 0.2723 0.1966 0.1722 0.1448 0.1193 0.09052 0.0044
## Cumulative Proportion  0.2723 0.4688 0.6410 0.7858 0.9051 0.99560 1.0000

Kaiser’s stopping rule - Selecting significant components

We retain only components with eigenvalues greater than 1, as per Kaiser’s criterion.

## Number of components with eigenvalue > 1: 4

A scree plot helps visualize the eigenvalues and determine the optimal number of components to retain. According do Kaiser’s rule, 4 components are eligible to be chosen.

This plot illustrates the percentage of variance explained by each principal component.

4 chosen components are responsible for explanation of almost 79% of variation which is satisfactory.

Variable correlation in PCA

This plot shows how well each variable is represented in the PCA space. Variables pointing in the same direction are positively correlated, while those in opposite directions are negatively correlated.

FineAgg and FlyAsh are positively correlated variables, but they are both negatively correlated with Age. Strong negative correlation can be also observed between CoarseAgg and Slag.

Projection of observations onto the new PCs

This visualization projects individual observations onto the PCA dimensions, highlighting their variance.

Most of observations seem to be grouped just above and to the right from the center.

Contribution of variables to each principal component

Here, we analyze which variables contribute the most to each principal component.

  • Water and Fine Aggregate are dominant in PC1, likely representing mix composition.
  • Cement and Slag play a critical role in PC2, highlighting their importance in structural strength.
  • Coarse Aggregate drives PC3, focusing on structural composition.
  • Age is most influential in PC4, emphasizing strength development over time.

Different principal components capture distinct aspects of concrete formulation, including mix balance, strength, aggregate structure, and aging effects. Reduction of dimensionality helps with better understanding the dataset while retaining most of its key information.