The aims of applying Principal Component Analysis (PCA):
library(ggplot2)
library(factoextra)
library(corrplot)
library(gridExtra)
The dataset contains various cement characteristics and the final strength of concrete. Relationships between these characteristics will be analyzed.
concrete <- read.csv("Concrete strength prediction.csv")
# rename columns for better readability
colnames(concrete) <- c("Cement", "Slag", "FlyAsh", "Water", "Superplasticizer", "CoarseAgg", "FineAgg", "Age", "Strength")
# "Strength" is excluded to perform PCA only on independent variables.
independent_vars <- concrete[, 1:8]
# inspect the structure of the dataset
str(independent_vars)
## 'data.frame': 1030 obs. of 8 variables:
## $ Cement : num 540 540 332 332 199 ...
## $ Slag : num 0 0 142 142 132 ...
## $ FlyAsh : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Water : num 162 162 228 228 192 228 228 228 228 228 ...
## $ Superplasticizer: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ CoarseAgg : num 1040 1055 932 932 978 ...
## $ FineAgg : num 676 676 594 594 826 ...
## $ Age : num 28 28 270 365 360 90 365 28 28 28 ...
Visualization of correlations between the variables helps to understand their relationships before applying PCA.
correlation_matrix <- cor(independent_vars)
corrplot(correlation_matrix, method = "color", tl.cex = 0.8, number.cex = 0.7)
Superplasticizer variable is exluded due to high correlation with Water.
independent_vars <- subset(independent_vars, select = -Superplasticizer)
PCA was applied to reduce the dataset’s dimensionality while preserving the maximum variance and information. The primary objective of this process is to simplify analysis and visualization by reducing the dataset’s size while retaining as much of the original data structure as possible.
# standardization to ensure that all variables contribute equally to the analysis
vars_scaled <- scale(independent_vars)
# PCA applied
pca_result <- prcomp(vars_scaled, center = F, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3806 1.1730 1.0978 1.0067 0.9138 0.79602 0.1756
## Proportion of Variance 0.2723 0.1966 0.1722 0.1448 0.1193 0.09052 0.0044
## Cumulative Proportion 0.2723 0.4688 0.6410 0.7858 0.9051 0.99560 1.0000
We retain only components with eigenvalues greater than 1, as per Kaiser’s criterion.
## Number of components with eigenvalue > 1: 4
A scree plot helps visualize the eigenvalues and determine the optimal number of components to retain. According do Kaiser’s rule, 4 components are eligible to be chosen.
This plot illustrates the percentage of variance explained by each
principal component.
4 chosen components are responsible for explanation of almost 79% of variation which is satisfactory.
This plot shows how well each variable is represented in the PCA
space. Variables pointing in the same direction are positively
correlated, while those in opposite directions are negatively
correlated.
FineAgg and FlyAsh are positively correlated variables, but they are both negatively correlated with Age. Strong negative correlation can be also observed between CoarseAgg and Slag.
This visualization projects individual observations onto the PCA
dimensions, highlighting their variance.
Most of observations seem to be grouped just above and to the right from the center.
Here, we analyze which variables contribute the most to each
principal component.
Different principal components capture distinct aspects of concrete formulation, including mix balance, strength, aggregate structure, and aging effects. Reduction of dimensionality helps with better understanding the dataset while retaining most of its key information.