Introduction

This project focuses on data analysis using dimensionality reduction through Principal Component Analysis (PCA). The goal of the project is to reduce the number of variables in the dataset while preserving as much information about the data’s distribution as possible. Based on a dataset containing various properties of wheat kernels, PCA identifies key variables that contribute the most to the variance in the data.

The analysis methods applied include:

Data Standardization – All variables were normalized to have a mean of 0 and a standard deviation of 1, ensuring that each variable has an equal influence on the PCA results.
Correlation Analysis – The relationships between variables were examined to assess which ones are strongly correlated and which may be redundant.
PCA (Principal Component Analysis) – This technique was used to reduce the dimensions of the data while retaining as much variance as possible. Based on eigenvalues and the proportion of explained variance, the optimal number of components to retain was determined.
PCA Results Visualization – Plots such as biplots and bar charts were used to interpret which variables contribute most to each principal component.

The dataset used in this project contains measurements of wheat kernels, widely studied for classification and pattern recognition tasks. The dataset includes the following features:

Area: The surface area of the wheat kernel.
Perimeter: The total length around the kernel.
Compactness: A derived metric representing the kernel’s shape.
Length of Kernel: The kernel’s longest dimension.
Width of Kernel: The kernel’s widest dimension.
Asymmetry Coefficient: A measure of the kernel’s symmetry.
Length of Kernel Groove: The length of the central groove in the kernel.

These variables provide comprehensive information about the kernels’ physical characteristics, making the dataset ideal for PCA analysis.

The data is sourced from: https://archive.ics.uci.edu/dataset/236/seeds

Step 1: Loading necessary libraries

library(readr)       
library(ggplot2)     
library(gridExtra)  
library(caret)       
library(corrplot)   
library(GGally)

Step 2: Dataset

data <- read.table("seeds_dataset.txt", header = FALSE)


colnames(data) <- c("Area", "Perimeter", "Compactness", 
                    "Length", "Width", "Asymmetry", "Groove", "Class")

str(data)

## 'data.frame':    210 obs. of  8 variables:
##  $ Area       : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ Perimeter  : num  14.8 14.6 14.1 13.9 15 ...
##  $ Compactness: num  0.871 0.881 0.905 0.895 0.903 ...
##  $ Length     : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ Width      : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ Asymmetry  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ Groove     : num  5.22 4.96 4.83 4.8 5.17 ...
##  $ Class      : int  1 1 1 1 1 1 1 1 1 1 ...

summary(data)

##       Area         Perimeter      Compactness         Length     
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
##      Width         Asymmetry          Groove          Class  
##  Min.   :2.630   Min.   :0.7651   Min.   :4.519   Min.   :1  
##  1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045   1st Qu.:1  
##  Median :3.237   Median :3.5990   Median :5.223   Median :2  
##  Mean   :3.259   Mean   :3.7002   Mean   :5.408   Mean   :2  
##  3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877   3rd Qu.:3  
##  Max.   :4.033   Max.   :8.4560   Max.   :6.550   Max.   :3

head(data)

##    Area Perimeter Compactness Length Width Asymmetry Groove Class
## 1 15.26     14.84      0.8710  5.763 3.312     2.221  5.220     1
## 2 14.88     14.57      0.8811  5.554 3.333     1.018  4.956     1
## 3 14.29     14.09      0.9050  5.291 3.337     2.699  4.825     1
## 4 13.84     13.94      0.8955  5.324 3.379     2.259  4.805     1
## 5 16.14     14.99      0.9034  5.658 3.562     1.355  5.175     1
## 6 14.38     14.21      0.8951  5.386 3.312     2.462  4.956     1

Step 3: Histograms of each variable

plots <- list()
for (col_name in colnames(data[, -8])) {  # Excluding the 'Class' column
  p <- ggplot(data, aes_string(x = col_name)) +
    geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
    theme_minimal() +
    labs(title = paste("Histogram of", col_name),
         x = col_name, y = "Frequency") +
    theme(plot.title = element_text(hjust = 0.5))
  plots[[col_name]] <- p
}

do.call(grid.arrange, c(plots, ncol = 3))

Histograms reveal the distribution of each variable. Distributions should be taken into account when interpreting PCA results.

Step 4: Data Standardization

Data standardization is a crucial preprocessing step in dimensionality reduction, especially when applying techniques like Principal Component Analysis (PCA). Since PCA is sensitive to the scale of variables, standardizing the data ensures that all features contribute equally to the analysis.

After standardization, the dataset was checked to ensure that:

The mean of each variable was approximately 0
The standard deviation of each variable was 1

preproc <- preProcess(data[, -8], method = c("center", "scale"))  # Excluding the 'Class' column
data_standardized <- predict(preproc, data[, -8])  
apply(data_standardized, 2, function(x) c(mean = mean(x), sd = sd(x)))

##              Area    Perimeter Compactness        Length         Width
## mean 2.851041e-16 1.142919e-16 1.23281e-15 -9.485633e-17 -3.082893e-16
## sd   1.000000e+00 1.000000e+00 1.00000e+00  1.000000e+00  1.000000e+00
##          Asymmetry        Groove
## mean -7.657359e-17 -1.130477e-16
## sd    1.000000e+00  1.000000e+00

Step 5: Correlation Matrix

The correlation matrix was visualized using a heatmap to provide a clear representation of variable relationships.

A correlation matrix provides insights into relationships between variables. It helps identify pairs of variables that are strongly correlated, which can lead to redundancy in the data and influence the results of dimensionality reduction techniques like PCA.

cor_matrix <- cor(data_standardized, method = "pearson")

corrplot(cor_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.cex = 0.8, addCoef.col = "black",
         title = "Correlation Matrix for Seeds Dataset", mar = c(0, 0, 1, 0))

Upon examining the correlation matrix for the wheat kernel dataset, several strong correlations between variables were observed. For instance:

Variables such as Length of Kernel and Perimeter showed a high degree of correlation, which is expected given the physical properties of wheat kernels.
Other pairs, like Area and Width of Kernel, also exhibited substantial correlation.

Rationale for Retaining All Variables

Despite the presence of high correlations, all variables were retained in the analysis. The decision was based on the following considerations:

Small Number of Variables:

The dataset includes a limited number of features (seven in total). Removing variables based on correlation thresholds would further reduce dimensionality, potentially omitting valuable information.

Avoiding Arbitrary Thresholds:

While it is common to remove variables exceeding a certain correlation coefficient to reduce redundancy, setting such thresholds can be arbitrary and context-dependent. Given the exploratory nature of this project, retaining all variables allows the PCA to determine which features contribute the most to variance without manual intervention.

Not a Methodological Error:

Keeping variables with high correlation is not inherently a mistake, especially in PCA. Highly correlated variables may contribute to the same principal components, which can still reveal meaningful patterns in the data.

Best Practices in Handling High Correlation

In practice, removing variables with high correlation is typically done to:

Reduce Redundancy:

By removing redundant variables, the dataset becomes simpler, and the results of PCA or other analyses are often easier to interpret.

Highlight Unique Contributions:

Eliminating highly correlated variables can help emphasize features that contribute independent information, potentially leading to more insightful results.

While this is a common practice, the choice to retain all variables in this analysis ensures that the PCA process itself determines the importance of features, avoiding the risk of prematurely discarding potentially informative variables.

Step 6: Pairwise Relationships

ggpairs(data_standardized)

Diagonal: Histograms showing the distribution of each variable. Upper triangle: Correlation coefficients indicating the strength and direction of linear relationships. Lower triangle: Scatterplots illustrating potential linear or nonlinear relationships.

Step 7: Principal Component Analysis (PCA)

# Computing PCA without centering or scaling
pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)

# PCA rotation
pca$rotation

##                    PC1         PC2         PC3         PC4         PC5
## Area        -0.4444735 -0.02656355  0.02587094  0.19363997 -0.20441167
## Perimeter   -0.4415715 -0.08400282 -0.05983912  0.29545659 -0.17427591
## Compactness -0.2770174  0.52915125  0.62969178 -0.33281640  0.33265481
## Length      -0.4235633 -0.20597518 -0.21187966  0.26340659  0.76609839
## Width       -0.4328187  0.11668963  0.21648338  0.19963039 -0.46536555
## Asymmetry    0.1186925 -0.71688203  0.67950584  0.09246481  0.03625822
## Groove      -0.3871608 -0.37719327 -0.21389720 -0.80414995 -0.11134657
##                     PC6          PC7
## Area        -0.42643686 -0.734805689
## Perimeter   -0.47623853  0.670751532
## Compactness -0.14162884  0.072552703
## Length       0.27357647 -0.046276051
## Width        0.70301171  0.039289079
## Asymmetry   -0.01964186  0.003723456
## Groove       0.04282974  0.034498098

# Eigenvalues
eigenvalues <- pca$sdev^2
eigenvalues

## [1] 5.0312011860 1.1975728470 0.6780034386 0.0683644770 0.0187136090
## [6] 0.0053320457 0.0008123968

explained_variance <- eigenvalues / sum(eigenvalues)

cumulative_variance <- cumsum(explained_variance)

pca_summary <- data.frame(
  Component = 1:length(eigenvalues),
  Eigenvalue = eigenvalues,
  "Explained Variance (%)" = explained_variance * 100,
  "Cumulative Variance (%)" = cumulative_variance * 100
)

#  Summary of PCA results
print(pca_summary)

##   Component   Eigenvalue Explained.Variance.... Cumulative.Variance....
## 1         1 5.0312011860            71.87430266                71.87430
## 2         2 1.1975728470            17.10818353                88.98249
## 3         3 0.6780034386             9.68576341                98.66825
## 4         4 0.0683644770             0.97663539                99.64488
## 5         5 0.0187136090             0.26733727                99.91222
## 6         6 0.0053320457             0.07617208                99.98839
## 7         7 0.0008123968             0.01160567               100.00000

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while preserving as much variability as possible. By transforming the original variables into a new set of orthogonal (uncorrelated) variables called principal components, PCA helps identify patterns and structures that may not be obvious in the original feature space.

Purpose of PCA

Dimensionality Reduction:

Reduce the number of variables in the dataset while retaining most of the information.

Simplify the dataset for visualization and analysis.

Variance Explanation:

Identify the key dimensions (principal components) that explain the most variance in the data. By focusing on the components that explain the most variance, PCA also reduced the influence of noise and less relevant features.

Feature Interpretation:

Understand which features contribute most to the variability in the data.

What Are Eigenvalues, and Why Are They Used in PCA?

Eigenvalues are fundamental to the mathematics of Principal Component Analysis (PCA). They quantify the amount of variance captured by each principal component and help determine the importance of each component in explaining the dataset’s structure.

Interpretation of PCA results:

The first principal component (PC1) explains 71.87% of the variance.
The second principal component (PC2) explains an additional 17.10% of the variance.

Together, PC1 and PC2 explain 88.98% of the variance, making them sufficient for dimensionality reduction. Subsequent components contribute very little additional information, as shown by the low eigenvalues and explained variance.

Step 8:Kaiser Rule

What Is the Kaiser Rule?

The Kaiser Rule is a commonly used criterion in Principal Component Analysis (PCA) for deciding the number of principal components (PCs) to retain. It is based on the eigenvalues associated with each principal component.

The Kaiser Rule states:

Retain all principal components with eigenvalues greater than 1.

The rationale behind this rule is that:

An eigenvalue of 1 represents the variance of a single standardized variable. Thus, any component with an eigenvalue greater than 1 explains more variance than an individual variable and is considered significant.
Components with eigenvalues less than 1 contribute less variance than a single variable and are typically deemed less meaningful.

Limitations of the Kaiser Rule

Over-Retention:

In some datasets, the Kaiser Rule may retain too many components, especially when the eigenvalues drop slowly after the first few components.

Under-Retention:

For datasets with subtle patterns, the Kaiser Rule might discard components that, while having eigenvalues slightly below 1, capture meaningful variance.

components_kaiser <- sum(eigenvalues > 1)
components_kaiser

## [1] 2

According to the Kaiser rule, two components should be retained, as they have eigenvalues > 1.

Step 9: Scree Plot

if (!requireNamespace("factoextra", quietly = TRUE)) {
  install.packages("factoextra")
}
library(factoextra)

fviz_eig(pca, addlabels = TRUE, ylim = c(0, 100)) +
  ggtitle("Scree Plot: Eigenvalues of Principal Components")

Summary of PCA results: The PCA results indicate that the first two components capture the majority of the variance in the dataset. This dimensionality reduction is effective and allows for visualization and further analysis in a 2D space.

Step 10: Components Analysis

# Visualizing the quality of representation (cos²) of individual observations in the PCA space
fviz_pca_ind(pca, col.ind = "cos2", geom = "point", 
             gradient.cols = c("blue", "purple", "red"), 
             title = "Representation Quality of Observations") +
  theme_minimal()

Points with higher cos² values (closer to red) are better represented in the two-dimensional space. Observations with low cos² values (closer to blue) are less well represented, suggesting they contribute less to the PCA structure.

# Visualizing the contribution of variables to the principal components
fviz_pca_var(pca, col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, 
             title = "Contribution of Variables to PCA Components") +
  theme_minimal()

Variables closer to the edges of the plot contribute more to the principal components. Variables grouped together suggest similar contributions to the principal components.

# Visualizing the contributions of variables specifically to the first two dimensions
Dim1 <- fviz_contrib(pca, choice = "var", axes = 1, 
                     title = "Contribution of Variables to Dim1")
Dim2 <- fviz_contrib(pca, choice = "var", axes = 2, 
                     title = "Contribution of Variables to Dim2")


gridExtra::grid.arrange(Dim1, Dim2, ncol = 2)

For Dim1, the variables with the highest contributions are Area, Perimeter, Width, Length and Groove.
For Dim2, the variables with the highest contributions are Asymmetry, and Compactness.

These variables are the key drivers of variance in their respective components and should be analyzed further for their relationships.

Step 11: Conclusion and Summary

Dimensionality reduction is a crucial technique for simplifying datasets, uncovering patterns, and improving the interpretability of complex data. This project focused on the application of Principal Component Analysis (PCA) to reduce the dimensionality of a dataset containing measurements of wheat kernels. The analysis provided valuable insights into the structure of the data and demonstrated the utility of PCA in data exploration and visualization.

PCA effectively reduced the dataset to two meaningful dimensions while retaining most of the information. This dimensionality reduction allows for simplified visualization, analysis, and potential clustering of observations based on key characteristics. The study underscores the power of PCA in identifying and summarizing patterns within high-dimensional datasets.

Key Findings

Reduction in Dimensionality:

PCA successfully reduced the original dataset from seven variables to a smaller number of principal components while retaining a significant proportion of the variance.
Using criteria such as the Kaiser Rule, the optimal number of components was determined.

Patterns Identified:

The first principal component (PC1) was dominated by size-related variables such as Length of Kernel and Perimeter, indicating that these features are strongly correlated and explain much of the variance.
The second principal component (PC2) highlighted shape-related traits such as Asymmetry Coefficient, providing a complementary perspective.

Final Thoughts:

Dimensionality reduction is a crucial step in data preprocessing, especially when dealing with datasets that contain highly correlated variables.
PCA not only reduces complexity but also enhances the interpretability of the data, making it an essential tool for exploratory data analysis.

Challenges and Limitations

Retention of Meaningful Variance:

While PCA preserved a significant portion of the variance, some minor patterns might have been discarded along with the less significant components.

Interpretability:

Although PCA simplifies the data, interpreting the principal components requires careful consideration of variable loadings and domain expertise.

Assumptions of PCA:

PCA assumes linear relationships among variables and may not perform well if non-linear patterns are dominant.

Dimension Reduction

Filip Bancerz

2025-01-15