Introduction

This project focuses on data analysis using dimensionality reduction through Principal Component Analysis (PCA). The goal of the project is to reduce the number of variables in the dataset while preserving as much information about the data’s distribution as possible. Based on a dataset containing various properties of wheat kernels, PCA identifies key variables that contribute the most to the variance in the data.

The analysis methods applied include:

The dataset used in this project contains measurements of wheat kernels, widely studied for classification and pattern recognition tasks. The dataset includes the following features:

These variables provide comprehensive information about the kernels’ physical characteristics, making the dataset ideal for PCA analysis.

The data is sourced from: https://archive.ics.uci.edu/dataset/236/seeds

Step 1: Loading necessary libraries

library(readr)       
library(ggplot2)     
library(gridExtra)  
library(caret)       
library(corrplot)   
library(GGally)   

Step 2: Dataset

data <- read.table("seeds_dataset.txt", header = FALSE)


colnames(data) <- c("Area", "Perimeter", "Compactness", 
                    "Length", "Width", "Asymmetry", "Groove", "Class")

str(data)       
## 'data.frame':    210 obs. of  8 variables:
##  $ Area       : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ Perimeter  : num  14.8 14.6 14.1 13.9 15 ...
##  $ Compactness: num  0.871 0.881 0.905 0.895 0.903 ...
##  $ Length     : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ Width      : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ Asymmetry  : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ Groove     : num  5.22 4.96 4.83 4.8 5.17 ...
##  $ Class      : int  1 1 1 1 1 1 1 1 1 1 ...
summary(data)  
##       Area         Perimeter      Compactness         Length     
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
##      Width         Asymmetry          Groove          Class  
##  Min.   :2.630   Min.   :0.7651   Min.   :4.519   Min.   :1  
##  1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045   1st Qu.:1  
##  Median :3.237   Median :3.5990   Median :5.223   Median :2  
##  Mean   :3.259   Mean   :3.7002   Mean   :5.408   Mean   :2  
##  3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877   3rd Qu.:3  
##  Max.   :4.033   Max.   :8.4560   Max.   :6.550   Max.   :3
head(data)      
##    Area Perimeter Compactness Length Width Asymmetry Groove Class
## 1 15.26     14.84      0.8710  5.763 3.312     2.221  5.220     1
## 2 14.88     14.57      0.8811  5.554 3.333     1.018  4.956     1
## 3 14.29     14.09      0.9050  5.291 3.337     2.699  4.825     1
## 4 13.84     13.94      0.8955  5.324 3.379     2.259  4.805     1
## 5 16.14     14.99      0.9034  5.658 3.562     1.355  5.175     1
## 6 14.38     14.21      0.8951  5.386 3.312     2.462  4.956     1

Step 3: Histograms of each variable

plots <- list()
for (col_name in colnames(data[, -8])) {  # Excluding the 'Class' column
  p <- ggplot(data, aes_string(x = col_name)) +
    geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
    theme_minimal() +
    labs(title = paste("Histogram of", col_name),
         x = col_name, y = "Frequency") +
    theme(plot.title = element_text(hjust = 0.5))
  plots[[col_name]] <- p
}

do.call(grid.arrange, c(plots, ncol = 3))

Histograms reveal the distribution of each variable. Distributions should be taken into account when interpreting PCA results.

Step 4: Data Standardization

Data standardization is a crucial preprocessing step in dimensionality reduction, especially when applying techniques like Principal Component Analysis (PCA). Since PCA is sensitive to the scale of variables, standardizing the data ensures that all features contribute equally to the analysis.

After standardization, the dataset was checked to ensure that:

preproc <- preProcess(data[, -8], method = c("center", "scale"))  # Excluding the 'Class' column
data_standardized <- predict(preproc, data[, -8])  
apply(data_standardized, 2, function(x) c(mean = mean(x), sd = sd(x)))
##              Area    Perimeter Compactness        Length         Width
## mean 2.851041e-16 1.142919e-16 1.23281e-15 -9.485633e-17 -3.082893e-16
## sd   1.000000e+00 1.000000e+00 1.00000e+00  1.000000e+00  1.000000e+00
##          Asymmetry        Groove
## mean -7.657359e-17 -1.130477e-16
## sd    1.000000e+00  1.000000e+00

Step 5: Correlation Matrix

The correlation matrix was visualized using a heatmap to provide a clear representation of variable relationships.

A correlation matrix provides insights into relationships between variables. It helps identify pairs of variables that are strongly correlated, which can lead to redundancy in the data and influence the results of dimensionality reduction techniques like PCA.

cor_matrix <- cor(data_standardized, method = "pearson")

corrplot(cor_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.cex = 0.8, addCoef.col = "black",
         title = "Correlation Matrix for Seeds Dataset", mar = c(0, 0, 1, 0))

Upon examining the correlation matrix for the wheat kernel dataset, several strong correlations between variables were observed. For instance:

Rationale for Retaining All Variables

Despite the presence of high correlations, all variables were retained in the analysis. The decision was based on the following considerations:

The dataset includes a limited number of features (seven in total). Removing variables based on correlation thresholds would further reduce dimensionality, potentially omitting valuable information.

While it is common to remove variables exceeding a certain correlation coefficient to reduce redundancy, setting such thresholds can be arbitrary and context-dependent. Given the exploratory nature of this project, retaining all variables allows the PCA to determine which features contribute the most to variance without manual intervention.

Keeping variables with high correlation is not inherently a mistake, especially in PCA. Highly correlated variables may contribute to the same principal components, which can still reveal meaningful patterns in the data.

Best Practices in Handling High Correlation

In practice, removing variables with high correlation is typically done to:

By removing redundant variables, the dataset becomes simpler, and the results of PCA or other analyses are often easier to interpret.

Eliminating highly correlated variables can help emphasize features that contribute independent information, potentially leading to more insightful results.

While this is a common practice, the choice to retain all variables in this analysis ensures that the PCA process itself determines the importance of features, avoiding the risk of prematurely discarding potentially informative variables.

Step 6: Pairwise Relationships

ggpairs(data_standardized)

Diagonal: Histograms showing the distribution of each variable. Upper triangle: Correlation coefficients indicating the strength and direction of linear relationships. Lower triangle: Scatterplots illustrating potential linear or nonlinear relationships.

Step 7: Principal Component Analysis (PCA)

# Computing PCA without centering or scaling
pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)

# PCA rotation
pca$rotation
##                    PC1         PC2         PC3         PC4         PC5
## Area        -0.4444735 -0.02656355  0.02587094  0.19363997 -0.20441167
## Perimeter   -0.4415715 -0.08400282 -0.05983912  0.29545659 -0.17427591
## Compactness -0.2770174  0.52915125  0.62969178 -0.33281640  0.33265481
## Length      -0.4235633 -0.20597518 -0.21187966  0.26340659  0.76609839
## Width       -0.4328187  0.11668963  0.21648338  0.19963039 -0.46536555
## Asymmetry    0.1186925 -0.71688203  0.67950584  0.09246481  0.03625822
## Groove      -0.3871608 -0.37719327 -0.21389720 -0.80414995 -0.11134657
##                     PC6          PC7
## Area        -0.42643686 -0.734805689
## Perimeter   -0.47623853  0.670751532
## Compactness -0.14162884  0.072552703
## Length       0.27357647 -0.046276051
## Width        0.70301171  0.039289079
## Asymmetry   -0.01964186  0.003723456
## Groove       0.04282974  0.034498098
# Eigenvalues
eigenvalues <- pca$sdev^2
eigenvalues
## [1] 5.0312011860 1.1975728470 0.6780034386 0.0683644770 0.0187136090
## [6] 0.0053320457 0.0008123968
explained_variance <- eigenvalues / sum(eigenvalues)

cumulative_variance <- cumsum(explained_variance)

pca_summary <- data.frame(
  Component = 1:length(eigenvalues),
  Eigenvalue = eigenvalues,
  "Explained Variance (%)" = explained_variance * 100,
  "Cumulative Variance (%)" = cumulative_variance * 100
)

#  Summary of PCA results
print(pca_summary)
##   Component   Eigenvalue Explained.Variance.... Cumulative.Variance....
## 1         1 5.0312011860            71.87430266                71.87430
## 2         2 1.1975728470            17.10818353                88.98249
## 3         3 0.6780034386             9.68576341                98.66825
## 4         4 0.0683644770             0.97663539                99.64488
## 5         5 0.0187136090             0.26733727                99.91222
## 6         6 0.0053320457             0.07617208                99.98839
## 7         7 0.0008123968             0.01160567               100.00000

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while preserving as much variability as possible. By transforming the original variables into a new set of orthogonal (uncorrelated) variables called principal components, PCA helps identify patterns and structures that may not be obvious in the original feature space.

Purpose of PCA

Reduce the number of variables in the dataset while retaining most of the information.

Simplify the dataset for visualization and analysis.

Identify the key dimensions (principal components) that explain the most variance in the data. By focusing on the components that explain the most variance, PCA also reduced the influence of noise and less relevant features.

Understand which features contribute most to the variability in the data.

What Are Eigenvalues, and Why Are They Used in PCA?

Eigenvalues are fundamental to the mathematics of Principal Component Analysis (PCA). They quantify the amount of variance captured by each principal component and help determine the importance of each component in explaining the dataset’s structure.

Interpretation of PCA results:

Together, PC1 and PC2 explain 88.98% of the variance, making them sufficient for dimensionality reduction. Subsequent components contribute very little additional information, as shown by the low eigenvalues and explained variance.

Step 8:Kaiser Rule

What Is the Kaiser Rule?

The Kaiser Rule is a commonly used criterion in Principal Component Analysis (PCA) for deciding the number of principal components (PCs) to retain. It is based on the eigenvalues associated with each principal component.

The Kaiser Rule states:

The rationale behind this rule is that:

Limitations of the Kaiser Rule

In some datasets, the Kaiser Rule may retain too many components, especially when the eigenvalues drop slowly after the first few components.

For datasets with subtle patterns, the Kaiser Rule might discard components that, while having eigenvalues slightly below 1, capture meaningful variance.

components_kaiser <- sum(eigenvalues > 1)
components_kaiser
## [1] 2

According to the Kaiser rule, two components should be retained, as they have eigenvalues > 1.

Step 9: Scree Plot

if (!requireNamespace("factoextra", quietly = TRUE)) {
  install.packages("factoextra")
}
library(factoextra)

fviz_eig(pca, addlabels = TRUE, ylim = c(0, 100)) +
  ggtitle("Scree Plot: Eigenvalues of Principal Components")

Summary of PCA results: The PCA results indicate that the first two components capture the majority of the variance in the dataset. This dimensionality reduction is effective and allows for visualization and further analysis in a 2D space.

Step 10: Components Analysis

# Visualizing the quality of representation (cos²) of individual observations in the PCA space
fviz_pca_ind(pca, col.ind = "cos2", geom = "point", 
             gradient.cols = c("blue", "purple", "red"), 
             title = "Representation Quality of Observations") +
  theme_minimal()

Points with higher cos² values (closer to red) are better represented in the two-dimensional space. Observations with low cos² values (closer to blue) are less well represented, suggesting they contribute less to the PCA structure.

# Visualizing the contribution of variables to the principal components
fviz_pca_var(pca, col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, 
             title = "Contribution of Variables to PCA Components") +
  theme_minimal()

Variables closer to the edges of the plot contribute more to the principal components. Variables grouped together suggest similar contributions to the principal components.

# Visualizing the contributions of variables specifically to the first two dimensions
Dim1 <- fviz_contrib(pca, choice = "var", axes = 1, 
                     title = "Contribution of Variables to Dim1")
Dim2 <- fviz_contrib(pca, choice = "var", axes = 2, 
                     title = "Contribution of Variables to Dim2")


gridExtra::grid.arrange(Dim1, Dim2, ncol = 2)

These variables are the key drivers of variance in their respective components and should be analyzed further for their relationships.

Step 11: Conclusion and Summary

Dimensionality reduction is a crucial technique for simplifying datasets, uncovering patterns, and improving the interpretability of complex data. This project focused on the application of Principal Component Analysis (PCA) to reduce the dimensionality of a dataset containing measurements of wheat kernels. The analysis provided valuable insights into the structure of the data and demonstrated the utility of PCA in data exploration and visualization.

PCA effectively reduced the dataset to two meaningful dimensions while retaining most of the information. This dimensionality reduction allows for simplified visualization, analysis, and potential clustering of observations based on key characteristics. The study underscores the power of PCA in identifying and summarizing patterns within high-dimensional datasets.

Key Findings

Reduction in Dimensionality:

Patterns Identified:

Final Thoughts:

Challenges and Limitations

While PCA preserved a significant portion of the variance, some minor patterns might have been discarded along with the less significant components.

Although PCA simplifies the data, interpreting the principal components requires careful consideration of variable loadings and domain expertise.

PCA assumes linear relationships among variables and may not perform well if non-linear patterns are dominant.