This project focuses on data analysis using dimensionality reduction through Principal Component Analysis (PCA). The goal of the project is to reduce the number of variables in the dataset while preserving as much information about the data’s distribution as possible. Based on a dataset containing various properties of wheat kernels, PCA identifies key variables that contribute the most to the variance in the data.
The analysis methods applied include:
Data Standardization – All variables were normalized to have a mean of 0 and a standard deviation of 1, ensuring that each variable has an equal influence on the PCA results.
Correlation Analysis – The relationships between variables were examined to assess which ones are strongly correlated and which may be redundant.
PCA (Principal Component Analysis) – This technique was used to reduce the dimensions of the data while retaining as much variance as possible. Based on eigenvalues and the proportion of explained variance, the optimal number of components to retain was determined.
PCA Results Visualization – Plots such as biplots and bar charts were used to interpret which variables contribute most to each principal component.
The dataset used in this project contains measurements of wheat kernels, widely studied for classification and pattern recognition tasks. The dataset includes the following features:
Area: The surface area of the wheat kernel.
Perimeter: The total length around the kernel.
Compactness: A derived metric representing the kernel’s shape.
Length of Kernel: The kernel’s longest dimension.
Width of Kernel: The kernel’s widest dimension.
Asymmetry Coefficient: A measure of the kernel’s symmetry.
Length of Kernel Groove: The length of the central groove in the kernel.
These variables provide comprehensive information about the kernels’ physical characteristics, making the dataset ideal for PCA analysis.
The data is sourced from: https://archive.ics.uci.edu/dataset/236/seeds
library(readr)
library(ggplot2)
library(gridExtra)
library(caret)
library(corrplot)
library(GGally)
data <- read.table("seeds_dataset.txt", header = FALSE)
colnames(data) <- c("Area", "Perimeter", "Compactness",
"Length", "Width", "Asymmetry", "Groove", "Class")
str(data)
## 'data.frame': 210 obs. of 8 variables:
## $ Area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ Perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ Compactness: num 0.871 0.881 0.905 0.895 0.903 ...
## $ Length : num 5.76 5.55 5.29 5.32 5.66 ...
## $ Width : num 3.31 3.33 3.34 3.38 3.56 ...
## $ Asymmetry : num 2.22 1.02 2.7 2.26 1.35 ...
## $ Groove : num 5.22 4.96 4.83 4.8 5.17 ...
## $ Class : int 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
## Area Perimeter Compactness Length
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## Width Asymmetry Groove Class
## Min. :2.630 Min. :0.7651 Min. :4.519 Min. :1
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045 1st Qu.:1
## Median :3.237 Median :3.5990 Median :5.223 Median :2
## Mean :3.259 Mean :3.7002 Mean :5.408 Mean :2
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877 3rd Qu.:3
## Max. :4.033 Max. :8.4560 Max. :6.550 Max. :3
head(data)
## Area Perimeter Compactness Length Width Asymmetry Groove Class
## 1 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 1
## 2 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
## 3 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 1
## 4 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1
## 5 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1
## 6 14.38 14.21 0.8951 5.386 3.312 2.462 4.956 1
plots <- list()
for (col_name in colnames(data[, -8])) { # Excluding the 'Class' column
p <- ggplot(data, aes_string(x = col_name)) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = paste("Histogram of", col_name),
x = col_name, y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
plots[[col_name]] <- p
}
do.call(grid.arrange, c(plots, ncol = 3))
Histograms reveal the distribution of each variable. Distributions should be taken into account when interpreting PCA results.
Data standardization is a crucial preprocessing step in dimensionality reduction, especially when applying techniques like Principal Component Analysis (PCA). Since PCA is sensitive to the scale of variables, standardizing the data ensures that all features contribute equally to the analysis.
After standardization, the dataset was checked to ensure that:
The mean of each variable was approximately 0
The standard deviation of each variable was 1
preproc <- preProcess(data[, -8], method = c("center", "scale")) # Excluding the 'Class' column
data_standardized <- predict(preproc, data[, -8])
apply(data_standardized, 2, function(x) c(mean = mean(x), sd = sd(x)))
## Area Perimeter Compactness Length Width
## mean 2.851041e-16 1.142919e-16 1.23281e-15 -9.485633e-17 -3.082893e-16
## sd 1.000000e+00 1.000000e+00 1.00000e+00 1.000000e+00 1.000000e+00
## Asymmetry Groove
## mean -7.657359e-17 -1.130477e-16
## sd 1.000000e+00 1.000000e+00
The correlation matrix was visualized using a heatmap to provide a clear representation of variable relationships.
A correlation matrix provides insights into relationships between variables. It helps identify pairs of variables that are strongly correlated, which can lead to redundancy in the data and influence the results of dimensionality reduction techniques like PCA.
cor_matrix <- cor(data_standardized, method = "pearson")
corrplot(cor_matrix, method = "circle", type = "upper",
tl.col = "black", tl.cex = 0.8, addCoef.col = "black",
title = "Correlation Matrix for Seeds Dataset", mar = c(0, 0, 1, 0))
Upon examining the correlation matrix for the wheat kernel dataset, several strong correlations between variables were observed. For instance:
Variables such as Length of Kernel and Perimeter showed a high degree of correlation, which is expected given the physical properties of wheat kernels.
Other pairs, like Area and Width of Kernel, also exhibited substantial correlation.
Rationale for Retaining All Variables
Despite the presence of high correlations, all variables were retained in the analysis. The decision was based on the following considerations:
The dataset includes a limited number of features (seven in total). Removing variables based on correlation thresholds would further reduce dimensionality, potentially omitting valuable information.
While it is common to remove variables exceeding a certain correlation coefficient to reduce redundancy, setting such thresholds can be arbitrary and context-dependent. Given the exploratory nature of this project, retaining all variables allows the PCA to determine which features contribute the most to variance without manual intervention.
Keeping variables with high correlation is not inherently a mistake, especially in PCA. Highly correlated variables may contribute to the same principal components, which can still reveal meaningful patterns in the data.
Best Practices in Handling High Correlation
In practice, removing variables with high correlation is typically done to:
By removing redundant variables, the dataset becomes simpler, and the results of PCA or other analyses are often easier to interpret.
Eliminating highly correlated variables can help emphasize features that contribute independent information, potentially leading to more insightful results.
While this is a common practice, the choice to retain all variables in this analysis ensures that the PCA process itself determines the importance of features, avoiding the risk of prematurely discarding potentially informative variables.
ggpairs(data_standardized)
Diagonal: Histograms showing the distribution of each variable. Upper triangle: Correlation coefficients indicating the strength and direction of linear relationships. Lower triangle: Scatterplots illustrating potential linear or nonlinear relationships.
# Computing PCA without centering or scaling
pca <- prcomp(data_standardized, center = FALSE, scale. = FALSE)
# PCA rotation
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## Area -0.4444735 -0.02656355 0.02587094 0.19363997 -0.20441167
## Perimeter -0.4415715 -0.08400282 -0.05983912 0.29545659 -0.17427591
## Compactness -0.2770174 0.52915125 0.62969178 -0.33281640 0.33265481
## Length -0.4235633 -0.20597518 -0.21187966 0.26340659 0.76609839
## Width -0.4328187 0.11668963 0.21648338 0.19963039 -0.46536555
## Asymmetry 0.1186925 -0.71688203 0.67950584 0.09246481 0.03625822
## Groove -0.3871608 -0.37719327 -0.21389720 -0.80414995 -0.11134657
## PC6 PC7
## Area -0.42643686 -0.734805689
## Perimeter -0.47623853 0.670751532
## Compactness -0.14162884 0.072552703
## Length 0.27357647 -0.046276051
## Width 0.70301171 0.039289079
## Asymmetry -0.01964186 0.003723456
## Groove 0.04282974 0.034498098
# Eigenvalues
eigenvalues <- pca$sdev^2
eigenvalues
## [1] 5.0312011860 1.1975728470 0.6780034386 0.0683644770 0.0187136090
## [6] 0.0053320457 0.0008123968
explained_variance <- eigenvalues / sum(eigenvalues)
cumulative_variance <- cumsum(explained_variance)
pca_summary <- data.frame(
Component = 1:length(eigenvalues),
Eigenvalue = eigenvalues,
"Explained Variance (%)" = explained_variance * 100,
"Cumulative Variance (%)" = cumulative_variance * 100
)
# Summary of PCA results
print(pca_summary)
## Component Eigenvalue Explained.Variance.... Cumulative.Variance....
## 1 1 5.0312011860 71.87430266 71.87430
## 2 2 1.1975728470 17.10818353 88.98249
## 3 3 0.6780034386 9.68576341 98.66825
## 4 4 0.0683644770 0.97663539 99.64488
## 5 5 0.0187136090 0.26733727 99.91222
## 6 6 0.0053320457 0.07617208 99.98839
## 7 7 0.0008123968 0.01160567 100.00000
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while preserving as much variability as possible. By transforming the original variables into a new set of orthogonal (uncorrelated) variables called principal components, PCA helps identify patterns and structures that may not be obvious in the original feature space.
Purpose of PCA
Reduce the number of variables in the dataset while retaining most of the information.
Simplify the dataset for visualization and analysis.
Identify the key dimensions (principal components) that explain the most variance in the data. By focusing on the components that explain the most variance, PCA also reduced the influence of noise and less relevant features.
Understand which features contribute most to the variability in the data.
What Are Eigenvalues, and Why Are They Used in PCA?
Eigenvalues are fundamental to the mathematics of Principal Component Analysis (PCA). They quantify the amount of variance captured by each principal component and help determine the importance of each component in explaining the dataset’s structure.
Interpretation of PCA results:
The first principal component (PC1) explains 71.87% of the variance.
The second principal component (PC2) explains an additional 17.10% of the variance.
Together, PC1 and PC2 explain 88.98% of the variance, making them sufficient for dimensionality reduction. Subsequent components contribute very little additional information, as shown by the low eigenvalues and explained variance.
What Is the Kaiser Rule?
The Kaiser Rule is a commonly used criterion in Principal Component Analysis (PCA) for deciding the number of principal components (PCs) to retain. It is based on the eigenvalues associated with each principal component.
The Kaiser Rule states:
The rationale behind this rule is that:
An eigenvalue of 1 represents the variance of a single standardized variable. Thus, any component with an eigenvalue greater than 1 explains more variance than an individual variable and is considered significant.
Components with eigenvalues less than 1 contribute less variance than a single variable and are typically deemed less meaningful.
Limitations of the Kaiser Rule
In some datasets, the Kaiser Rule may retain too many components, especially when the eigenvalues drop slowly after the first few components.
For datasets with subtle patterns, the Kaiser Rule might discard components that, while having eigenvalues slightly below 1, capture meaningful variance.
components_kaiser <- sum(eigenvalues > 1)
components_kaiser
## [1] 2
According to the Kaiser rule, two components should be retained, as they have eigenvalues > 1.
if (!requireNamespace("factoextra", quietly = TRUE)) {
install.packages("factoextra")
}
library(factoextra)
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 100)) +
ggtitle("Scree Plot: Eigenvalues of Principal Components")
Summary of PCA results: The PCA results indicate that the first two components capture the majority of the variance in the dataset. This dimensionality reduction is effective and allows for visualization and further analysis in a 2D space.
# Visualizing the quality of representation (cos²) of individual observations in the PCA space
fviz_pca_ind(pca, col.ind = "cos2", geom = "point",
gradient.cols = c("blue", "purple", "red"),
title = "Representation Quality of Observations") +
theme_minimal()
Points with higher cos² values (closer to red) are better represented in the two-dimensional space. Observations with low cos² values (closer to blue) are less well represented, suggesting they contribute less to the PCA structure.
# Visualizing the contribution of variables to the principal components
fviz_pca_var(pca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Contribution of Variables to PCA Components") +
theme_minimal()
Variables closer to the edges of the plot contribute more to the principal components. Variables grouped together suggest similar contributions to the principal components.
# Visualizing the contributions of variables specifically to the first two dimensions
Dim1 <- fviz_contrib(pca, choice = "var", axes = 1,
title = "Contribution of Variables to Dim1")
Dim2 <- fviz_contrib(pca, choice = "var", axes = 2,
title = "Contribution of Variables to Dim2")
gridExtra::grid.arrange(Dim1, Dim2, ncol = 2)
These variables are the key drivers of variance in their respective components and should be analyzed further for their relationships.
Dimensionality reduction is a crucial technique for simplifying datasets, uncovering patterns, and improving the interpretability of complex data. This project focused on the application of Principal Component Analysis (PCA) to reduce the dimensionality of a dataset containing measurements of wheat kernels. The analysis provided valuable insights into the structure of the data and demonstrated the utility of PCA in data exploration and visualization.
PCA effectively reduced the dataset to two meaningful dimensions while retaining most of the information. This dimensionality reduction allows for simplified visualization, analysis, and potential clustering of observations based on key characteristics. The study underscores the power of PCA in identifying and summarizing patterns within high-dimensional datasets.
Key Findings
Reduction in Dimensionality:
PCA successfully reduced the original dataset from seven variables to a smaller number of principal components while retaining a significant proportion of the variance.
Using criteria such as the Kaiser Rule, the optimal number of components was determined.
Patterns Identified:
The first principal component (PC1) was dominated by size-related variables such as Length of Kernel and Perimeter, indicating that these features are strongly correlated and explain much of the variance.
The second principal component (PC2) highlighted shape-related traits such as Asymmetry Coefficient, providing a complementary perspective.
Final Thoughts:
Dimensionality reduction is a crucial step in data preprocessing, especially when dealing with datasets that contain highly correlated variables.
PCA not only reduces complexity but also enhances the interpretability of the data, making it an essential tool for exploratory data analysis.
Challenges and Limitations
While PCA preserved a significant portion of the variance, some minor patterns might have been discarded along with the less significant components.
Although PCA simplifies the data, interpreting the principal components requires careful consideration of variable loadings and domain expertise.
PCA assumes linear relationships among variables and may not perform well if non-linear patterns are dominant.