Data consulting & education
Pills_R.
Training Material
[website] https://agustincastro.es
[RPubs] https://rpubs.com/acastro
[GitHub-R] https://github.com/acastromartinez/GITHUB---R
In this practice, we will work with the Principal Component Analysis (PCA) technique to reduce dimensionality in datasets with a large number of variables. PCA simplifies the analysis and enhances the visualization and interpretation of results by transforming the original variables into a new set of principal components. These components retain the maximum amount of information possible using fewer dimensions. The first principal component explains the most variance in the data, while each subsequent component captures the maximum remaining variance without overlapping the information already explained by the previous components.
PCA is particularly valuable for large datasets, as it allows for significant simplification of the analysis and effective visualization without sacrificing relevant information.
We will use the dataset biopsy from the MASS package, one of the most popular and widely used in statistical analysis. It was developed by the group of Venables and Ripley and accompanies the book Modern Applied Statistics with S.
The dataset biopsy contains information for the classification of breast tumors as benign or malignant based on characteristics obtained from fine needle aspiration biopsy images of breast masses. The data includes both quantitative features of the cells and the final tumor classification. It is a small, manageable dataset that offers a variety of features for applying multiple modeling techniques and is ideal for teaching and experimenting with data analysis and machine learning. Additionally, it is well-documented and supported in the MASS package for R.
Data obtained from the University of Wisconsin Hospitals, Madison (Dr. Wolberg). It is based on the evaluation of 699 breast tumor biopsies. Each of the nine attributes V1-V9 is scored on a scale from 1 to 10. The two-level classification, benign and malignant, is also known.
## To cite the MASS package in publications use:
##
## Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with
## S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## title = {Modern Applied Statistics with S},
## author = {W. N. Venables and B. D. Ripley},
## publisher = {Springer},
## edition = {Fourth},
## address = {New York},
## year = {2002},
## note = {ISBN 0-387-95457-0},
## url = {https://www.stats.ox.ac.uk/pub/MASS4/},
## }
## [1] 699 11
## 'data.frame': 699 obs. of 11 variables:
## $ ID : chr "1000025" "1002945" "1015425" "1016277" ...
## $ V1 : int 5 5 3 6 4 8 1 2 2 4 ...
## $ V2 : int 1 4 1 8 1 10 1 1 1 2 ...
## $ V3 : int 1 4 1 8 1 10 1 2 1 1 ...
## $ V4 : int 1 5 1 1 3 8 1 1 1 1 ...
## $ V5 : int 2 7 2 3 2 7 2 2 2 2 ...
## $ V6 : int 1 10 2 4 1 10 10 1 1 1 ...
## $ V7 : int 3 3 3 3 3 9 3 3 1 2 ...
## $ V8 : int 1 2 1 7 1 7 1 1 1 1 ...
## $ V9 : int 1 1 1 1 1 1 1 1 5 1 ...
## $ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
We check for missing data in the dataset and find that there are 16 NAs in the variable V6. The grid.table function is not included by default in R. It requires the installation of the gridExtra library.
## ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class
## 0 0 0 0 0 0 16 0 0 0 0
We remove the rows that contain missing data in variable V6. The final dimension of biopsy is 683 rows x 11 columns (vars).
## [1] 0
## [1] 683 11
There are samples with duplicate IDs but with different measured values for the various variables. Therefore, we do not remove any of these duplicate IDs, as they contain different information.
# we check if there are duplicate ID samples in the dataset
sum(duplicated(biopsy$ID)) # there are 53 duplicate IDs## [1] 53
## [1] "1033078" "1070935" "1143978" "1171710" "1173347" "1174057" "1212422"
## [8] "1218860" "1017023" "1100524" "1116116" "1168736" "1182404" "1182404"
## [15] "1198641" "1182404" "1198641" "320675" "704097" "493452" "560680"
## [22] "1114570" "1158247" "1276091" "1276091" "1276091" "1293439" "734111"
## [29] "1182404" "1276091" "1105524" "1115293" "1182404" "1320077" "769612"
## [36] "798429" "1116192" "1240603" "1299924" "1321942" "385103" "411453"
## [43] "822829" "1061990" "1238777" "1277792" "1299596" "1339781" "1354840"
## [50] "466906" "654546" "695091" "897471"
duplicated <- biopsy$ID[duplicated(biopsy$ID)]
table(duplicated) # number of times each ID is repeated## duplicated
## 1017023 1033078 1061990 1070935 1100524 1105524 1114570 1115293 1116116 1116192
## 1 1 1 1 1 1 1 1 1 1
## 1143978 1158247 1168736 1171710 1173347 1174057 1182404 1198641 1212422 1218860
## 1 1 1 1 1 1 5 2 1 1
## 1238777 1240603 1276091 1277792 1293439 1299596 1299924 1320077 1321942 1339781
## 1 1 4 1 1 1 1 1 1 1
## 1354840 320675 385103 411453 466906 493452 560680 654546 695091 704097
## 1 1 1 1 1 1 1 1 1 1
## 734111 769612 798429 822829 897471
## 1 1 1 1 1
# we examine the values of the variables for the different samples with duplicate IDs
# they are different, so we do not remove duplicates.
biopsy_duplicates <- biopsy %>%
filter(ID %in% ID[duplicated(ID)]) %>%
arrange(ID)
head(biopsy_duplicates)To study the correlation between the variables used for tumor classification, we need to create a correlation matrix. To do this, we create a subset from biopsy that contains exclusively the data for the variables of interest (V1 to V9), removing the ID column. We will name this new dataframe biopsy_vars, as it only contains the mentioned variables.
## 'data.frame': 683 obs. of 9 variables:
## $ V1: int 5 5 3 6 4 8 1 2 2 4 ...
## $ V2: int 1 4 1 8 1 10 1 1 1 2 ...
## $ V3: int 1 4 1 8 1 10 1 2 1 1 ...
## $ V4: int 1 5 1 1 3 8 1 1 1 1 ...
## $ V5: int 2 7 2 3 2 7 2 2 2 2 ...
## $ V6: int 1 10 2 4 1 10 10 1 1 1 ...
## $ V7: int 3 3 3 3 3 9 3 3 1 2 ...
## $ V8: int 1 2 1 7 1 7 1 1 1 1 ...
## $ V9: int 1 1 1 1 1 1 1 1 5 1 ...
Now we create the correlation matrix using R’s cor function. We will store it in biopsy_vars_cor.
## V1 V2 V3 V4 V5 V6 V7
## V1 1.0000000 0.6424815 0.6534700 0.4878287 0.5235960 0.5930914 0.5537424
## V2 0.6424815 1.0000000 0.9072282 0.7069770 0.7535440 0.6917088 0.7555592
## V3 0.6534700 0.9072282 1.0000000 0.6859481 0.7224624 0.7138775 0.7353435
## V4 0.4878287 0.7069770 0.6859481 1.0000000 0.5945478 0.6706483 0.6685671
## V5 0.5235960 0.7535440 0.7224624 0.5945478 1.0000000 0.5857161 0.6181279
## V6 0.5930914 0.6917088 0.7138775 0.6706483 0.5857161 1.0000000 0.6806149
## V7 0.5537424 0.7555592 0.7353435 0.6685671 0.6181279 0.6806149 1.0000000
## V8 0.5340659 0.7193460 0.7179634 0.6031211 0.6289264 0.5842802 0.6656015
## V9 0.3509572 0.4607547 0.4412576 0.4188983 0.4805833 0.3392104 0.3460109
## V8 V9
## V1 0.5340659 0.3509572
## V2 0.7193460 0.4607547
## V3 0.7179634 0.4412576
## V4 0.6031211 0.4188983
## V5 0.6289264 0.4805833
## V6 0.5842802 0.3392104
## V7 0.6656015 0.3460109
## V8 1.0000000 0.4337573
## V9 0.4337573 1.0000000
We create the significance matrix with the p-values obtained from the correlation matrix calculation. To do this, we use the R function cor.mtest (in this case, with a 0.95 confidence level).
## V1 V2 V3 V4 V5
## V1 0.000000e+00 8.964173e-81 2.064616e-84 4.027956e-42 2.411759e-49
## V2 8.964173e-81 0.000000e+00 2.567742e-258 1.544338e-104 3.535863e-126
## V3 2.064616e-84 2.567742e-258 0.000000e+00 4.146228e-96 3.061101e-111
## V4 4.027956e-42 1.544338e-104 4.146228e-96 0.000000e+00 1.627087e-66
## V5 2.411759e-49 3.535863e-126 3.061101e-111 1.627087e-66 0.000000e+00
## V6 4.050902e-66 2.402961e-98 1.807287e-107 2.058229e-90 3.828707e-64
## V7 3.880813e-56 3.184894e-127 3.567289e-117 1.153511e-89 3.199163e-73
## V8 1.260114e-51 7.433029e-110 3.018221e-109 6.883291e-69 1.734754e-76
## V9 3.148289e-21 3.398097e-37 6.531371e-34 2.125287e-30 9.266128e-41
## V6 V7 V8 V9
## V1 4.050902e-66 3.880813e-56 1.260114e-51 3.148289e-21
## V2 2.402961e-98 3.184894e-127 7.433029e-110 3.398097e-37
## V3 1.807287e-107 3.567289e-117 3.018221e-109 6.531371e-34
## V4 2.058229e-90 1.153511e-89 6.883291e-69 2.125287e-30
## V5 3.828707e-64 3.199163e-73 1.734754e-76 9.266128e-41
## V6 0.000000e+00 4.391890e-94 9.158074e-64 7.473326e-20
## V7 4.391890e-94 0.000000e+00 1.312645e-88 1.214400e-20
## V8 9.158074e-64 1.312645e-88 0.000000e+00 1.053441e-32
## V9 7.473326e-20 1.214400e-20 1.053441e-32 0.000000e+00
Using the corrplot function, we create a correlation plot. Here, we can see that the most correlated variables (r = 0.91) are V2 and V3 (cell size and shape uniformity, respectively). Several variables are correlated with r values close to 0.8. All r values obtained in the matrix are positive (+) and significant with a 95% confidence level. The use of the corrplot function requires the installation of the corrplot library, as it is not included by default in R.
[Note] PCA helps simplify the analysis by reducing the number of variables while preserving most of the variance or relevant information in the data. This is particularly useful when the variables are highly correlated, as PCA can identify principal components that capture the underlying relationships between variables and reduce redundancy. Additionally, it makes interpretation easier by reducing the number of individual variables to consider.
corrplot(biopsy_vars_cor,
p.mat = biopsy_vars_sig$p, sig.level = 0.05,
method = "color",
order = "hclust",
type = "upper",
diag = FALSE,
addCoef.col = "black",
number.cex = 0.9)The prcomp function in R is used to perform Principal Component Analysis (PCA). In this case, center = TRUE ensures that the variables are centered around zero mean, eliminating the bias introduced by different variable means, and scale = TRUE normalizes the variables to have unit variance, ensuring that all variables contribute equally to the analysis, regardless of their original scales.
We can use prcomp on the original dataframe biopsy (where we did not remove the ID variable of type char). In this case, it is necessary to specify the columns/variables that will be considered in the analysis. biopsy_PCA <- prcomp(biopsy[,-c(1,11)], center = TRUE, scale = TRUE)
A Scree Plot is a graph that shows the variance explained by each principal component in PCA. The x-axis represents the principal components (sorted according to the amount of variance they explain), and the y-axis shows the variance explained by each component.
In the plot, the ‘elbow’ or inflection point is of particular importance. It is where the slope of the graph changes significantly. By observing this, we can determine the number of principal components that explain a significant amount of variance.
scree_plot <- fviz_screeplot(biopsy_PCA,
title = "Scree plot",
xlab = "Dimensions",
ylab = "% Explained variance",
barfill = "lightblue",
ylim = c(0, 70),
addlabels = TRUE)
scree_plotIn PCA, the first components capture the majority of the variability in the data. In this specific case, our first component explains the largest portion of the variance (65.5%), while the second explains a smaller proportion (8.6%). These two components together explain 74.1% of the total variance. The significant difference between the first and second components indicates that the first is clearly more significant for describing the structure of the data we are analyzing. However, the second component may also be useful for informational purposes. This is something we will explore further as we delve into the results obtained from the PCA.
In the context of PCA, the contribution of the
variables refers to how much each original variable contributes
to a specific principal component. This concept is key to understanding
how the original variables combine to form the principal components, as
well as interpreting the importance of each variable in the reduced data
space representation.
By taking a closer look, we might draw some conclusions.
In order, the variables that contribute the most are:
V2 (uniformity of cell size): Uniformity of
cell size.
V3 (uniformity of cell shape):
Uniformity of cell shape.
V7
(bland chromatin): Bland chromatin.
V5 (single epithelial cell size): Size of individual
epithelial cells.
V8 (normal nucleoli): Normal
nucleoli.
V6 (bare nuclei): Presence of bare
nuclei.
The red line represents a threshold indicating the value of
contribution above which variables are considered significant
for the principal component we are analyzing.
The first principal component would be closely associated with characteristics related to the morphology and size of the breast tissue cells.
Furthermore, the relationship between bland chromatin (V7) and the morphological characteristics of cells, including their size and shape, is supported by scientific research in cell biology and cancer. There is evidence linking bland chromatin with breast cancer due to its influence on DNA accessibility and gene expression regulation (in this case, increased transcriptional activity). Alterations in the structure and epigenetic modifications of chromatin can facilitate the activation of genes associated with malignancy and contribute to tumor progression.
contrib_plot_dim1 <- fviz_contrib(biopsy_PCA,
choice = "var",
axes = 1,
fill = "lightblue")
contrib_plot_dim1 +
ggtitle("Contribution of the variables to DIM-1") +
ylab("Contributions (%)")In the case of the second principal component, we have a clear dominance of a single variable:
V9 (mitoses): Number of mitoses.
This second component would reflect cellular activity in terms of proliferation. Tissue samples with non-cancerous cells would show normal mitosis rates, whereas those from malignant tumors would be altered, with significantly higher proliferation values.
contrib_plot_dim2 <- fviz_contrib(biopsy_PCA,
choice = "var",
axes = 2,
fill = "lightblue")
contrib_plot_dim2 +
ggtitle("Contribution of the variables to DIM-2") +
ylab("Contributions (%)")The ‘variable loadings’ indicates how each original variable influences the principal components. The loadings show how important each variable is for each component, helping to understand how the principal components are formed and which variables are most relevant in explaining the variability of the data.
We can see how the variables V2 and V3 have a very close and strong relationship with PC1 (principal component 1), while V9 is clearly more associated with PC2.
fviz_pca_var(biopsy_PCA, col.var = "contrib", repel = TRUE, axes = c(1, 2)) +
labs(title="Variables loading for PC1 and PC2", x="PC1", y="PC2")A biplot is a graph that combines the representation of observations
and variables in the same space, typically in the context of Principal
Component Analysis (PCA). Observations are represented
as points on the plot, showing how the data are distributed based on the
principal components. Variables are represented as
arrows or vectors, indicating the direction and magnitude of their
influence on the principal components.
In this case, we represent the observations of the ‘class’ variable, which in the biopsy_PCA dataframe contains information about the tumor classification (‘benign’ or ‘malignant’). To facilitate the visualization of the result in the biplot, it is also possible to use different colors to indicate the ‘benign’ and ‘malignant’ categories.
fviz_pca_biplot(biopsy_PCA,
label = "var",
habillage = biopsy$class,
axes = c(1,2),
addEllipse = FALSE,
col.var = "black") +
scale_color_manual(values = c("lightblue", "red"))We started with a dataframe containing data from 9 variables (V1 to
V9) corresponding to samples obtained from breast tumor tissue
biopsies.
The dataset includes the following variables:
ID: Sample identifier.
V1
(clump thickness): Thickness of the cell clump.
V2
(uniformity of cell size): Uniformity of cell size.
V3 (uniformity of cell shape): Uniformity of cell
shape.
V4 (marginal adhesion): Marginal
adhesion.
V5 (single epithelial cell size): Size of
individual epithelial cells.
V6 (bare nuclei):
Presence of bare nuclei.
V7 (bland chromatin):
Bland chromatin.
V8 (normal nucleoli): Normal
nucleoli.
V9 (mitoses): Number of mitoses.
class: Tumor classification, which can be “benign” or
“malignant”.
The corrplot library is a specialized tool for intuitively and attractively visualizing correlation matrices. It facilitates the interpretation of relationships between variables by providing a variety of correlation plots, such as: Correlation Diagrams (Displays the correlation matrix with colors and labels representing the strength and direction of correlations between pairs of variables), Heatmaps (Uses colors to visualize the magnitude of correlations, making it easier to identify patterns), among others.
The GGally library in R is an extension of the ggplot2 package, designed to simplify the visualization and exploratory analysis of data. It provides a series of functions that facilitate the creation of complex and customized plots. Specifically, we will use the ggcorr function, which creates correlation matrices to help identify patterns and relationships between variables. In other practices, we have used the corrplot library to create correlation matrices.
The factoextra library in R is designed to facilitate the visualization and interpretation of multivariate analysis results, such as Principal Component Analysis (PCA) and Factor Analysis. It includes several useful functions for these tasks: fviz_pca_biplot (generates biplots to visualize both observations and variables in the principal component space), fviz_pca_var (shows how variables contribute to the principal components), fviz_pca_ind (visualizes the distribution of observations in the reduced principal component space), fviz_eig (displays a plot of eigenvalues to represent the proportion of variance explained by each component), among others.
gridExtra is an R library that extends the visualization capabilities of the grid graphics system, allowing for flexible combination of multiple plots and tables into a single figure. It provides functions such as grid.table() to include tables within plots. It is particularly useful for creating complex compositions of graphs and tables, facilitating the presentation of results in a single visual space.