Sm🄰rt PCA biplots with 𝖇𝖎𝖕𝖑𝖔🆃𝖆𝖇𝖑𝖊 basic Shiny app

Author

James Silva Garcia

Published

April 20, 2023

Welcome

I am glad you are reading this post. Today I want to offer you a practical intro to the world of Principal Component Analysis (PCA) biplots (a.k.a. GGE-biplots in the context of plant breeding). The goal is to obtain meaningful two-way (bi) visualizations (plots) of a numerical data table that originally contained three or more numerical columns. In typical applications, rows in the original data table correspond to samples or subjects, with columns used to capture the average response profile for each subject. Classical applications include analyzing the average yield performance of a collection of varieties tested across experimental sites (useful for interpreting genotype-by-environment interaction) or describing the average performance profile of several agronomic characteristics for a collection of varieties (i.e., a genotype-by-characteristic analysis).

I designed the 𝖇𝖎𝖕𝖑𝖔🆃𝖆𝖇𝖑𝖊 basic Shiny app to provide a friendly user interface to carry out PCA computations and deliver neat biplot visualizations along the way. All you need to do when putting together your input data table is to remember what each row and each column represents for your analysis, making sure the header of the first column is entered as the case-sensitive text RowName; as generated visualizations may include a bunch of overlapping label annotations, it is also recommended that you abbreviate in a meaningful way your sample or subject ID labels and column headers (using no more than 12 characters for each row ID label or column header).

The input data table used in this post corresponds to a fictitious genotype-by-environment table of average yield performance for 40 varieties across six sites as depicted below

All used input data tables used are shared as Appendices at the end of this post (you can copy/paste each table into a corresponding Excel file and save them accordingly). You are welcome to visit the 𝖇𝖎𝖕𝖑𝖔🆃𝖆𝖇𝖑𝖊 basic Shiny app at any time by clicking here. The following sections provide basic guidelines on how to use this app.

Input data upload

To feed the app with data, go to the [Upload data] menu item, and then use the [Browse…] button to navigate your files system and chose an Excel input data file (e.g., “biploTable-Example1A.xlsx”)

Perform PCA

Once your input data table has been uploaded, go to the [Perform PCA] menu item and click the [Run PCA] button to trigger computations. The [PCA summary] results table is displayed by default. For this data, the first two principal componentes (PC) account for about 72% of the total variability (reducing from 6 original dimensions down to just two, preserves ~72 percent of the total original variability), with nearly 59% accounted for by the first PC. The information ratio statistic shown in the last column of the PCA summary can be used to determine how many PC to use for dimensionality reduction: keep PC having InformatioRatio of 1 or beyond; for this sample data, however, for obvious reasons we will use the first two PC to generate biplots.

Annotation design options

You may use the [Annotation design] menu item to customize the look of the different biplot visualization options. As displayed in the following image, several customizations can be made, but general features are described next.

Row annotation

It is not uncommon that a first PC is typically associated with a weighted average performance. To capitalize on this fact, the app calculates an average performance value across the response columns for each entry and uses quantiles estimated from average performance to group subjects into five color-coded average performance categories (80-100%, 60-80%, 40-60%, 20-40%, and 0-20%). Likewise, a Unicode “PointShape” character is designated to each performance category (solid circled numbers, by default); however, a [Custom file] option is provided to enable users to customize the way colors and plotting shapes are assigned to build a proper row annotation design. Default row annotation options were kept to generate our first biplot view (shown later).

Column annotation

The app also generates a default annotation for columns as illustrated below.

Annotation cheatsheet

The annotation cheatsheet (reproduced below) has been provided to helping users designing proper annotations.

Here is a brief explanation on how to use it. By default, the variety GrC12 would be represented with a green (i.e., the p4 colorCode) solid circled number 5 (passed on to the graphical interface using the PointShape Unicode value -10126). In other words, the allocation of the variety GrC12 would be represented with the point shape .

Column-focused biplots

The first set of biplot views is called [Column-focused biplots] and, as indicated by its name, is designed for preserving metrics across columns from the original input data (sites-metric preserving for our current analysis). The first view is called the Column-focused Which-Won-Where (CF-WWW) view biplot and it can be used to study relationships across sites.

Using the default column annotation, sites are represented by empty circled numbers (1 to 6) and connected with purple segments to the biplot origin. The narrower the angle between two segments, the higher the correlation between the corresponding sites; according to this, sites LocA, LocD, and LocF are strongly correlated. Site segments at an angle close to 90° are non-correlated (like LocB and LocC), while site segments at an angle approaching 180° would have high negative correlation.

Likewise, the default row annotation indicates that Varieties are represented by color-coded solid circled numbers to show average performance categories. The out-most varieties are connected with an irregular polygon that encloses all tested varieties; perpendicular dashed lines are drawn from the biplot origin to each of the sides of the polygon and are used to partition the biplot area into several sections. Sites falling within a same section (like sites LocA, LocD, and LocF) share similar characteristics and conform a mega-environment. Although this is not an appropriate view to perform ranking of varieties, this view also indicates that, for example, varieties GrC30 and GrC31 where the top performers in LocB (a fact revealed by markers at a far distance from the origin, with narrow angle with the LocB segment); additionally, GrC24 was a poor performer variety in LocB (far away marker point in the opposite direction of the LocB segment).

As the six sites fell into three different biplot sections, ranking of varieties should be performed (using row-focused biplots, for mega-environments with 3 or more sites) independently. To generate row-focused biplots for the mega-environment containing sites LocA, LocD, and LocF, you are encouraged to go back to the [Upload data] menu item, click the [Table modifications] tab and use the [Select columns to exclude] user input to eliminate sites LocB, LocC, and LocE from the analysis, and then take look on the [Row-focused biplots] menu item (results not discussed here). However, an interesting data-driven approach to enhance our analysis will be proposed later.

The second view is called the Column-focused Column Evaluation View (CF-CEV) biplot and it can be used to rank sites.

The default annotation design is the same than the one used for the CF-WWW view. One additional solid diamond marker is drawn and represents the average column coordinates (ACC, or AEC = average environment coordinates for our current case); a solid black segment across the biplot area that connects the ACC and the origin is drawn. This is called the average column axis (ACA, or AEA = average environment axis for our current case). Another black segment, perpendicular to the ACA is also drawn. Projections from site segments onto the ACC are useful for ranking sites by their variety discriminating ability (outcome summarized using the table below).

Data-driven augmentation approach

PCA biplots are used to try to identify samples with extreme performance. For example, in our multi-environmental variety trial data we would be interested in detecting varieties that outperform their competitors. As our intention is to maximize yield performance, we are interested in detecting varieties that perform well in all locations. One clever way to guesstimate what the profile of such variety would be is to augment our input data with a dummy variety (referred to as the GOOD dummy variety) and defining its yield performance profile using the estimated maximum yield within each site. Additionally, we can also augment our yield data with one more POOR dummy variety using minimum yield within each site. The proposed data-driven row-augmentation process is illustrated below.

Once dummy entries are obtained, we can proceed to perform data-driven column augmentation. One possible way to do this is by adding a DIST dummy site column to accommodate Euclidian distances from the POOR dummy variety (we are looking for entries performing far away from the POOR dummy variety, and in the direction of the GOOD dummy variety).

Note that the GOOD and POOR ID labels are forbidden labels and cannot be used as RowName for other samples. It is also good to emphasize that in other analytical situations, defining the profile for extreme performers may involve the use of a combination of MIN(), MAX(), AVERAGE(), or other statistics. For example, we may be interested in maximizing some columns (like yield and yield components or quality indicators), minimizing others (like disease incidence or any undesired characteristic), while keeping a few at their average level (like average plant height). In other words, your creativity plays a huge role when performing meaningful data-driven augmentation.

Once data-driven augmentation has been completed, we should save the results as a new Excel file. Just remember to close your new data file, before attempting to upload it into our app.

Column-focused biplots for data-driven augmented tables

After uploading the augmented data and performing PCA, we are ready to go back to interpreting some of the biplots that can be generated. Let’s start with the CF-WWW biplot view reproduced below.

We can see that the amount of variability accounted for by the first two PC jumped up to ~80%, with PC1 accounting for ~71%. Yet another important change is that now all sites belong to a same mega-environment. Note also the inclusion of star symbols to represent the POOR and GOOD dummy varieties, located on opposite directions as expected.

Next, let me illustrate how to use the [Annotation design] menu item to change default colors to rather highlight the variety type (GrA, GrB, GrC, or GrD) and the dummy DIST site segment.

Upon loading custom annotation files, the CF-WWW biplot is updated.

Row-focused biplots for data-driven augmented tables

The last thing I want to share are the details about how one of the Row-focused biplots is constructed. The most typical visualization to perform ranking of row entries is called Row-focused Row Evaluation View (or RF-REV) biplot.

The construction of this biplot view is very similar to the one used for building up the CF-CEV biplot, but this time, solid projection lines are drawn from each entry marker point onto the ACA. Depending on the number of row entries being visualized, it might become challenging to use this biplot for visually ranking your entries; in those situations it is better to go back to the [Perform PCA] menu item to take a look on the [Ranking of rows] results table a illustrated below.

Further learning

To learn more about PCA biplot analysis, I recommend you to take a look on the presentation (file: “myIntroduction to Augmented PCA Biplots.pdf”) shared in the [Welcome] menu item. It is also a very good idea to read the publications referenced in my introductory presentation.

I encourage you to share and take advantage of these learnings.

Enjoy the 𝖇𝖎𝖕𝖑𝖔🆃𝖆𝖇𝖑𝖊 basic Shiny app!

Appendices

Example 1A Files

Raw data

RowName LocA LocB LocC LocD LocE LocF
GrA01 10.80 9.95 8.10 8.70 10.20 6.15
GrA02 13.30 9.05 11.05 8.00 8.90 7.95
GrA03 12.50 8.60 8.45 6.80 8.90 7.45
GrA04 12.90 8.75 10.75 7.55 9.05 6.75
GrA05 12.50 8.95 8.35 7.10 8.95 7.05
GrB06 9.70 8.55 8.65 8.55 8.30 6.70
GrC07 12.75 9.15 10.75 7.35 9.30 6.75
GrC08 10.60 9.45 9.30 8.85 10.40 7.00
GrC09 11.25 8.90 11.40 9.95 8.30 6.95
GrC10 10.95 7.70 9.75 7.00 8.30 7.40
GrC11 14.10 11.70 10.75 10.10 7.55 5.80
GrC12 13.75 10.70 11.10 9.30 9.30 7.75
GrC13 10.85 9.45 12.05 9.75 8.65 6.95
GrC14 13.05 8.50 11.00 8.60 9.15 7.65
GrC15 12.05 8.35 9.95 9.00 8.40 7.50
GrC16 9.10 9.55 8.30 6.90 9.10 6.90
GrC17 10.20 9.60 11.05 9.75 6.65 5.60
GrC18 10.95 8.05 9.85 6.05 7.10 5.10
GrC19 12.15 9.30 8.35 6.70 6.80 5.20
GrC20 9.75 9.15 7.85 5.55 5.90 4.70
GrC21 12.40 9.65 9.90 6.70 9.30 5.05
GrC22 9.65 8.45 7.65 6.10 8.10 6.25
GrC23 8.75 7.25 6.75 6.25 5.70 4.20
GrC24 9.70 6.20 7.35 5.20 4.70 3.75
GrC25 7.80 9.20 7.55 6.35 6.45 4.35
GrC26 9.55 8.35 7.35 5.95 7.75 5.50
GrC27 13.30 8.35 9.10 8.85 7.40 5.25
GrC28 12.85 9.75 10.40 9.60 9.95 6.65
GrC29 9.90 10.35 10.10 6.85 8.35 5.85
GrC30 12.90 11.25 9.95 9.95 10.00 6.85
GrC31 14.45 11.20 10.65 8.55 10.35 6.50
GrC32 10.95 9.25 9.90 6.55 9.15 6.25
GrC33 11.35 9.35 9.85 7.35 7.70 5.90
GrD34 10.45 7.45 11.20 8.95 8.10 6.20
GrD35 12.25 9.25 10.45 9.70 8.85 6.85
GrD36 11.25 8.65 10.30 9.10 8.40 6.65
GrD37 10.20 8.30 8.95 6.75 6.80 5.35
GrD38 8.85 8.25 9.40 5.15 7.65 5.30
GrD39 8.25 8.95 6.25 7.95 7.70 4.95
GrD40 10.35 9.25 8.70 5.45 9.75 6.30

Row annotation

RowName colorCode PointShape
GrA01 d6 -10125
GrA02 d6 -10126
GrA03 d6 -10124
GrA04 d6 -10125
GrA05 d6 -10124
GrB06 d7 -10123
GrC07 s8 -10125
GrC08 s8 -10125
GrC09 s8 -10125
GrC10 s8 -10124
GrC11 s8 -10126
GrC12 s8 -10126
GrC13 s8 -10126
GrC14 s8 -10126
GrC15 s8 -10125
GrC16 s8 -10123
GrC17 s8 -10124
GrC18 s8 -10123
GrC19 s8 -10123
GrC20 s8 -10122
GrC21 s8 -10124
GrC22 s8 -10123
GrC23 s8 -10122
GrC24 s8 -10122
GrC25 s8 -10122
GrC26 s8 -10122
GrC27 s8 -10123
GrC28 s8 -10126
GrC29 s8 -10124
GrC30 s8 -10126
GrC31 s8 -10126
GrC32 s8 -10124
GrC33 s8 -10123
GrD34 p2 -10124
GrD35 p2 -10125
GrD36 p2 -10125
GrD37 p2 -10122
GrD38 p2 -10122
GrD39 p2 -10122
GrD40 p2 -10123

Column Annotation

ColName colorCode PointShape
LocF d3 -10112
LocE d4 -10113
LocA d3 -10114
LocC p12 -10115
LocD d3 -10116
LocB d4 -10117

Example 1B Files

Raw data

RowName LocA LocB LocC LocD LocE LocF DIST
GrA01 10.80 9.95 8.10 8.70 10.20 6.15 8.67
GrA02 13.30 9.05 11.05 8.00 8.90 7.95 10.24
GrA03 12.50 8.60 8.45 6.80 8.90 7.45 8.17
GrA04 12.90 8.75 10.75 7.55 9.05 6.75 9.30
GrA05 12.50 8.95 8.35 7.10 8.95 7.05 8.17
GrB06 9.70 8.55 8.65 8.55 8.30 6.70 6.94
GrC07 12.75 9.15 10.75 7.35 9.30 6.75 9.41
GrC08 10.60 9.45 9.30 8.85 10.40 7.00 9.19
GrC09 11.25 8.90 11.40 9.95 8.30 6.95 9.59
GrC10 10.95 7.70 9.75 7.00 8.30 7.40 7.36
GrC11 14.10 11.70 10.75 10.10 7.55 5.80 11.27
GrC12 13.75 10.70 11.10 9.30 9.30 7.75 11.56
GrC13 10.85 9.45 12.05 9.75 8.65 6.95 10.03
GrC14 13.05 8.50 11.00 8.60 9.15 7.65 10.12
GrC15 12.05 8.35 9.95 9.00 8.40 7.50 8.89
GrC16 9.10 9.55 8.30 6.90 9.10 6.90 7.03
GrC17 10.20 9.60 11.05 9.75 6.65 5.60 8.29
GrC18 10.95 8.05 9.85 6.05 7.10 5.10 5.89
GrC19 12.15 9.30 8.35 6.70 6.80 5.20 6.47
GrC20 9.75 9.15 7.85 5.55 5.90 4.70 4.19
GrC21 12.40 9.65 9.90 6.70 9.30 5.05 8.46
GrC22 9.65 8.45 7.65 6.10 8.10 6.25 5.40
GrC23 8.75 7.25 6.75 6.25 5.70 4.20 2.16
GrC24 9.70 6.20 7.35 5.20 4.70 3.75 2.20
GrC25 7.80 9.20 7.55 6.35 6.45 4.35 3.94
GrC26 9.55 8.35 7.35 5.95 7.75 5.50 4.68
GrC27 13.30 8.35 9.10 8.85 7.40 5.25 8.14
GrC28 12.85 9.75 10.40 9.60 9.95 6.65 10.54
GrC29 9.90 10.35 10.10 6.85 8.35 5.85 7.55
GrC30 12.90 11.25 9.95 9.95 10.00 6.85 11.22
GrC31 14.45 11.20 10.65 8.55 10.35 6.50 11.82
GrC32 10.95 9.25 9.90 6.55 9.15 6.25 7.78
GrC33 11.35 9.35 9.85 7.35 7.70 5.90 7.34
GrD34 10.45 7.45 11.20 8.95 8.10 6.20 8.07
GrD35 12.25 9.25 10.45 9.70 8.85 6.85 9.71
GrD36 11.25 8.65 10.30 9.10 8.40 6.65 8.49
GrD37 10.20 8.30 8.95 6.75 6.80 5.35 5.20
GrD38 8.85 8.25 9.40 5.15 7.65 5.30 5.13
GrD39 8.25 8.95 6.25 7.95 7.70 4.95 5.10
GrD40 10.35 9.25 8.70 5.45 9.75 6.30 7.34
GOOD 14.45 11.70 12.05 10.10 10.40 7.95 13.52
POOR 7.80 6.20 6.25 5.15 4.70 3.75 0.00

Row annotation

RowName colorCode PointShape
GrA01 d6 -10125
GrA02 d6 -10126
GrA03 d6 -10124
GrA04 d6 -10125
GrA05 d6 -10124
GrB06 d7 -10123
GrC07 s8 -10125
GrC08 s8 -10125
GrC09 s8 -10125
GrC10 s8 -10124
GrC11 s8 -10126
GrC12 s8 -10126
GrC13 s8 -10126
GrC14 s8 -10126
GrC15 s8 -10125
GrC16 s8 -10123
GrC17 s8 -10124
GrC18 s8 -10123
GrC19 s8 -10123
GrC20 s8 -10122
GrC21 s8 -10124
GrC22 s8 -10123
GrC23 s8 -10122
GrC24 s8 -10122
GrC25 s8 -10122
GrC26 s8 -10122
GrC27 s8 -10123
GrC28 s8 -10126
GrC29 s8 -10124
GrC30 s8 -10126
GrC31 s8 -10126
GrC32 s8 -10124
GrC33 s8 -10123
GrD34 p2 -10124
GrD35 p2 -10125
GrD36 p2 -10125
GrD37 p2 -10122
GrD38 p2 -10122
GrD39 p2 -10122
GrD40 p2 -10123

Column annotation

ColName colorCode PointShape
LocA d3 -10114
LocB d3 -10117
LocC d3 -10115
LocD d3 -10116
LocE d3 -10113
LocF d3 -10112
DIST p10 -9679