Introduction

We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777

TOP500 data

Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/

Filtering Data

Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.

library(readxl)
library(tidyverse)
library(tidyquant)
library(broom)
library(DT)
library(knitr)



TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)

##  [1] "Rank"                        "PreviousRank"               
##  [3] "FirstAppearance"             "FirstRank"                  
##  [5] "Name"                        "Computer"                   
##  [7] "Site"                        "Manufacturer"               
##  [9] "Country"                     "Year"                       
## [11] "Segment"                     "TotalCores"                 
## [13] "AcceleratorCoProcessorCores" "Rmax"                       
## [15] "Rpeak"                       "Nmax"                       
## [17] "Nhalf"                       "HPCG"                       
## [19] "Power"                       "PowerSource"                
## [21] "PowerEfficiency"             "Architecture"               
## [23] "Processor"                   "ProcessorTechnology"        
## [25] "ProcessorSpeed"              "OperatingSystem"            
## [27] "OSFamily"                    "AcceleratorCoProcessor"     
## [29] "CoresperSocket"              "ProcessorGeneration"        
## [31] "SystemModel"                 "SystemFamily"               
## [33] "InterconnectFamily"          "Interconnect"               
## [35] "Continent"                   "SiteID"                     
## [37] "SystemID"

Problem

We want to discriminate HPC mainframes of China and USA, applying exploratory multivariate methods such as principal component analysis, correspondence analysis or clustering. To this end we use FactorMinor package by Francois Husson, Julie Josse, Sebastien Le, Jeremy Mazet. For more see:http://factominer.free.fr/index.html

TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country, InterconnectFamily, Architecture, Rank, Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%drop_na()

TOP500.USCH  %>% datatable()#kable(align = "r",caption = "TOP500: China - USA")

This dataset contains 98 individuals (mainframes) and 13 variables, 3 qualitative variables (Country, InterconnectFamily, Architecture) are considered as illustrative.

1. Study of the outliers

The analysis of the graphs does not detect any outlier.

2. Inertia distribution

The inertia of the first dimensions shows if there are strong relationships between variables and suggests the number of dimensions that should be studied.

The first two dimensions of analyse express 60.16% of the total dataset inertia ; that means that 60.16% of the individuals (or variables) cloud total variability is explained by the plane. This percentage is relatively high and thus the first plane well represents the data variability. This value is strongly greater than the reference value that equals 31.13%, the variability explained by this plane is thus highly significant (the reference value is the 0.95-quantile of the inertia percentages distribution obtained by simulating 2020 data tables of equivalent size on the basis of a normal distribution).

From these observations, it should be better to also interpret the dimensions greater or equal to the third one.

Figure 2 - Decomposition of the total inertia

An estimation of the right number of axis to interpret suggests to restrict the analysis to the description of the first 3 axis. These axis present an amount of inertia greater than those obtained by the 0.95-quantile of random distributions (73.68% against 43.69%). This observation suggests that only these axis are carrying a real information. As a consequence, the description will stand to these axis.

3. Description of the plane 1:2

Figure 3.1 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction.

The Wilks test p-value indicates which variable factors are the best separated on the plane (i.e. which one explain the best the distance between individuals).

## InterconnectFamily       Architecture            Country 
##       8.344702e-08       4.352946e-06       1.313089e-01

The best qualitative variable to illustrate the distance between individuals on this plane is : InterconnectFamily.

Figure 3.2 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction. The individuals are coloured after their category for the variable InterconnectFamily.

Figure 3.3 - Variables factor map (PCA) The labeled variables are those the best shown on the plane.

Figure 3.4 - Qualitative factor map (PCA) The labeled factors are those the best shown on the plane.

The dimension 1 opposes individuals such as 2, 4, 3, 1 and 5 (to the right of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 66, 91 and 83 (to the left of the graph, characterized by a strongly negative coordinate on the axis).

The group in which the individuals 2 and 4 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rmax, PowerEfficiency and Rpeak (variables are sorted from the strongest).
low values for the variable Rank.

The group in which the individual 5 stands (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Power and TotalCores (variables are sorted from the strongest).

The group in which the individual 1 stands (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rmax, Rpeak and Nmax (variables are sorted from the strongest).

The group in which the individual 3 stands (characterized by a positive coordinate on the axis) is sharing :

high values for the variables CoresperSocket, TotalCores, Power, Rmax, Rpeak and Nmax (variables are sorted from the strongest).
low values for the variable ProcessorSpeed.

The group in which the individuals 66, 91 and 83 stand (characterized by a negative coordinate on the axis) is sharing :

high values for the variables Rank and Year (variables are sorted from the strongest).
low values for the variables Nmax, Power, Rpeak, Rmax, TotalCores and CoresperSocket (variables are sorted from the weakest).

The dimension 2 opposes individuals such as 2, 4, 1, 66, 8, 91, 83 and 35 (to the top of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 3, 6, 7, 27 and 26 (to the bottom of the graph, characterized by a strongly negative coordinate on the axis).

The group in which the individuals 66, 91 and 83 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rank and Year (variables are sorted from the strongest).
low values for the variables Nmax, Power, Rpeak, Rmax, TotalCores and CoresperSocket (variables are sorted from the weakest).

The group in which the individuals 8 and 35 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variable PowerEfficiency.

The group in which the individuals 2 and 4 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rmax, PowerEfficiency and Rpeak (variables are sorted from the strongest).
low values for the variable Rank.

The group in which the individual 1 stands (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rmax, Rpeak and Nmax (variables are sorted from the strongest).

The group 5 (characterized by a negative coordinate on the axis) is sharing :

low values for the variables Year and PowerEfficiency (variables are sorted from the weakest).

The group in which the individuals 6, 7, 27 and 26 stand (characterized by a negative coordinate on the axis) is sharing :

high values for the variable Power.
low values for the variables ProcessorSpeed, Rank and Year (variables are sorted from the weakest).

The group in which the individual 3 stands (characterized by a negative coordinate on the axis) is sharing :

high values for the variables CoresperSocket, TotalCores, Power, Rmax, Rpeak and Nmax (variables are sorted from the strongest).
low values for the variable ProcessorSpeed.

4. Description of the dimension 3

Figure 4.1 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction.

The Wilks test p-value indicates which variable factors are the best separated on the plane (i.e. which one explain the best the distance between individuals).

##            Country InterconnectFamily       Architecture 
##       9.368191e-08       4.173656e-06       2.989679e-02

The best qualitative variable to illustrate the distance between individuals on this plane is : Country.

Figure 4.2 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction. The individuals are coloured after their category for the variable Country.

Figure 4.3 - Variables factor map (PCA) The labeled variables are those the best shown on the plane.

Figure 4.4 - Qualitative factor map (PCA) The labeled factors are those the best shown on the plane.

The dimension 3 opposes individuals such as 78, 21, 85, 20, 14 and 35 (to the right of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 28, 30, 9, 23 and 17 (to the left of the graph, characterized by a strongly negative coordinate on the axis).

The group in which the individuals 21, 20, 14 and 35 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variables PowerEfficiency and CoresperSocket (variables are sorted from the strongest).
low values for the variable Nmax.

The group in which the individuals 78 and 85 stand (characterized by a positive coordinate on the axis) is sharing :

high values for the variables Rank, Year and Nmax (variables are sorted from the strongest).
low values for the variables Rmax, ProcessorSpeed, TotalCores, Power and PowerEfficiency (variables are sorted from the weakest).

The group in which the individuals 28, 30, 9, 23 and 17 stand (characterized by a negative coordinate on the axis) is sharing :

high values for the variables Power, ProcessorSpeed, Rmax and TotalCores (variables are sorted from the strongest).
low values for the variables Year and Rank (variables are sorted from the weakest).

Conclusion

Principal Components Analysis with TOP500 data for China and USA describes difference between mainframes of the two leading countries in terms of multidimensional decomposition of scalar vectors corresponding to the values of published specifications.
The ongoing analysis will include the following classification procedures: logistic regression, classification trees, support vector machines, naive Bayes and discriminant analysis.

TOP500 HPC Race by PCA: China - United States

Alexander Levakov, Senior Research Fellow, Ph.D

December, 2020