IBM-Summit
We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777
Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/
Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.
library(readxl)
library(tidyverse)
library(tidyquant)
library(broom)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
## [1] "Rank" "PreviousRank"
## [3] "FirstAppearance" "FirstRank"
## [5] "Name" "Computer"
## [7] "Site" "Manufacturer"
## [9] "Country" "Year"
## [11] "Segment" "TotalCores"
## [13] "AcceleratorCoProcessorCores" "Rmax"
## [15] "Rpeak" "Nmax"
## [17] "Nhalf" "HPCG"
## [19] "Power" "PowerSource"
## [21] "PowerEfficiency" "Architecture"
## [23] "Processor" "ProcessorTechnology"
## [25] "ProcessorSpeed" "OperatingSystem"
## [27] "OSFamily" "AcceleratorCoProcessor"
## [29] "CoresperSocket" "ProcessorGeneration"
## [31] "SystemModel" "SystemFamily"
## [33] "InterconnectFamily" "Interconnect"
## [35] "Continent" "SiteID"
## [37] "SystemID"
We want to discriminate HPC mainframes of China and USA, applying exploratory multivariate methods such as principal component analysis, correspondence analysis or clustering. To this end we use FactorMinor package by Francois Husson, Julie Josse, Sebastien Le, Jeremy Mazet. For more see:http://factominer.free.fr/index.html
TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country, InterconnectFamily, Architecture, Rank, Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%drop_na()
TOP500.USCH %>% datatable()#kable(align = "r",caption = "TOP500: China - USA")
This dataset contains 98 individuals (mainframes) and 13 variables, 3 qualitative variables (Country, InterconnectFamily, Architecture) are considered as illustrative.
The analysis of the graphs does not detect any outlier.
The inertia of the first dimensions shows if there are strong relationships between variables and suggests the number of dimensions that should be studied.
The first two dimensions of analyse express 60.16% of the total dataset inertia ; that means that 60.16% of the individuals (or variables) cloud total variability is explained by the plane. This percentage is relatively high and thus the first plane well represents the data variability. This value is strongly greater than the reference value that equals 31.13%, the variability explained by this plane is thus highly significant (the reference value is the 0.95-quantile of the inertia percentages distribution obtained by simulating 2020 data tables of equivalent size on the basis of a normal distribution).
From these observations, it should be better to also interpret the dimensions greater or equal to the third one.
Figure 2 - Decomposition of the total inertia
An estimation of the right number of axis to interpret suggests to restrict the analysis to the description of the first 3 axis. These axis present an amount of inertia greater than those obtained by the 0.95-quantile of random distributions (73.68% against 43.69%). This observation suggests that only these axis are carrying a real information. As a consequence, the description will stand to these axis.
Figure 3.1 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction.
The Wilks test p-value indicates which variable factors are the best separated on the plane (i.e. which one explain the best the distance between individuals).
## InterconnectFamily Architecture Country
## 8.344702e-08 4.352946e-06 1.313089e-01
The best qualitative variable to illustrate the distance between individuals on this plane is : InterconnectFamily.
Figure 3.2 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction. The individuals are coloured after their category for the variable InterconnectFamily.
Figure 3.3 - Variables factor map (PCA) The labeled variables are those the best shown on the plane.
Figure 3.4 - Qualitative factor map (PCA) The labeled factors are those the best shown on the plane.
The dimension 1 opposes individuals such as 2, 4, 3, 1 and 5 (to the right of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 66, 91 and 83 (to the left of the graph, characterized by a strongly negative coordinate on the axis).
The group in which the individuals 2 and 4 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individual 5 stands (characterized by a positive coordinate on the axis) is sharing :
The group in which the individual 1 stands (characterized by a positive coordinate on the axis) is sharing :
The group in which the individual 3 stands (characterized by a positive coordinate on the axis) is sharing :
The group in which the individuals 66, 91 and 83 stand (characterized by a negative coordinate on the axis) is sharing :
The dimension 2 opposes individuals such as 2, 4, 1, 66, 8, 91, 83 and 35 (to the top of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 3, 6, 7, 27 and 26 (to the bottom of the graph, characterized by a strongly negative coordinate on the axis).
The group in which the individuals 66, 91 and 83 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individuals 8 and 35 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individuals 2 and 4 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individual 1 stands (characterized by a positive coordinate on the axis) is sharing :
The group 5 (characterized by a negative coordinate on the axis) is sharing :
The group in which the individuals 6, 7, 27 and 26 stand (characterized by a negative coordinate on the axis) is sharing :
The group in which the individual 3 stands (characterized by a negative coordinate on the axis) is sharing :
Figure 4.1 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction.
The Wilks test p-value indicates which variable factors are the best separated on the plane (i.e. which one explain the best the distance between individuals).
## Country InterconnectFamily Architecture
## 9.368191e-08 4.173656e-06 2.989679e-02
The best qualitative variable to illustrate the distance between individuals on this plane is : Country.
Figure 4.2 - Individuals factor map (PCA) The labeled individuals are those with the higher contribution to the plane construction. The individuals are coloured after their category for the variable Country.
Figure 4.3 - Variables factor map (PCA) The labeled variables are those the best shown on the plane.
Figure 4.4 - Qualitative factor map (PCA) The labeled factors are those the best shown on the plane.
The dimension 3 opposes individuals such as 78, 21, 85, 20, 14 and 35 (to the right of the graph, characterized by a strongly positive coordinate on the axis) to individuals such as 28, 30, 9, 23 and 17 (to the left of the graph, characterized by a strongly negative coordinate on the axis).
The group in which the individuals 21, 20, 14 and 35 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individuals 78 and 85 stand (characterized by a positive coordinate on the axis) is sharing :
The group in which the individuals 28, 30, 9, 23 and 17 stand (characterized by a negative coordinate on the axis) is sharing :