Preface

Previously we had made data mining tour for HPC (High Performance Computing) or supercomputers, applying bunch of statistical methods and models for the Top500 superpowerfull mainframes in the world - http://www.top500.org.

For the previous topics on this item see http://rpubs.com/alex-lev/216789, https://rpubs.com/alex-lev/71014 or for PDF versions - https://independent.academia.edu/AlexLev2.

Now we want to compare to date positions of two leaders of the ongoing HPC race: United States and China.

Data

To this end we use the same source of the most fresh data (11.2017) - http://www.top500.org/statistics/sublist/.

We omit some obvious manipulations with variable names.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)
library(tidyr)
library(corrplot)

## corrplot 0.84 loaded

library(ggpmisc)
library(scales)
load(file = "top500_17.dat")
names(top.500.17)

##  [1] "Rank"                          "Previous.Rank"                
##  [3] "First.Appearance"              "First.Rank"                   
##  [5] "Name"                          "Computer"                     
##  [7] "Site"                          "Manufacturer"                 
##  [9] "Country"                       "Year"                         
## [11] "Segment"                       "Total.Cores"                  
## [13] "Accelerator.CoProcessor.Cores" "Rmax"                         
## [15] "Rpeak"                         "Nmax"                         
## [17] "Nhalf"                         "HPCG"                         
## [19] "Power"                         "Power.Source"                 
## [21] "Power.Effeciency"              "Architecture"                 
## [23] "Processor"                     "Processor.Technology"         
## [25] "Processor.Speed"               "Operating.System"             
## [27] "OS.Family"                     "Accelerator.CoProcessor"      
## [29] "Cores.per.Socket"              "Processor.Generation"         
## [31] "System.Model"                  "System.Family"                
## [33] "Interconnect.Family"           "Interconnect"                 
## [35] "Region"                        "Continent"                    
## [37] "Site.ID"                       "System.ID"

top.500.17.tbl <- tibble::as_tibble(top.500.17)

First, we must prove the leadership of two players in this race. Let’s do it!

top.500.17.tbl %>% select(Rank, Name, Country, Rpeak, Total.Cores)%>%
  group_by(Rank)

top.500.17.tbl %>% 
  count(Country) %>% 
  mutate(Mainframes=n, Percent=(n/500)*100) %>% 
  select(Country,Mainframes,Percent)%>%
  arrange(desc(Mainframes),Percent) %>% top_n(.,5)

## Selecting by Percent

TOP5 <- top.500.17.tbl %>% 
  count(Country) %>% top_n(.,5) %>% arrange(n)

## Selecting by n

  ggplot(TOP5,aes(x="",y=n,fill=Country))+
  geom_bar(stat = "identity") +
  coord_polar(theta = "y",start = 0) +
  labs(x="",y="")

top.500.17.tbl %>% group_by(Country) %>% 
  summarise(Rpeak.Sum=sum(Rpeak),Total.Cores.Sum=sum(Total.Cores))%>%
  arrange(desc(Rpeak.Sum))

As we can see inspite of the fact that supercomputers of Japan (7%), Germany (4.2%), France(3.6%) and United Kingdom (3%) are included in the top HPC mainframes of the world, only China (40.4%) and US (28.6%) have got in total 69% of all mainframes in the top500 list i.e. roughly 2/3.

Meanwhile HPC mainframe of Switzerland (Piz Daint) is the third in the first dozen of top500 list - very curious fact! The other interesting fact produced by this table is that Japan has got 26331160 total cores of its mainframes in the top500 list i.e. more than China and US, being the third in terms of Total.Rpeak (136440) after China (524584) and US (391614). There is one more interesting fact - China has got Total.Rpeak roughly equal to the sum of the US and Japan Total.Rpeak values. So what? Does it mean China is the leader of the HPC race?

China as HPC rival for the United States

Now we reduce our data to the pair of two rivals (China and US) to compare their current positions in this outstanding HPC race.

topUSCH <- top.500.17.tbl %>% filter(Country==c("China","United States")) %>% select(Country,Manufacturer,Segment,Total.Cores,Rpeak,Rmax,Processor.Speed,Power,Architecture,Accelerator.CoProcessor.Cores)

topUSCH%>%group_by(Country,Segment)%>%count()

topUSCH%>%ggplot(.,aes(Segment)) + geom_bar(aes(fill=Country))

topUSCH%>%group_by(Country,Architecture)%>%count()

topUSCH%>%ggplot(.,aes(Architecture)) + geom_bar(aes(fill=Country))

topUSCH%>%filter(Country=="China")%>%group_by(Country,Segment,Manufacturer)%>%
  count()%>%arrange(desc(n))

topUSCH%>%filter(Country=="China")%>%ggplot(.,aes(Segment)) + 
  geom_bar(aes(fill=Manufacturer))

topUSCH%>%filter(Country=="United States")%>%group_by(Country,Segment,Manufacturer)%>%count()%>%arrange(desc(n))

topUSCH%>%filter(Country=="United States")%>%ggplot(.,aes(Segment)) + 
  geom_bar(aes(fill=Manufacturer))

So what we see? First, China has got 93 HPC mainframes in Industry while US has got only 37 HPC mainframes (roughly one third -1/3) in this segment. At the same time US has got 22 HPC mainframes in Research while China has got only 4 pieces in this segment (five times less)! In Academic segment US dominates on China too (7:2 i.e more than three times). Both China and US have got 6 HPC mainframes in Government segment - very curious fact for two countries so unfamular in all directions. On the whole China dominates on US in total number of HPC mainframes out of top500 list (202:143). Second, both countries have got Cluster Architecture as the most usable in HPC. At the same time US has got 11 MPP HPC pieces out of 73 total (15%), while China has got only 2 MPP HPC pieces out of 202 (1%). Third, Lenovo is the leading manufacturor in Industry HPC pieces of China, while in the US Industry we can see HPE as the leading of manufacturor.

Mean and sum values

Now we a ready to compare both countries in terms of mean and sum values of Rpeak, Rmax and Total.Cores.

topUSCH%>%group_by(Country)%>%
  summarise("M[Rmax]"=mean(Rmax),"S[Rmax]"=sum(Rmax),
            "M[Rpeak]"=mean(Rpeak),"S[Rpeak]"=sum(Rmax),
            "M[Total.Cores]"=mean(Total.Cores),"S[Total.Cores]"=sum(Total.Cores))

As we can see China dominates totally except M[Rmax] - mean value of Rmax. The most significant dominance of China to the US is in the sum of Total.Cores of top500 list - S[Total.Cores] i.e. 15318764/6136416 = 2.5 times! Is China the true leader?

Linear regression model

Now we are ready for comparison of two countries Rpeak/Total.Cores values.

topUSCH%>%group_by(Country)%>%do(tidy(lm(log(Rpeak)~log(Total.Cores),data = .)))

topUSCH%>%group_by(Country)%>%filter(Segment=="Industry")%>%
  do(tidy(lm(log(Rpeak)~log(Total.Cores),data = .)))

So our linear model looks like \(Rpeak=f(Total.Cores)\) or after some transformation as \(log(Rpeak)=f(log(Total.Cores))\).

The results of linear regression models are significant. Good news for both countries i.e China and US have produced Rpeak out of Total.Cores with some positive effect or correlation. The bad news is that the US dominates in terms of \(b\) coefficient in the equation \(y=a+bx\) i.e. US HPC mainframes are more effective than that of China on average by 0.846/0.578=1.46 times especially in Industry segment (1.07/0.37=2.89 or roughly three times). It means that the US is the leader in technology of HPC while China is drastically increasing number of mainframes and Total.Cores with less effectiveness except Research segment where we can see a real breakthrough of China. We can prove it once more but later on.

ggplot(topUSCH,aes(x=log(Total.Cores),y=log(Rpeak),col=Country))+
  geom_smooth(method="lm") + geom_point() + facet_wrap(~Country) +
  ggtitle("Rpeak ~ Total.Cores linear regression")

topUSCH%>%group_by(Country)%>%filter(Segment!="Vendor")%>%
  ggplot(.,aes(log(Total.Cores),log(Rpeak),col=Country)) + 
  geom_point(na.rm = T) + geom_smooth(method = lm,se = F,na.rm = T) + facet_wrap(~Segment)+ggtitle("Rpeak ~ Total.Cores linear regression by Segment ")

topUSCH%>%group_by(Manufacturer)%>%
  filter(Manufacturer=="Lenovo"|Manufacturer=="HPE")%>%
  ggplot(.,aes(log(Total.Cores),log(Rpeak),col=Manufacturer)) + 
  geom_point(na.rm = T) + geom_smooth(method = lm,se = T,na.rm = T) +   facet_wrap(~Country)+
  ggtitle("Rpeak ~ Total.Cores linear regression by Manufacturer")

ggplot(topUSCH,aes(log(Total.Cores),fill=Country)) +
  geom_density(alpha=0.7) + ggtitle("Total.Cores density")

ggplot(topUSCH,aes(x=Country,y=log(Total.Cores),fill=Country)) +
  geom_violin() + geom_jitter() + ggtitle("Total.Cores boxplot")

topUSCH%>%group_by(Country)%>%
  summarise("Mean[Rmax/Rpeak]"=mean(Rmax/Rpeak, na.rm = T))

ggplot(topUSCH,aes(x=Country,y=Rmax/Rpeak,fill=Country)) +
  geom_violin() + geom_jitter() + ggtitle("Rmax/Rpeak boxplot")

ggplot(topUSCH,aes(Rmax/Rpeak,fill=Country)) +
  geom_density(alpha=0.5) + ggtitle("Rmax/Rpeak density")

US versus China is leading in Rmax/Rpeak too (0.69/0.52=1.32)!

Comparison of power effectiveness

What is the impact of HPC Power (kWt) on Rpeak? Are Rpeak and Power correlated? Let’s see!

topUSCH%>%group_by(Country)%>%
  summarise("S[Power]"=sum(Power,na.rm = T),"M[Power]"=mean(Power,na.rm = T),"M[Rpeak/Power]"=mean(Rpeak/Power,na.rm = T),"M[Rpeak]"=mean(Rpeak,na.rm = T),"Rpeak/Rmax~Power"=cor(Rpeak/Rmax,Power,use = "pairwise"),
"Rpeak~Power"=cor(Rpeak,Power,use = "pairwise"))

topUSCH%>%group_by(Country)%>%
  ggplot(.,aes(log(Power),log(Rpeak),col=Country)) + geom_point(na.rm = T) + geom_smooth(method = lm)+ facet_wrap(~Country) + 
  ggtitle("Rpeak ~ Power linear regression")

topUSCH%>%group_by(Country)%>%filter(Segment!="Vendor")%>%
  ggplot(.,aes(log(Power),log(Rpeak),col=Country)) + geom_point(na.rm = T) + geom_smooth(method = lm,se = F,na.rm = T) +facet_wrap(~Segment) + 
  ggtitle("Rpeak ~ Power linear regression by Segment")

topUSCH%>%group_by(Manufacturer)%>%
  filter(Manufacturer=="Lenovo"|Manufacturer=="HPE")%>%
  ggplot(.,aes(log(Power),log(Rpeak),col=Manufacturer)) + 
  geom_point(na.rm = T) + geom_smooth(method = lm,se = T,na.rm = T) + facet_wrap(~Country) + ggtitle("Rpeak ~ Power linear regression by Manufacturer")

topUSCH%>%group_by(Country)%>%
  ggplot(.,aes(log(Total.Cores),log(Accelerator.CoProcessor.Cores),col=Country))+
  geom_point(na.rm = T)+geom_smooth(method = lm,se = F) + ggtitle("Accelerator.CoProcessor.Cores ~ Total.Cores linear regression ")

topUSCH%>%group_by(Country)%>%
  ggplot(.,aes(log(Accelerator.CoProcessor.Cores),log(Rpeak),col=Country))+
  geom_point(na.rm = T)+geom_smooth(method = lm,se = F)+ 
  ggtitle("Rpeak ~ Accelerator.CoProcessor.Cores linear regression")

topUSCH%>%select(Country,Power)%>%ggplot(.,aes(x=Country,y=Power,fill=Country)) +
  geom_violin() + geom_jitter(na.rm = T) + ggtitle("Power boxplot")

ggplot(topUSCH,aes(Power,fill=Country)) +
  geom_density(alpha=0.5) + ggtitle("Power density")

topUSCH%>%select(Country,Power,Rpeak)%>%
  ggplot(.,aes(x=Country,y=Rpeak/Power,fill=Country)) +
  geom_violin() + geom_jitter(na.rm = T) + ggtitle("Rpeak/Power boxplot")

ggplot(topUSCH,aes(Rpeak/Power,fill=Country)) +
  geom_density(alpha=0.5) + ggtitle("Rpeak/Power density")

topUSCH%>%filter(Segment!="Vendor")%>%
  group_by(Country,Segment)%>%
  summarise("M[Rpeak/Power]"=mean(Rpeak/Power, na.rm = T),"M[Processor.Speed]"=mean(Processor.Speed,na.rm = T))

topUSCH%>%group_by(Country)%>%
  summarise("Rpeak~Power"=cor(Rpeak,Power, use = "pairwise"),
            "Rpeak~Total.Cores"=cor(Rpeak,Total.Cores, use = "pairwise"),
            "Rpeak~Accelerator.CoProcessor.Cores"=cor(Rpeak,Accelerator.CoProcessor.Cores, use = "pairwise"))

topUSCH%>%group_by(Country, Segment)%>%filter(Segment!="Vendor")%>%
  summarise("Rpeak~Power"=cor(Rpeak,Power, use = "pairwise"),
            "Rpeak~Total.Cores"=cor(Rpeak,Total.Cores, use = "pairwise"),
            "Rpeak~Accelerator.CoProcessor.Cores"=cor(Rpeak,Accelerator.CoProcessor.Cores, use = "pairwise"))

topUSCH%>%group_by(Country,Manufacturer)%>%
  filter(Manufacturer=="Lenovo", Country=="China")%>%
  summarise("Rpeak~Power"=cor(Rpeak,Power, use = "pairwise"),
            "Rpeak~Total.Cores"=cor(Rpeak,Total.Cores, use = "pairwise"),
            "Rpeak~Accelerator.CoProcessor.Cores"=cor(Rpeak,Accelerator.CoProcessor.Cores, use = "pairwise"))

topUSCH%>%group_by(Country,Manufacturer)%>%
  filter(Manufacturer=="HPE", Country=="United States")%>%
  summarise("Rpeak~Power"=cor(Rpeak,Power, use = "pairwise"),
            "Rpeak~Total.Cores"=cor(Rpeak,Total.Cores, use = "pairwise"),
            "Rpeak~Accelerator.CoProcessor.Cores"=cor(Rpeak,Accelerator.CoProcessor.Cores, use = "pairwise"))

We have got paired correlation values (Rpeak~Power) for China and US as 0.912 and 0.377. What does it mean?

First, both countries produce high values of Rpeak in their HPC mainframes applying adequate electric Power. Second, mainframes of China increase Rpeak by increasing input Power in 91% cases while US do the same increase of Rpeak in 38% of cases. Anyway both countries on average have got almost the same values of ratio Rpeak/Power. On the whole US mainframes consume more power than that of China (75659/73045) by 3% and in the average (1991/1460) by 36%. Very interesting result!

China having got 202 HPC mainframes in top500 list consumes less Power than US with 142 pieces. At the same time the increase of Rpeak for China mainframes is 91% correlated with increasing Power from one model to another in contrast to the US 38% correlation. That is the point and true paradox of the HPC race between two superpowers of the world!

At one glance

And now we can explore our data by means of correlation matrix to feel the difference. Let’s do it!

MCCH <- topUSCH%>%filter(Country=="China")%>%
  select(-Manufacturer,-Architecture,-Segment,-Country)%>%
  rename(TC=Total.Cores,PS=Processor.Speed,ACPC=Accelerator.CoProcessor.Cores)%>%
  cor(.,use = "pairwise")


MCUS <- topUSCH%>%filter(Country=="United States")%>%
  select(-Manufacturer,-Architecture,-Segment,-Country)%>%
  rename(TC=Total.Cores,PS=Processor.Speed,ACPC=Accelerator.CoProcessor.Cores)%>%
  cor(.,use = "pairwise")

#China correlation matrix
MCCH

##               TC      Rpeak       Rmax         PS      Power       ACPC
## TC     1.0000000  0.9990461  0.9994902 -0.4816373  0.9072462  0.9923557
## Rpeak  0.9990461  1.0000000  0.9992121 -0.4707862  0.9122296  0.2731432
## Rmax   0.9994902  0.9992121  1.0000000 -0.4741523  0.9028667  0.1479453
## PS    -0.4816373 -0.4707862 -0.4741523  1.0000000 -0.5419273 -0.2374040
## Power  0.9072462  0.9122296  0.9028667 -0.5419273  1.0000000  0.6164714
## ACPC   0.9923557  0.2731432  0.1479453 -0.2374040  0.6164714  1.0000000

#US correlation matrix
MCUS

##               TC      Rpeak       Rmax         PS      Power       ACPC
## TC     1.0000000  0.7964077  0.9186063 -0.4266470  0.4673717  0.2519757
## Rpeak  0.7964077  1.0000000  0.9559716 -0.5823495  0.3776678  0.4063090
## Rmax   0.9186063  0.9559716  1.0000000 -0.5370755  0.4398240  0.4675032
## PS    -0.4266470 -0.5823495 -0.5370755  1.0000000 -0.2187824 -0.2437717
## Power  0.4673717  0.3776678  0.4398240 -0.2187824  1.0000000 -0.3116844
## ACPC   0.2519757  0.4063090  0.4675032 -0.2437717 -0.3116844  1.0000000

par(mfrow=c(1,2))
corrplot(MCCH,method = "ellipse",title = "China")
corrplot(MCUS,method = "ellipse",title = "US")

topUSCH%>%ggplot(.,aes(log(Power),log(Rpeak),col=Country)) +
  geom_point() + geom_smooth(method = lm,se = F,na.rm = T) + 
  ggtitle("Linear regression Rpeak~Power")

## Warning: Removed 90 rows containing missing values (geom_point).

topUSCH%>%ggplot(.,aes(log(Rmax),log(Rpeak),col=Country)) +
  geom_point() +geom_smooth(method = lm,se = F) + 
  ggtitle("Linear regression Rpeak~Rmax")

topUSCH%>%ggplot(.,aes(log(Total.Cores),log(Rpeak),col=Country)) +
  geom_point() +geom_smooth(method = lm,se = F) + 
  ggtitle("Linear regression Rpeak~Total.Cores")

topUSCH%>%
  ggplot(.,aes(log(Total.Cores),log(Accelerator.CoProcessor.Cores),col=Country))+  geom_point() +geom_smooth(method = lm,se = F,na.rm = T)+ 
  ggtitle("Linear regression Accelerator.CoProcessor.Cores~Total.Cores")

## Warning: Removed 145 rows containing missing values (geom_point).

topUSCH%>%ggplot(.,aes(x=log(Total.Cores),y=log(Rpeak),col=Country))+
  geom_density2d(binwidth = 0.01, na.rm = T)+
  ggtitle("Density contour plot for Rpeak~Total.Cores")

topUSCH%>%ggplot(.,aes(x=log(Rmax),y=log(Rpeak),col=Country))+
  geom_density2d(binwidth = 0.01, na.rm = T)+
  ggtitle("Density contour plot for Rpeak~Rmax")

topUSCH%>%ggplot(.,aes(x=log(Power),y=log(Rpeak),col=Country))+
  geom_density2d(binwidth = 0.01,na.rm = T)+
  ggtitle("Density contour plot for Rpeak~Power")

topUSCH%>%
  ggplot(.,aes(x=log(Total.Cores),y=log(Accelerator.CoProcessor.Cores),col=Country))+
  geom_density2d(binwidth = 0.01,na.rm = T)+
  ggtitle("Density contour plot for Accelerator.CoProcessor.Cores~Total.Cores")

topUSCH%>%ggplot(., aes(log(Total.Cores), log(Rpeak))) +
  geom_polygon(aes(fill = Country)) + facet_wrap(~Country)

US-China HPC race