IBM-Summit

1 Introduction

We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777, https://rpubs.com/alex-lev/708382

2 TOP500 data

Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/

3 Filtering Data

Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.

library(readxl)
library(tidyverse)
library(tidyquant)
library(broom)
library(DT)
library(knitr)
library(plot3D)



TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
##  [1] "Rank"                        "PreviousRank"               
##  [3] "FirstAppearance"             "FirstRank"                  
##  [5] "Name"                        "Computer"                   
##  [7] "Site"                        "Manufacturer"               
##  [9] "Country"                     "Year"                       
## [11] "Segment"                     "TotalCores"                 
## [13] "AcceleratorCoProcessorCores" "Rmax"                       
## [15] "Rpeak"                       "Nmax"                       
## [17] "Nhalf"                       "HPCG"                       
## [19] "Power"                       "PowerSource"                
## [21] "PowerEfficiency"             "Architecture"               
## [23] "Processor"                   "ProcessorTechnology"        
## [25] "ProcessorSpeed"              "OperatingSystem"            
## [27] "OSFamily"                    "AcceleratorCoProcessor"     
## [29] "CoresperSocket"              "ProcessorGeneration"        
## [31] "SystemModel"                 "SystemFamily"               
## [33] "InterconnectFamily"          "Interconnect"               
## [35] "Continent"                   "SiteID"                     
## [37] "SystemID"

4 Multiple Linear Regression

Now we can compare five countries (USA,China,Japan,France,Germany), leading TOP500 race by total HPC mainframes (see https://rpubs.com/alex-lev/694840). Here we apply multiple linear regression to observe pace of race in terms of coefficients \[ln(Z_i)=B_0+B_1ln(X_i) + B_2ln(Y_i)+E_i \]that is \[ln(Rmax_i) = B_0+B_1ln(Nmax_i) + B_2ln(TotalCores_i) +E_i\]

4.1 USA

TOP500.US <- TOP500_202011 %>% filter(Country=="United States") %>% select(Country,Name,Rmax,Nmax,TotalCores) %>% na.omit()
TOP500.US %>% datatable()
fit.us <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.US)
summary(fit.us)
## 
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.US)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19758 -0.16161 -0.05088  0.29068  1.15044 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.63614    1.32271   1.237 0.219354    
## log(Nmax)       -0.34927    0.09287  -3.761 0.000303 ***
## log(TotalCores)  1.03993    0.05428  19.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4003 on 89 degrees of freedom
## Multiple R-squared:  0.8123, Adjusted R-squared:  0.8081 
## F-statistic: 192.6 on 2 and 89 DF,  p-value: < 2.2e-16
 ggplot(TOP500.US,aes(x=log(Nmax),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="darkblue") + 
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression USA")+
  theme(plot.title = element_text(hjust = .5))

 ggplot(TOP500.US,aes(x=log(TotalCores),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="darkblue") + 
  theme_bw() + ggtitle("Rmax ~ TotalCores linear regression USA")+
  theme(plot.title = element_text(hjust = .5))

 df.us <- TOP500.US %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>% 
   select(LCORES,LNMAX,LRMAX)
 
 hpc_3dplot <- function(df){
 
 with (df, {

  # linear regression
  fit <- lm(LRMAX ~ LNMAX + LCORES)

  NN <- 10
  # predict values on regular xy grid
  cores.pred <- seq(min(df$LCORES), max(df$LCORES), length.out = NN)
  nmax.pred <- seq(min(df$LNMAX), max(df$LNMAX), length.out = NN)
  xy <- expand.grid(LCORES = cores.pred,
                    LNMAX = nmax.pred)

  rmax.pred <- matrix (nrow = NN, ncol = NN,
                      data = predict(fit, newdata = data.frame(xy),
                                     interval = "prediction"))

  # fitted points for droplines to surface
  fitpoints <- predict(fit)

  scatter3D(z = LRMAX, x = LCORES, y = LNMAX, pch = 18, cex = 2,
            theta = 20, phi = 50, ticktype = "detailed",
            xlab = "Cores", ylab = "Nmax", zlab = "Rmax",
            surf = list(x = cores.pred, y = nmax.pred, z = rmax.pred,
                        facets = NA, fit = fitpoints),
            main = "log(Rmax)~log(Nmax)+log(TotalCores)")

})
 }
 
 hpc_3dplot(df.us)

4.2 CHINA

TOP500.CH <- TOP500_202011 %>% filter(Country=="China") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.CH  %>% datatable()
fit.china <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.CH)
summary(fit.china)
## 
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.CH)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75592 -0.32587  0.05056  0.27277  0.70907 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.48867    0.95656   3.647 0.000498 ***
## log(Nmax)       -0.21237    0.07320  -2.901 0.004926 ** 
## log(TotalCores)  0.66443    0.04635  14.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.319 on 72 degrees of freedom
## Multiple R-squared:  0.7646, Adjusted R-squared:  0.7581 
## F-statistic:   117 on 2 and 72 DF,  p-value: < 2.2e-16
 ggplot(TOP500.CH,aes(x=log(Nmax),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="red") + 
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression CHINA")+
  theme(plot.title = element_text(hjust = .5))

 ggplot(TOP500.CH,aes(x=log(TotalCores),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="red") + 
  theme_bw() + ggtitle("Rmax ~ TotalCores linear regression CHINA")+
  theme(plot.title = element_text(hjust = .5))

 df.ch <- TOP500.CH %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>% 
   select(LCORES,LNMAX,LRMAX)
 
 hpc_3dplot(df.ch)

4.3 JAPAN

TOP500.JP <- TOP500_202011 %>% filter(Country=="Japan") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.JP %>% datatable()
fit.jap <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.JP)
summary(fit.jap)
## 
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.JP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9429 -0.3503 -0.1971  0.1920  1.8128 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.69250    3.19272   0.217    0.830    
## log(Nmax)       -0.03671    0.26858  -0.137    0.893    
## log(TotalCores)  0.72808    0.13536   5.379 2.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6984 on 22 degrees of freedom
## Multiple R-squared:  0.7092, Adjusted R-squared:  0.6827 
## F-statistic: 26.82 on 2 and 22 DF,  p-value: 1.259e-06
ggplot(TOP500.JP,aes(x=log(Nmax),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="yellow")  +
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression JAPAN")+
  theme(plot.title = element_text(hjust = .5))

 ggplot(TOP500.JP,aes(x=log(TotalCores),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="yellow")  +
  theme_bw() + ggtitle("Rmax ~ TotalCores linear regression JAPAN")+
  theme(plot.title = element_text(hjust = .5))

 df.jp <- TOP500.JP %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>% 
   select(LCORES,LNMAX,LRMAX)
 
 hpc_3dplot(df.jp)

4.4 GERMANY

TOP500.GR <- TOP500_202011 %>% filter(Country=="Germany") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.GR %>% datatable()
fit.gr <- lm(log(Rmax)~log(Nmax) + log(TotalCores),TOP500.GR)
summary(fit.gr)
## 
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.GR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16097 -0.26248  0.06464  0.35123  0.79748 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.3965     7.4404  -0.053  0.95831    
## log(Nmax)        -0.2448     0.5847  -0.419  0.68231    
## log(TotalCores)   1.0782     0.2377   4.536  0.00056 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.533 on 13 degrees of freedom
## Multiple R-squared:  0.7209, Adjusted R-squared:  0.678 
## F-statistic: 16.79 on 2 and 13 DF,  p-value: 0.0002495
 ggplot(TOP500.GR,aes(x=log(Nmax),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="green")  +
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression GERMANY")+
  theme(plot.title = element_text(hjust = .5))

 ggplot(TOP500.GR,aes(x=log(TotalCores),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="green")  +
  theme_bw() + ggtitle("Rmax ~ TotalCores linear regression GERMANY")+
  theme(plot.title = element_text(hjust = .5))

 df.gr <- TOP500.GR %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>% 
   select(LCORES,LNMAX,LRMAX)
 
 hpc_3dplot(df.gr)

4.5 FRANCE

TOP500.FR <- TOP500_202011 %>% filter(Country=="France") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.FR %>% datatable()
fit.fr <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.FR)
summary(fit.fr)
## 
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.FR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66358 -0.29596 -0.03478  0.30936  0.71756 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.4778     2.9369   0.844 0.412104    
## log(Nmax)        -0.3039     0.2671  -1.138 0.273074    
## log(TotalCores)   0.9002     0.1782   5.053 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4134 on 15 degrees of freedom
## Multiple R-squared:  0.7264, Adjusted R-squared:   0.69 
## F-statistic: 19.92 on 2 and 15 DF,  p-value: 5.997e-05
 ggplot(TOP500.FR,aes(x=log(Nmax),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="blue")  +
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression FRANCE")+
  theme(plot.title = element_text(hjust = .5))

ggplot(TOP500.FR,aes(x=log(TotalCores),y=log(Rmax)))+
  geom_smooth(method="lm") + geom_point(col="blue")  +
  theme_bw() + ggtitle("Rnamx ~ TotalCores linear regression FRANCE")+
  theme(plot.title = element_text(hjust = .5))

df.fr <- TOP500.FR %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>% 
   select(LCORES,LNMAX,LRMAX)
 
 hpc_3dplot(df.fr)

4.6 Comparison

TOP500.5 <- rbind(TOP500.US,TOP500.CH,TOP500.JP,TOP500.FR,TOP500.GR)

ggplot(TOP500.5,aes(x=Country,y=log(Rmax),fill=Country)) +
  geom_violin(outlier.shape = NA) + geom_jitter() + theme_bw() + ggtitle("Rmax by Country")+
  theme(plot.title = element_text(hjust = .5))

TOP500.5%>%group_by(Country)%>%
  do(tidy(lm(log(Rmax)~log(Nmax)+log(TotalCores),data = .))) %>% 
  kable(caption = "Linear model parameters by country",digits = 4)
Linear model parameters by country
Country term estimate std.error statistic p.value
China (Intercept) 3.4887 0.9566 3.6471 0.0005
China log(Nmax) -0.2124 0.0732 -2.9014 0.0049
China log(TotalCores) 0.6644 0.0463 14.3352 0.0000
France (Intercept) 2.4778 2.9369 0.8437 0.4121
France log(Nmax) -0.3039 0.2671 -1.1377 0.2731
France log(TotalCores) 0.9002 0.1782 5.0530 0.0001
Germany (Intercept) -0.3965 7.4404 -0.0533 0.9583
Germany log(Nmax) -0.2448 0.5847 -0.4186 0.6823
Germany log(TotalCores) 1.0782 0.2377 4.5357 0.0006
Japan (Intercept) 0.6925 3.1927 0.2169 0.8303
Japan log(Nmax) -0.0367 0.2686 -0.1367 0.8925
Japan log(TotalCores) 0.7281 0.1354 5.3789 0.0000
United States (Intercept) 1.6361 1.3227 1.2370 0.2194
United States log(Nmax) -0.3493 0.0929 -3.7610 0.0003
United States log(TotalCores) 1.0399 0.0543 19.1574 0.0000
 ggplot(TOP500.5,aes(x=log(Nmax),y=log(Rmax),col=Country))+
  geom_smooth(method="lm",se=F) + geom_point()  +
  theme_bw() + ggtitle("Rmax ~ Nmax linear regression ")+
  theme(plot.title = element_text(hjust = .5))

 ggplot(TOP500.5,aes(x=log(TotalCores),y=log(Rmax),col=Country))+
  geom_smooth(method="lm",se=F) + geom_point()  +
  theme_bw() + ggtitle("Rmax ~ TotalCores linear regression ")+
  theme(plot.title = element_text(hjust = .5))

 TOP500.5%>%ggplot(.,aes(x=log(Nmax),y=log(Rmax),col=Country))+
  geom_density2d(binwidth = 0.01, na.rm = T)+
  theme_bw() + ggtitle("Density contour plot for Rmax~Nmax")+
  theme(plot.title = element_text(hjust = .5))

 TOP500.5%>%ggplot(.,aes(x=log(TotalCores),y=log(Rmax),col=Country))+
  geom_density2d(binwidth = 0.01, na.rm = T)+
  theme_bw() + ggtitle("Density contour plot for Rmax~TotalCores")+
  theme(plot.title = element_text(hjust = .5))

5 Conclusion

I.Applying multiple linear regression `\(ln(Rmax_i) = B_0+B_1ln(Nmax_i) + B_2ln(TotalCores_i) +E_i\) we can arrange leaders in the HPC race according to the regression slopes \(B_1\) and \(B_2\).

II.Slope \(B_1\) correlates variation of \(Rmax\) with \(Nmax\) given mean value of \(TotalCores\) constant. So we get the following order: 1. USA - (-0.3493), 2. China - (-0.2124). We excluded Japan, Germany and France due to the P-value as not significant for \(B_1\) slope. In fact both USA and China have negative value of \(B_1\) that means decreasing problem size \(Nmax\) produces greater \(Rmax\). Nothing curious at all.

III.Slope \(B_2\) correlates variation of \(Rmax\) with \(TotalCores\) given mean value of \(Nmax\) constant. So we get the following order: 1. Germany - 1.0782, 2. USA - 1.0399, 3. France - 0.9002, 4. Japan - 0.7281, 5. China - 0.6644. The slope coefficient \(B_2\) displays how much performance \(Rmax\) could be produced by \(TotalCores\), that is equivalent of mainframe construction technology for particular country in broad sense i.e. index of HPC advanced construction ability as well as usability.

IV.However China has got about 43% of the total HPC mainframes in top500 list while Japan has produced Fugaku - HPC number one in the world today. The race is not over!