IBM-Summit
We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777, https://rpubs.com/alex-lev/708382
Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/
Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.
library(readxl)
library(tidyverse)
library(tidyquant)
library(broom)
library(DT)
library(knitr)
library(plot3D)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
## [1] "Rank" "PreviousRank"
## [3] "FirstAppearance" "FirstRank"
## [5] "Name" "Computer"
## [7] "Site" "Manufacturer"
## [9] "Country" "Year"
## [11] "Segment" "TotalCores"
## [13] "AcceleratorCoProcessorCores" "Rmax"
## [15] "Rpeak" "Nmax"
## [17] "Nhalf" "HPCG"
## [19] "Power" "PowerSource"
## [21] "PowerEfficiency" "Architecture"
## [23] "Processor" "ProcessorTechnology"
## [25] "ProcessorSpeed" "OperatingSystem"
## [27] "OSFamily" "AcceleratorCoProcessor"
## [29] "CoresperSocket" "ProcessorGeneration"
## [31] "SystemModel" "SystemFamily"
## [33] "InterconnectFamily" "Interconnect"
## [35] "Continent" "SiteID"
## [37] "SystemID"
Now we can compare five countries (USA,China,Japan,France,Germany), leading TOP500 race by total HPC mainframes (see https://rpubs.com/alex-lev/694840). Here we apply multiple linear regression to observe pace of race in terms of coefficients \[ln(Z_i)=B_0+B_1ln(X_i) + B_2ln(Y_i)+E_i \]that is \[ln(Rmax_i) = B_0+B_1ln(Nmax_i) + B_2ln(TotalCores_i) +E_i\]
TOP500.US <- TOP500_202011 %>% filter(Country=="United States") %>% select(Country,Name,Rmax,Nmax,TotalCores) %>% na.omit()
TOP500.US %>% datatable()
fit.us <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.US)
summary(fit.us)
##
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19758 -0.16161 -0.05088 0.29068 1.15044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.63614 1.32271 1.237 0.219354
## log(Nmax) -0.34927 0.09287 -3.761 0.000303 ***
## log(TotalCores) 1.03993 0.05428 19.157 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4003 on 89 degrees of freedom
## Multiple R-squared: 0.8123, Adjusted R-squared: 0.8081
## F-statistic: 192.6 on 2 and 89 DF, p-value: < 2.2e-16
ggplot(TOP500.US,aes(x=log(Nmax),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="darkblue") +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression USA")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.US,aes(x=log(TotalCores),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="darkblue") +
theme_bw() + ggtitle("Rmax ~ TotalCores linear regression USA")+
theme(plot.title = element_text(hjust = .5))
df.us <- TOP500.US %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>%
select(LCORES,LNMAX,LRMAX)
hpc_3dplot <- function(df){
with (df, {
# linear regression
fit <- lm(LRMAX ~ LNMAX + LCORES)
NN <- 10
# predict values on regular xy grid
cores.pred <- seq(min(df$LCORES), max(df$LCORES), length.out = NN)
nmax.pred <- seq(min(df$LNMAX), max(df$LNMAX), length.out = NN)
xy <- expand.grid(LCORES = cores.pred,
LNMAX = nmax.pred)
rmax.pred <- matrix (nrow = NN, ncol = NN,
data = predict(fit, newdata = data.frame(xy),
interval = "prediction"))
# fitted points for droplines to surface
fitpoints <- predict(fit)
scatter3D(z = LRMAX, x = LCORES, y = LNMAX, pch = 18, cex = 2,
theta = 20, phi = 50, ticktype = "detailed",
xlab = "Cores", ylab = "Nmax", zlab = "Rmax",
surf = list(x = cores.pred, y = nmax.pred, z = rmax.pred,
facets = NA, fit = fitpoints),
main = "log(Rmax)~log(Nmax)+log(TotalCores)")
})
}
hpc_3dplot(df.us)
TOP500.CH <- TOP500_202011 %>% filter(Country=="China") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.CH %>% datatable()
fit.china <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.CH)
summary(fit.china)
##
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.CH)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75592 -0.32587 0.05056 0.27277 0.70907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.48867 0.95656 3.647 0.000498 ***
## log(Nmax) -0.21237 0.07320 -2.901 0.004926 **
## log(TotalCores) 0.66443 0.04635 14.335 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.319 on 72 degrees of freedom
## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7581
## F-statistic: 117 on 2 and 72 DF, p-value: < 2.2e-16
ggplot(TOP500.CH,aes(x=log(Nmax),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="red") +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression CHINA")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.CH,aes(x=log(TotalCores),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="red") +
theme_bw() + ggtitle("Rmax ~ TotalCores linear regression CHINA")+
theme(plot.title = element_text(hjust = .5))
df.ch <- TOP500.CH %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>%
select(LCORES,LNMAX,LRMAX)
hpc_3dplot(df.ch)
TOP500.JP <- TOP500_202011 %>% filter(Country=="Japan") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.JP %>% datatable()
fit.jap <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.JP)
summary(fit.jap)
##
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.JP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9429 -0.3503 -0.1971 0.1920 1.8128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.69250 3.19272 0.217 0.830
## log(Nmax) -0.03671 0.26858 -0.137 0.893
## log(TotalCores) 0.72808 0.13536 5.379 2.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6984 on 22 degrees of freedom
## Multiple R-squared: 0.7092, Adjusted R-squared: 0.6827
## F-statistic: 26.82 on 2 and 22 DF, p-value: 1.259e-06
ggplot(TOP500.JP,aes(x=log(Nmax),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="yellow") +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression JAPAN")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.JP,aes(x=log(TotalCores),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="yellow") +
theme_bw() + ggtitle("Rmax ~ TotalCores linear regression JAPAN")+
theme(plot.title = element_text(hjust = .5))
df.jp <- TOP500.JP %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>%
select(LCORES,LNMAX,LRMAX)
hpc_3dplot(df.jp)
TOP500.GR <- TOP500_202011 %>% filter(Country=="Germany") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.GR %>% datatable()
fit.gr <- lm(log(Rmax)~log(Nmax) + log(TotalCores),TOP500.GR)
summary(fit.gr)
##
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.GR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16097 -0.26248 0.06464 0.35123 0.79748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3965 7.4404 -0.053 0.95831
## log(Nmax) -0.2448 0.5847 -0.419 0.68231
## log(TotalCores) 1.0782 0.2377 4.536 0.00056 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.533 on 13 degrees of freedom
## Multiple R-squared: 0.7209, Adjusted R-squared: 0.678
## F-statistic: 16.79 on 2 and 13 DF, p-value: 0.0002495
ggplot(TOP500.GR,aes(x=log(Nmax),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="green") +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression GERMANY")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.GR,aes(x=log(TotalCores),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="green") +
theme_bw() + ggtitle("Rmax ~ TotalCores linear regression GERMANY")+
theme(plot.title = element_text(hjust = .5))
df.gr <- TOP500.GR %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>%
select(LCORES,LNMAX,LRMAX)
hpc_3dplot(df.gr)
TOP500.FR <- TOP500_202011 %>% filter(Country=="France") %>% select(Country,Name,Rmax,TotalCores,Nmax) %>% na.omit()
TOP500.FR %>% datatable()
fit.fr <- lm(log(Rmax)~log(Nmax)+log(TotalCores),TOP500.FR)
summary(fit.fr)
##
## Call:
## lm(formula = log(Rmax) ~ log(Nmax) + log(TotalCores), data = TOP500.FR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66358 -0.29596 -0.03478 0.30936 0.71756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4778 2.9369 0.844 0.412104
## log(Nmax) -0.3039 0.2671 -1.138 0.273074
## log(TotalCores) 0.9002 0.1782 5.053 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4134 on 15 degrees of freedom
## Multiple R-squared: 0.7264, Adjusted R-squared: 0.69
## F-statistic: 19.92 on 2 and 15 DF, p-value: 5.997e-05
ggplot(TOP500.FR,aes(x=log(Nmax),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="blue") +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression FRANCE")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.FR,aes(x=log(TotalCores),y=log(Rmax)))+
geom_smooth(method="lm") + geom_point(col="blue") +
theme_bw() + ggtitle("Rnamx ~ TotalCores linear regression FRANCE")+
theme(plot.title = element_text(hjust = .5))
df.fr <- TOP500.FR %>% mutate(LCORES=log(TotalCores),LNMAX=log(Nmax),LRMAX=log(Rmax)) %>%
select(LCORES,LNMAX,LRMAX)
hpc_3dplot(df.fr)
TOP500.5 <- rbind(TOP500.US,TOP500.CH,TOP500.JP,TOP500.FR,TOP500.GR)
ggplot(TOP500.5,aes(x=Country,y=log(Rmax),fill=Country)) +
geom_violin(outlier.shape = NA) + geom_jitter() + theme_bw() + ggtitle("Rmax by Country")+
theme(plot.title = element_text(hjust = .5))
TOP500.5%>%group_by(Country)%>%
do(tidy(lm(log(Rmax)~log(Nmax)+log(TotalCores),data = .))) %>%
kable(caption = "Linear model parameters by country",digits = 4)
| Country | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| China | (Intercept) | 3.4887 | 0.9566 | 3.6471 | 0.0005 |
| China | log(Nmax) | -0.2124 | 0.0732 | -2.9014 | 0.0049 |
| China | log(TotalCores) | 0.6644 | 0.0463 | 14.3352 | 0.0000 |
| France | (Intercept) | 2.4778 | 2.9369 | 0.8437 | 0.4121 |
| France | log(Nmax) | -0.3039 | 0.2671 | -1.1377 | 0.2731 |
| France | log(TotalCores) | 0.9002 | 0.1782 | 5.0530 | 0.0001 |
| Germany | (Intercept) | -0.3965 | 7.4404 | -0.0533 | 0.9583 |
| Germany | log(Nmax) | -0.2448 | 0.5847 | -0.4186 | 0.6823 |
| Germany | log(TotalCores) | 1.0782 | 0.2377 | 4.5357 | 0.0006 |
| Japan | (Intercept) | 0.6925 | 3.1927 | 0.2169 | 0.8303 |
| Japan | log(Nmax) | -0.0367 | 0.2686 | -0.1367 | 0.8925 |
| Japan | log(TotalCores) | 0.7281 | 0.1354 | 5.3789 | 0.0000 |
| United States | (Intercept) | 1.6361 | 1.3227 | 1.2370 | 0.2194 |
| United States | log(Nmax) | -0.3493 | 0.0929 | -3.7610 | 0.0003 |
| United States | log(TotalCores) | 1.0399 | 0.0543 | 19.1574 | 0.0000 |
ggplot(TOP500.5,aes(x=log(Nmax),y=log(Rmax),col=Country))+
geom_smooth(method="lm",se=F) + geom_point() +
theme_bw() + ggtitle("Rmax ~ Nmax linear regression ")+
theme(plot.title = element_text(hjust = .5))
ggplot(TOP500.5,aes(x=log(TotalCores),y=log(Rmax),col=Country))+
geom_smooth(method="lm",se=F) + geom_point() +
theme_bw() + ggtitle("Rmax ~ TotalCores linear regression ")+
theme(plot.title = element_text(hjust = .5))
TOP500.5%>%ggplot(.,aes(x=log(Nmax),y=log(Rmax),col=Country))+
geom_density2d(binwidth = 0.01, na.rm = T)+
theme_bw() + ggtitle("Density contour plot for Rmax~Nmax")+
theme(plot.title = element_text(hjust = .5))
TOP500.5%>%ggplot(.,aes(x=log(TotalCores),y=log(Rmax),col=Country))+
geom_density2d(binwidth = 0.01, na.rm = T)+
theme_bw() + ggtitle("Density contour plot for Rmax~TotalCores")+
theme(plot.title = element_text(hjust = .5))
I.Applying multiple linear regression `\(ln(Rmax_i) = B_0+B_1ln(Nmax_i) + B_2ln(TotalCores_i) +E_i\) we can arrange leaders in the HPC race according to the regression slopes \(B_1\) and \(B_2\).
II.Slope \(B_1\) correlates variation of \(Rmax\) with \(Nmax\) given mean value of \(TotalCores\) constant. So we get the following order: 1. USA - (-0.3493), 2. China - (-0.2124). We excluded Japan, Germany and France due to the P-value as not significant for \(B_1\) slope. In fact both USA and China have negative value of \(B_1\) that means decreasing problem size \(Nmax\) produces greater \(Rmax\). Nothing curious at all.
III.Slope \(B_2\) correlates variation of \(Rmax\) with \(TotalCores\) given mean value of \(Nmax\) constant. So we get the following order: 1. Germany - 1.0782, 2. USA - 1.0399, 3. France - 0.9002, 4. Japan - 0.7281, 5. China - 0.6644. The slope coefficient \(B_2\) displays how much performance \(Rmax\) could be produced by \(TotalCores\), that is equivalent of mainframe construction technology for particular country in broad sense i.e. index of HPC advanced construction ability as well as usability.
IV.However China has got about 43% of the total HPC mainframes in top500 list while Japan has produced Fugaku - HPC number one in the world today. The race is not over!