Introduction

In my previous topic Supercomputers: data mining tour with R we discussed data mining technique applied to Top500 data. See https://rpubs.com/alex-lev/71014.

library(car)
library(dplyr)

## 
## Attaching package: 'dplyr'

## Qkeds~yhi nazejr qjp{r nr 'package:car':
## 
##     recode

## Qkeds~yhe nazejr{ qjp{r{ nr 'package:stats':
## 
##     filter, lag

## Qkeds~yhe nazejr{ qjp{r{ nr 'package:base':
## 
##     intersect, setdiff, setequal, union

library(broom)
library(scatterplot3d)
load(file = "top500.dat")
attach(top500)

Linear model

We used this linear model for regression.

fit.lm.year=lm(log(Rpeak)~log(Total.Cores)+Year,data = top500)
summary(fit.lm.year)

## 
## Call:
## lm(formula = log(Rpeak) ~ log(Total.Cores) + Year, data = top500)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18183 -0.17802 -0.04129  0.07572  1.55894 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -215.72915   27.84930  -7.746 5.36e-14 ***
## log(Total.Cores)    0.84975    0.01852  45.891  < 2e-16 ***
## Year                0.10941    0.01381   7.920 1.57e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3481 on 497 degrees of freedom
## Multiple R-squared:  0.8098, Adjusted R-squared:  0.8091 
## F-statistic:  1058 on 2 and 497 DF,  p-value: < 2.2e-16

s3d<-scatterplot3d(log(Total.Cores), Year, log(Rpeak) , main="3D plot for top500 regression",pch=16, highlight.3d=TRUE,type="p", grid = T, xlab = "log(Total.Cores)", ylab = "Year", zlab = "log(Rpeak)")
s3d$plane3d(fit.lm.year,draw_lines = T,draw_polygon = T)

Modifying model

Now we add Power variable into the model, transform all the variables and make some tests in order to verify new model.

par(mfrow=c(2,2))
fit.lm.2<-lm(data = top500,log(Rpeak)~log(Power)+log(Total.Cores)+log(Year))
summary(fit.lm.2)

## 
## Call:
## lm(formula = log(Rpeak) ~ log(Power) + log(Total.Cores) + log(Year), 
##     data = top500)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19835 -0.15566 -0.06235  0.08173  1.40477 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.629e+03  2.996e+02  -5.437 1.23e-07 ***
## log(Power)        1.195e-01  4.198e-02   2.847  0.00476 ** 
## log(Total.Cores)  7.844e-01  3.943e-02  19.894  < 2e-16 ***
## log(Year)         2.147e+02  3.938e+01   5.453 1.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3763 on 264 degrees of freedom
##   (232 observations deleted due to missingness)
## Multiple R-squared:  0.8446, Adjusted R-squared:  0.8428 
## F-statistic: 478.2 on 3 and 264 DF,  p-value: < 2.2e-16

glance(fit.lm.year)

##   r.squared adj.r.squared     sigma statistic       p.value df    logLik
## 1 0.8098396     0.8090744 0.3480606  1058.292 7.266586e-180  3 -180.2754
##        AIC      BIC deviance df.residual
## 1 368.5507 385.4092 60.20964         497

glance(fit.lm.2)

##   r.squared adj.r.squared     sigma statistic      p.value df    logLik
## 1  0.844567     0.8428007 0.3762566  478.1602 2.29778e-106  4 -116.2948
##        AIC      BIC deviance df.residual
## 1 242.5896 260.5445 37.37423         264

tidy(anova(fit.lm.2))

##               term  df      sumsq      meansq  statistic      p.value
## 1       log(Power)   1 144.262605 144.2626046 1019.02641 1.273355e-92
## 2 log(Total.Cores)   1  54.605690  54.6056897  385.71770 1.495899e-53
## 3        log(Year)   1   4.209775   4.2097749   29.73655 1.137286e-07
## 4        Residuals 264  37.374230   0.1415691         NA           NA

plot(fit.lm.2)

leveragePlots(fit.lm.2)

\[Rpeak=e^{-1629 + 0.1195*log(Power) + 0.7844*log(Total.Cores)+214.7*log(Year)}\] Good results for our new model compared to the old one: see adj.r.squared, AIC, BIC and deviance. All coefficients are still significant.

Predicting Rpeak for new supercomputer

We used data from Top500 list as of 2015 year. Now Top500 list is updated with the new champion Sunway TaihuLight. See https://www.top500.org/. Let’s try to use our new linear model to predict Rpeak for this HPC giant from China.

exp(predict(fit.lm.2,data.frame(Power=15371,Total.Cores=10649600,Year=2016),interval = "confidence"))/1000

##        fit      lwr      upr
## 1 127186.4 89512.19 180717.1

Excellent result! Compare this one 127186.4with Top500 Rpeak=125,435.9.

Note. We divided the result for Rpeak by 1000 because our data set for Top500 has been measured for Rpeak as GFlop/s not Tflop/s.

head(top500[,"Rpeak"],10)

##  [1] 54902400 27112550 20132659 11280384 10066330  7788853  8520112
##  [8]  5872026  5033165  4881254

Conclusions

We modified the linear regression model by adding Power and log-transforming all the variables.
All the tests for the new model produced good results compared to the old model.
We predicted Rpeak for the Sunway TaihuLight Chinese supercomputer that is number one in the top500 list so far.
The predicted value for Rpeak fits the real value of HPC champion.

Supercomputers: linear regression

Alexander Levakov, Senior Research Fellow, PhD

October, 2016

Introduction

Linear model

Modifying model

Predicting Rpeak for new supercomputer

Conclusions