In my previous topic Supercomputers: data mining tour with R we discussed data mining technique applied to Top500 data. See https://rpubs.com/alex-lev/71014.
library(car)
library(dplyr)
##
## Attaching package: 'dplyr'
## Qkeds~yhi nazejr qjp{r nr 'package:car':
##
## recode
## Qkeds~yhe nazejr{ qjp{r{ nr 'package:stats':
##
## filter, lag
## Qkeds~yhe nazejr{ qjp{r{ nr 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(scatterplot3d)
load(file = "top500.dat")
attach(top500)
We used this linear model for regression.
fit.lm.year=lm(log(Rpeak)~log(Total.Cores)+Year,data = top500)
summary(fit.lm.year)
##
## Call:
## lm(formula = log(Rpeak) ~ log(Total.Cores) + Year, data = top500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18183 -0.17802 -0.04129 0.07572 1.55894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -215.72915 27.84930 -7.746 5.36e-14 ***
## log(Total.Cores) 0.84975 0.01852 45.891 < 2e-16 ***
## Year 0.10941 0.01381 7.920 1.57e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3481 on 497 degrees of freedom
## Multiple R-squared: 0.8098, Adjusted R-squared: 0.8091
## F-statistic: 1058 on 2 and 497 DF, p-value: < 2.2e-16
s3d<-scatterplot3d(log(Total.Cores), Year, log(Rpeak) , main="3D plot for top500 regression",pch=16, highlight.3d=TRUE,type="p", grid = T, xlab = "log(Total.Cores)", ylab = "Year", zlab = "log(Rpeak)")
s3d$plane3d(fit.lm.year,draw_lines = T,draw_polygon = T)
Now we add Power variable into the model, transform all the variables and make some tests in order to verify new model.
par(mfrow=c(2,2))
fit.lm.2<-lm(data = top500,log(Rpeak)~log(Power)+log(Total.Cores)+log(Year))
summary(fit.lm.2)
##
## Call:
## lm(formula = log(Rpeak) ~ log(Power) + log(Total.Cores) + log(Year),
## data = top500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19835 -0.15566 -0.06235 0.08173 1.40477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.629e+03 2.996e+02 -5.437 1.23e-07 ***
## log(Power) 1.195e-01 4.198e-02 2.847 0.00476 **
## log(Total.Cores) 7.844e-01 3.943e-02 19.894 < 2e-16 ***
## log(Year) 2.147e+02 3.938e+01 5.453 1.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3763 on 264 degrees of freedom
## (232 observations deleted due to missingness)
## Multiple R-squared: 0.8446, Adjusted R-squared: 0.8428
## F-statistic: 478.2 on 3 and 264 DF, p-value: < 2.2e-16
glance(fit.lm.year)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.8098396 0.8090744 0.3480606 1058.292 7.266586e-180 3 -180.2754
## AIC BIC deviance df.residual
## 1 368.5507 385.4092 60.20964 497
glance(fit.lm.2)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.844567 0.8428007 0.3762566 478.1602 2.29778e-106 4 -116.2948
## AIC BIC deviance df.residual
## 1 242.5896 260.5445 37.37423 264
tidy(anova(fit.lm.2))
## term df sumsq meansq statistic p.value
## 1 log(Power) 1 144.262605 144.2626046 1019.02641 1.273355e-92
## 2 log(Total.Cores) 1 54.605690 54.6056897 385.71770 1.495899e-53
## 3 log(Year) 1 4.209775 4.2097749 29.73655 1.137286e-07
## 4 Residuals 264 37.374230 0.1415691 NA NA
plot(fit.lm.2)
leveragePlots(fit.lm.2)
\[Rpeak=e^{-1629 + 0.1195*log(Power) + 0.7844*log(Total.Cores)+214.7*log(Year)}\] Good results for our new model compared to the old one: see adj.r.squared, AIC, BIC and deviance. All coefficients are still significant.
We used data from Top500 list as of 2015 year. Now Top500 list is updated with the new champion Sunway TaihuLight. See https://www.top500.org/. Let’s try to use our new linear model to predict Rpeak for this HPC giant from China.
exp(predict(fit.lm.2,data.frame(Power=15371,Total.Cores=10649600,Year=2016),interval = "confidence"))/1000
## fit lwr upr
## 1 127186.4 89512.19 180717.1
Excellent result! Compare this one 127186.4with Top500 Rpeak=125,435.9.
Note. We divided the result for Rpeak by 1000 because our data set for Top500 has been measured for Rpeak as GFlop/s not Tflop/s.
head(top500[,"Rpeak"],10)
## [1] 54902400 27112550 20132659 11280384 10066330 7788853 8520112
## [8] 5872026 5033165 4881254