IBM-Summit

Introduction

We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777

TOP500 data

Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/

Filtering Data

Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.

library(e1071)
library(vip)
library(tidyverse)
library(readxl)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)

##  [1] "Rank"                        "PreviousRank"               
##  [3] "FirstAppearance"             "FirstRank"                  
##  [5] "Name"                        "Computer"                   
##  [7] "Site"                        "Manufacturer"               
##  [9] "Country"                     "Year"                       
## [11] "Segment"                     "TotalCores"                 
## [13] "AcceleratorCoProcessorCores" "Rmax"                       
## [15] "Rpeak"                       "Nmax"                       
## [17] "Nhalf"                       "HPCG"                       
## [19] "Power"                       "PowerSource"                
## [21] "PowerEfficiency"             "Architecture"               
## [23] "Processor"                   "ProcessorTechnology"        
## [25] "ProcessorSpeed"              "OperatingSystem"            
## [27] "OSFamily"                    "AcceleratorCoProcessor"     
## [29] "CoresperSocket"              "ProcessorGeneration"        
## [31] "SystemModel"                 "SystemFamily"               
## [33] "InterconnectFamily"          "Interconnect"               
## [35] "Continent"                   "SiteID"                     
## [37] "SystemID"

Problem

We want to classify HPC mainframes of China and USA, applying Support Vector Machine (SVM).

TOP500_USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country,  Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%mutate (Country=as.factor(Country)) %>% drop_na()

TOP500_USCH  %>% datatable() #kable(align = "r",caption = "TOP500: China - USA")

This dataset above contains 98 individuals (mainframes) and 10 variables.

Support Vector Machine

set.seed(12345)
MisClassError <- c()
for (i in 1:3000) {
train <- sample_frac(TOP500_USCH,0.5,replace = F)
test <-TOP500_USCH %>% setdiff(train)
model.svm<-svm(Country~.,data = train,kernel="linear",type="C-classification",
               scale=T,probability=T)
pred_svm<-predict(model.svm, test, probability=T)
MisClassError[i] <-sum(ifelse(pred_svm[1:length(test$Country)]!=test$Country,1,0))/length(test$Country)
}

summary(1-MisClassError)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.6304  0.7234  0.7109  0.7872  0.9348

quantile(1-MisClassError,c(0.1,0.5,0.9))

##       10%       50%       90% 
## 0.5744681 0.7234043 0.8297872

print(paste('Mean Classifucation Accuracy =',round(mean(1-MisClassError),3)))

## [1] "Mean Classifucation Accuracy = 0.711"

par(mfrow=c(1,1))
hist(1-MisClassError,breaks = 7,col="blue",main = "True Classification Proportion",
     xlab = "Proportion")

Conclusion

Support Vector Machine provides 70% true classification of HPC mainframes by country due to specifications.

TOP500 HPC Race by SVM: China - United States