IBM-Summit

Introduction

We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777

TOP500 data

Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/

Filtering Data

Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.

library(e1071)
library(vip)
library(tidyverse)
library(readxl)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
##  [1] "Rank"                        "PreviousRank"               
##  [3] "FirstAppearance"             "FirstRank"                  
##  [5] "Name"                        "Computer"                   
##  [7] "Site"                        "Manufacturer"               
##  [9] "Country"                     "Year"                       
## [11] "Segment"                     "TotalCores"                 
## [13] "AcceleratorCoProcessorCores" "Rmax"                       
## [15] "Rpeak"                       "Nmax"                       
## [17] "Nhalf"                       "HPCG"                       
## [19] "Power"                       "PowerSource"                
## [21] "PowerEfficiency"             "Architecture"               
## [23] "Processor"                   "ProcessorTechnology"        
## [25] "ProcessorSpeed"              "OperatingSystem"            
## [27] "OSFamily"                    "AcceleratorCoProcessor"     
## [29] "CoresperSocket"              "ProcessorGeneration"        
## [31] "SystemModel"                 "SystemFamily"               
## [33] "InterconnectFamily"          "Interconnect"               
## [35] "Continent"                   "SiteID"                     
## [37] "SystemID"

Problem

We want to classify HPC mainframes of China and USA, applying Naive Bayes (NB).

TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country,  Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%mutate (Country=as.factor(Country)) %>% drop_na()

TOP500.USCH  %>% datatable() #kable(align = "r",caption = "TOP500: China - USA")

This dataset above contains 98 individuals (mainframes) and 10 variables.

Naive Bayes

set.seed(12345)
MisClassError <- c()
for (i in 1:3000) {
  train <- sample_frac(TOP500.USCH,0.5,replace = F)
  test <-TOP500.USCH %>% setdiff(train)
  model <- naiveBayes(Country ~ . , data = train)
  pred_country<-predict(model,test)
  MisClassError[i] <-sum(ifelse(pred_country!=test$Country,1,0))/length(test$Country)
}

summary(1-MisClassError)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2128  0.6596  0.7660  0.7046  0.8261  0.9375
quantile(1-MisClassError,c(0.1,0.5,0.9))
##       10%       50%       90% 
## 0.3617021 0.7659574 0.8541667
print(paste('Mean Classifucation Accuracy =',round(mean(1-MisClassError),3)))
## [1] "Mean Classifucation Accuracy = 0.705"
par(mfrow=c(1,1))
hist(1-MisClassError,breaks = 20,col="blue",main = "True Classification Proportion",
     xlab = "Proportion",xlim = c(0.5,1.0))

Conclusion

Naive Bayes provides 77% true classification of HPC mainframes by country due to specifications.