IBM-Summit
We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777
Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/
Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.
library(e1071)
library(vip)
library(tidyverse)
library(readxl)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
## [1] "Rank" "PreviousRank"
## [3] "FirstAppearance" "FirstRank"
## [5] "Name" "Computer"
## [7] "Site" "Manufacturer"
## [9] "Country" "Year"
## [11] "Segment" "TotalCores"
## [13] "AcceleratorCoProcessorCores" "Rmax"
## [15] "Rpeak" "Nmax"
## [17] "Nhalf" "HPCG"
## [19] "Power" "PowerSource"
## [21] "PowerEfficiency" "Architecture"
## [23] "Processor" "ProcessorTechnology"
## [25] "ProcessorSpeed" "OperatingSystem"
## [27] "OSFamily" "AcceleratorCoProcessor"
## [29] "CoresperSocket" "ProcessorGeneration"
## [31] "SystemModel" "SystemFamily"
## [33] "InterconnectFamily" "Interconnect"
## [35] "Continent" "SiteID"
## [37] "SystemID"
We want to classify HPC mainframes of China and USA, applying Naive Bayes (NB).
TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country, Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%mutate (Country=as.factor(Country)) %>% drop_na()
TOP500.USCH %>% datatable() #kable(align = "r",caption = "TOP500: China - USA")
This dataset above contains 98 individuals (mainframes) and 10 variables.
set.seed(12345)
MisClassError <- c()
for (i in 1:3000) {
train <- sample_frac(TOP500.USCH,0.5,replace = F)
test <-TOP500.USCH %>% setdiff(train)
model <- naiveBayes(Country ~ . , data = train)
pred_country<-predict(model,test)
MisClassError[i] <-sum(ifelse(pred_country!=test$Country,1,0))/length(test$Country)
}
summary(1-MisClassError)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2128 0.6596 0.7660 0.7046 0.8261 0.9375
quantile(1-MisClassError,c(0.1,0.5,0.9))
## 10% 50% 90%
## 0.3617021 0.7659574 0.8541667
print(paste('Mean Classifucation Accuracy =',round(mean(1-MisClassError),3)))
## [1] "Mean Classifucation Accuracy = 0.705"
par(mfrow=c(1,1))
hist(1-MisClassError,breaks = 20,col="blue",main = "True Classification Proportion",
xlab = "Proportion",xlim = c(0.5,1.0))
Naive Bayes provides 77% true classification of HPC mainframes by country due to specifications.