IBM-Summit

Introduction

We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777

TOP500 data

Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/

Filtering Data

Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.

library(randomForest)
library(tidyverse)
library(readxl)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)

##  [1] "Rank"                        "PreviousRank"               
##  [3] "FirstAppearance"             "FirstRank"                  
##  [5] "Name"                        "Computer"                   
##  [7] "Site"                        "Manufacturer"               
##  [9] "Country"                     "Year"                       
## [11] "Segment"                     "TotalCores"                 
## [13] "AcceleratorCoProcessorCores" "Rmax"                       
## [15] "Rpeak"                       "Nmax"                       
## [17] "Nhalf"                       "HPCG"                       
## [19] "Power"                       "PowerSource"                
## [21] "PowerEfficiency"             "Architecture"               
## [23] "Processor"                   "ProcessorTechnology"        
## [25] "ProcessorSpeed"              "OperatingSystem"            
## [27] "OSFamily"                    "AcceleratorCoProcessor"     
## [29] "CoresperSocket"              "ProcessorGeneration"        
## [31] "SystemModel"                 "SystemFamily"               
## [33] "InterconnectFamily"          "Interconnect"               
## [35] "Continent"                   "SiteID"                     
## [37] "SystemID"

Problem

We want to classify HPC mainframes of China and USA, applying Random Forest (RF).

TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country,  Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%mutate (Country=as.factor(Country)) %>% drop_na()

TOP500.USCH  %>% datatable() #kable(align = "r",caption = "TOP500: China - USA")

This dataset above contains 98 individuals (mainframes) and 10 variables.

Random Forest

Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees and then combining the output generated by each of the decision trees. Decision tree is a classification model which works on the concept of information gain at every node. For all the data points, decision tree will try to classify data points at each of the nodes and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are very simple and easy to understand models; however, they have very low predictive power. In fact, they are called weak learners. For more about RF see https://r-posts.com/how-to-implement-random-forests-in-r/

set.seed(12345)
train <- sample_frac(TOP500.USCH,0.5,replace = F)
test <-TOP500.USCH %>% setdiff(train)


usch.rf <- randomForest(Country ~ ., data=train, importance=TRUE,
                        proximity=TRUE,replace=T,ntree=500)

print(usch.rf)

## 
## Call:
##  randomForest(formula = Country ~ ., data = train, importance = TRUE,      proximity = TRUE, replace = T, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.33%
## Confusion matrix:
##               China United States class.error
## China            23             4   0.1481481
## United States     4            18   0.1818182

plot(usch.rf)

varImpPlot(usch.rf)

# Predicting on train set
predTrain <- predict(usch.rf, train, type = "class")
# Checking classification accuracy
table(predTrain, train$Country)

##                
## predTrain       China United States
##   China            27             0
##   United States     0            22

## Look at variable importance:
round(importance(usch.rf), 2)

##                 China United States MeanDecreaseAccuracy MeanDecreaseGini
## Year            15.34         12.36                16.11             4.47
## TotalCores       5.30          0.31                 3.93             1.92
## Rmax            16.14         12.56                17.95             4.70
## Rpeak            3.17          3.29                 4.78             2.02
## Nmax            -0.36         -0.62                -0.91             1.15
## Power            9.85          2.09                 9.25             3.22
## PowerEfficiency  3.60         -2.41                 0.62             1.27
## ProcessorSpeed   7.96          7.53                10.18             2.34
## CoresperSocket   6.41          6.35                 8.21             2.68

## Do MDS on 1 - proximity:
usch.mds <- cmdscale(1 - usch.rf$proximity, eig=TRUE)
op <- par(pty="s")
pairs(cbind(train[,2:9], usch.mds$points), cex=0.6, gap=0,
      col=c("red", "green", "blue")[as.numeric(train$Country)],
      main="USCH Data: Predictors and MDS of Proximity Based on RandomForest")

par(op)

set.seed(12345)
MisClassError <- c()
for (i in 1:3000) {
  train <- sample_frac(TOP500.USCH,0.5,replace = F)
  test <-TOP500.USCH %>% setdiff(train)
  usch.rf <- randomForest(Country ~ ., data=train, importance=TRUE,
                          proximity=TRUE,replace=F,ntree=500)
  MisClassError[i] <-usch.rf$err.rate
}

summary(1-MisClassError)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2778  0.7222  0.7778  0.7751  0.8333  1.0000

quantile(1-MisClassError,c(0.1,0.5,0.9))

##       10%       50%       90% 
## 0.6111111 0.7777778 0.8888889

print(paste('Mean Classifucation Accuracy =',round(mean(1-MisClassError),3)))

## [1] "Mean Classifucation Accuracy = 0.775"

par(mfrow=c(1,1))
hist(1-MisClassError,breaks = 20,col="blue",main = "True Classification Proportion",
     xlab = "Proportion")

Conclusion

Random Forest provides 78% true classification of HPC mainframes by country due to specifications.

TOP500 HPC Race by RF: China - United States