IBM-Summit
We had discussed HPC race previously. See: https://rpubs.com/alex-lev/696179, https://rpubs.com/alex-lev/694840, https://rpubs.com/alex-lev/693131, https://rpubs.com/alex-lev/553777
Data for November 2020 can be downloaded here: https://www.top500.org/lists/top500/2020/11/
Note: the original names of variables were changed (truncated, concatenated and simplified) for the purpose of data exploration and visualization as much as possible.
library(randomForest)
library(tidyverse)
library(readxl)
library(DT)
library(knitr)
TOP500_202011 <- read_excel("top500/TOP500_202011.xlsx")
names(TOP500_202011)
## [1] "Rank" "PreviousRank"
## [3] "FirstAppearance" "FirstRank"
## [5] "Name" "Computer"
## [7] "Site" "Manufacturer"
## [9] "Country" "Year"
## [11] "Segment" "TotalCores"
## [13] "AcceleratorCoProcessorCores" "Rmax"
## [15] "Rpeak" "Nmax"
## [17] "Nhalf" "HPCG"
## [19] "Power" "PowerSource"
## [21] "PowerEfficiency" "Architecture"
## [23] "Processor" "ProcessorTechnology"
## [25] "ProcessorSpeed" "OperatingSystem"
## [27] "OSFamily" "AcceleratorCoProcessor"
## [29] "CoresperSocket" "ProcessorGeneration"
## [31] "SystemModel" "SystemFamily"
## [33] "InterconnectFamily" "Interconnect"
## [35] "Continent" "SiteID"
## [37] "SystemID"
We want to classify HPC mainframes of China and USA, applying Random Forest (RF).
TOP500.USCH <- as_tibble(TOP500_202011) %>% filter(Country %in% c("United States","China")) %>% select(Country, Year, TotalCores, Rmax, Rpeak, Nmax, Power, PowerEfficiency, ProcessorSpeed, CoresperSocket) %>%mutate (Country=as.factor(Country)) %>% drop_na()
TOP500.USCH %>% datatable() #kable(align = "r",caption = "TOP500: China - USA")
This dataset above contains 98 individuals (mainframes) and 10 variables.
Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees and then combining the output generated by each of the decision trees. Decision tree is a classification model which works on the concept of information gain at every node. For all the data points, decision tree will try to classify data points at each of the nodes and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are very simple and easy to understand models; however, they have very low predictive power. In fact, they are called weak learners. For more about RF see https://r-posts.com/how-to-implement-random-forests-in-r/
set.seed(12345)
train <- sample_frac(TOP500.USCH,0.5,replace = F)
test <-TOP500.USCH %>% setdiff(train)
usch.rf <- randomForest(Country ~ ., data=train, importance=TRUE,
proximity=TRUE,replace=T,ntree=500)
print(usch.rf)
##
## Call:
## randomForest(formula = Country ~ ., data = train, importance = TRUE, proximity = TRUE, replace = T, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.33%
## Confusion matrix:
## China United States class.error
## China 23 4 0.1481481
## United States 4 18 0.1818182
plot(usch.rf)
varImpPlot(usch.rf)
# Predicting on train set
predTrain <- predict(usch.rf, train, type = "class")
# Checking classification accuracy
table(predTrain, train$Country)
##
## predTrain China United States
## China 27 0
## United States 0 22
## Look at variable importance:
round(importance(usch.rf), 2)
## China United States MeanDecreaseAccuracy MeanDecreaseGini
## Year 15.34 12.36 16.11 4.47
## TotalCores 5.30 0.31 3.93 1.92
## Rmax 16.14 12.56 17.95 4.70
## Rpeak 3.17 3.29 4.78 2.02
## Nmax -0.36 -0.62 -0.91 1.15
## Power 9.85 2.09 9.25 3.22
## PowerEfficiency 3.60 -2.41 0.62 1.27
## ProcessorSpeed 7.96 7.53 10.18 2.34
## CoresperSocket 6.41 6.35 8.21 2.68
## Do MDS on 1 - proximity:
usch.mds <- cmdscale(1 - usch.rf$proximity, eig=TRUE)
op <- par(pty="s")
pairs(cbind(train[,2:9], usch.mds$points), cex=0.6, gap=0,
col=c("red", "green", "blue")[as.numeric(train$Country)],
main="USCH Data: Predictors and MDS of Proximity Based on RandomForest")
par(op)
set.seed(12345)
MisClassError <- c()
for (i in 1:3000) {
train <- sample_frac(TOP500.USCH,0.5,replace = F)
test <-TOP500.USCH %>% setdiff(train)
usch.rf <- randomForest(Country ~ ., data=train, importance=TRUE,
proximity=TRUE,replace=F,ntree=500)
MisClassError[i] <-usch.rf$err.rate
}
summary(1-MisClassError)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2778 0.7222 0.7778 0.7751 0.8333 1.0000
quantile(1-MisClassError,c(0.1,0.5,0.9))
## 10% 50% 90%
## 0.6111111 0.7777778 0.8888889
print(paste('Mean Classifucation Accuracy =',round(mean(1-MisClassError),3)))
## [1] "Mean Classifucation Accuracy = 0.775"
par(mfrow=c(1,1))
hist(1-MisClassError,breaks = 20,col="blue",main = "True Classification Proportion",
xlab = "Proportion")
Random Forest provides 78% true classification of HPC mainframes by country due to specifications.