Introduction
This document explores the Computer Hardware Data Set commissioned by Amdahl Corporation into the competitive landscape of computer CPUs (Central Processing Units). It is a high growth market within the computer industry of 1982.
In our data set you can find the Published Relative Performance benchmarks (PRP), from the influential BYTE magazine, for 209 FDA Approved CPUs active on the market today. In addition to conclusions and recommendations on the data, I have created a model to automatically predict BYTE’s benchmark based on key tests metrics developed in Tel Aviv University’s Department of Electrical Engineering.
The data set of 209 CPUs includes brand, model and 6 key test metrics, as well as BYTE magazine’s Published Relative Performance benchmark, and the Estimated Benchmark Performance made by Tel Aviv university.
The data can be downloaded from this link. Full technical details such as origin along with Amdahl and competitor CPU variables can be found linked here.
Please review this document in conjunction with the executive presentation (which I hope is more fun…).
Report
First I shall load up our data set and install the necessary packages.
# First I load our data set and the necessary packages to analyse
set.seed(100)
library(devtools)
#install_github('fawda123/ggord')
library(ggord)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(Matrix)
library(caret)
library(RColorBrewer)
library(xgboost)
library(randomForest)
# set the directory to upload the file, and upload
setwd("C:/Users/Think/Dropbox (Go!)/Aon/Data/CPU/")
cpu <- read.csv("machine.data", header=FALSE, stringsAsFactors = TRUE)
There are 209 offerings on the market today, and for each we have ten attributes of information.
dim(cpu)
## [1] 209 10
Our data set is reported to have no missing values, however I will be diligent and verify this.
length(cpu[is.na(cpu)])
## [1] 0
I shall add headers to each of our variables in our data set and compactly display the structure of the data.
# Assign names to each of the attributes.
colnames(cpu) = c("Vendor", # CPU Vendor name
"Model", # CPU Model Name
"MYCT", # Machine cycle time in nanoseconds
"MMIN", # Minimum main memory in kilobytes
"MMAX", # Maximum main memory in kilobytes
"CACH", # Cache memory in kilobytes
"CHMIN", # Minimum channels in units
"CHMAX", # Maximum channels in units
"PRP", # Independently published relative performance (Overall CPU Measurement)
"ERP") # Estimated relative performance using Linear Regression
str(cpu)
## 'data.frame': 209 obs. of 10 variables:
## $ Vendor: Factor w/ 30 levels "adviser","amdahl",..: 1 2 2 2 2 2 2 2 2 2 ...
## $ Model : Factor w/ 209 levels "100","1100/61-h1",..: 30 63 64 65 66 67 75 76 77 78 ...
## $ MYCT : int 125 29 29 29 29 26 23 23 23 23 ...
## $ MMIN : int 256 8000 8000 8000 8000 8000 16000 16000 16000 32000 ...
## $ MMAX : int 6000 32000 32000 32000 16000 32000 32000 32000 64000 64000 ...
## $ CACH : int 256 32 32 32 32 64 64 64 64 128 ...
## $ CHMIN : int 16 8 8 8 8 8 16 16 16 32 ...
## $ CHMAX : int 128 32 32 32 16 32 32 32 32 64 ...
## $ PRP : int 198 269 220 172 132 318 367 489 636 1144 ...
## $ ERP : int 199 253 253 253 132 290 381 381 749 1238 ...
Let’s compare how many unique rows we have for the test metrics.
# Number of unique CPUs versus total number of CPUs from Amdahl competitors
c("Total CPUs (non-amdahl) : ", nrow(cpu[!cpu$Vendor== "amdahl",3:8]),
"Uniquely specified CPUs (non-amdahl) : ", nrow(unique(cpu[!cpu$Vendor== "amdahl",3:8])))
## [1] "Total CPUs (non-amdahl) : "
## [2] "200"
## [3] "Uniquely specified CPUs (non-amdahl) : "
## [4] "184"
# Now lets check Amdahl specific rows for uniqueness
c("Total CPUs (amdahl) : ", nrow(cpu[cpu$Vendor== "amdahl",3:8]),
"Uniquely specified CPUs (amdahl) : ", nrow(unique(cpu[cpu$Vendor== "amdahl",3:8])))
## [1] "Total CPUs (amdahl) : "
## [2] "9"
## [3] "Uniquely specified CPUs (amdahl) : "
## [4] "6"
Interestingly many of the Amdahl CPUs have the same test metrics. Much more than the overall market. In terms of the test metrics Amdahl only offers 6 unique specifications out of 9 CPUs. Whereas for the rest of the market it is 184 unique specifications out of 200 CPUs. This is a rate of 33% non-uniqueness for Amdahl vs. 8% for competitors.
Lets calculate the non-uniqueness of CPUs within competitor brand line ups.
# Here I include Vendor name to check for non-uniqueness within brands
nrow(cpu[!cpu$Vendor=="amdahl",]) - nrow(unique(cpu[!cpu$Vendor=="amdahl",c(1,3:8)]))
## [1] 16
We see that all 16 competitor occurrences of non unique specifications occur within brands. This means there are no two brands having the exact same CPU specification.
Now that we have a feel for the data, lets take a look at Amdahl’s competitors in the market and how many product offerings they have.
# First lets assign the colors of the bars - highlight Amdahl in red
# Assign grey to all
colors = c(rep("darkgrey", length(unique(cpu$Vendor))))
# Then sort the vendors according to fequency, like on the chart and assign red to Amdahl
colors[match(c("amdahl"),names(table(reorder(cpu$Vendor, rep(-1,length(cpu$Vendor)), sum))))] = "red"
# Now lets look at the count of products per Vendor
# to make it more clear visually, I sort vendors on frequency of CPUs on offer
ggplot(data=cpu, aes(x=reorder(Vendor, Vendor, function(x)-length(x)))) +
geom_histogram(fill = colors) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=16)) +
xlab(" ") + ylab("FDA Approved CPUs on the Market")
Next we look at the individual technical metrics in the sourced data set, to see how Amdahl products compare to that of the competition.
# First I will format the individual CPU metrics for plotting.
# To make it more clear visually, I log some features, and others not. ggplot2 has constraints in doing this so I break the data into two
# I set seed to ensure the factor levels are in the same sequence. Each time we execute.
set.seed(100)
# Exclude the unlogged features and transform the data for ggplot
cpu.log = melt(cpu[,!colnames(cpu) %in% c("CHMIN", "CHMAX", "CACH", "PRP", "ERP")])
levels(cpu.log$variable) = c("Cycle time (nanoseconds)", "Min. Main Memory (kB)", "Max. Main Memory (kB)")
# Exclude the logged features and transform the data for ggplot
cpu.raw = melt(cpu[,!colnames(cpu) %in% c("MYCT", "MMAX", "MMIN", "PRP", "ERP")])
levels(cpu.raw$variable) = c("Cache Memory (kB)", "Min. Channels (units)", "Max. Channels (units)")
# Next I separate amdahl and the competitors using a Label for both logged and raw data set
vendor.log = as.factor(ifelse(cpu.log$Vendor=="amdahl", "Amdahl", "Competitors"))
vendor.raw = as.factor(ifelse(cpu.raw$Vendor=="amdahl", "Amdahl", "Competitors"))
# Finally as I log some scales in the plot I shall add a small value to avoid having Log10(0) = NAN
cpu.log$value[cpu.log$value==0] = .01
# I create two plots, one for logged metrics and one for unlogged.
# For the common elements of the plot I create a function.
expand.gplot = function(p){
p = p + scale_fill_manual(values=c("red","darkgrey"))
p = p + facet_wrap( ~ variable, scales="free")
p = p + theme(legend.position = "none")
p = p + theme(axis.title.x=element_blank(), axis.title.y=element_blank(), axis.text.x = element_blank(), axis.ticks.x=element_blank())
p = p + theme(strip.text.x = element_text(size = 12))
}
# Plot the logged CPU metric values
p1 <- ggplot(data = cpu.log, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=vendor.log))
p1 <- p1 + scale_y_log10()
p1 <- expand.gplot(p1)
p1 <- p1 + ggtitle("Amdahl versus Competitors in the market")
# Plot the unlogged CPU metric values
p2 <- ggplot(data = cpu.raw, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=vendor.raw))
p2 <- expand.gplot(p2)
# Join both plots and display
grid.arrange(p1, p2)
We can see above that Amdahls performance in terms of individual metrics far exceeds that of the majority of competitor offerings on the market. In fact, the top three metrics need to be log scaled as the gap between premium performance products is so wide from the majority, which are budget CPUs.
Next I validate the accuracy of BYTE magazine’s benchmark, in recognizing top performing CPUs, against our independent tests of the CPUs. I use a method called K-Means clustering to cluster the CPUs into three main groups, which I call Budget, Mid range and Premium. I shall compare these clusters, generated on the individual metrics against the independently published, industry standard CPU performance benchmark from computer magazine BYTE. I shall only use the 6 test metrics to cluster, this way the cluster assignment per CPU is independent of brand rating.
# I shall only use the tested metrics to cluster CPUs.
# I exclude CHMIN, channel min, as this is later found to be insignificant.
m=as.matrix(cpu[,colnames(cpu) %in% c("MYCT", "MMAX", "CACH", "CHMAX", "MMIN")])
# I scale the matrix before clustering as variables are in different dimensions.
# Next, I run the kmeans clustering algorithm, setting the seed to get the same results each time.
set.seed(104)
cl=(kmeans(scale(m),3))
# I assign levels to the clusters based on visual inspection of the resulting three.
cluster = as.factor(cl$cluster)
levels(cluster) = c("Budget", "Midrange", "Premium")
# I also highlight Amdahl, and its main competitors among the list of vendors.
competitors = which(sort(unique(cpu$Vendor)) %in% c("amdahl", "gould", "nas", "sperry", "ibm", "siemens", "cdc"))
# Lets highlight the interesting competitors among the 30 brands
label_face = rep("plain", length(unique(cpu$Vendor)))
label_face[competitors] = "bold"
label_color = rep("gray40", length(unique(cpu$Vendor)))
label_color[competitors] = "black"
# Now I plot each CPU per brand against it's benchmark and its associated cluster.
# As I repeat some of the commands, later in the document, I create a function I can use again later.
expandbm.plot = function(p){
p <- p + scale_colour_manual(name="Algorithm \nClusters",
values = c("Premium"="maroon", "Midrange"="gray", "Budget"="black"))
p <- p + xlab(" ") + ylab("Relative Performance Benchmark")
p <- p + theme(axis.text.y=element_text(size = rel(1.8)))
p <- p + scale_y_continuous(breaks = seq(0, 1200, by = 200))
}
# Now I use this function to make a plot. Jitter is added for clarity of overlapping points.
p <- ggplot(cpu, aes(Vendor, PRP))
p <- p + geom_point(size=4, aes(colour = cluster), position = position_jitter(width = .2))
p <- p + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=16))
p <- p + theme(axis.text.x=element_text(face=label_face, color=label_color))
p <- expandbm.plot(p)
p
To generate the clusters, only the test metrics of the CPUs were used. I excluded the minimum channel value metric as this is later seen to be insignificant in influencing BYTE’s performance benchmark. We can see a clear correlation with BYTE’s benchmark and the clusters. However there are still many differences in BYTE’s benchmark and the CPU algorithmic clustering. We can see close Amdahl competitors clustering in the same group are Sperry, Nas, IBM, Siemens, CDC and to a lesser extent Gould. A potential issue identified is nearly all Amdahl CPUs are clustered premium, however many are assigned mid range BYTE benchmarks. Conversely CDC is clustered mid range but recieves high BYTE benchmarks.
It is also to be noted that the Premium and Budget clusters are very small compared to the mid range cluster. With KMeans clustering we cannot control the size of the cluters. Ana alternative approach to try for any improvements may be hierarchical clusterings.
I now explore which models are competing with Amdahl in the high end CPU space based on the BYTE’s benchmark.
# I shall filter on the CPUs recieving high benchmarks from BYTE magazine.
cpu.hi = cpu[cpu$PRP>300,]
# convert to matrix, including only the relvant test metrics and the high rated CPUs
m=as.matrix(cpu.hi[,colnames(cpu) %in% c("MYCT", "MMAX", "CACH", "CHMAX", "MMIN")])
# I display the clusters, the cluster vector must also filtered.
cluster.hi = cluster[cpu$PRP>300]
# set label height to remove label overlap of close products.
# The overlapping products were found on manual inspections and manually entered below.
vertical=rep(1.5,17)
vertical[which(cpu.hi$Model %in% c("3081", "as/9060", "cyber:170/760"))] = rep(-0.5,3)
# Now I plot each CPU per brand against it's benchmark and its associated cluster.
p <- ggplot(cpu.hi, aes(Vendor, PRP))
p <- p + geom_point(size=4, aes(colour = cluster.hi), position = position_jitter(width = .1))
p <- p + theme(axis.text.x = element_text(hjust = 1, size=16))
p <- p + geom_text(aes(label=cpu.hi$Model),hjust=0.5, vjust=vertical)
p <- expandbm.plot(p)
p
Above, I show the CPU models rated highly (above 300) in BYTE’s benchmark. As an early stage corporation, Amdahl is standing up against the established brands IBM, Siemens and CDC. To be watched are similar early stage competitors Sperry and NAS.
There is obviously competitive advantage in being able to predict the key attributes that go into BYTE’s benchmark. Next I investigate and prioritize the most important metrics from our tests influencing the benchmark. I use linear regression modelling to determine this.
In order to calculate the Estimated Regression Performance (ERP) you find in our CPU data set, we used a linear regression method at Tel Aviv University laps.
# Create the predictor and response variables.
y <- cpu$PRP
X <- subset(cpu, select = -c(Model, Vendor, PRP, ERP))
X <- model.matrix(~., X)
# Extract the coefficents of the model into a matrix
lm.sum = summary(lm(y ~ X))$coefficients
# Extract from these coefficients the names of each metric and the signisficance
# Sort these in order before displaying
signif <- data.frame(metrics = rownames(lm.sum[-1,]), significance = lm.sum[-1,4])
signif <- data.frame(signif[with(signif, order(significance)), ])
rownames(signif) = 1:6
signif
## metrics significance
## 1 XMMAX 1.317802e-15
## 2 XMMIN 9.419325e-15
## 3 XCHMAX 1.646289e-10
## 4 XCACH 7.585893e-06
## 5 XMYCT 5.798460e-03
## 6 XCHMIN 7.523592e-01
Above we see the p-value, or statistical significance of each individual variable on the BYTE Performance index. In layman’s terms this is the chance of getting such a result if this variable really had no effect on the benchmark. It is not an indicator of how much affect this variable has on the benchmark. It only shows if it has an effect or not. It does not consider, non linearities or variable iteractions. All Variables show as statistically relevant except for Minimum Channels.
Below we see the average deviation from the Prof Ein-Dor’s Linear Regression predictions (or ERP) versus BYTEs Relative Performance metric (PRP).
deviation = round(mean(abs(cpu$PRP-cpu$ERP)/cpu$PRP),3)*100
rmse = round(sqrt(mean((cpu$PRP-cpu$ERP)^2)),3)
c("Deviation % = ", deviation, " Root Mean Squared Error : ", rmse)
## [1] "Deviation % = " "33.9"
## [3] " Root Mean Squared Error : " "41.681"
Next I shall plot the estimated performance benchmark, using Linear Regression, against the actual performance benchmark received from BYTE.
# The originally regressed numbers are provided in the data set. Therefore, I shall examine the original numbers made by the Tel Aviv university team, instead of remodelling.
label_color = rep("gray40", length(cpu$Vendor))
label_color[cpu$Vendor == "amdahl"] = "red"
# Plot the raw machine.data features PRP vs ERP (Published Relative Performance vs Estimated Relative Performance)
p <- ggplot(cpu, aes(ERP, PRP))
p <- p + geom_abline(intercept=0, slope=1, colour = "white", size=1)
p <- p + geom_point(size=4, colour = label_color)
p <- p + scale_x_log10(limits=c(10,1200)) + scale_y_log10(limits=c(10,1200))
p <- p + xlab("BYTE : Relative Performance benchmark \n(PRP from raw data)") + ylab("Linear Regression Estimated Benchmark \n(ERP from raw data)")
p
Above I plot the results of the Tel Aviv’s team Estimated Relative Performance, a Linear Regression model which was run on an Amdahl 470v/7, predicting the benchmark for all 209 CPUs.
This linear regression approach has it’s drawbacks in that it cannot catch non linear dependencies or metric interactions affecting the benchmark, unless these these are specifically called out in the model. Workarounds to the these issues exist within linear modelling, however they would vastly increase our data set size which would be unfeasible even with the powerful compute of 1982.
Conclusions
To highlight the conclusions of the last 3 months investigation at Tel Aviv University :
The CPU market has not consolidated. Customers can choose from over 200 CPUs and 30 brands.
Tests have proven Amdahl have a premium product.
Amdahl have approximately 6 direct competitors offering premium CPUs. These include established brands and start ups.
BYTE’s published performance benchmark influences customer choice. We recommend using our metrics and linear regression model to predict the BYTE benchmark during product development. This will provide Amdahl a competitive advantage.
We recommend considering other market differentiators than CPU performance, such as,
I shall start with comparing the results of Principal component Analysis (PCA) against the KMeans clustering model to see if the grouping of the CPUs can be improved or more information can be interpreted. Principal component analysis will attempt to project all the variance seen in the 6 key test metrics on to a lower number of dimensions. The first dimension represents the highest axes of variance seen in the six metrics, the second dimension the second highest axes, and so on.
On our plots I shall display the first two dimensions only and I shall report on the variance explained in these first two dimensions. In order to compare how the dimensions correlate to BYTE’s benchmark, I shall bucket the CPUs in 3 equally sized buckets, based on the benchmark, each represented by different point colors.
# I shall now take the same matrix as was used for the clustering
m=as.matrix(cpu[,colnames(cpu) %in% c("MYCT", "MMAX", "CACH", "CHMAX", "MMIN")])
# While running PCA I will set all variables to the same scale.
pr.out = prcomp(m, scale=TRUE)
# Now I look at the variance explained by each component
print(paste("Proportion of variance explained by component ", 1:5, ":", round(pr.out$sdev^2/sum(pr.out$sdev^2), 2)))
## [1] "Proportion of variance explained by component 1 : 0.56"
## [2] "Proportion of variance explained by component 2 : 0.16"
## [3] "Proportion of variance explained by component 3 : 0.15"
## [4] "Proportion of variance explained by component 4 : 0.1"
## [5] "Proportion of variance explained by component 5 : 0.04"
We see above the first two components explain 72% of the variance. Now lets plot these first two dimensions and bucket the BYTE benchmark with color code.
# First I will bucket the Benchmark
buckets = factor(cut_number(cpu$PRP, 3))
levels(buckets) = c("6 to 33", "33 to 72", "72 to 1200")
# I convert the components ot a data frame for ggplot compatibility
pr.out.df = data.frame(pr.out$x)
# Now I plot
p <- ggplot(pr.out.df, aes(x=PC1, y=PC2, color = buckets))
p <- p + geom_point(data=pr.out.df[cpu$Vendor=="amdahl", ], aes(x=PC1, y=PC2), colour="red", size=6)
p <- p + geom_point(size = 4)
p <- p + scale_colour_brewer(drop=TRUE, limits = levels(buckets), palette="GnBu", name = "Published \nBenchmark \nScore")
p <- p + xlab("Principal Component 1") + ylab("Principal Component 2")
p <- p + theme_bw()
p
We see above the first two components. Amdahl’s CPUs are marked in red, with only 6 occurrences. Since I have non-unique CPUs based on the test metrics, they overlap on the chart. Compared to the clustering model originally displayed, we still see overlap in benchmark intervals. However Amdahl CPUs are well separated in terms of performance and clustered within CPUs with similar high benchmark scores. This could be more useful than the original clustering method for finding similar models and identifying outliers in BYTE’s benchmark.
Below we can see the same PCA plot with the loading of each variable. For example, notice the lower bechmark bracket (6 to 33) is in the direction of MYCT, meaning it has a high Machine Cycle time (poor performer), and almost the the opposite direction of Max Memory (MMAX), meaning low main memory. So the right side of the chart is the poorer performers. The First Principal Component runs parallel to MMAX indicating the biggest variation of data is low versus high memore. The second components differentiates in the direction of Machine Cycle Time and Max channels (CHMAX).
Given this Amdahl, in red, is together with the highest performers - high memory and low machine cycle time.
# Print the plot with the
p <- ggord(pr.out, buckets)
p <- p + scale_colour_brewer(drop=TRUE, limits = levels(buckets), palette="GnBu", name = "Published \nBenchmark \nScore")
p <- p+ geom_point(data=pr.out.df[cpu$Vendor=="amdahl", ], aes(x=PC1, y=PC2), colour="red", size=6, shape=79)
p <- p + ggtitle("Principal Component Loading Arrows")
p
I will now attempt to improve on Prof Ein-Dor’s Linear Regression using a number of modern techniques, both parametric (eg. Generalised Linear Model) and non-parametric (eg. Random Forest). Before running the models, I will create the data to be modeled.
Prof Ein-Dor’s error measurement, reported on UCI, was mean deviation percentage. This is not standard today, and likely to have been used in 1982, as regression models were often performed manually and this is a simpler metric. Therefore I will switch our metric to root mean squared error which is more robust to outliers and regularly used today for regression models. I first need to calculate the RMSE for Prof Ein-Dor’s estimates.
RMSE = round(sqrt(mean((cpu$ERP-cpu$PRP)^2)),2)
# [1] "Original Linear Regression Estimate RMSE : 41.68"
print(paste("Original Linear Regression Estimate RMSE : ", RMSE))
## [1] "Original Linear Regression Estimate RMSE : 41.68"
Brand may influence BYTE’s benchmark, however I suspect it may lead to overfitting on the data as some Vendors have only one or two CPUs on the market. Therefore I will not model this. However I will first scale the predictor variables which will help models using variable penalties.
# Next I create a matrix of predictors varaibles with only the test metrics included
cpu_scale = data.frame(scale(cpu[,3:8], center = FALSE, scale = TRUE))
cpu_scale = cbind(Vendor = cpu$Vendor, Model = cpu$Model, cpu_scale, PRP = cpu$PRP)
I shall use the caret library to quickly tune and crossvalidate models. While I am not aware how or if Prof Ein-Dor originally crossvalidated, I shall use 5 fold cross validation for our tests.
fitControl <- trainControl(method = "cv",
number = 5)
Before using other models I will verify the original linear regression fit by Prof Ein-Dor using cross validation.
Linear Regression
# For linear regression there are not major influencing tuning parameters I can use. Therefore I will not use a grid of parameters
set.seed(101)
lmFit <- train(PRP ~ ., data = cpu[3:9],
method = "lm",
verbose = FALSE,
trControl = fitControl)
# [1] "Best Linear Regression RMSE : 68.35"
print(paste("Best Linear Regression RMSE : ", round(min(lmFit$results$RMSE),2)))
## [1] "Best Linear Regression RMSE : 68.35"
Above we see the RMSE result of today’s linear regression is not close to the accuracy of the original Linear Regression found in the data set field ERP (Estimated relative performance). Following an online search, I have seen that others received approximately the same RMSE as me (Reference : Data Mining and Business Analytics with R, Johannes Ledolter (RMSE : 69.35 with Leave one out Cross Validation) link). This leaves an open question how such an accurate estimate was originally made.
After validating the data, I have seen that the records within the data set with non unique test metrics have the same ERP (Estimated Relative Performance) value, but a different PRP Value, or Byte Benchmark. Therefore I am confident that Prof Ein-Dor did not use features outside of the data set, but rather used a feature transformation on the data set.
I attempted some feature transformations (logging, scaling, ratios, feature selection…etc) but could not reach close to the original results with linear regression. I also attempted training and testing on the same data which achieved a results of approximately 58.
This does leave an open question of how Prof. Ein-Dor recieved such an accurate result.
I shall now run through a number of models excluding brand name as a predictor. I shall manually tune each one (without showing all tuning attempts), until I reach close to a minimum. For each model the respective RMSE is reported, along with the best parameters. It is to be noted, as I run through many parameters I may have a slight risk of overtuning the model to a particular seed - however this should not cause a large difference.
Gradient Boosting
# First I try gradient boosting machine, a non parametric tree based model.
# I set the paramters I would like to test within the model and set the seed.
gbmGrid <- expand.grid(interaction.depth = c(1,2,4,6),
n.trees = (1:5)*20,
shrinkage = c(.01,.05,.1, .3,.5),
n.minobsinnode = c(1,2,4))
set.seed(825)
# Now I fit the data each time using different parameters from the parameter grid specified above.
gbmFit <- train(PRP ~ ., data = cpu[,3:9],
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
# Now I output the RMSE of the best parameters.
# "Best GBM RMSE : 45.91"
# Best tune parameters :
# n.trees interaction.depth shrinkage n.minobsinnode
# 100 4 0.1 1
print(paste("Best GBM RMSE : ", round(min(gbmFit$results$RMSE),2)))
## [1] "Best GBM RMSE : 45.91"
Random Forest
# Unfortunately in caret library, only the mtry paramter can be tuned in caret with a grid.
# I attempted manually tuning other paramters but it did not improve the results using caret.
rfGrid <- expand.grid( mtry = c(1:10))
set.seed(825)
rfFit <- train(PRP ~ ., data = cpu[,c(3:9)],
method = "rf",
trControl = fitControl,
verbose = FALSE,
tuneGrid = rfGrid)
# [1] "Best Random Forest RMSE : 49.92"
# rfFit$bestTune
# mtry
# 5
print(paste("Best Random Forest RMSE : ", round(min(rfFit$results$RMSE),2)))
## [1] "Best Random Forest RMSE : 49.92"
Support Vector Machine
# As Support Vector Machines use a penalty on attributes I shall scale the variables.
# I shall only use the Radial basis as this produces the best results.
svmGrid <- expand.grid(sigma=10^seq(-4,-2,.5), C=1:5*50)
set.seed(825)
svmFit <- train(PRP ~ ., data = cpu_scale[,3:9],
method = "svmRadial",
trControl = fitControl,
verbose = FALSE,
tuneGrid = svmGrid)
# [1] "Best SVM model : 46.81"
# svmFit$bestTune
# sigma C
# 0.003162278 100
print(paste("Best SVM model : ", round(min(svmFit$results$RMSE),2)))
## [1] "Best SVM model : 46.81"
Regularised Generalised Linear Model
# Again, I have a penalty with glmnet so I will use the scaled data.
glmnetGrid <- expand.grid(.alpha = 10^-5:1, .lambda = 10^-5:1)
set.seed(825)
glmnetFit <- train(PRP ~ ., data = cpu_scale[,3:9],
method = "glmnet",
trControl = fitControl,
tuneGrid = glmnetGrid)
# [1] "Best glm model : 66.07"
# glmnetFit$bestTune
# alpha lambda
# 1e-05 1e-05
print(paste("Best glm model : ", round(min(glmnetFit$results$RMSE),2)))
## [1] "Best glm model : 66.07"
Extreme Gradient Boosting As GBM performed was the best performer so far, I shall try a more modern implementation of this model. In practice it often performs better than GBM, and many other machine learning algorithms, on large data sets. In addition it runs very fast over large data sets - although this is not as important with our small data set.
# set our seed for reproducable results
set.seed(825)
# First find the best number of trees to use.
# This is done by creating a validation partition (10% of the data) to measure the best number of trees and perform cross validation on the rest of the data
val = createDataPartition(cpu$PRP, times = 1, p = 0.1, list = FALSE)
# Set the max possible number of rounds. The model will create trees to the min of num_rounds or the best score on the validation partition.
num_rounds = 5000
# Create our training and validation sets in sparse matrices, and a watchlist to find the best number of iterations.
xgtrain <- xgb.DMatrix(data = as.matrix(cpu_scale[-val,3:8]), label= cpu$PRP[-val])
xgval <- xgb.DMatrix(data = as.matrix(cpu_scale[val,3:8]), label= cpu$PRP[val])
watchlist <- list(val=xgval, train=xgtrain)
# Create our best parameter list after some light tuning.
param <- list("objective" = "reg:linear", "eta" = 0.05, "max_depth" = 2, "subsample" = 1)
# Run a model which will stop 20 rounds after the best numer of trees is found in the validation set
bst.iter <- xgb.cv(params = param, data = xgtrain, nround=num_rounds, verbose=0,
watchlist=watchlist, early.stop.round = 20, nfold=5, prediction=FALSE, maximize = FALSE)
## Stopping. Best iteration: 403
# Use the above result of the best iteration of trees and refit the model on the full data set
bst.cv = xgb.cv(param=param, data=as.matrix(cpu_scale[,3:8]), label = y, nfold = 5, print.every.n = 50,
nrounds=dim(bst.iter)[1], prediction=TRUE)
## [0] train-rmse:184.204397+11.940349 test-rmse:178.742840+51.047679
## [50] train-rmse:43.663303+1.024490 test-rmse:59.104935+21.442685
## [100] train-rmse:26.391844+0.902783 test-rmse:47.396313+11.314469
## [150] train-rmse:23.361406+1.018245 test-rmse:45.593499+9.509731
## [200] train-rmse:21.641441+1.161725 test-rmse:45.150965+9.354368
## [250] train-rmse:20.302945+1.280106 test-rmse:44.879855+9.067232
## [300] train-rmse:19.169482+1.290023 test-rmse:44.882710+8.712800
## [350] train-rmse:18.263880+1.180070 test-rmse:44.789888+8.508430
## [400] train-rmse:17.557737+1.102042 test-rmse:44.919006+8.541803
# [1] "Best Extreme Gradient Boosting model : 45.63"
print(paste("Best Extreme Gradient Boosting model : ", round(sqrt(mean((bst.cv$pred-cpu$PRP)^2)),2)))
## [1] "Best Extreme Gradient Boosting model : 45.63"
With all 6 models, the relative performance prediction model from Prof Ein-Dor in 1982 is not bettered. With more time I believe I could find Prof Ein-Dor’s secret sauce, perhaps an Amdahl 470v/7, but most likely feature transformation. The next best model is Gradient Boosting, with the Extreme Gradient Boosting package. However, as the R gbm results seem more stable, I recommend using the R package gbm as the results are more stable with small data sets than Extreme Gradient boosting.
Coupling Prof-Ein Dor’s technique of feature transformation with SVM or Gradient Boosting would most likely give a much lower RMSE.
Finally I shall revisit the importance of the metrics in making BYTE benchmark by using the Random Forest models output of Feature importance.
# Set seed for reporducable results.
set.seed(825)
# Create a random forest model using the optimal parameters.
# Notice here Brand or Vendor is included to see if this influences the BYTE benchmark.
rf.mod = randomForest(PRP ~ ., data = cpu[,c(1,3:9)], mtry=5, importance=TRUE)
# Extract the feature importance form the model.
imp <- importance(rf.mod, type=1)
# Set the names of the variables and load to a data frame.
row.names(imp) = c("Vendor", "Cycle time", "Min. Main Memory",
"Max. Main Memory", "Cache Memory",
"Min. Channels", "Max. Channels")
featureImportance <- data.frame(Feature=row.names(imp), Importance=imp[,1])
# Output to a plot, reordering the deatures on importance in predicting the BYTE benchmark, PRP.
p <- ggplot(featureImportance, aes(x=reorder(Feature, Importance), y=Importance)) +
geom_bar(stat="identity", fill="#53cfff") +
coord_flip() +
theme_light(base_size=14) +
xlab("") +
ylab("Importance") +
ggtitle("BYTE Benchmark Feature Importance\n") +
theme(plot.title=element_text(size=18))
p
Wrap Up
While disappointed that I could not better the results of Prof Ein-Dor this August Bank Holiday weekend, I am confident that with the modern prediction techniques shown, and Prof Ein-Dor feature transformation methods from 1982, a much more accurate RMSE could be realised. I am also confident Prof Ein-Dor has used feature transformation unknown to us as,
We have also shown some other methods, such as kmeans clustering, pca and feature importance, which could be very useful for Amdahl’s business decisions, as well as gaining a good insight to the CPUs on offer in 1982.. I hope you found the report thought-provoking and the findings were worth your time reviewing it.
You can find a visualisation of the data here.
If you have any questions please contact me at : darragh.hanley [at] gmail [dot] com