Introduction

对创建云主机过程中的日志做特征,离线提取目标日志,从而获取自定义阶段时间消耗,将其可视化展出

data

data<-read.csv("G:/creatingInstance.csv")
data
##                 X              t.200              t.400              t.600
## 1             api  92.86785216697686 199.95401671932598 310.55878177685167
## 2        schedule 370.19923743685189 593.32759702474993 740.13826529645769
## 3 compute message   385.425981122042 607.29682174828849 769.10284561469985
## 4         compute   732.582448816804 1079.2554514218009 1222.0473705247207
## 5            succ          4000/4000          7596/8000        10558/12000

t.200、t.400、t.600: 计算节点数分别为200、400、600,创建云主机数为1:20进行创建;

从开始创建云主机的请求开始:

api: 至nova-api阶段处理完创建云主机相关请求的平均总耗时;

scheduler: 至nova-scheduler阶段处理完创建云主机相关请求的平均总耗时;

compute message: 至nova-compute开始处理创建云主机相关请求的平均总耗时;

compute: 至nova-compute阶段处理完创建云主机相关请求的平均总耗时;

succ: 创建云主机的成功率。

时效分析

我们先来看看创建云主机的过程在各模块间的耗时与节点数之间的情况,暂时先不管成功率这个指标,把数据rearrange一下:

library(reshape2)
newdata<-as.data.frame(cbind(c(200,400,600),t(data[-5,-1])))
rubishdata<-lapply(1:ncol(newdata),function(i){newdata[,i]<<-as.numeric(as.vector(newdata[,i]))});rm(rubishdata)
colnames(newdata)<-c('Node','api','scheduler','computeMessage','compute')
rownames(newdata)<-NULL
new2data<-newdata
new2data[,3:ncol(new2data)]<-new2data[,3:ncol(new2data)]-new2data[,2:(ncol(new2data)-1)]
meltdata<-melt(new2data,id.vars = 'Node',variable.name = "module",value.name = "timeElapsed")
newdata
##   Node       api scheduler computeMessage   compute
## 1  200  92.86785  370.1992       385.4260  732.5824
## 2  400 199.95402  593.3276       607.2968 1079.2555
## 3  600 310.55878  740.1383       769.1028 1222.0474
meltdata
##    Node         module timeElapsed
## 1   200            api    92.86785
## 2   400            api   199.95402
## 3   600            api   310.55878
## 4   200      scheduler   277.33139
## 5   400      scheduler   393.37358
## 6   600      scheduler   429.57948
## 7   200 computeMessage    15.22674
## 8   400 computeMessage    13.96922
## 9   600 computeMessage    28.96458
## 10  200        compute   347.15647
## 11  400        compute   471.95863
## 12  600        compute   452.94452

newdata的数据描述的是,各节点数(node)下,从开始创建云主机到各模块(module)处理的平均总耗时

meltdata的数据描述的是,各节点下,创建云主机在各模块的平均耗时,为了方便后面的图示,做了相关数据结构调整

library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.1
meltdata$module<-factor(meltdata$module,levels = rev(levels(meltdata$module)))
ggplot(data = meltdata,aes(x=Node,y=timeElapsed,fill=module))+
  geom_area(show.legend = T)+
  scale_fill_manual(values = alpha(c('black','green','blue','red'),.3))+
  labs(title='Elapsed time stack of creating instances',x='compute nodes',y='time(second)')+
  theme(legend.position = c(.06,.93),panel.background = element_rect(fill = 'white',colour = 'grey'))+
  geom_text(aes(x=200,y=sum(meltdata$timeElapsed[which(meltdata$Node==200)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==200)])))+
  geom_text(aes(x=400,y=sum(meltdata$timeElapsed[which(meltdata$Node==400)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==400)])))+
  geom_text(aes(x=600,y=sum(meltdata$timeElapsed[which(meltdata$Node==600)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==600)])))

上图描述的是各模块时间消耗的堆砌示意图,可以看出各模块的时间消耗随节点数增加而增加,但是由于数据只有三组,分析结果的说服力有限

以总时间消耗与节点数之间关系为例,对其做拟合,并获得拟合公式及相关参数:

library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 3.4.1
myFormula<-y~I(log(x))
ggplot(newdata,aes(Node,compute))+
  geom_smooth(method = 'lm',formula =myFormula,se=F)+
  geom_point()+
  stat_poly_eq(formula = myFormula,
               eq.with.lhs = "italic(hat(y))~`=`~",
               eq.x.rhs = 'italic(log(x))',
               aes(label=paste('nls: ',..eq.label..,..rr.label..,..BIC.label..,sep='~~~')),
               label.x = 500,
               label.y = 1210,
               color='blue',
               size=3,
               parse=T)+
  geom_smooth(method = 'lm',se=F,color='red')+
  stat_poly_eq(formula = y~x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label=paste('lm: ',..eq.label..,..rr.label..,..BIC.label..,sep = '~~~')),
               label.x = 550,
               label.y = 1130,
               color='red',
               size=3,
               parse=T
               )

红色为线性拟合;蓝色为对数线性拟合,从R-squared和BIC来看,对数线性函数比线性函数拟合效果好一些

注:

R-squared介于0-1之间,越大表明拟合程度越好;

BIC没有固定的取值范围

\[ BIC=k*ln(n)-2ln(L) \] k表示模型参数个数,n表示样本数量,L为似然函数值,第一项取决于模型复杂程度,用越多参数来构建模型,模型就越复杂,第二项取决于似然函数值,似然函数值越大说明选择的估计参数值越好,前者是对模型结构的表达,后者是对结构固定后参数取值好坏的表达,综合起来的意义:模型从简,参数从优,BIC越小越好