对创建云主机过程中的日志做特征,离线提取目标日志,从而获取自定义阶段时间消耗,将其可视化展出
data<-read.csv("G:/creatingInstance.csv")
data
## X t.200 t.400 t.600
## 1 api 92.86785216697686 199.95401671932598 310.55878177685167
## 2 schedule 370.19923743685189 593.32759702474993 740.13826529645769
## 3 compute message 385.425981122042 607.29682174828849 769.10284561469985
## 4 compute 732.582448816804 1079.2554514218009 1222.0473705247207
## 5 succ 4000/4000 7596/8000 10558/12000
t.200、t.400、t.600: 计算节点数分别为200、400、600,创建云主机数为1:20进行创建;
从开始创建云主机的请求开始:
api: 至nova-api阶段处理完创建云主机相关请求的平均总耗时;
scheduler: 至nova-scheduler阶段处理完创建云主机相关请求的平均总耗时;
compute message: 至nova-compute开始处理创建云主机相关请求的平均总耗时;
compute: 至nova-compute阶段处理完创建云主机相关请求的平均总耗时;
succ: 创建云主机的成功率。
我们先来看看创建云主机的过程在各模块间的耗时与节点数之间的情况,暂时先不管成功率这个指标,把数据rearrange一下:
library(reshape2)
newdata<-as.data.frame(cbind(c(200,400,600),t(data[-5,-1])))
rubishdata<-lapply(1:ncol(newdata),function(i){newdata[,i]<<-as.numeric(as.vector(newdata[,i]))});rm(rubishdata)
colnames(newdata)<-c('Node','api','scheduler','computeMessage','compute')
rownames(newdata)<-NULL
new2data<-newdata
new2data[,3:ncol(new2data)]<-new2data[,3:ncol(new2data)]-new2data[,2:(ncol(new2data)-1)]
meltdata<-melt(new2data,id.vars = 'Node',variable.name = "module",value.name = "timeElapsed")
newdata
## Node api scheduler computeMessage compute
## 1 200 92.86785 370.1992 385.4260 732.5824
## 2 400 199.95402 593.3276 607.2968 1079.2555
## 3 600 310.55878 740.1383 769.1028 1222.0474
meltdata
## Node module timeElapsed
## 1 200 api 92.86785
## 2 400 api 199.95402
## 3 600 api 310.55878
## 4 200 scheduler 277.33139
## 5 400 scheduler 393.37358
## 6 600 scheduler 429.57948
## 7 200 computeMessage 15.22674
## 8 400 computeMessage 13.96922
## 9 600 computeMessage 28.96458
## 10 200 compute 347.15647
## 11 400 compute 471.95863
## 12 600 compute 452.94452
newdata的数据描述的是,各节点数(node)下,从开始创建云主机到各模块(module)处理的平均总耗时
meltdata的数据描述的是,各节点下,创建云主机在各模块的平均耗时,为了方便后面的图示,做了相关数据结构调整
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.1
meltdata$module<-factor(meltdata$module,levels = rev(levels(meltdata$module)))
ggplot(data = meltdata,aes(x=Node,y=timeElapsed,fill=module))+
geom_area(show.legend = T)+
scale_fill_manual(values = alpha(c('black','green','blue','red'),.3))+
labs(title='Elapsed time stack of creating instances',x='compute nodes',y='time(second)')+
theme(legend.position = c(.06,.93),panel.background = element_rect(fill = 'white',colour = 'grey'))+
geom_text(aes(x=200,y=sum(meltdata$timeElapsed[which(meltdata$Node==200)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==200)])))+
geom_text(aes(x=400,y=sum(meltdata$timeElapsed[which(meltdata$Node==400)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==400)])))+
geom_text(aes(x=600,y=sum(meltdata$timeElapsed[which(meltdata$Node==600)]),label=sum(meltdata$timeElapsed[which(meltdata$Node==600)])))
上图描述的是各模块时间消耗的堆砌示意图,可以看出各模块的时间消耗随节点数增加而增加,但是由于数据只有三组,分析结果的说服力有限
以总时间消耗与节点数之间关系为例,对其做拟合,并获得拟合公式及相关参数:
library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 3.4.1
myFormula<-y~I(log(x))
ggplot(newdata,aes(Node,compute))+
geom_smooth(method = 'lm',formula =myFormula,se=F)+
geom_point()+
stat_poly_eq(formula = myFormula,
eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = 'italic(log(x))',
aes(label=paste('nls: ',..eq.label..,..rr.label..,..BIC.label..,sep='~~~')),
label.x = 500,
label.y = 1210,
color='blue',
size=3,
parse=T)+
geom_smooth(method = 'lm',se=F,color='red')+
stat_poly_eq(formula = y~x,
eq.with.lhs = "italic(hat(y))~`=`~",
aes(label=paste('lm: ',..eq.label..,..rr.label..,..BIC.label..,sep = '~~~')),
label.x = 550,
label.y = 1130,
color='red',
size=3,
parse=T
)
红色为线性拟合;蓝色为对数线性拟合,从R-squared和BIC来看,对数线性函数比线性函数拟合效果好一些
注:
R-squared介于0-1之间,越大表明拟合程度越好;
BIC没有固定的取值范围
\[ BIC=k*ln(n)-2ln(L) \] k表示模型参数个数,n表示样本数量,L为似然函数值,第一项取决于模型复杂程度,用越多参数来构建模型,模型就越复杂,第二项取决于似然函数值,似然函数值越大说明选择的估计参数值越好,前者是对模型结构的表达,后者是对结构固定后参数取值好坏的表达,综合起来的意义:模型从简,参数从优,BIC越小越好