OpenStack日志分析笔记——合并日志+提取时间

由于本人水平有限，错漏和不妥之处在所难免，殷望批评指正！

OpenStack日志结构

读取日志文件:nova-api.log

library(readr)
novaApi<-read_lines("G:/nova-api.log")
length(novaApi)

## [1] 3234903

日志文件普遍比较大，readr::read_lines是我目前找到读取日志文件最快的函数了，如果有谁知道更快的，告诉我一下

head(novaApi)

## [1] "2017-06-22 03:07:14.154 23317 DEBUG nova.osapi_compute.wsgi.server [-] (23317) accepted ('192.168.17.2', 56980) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:826"                                                                                       
## [2] "2017-06-22 03:07:14.155 23317 DEBUG nova.api.openstack.wsgi [-] Calling method '<bound method Versions.index of <nova.api.openstack.compute.versions.Versions object at 0x8437f50>>' _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:792"
## [3] "2017-06-22 03:07:14.156 23317 INFO nova.osapi_compute.wsgi.server [-] 192.168.17.2 \"OPTIONS / HTTP/1.0\" status: 200 len: 501 time: 0.0012259"                                                                                                                     
## [4] "2017-06-22 03:07:14.608 23351 DEBUG nova.metadata.wsgi.server [-] (23351) accepted ('192.168.17.2', 40228) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:826"                                                                                            
## [5] "2017-06-22 03:07:14.613 23351 INFO nova.api.ec2 [-] 0.311s 192.168.17.2 OPTIONS / None:None 200 [None] text/plain text/plain"                                                                                                                                       
## [6] "2017-06-22 03:07:14.613 23351 INFO nova.metadata.wsgi.server [-] 192.168.17.2 \"OPTIONS / HTTP/1.0\" status: 200 len: 234 time: 0.0014119"

基本结构：日期+时间+日志级别+模块+日志内容+方法+源码路径。

也有些日志没有方法和源码路径：

## [1] "2017-06-22 03:07:14.156 23317 INFO nova.osapi_compute.wsgi.server [-] 192.168.17.2 \"OPTIONS / HTTP/1.0\" status: 200 len: 501 time: 0.0012259"

还有一些日志以多行的形式打出：

## [1] "2017-06-22 11:32:17.634 8748 DEBUG keystoneclient.session [-] RESP: [300] date: Thu, 22 Jun 2017 03:32:17 GMT vary: X-Auth-Token content-length: 599 content-type: application/json connection: close "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [2] "RESP BODY: {\"versions\": {\"values\": [{\"status\": \"stable\", \"updated\": \"2015-03-30T00:00:00Z\", \"media-types\": [{\"base\": \"application/json\", \"type\": \"application/vnd.openstack.identity-v3+json\"}], \"id\": \"v3.4\", \"links\": [{\"href\": \"http://192.168.19.252:35357/v3/\", \"rel\": \"self\"}]}, {\"status\": \"stable\", \"updated\": \"2014-04-17T00:00:00Z\", \"media-types\": [{\"base\": \"application/json\", \"type\": \"application/vnd.openstack.identity-v2.0+json\"}], \"id\": \"v2.0\", \"links\": [{\"href\": \"http://192.168.19.252:35357/v2.0/\", \"rel\": \"self\"}, {\"href\": \"http://docs.openstack.org/\", \"type\": \"text/html\", \"rel\": \"describedby\"}]}]}}"
## [3] " _http_log_response /usr/lib/python2.7/site-packages/keystoneclient/session.py:215"

日志的结构并不十分统一，而且开发人员修改源码的时候也会打一些自己定义的日志，所以有些方案是从源码出发，从源码中解析出日志的模版结构，如：

Large-Scale System Problems Detection by Mining Console Log

http://nma.berkeley.edu/ark:/28722/bk0005k9b6k

目前，日志结构统一的部分是时间+日志级别+模块。为了方便处理，把以多行形式打出的日志重新整理成一行，作为一条日志对象。这样处理的假定条件是：

1、每一条日志都以时间戳开始；

2、忽略日志之间的干扰，认为日志的抛出是以条为单位且有序的

实际上，有时候会因为网络、日志收集方式等原因会导致我们看到的日志是残缺且无序的，这种情况的日志相对来讲量不多，所以一般被passed掉，如果量很多，说明这个系统问题挺大了

unifyLog<-function(logSet,pattern='[0-9]+[-]+[0-9]+[0-9]+[ ]+[0-9]+[:]+[0-9]+[:]+[0-9]',outPutTime=T){
  timeofLog<-substr(logSet,1,23)
  normalLogSerial<-grep(pattern,timeofLog)
  diffSerial<-diff(normalLogSerial)
  breakPoint<-which(diffSerial>1)
  lapply(breakPoint,
         function(i){
           logSet[normalLogSerial[i]]<<-paste0(logSet[normalLogSerial[i]:(normalLogSerial[i]+diffSerial[i]-1)],collapse = '')
         }
  )
  if(outPutTime){
    return(list(unifiedLog=logSet[normalLogSerial],timeofLog=timeofLog[normalLogSerial]))
  }else{
    return(logSet[normalLogSerial])
  }
}
timeConsuming<-system.time(unifiedLog<-unifyLog(novaApi))
str(unifiedLog)

## List of 2
##  $ unifiedLog: chr [1:2775321] "2017-06-22 03:07:14.154 23317 DEBUG nova.osapi_compute.wsgi.server [-] (23317) accepted ('192.168.17.2', 56980)"| __truncated__ "2017-06-22 03:07:14.155 23317 DEBUG nova.api.openstack.wsgi [-] Calling method '<bound method Versions.index of"| __truncated__ "2017-06-22 03:07:14.156 23317 INFO nova.osapi_compute.wsgi.server [-] 192.168.17.2 \"OPTIONS / HTTP/1.0\" statu"| __truncated__ "2017-06-22 03:07:14.608 23351 DEBUG nova.metadata.wsgi.server [-] (23351) accepted ('192.168.17.2', 40228) serv"| __truncated__ ...
##  $ timeofLog : chr [1:2775321] "2017-06-22 03:07:14.154" "2017-06-22 03:07:14.155" "2017-06-22 03:07:14.156" "2017-06-22 03:07:14.608" ...

3234903行日志处理完后，实际日志条数为2775321条，提取出日志的时间，整个处理过程耗时：

timeConsuming

##    user  system elapsed 
##   11.75    0.25   12.16

有更快处理方案的，告诉我一下

OpenStack日志分析笔记——合并日志+提取时间

Yuanchun Zeng

2017年6月27日

OpenStack日志结构