Syslog 데이터 분석을 위해 기존에 활용하던 event 추출룰을 row data에 적용하여 전체 데이터에 대한 event 유형별 시간대별 통계 테이블 작성(Hive 활용)
동일한 방식으로 Syslog severity 항목을 기준으로 시간대별 통계 테이블 작성(Hive 활용)
일자별 / 시간대별 / severity 별로 count 한 데이터임
## Source: local data frame [4,615 x 5]
##
## log_date log_hour severity count log_date_hour
## (time) (int) (fctr) (dbl) (time)
## 1 2015-11-01 0 alert 60710 2015-11-01 00:00:00
## 2 2015-11-01 0 crit 19705 2015-11-01 00:00:00
## 3 2015-11-01 0 emerg 60 2015-11-01 00:00:00
## 4 2015-11-01 0 err 2311092 2015-11-01 00:00:00
## 5 2015-11-01 0 info 5808214 2015-11-01 00:00:00
## 6 2015-11-01 0 notice 19250828 2015-11-01 00:00:00
## 7 2015-11-01 0 warning 938941 2015-11-01 00:00:00
## 8 2015-11-01 1 alert 57796 2015-11-01 01:00:00
## 9 2015-11-01 1 crit 17956 2015-11-01 01:00:00
## 10 2015-11-01 1 emerg 60 2015-11-01 01:00:00
## .. ... ... ... ... ...
## Source: local data frame [7 x 3]
##
## severity total prop
## (fctr) (dbl) (dbl)
## 1 emerg 39380 0.000002
## 2 crit 12883235 0.000536
## 3 alert 39630043 0.001648
## 4 warning 623511091 0.025931
## 5 err 3752663597 0.156069
## 6 info 3791553949 0.157687
## 7 notice 15824596164 0.658128
몇몇 중요한 항목들은 숫자가 작아 따로 다시 그려보자
날짜 기준으로도 한번 보자
## Source: local data frame [198 x 3]
## Groups: log_date [?]
##
## log_date severity total
## (time) (fctr) (dbl)
## 1 2015-11-01 alert 1453968
## 2 2015-11-01 crit 475546
## 3 2015-11-01 emerg 1393
## 4 2015-11-01 err 38027911
## 5 2015-11-01 info 132673502
## 6 2015-11-01 notice 458145549
## 7 2015-11-01 warning 22154095
## 8 2015-11-02 alert 1452437
## 9 2015-11-02 crit 475753
## 10 2015-11-02 emerg 1451
## .. ... ... ...
시간대별로 각 타입별 이벤트가 일어날 확률을 전체 시간 평균과 비교해보면 편차가 큰 애들이 있음.. 즉 정기적으로 일어나지 않는 애들임 - emerg, err, 반면에 alert는 평이함..
특정 시간대에 집중적으로 평상시와 다르게 많은 이벤트가 발생함을 알 수 있음.. 해당 시간대 다시 확인 필요
점이 모여있지 않고 위로 길게 나타나는 부분에 대한 확인 필요
특정 시간대에 특정 유형의 카운트 분포가 어떻게 되는지 보자
특별히 시간대별 카운트 합계와 발생 평균과의 차이에는 관계가 없어 보임
회귀 계수를 봐도 0.3% 정도로 매우 낮음
##
## Call:
## lm(formula = dist ~ date_hour_type_sum, data = devi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0002731 -0.0002306 -0.0001632 -0.0000554 0.0301193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.733e-04 1.907e-05 14.333 < 2e-16 ***
## date_hour_type_sum -6.314e-12 1.683e-12 -3.752 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001006 on 3930 degrees of freedom
## Multiple R-squared: 0.00357, Adjusted R-squared: 0.003316
## F-statistic: 14.08 on 1 and 3930 DF, p-value: 0.0001778
그래도 뭔가 이상현상을 보이는 것들을 찾기 위해서 평균과의 차이를 그래프로 나타내본다.
# dist 값에 대한 시간대별 카운트 회귀 잔차 값
devi$resid <- resid(rlm(log(dist) ~ log(date_hour_type_sum), data = devi))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -19.0800 -1.2360 0.1642 -0.2388 1.2760 7.0630
## 0% 25% 50% 75% 100%
## -19.0819362 -1.2361900 0.1641739 1.2755707 7.0634974
quantile(devi$resid, seq(0, 1, by=0.1))
## 0% 10% 20% 30% 40% 50%
## -19.0819362 -3.3445930 -1.7825863 -0.8772631 -0.2797260 0.1641739
## 60% 70% 80% 90% 100%
## 0.5829940 1.0247470 1.5269985 2.1581461 7.0634974
## Source: local data frame [6 x 4]
##
## count date_hour_type_sum dist resid
## (dbl) (dbl) (dbl) (dbl)
## 1 155 155 0.002004839 2.217574
## 2 157 157 0.003148308 2.671329
## 3 174 174 0.001824589 2.145487
## 4 178 178 0.004984874 3.154888
## 5 232 232 0.007912481 3.667611
## 6 255 255 0.009511350 3.869739
## Source: local data frame [6 x 12]
## Groups: log_hour, severity [6]
##
## log_date log_hour severity count log_date_hour hour_type_total
## (time) (int) (fctr) (dbl) (time) (dbl)
## 1 2015-11-01 0 alert 60710 2015-11-01 1629747
## 2 2015-11-01 0 crit 19705 2015-11-01 529459
## 3 2015-11-01 0 emerg 60 2015-11-01 1506
## 4 2015-11-01 0 err 2311092 2015-11-01 351674898
## 5 2015-11-01 0 info 5808214 2015-11-01 155023878
## 6 2015-11-01 0 notice 19250828 2015-11-01 573596936
## Variables not shown: prop (dbl), hour_total (dbl), prop_all (dbl),
## date_hour_type_sum (dbl), dist (dbl), resid (dbl)
severity 기준으로 평균(붉은선) 대비해서 해당 시간대에 얼마나 불규칙하게 발생하는지 확인 해보자
## Source: local data frame [4,091 x 4]
##
## log_date log_hour event_type count
## (fctr) (int) (fctr) (int)
## 1 2015-11-01 0 disk_error 134110
## 2 2015-11-01 0 memory_error 3595
## 3 2015-11-01 0 nic_bandwidth 121
## 4 2015-11-01 0 nic_down 163
## 5 2015-11-01 0 oom 18
## 6 2015-11-01 0 temperature_issue 26
## 7 2015-11-01 1 disk_error 133736
## 8 2015-11-01 1 memory_error 3574
## 9 2015-11-01 1 nic_bandwidth 126
## 10 2015-11-01 1 nic_down 162
## .. ... ... ... ...
## log_date log_hour event_type
## 2015-11-05: 159 Min. : 0.00 disk_error :648
## 2015-11-23: 157 1st Qu.: 6.00 memory_error :648
## 2015-11-03: 156 Median :12.00 nic_bandwidth :648
## 2015-11-04: 156 Mean :11.65 nic_down :648
## 2015-11-11: 156 3rd Qu.:18.00 temperature_issue:648
## 2015-11-20: 156 Max. :23.00 oom :622
## (Other) :3151 (Other) :229
## count
## Min. : 1
## 1st Qu.: 48
## Median : 160
## Mean : 19096
## 3rd Qu.: 3594
## Max. :181927
##
## Source: local data frame [4,091 x 5]
##
## log_date log_hour event_type count log_date_hour
## (time) (int) (fctr) (dbl) (time)
## 1 2015-11-01 0 disk_error 134110 2015-11-01 00:00:00
## 2 2015-11-01 0 memory_error 3595 2015-11-01 00:00:00
## 3 2015-11-01 0 nic_bandwidth 121 2015-11-01 00:00:00
## 4 2015-11-01 0 nic_down 163 2015-11-01 00:00:00
## 5 2015-11-01 0 oom 18 2015-11-01 00:00:00
## 6 2015-11-01 0 temperature_issue 26 2015-11-01 00:00:00
## 7 2015-11-01 1 disk_error 133736 2015-11-01 01:00:00
## 8 2015-11-01 1 memory_error 3574 2015-11-01 01:00:00
## 9 2015-11-01 1 nic_bandwidth 126 2015-11-01 01:00:00
## 10 2015-11-01 1 nic_down 162 2015-11-01 01:00:00
## .. ... ... ... ... ...
## Source: local data frame [8 x 3]
##
## event_type total prop
## (fctr) (dbl) (dbl)
## 1 disk_error 75518599 0.966671
## 2 memory_error 2332185 0.029853
## 3 nic_down 108410 0.001388
## 4 nic_bandwidth 84060 0.001076
## 5 temperature_issue 49384 0.000632
## 6 oom 23750 0.000304
## 7 passwd_security 5850 0.000075
## 8 dhcp_error 90 0.000001
## Source: local data frame [4,091 x 7]
## Groups: log_hour, event_type [173]
##
## log_date log_hour event_type count log_date_hour
## (time) (int) (fctr) (dbl) (time)
## 1 2015-11-01 0 disk_error 134110 2015-11-01 00:00:00
## 2 2015-11-01 0 memory_error 3595 2015-11-01 00:00:00
## 3 2015-11-01 0 nic_bandwidth 121 2015-11-01 00:00:00
## 4 2015-11-01 0 nic_down 163 2015-11-01 00:00:00
## 5 2015-11-01 0 oom 18 2015-11-01 00:00:00
## 6 2015-11-01 0 temperature_issue 26 2015-11-01 00:00:00
## 7 2015-11-01 1 disk_error 133736 2015-11-01 01:00:00
## 8 2015-11-01 1 memory_error 3574 2015-11-01 01:00:00
## 9 2015-11-01 1 nic_bandwidth 126 2015-11-01 01:00:00
## 10 2015-11-01 1 nic_down 162 2015-11-01 01:00:00
## .. ... ... ... ... ...
## Variables not shown: hour_type_total (dbl), prop (dbl)
## Source: local data frame [24 x 3]
##
## log_hour hour_total prop_all
## (int) (dbl) (dbl)
## 1 0 3217616 0.04118689
## 2 1 3245845 0.04154824
## 3 2 3237314 0.04143904
## 4 3 3263487 0.04177406
## 5 4 3269484 0.04185083
## 6 5 3272011 0.04188317
## 7 6 3303174 0.04228207
## 8 7 3297715 0.04221220
## 9 8 3366237 0.04308931
## 10 9 3364305 0.04306458
## .. ... ... ...
## Source: local data frame [1,055 x 3]
##
## count date_hour_type_sum dist
## (dbl) (dbl) (dbl)
## 1 1 105 0.009982222
## 2 2 178 0.015342633
## 3 3 180 0.001437419
## 4 4 280 0.008746202
## 5 5 270 0.006327795
## 6 6 210 0.013122278
## 7 7 287 0.042041489
## 8 8 256 0.001754248
## 9 9 270 0.001011145
## 10 10 330 0.004937443
## .. ... ... ...
## count date_hour_type_sum dist
## Min. : 1.0 Min. : 58 Min. :0.0000000
## 1st Qu.: 462.2 1st Qu.: 1837 1st Qu.:0.0000065
## Median :101775.0 Median :102238 Median :0.0000225
## Mean : 69599.8 Mean : 74120 Mean :0.0064066
## 3rd Qu.:133986.2 3rd Qu.:134048 3rd Qu.:0.0001375
## Max. :181927.0 Max. :277858 Max. :0.8813636
##
## Call:
## lm(formula = dist ~ date_hour_type_sum, data = devi1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.01388 -0.01301 -0.00217 -0.00034 0.86766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.388e-02 2.174e-03 6.386 2.54e-10 ***
## date_hour_type_sum -1.009e-07 2.162e-08 -4.666 3.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04769 on 1052 degrees of freedom
## Multiple R-squared: 0.02027, Adjusted R-squared: 0.01934
## F-statistic: 21.77 on 1 and 1052 DF, p-value: 3.471e-06
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -12.78000 -1.21200 -0.13120 0.04477 1.30300 8.77900
## 0% 25% 50% 75% 100%
## -12.7785365 -1.2115856 -0.1312496 1.3033363 8.7791118
## 0% 10% 20% 30% 40% 50%
## -12.7785365 -2.4579279 -1.5000382 -0.9241052 -0.5609020 -0.1312496
## 60% 70% 80% 90% 100%
## 0.2925703 0.8303203 1.7771549 3.1351225 8.7791118
## Source: local data frame [6 x 12]
## Groups: log_hour, event_type [6]
##
## log_date log_hour event_type count log_date_hour
## (time) (int) (fctr) (dbl) (time)
## 1 2015-11-01 0 disk_error 134110 2015-11-01
## 2 2015-11-01 0 memory_error 3595 2015-11-01
## 3 2015-11-01 0 nic_bandwidth 121 2015-11-01
## 4 2015-11-01 0 nic_down 163 2015-11-01
## 5 2015-11-01 0 oom 18 2015-11-01
## 6 2015-11-01 0 temperature_issue 26 2015-11-01
## Variables not shown: hour_type_total (dbl), prop (dbl), hour_total (dbl),
## prop_all (dbl), date_hour_type_sum (dbl), dist (dbl), resid (dbl)