Introduction:

Our websites utilize an external monitoring system which performs a synthetic check every 5 minutes. The checks follow an eleven (11) step process which tests most of the functionality of the environment. When any of the steps fail, the entire check fails and generates the errors utilized in this report. The total length of time the system took to get from step 1 to the failure is recorded in milliseconds.

General Overview

The data set had some minor modifications in order to the below presentation to be assembled.

  • Modifications include:
    • Extracting the Hour in which the error happens.
    • Extracting the Day of the Month.
    • Identifying the Day of the week the error happens on.
    • Adjusting the column names to user friendly versions.

When those modifications are complete, the dataset looks like this:

##         Status Monitor Total_Time Status2
## 1  Unconfirmed  Vedder      51121    7004
## 2  Unconfirmed  Vedder      51121    7004
## 3  Unconfirmed WhiteUS      42644    7004
## 4  Unconfirmed WhiteUS      42644    7004
## 5  Unconfirmed  Vedder      58819    7004
## 6  Unconfirmed  Vedder      58819    7004
## 7  Unconfirmed  IRIC01      99160    7005
## 8  Unconfirmed  Vedder      43968    7004
## 9  Unconfirmed  Vedder      43968    7004
## 10   Confirmed  Vedder      58109    7004
##                                                   Description Step.Number
## 1                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 2                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 3                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 4                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 5                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 6                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 7  Step 9 (Open JE_Test_001): Element 'Test Fios,' not found.           9
## 8                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 9                Step 11 (Save PDF): 'JE_Test_002' not found.          11
## 10               Step 11 (Save PDF): 'JE_Test_002' not found.          11
##          Date     Time       Date_Time Day_of_Month Hour Day_of_Week
## 1  2019-04-30 19:02:00 4/30/2019 19:02           30   30     Tuesday
## 2  2019-04-30 19:02:00 4/30/2019 19:02           30   30     Tuesday
## 3  2019-04-30 19:07:00 4/30/2019 19:07           30   30     Tuesday
## 4  2019-04-30 19:07:00 4/30/2019 19:07           30   30     Tuesday
## 5  2019-04-30 19:11:00 4/30/2019 19:11           30   30     Tuesday
## 6  2019-04-30 19:11:00 4/30/2019 19:11           30   30     Tuesday
## 7  2019-04-30 19:17:00 4/30/2019 19:17           30   30     Tuesday
## 8  2019-04-30 19:26:00 4/30/2019 19:26           30   30     Tuesday
## 9  2019-04-30 19:26:00 4/30/2019 19:26           30   30     Tuesday
## 10 2019-04-30 19:28:00 4/30/2019 19:28           30   30     Tuesday
##     Status            Monitor            Total_Time        Status2    
##  Length:30459       Length:30459       Min.   :    16   Min.   :7001  
##  Class :character   Class :character   1st Qu.:  8836   1st Qu.:7004  
##  Mode  :character   Mode  :character   Median : 44442   Median :7004  
##                                        Mean   : 48389   Mean   :7007  
##                                        3rd Qu.: 83096   3rd Qu.:7009  
##                                        Max.   :235047   Max.   :7020  
##                                                                       
##  Description         Step.Number          Date                 Time      
##  Length:30459       Min.   : 1.000   Min.   :2019-04-30   3:16:00:   44  
##  Class :character   1st Qu.: 5.000   1st Qu.:2019-05-07   4:08:00:   43  
##  Mode  :character   Median : 9.000   Median :2019-05-17   1:00:00:   40  
##                     Mean   : 7.564   Mean   :2019-05-15   3:14:00:   39  
##                     3rd Qu.:10.000   3rd Qu.:2019-05-22   3:08:00:   38  
##                     Max.   :11.000   Max.   :2019-05-30   1:25:00:   37  
##                                                           (Other):30218  
##            Date_Time     Day_of_Month            Hour      
##  5/12/2019 17:47:   15   Length:30459       Min.   : 1.00  
##  5/18/2019 12:05:   14   Class :character   1st Qu.: 8.00  
##  5/18/2019 12:39:   14   Mode  :character   Median :17.00  
##  5/18/2019 12:12:   13                      Mean   :15.19  
##  5/18/2019 12:55:   13                      3rd Qu.:22.00  
##  5/5/2019 2:00  :   13                      Max.   :30.00  
##  (Other)        :30377                                     
##  Day_of_Week       
##  Length:30459      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
## [1] "The data in this report is from 2019-04-30 to 2019-05-30"
## [1] "This date range gives us 30459 of data and individual errors to analyze"
## [1] "Within this dataset, we have 37 unique environments"

Further exploration will happen in specific sections below in effort to add context.

The below graphs display the total Errors by Instance across the enterprise as well as the top five (5) with the most errors and the bottom five (5) instances with the least errors.

Analysis

We should take a deeper look into WhiteUS and potentially Vedder as they are greatly over the rest of the enterprise.

Further investigation into Stoelle and WDBA vs. WhiteUS and Vedder is warranted in order to understand the differences in the instances (such as user counts, hardware, or configurations) and if there is correlation between the differences and the number of errors.


The next section covers TIME based data.

  • We should be able to answer:
    • What hour of the day do the errors happen on?
    • What day of the week do our errors happen on?
    • What day of the month do they happen on?

Please note, the time of day is recorded in 24 hours and is referencing Greenwich Mean Time (GMT).

Analysis

We have a spike in errors at 3:00 – 4:00 AM which tapers off until 9:00 AM
A second lull starts at 5:00 PM with the lowest number of errors at 8:00 PM before rising again at 9:00 PM
Saturday appears to be the day with the highest number of errors. There is a potential correlation between these errors and maintenance events.
There is a clear spike in errors on the 18th of the month and a slight ramp of errors at the beginning of the month.


The next section of our general overview shows the Failures by Step. This is a total of all errors by step in all environments.

Analysis

The highest number of errors occur during Steps 5, 9, and 11.


The last section of the overview takes a look at the total time before failure in total and by each instance.

The Histogram provides context to the length of time before the site errors. Utilizing the boxplot, allows us to identify environments whose range is outside the norm both positively and negatively.

## [1] "To understand the data below, the mean of the Total time in milliseconds is 48388.851176992 and the median of the same is 44442"

Analysis

1. Vedder has a high concentration of outliers right at the 150,000ms mark with a very low plot. Combining this with the histogram leads for a strong correlation, between our highest errors time and this site
2. The concentration of outliers on RelateC is troubling, Further investigation will happen below.
3. WhiteUS has a small box plot with outliers in the extreme upper and lower of the range. This could suggest the site is failing consistently at the same average time. Further investigation could be required.
4. Cohan is one of our lower error generating sites, however, the median plot range is relatively high when compared to other sites. This could suggest the site is slow to respond but does not error.


Break down of Highest instance - WhiteUS

WhiteUS is magnatudes higher than the the rest of the enterprise in terms of error counts. Breaking down the data gives us interesting insight into potential causes.

Analysis

Unlike the global analysis of Step Number errors, WhiteUS has far more issues with Step 11 than with Steps 5 or 9. Additionally, the highest step errors (5, 9 and 11) are reversed from the global average.

Analysis

The highest spikes of errors come between May 20th – 22nd. This is not in line with the global errors by day of the month which shows a spike on May 18th. The data does not give us any further details, however, the graph above does give context to the high errors in Step 11.


Breakdown of outliers on RelateC

Analysis

The outliers on RelateC seem to come from the same day, May 18th, as the high spike in errors for the month.

We have a lull in errors around 8:00 - 9:00 PM, this could potentially be a good time to push updates to the sites.


Conculsion

The analysis has shown that while the majority of sties fall within an acceptable range of each other, there are two instances with extremely high error counts, Vedder and WhiteUS. Further analysis of the two instances with low error counts, Stoelle and WDBA, as well as a comparsion of Stoelle and WDBA vs. WhiteUS and Vedder could provide needed insight into potential reasons for their respective error counts and how to improve performance.
There is a lull in errors around 8:00 PM, which may present a good time to push emergency patches if they are required. Additionally, if those patches could wait until the 5th through the 11th of the month, and preferably a Monday or Tuesday, they will potentially cause less disruptions.
The extreme concentration of errors in Steps 5, 9, and 11 is troubling and requires investigation into the possible causes. A full RCA should go into the cause of the spike in errors on the 18th. Additionally, a RCA could be asked for in the case of WhiteUS on the spikes in errors between May 20th – 22nd.

Actions for internal teams
  • Site Team:
    • Root Cause for WhiteUs having spikes in errors on Step 11
    • Root Cause for WhiteUs having spikes between May 20th - 22nd
    • Potential causes of Vedder high outliers
    • Identify causes for Steps 5, 9, and 11 being the highest failures
  • Infrastructure Teams:
    • Potential causes of the global spike on May 18th
    • Potential causes for spikes in errors around 3:00 - 4:00 AM
  • Change Teams:
    • Will need to confirm high count of errors on Saturday in preperation for patching efforts
    • Keep watch on further analysis reports to ensure 8:00 PM is the best time to push emergency packages