Motivation for studying process mining

Process Data is ubiquitous

Every business employs processes and collects information regarding various activities, be it sales & procurement or recruitment or service workflows. It is imperative to use work flows (whether manual or IT enabled) to track these processes.

There are unresolved concerns in process management

In terms of business process management, two key questions have long troubled the stakeholders since inception of business processes.

  • The first of these is the creation of “current state” processes. Most process exercises in large corporations focus on the “to be” processes, which implies how something should be done, rather than how it is being done. Moreover, understanding the “as is” is justifiably skeptical for the executive given the sticky-note interview juggling it is built upon. Although there has traditionally been no subsitute to first hand experience, with the data collection robustness in modern times there might be a better option.

  • This brings us to the second question which is the lack of two-way connection between business processes and enterprise information systems. While ERP systems do capture process logs, they rarely provide interfaces to work with these logs.

Encouraging ecosystem activity in the field

In 2011, a munich based company called Celonis was founded with Van Der Alst as the scientific advisor. Van der Alst has been the academic force behind research on process mining and is the instructor for arguably the best structured MOOC on the subject. There are also resources like ProM and fluxicon which have been active in the field since 2014.

Since 2016, IEEE has been maintaining the standard data format XES for event logs to enable streamlines process analysis. In 2018, Gartner published a market guide for process mining and hailed it as an enabler towards digital transformation. In April 2019, Harvard business review published an article which describes the need for mining processes and deriving useful insights from process data.

This makes process mining an interesting exploration for data scientists.

What is Process Mining?

Process mining is the analytical discipline of understanding real processes (vis a vis assumed processes) from event logs, thereby enabling comparison of the extracted process with ideal process. It can provide detailed objective & data driven view of how the processes are performing and answer both performance and compliance related questions.

Process mining can be used as a precursor to operational efficiency enhancement programs, resource allocation and process automation guidelines. It is closely related to business process management.

This notebook is intended for getting an introduction to process mining, understanding its basics and provide basic ground work for starting process mining.

Available tools

As software offerings, Celonis, Fluxicon and QPR provide solutions for process discovery. In the open source data science world, R and Python communities have libraries bupaR and pm4py respectively to work on the process data. While this notebook is built on R and works with bupaR, the fundamental priciples are common to other tools

Understanding process data

The essential data capture schema for process mining is an event log. Basically three aspects that should be necessarily captured are 1. Activity - a well understood step in the process, for instance in a IT process, calling help desk is an activity/ 2. Case identifier - the unique identifier to which multiple activities can be tagged and tracked. In the same example of an IT process, case identifier would be serial number assigned to the case. 3. Activity instance identifier (or sequence) - the connect between 1) and 2) above, for instance the starting and ending of phone call would be tagged with the instance identifier.

Time stamp, resource, location and other details are also sometimes available in event logs and can be used for selective filtering.

BUsinesss Process Analytics with R (BUPAR)

Introducing dataset

For this exercise, I have used production data from 4TU data repository. Data is available here

 [1] "Case.ID"            "Activity"           "Resource"           "Start.Timestamp"   
 [5] "Complete.Timestamp" "Span"               "Work.Order..Qty"    "Part.Desc."        
 [9] "Worker.ID"          "Report.Type"        "Qty.Completed"      "Qty.Rejected"      
[13] "Qty.for.MRB"        "Rework"            

The column names from the above data need to be mapped to the standard event log nomenclature. This is done by using the event log creating command. The standard nomencalture includes following column inputs

  • case_id - which is a unique identifier for the whole process sequence,
  • activity_id - a description on the individual activity within a process,
  • activity_instance_id - which instances of activity should be treated as different from others,
  • lifecycle_id - status/outcome of the process,
  • timestamp - time at which logging was done,
  • resource_id - machine/individual responsible for the instance of the process/activity.

However, before that the date time data needs to be brought to a proper date time format. For this, lubridate library is used.

At this point, an activity instance is added to the logs and an event log is created.

prod_data_with_instance<- prod_data %>% 
  group_by(Case.ID) %>%
  mutate(activity_instance = as.character(row_number()))
prod_event = prod_data_with_instance %>%
    eventlog(
        case_id = "Case.ID",
        activity_id = "Activity",
        activity_instance_id = "activity_instance",
        lifecycle_id = "Rework",
        timestamp = "Complete.Timestamp",
        resource_id = "Worker.ID"
    )
prod_event %>% n_activities
[1] 55
prod_event %>% n_cases
[1] 225
prod_event %>% n_traces
[1] 221
prod_event %>% n_resources
[1] 49

There are a total of 55 activities, carried out for 225 cases, in 221 unique ways (traces) by 49 resources.

Activity Understanding & Analysis

Let us start what different level of activities are present and their relative frequency

activity_data<- prod_event %>% 
   activity_frequency(level = "activity")
activity_data_reduced<- activity_data[activity_data$relative>0.02,]
(plot(activity_data_reduced))

This shows that final inspection is the highest frequency activity, which is to be expected. After which the turning and milling quality checks are also high. Lapping and packaging follow up.

Visualizing process maps

Let us start by viewing one of the process maps.

event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1"),]
event_reduced %>% process_map(type = frequency("relative"))

This shows for one of the cases, how the steps involved in the process play out. As we add more data, we can start to see some of the possible iterations in the process play out. It appears that the steps for this case are 1. Turning and milling 2. Turning and milling Q.C 3. Laser Marking 4. Lapping 5. Round grinding 6. Final Inspection 7. Packing

Self loops represent that more than one logging of the same activity is observed, which might be due to error messages or logging discrepancy.

If we add two more cases to this, more insights start to appear, see below plot for 3 cases.

event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1","Case 111","Case 104"),]
event_reduced %>% process_map(type = frequency("relative"))

It is clear that one of the cases (1/3) went for turning rather than turning and milling, which also increased a turning Q.C point

Above plot also shows that about 20% of the components directly go from laser marking to end, 8% go from turning to end while about 20% go through the lapping and griding operation before going to packing and then ending.

If we add few more cases to the above plot, it starts to get cumbersome to understand.

Similar to process maps, resource maps are also another way of understanding flow.

event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1","Case 111","Case 104"),]
event_reduced %>% filter_trace_frequency(percentage = 0.2) %>% resource_map(type = frequency("absolute"))

There are further ways available for exploring processes.

Precedence Matrix

Another way to visualize the process is precdence matrix which shows which steps tend to happen together. In this case since logging seems to have duplication in activity, the plot is not very insightful.

Resource & activity Analysis

Resource specialization and utilization is another key activity which can be helped by process analytics

prod_event %>%
    resource_specialisation("resource") %>% plot()

The above plot for instance shows that ID4932 and ID0937 are generalists, performing upto 15 activity types, ID3641, ID3719 are specialists.

prod_event %>%
    resource_specialisation("activity") %>% plot()

Final inspection, packing, lapping, round griding and turning Q.C are specialized activities perfomred only by one resource.

In terms of activities, it would also be useful to understand which activities are always performed, and which are rare.

prod_event %>% activity_presence() %>% plot()

Trace analysis

Trace length and plots are used to see how much variation is there across cases.

prod_event %>% trace_length() %>% plot()

prod_event %>% trace_length()
      min        q1    median      mean        q3       max    st_dev       iqr 
  1.00000   8.00000  14.00000  20.19111  23.00000 175.00000  20.93024  15.00000 

The plot shows that on an average, 75% of cases have between 8-23 steos, although a maximum of 175 steps have been observed as well. Median number of steps is 14, and average number of steps is 20.

Do all activities start and end at the same points? This can be visualized using bar plots as well

start_activities(prod_event, level = "activity") %>% plot()

While turning and milling seems to be the first operation, machine 6 sees highest rate of starting points.

As expected, most operations end with final inspection or packing.

The above plot shows how many cases can be described with a relatively small number of traces indicating the consistency in the process. Here we have seen that for 225 cases, as high as 221 traces exist, so there is just less consistency. However, this is due to machine number being part of activity description, which means that similar steps are also treated differently.

Consolidation of activities

Consolidation of activities allows relabeling and further high level view of the process.

event_reduced<- prod_event_united[prod_event_united$Case.ID %in% c("Case 1","Case 111","Case 104"),]
event_reduced %>% filter_trace_frequency(percentage = 0.8) %>% process_map(type = frequency("absolute"))

The 10% most infrequent traces are plotted below with united data.

prod_event_united %>% trace_explorer(coverage = 0.1, type = "infrequent")

---
title: "Process Mining - Primer"
author: "Rohit Pruthi - R2DL"
output:
  html_notebook: default
  html_document:
    df_print: paged
  word_document: default
  pdf_document: default
always_allow_html: yes
---

### Motivation for studying process mining

<style>
body {
text-align: justify}
</style>

#### Process Data is ubiquitous
Every business employs processes and collects information regarding various activities, be it sales & procurement or recruitment or service workflows. It is imperative to use work flows (whether manual or IT enabled) to track these processes. 

#### There are unresolved concerns in process management
In terms of business process management, two key questions have long troubled the stakeholders since inception of business processes. 

* The first of these is the creation of "current state" processes. Most process exercises in large corporations focus on the "to be" processes, which implies how something should be done, rather than how it is being done. Moreover, understanding the "as is" is justifiably skeptical for the executive given the sticky-note interview juggling it is built upon. Although there has traditionally been no subsitute to first hand experience, with the data collection robustness in modern times there might be a better option. 

* This brings us to the second question which is the lack of two-way connection between business processes and enterprise information systems. While ERP systems do capture process logs, they rarely provide interfaces to work with these logs. 

#### Encouraging ecosystem activity in the field
In 2011, a munich based company called [Celonis](https://www.celonis.com/) was founded with Van Der Alst as the scientific advisor. Van der Alst has been the academic force behind research on process mining and is the instructor for arguably the best structured MOOC on the subject. There are also resources like [ProM](http://www.processmining.org/prom/start) and [fluxicon](https://www.fluxicon.com/) which have been active in the field since 2014. 

Since 2016, IEEE has been maintaining the standard data format [XES](http://xes-standard.org/) for event logs to enable streamlines process analysis. In 2018, Gartner published a market guide for process mining and hailed it as an enabler towards digital transformation. In April 2019, Harvard business review published an [article](https://hbr.org/2019/04/what-process-mining-is-and-why-companies-should-do-it) which describes the need for mining processes and deriving useful insights from process data. 

This makes process mining an interesting exploration for data scientists. 


### What is Process Mining?

<style>
body {
text-align: justify}
</style>

Process mining is the analytical discipline of understanding real processes (vis a vis assumed processes) from event logs, thereby enabling comparison of the extracted process with ideal process. It can provide detailed objective & data driven view of how the processes are performing and answer both performance and compliance related questions. 

Process mining can be used as a precursor to operational efficiency enhancement programs, resource allocation and process automation guidelines. It is closely related to business process management.


This notebook is intended for getting an introduction to process mining, understanding its basics and provide basic ground work for starting process mining. 

### Available tools

<style>
body {
text-align: justify}
</style>

As software offerings, Celonis, Fluxicon and QPR provide solutions for process discovery. In the open source data science world, R and Python communities have libraries bupaR and pm4py respectively to work on the process data. While this notebook is built on R and works with bupaR, the fundamental priciples are common to other tools

### Understanding process data

<style>
body {
text-align: justify}
</style>

The essential data capture schema for process mining is an *event log*. Basically three aspects that should be necessarily captured are 
1. Activity - a well understood step in the process, for instance in a IT process, *calling help desk* is an activity/
2. Case identifier - the unique identifier to which multiple activities can be tagged and tracked. In the same example of an IT process, case identifier would be serial number assigned to the case. 
3. Activity instance identifier (or sequence) - the connect between 1) and 2) above, for instance the starting and ending of phone call would be tagged with the instance identifier. 

Time stamp, resource, location and other details are also sometimes available in event logs and can be used for selective filtering. 

### BUsinesss Process Analytics with R (BUPAR)

```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(plyr)
library(tidyverse)
library(bupaR)

```

#### Introducing dataset
For this exercise, I have used production data from 4TU data repository. Data is available [here](https://data.4tu.nl/repository/uuid:68726926-5ac5-4fab-b873-ee76ea412399) 

```{r, echo=FALSE}

prod_data<- read.csv("C:/Users/PruthiR/Documents/2020/mLogic/processmining/Dataset files (2.1 MB)/data/Production_Data.csv")

print(colnames(prod_data))

```

The column names from the above data need to be mapped to the standard event log nomenclature. This is done by using the event log creating command. The standard nomencalture includes following column inputs

* case_id - which is a unique identifier for the *whole* process sequence,
* activity_id - a description on the individual activity within a process,
* activity_instance_id - which instances of activity should be treated as different from others,
* lifecycle_id - status/outcome of the process,
* timestamp - time at which logging was done,
* resource_id - machine/individual responsible for the instance of the process/activity.

However, before that the date time data needs to be brought to a proper date time format. For this, lubridate library is used. 

```{r, echo=FALSE, message=FALSE}
library(lubridate)

prod_data$Complete.Timestamp<- ymd_hms(prod_data$Complete.Timestamp)
prod_data$Start.Timestamp<- ymd_hms(prod_data$Start.Timestamp)
```

At this point, an activity instance is added to the logs and an event log is created. 

```{r}
prod_data_with_instance<- prod_data %>% 
  mutate(activity_instance = as.character(row_number()))
```


```{r, echo=TRUE, message=FALSE}
prod_event = prod_data_with_instance %>%
    eventlog(
        case_id = "Case.ID",
        activity_id = "Activity",
        activity_instance_id = "activity_instance",
        lifecycle_id = "Rework",
        timestamp = "Complete.Timestamp",
        resource_id = "Worker.ID"
    )

prod_event %>% n_activities

prod_event %>% n_cases

prod_event %>% n_traces
prod_event %>% n_resources

```

There are a total of 55 activities, carried out for 225 cases, in 221 unique ways (traces) by 49 resources. 

#### Activity Understanding & Analysis
Let us start what different level of activities are present and their relative frequency

```{r}

activity_data<- prod_event %>% 
   activity_frequency(level = "activity")

activity_data_reduced<- activity_data[activity_data$relative>0.02,]

(plot(activity_data_reduced))
```

This shows that final inspection is the highest frequency activity, which is to be expected. After which the turning and milling quality checks are also high. Lapping and packaging follow up. 

#### Visualizing process maps
Let us start by viewing one of the process maps. 

```{r}
event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1"),]

event_reduced %>% process_map(type = frequency("relative"))
```

This shows for one of the cases, how the steps involved in the process play out. As we add more data, we can start to see some of the possible iterations in the process play out. It appears that the steps for this case are 
1. Turning and milling
2. Turning and milling Q.C
3. Laser Marking
4. Lapping
5. Round grinding
6. Final Inspection
7. Packing

Self loops represent that more than one logging of the same activity is observed, which might be due to error messages or logging discrepancy. 

If we add two more cases to this, more insights start to appear, see below plot for 3 cases. 

```{r}
event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1","Case 111","Case 104"),]

event_reduced %>% process_map(type = frequency("relative"))
```
It is clear that one of the cases (1/3) went for turning rather than turning and milling, which also increased a turning Q.C point

Above plot also shows that about 20% of the components directly go from laser marking to end, 8% go from turning to end while about 20% go through the lapping and griding operation before going to packing and then ending. 

If we add few more cases to the above plot, it starts to get cumbersome to understand. 

Similar to process maps, resource maps are also another way of understanding flow. 

```{r}
event_reduced<- prod_event[prod_event$Case.ID %in% c("Case 1","Case 111","Case 104"),]

event_reduced %>% filter_trace_frequency(percentage = 0.5) %>% resource_map(type = frequency("absolute"))
```


There are further ways available for exploring processes. 


#### Precedence Matrix
Another way to visualize the process is precdence matrix which shows which steps tend to happen together. In this case since logging seems to have duplication in activity, the plot is not very insightful. 

```{r}
precedence_matrix <- prod_event %>%
  filter_activity_frequency(percentage = 0.9) %>% 
  filter_trace_frequency(percentage = .80) %>%    
  precedence_matrix() %>% 
  plot()

precedence_matrix

```

#### Resource & activity Analysis
Resource specialization and utilization is another key activity which can be helped by process analytics

```{r, fig.height=8}
prod_event %>%
    resource_specialisation("resource") %>% plot()
```

The above plot for instance shows that ID4932 and ID0937 are generalists, performing upto 15 activity types, ID3641, ID3719 are specialists. 

```{r, fig.height=10}
prod_event %>%
    resource_specialisation("activity") %>% plot()
```

Final inspection, packing, lapping, round griding and turning Q.C are specialized activities perfomred only by one resource.   

In terms of activities, it would also be useful to understand which activities are always performed, and which are rare. 

```{r, fig.height=8}
prod_event %>% activity_presence() %>% plot()
```

#### Trace analysis
Trace length and plots are used to see how much variation is there across cases. 

```{r}
prod_event %>% trace_length() %>% plot()
#prod_event %>% trace_length()
```

The plot shows that on an average, 75% of cases have between 8-23 steos, although a maximum of 175 steps have been observed as well. Median number of steps is 14, and average number of steps is 20. 

Do all activities start and end at the same points? This can be visualized using bar plots as well

```{r}
start_activities(prod_event, level = "activity") %>% plot()

```

While turning and milling seems to be the first operation, machine 6 sees highest rate of starting points. 

```{r}
end_activities(prod_event, level = "activity") %>% plot()
```

As expected, most operations end with final inspection or packing. 

```{r}
prod_event %>%
  trace_coverage(level = "trace") %>%
  plot()
```

The above plot shows how many cases can be described with a relatively small number of traces indicating the consistency in the process. Here we have seen that for 225 cases, as high as 221 traces exist, so there is just less consistency. However, this is due to machine number being part of activity description, which means that similar steps are also treated differently. 



#### Consolidation of activities
Consolidation of activities allows relabeling and further high level view of the process. 

```{r, echo=FALSE}
prod_event_united<- prod_event %>% act_unite(Turning = c("Turning - Machine 21","Turning - Machine 4", "Turning - Machine 5", "Turning - Machine 8", "Turning - Machine 9"))

prod_event_united<- prod_event_united %>% act_unite("Turning & Milling" = c("Turning & Milling - Machine 10","Turning & Milling - Machine 4","Turning & Milling - Machine 5","Turning & Milling - Machine 6","Turning & Milling - Machine 8","Turning & Milling - Machine 9" ))

prod_event_united<- prod_event_united %>% act_unite("Round Gridning - Machine" = c("Round Grinding - Machine 12","Round Grinding - Machine 19", "Round Grinding - Machine 2","Round Grinding - Machine 23","Round Grinding - Machine 3"))

prod_event_united<- prod_event_united %>% act_unite("Milling" = c("Milling - Machine 10","Milling - Machine 14","Milling - Machine 16","Milling - Machine 8" ))

prod_event_united<- prod_event_united %>% act_unite("Grinding Rework" = c("Grinding Rework","Grinding Rework - Machine 12", "Grinding Rework - Machine 2","Grinding Rework - Machine 27" ))

prod_event_united<- prod_event_united %>% act_unite("Setup" = c("Setup - Machine 8","Setup - Machine 4"))

prod_event_united<- prod_event_united %>% act_unite("Fix" = c("Fix - Machine 3","Fix - Machine 19", "Fix - Machine 15", "Fix - Machine 15M"))

prod_event_united<- prod_event_united %>% act_unite("Fix" = c("Fix","Fix EDM"))

prod_event_united<- prod_event_united %>% act_unite("Turn & Mill and Screw" = c("Turn & Mill. & Screw Assem - Machine 9","Turn & Mill. & Screw Assem - Machine 10"))

prod_event_united<- prod_event_united %>% act_unite("Wire Cut" = c("Wire Cut - Machine 18","Wire Cut - Machine 13"))

prod_event_united<- prod_event_united %>% act_unite("Flat Grinding" = c("Flat Grinding - Machine 11","Flat Grinding - Machine 26" ))

```


```{r}
event_reduced<- prod_event_united[prod_event_united$Case.ID %in% c("Case 1","Case 111","Case 104"),]

event_reduced %>% filter_trace_frequency(percentage = 0.8) %>% process_map(type = frequency("absolute"))
```

The 10% most infrequent traces are plotted below with united data. 

```{r, fig.width=25}
prod_event_united %>% trace_explorer(coverage = 0.1, type = "infrequent")
```

