## Warning: package 'knitr' was built under R version 3.1.3
## Warning: package 'dplyr' was built under R version 3.1.3
## Warning: package 'lubridate' was built under R version 3.1.3
## Warning: package 'tidyr' was built under R version 3.1.3
Lubridate is an R package which makes it easier to work with date and timestamps. In this tutorial, we will show some applications of the lubridate package. For more detailed information on lubridate, please refer to :
If your dataset contains dates or timestamps, lubridate is the go-to package to transform these columns into a proper format. Let’s take an example at the loading script for the interaction data set.
library(dplyr)
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component",
"interaction_id", "status", "impact", "urgency", "priority",
"category", "km_number", "open_time", "close_time", "closure_code",
"first_call_resolution", "handle_time", "related_incident")
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
interaction_data <- tbl_df(interaction_data)
Now, let’s take a look at the data
glimpse(interaction_data)
## Variables:
## $ ci_name (fctr) SBA000243, SUB000443, LAP000110, DTA000...
## $ ci_type (fctr) application, subapplication, computer, ...
## $ ci_subtype (fctr) Server Based Application, Web Based App...
## $ service_component (fctr) WBS000125, WBS000125, WBS000187, WBS000...
## $ interaction_id (fctr) SD0000001, SD0000002, SD0000003, SD0000...
## $ status (fctr) Closed, Closed, Closed, Closed, Closed,...
## $ impact (fctr) 5, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ urgency (fctr) 4, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ priority (fctr) 4, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ category (fctr) incident, request for information, inci...
## $ km_number (fctr) KM0000987, KM0000989, KM0000317, KM0000...
## $ open_time (fctr) 9-9-2011 9:23, 29-9-2011 14:59, 13-10-2...
## $ close_time (fctr) 14-2-2014 9:05, 13-12-2013 16:27, 21-10...
## $ closure_code (fctr) Other, Software, Software, Unknown, Sof...
## $ first_call_resolution (fctr) N, N, N, N, N, N, N, N, N, N, N, N, N, ...
## $ handle_time (int) 239, 406, 738, 787, 459, 412, 363, 374, ...
## $ related_incident (fctr) IM0000001, IM0000001, , , IM0000003, , ...
This output shows that most columns are stored as factors, except for handle_time which is stored as an integer. This seems right for most columns, except for open_time and close_time which are clearly dates (with a timestamp). These should not be stored as a factor, but as a more appropriate data type such as lubridate’s Instant type which represents a specific moment in time. [Technically, this is not completely correct. Lubridate stores dates as a POSIXct data type, but it recognizes any date-time object that refers to a moment of time as an instant. In the remainder of the tutorial, we will always use the term instant instead of the true underlying date type.]
To transform open_time and close_time into instants, lubridate offers various methods each corresponding to a different time format. Upon close inspection of the output above, it shows that a date format is used which represents days first, months second, years third followed by hour and minute. This corresponds to the dmy_hm() function. If months whould have been represented before day, you should use the mdy_hm() function, and so on. Let’s adapt the loading script.
library(dplyr)
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component",
"interaction_id", "status", "impact", "urgency", "priority",
"category", "km_number", "open_time", "close_time", "closure_code",
"first_call_resolution", "handle_time", "related_incident")
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
interaction_data$open_time <- dmy_hm(interaction_data$open_time, tz="Europe/Amsterdam")
interaction_data$close_time <- dmy_hm(interaction_data$close_time, tz= "Europe/Amsterdam")
interaction_data <- tbl_df(interaction_data)
As you can see, the dmy_hm() function takes an additional argument, i.e. the time zone. Time zones are important in two specific scenario’s, i.e. when comparing timestamps taken in different time zones (e.g. when dealing with flight data) or when daylight saving time needs to be taken into account. Let’s illustrate the latter.
By default, lubridate assumes that instants are expressed according in UTC, which is a standard time that does not correspond to a specific timezone and which does not have daylight saving time.
start <- dmy_hm("29-03-2015 01:59")
end <- dmy_hm("29-03-2015 03:01")
end-start
## Time difference of 1.033333 hours
If we however define the exact instants but with the timezone of Belgium, we get a different result because of daylight saving time. In Belgium, at 29-03-2015 time skipped from 01:59:59 to 03:00:00. So in reality, only 2 minutes passed from start to end.
start <- dmy_hm("29-03-2015 01:59", tz="Europe/Brussels")
end <- dmy_hm("29-03-2015 03:01", tz="Europe/Brussels")
end-start
## Time difference of 2 mins
Now that the data is stored correctly in a date format, we can verify the impact on descriptive statistics calculated for these variables. Let’s get a summary of the data to begin with.
summary(interaction_data)
## ci_name ci_type
## SAN000182: 10668 application :99483
## DTA000616: 7911 subapplication:20745
## WBA000133: 6897 storage :11829
## SUB000456: 6476 computer : 8410
## DTA000057: 5708 software : 1981
## SBA000659: 4008 displaydevice : 1662
## (Other) :105336 (Other) : 2894
## ci_subtype service_component interaction_id
## Server Based Application:48469 WBS000073:33528 SD0000001: 1
## Web Based Application :39225 WBS000128:14130 SD0000002: 1
## Desktop Application :24239 WBS000092: 7219 SD0000003: 1
## SAN :11675 WBS000094: 6825 SD0000004: 1
## Citrix : 3836 WBS000091: 6411 SD0000005: 1
## Laptop : 3604 WBS000089: 6011 SD0000006: 1
## (Other) :15956 (Other) :72880 (Other) :146998
## status impact urgency priority
## Closed :146998 1: 2 1: 32 1: 2
## Open - Linked: 6 2: 904 2: 950 2: 947
## 3:16357 3:16074 3:16340
## 4:75552 4:76645 4:77534
## 5:54189 5:53303 5:52181
##
##
## category km_number
## complaint : 63 KM0002125: 5727
## incident :115704 KM0001935: 1768
## problem : 5 KM0002126: 1698
## request for change : 1 KM0001968: 1606
## request for information: 31183 KM0001625: 1415
## service request : 48 KM0000075: 1389
## (Other) :133401
## open_time close_time
## Min. :2011-09-09 09:23:00 Min. :2013-10-01 06:46:00
## 1st Qu.:2013-11-12 15:00:45 1st Qu.:2013-11-13 18:24:30
## Median :2013-12-31 08:46:00 Median :2014-01-02 11:33:30
## Mean :2013-12-28 16:24:14 Mean :2013-12-30 16:11:25
## 3rd Qu.:2014-02-12 12:33:15 3rd Qu.:2014-02-13 11:59:00
## Max. :2014-03-31 19:21:00 Max. :2014-03-31 22:47:00
##
## closure_code first_call_resolution
## Other :54487 N:53008
## Software :45573 Y:93996
## Referred : 9793
## User error : 9052
## No error - works as designed: 5975
## Hardware : 4224
## (Other) :17900
## handle_time related_incident
## Min. : 0.0 :94250
## 1st Qu.: 176.0 #MULTIVALUE: 873
## Median : 324.0 IM0000220 : 239
## Mean : 444.7 IM0031184 : 85
## 3rd Qu.: 562.0 IM0016639 : 69
## Max. :22530.0 IM0014444 : 54
## (Other) :51434
Looking at the summaries of open_time and close_time, we get a remarkable result. It appears that the earliest open_time was at 2011-09-09 09:23:00 while, the earliest close_time was at 2013-10-01 06:46:00. We can not be certain that both timestamps relate to the same interaction, but if they do, the first recorded interaction in the data lasted over two years. A quick glance at the other information gives the impression that some kind of coding error must have occured.
To get a better look at what is really happening, let’s create a new variable that represents the timespan of each interaction between open_time and close_time. Such a variable can be created with lubridate and is called an interval variable
interaction_data <- interaction_data %>% mutate(period=interval(open_time, close_time))
select(interaction_data, period)
## Source: local data frame [147,004 x 1]
##
## period
## 1 2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2 2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3 2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4 2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5 2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6 2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7 2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8 2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9 2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10 2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## .. ...
Storing periods as a lubridate interval variable has several advantages. Lubridate makes it for example easy to verify if one interval falls with the boundaries of a second interval. For example, the following shows us how many interactions both started and ended in March 2014.
march_2014 <- interval(dmy("1-3-2014"), dmy("31-3-2014"))
sum(interaction_data$period %within% march_2014)
Unfortunately, lubridate and dplyr don’t work well together yet. With the current versions (dplyr_0.4.1 and lubridate_1.3.3), a dplyr data frame containing an interval column cannot be filtered nor sorted. For example, if we wanted to make a subset of the original data frame with only the interactions that started and ended in March 2014, we would normally do the following:
interaction_data %>% filter(period %within% march_2014)
However, running this gives the following error: column ‘period’ has unsupported type : Interval’. Therefore, we need to pull some tricks to make it work.
Since the problems only occur when we want to filter or arrange, we can still create a boolean column which shows which interactions occurred in March.
interaction_data <- interaction_data %>% mutate(in_march2014= period %within% march_2014)
Next, if we want to keep the interval information, we recode it as a character type
interaction_data <- interaction_data %>% mutate(period = as.character(period))
Now, we can use the boolean column in_march2014 to filter
interaction_data %>% filter(in_march2014 == TRUE) %>% select(period)
## Source: local data frame [21,674 x 1]
##
## period
## 1 2014-03-02 08:30:00 CET--2014-03-02 08:37:00 CET
## 2 2014-03-02 08:45:00 CET--2014-03-02 08:59:00 CET
## 3 2014-03-01 09:58:00 CET--2014-03-01 10:02:00 CET
## 4 2014-03-01 10:02:00 CET--2014-03-01 10:09:00 CET
## 5 2014-03-01 10:09:00 CET--2014-03-01 10:11:00 CET
## 6 2014-03-01 10:39:00 CET--2014-03-03 15:12:00 CET
## 7 2014-03-01 10:55:00 CET--2014-03-03 06:17:00 CET
## 8 2014-03-01 11:31:00 CET--2014-03-03 08:18:00 CET
## 9 2014-03-01 11:37:00 CET--2014-03-01 12:06:00 CET
## 10 2014-03-01 12:23:00 CET--2014-03-01 12:35:00 CET
## .. ...
Now, let’s take a look again at the periods of the earliest interactions in the data frame
interaction_data %>% arrange(open_time) %>% select(period)
## Source: local data frame [147,004 x 1]
##
## period
## 1 2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2 2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3 2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4 2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5 2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6 2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7 2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8 2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9 2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10 2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## .. ...
This output shows us that the earliest interactions go back to 2011, but all of them are not closed until 2013 or 2014. I get the feeling that whoever made this data set, took a sample of all interactions in 2013 and 2014. However, the selection they made included any interaction which ended somewhere in the selected period. As a result, very old interactions which took a long time are still in the data set. If this hypothesis is true, we should see the average duration for each month go down until the actual selection period is chosen.
Therefore, we will create a column representing the duration of an interval, which is typically stored as a lubridate duration object. To determine the duration of an interval, we apply as.duration() to the interval. Note that since we transformed the period column to a character data type, we need to recreate the interval.
interaction_data <- interaction_data %>% mutate(duration = as.duration(interval(open_time, close_time)))
interaction_data %>% select(period, duration)
## Source: local data frame [147,004 x 2]
##
## period
## 1 2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2 2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3 2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4 2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5 2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6 2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7 2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8 2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9 2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10 2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## .. ...
## Variables not shown: duration (dbl)
Let’s use ggplot to have a first quick look whether our assumption could hold. Let’s plot the duration against the open_time as a points.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
ggplot(interaction_data, aes(x=open_time, y=duration)) + geom_point()
This plot does not contradict our hypothesis, so let’s take a closer look. Let’s plot the average duration per month against each month. To do so, we need to transform our data frame. Unfortunately, dplyr is not yet able to deal with lubridate’s duration class (except for the select and mutate verbs). To solve this, we transform duration into a numeric data type
interaction_data <- interaction_data %>% mutate(duration = as.numeric(duration))
The following code first groups the interaction by year and month, using the lubridate’s year and month methods. These methods extract the year and month from a lubridate instant variable. Next, we summarize the data for each defined group (year-month) by calculating the average duration and the number of interactions. Because duration is expressed in seconds, we decided to transform it to days. To do so, we first transform the duration into a lubridate duration object and divide it by the duration of a single day (dday(1)). Note that the result of this is not a duration object, but a numeric object, which poses no problems for dplyr. Finally, we create a new variable which represents the month-year combination as an Instant and both print and plot the results.
monthly <- interaction_data %>%
group_by(year = year(open_time), month = month(open_time)) %>%
summarise(avg_duration = mean(duration), n_interactions = n()) %>%
mutate(avg_duration = as.duration(avg_duration) / ddays(1)) %>%
mutate(date = ymd(1+ month*100 + year*10000))
monthly %>% select(date, n_interactions, avg_duration) %>% print(n=nrow(monthly))
## Source: local data frame [30 x 4]
## Groups: year
##
## year date n_interactions avg_duration
## 1 2011 2011-09-01 2 847.5659722
## 2 2011 2011-10-01 1 738.5513889
## 3 2011 2011-12-01 2 678.5006944
## 4 2012 2012-01-01 1 643.5791667
## 5 2012 2012-02-01 4 640.3223958
## 6 2012 2012-03-01 6 588.0273148
## 7 2012 2012-04-01 1 555.7118056
## 8 2012 2012-05-01 2 550.1218750
## 9 2012 2012-06-01 4 543.5564236
## 10 2012 2012-07-01 4 545.8097222
## 11 2012 2012-08-01 8 474.5415799
## 12 2012 2012-09-01 5 464.3123611
## 13 2012 2012-10-01 7 412.6067460
## 14 2012 2012-11-01 6 418.5134259
## 15 2012 2012-12-01 4 345.2718750
## 16 2013 2013-01-01 15 335.4605093
## 17 2013 2013-02-01 13 302.6753739
## 18 2013 2013-03-01 15 274.9146759
## 19 2013 2013-04-01 29 233.0033046
## 20 2013 2013-05-01 30 213.1684259
## 21 2013 2013-06-01 42 166.0620205
## 22 2013 2013-07-01 105 111.7106548
## 23 2013 2013-08-01 242 65.9795483
## 24 2013 2013-09-01 978 22.3574891
## 25 2013 2013-10-01 25730 1.5912801
## 26 2013 2013-11-01 25013 1.6274383
## 27 2013 2013-12-01 21895 1.5195094
## 28 2014 2014-01-01 26531 1.1108634
## 29 2014 2014-02-01 23316 1.0254645
## 30 2014 2014-03-01 22993 0.5298269
ggplot(monthly) + geom_line(aes(x=date, y=avg_duration))
These results strongly support our hypothesis and seem to suggest that the original data contain all interactions which ended in the period between October 2013 and March 2014. Let’s verify this by taking a look at the earliest close_time.
min(interaction_data$close_time)
## [1] "2013-10-01 06:46:00 CEST"
This supports the hypothesis again. Based on these results it seems wise to filter a subset which contains only the interactions that both started and completed in the period from 1 October 2013 until 31 March 2014. For this we create a new load script by the name load_interaction_data_oct2013_mar2014.R.
library(dplyr)
library(lubridate)
source("data/load_interaction_data.R")
interaction_data <- interaction_data %>%
mutate(period = interval(open_time, close_time)) %>%
mutate(in_range = period %within% interval(dmy(01102013),dmy(01042014))) %>%
select(-period) %>%
filter(in_range == TRUE) %>%
select(-in_range)
The final package we will be discussing in this tutorial is the TidyR package. For detailed information about the package, we refer the interestered reader to the following two nice sources:
In this tutorial we will focus on how we can use spread function to reformat the incident_activity_data. But first, let’s take a look at the loading script for the incident_activity_data (Note that we transformed timestamp into a lubridate Instant type compared with the previous version of this script).
library(dplyr)
incident_activity_data <- tbl_df(read.csv("data/raw/detail_incident_activity.csv", sep=";"))
incident_activity_data <- select(incident_activity_data,
incident_id = Incident.ID,
timestamp = DateStamp,
incident_activity_id = IncidentActivity_Number,
incident_activity_type = IncidentActivity_Type,
assignment_group = Assignment.Group,
km_number = KM.number,
interaction_id = Interaction.ID)
incident_activity_data$timestamp <- dmy_hms(incident_activity_data$timestamp, tz = "Europe/Amsterdam")
incident_activity_data$interaction_id[incident_activity_data$interaction_id == "#N/B"] = NA
incident_activity_data <- droplevels(incident_activity_data)
Let’s take a glimpse at the data
glimpse(incident_activity_data)
## Observations: 466737
## Variables:
## $ incident_id (fctr) IM0000004, IM0000004, IM0000004, IM000...
## $ timestamp (time) 2013-01-07 08:17:17, 2013-11-04 13:41:...
## $ incident_activity_id (fctr) 001A3689763, 001A5852941, 001A5852943,...
## $ incident_activity_type (fctr) Reassignment, Reassignment, Update fro...
## $ assignment_group (fctr) TEAM0001, TEAM0002, TEAM0002, TEAM0003...
## $ km_number (fctr) KM0000553, KM0000553, KM0000553, KM000...
## $ interaction_id (fctr) SD0000007, SD0000007, SD0000007, SD000...
As we can see, each record in this table represents one particular activity performed for one specific incident. This representation has both advantages and disadvantages. For example, if we want to know the number of activities for each incident, together with the timestamps of the earliest and latest activities, we can use this representation together with dplyr.
activities_per_incident <- incident_activity_data %>%
group_by(incident_id) %>%
summarize(n_activities = n(), ts_earliest = min(timestamp), ts_latest = max(timestamp)) %>%
print()
## Source: local data frame [46,616 x 4]
##
## incident_id n_activities ts_earliest ts_latest
## 1 IM0000004 10 2013-01-07 08:17:17 2013-11-04 13:51:18
## 2 IM0000005 28 2013-01-07 08:17:54 2013-12-02 12:36:26
## 3 IM0000006 2 2014-01-13 15:13:13 2014-01-13 15:13:13
## 4 IM0000011 20 2013-01-07 08:19:20 2013-11-14 09:31:26
## 5 IM0000012 6 2013-02-13 11:57:55 2013-11-08 13:55:53
## 6 IM0000013 6 2013-02-13 12:03:35 2013-11-08 13:54:13
## 7 IM0000014 8 2013-02-13 12:21:01 2013-12-27 10:59:33
## 8 IM0000015 18 2013-02-13 12:31:12 2013-11-08 14:09:10
## 9 IM0000017 6 2013-02-13 12:38:26 2013-11-08 14:02:21
## 10 IM0000018 6 2013-02-27 13:16:20 2013-11-08 14:35:43
## .. ... ... ... ...
Now we can also calculate the actual duration between the first and last activity.
activities_per_incident <- activities_per_incident %>%
mutate(timespan = interval(ts_earliest, ts_latest) / dhours(1))
activities_per_incident %>% select(incident_id, n_activities, ts_earliest, timespan) %>%print()
## Source: local data frame [46,616 x 4]
##
## incident_id n_activities ts_earliest timespan
## 1 IM0000004 10 2013-01-07 08:17:17 7229.567
## 2 IM0000005 28 2013-01-07 08:17:54 7900.309
## 3 IM0000006 2 2014-01-13 15:13:13 0.000
## 4 IM0000011 20 2013-01-07 08:19:20 7465.202
## 5 IM0000012 6 2013-02-13 11:57:55 6433.966
## 6 IM0000013 6 2013-02-13 12:03:35 6433.844
## 7 IM0000014 8 2013-02-13 12:21:01 7606.642
## 8 IM0000015 18 2013-02-13 12:31:12 6433.633
## 9 IM0000017 6 2013-02-13 12:38:26 6433.399
## 10 IM0000018 6 2013-02-27 13:16:20 6097.323
## .. ... ... ... ...
However, we could also represent the incident_activity_data such that each row represents all activities of a single incident. First, let’s have a look at the distribution of activity types.
table(incident_activity_data$incident_activity_type)
##
## Affected CI Change alert stage 1
## 185 113
## Analysis/Research Assignment
## 981 88502
## Callback Request Caused By CI
## 53 34382
## Closed Communication with customer
## 50145 6148
## Communication with vendor Contact Change
## 1777 32
## Description Update Dial-in
## 4501 2
## External update External Vendor Assignment
## 1099 4354
## External Vendor Reassignment Impact Change
## 43 1283
## Incident reproduction Mail to Customer
## 23 3788
## Notify By Change OO Response
## 281 33
## Open Operator Update
## 46607 56292
## Pending vendor Problem Closure
## 4338 40
## Problem Workaround Quality Indicator
## 140 2465
## Quality Indicator Fixed Quality Indicator Set
## 7791 1956
## Reassignment Referred
## 51961 29
## Reopen Resolved
## 2428 1626
## Service Change Status Change
## 159 50914
## Update Update from customer
## 35969 3906
## Urgency Change Vendor Reference
## 1317 941
## Vendor Reference Change
## 133
With the spread function of tidyR, we can transform the data set such that each activity type gets its own column with the corresponding timestamp as its value. However, since one incident can contain multiple activities of the same type, spread will run into problems determining the correct timestamp for the activity types. Therefore, we must distinguish between the first occurrence of an activity type and the second occurence. We can solve this with dplyr.
The following code will first group the data by incident_id and incident_activity_type. Next, the dplyr function row_number is used to determine the row number based on the timestamp (please note that this is repeated for each group). Finally, we ungroup the data and order by incident_id and timestamp
incident_activity_data <- incident_activity_data %>%
group_by(incident_id, incident_activity_type) %>%
mutate(n_rep = row_number(timestamp)) %>%
ungroup() %>%
arrange(incident_id, timestamp)
Let’s have a look at the data
incident_activity_data %>% select(incident_id, incident_activity_type, n_rep, timestamp)
## Source: local data frame [466,737 x 4]
##
## incident_id incident_activity_type n_rep timestamp
## 1 IM0000004 Reassignment 1 2013-01-07 08:17:17
## 2 IM0000004 Operator Update 1 2013-09-25 08:27:40
## 3 IM0000004 Operator Update 2 2013-11-04 12:09:37
## 4 IM0000004 Assignment 1 2013-11-04 12:09:37
## 5 IM0000004 Reassignment 2 2013-11-04 12:09:37
## 6 IM0000004 Reassignment 3 2013-11-04 13:41:30
## 7 IM0000004 Update from customer 1 2013-11-04 13:41:30
## 8 IM0000004 Assignment 2 2013-11-04 13:41:30
## 9 IM0000004 Closed 1 2013-11-04 13:51:18
## 10 IM0000004 Caused By CI 1 2013-11-04 13:51:18
## .. ... ... ... ...
Next, we will need to combine the columns incident_activity_type and n_rep into a single column. This can be done with the unite function of the tidyr package.
incident_activity_data <- incident_activity_data %>%
unite(act_type, incident_activity_type, n_rep, sep="_", remove = FALSE)
Let’s take a look at the output
incident_activity_data %>% select(incident_id, act_type, timestamp)
## Source: local data frame [466,737 x 3]
##
## incident_id act_type timestamp
## 1 IM0000004 Reassignment_1 2013-01-07 08:17:17
## 2 IM0000004 Operator Update_1 2013-09-25 08:27:40
## 3 IM0000004 Operator Update_2 2013-11-04 12:09:37
## 4 IM0000004 Assignment_1 2013-11-04 12:09:37
## 5 IM0000004 Reassignment_2 2013-11-04 12:09:37
## 6 IM0000004 Reassignment_3 2013-11-04 13:41:30
## 7 IM0000004 Update from customer_1 2013-11-04 13:41:30
## 8 IM0000004 Assignment_2 2013-11-04 13:41:30
## 9 IM0000004 Closed_1 2013-11-04 13:51:18
## 10 IM0000004 Caused By CI_1 2013-11-04 13:51:18
## .. ... ... ...
Now, we can use spread to create a separate column for each activity type with the corresponding timestamp as its value. But to avoid an explosion of new columns, we are only interested in the first three occurences of each activity. Also, it should be noted that the act of spreading the data, results in R storing the timestamps as doubles instead of dates. We created a helper function to transform this back.
transform_back_to_date <- function(x){
as.POSIXct(x, origin = '1970-01-01 00:00:00', tz = "Europe/Amsterdam")
}
incident_data_details <- incident_activity_data %>%
filter(n_rep <=3) %>%
select(incident_id, act_type, timestamp) %>%
spread(act_type, timestamp) %>%
mutate_each(funs(transform_back_to_date), -incident_id) %>%
print()
## Source: local data frame [46,616 x 103]
##
## incident_id Affected CI Change_1 alert stage 1_1 alert stage 1_2
## 1 IM0000004 <NA> <NA> <NA>
## 2 IM0000005 <NA> <NA> <NA>
## 3 IM0000006 <NA> <NA> <NA>
## 4 IM0000011 <NA> <NA> <NA>
## 5 IM0000012 <NA> <NA> <NA>
## 6 IM0000013 <NA> <NA> <NA>
## 7 IM0000014 <NA> <NA> <NA>
## 8 IM0000015 <NA> <NA> <NA>
## 9 IM0000017 <NA> <NA> <NA>
## 10 IM0000018 <NA> <NA> <NA>
## .. ... ... ... ...
## Variables not shown: alert stage 1_3 (time), Analysis/Research_1 (time),
## Analysis/Research_2 (time), Analysis/Research_3 (time), Assignment_1
## (time), Assignment_2 (time), Assignment_3 (time), Callback Request_1
## (time), Caused By CI_1 (time), Caused By CI_2 (time), Caused By CI_3
## (time), Closed_1 (time), Closed_2 (time), Closed_3 (time), Communication
## with customer_1 (time), Communication with customer_2 (time),
## Communication with customer_3 (time), Communication with vendor_1
## (time), Communication with vendor_2 (time), Communication with vendor_3
## (time), Contact Change_1 (time), Contact Change_2 (time), Description
## Update_1 (time), Description Update_2 (time), Description Update_3
## (time), Dial-in_1 (time), External update_1 (time), External update_2
## (time), External update_3 (time), External Vendor Assignment_1 (time),
## External Vendor Assignment_2 (time), External Vendor Assignment_3
## (time), External Vendor Reassignment_1 (time), External Vendor
## Reassignment_2 (time), Impact Change_1 (time), Impact Change_2 (time),
## Impact Change_3 (time), Incident reproduction_1 (time), Incident
## reproduction_2 (time), Mail to Customer_1 (time), Mail to Customer_2
## (time), Mail to Customer_3 (time), Notify By Change_1 (time), Notify By
## Change_2 (time), Notify By Change_3 (time), OO Response_1 (time), OO
## Response_2 (time), OO Response_3 (time), Open_1 (time), Open_2 (time),
## Open_3 (time), Operator Update_1 (time), Operator Update_2 (time),
## Operator Update_3 (time), Pending vendor_1 (time), Pending vendor_2
## (time), Pending vendor_3 (time), Problem Closure_1 (time), Problem
## Closure_2 (time), Problem Workaround_1 (time), Problem Workaround_2
## (time), Problem Workaround_3 (time), Quality Indicator Fixed_1 (time),
## Quality Indicator Fixed_2 (time), Quality Indicator Fixed_3 (time),
## Quality Indicator Set_1 (time), Quality Indicator Set_2 (time), Quality
## Indicator Set_3 (time), Quality Indicator_1 (time), Quality Indicator_2
## (time), Quality Indicator_3 (time), Reassignment_1 (time),
## Reassignment_2 (time), Reassignment_3 (time), Referred_1 (time),
## Referred_2 (time), Reopen_1 (time), Reopen_2 (time), Reopen_3 (time),
## Resolved_1 (time), Resolved_2 (time), Service Change_1 (time), Status
## Change_1 (time), Status Change_2 (time), Status Change_3 (time), Update
## from customer_1 (time), Update from customer_2 (time), Update from
## customer_3 (time), Update_1 (time), Update_2 (time), Update_3 (time),
## Urgency Change_1 (time), Urgency Change_2 (time), Urgency Change_3
## (time), Vendor Reference Change_1 (time), Vendor Reference Change_2
## (time), Vendor Reference Change_3 (time), Vendor Reference_1 (time),
## Vendor Reference_2 (time)