Lubridate and TidyR

## Warning: package 'knitr' was built under R version 3.1.3

## Warning: package 'dplyr' was built under R version 3.1.3

## Warning: package 'lubridate' was built under R version 3.1.3

## Warning: package 'tidyr' was built under R version 3.1.3

Lubridate

Lubridate is an R package which makes it easier to work with date and timestamps. In this tutorial, we will show some applications of the lubridate package. For more detailed information on lubridate, please refer to :

Using lubridate in data loading scripts

If your dataset contains dates or timestamps, lubridate is the go-to package to transform these columns into a proper format. Let’s take an example at the loading script for the interaction data set.

library(dplyr)
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component", 
                 "interaction_id", "status", "impact", "urgency", "priority",
                 "category", "km_number", "open_time", "close_time", "closure_code",
                 "first_call_resolution", "handle_time", "related_incident")
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
interaction_data <- tbl_df(interaction_data)

Now, let’s take a look at the data

glimpse(interaction_data)

## Variables:
## $ ci_name               (fctr) SBA000243, SUB000443, LAP000110, DTA000...
## $ ci_type               (fctr) application, subapplication, computer, ...
## $ ci_subtype            (fctr) Server Based Application, Web Based App...
## $ service_component     (fctr) WBS000125, WBS000125, WBS000187, WBS000...
## $ interaction_id        (fctr) SD0000001, SD0000002, SD0000003, SD0000...
## $ status                (fctr) Closed, Closed, Closed, Closed, Closed,...
## $ impact                (fctr) 5, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ urgency               (fctr) 4, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ priority              (fctr) 4, 4, 4, 4, 4, 4, 4, 3, 2, 3, 3, 4, 4, ...
## $ category              (fctr) incident, request for information, inci...
## $ km_number             (fctr) KM0000987, KM0000989, KM0000317, KM0000...
## $ open_time             (fctr) 9-9-2011 9:23, 29-9-2011 14:59, 13-10-2...
## $ close_time            (fctr) 14-2-2014 9:05, 13-12-2013 16:27, 21-10...
## $ closure_code          (fctr) Other, Software, Software, Unknown, Sof...
## $ first_call_resolution (fctr) N, N, N, N, N, N, N, N, N, N, N, N, N, ...
## $ handle_time           (int) 239, 406, 738, 787, 459, 412, 363, 374, ...
## $ related_incident      (fctr) IM0000001, IM0000001, , , IM0000003, , ...

This output shows that most columns are stored as factors, except for handle_time which is stored as an integer. This seems right for most columns, except for open_time and close_time which are clearly dates (with a timestamp). These should not be stored as a factor, but as a more appropriate data type such as lubridate’s Instant type which represents a specific moment in time. [Technically, this is not completely correct. Lubridate stores dates as a POSIXct data type, but it recognizes any date-time object that refers to a moment of time as an instant. In the remainder of the tutorial, we will always use the term instant instead of the true underlying date type.]

To transform open_time and close_time into instants, lubridate offers various methods each corresponding to a different time format. Upon close inspection of the output above, it shows that a date format is used which represents days first, months second, years third followed by hour and minute. This corresponds to the dmy_hm() function. If months whould have been represented before day, you should use the mdy_hm() function, and so on. Let’s adapt the loading script.

library(dplyr)
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component", 
                 "interaction_id", "status", "impact", "urgency", "priority",
                 "category", "km_number", "open_time", "close_time", "closure_code",
                 "first_call_resolution", "handle_time", "related_incident")
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
interaction_data$open_time <- dmy_hm(interaction_data$open_time, tz="Europe/Amsterdam")
interaction_data$close_time <- dmy_hm(interaction_data$close_time, tz= "Europe/Amsterdam")
interaction_data <- tbl_df(interaction_data)

As you can see, the dmy_hm() function takes an additional argument, i.e. the time zone. Time zones are important in two specific scenario’s, i.e. when comparing timestamps taken in different time zones (e.g. when dealing with flight data) or when daylight saving time needs to be taken into account. Let’s illustrate the latter.

By default, lubridate assumes that instants are expressed according in UTC, which is a standard time that does not correspond to a specific timezone and which does not have daylight saving time.

start <- dmy_hm("29-03-2015 01:59")
end <- dmy_hm("29-03-2015 03:01")
end-start

## Time difference of 1.033333 hours

If we however define the exact instants but with the timezone of Belgium, we get a different result because of daylight saving time. In Belgium, at 29-03-2015 time skipped from 01:59:59 to 03:00:00. So in reality, only 2 minutes passed from start to end.

start <- dmy_hm("29-03-2015 01:59", tz="Europe/Brussels")
end <- dmy_hm("29-03-2015 03:01", tz="Europe/Brussels")
end-start

## Time difference of 2 mins

Lubridate and Descriptive statistics

Now that the data is stored correctly in a date format, we can verify the impact on descriptive statistics calculated for these variables. Let’s get a summary of the data to begin with.

summary(interaction_data)

##       ci_name                 ci_type     
##  SAN000182: 10668   application   :99483  
##  DTA000616:  7911   subapplication:20745  
##  WBA000133:  6897   storage       :11829  
##  SUB000456:  6476   computer      : 8410  
##  DTA000057:  5708   software      : 1981  
##  SBA000659:  4008   displaydevice : 1662  
##  (Other)  :105336   (Other)       : 2894  
##                     ci_subtype    service_component   interaction_id  
##  Server Based Application:48469   WBS000073:33528   SD0000001:     1  
##  Web Based Application   :39225   WBS000128:14130   SD0000002:     1  
##  Desktop Application     :24239   WBS000092: 7219   SD0000003:     1  
##  SAN                     :11675   WBS000094: 6825   SD0000004:     1  
##  Citrix                  : 3836   WBS000091: 6411   SD0000005:     1  
##  Laptop                  : 3604   WBS000089: 6011   SD0000006:     1  
##  (Other)                 :15956   (Other)  :72880   (Other)  :146998  
##            status       impact    urgency   priority 
##  Closed       :146998   1:    2   1:   32   1:    2  
##  Open - Linked:     6   2:  904   2:  950   2:  947  
##                         3:16357   3:16074   3:16340  
##                         4:75552   4:76645   4:77534  
##                         5:54189   5:53303   5:52181  
##                                                      
##                                                      
##                     category          km_number     
##  complaint              :    63   KM0002125:  5727  
##  incident               :115704   KM0001935:  1768  
##  problem                :     5   KM0002126:  1698  
##  request for change     :     1   KM0001968:  1606  
##  request for information: 31183   KM0001625:  1415  
##  service request        :    48   KM0000075:  1389  
##                                   (Other)  :133401  
##    open_time                     close_time                 
##  Min.   :2011-09-09 09:23:00   Min.   :2013-10-01 06:46:00  
##  1st Qu.:2013-11-12 15:00:45   1st Qu.:2013-11-13 18:24:30  
##  Median :2013-12-31 08:46:00   Median :2014-01-02 11:33:30  
##  Mean   :2013-12-28 16:24:14   Mean   :2013-12-30 16:11:25  
##  3rd Qu.:2014-02-12 12:33:15   3rd Qu.:2014-02-13 11:59:00  
##  Max.   :2014-03-31 19:21:00   Max.   :2014-03-31 22:47:00  
##                                                             
##                        closure_code   first_call_resolution
##  Other                       :54487   N:53008              
##  Software                    :45573   Y:93996              
##  Referred                    : 9793                        
##  User error                  : 9052                        
##  No error - works as designed: 5975                        
##  Hardware                    : 4224                        
##  (Other)                     :17900                        
##   handle_time         related_incident
##  Min.   :    0.0              :94250  
##  1st Qu.:  176.0   #MULTIVALUE:  873  
##  Median :  324.0   IM0000220  :  239  
##  Mean   :  444.7   IM0031184  :   85  
##  3rd Qu.:  562.0   IM0016639  :   69  
##  Max.   :22530.0   IM0014444  :   54  
##                    (Other)    :51434

Looking at the summaries of open_time and close_time, we get a remarkable result. It appears that the earliest open_time was at 2011-09-09 09:23:00 while, the earliest close_time was at 2013-10-01 06:46:00. We can not be certain that both timestamps relate to the same interaction, but if they do, the first recorded interaction in the data lasted over two years. A quick glance at the other information gives the impression that some kind of coding error must have occured.

To get a better look at what is really happening, let’s create a new variable that represents the timespan of each interaction between open_time and close_time. Such a variable can be created with lubridate and is called an interval variable

interaction_data <- interaction_data %>% mutate(period=interval(open_time, close_time)) 
select(interaction_data, period)

## Source: local data frame [147,004 x 1]
## 
##                                                period
## 1   2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2   2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3  2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4   2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5   2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6   2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7    2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8   2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9   2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10   2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## ..                                                ...

Storing periods as a lubridate interval variable has several advantages. Lubridate makes it for example easy to verify if one interval falls with the boundaries of a second interval. For example, the following shows us how many interactions both started and ended in March 2014.

march_2014 <- interval(dmy("1-3-2014"), dmy("31-3-2014"))
sum(interaction_data$period %within% march_2014)

Unfortunately, lubridate and dplyr don’t work well together yet. With the current versions (dplyr_0.4.1 and lubridate_1.3.3), a dplyr data frame containing an interval column cannot be filtered nor sorted. For example, if we wanted to make a subset of the original data frame with only the interactions that started and ended in March 2014, we would normally do the following:

interaction_data %>% filter(period %within% march_2014)

However, running this gives the following error: column ‘period’ has unsupported type : Interval’. Therefore, we need to pull some tricks to make it work.

Since the problems only occur when we want to filter or arrange, we can still create a boolean column which shows which interactions occurred in March.

interaction_data <- interaction_data %>% mutate(in_march2014= period %within% march_2014)

Next, if we want to keep the interval information, we recode it as a character type

interaction_data <- interaction_data %>% mutate(period = as.character(period))

Now, we can use the boolean column in_march2014 to filter

interaction_data %>% filter(in_march2014 == TRUE) %>% select(period)

## Source: local data frame [21,674 x 1]
## 
##                                              period
## 1  2014-03-02 08:30:00 CET--2014-03-02 08:37:00 CET
## 2  2014-03-02 08:45:00 CET--2014-03-02 08:59:00 CET
## 3  2014-03-01 09:58:00 CET--2014-03-01 10:02:00 CET
## 4  2014-03-01 10:02:00 CET--2014-03-01 10:09:00 CET
## 5  2014-03-01 10:09:00 CET--2014-03-01 10:11:00 CET
## 6  2014-03-01 10:39:00 CET--2014-03-03 15:12:00 CET
## 7  2014-03-01 10:55:00 CET--2014-03-03 06:17:00 CET
## 8  2014-03-01 11:31:00 CET--2014-03-03 08:18:00 CET
## 9  2014-03-01 11:37:00 CET--2014-03-01 12:06:00 CET
## 10 2014-03-01 12:23:00 CET--2014-03-01 12:35:00 CET
## ..                                              ...

Now, let’s take a look again at the periods of the earliest interactions in the data frame

interaction_data %>% arrange(open_time) %>% select(period)

## Source: local data frame [147,004 x 1]
## 
##                                                period
## 1   2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2   2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3  2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4   2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5   2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6   2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7    2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8   2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9   2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10   2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## ..                                                ...

This output shows us that the earliest interactions go back to 2011, but all of them are not closed until 2013 or 2014. I get the feeling that whoever made this data set, took a sample of all interactions in 2013 and 2014. However, the selection they made included any interaction which ended somewhere in the selected period. As a result, very old interactions which took a long time are still in the data set. If this hypothesis is true, we should see the average duration for each month go down until the actual selection period is chosen.

Therefore, we will create a column representing the duration of an interval, which is typically stored as a lubridate duration object. To determine the duration of an interval, we apply as.duration() to the interval. Note that since we transformed the period column to a character data type, we need to recreate the interval.

interaction_data <- interaction_data %>% mutate(duration = as.duration(interval(open_time, close_time)))
interaction_data %>% select(period, duration)

## Source: local data frame [147,004 x 2]
## 
##                                                period
## 1   2011-09-09 09:23:00 CEST--2014-02-14 09:05:00 CET
## 2   2011-09-29 14:59:00 CEST--2013-12-13 16:27:00 CET
## 3  2011-10-13 15:47:00 CEST--2013-10-21 05:01:00 CEST
## 4   2011-12-01 15:39:00 CET--2013-10-21 05:02:00 CEST
## 5   2011-12-23 16:23:00 CET--2013-10-21 05:02:00 CEST
## 6   2012-01-16 14:09:00 CET--2013-10-21 05:03:00 CEST
## 7    2012-02-05 13:26:00 CET--2013-11-04 13:51:00 CET
## 8   2012-02-09 12:38:00 CET--2013-10-21 05:03:00 CEST
## 9   2012-02-15 15:46:00 CET--2013-10-21 05:04:00 CEST
## 10   2012-02-27 07:52:00 CET--2014-01-17 10:41:00 CET
## ..                                                ...
## Variables not shown: duration (dbl)

Let’s use ggplot to have a first quick look whether our assumption could hold. Let’s plot the duration against the open_time as a points.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

ggplot(interaction_data, aes(x=open_time, y=duration)) + geom_point()

This plot does not contradict our hypothesis, so let’s take a closer look. Let’s plot the average duration per month against each month. To do so, we need to transform our data frame. Unfortunately, dplyr is not yet able to deal with lubridate’s duration class (except for the select and mutate verbs). To solve this, we transform duration into a numeric data type

interaction_data <- interaction_data %>% mutate(duration = as.numeric(duration))

The following code first groups the interaction by year and month, using the lubridate’s year and month methods. These methods extract the year and month from a lubridate instant variable. Next, we summarize the data for each defined group (year-month) by calculating the average duration and the number of interactions. Because duration is expressed in seconds, we decided to transform it to days. To do so, we first transform the duration into a lubridate duration object and divide it by the duration of a single day (dday(1)). Note that the result of this is not a duration object, but a numeric object, which poses no problems for dplyr. Finally, we create a new variable which represents the month-year combination as an Instant and both print and plot the results.

monthly <- interaction_data %>% 
  group_by(year = year(open_time), month = month(open_time))  %>%
  summarise(avg_duration = mean(duration), n_interactions = n()) %>%
  mutate(avg_duration = as.duration(avg_duration) / ddays(1)) %>% 
  mutate(date = ymd(1+ month*100 + year*10000))
monthly %>% select(date, n_interactions, avg_duration) %>% print(n=nrow(monthly))

## Source: local data frame [30 x 4]
## Groups: year
## 
##    year       date n_interactions avg_duration
## 1  2011 2011-09-01              2  847.5659722
## 2  2011 2011-10-01              1  738.5513889
## 3  2011 2011-12-01              2  678.5006944
## 4  2012 2012-01-01              1  643.5791667
## 5  2012 2012-02-01              4  640.3223958
## 6  2012 2012-03-01              6  588.0273148
## 7  2012 2012-04-01              1  555.7118056
## 8  2012 2012-05-01              2  550.1218750
## 9  2012 2012-06-01              4  543.5564236
## 10 2012 2012-07-01              4  545.8097222
## 11 2012 2012-08-01              8  474.5415799
## 12 2012 2012-09-01              5  464.3123611
## 13 2012 2012-10-01              7  412.6067460
## 14 2012 2012-11-01              6  418.5134259
## 15 2012 2012-12-01              4  345.2718750
## 16 2013 2013-01-01             15  335.4605093
## 17 2013 2013-02-01             13  302.6753739
## 18 2013 2013-03-01             15  274.9146759
## 19 2013 2013-04-01             29  233.0033046
## 20 2013 2013-05-01             30  213.1684259
## 21 2013 2013-06-01             42  166.0620205
## 22 2013 2013-07-01            105  111.7106548
## 23 2013 2013-08-01            242   65.9795483
## 24 2013 2013-09-01            978   22.3574891
## 25 2013 2013-10-01          25730    1.5912801
## 26 2013 2013-11-01          25013    1.6274383
## 27 2013 2013-12-01          21895    1.5195094
## 28 2014 2014-01-01          26531    1.1108634
## 29 2014 2014-02-01          23316    1.0254645
## 30 2014 2014-03-01          22993    0.5298269

ggplot(monthly) + geom_line(aes(x=date, y=avg_duration))

These results strongly support our hypothesis and seem to suggest that the original data contain all interactions which ended in the period between October 2013 and March 2014. Let’s verify this by taking a look at the earliest close_time.

min(interaction_data$close_time)

## [1] "2013-10-01 06:46:00 CEST"

This supports the hypothesis again. Based on these results it seems wise to filter a subset which contains only the interactions that both started and completed in the period from 1 October 2013 until 31 March 2014. For this we create a new load script by the name load_interaction_data_oct2013_mar2014.R.

library(dplyr)
library(lubridate)
source("data/load_interaction_data.R")
interaction_data <- interaction_data %>% 
  mutate(period = interval(open_time, close_time)) %>%
  mutate(in_range = period %within% interval(dmy(01102013),dmy(01042014))) %>%
  select(-period) %>%
  filter(in_range == TRUE) %>%
  select(-in_range)

Challenge: Calculate for each interaction the number of concurrent interactions (proxy for load on system)

TidyR

The final package we will be discussing in this tutorial is the TidyR package. For detailed information about the package, we refer the interestered reader to the following two nice sources:

In this tutorial we will focus on how we can use spread function to reformat the incident_activity_data. But first, let’s take a look at the loading script for the incident_activity_data (Note that we transformed timestamp into a lubridate Instant type compared with the previous version of this script).

library(dplyr)
incident_activity_data <- tbl_df(read.csv("data/raw/detail_incident_activity.csv", sep=";"))
incident_activity_data <- select(incident_activity_data, 
                                 incident_id = Incident.ID,
                                 timestamp = DateStamp, 
                                 incident_activity_id = IncidentActivity_Number,
                                 incident_activity_type = IncidentActivity_Type, 
                                 assignment_group = Assignment.Group, 
                                 km_number = KM.number, 
                                 interaction_id = Interaction.ID)

incident_activity_data$timestamp <- dmy_hms(incident_activity_data$timestamp, tz = "Europe/Amsterdam")
incident_activity_data$interaction_id[incident_activity_data$interaction_id == "#N/B"] = NA

incident_activity_data <- droplevels(incident_activity_data)

Let’s take a glimpse at the data

glimpse(incident_activity_data)

## Observations: 466737
## Variables:
## $ incident_id            (fctr) IM0000004, IM0000004, IM0000004, IM000...
## $ timestamp              (time) 2013-01-07 08:17:17, 2013-11-04 13:41:...
## $ incident_activity_id   (fctr) 001A3689763, 001A5852941, 001A5852943,...
## $ incident_activity_type (fctr) Reassignment, Reassignment, Update fro...
## $ assignment_group       (fctr) TEAM0001, TEAM0002, TEAM0002, TEAM0003...
## $ km_number              (fctr) KM0000553, KM0000553, KM0000553, KM000...
## $ interaction_id         (fctr) SD0000007, SD0000007, SD0000007, SD000...

As we can see, each record in this table represents one particular activity performed for one specific incident. This representation has both advantages and disadvantages. For example, if we want to know the number of activities for each incident, together with the timestamps of the earliest and latest activities, we can use this representation together with dplyr.

activities_per_incident <- incident_activity_data %>% 
  group_by(incident_id) %>%
  summarize(n_activities = n(), ts_earliest = min(timestamp), ts_latest = max(timestamp)) %>%
  print()

## Source: local data frame [46,616 x 4]
## 
##    incident_id n_activities         ts_earliest           ts_latest
## 1    IM0000004           10 2013-01-07 08:17:17 2013-11-04 13:51:18
## 2    IM0000005           28 2013-01-07 08:17:54 2013-12-02 12:36:26
## 3    IM0000006            2 2014-01-13 15:13:13 2014-01-13 15:13:13
## 4    IM0000011           20 2013-01-07 08:19:20 2013-11-14 09:31:26
## 5    IM0000012            6 2013-02-13 11:57:55 2013-11-08 13:55:53
## 6    IM0000013            6 2013-02-13 12:03:35 2013-11-08 13:54:13
## 7    IM0000014            8 2013-02-13 12:21:01 2013-12-27 10:59:33
## 8    IM0000015           18 2013-02-13 12:31:12 2013-11-08 14:09:10
## 9    IM0000017            6 2013-02-13 12:38:26 2013-11-08 14:02:21
## 10   IM0000018            6 2013-02-27 13:16:20 2013-11-08 14:35:43
## ..         ...          ...                 ...                 ...

Now we can also calculate the actual duration between the first and last activity.

activities_per_incident <- activities_per_incident %>% 
  mutate(timespan = interval(ts_earliest, ts_latest) / dhours(1)) 
activities_per_incident %>% select(incident_id, n_activities, ts_earliest, timespan) %>%print()

## Source: local data frame [46,616 x 4]
## 
##    incident_id n_activities         ts_earliest timespan
## 1    IM0000004           10 2013-01-07 08:17:17 7229.567
## 2    IM0000005           28 2013-01-07 08:17:54 7900.309
## 3    IM0000006            2 2014-01-13 15:13:13    0.000
## 4    IM0000011           20 2013-01-07 08:19:20 7465.202
## 5    IM0000012            6 2013-02-13 11:57:55 6433.966
## 6    IM0000013            6 2013-02-13 12:03:35 6433.844
## 7    IM0000014            8 2013-02-13 12:21:01 7606.642
## 8    IM0000015           18 2013-02-13 12:31:12 6433.633
## 9    IM0000017            6 2013-02-13 12:38:26 6433.399
## 10   IM0000018            6 2013-02-27 13:16:20 6097.323
## ..         ...          ...                 ...      ...

However, we could also represent the incident_activity_data such that each row represents all activities of a single incident. First, let’s have a look at the distribution of activity types.

table(incident_activity_data$incident_activity_type)

## 
##           Affected CI Change                alert stage 1 
##                          185                          113 
##            Analysis/Research                   Assignment 
##                          981                        88502 
##             Callback Request                 Caused By CI 
##                           53                        34382 
##                       Closed  Communication with customer 
##                        50145                         6148 
##    Communication with vendor               Contact Change 
##                         1777                           32 
##           Description Update                      Dial-in 
##                         4501                            2 
##              External update   External Vendor Assignment 
##                         1099                         4354 
## External Vendor Reassignment                Impact Change 
##                           43                         1283 
##        Incident reproduction             Mail to Customer 
##                           23                         3788 
##             Notify By Change                  OO Response 
##                          281                           33 
##                         Open              Operator Update 
##                        46607                        56292 
##               Pending vendor              Problem Closure 
##                         4338                           40 
##           Problem Workaround            Quality Indicator 
##                          140                         2465 
##      Quality Indicator Fixed        Quality Indicator Set 
##                         7791                         1956 
##                 Reassignment                     Referred 
##                        51961                           29 
##                       Reopen                     Resolved 
##                         2428                         1626 
##               Service Change                Status Change 
##                          159                        50914 
##                       Update         Update from customer 
##                        35969                         3906 
##               Urgency Change             Vendor Reference 
##                         1317                          941 
##      Vendor Reference Change 
##                          133

With the spread function of tidyR, we can transform the data set such that each activity type gets its own column with the corresponding timestamp as its value. However, since one incident can contain multiple activities of the same type, spread will run into problems determining the correct timestamp for the activity types. Therefore, we must distinguish between the first occurrence of an activity type and the second occurence. We can solve this with dplyr.

The following code will first group the data by incident_id and incident_activity_type. Next, the dplyr function row_number is used to determine the row number based on the timestamp (please note that this is repeated for each group). Finally, we ungroup the data and order by incident_id and timestamp

incident_activity_data <- incident_activity_data %>%
  group_by(incident_id, incident_activity_type) %>%
  mutate(n_rep = row_number(timestamp)) %>%
  ungroup() %>%
  arrange(incident_id, timestamp)

Let’s have a look at the data

incident_activity_data %>% select(incident_id, incident_activity_type, n_rep, timestamp)

## Source: local data frame [466,737 x 4]
## 
##    incident_id incident_activity_type n_rep           timestamp
## 1    IM0000004           Reassignment     1 2013-01-07 08:17:17
## 2    IM0000004        Operator Update     1 2013-09-25 08:27:40
## 3    IM0000004        Operator Update     2 2013-11-04 12:09:37
## 4    IM0000004             Assignment     1 2013-11-04 12:09:37
## 5    IM0000004           Reassignment     2 2013-11-04 12:09:37
## 6    IM0000004           Reassignment     3 2013-11-04 13:41:30
## 7    IM0000004   Update from customer     1 2013-11-04 13:41:30
## 8    IM0000004             Assignment     2 2013-11-04 13:41:30
## 9    IM0000004                 Closed     1 2013-11-04 13:51:18
## 10   IM0000004           Caused By CI     1 2013-11-04 13:51:18
## ..         ...                    ...   ...                 ...

Next, we will need to combine the columns incident_activity_type and n_rep into a single column. This can be done with the unite function of the tidyr package.

incident_activity_data <- incident_activity_data %>%
  unite(act_type, incident_activity_type, n_rep, sep="_", remove = FALSE)

Let’s take a look at the output

incident_activity_data %>% select(incident_id, act_type, timestamp)

## Source: local data frame [466,737 x 3]
## 
##    incident_id               act_type           timestamp
## 1    IM0000004         Reassignment_1 2013-01-07 08:17:17
## 2    IM0000004      Operator Update_1 2013-09-25 08:27:40
## 3    IM0000004      Operator Update_2 2013-11-04 12:09:37
## 4    IM0000004           Assignment_1 2013-11-04 12:09:37
## 5    IM0000004         Reassignment_2 2013-11-04 12:09:37
## 6    IM0000004         Reassignment_3 2013-11-04 13:41:30
## 7    IM0000004 Update from customer_1 2013-11-04 13:41:30
## 8    IM0000004           Assignment_2 2013-11-04 13:41:30
## 9    IM0000004               Closed_1 2013-11-04 13:51:18
## 10   IM0000004         Caused By CI_1 2013-11-04 13:51:18
## ..         ...                    ...                 ...

Now, we can use spread to create a separate column for each activity type with the corresponding timestamp as its value. But to avoid an explosion of new columns, we are only interested in the first three occurences of each activity. Also, it should be noted that the act of spreading the data, results in R storing the timestamps as doubles instead of dates. We created a helper function to transform this back.

transform_back_to_date <- function(x){
  as.POSIXct(x, origin = '1970-01-01 00:00:00', tz = "Europe/Amsterdam")
  
}

incident_data_details <- incident_activity_data %>%
  filter(n_rep <=3) %>%
  select(incident_id, act_type, timestamp) %>%
  spread(act_type, timestamp) %>%
  mutate_each(funs(transform_back_to_date), -incident_id) %>%
  print()

## Source: local data frame [46,616 x 103]
## 
##    incident_id Affected CI Change_1 alert stage 1_1 alert stage 1_2
## 1    IM0000004                 <NA>            <NA>            <NA>
## 2    IM0000005                 <NA>            <NA>            <NA>
## 3    IM0000006                 <NA>            <NA>            <NA>
## 4    IM0000011                 <NA>            <NA>            <NA>
## 5    IM0000012                 <NA>            <NA>            <NA>
## 6    IM0000013                 <NA>            <NA>            <NA>
## 7    IM0000014                 <NA>            <NA>            <NA>
## 8    IM0000015                 <NA>            <NA>            <NA>
## 9    IM0000017                 <NA>            <NA>            <NA>
## 10   IM0000018                 <NA>            <NA>            <NA>
## ..         ...                  ...             ...             ...
## Variables not shown: alert stage 1_3 (time), Analysis/Research_1 (time),
##   Analysis/Research_2 (time), Analysis/Research_3 (time), Assignment_1
##   (time), Assignment_2 (time), Assignment_3 (time), Callback Request_1
##   (time), Caused By CI_1 (time), Caused By CI_2 (time), Caused By CI_3
##   (time), Closed_1 (time), Closed_2 (time), Closed_3 (time), Communication
##   with customer_1 (time), Communication with customer_2 (time),
##   Communication with customer_3 (time), Communication with vendor_1
##   (time), Communication with vendor_2 (time), Communication with vendor_3
##   (time), Contact Change_1 (time), Contact Change_2 (time), Description
##   Update_1 (time), Description Update_2 (time), Description Update_3
##   (time), Dial-in_1 (time), External update_1 (time), External update_2
##   (time), External update_3 (time), External Vendor Assignment_1 (time),
##   External Vendor Assignment_2 (time), External Vendor Assignment_3
##   (time), External Vendor Reassignment_1 (time), External Vendor
##   Reassignment_2 (time), Impact Change_1 (time), Impact Change_2 (time),
##   Impact Change_3 (time), Incident reproduction_1 (time), Incident
##   reproduction_2 (time), Mail to Customer_1 (time), Mail to Customer_2
##   (time), Mail to Customer_3 (time), Notify By Change_1 (time), Notify By
##   Change_2 (time), Notify By Change_3 (time), OO Response_1 (time), OO
##   Response_2 (time), OO Response_3 (time), Open_1 (time), Open_2 (time),
##   Open_3 (time), Operator Update_1 (time), Operator Update_2 (time),
##   Operator Update_3 (time), Pending vendor_1 (time), Pending vendor_2
##   (time), Pending vendor_3 (time), Problem Closure_1 (time), Problem
##   Closure_2 (time), Problem Workaround_1 (time), Problem Workaround_2
##   (time), Problem Workaround_3 (time), Quality Indicator Fixed_1 (time),
##   Quality Indicator Fixed_2 (time), Quality Indicator Fixed_3 (time),
##   Quality Indicator Set_1 (time), Quality Indicator Set_2 (time), Quality
##   Indicator Set_3 (time), Quality Indicator_1 (time), Quality Indicator_2
##   (time), Quality Indicator_3 (time), Reassignment_1 (time),
##   Reassignment_2 (time), Reassignment_3 (time), Referred_1 (time),
##   Referred_2 (time), Reopen_1 (time), Reopen_2 (time), Reopen_3 (time),
##   Resolved_1 (time), Resolved_2 (time), Service Change_1 (time), Status
##   Change_1 (time), Status Change_2 (time), Status Change_3 (time), Update
##   from customer_1 (time), Update from customer_2 (time), Update from
##   customer_3 (time), Update_1 (time), Update_2 (time), Update_3 (time),
##   Urgency Change_1 (time), Urgency Change_2 (time), Urgency Change_3
##   (time), Vendor Reference Change_1 (time), Vendor Reference Change_2
##   (time), Vendor Reference Change_3 (time), Vendor Reference_1 (time),
##   Vendor Reference_2 (time)

Lubridate and TidyR

B. Depaire

Thursday, April 09, 2015

Lubridate

Using lubridate in data loading scripts

Lubridate and Descriptive statistics

TidyR