A project should contain at least two folders, i.e. ‘data’ and ‘analysis’.
For simplicity and clarity, we suggest that folder and file names use lowercase letters spaces are replace by underscores (_).
The data folder should contain a subfolder ‘raw’ which will contain the original data files. The data folder itself contains no data files, but only R-scripts that load the appropriate data. The idea is that in our analysis we don’t load the data directly, but call the appropriate loading-script which preprocesses the raw data such that it suits the needs of our analysis. The folder structure of our project should look as follows.
Let’s illustrate this setup with a specific example. We will perform an exploratory analysis on the data of the BPI Challenge 2014.
Create a new project directory, together with the directories ‘data’, ‘raw’ and ‘analysis’.
[In order to follow this tutorial literally, you must make sure that your working directory is set to the project diretory, You can use the ‘setwd’ function to do so.]
Download the four data sets of the BPI Challenge 2014, together with the Quick reference guide, and store them in the ‘raw’ folder. (Extract the zip files and store the csv files!). Your project layout should look as follows:
To do so, we will probably have to experiment a bit to find the proper preprocessing steps. Therefore, I suggest that you use the console to experiment and once you have figured out the required lines of code, we can create a script.
We notice that the file ends with a ‘.csv’ extension, so we will try to use the ‘write.csv’ function to load the data. Try the following code in the console.
interaction_data <- read.csv("data/raw/detail_interaction.csv")
dim(interaction_data)
## [1] 147004 1
We now see that the data only contains a single column, which is typically an indication that the csv-separator was set incorrectly. By default, read.csv assumes that the comma is the separator. Let’s try to find out what the true separator was by inspecting the first few rows. [Console]
head(interaction_data, 2)
## CI.Name..aff..CI.Type..aff..CI.Subtype..aff..Service.Comp.WBS..aff..Interaction.ID.Status.Impact.Urgency.Priority.Category.KM.number.Open.Time..First.Touch..Close.Time.Closure.Code.First.Call.Resolution.Handle.Time..secs..Related.Incident
## 1 SBA000243;application;Server Based Application;WBS000125;SD0000001;Closed;5;4;4;incident;KM0000987;9-9-2011 9:23;14-2-2014 9:05;Other;N;239;IM0000001
## 2 SUB000443;subapplication;Web Based Application;WBS000125;SD0000002;Closed;4;4;4;request for information;KM0000989;29-9-2011 14:59;13-12-2013 16:27;Software;N;406;IM0000001
This output reveals that the csv file actually uses the semicolon to separate values. We thus have to set the appropriate parameters. [Console]
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
dim(interaction_data)
## [1] 147004 17
Ok, that looks more like it. We appear to have a data set of records and columns. We can now create our loading script and add the first lines of code to actually load the data as a data.frame.
As naming convention we will only use lowercase letters and replace spaces with underscores. Furthermore, all loading scripts start with the prefix ‘load’ followed by the name of the data.frame that is returned. In our case, we will name the loading script ‘load_interaction_data.r’. Create this file in the data directory of our project, which is the place where data loading scripts should reside. The current project structure should now look as follows:
This loading script should now contain the following line:
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
Next, let’s take a look at the column names. Again, we start by experimenting in our console.
names(interaction_data)
## [1] "CI.Name..aff." "CI.Type..aff."
## [3] "CI.Subtype..aff." "Service.Comp.WBS..aff."
## [5] "Interaction.ID" "Status"
## [7] "Impact" "Urgency"
## [9] "Priority" "Category"
## [11] "KM.number" "Open.Time..First.Touch."
## [13] "Close.Time" "Closure.Code"
## [15] "First.Call.Resolution" "Handle.Time..secs."
## [17] "Related.Incident"
These values appear to be actual column names, so that is good. Please note that sometimes the valeus in the names attribute of the data will appear to be data values rather than column names. This often is an indication that the original data did not contain column header information and you should set the ‘header’ parameter of the ‘read.csv’ command to FALSE.
The column names however do appear to be a bit complex. Therefore, we decide to rename them. We also decided to only use lowercase letters and the underscore instead of spaces as a naming convention again. We first try these steps out in our console.
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component",
"interaction_id", "status", "impact", "urgency", "priority",
"category", "km_number", "open_time", "close_time", "closure_code",
"first_call_resolution", "handle_time", "related_incident")
names(interaction_data)
## [1] "ci_name" "ci_type"
## [3] "ci_subtype" "service_component"
## [5] "interaction_id" "status"
## [7] "impact" "urgency"
## [9] "priority" "category"
## [11] "km_number" "open_time"
## [13] "close_time" "closure_code"
## [15] "first_call_resolution" "handle_time"
## [17] "related_incident"
The column names appear to be changed correctly, so We add this step to the loading script ‘load_interaction_data.r’, which should now hold the following code:
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component",
"interaction_id", "status", "impact", "urgency", "priority",
"category", "km_number", "open_time", "close_time", "closure_code",
"first_call_resolution", "handle_time", "related_incident")
Next, we need to see if the columns are of the appropriate data type. Again, use the console to experiment!
str(interaction_data)
## 'data.frame': 147004 obs. of 17 variables:
## $ ci_name : Factor w/ 4153 levels "#N/B","ACS000001",..: 3385 3803 1905 1730 3647 3791 3824 1789 107 3372 ...
## $ ci_type : Factor w/ 14 levels "#N/B","application",..: 2 14 4 2 2 14 14 2 4 2 ...
## $ ci_subtype : Factor w/ 67 levels "#N/B","Application Server",..: 48 61 22 12 48 61 61 12 4 48 ...
## $ service_component : Factor w/ 289 levels "WBS000001","WBS000002",..: 113 113 169 224 50 67 147 83 134 112 ...
## $ interaction_id : Factor w/ 147004 levels "SD0000001","SD0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ status : Factor w/ 2 levels "Closed","Open - Linked": 1 1 1 1 1 1 1 1 1 1 ...
## $ impact : int 5 4 4 4 4 4 4 3 2 3 ...
## $ urgency : Factor w/ 6 levels "1","2","3","4",..: 4 4 4 4 4 4 4 3 2 3 ...
## $ priority : int 4 4 4 4 4 4 4 3 2 3 ...
## $ category : Factor w/ 6 levels "complaint","incident",..: 2 5 2 2 2 2 2 2 2 2 ...
## $ km_number : Factor w/ 2360 levels "KM0000001","KM0000002",..: 983 985 317 57 649 699 550 984 132 488 ...
## $ open_time : Factor w/ 65848 levels "1-10-2012 10:44",..: 65830 46724 8950 1017 33146 15042 56577 65819 14951 41516 ...
## $ close_time : Factor w/ 64727 levels "1-1-2014 15:35",..: 12700 9595 28509 28510 28510 28511 52545 28511 28512 16351 ...
## $ closure_code : Factor w/ 25 levels "","Auto Closed",..: 14 19 19 23 19 14 14 19 5 14 ...
## $ first_call_resolution: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ handle_time : int 239 406 738 787 459 412 363 374 272 295 ...
## $ related_incident : Factor w/ 46089 levels "","#MULTIVALUE",..: 3 3 1 1 5 1 6 1 1 1 ...
Most of them appear to be correct, but not all. For example open_time and close_time are stored as factors, while they should be datetime values. However, as we will deal with datetime objects in a later tutorial, we will leave them as a factor for the time being.
The columns impact, urgency and priority also appear to be coded in a non-consistent way. While impact and priority are coded as integers, urgency is coded as a factor. This is particularly remarkable since the levels revealed by the ‘str’ function all appear to be integers. Let’s have a closer look at the levels of the urgency column. [Console]
levels(interaction_data$urgency)
## [1] "1" "2" "3" "4"
## [5] "5" "5 - Very Low"
It appears that there the level 5 appears twice, as ‘5’ and ‘5 - Very Low’. This might be a coding error in the original data, but to be certain let’s take a look at the frequency of each level. [Console]
table(interaction_data$urgency)
##
## 1 2 3 4 5
## 32 950 16074 76645 53302
## 5 - Very Low
## 1
Indeed, the level ‘5 - Very Low’ only appears once, indicating that this probably is a coding error in the original data. Let’s try to correct this in our console first.
levels(interaction_data$urgency)
## [1] "1" "2" "3" "4"
## [5] "5" "5 - Very Low"
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
str(interaction_data$urgency)
## Factor w/ 5 levels "1","2","3","4",..: 4 4 4 4 4 4 4 3 2 3 ...
We now have a factor of 5 levels. So our code appeared to work. The next question however is whether impact, urgency and priority should be coded the same and if so, as what? There obviously is some kind of order in the levels, so an ordered.factor would be more appropriate than a factor. One could argue that all three should be coded as integers, but this assumes that e.g. the difference in priority between a priority 1 and 2 interaction is the same as the difference between a priority 4 and 5. We prefer not to make this assumption and code the data as ordered factors. [Console]
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
str(interaction_data)
## 'data.frame': 147004 obs. of 17 variables:
## $ ci_name : Factor w/ 4153 levels "#N/B","ACS000001",..: 3385 3803 1905 1730 3647 3791 3824 1789 107 3372 ...
## $ ci_type : Factor w/ 14 levels "#N/B","application",..: 2 14 4 2 2 14 14 2 4 2 ...
## $ ci_subtype : Factor w/ 67 levels "#N/B","Application Server",..: 48 61 22 12 48 61 61 12 4 48 ...
## $ service_component : Factor w/ 289 levels "WBS000001","WBS000002",..: 113 113 169 224 50 67 147 83 134 112 ...
## $ interaction_id : Factor w/ 147004 levels "SD0000001","SD0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ status : Factor w/ 2 levels "Closed","Open - Linked": 1 1 1 1 1 1 1 1 1 1 ...
## $ impact : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 4 4 4 4 4 4 3 2 3 ...
## $ urgency : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 4 4 4 4 4 4 3 2 3 ...
## $ priority : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 4 4 4 4 4 4 3 2 3 ...
## $ category : Factor w/ 6 levels "complaint","incident",..: 2 5 2 2 2 2 2 2 2 2 ...
## $ km_number : Factor w/ 2360 levels "KM0000001","KM0000002",..: 983 985 317 57 649 699 550 984 132 488 ...
## $ open_time : Factor w/ 65848 levels "1-10-2012 10:44",..: 65830 46724 8950 1017 33146 15042 56577 65819 14951 41516 ...
## $ close_time : Factor w/ 64727 levels "1-1-2014 15:35",..: 12700 9595 28509 28510 28510 28511 52545 28511 28512 16351 ...
## $ closure_code : Factor w/ 25 levels "","Auto Closed",..: 14 19 19 23 19 14 14 19 5 14 ...
## $ first_call_resolution: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ handle_time : int 239 406 738 787 459 412 363 374 272 295 ...
## $ related_incident : Factor w/ 46089 levels "","#MULTIVALUE",..: 3 3 1 1 5 1 6 1 1 1 ...
With these steps we can finalize our loading script ‘load_interaction_data.r’ for now. The content of the script should now read as follows:
interaction_data <- read.csv("data/raw/detail_interaction.csv", sep = ";")
names(interaction_data) <- c("ci_name", "ci_type", "ci_subtype", "service_component",
"interaction_id", "status", "impact", "urgency", "priority",
"category", "km_number", "open_time", "close_time", "closure_code",
"first_call_resolution", "handle_time", "related_incident")
levels(interaction_data$urgency) <- c("1", "2", "3", "4", "5", "5")
interaction_data$impact <- as.ordered(interaction_data$impact)
interaction_data$urgency <- as.ordered(interaction_data$urgency)
interaction_data$priority <- as.ordered(interaction_data$priority)
Now, if we want to perform analysis we only have to call the appropriate data loading script which takes care of all the required preprocessing.
Once we have set up the data for the project, we can start doing the analysis. For each analysis of the data (e.g. a univariate analysis of a specific data set, a visual analysis of particular subset of the data, an analysis of a specific peculiarity in the data, …), a new RMarkdown file is created.
RMarkdown is an authoring format that combines the core syntax of markdown (an easy-to-write markup language) with the power of R. It allows you to integrate R-code in your document, which is evaluated when the document is knitted (compiled). For more information, please take a look at the website on rmarkdown. For a tutorial on the markdown language, take a look at this site.
Obviously, you should give your analysis document a meaningfull name such that you do not always need to open the analysis file to know what it is about. Try to stick to the naming convention of all lowercase letters and underscores (_) instead of spaces. Let’s create a first analysis document which performs univariate analysis of our interaction data. Create a markdown file named ‘univariate_analysis_interaction_data.Rmd’. Your project structure should now look as follows:
When you create a new RMarkdown file, you should start by adding a metadata section at the top, containing information about the final output format, the author, the creation data, the document’s title, … . (RStudio creates this section automatically for you). Your analysis document should contain the following information (or something similar):
—
title: “Univariate Analysis of Interaction Data”
author: “B. Depaire”
date: “Tuesday, March 10, 2015”
output: html_document
—
An R Markdown file consists of two types of input. Regular markdown code and R code. Regular markdown code is regular text with some additional markup to influence the layout of the document. R code is typically added in an R Chunck. Such an R Chunk is a block of R code embedded in the following two lines (Tip: You can use the key-combo ctrl+alt+i or AltGr+i in RStudio to insert a R chunck):
```{r}
R Code
```
In a data project, we use the convention to make all references to other files (data files or scripts) by means of relative paths starting from the project root directory. This allows us to copy data projects to other locations or other computers without breaking the code. For this to work, it is essential that a markdown file sets the project root directory as its working directory (or root directory).
By default, markdown files use the directory they are in as their working directory. Therefore, you should start any markdown file in your data project with an R Chunck to set the appropriate root directory. Note that you should set this in a relative way (starting from where the markdown file actually resides. In this case, the markdown file resides in the ‘analysis’ directory, wich is a direct child of the project root directory. Therefore, we use the relative path ‘..’ (i.e. the parent directory of the current directory) to identify the project root directory. Your markdown file should now look as follows:
—
title: “Univariate Analysis of Interaction Data”
author: “B. Depaire”
date: “Tuesday, March 10, 2015”
output: html_document
—
```{r}
require(knitr)
opts_knit$set(root.dir=“..”)
```
The main purpose of using RMarkdown files instead of plain R scripts, is that you discuss your analysis. So, let’s start by adding a title and some information about the data set.
—
title: “Univariate Analysis of Interaction Data”
author: “B. Depaire”
date: “Tuesday, March 10, 2015”
output: html_document
—
```{r}
require(knitr)
opts_knit$set(root.dir=“..”)
```
##Interaction Data
This data set tracks information on calls made to IT support about experienced problems.
Next, let’s add an R Chunk to load the appropriate load script for interaction data. Note in the code below how we use the ‘source’ function to call the appropriate loading script! This function actually imports the code of the script into our analysis document. We also added a line of R code which calls the ‘summary’ function on the data.
Now you can compile your markdown file (called knitting) and you will notice that R Markdown inserted the R output (of the summary function) automatically.
—
title: “Univariate Analysis of Interaction Data”
author: “B. Depaire”
date: “Tuesday, March 10, 2015”
output: html_document
—
```{r}
require(knitr)
opts_knit$set(root.dir=“..”)
```
##Interaction Data
This data set tracks information on calls made to IT support about experienced problems.
```{r}
source(“data/load_interaction_data.R”)
summary(interaction_data)
```
(Tip: If you want to execute the code of a R chunck in your console, you can use the the combo ‘ctrl+alt+c’ or ‘AltGr+c’ to execute the current R chunk.)