MATH2405 TP3, 2020

Setup

Loading the necessary packages to reproduce the report here:

library(readr) # Useful for importing data
library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Data Location

This report used UCI (https://archive.ics.uci.edu/ ) Machine learining Repository data set Incident management process enriched event log.

URL :

https://archive.ics.uci.edu/Incident management process enriched event log Data Set

Abstract
This event log was extracted from data gathered from the audit system of an instance of the ServiceNow platform used by an IT company and enriched with data loaded from a relational database.

Source
Claudio Aparecido Lira do Amaral, claudio.amaral at usp.br,University of São Paulo, Brazil Marcelo Fantinato, m.fantinato at usp.br, University of São Paulo, Brazil Sarajane Marques Peres, sarajane at usp.br, University of São Paulo, Brazil

Read/Import Data

Read/Import the data into R, then save it as a data frame incidents.

# This is an R chunk for importing the data. Provide your R codes here:
dir <- ".\\"
setwd(dir)
incidents<- read.csv("incident_event_log.csv")
head(incidents)

Assign currnet directory path tp variable dir
Ser working directory to current diretory
Display first few rows

Data description

URL : https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log

This is an event log of an incident management process extracted from data gathered from the audit system of an instance of the ServiceNowTM platform used by an IT company. The event log is enriched with data loaded from a relational database underlying a corresponding process-aware information system. Information was anonymized for privacy.

Number of instances: 141,712 events (24,918 incidents) Number of attributes: 36 attributes (1 case identifier, 1 state identifier, 32 descriptive attributes, 2 dependent variables)

The attributed closed_at is used to determine the dependent variable for the time completion prediction task. The attribute resolved_at is highly correlated with closed_at. In this event log, some rows may have the same values (they are equal) since not all attributes involved in the real-world process are present in the log.

Attributes used to record textual information are not placed in this log.

The missing values should be considered unknown information.

Attribute Information:

Attribute	Description
1. number	incident identifier (24,918 different values)
2. incident state	eight levels controlling the incident management process transitions from opening until closing the case
3. active	boolean attribute that shows whether the record is active or closed/canceled
4. reassignment_count	number of times the incident has the group or the support analysts changed
5. reopen_count	number of times the incident resolution was rejected by the caller
6. sys_mod_count	number of incident updates until that moment
7. made_sla	boolean attribute that shows whether the incident exceeded the target SLA
8. caller_id	identifier of the user affected
9. opened_by	identifier of the user who reported the incident
10. opened_at	incident user opening date and time
11. sys_created_by	identifier of the user who registered the incident
12. sys_created_at	incident system creation date and time
13. sys_updated_by	identifier of the user who updated the incident and generated the current log record
14. sys_updated_at	incident system update date and time
15. contact_type	categorical attribute that shows by what means the incident was reported
16. location	identifier of the location of the place affected
17. category	first-level description of the affected service
18. subcategory	second-level description of the affected service (related to the first level description, i.e., to category)
19. u_symptom	description of the user perception about service availability
20. cmdb_ci	(confirmation item) identifier used to report the affected item (not mandatory)
21. impact	description of the impact caused by the incident (values: 1.High; 2.Medium; 3.Low)
22. urgency	description of the urgency informed by the user for the incident resolution (values: 1.High; 2.Medium; 3.Low)
23. priority	calculated by the system based on ‘impact’ and ‘urgency’
24. assignment_group	identifier of the support group in charge of the incident
25. assigned_to	identifier of the user in charge of the incident
26. knowledge	boolean attribute that shows whether a knowledge base document was used to resolve the incident
27. u_priority_confirmation	boolean attribute that shows whether the priority field has been double-checked
28. notify	categorical attribute that shows whether notifications were generated for the incident
29. problem_id	identifier of the problem associated with the incident
30. rfc	(request for change) identifier of the change request associated with the incident
31. vendor	identifier of the vendor in charge of the incident
32. caused_by	identifier of the RFC responsible by the incident
33. close_code	identifier of the resolution of the incident
34. resolved_by	identifier of the user who resolved the incident
35. resolved_at	incident user resolution date and time (dependent variable)
36. closed_at	incident user close date and time (dependent variable).

Inspect dataset and variables

check the dimensions of the data frame.

dim(incidents)

## [1] 141712     36

check the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.

str(incidents)

## 'data.frame':    141712 obs. of  36 variables:
##  $ number                 : Factor w/ 24918 levels "INC0000045","INC0000047",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ incident_state         : Factor w/ 9 levels "-100","Active",..: 8 9 9 7 8 2 2 2 2 2 ...
##  $ active                 : Factor w/ 2 levels "false","true": 2 2 2 1 2 2 2 2 2 2 ...
##  $ reassignment_count     : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ reopen_count           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sys_mod_count          : int  0 2 3 4 0 1 2 3 4 5 ...
##  $ made_sla               : Factor w/ 2 levels "false","true": 2 2 2 2 2 2 2 2 2 2 ...
##  $ caller_id              : Factor w/ 5245 levels "?","Caller 10",..: 1464 1464 1464 1464 1464 1464 1464 1464 1464 1464 ...
##  $ opened_by              : Factor w/ 208 levels "?","Opened by  10",..: 202 202 202 202 122 122 122 122 122 122 ...
##  $ opened_at              : Factor w/ 19849 levels "1/1/2017 01:14",..: 12991 12991 12991 12991 12992 12992 12992 12992 12992 12992 ...
##  $ sys_created_by         : Factor w/ 186 levels "?","Created by 1",..: 153 153 153 153 60 60 60 60 60 60 ...
##  $ sys_created_at         : Factor w/ 11553 levels "?","1/1/2017 02:15",..: 7418 7418 7418 7418 7419 7419 7419 7419 7419 7419 ...
##  $ sys_updated_by         : Factor w/ 846 levels "Updated by 1",..: 105 510 659 763 606 105 105 659 565 222 ...
##  $ sys_updated_at         : Factor w/ 50664 levels "1/1/2017 01:14",..: 33113 33151 33242 41839 33114 33115 33116 33243 33244 176 ...
##  $ contact_type           : Factor w/ 5 levels "Direct opening",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ location               : Factor w/ 225 levels "?","Location 10",..: 45 45 45 45 64 64 64 64 64 64 ...
##  $ category               : Factor w/ 59 levels "?","Category 10",..: 48 48 48 48 32 32 32 32 32 32 ...
##  $ subcategory            : Factor w/ 255 levels "?","Subcategory 10",..: 71 71 71 71 114 114 114 114 114 114 ...
##  $ u_symptom              : Factor w/ 526 levels "?","Symptom 10",..: 503 503 503 503 354 354 354 354 354 354 ...
##  $ cmdb_ci                : Factor w/ 51 levels "?","cmdb_ci 10",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ impact                 : Factor w/ 3 levels "1 - High","2 - Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ urgency                : Factor w/ 3 levels "1 - High","2 - Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ priority               : Factor w/ 4 levels "1 - Critical",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ assignment_group       : Factor w/ 79 levels "?","Group 10",..: 49 49 49 49 65 17 17 17 17 17 ...
##  $ assigned_to            : Factor w/ 235 levels "?","Resolver 10",..: 1 1 1 1 225 169 169 169 169 169 ...
##  $ knowledge              : Factor w/ 2 levels "false","true": 2 2 2 2 2 2 2 2 2 2 ...
##  $ u_priority_confirmation: Factor w/ 2 levels "false","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ notify                 : Factor w/ 2 levels "Do Not Notify",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ problem_id             : Factor w/ 253 levels "?","Problem ID  10",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ rfc                    : Factor w/ 182 levels "?","CHG0000047",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ vendor                 : Factor w/ 5 levels "?","code 8s",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ caused_by              : Factor w/ 4 levels "?","CHG0000097",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ closed_code            : Factor w/ 18 levels "?","code 1","code 10",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ resolved_by            : Factor w/ 217 levels "?","Resolved by 10",..: 53 53 53 53 198 198 198 198 198 198 ...
##  $ resolved_at            : Factor w/ 18506 levels "?","1/1/2017 01:17",..: 12246 12246 12246 12246 42 42 42 42 42 42 ...
##  $ closed_at              : Factor w/ 2707 levels "1/11/2016 15:00",..: 2210 2210 2210 2210 2318 2318 2318 2318 2318 2318 ...

# active | made_sla | knowledge | u_priority_confirmation  attributes are logical data type so lets covert  from factor to logical
levels(incidents$active) <- c(FALSE,TRUE)
levels(incidents$made_sla) <- c(FALSE,TRUE)
levels(incidents$knowledge) <- c(FALSE,TRUE)
levels(incidents$u_priority_confirmation) <- c(FALSE,TRUE)

incidents$active <- as.logical(incidents$active)
incidents$made_sla <- as.logical(incidents$made_sla)
incidents$knowledge <- as.logical(incidents$knowledge)
incidents$u_priority_confirmation <- as.logical(incidents$u_priority_confirmation)



str(incidents$active)

##  logi [1:141712] TRUE TRUE TRUE FALSE TRUE TRUE ...

str(incidents$made_sla)

##  logi [1:141712] TRUE TRUE TRUE TRUE TRUE TRUE ...

str(incidents$knowledge)

##  logi [1:141712] TRUE TRUE TRUE TRUE TRUE TRUE ...

str(incidents$u_priority_confirmation)

##  logi [1:141712] FALSE FALSE FALSE FALSE FALSE FALSE ...

# number is a charcer type lets covert it from factor to charctor
incidents$number <- as.character(incidents$number)
incidents$caller_id <- as.character(incidents$caller_id)
incidents$opened_by <- as.character(incidents$opened_by)
incidents$opened_at <- as.character(incidents$opened_at)
incidents$sys_created_by <- as.character(incidents$sys_created_by)
incidents$sys_created_at <- as.character(incidents$sys_created_at)
incidents$sys_updated_by <- as.character(incidents$sys_updated_by)
incidents$sys_updated_at <- as.character(incidents$sys_updated_at)
incidents$location <- as.character(incidents$location)
incidents$u_symptom <- as.character(incidents$u_symptom)
incidents$cmdb_ci <- as.character(incidents$cmdb_ci)
incidents$problem_id <- as.character(incidents$problem_id)
incidents$rfc <- as.character(incidents$rfc)
incidents$vendor <- as.character(incidents$vendor)
incidents$vendor <- as.character(incidents$vendor)
incidents$resolved_by <- as.character(incidents$resolved_by)
incidents$resolved_at <- as.character(incidents$resolved_at)
incidents$closed_at <- as.character(incidents$closed_at)

rename factor veriable in incident_state

levels(incidents$incident_state)

## [1] "-100"               "Active"             "Awaiting Evidence" 
## [4] "Awaiting Problem"   "Awaiting User Info" "Awaiting Vendor"   
## [7] "Closed"             "New"                "Resolved"

#invisible(revalue(incidents$incident_state,c("-100" = "Undefine")))

levels(incidents$incident_state)[levels(incidents$incident_state)=="-100"] <- "Undefine"

levels(incidents$incident_state)

## [1] "Undefine"           "Active"             "Awaiting Evidence" 
## [4] "Awaiting Problem"   "Awaiting User Info" "Awaiting Vendor"   
## [7] "Closed"             "New"                "Resolved"

check the column names in the data frame, rename them if required.

names(incidents)[names(incidents)=="number"] <- "incident_id"
names(incidents)

##  [1] "incident_id"             "incident_state"         
##  [3] "active"                  "reassignment_count"     
##  [5] "reopen_count"            "sys_mod_count"          
##  [7] "made_sla"                "caller_id"              
##  [9] "opened_by"               "opened_at"              
## [11] "sys_created_by"          "sys_created_at"         
## [13] "sys_updated_by"          "sys_updated_at"         
## [15] "contact_type"            "location"               
## [17] "category"                "subcategory"            
## [19] "u_symptom"               "cmdb_ci"                
## [21] "impact"                  "urgency"                
## [23] "priority"                "assignment_group"       
## [25] "assigned_to"             "knowledge"              
## [27] "u_priority_confirmation" "notify"                 
## [29] "problem_id"              "rfc"                    
## [31] "vendor"                  "caused_by"              
## [33] "closed_code"             "resolved_by"            
## [35] "resolved_at"             "closed_at"

Provide your R codes with outputs and explain everything that you do in this step.

Tidy data

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.

Data set already tidy. Each variable must have its own column. No varables precent in different column Each observation must have its own row. Each value must have its own cell.

# This is a chunk where you check if the data conforms the tidy data principles and reshape your data into a tidy format.

Summary statistics

Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.

# This is a chunk where you provide summary statistics

incidents %>%
  group_by(incidents$incident_state) %>%
  summarize(Mean = mean(sys_mod_count),
             Median = median(sys_mod_count),
             Minimum = min(sys_mod_count),
             Maximum = max(sys_mod_count),
             SD = sd(sys_mod_count))

Create a list

Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.

# This is a chunk where you create a list

Join the list

Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.

# This is a chunk where you join the list

Subsetting I

Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.

# This is a chunk to subset your data and convert it to a matrix

Subsetting II

Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to subset your data and convert it to an R object file

Create a new Data Frame

Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.

The ordinal variable has to be a factor and ordered properly. Make sure you name your variables.
Show the structure of your variables and the levels of the ordinal variable.
Create another numeric vector and use cbind() to add this vector to your data frame.
After this step you should have 3 variables in the data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to create a new data frame with the given specifications

Create another Data Frame

Create another data frame with a common variable to the dataset created in step 11.

Join the data frame to the dataset above, and ensure that the dataset is joined properly.
Ensuring the new categorical variable is carried to the larger dataset. Eg. A dataset to join could be State, Abbreviation, Municipality, Prevailing Religion.
Provide the R codes with outputs and explain everything that you do in this step.

# This is a chunk to create another data frame with the given specifications

IMPORTANT NOTE:

The report must be uploaded to Assignment 1 section in Canvas as a PDF document with R codes and outputs showing. The easiest way to achieve this is to run all R chunks first, then Preview your notebook in HTML (by clicking Preview), then Open in Browser (Chrome), then Right Click on the report in Chrome , then Click Print and Select the Destination Option to Save as PDF. Upload this PDF report as one single file via the Assignment 1 page in CANVAS.

DELETE the instructional text provided in the template. Failure to do this will INCREASE the SIMILARITY INDEX reported in TURNITIN If you have any questions regarding the assignment instructions and the R template, please post it on Canvas discussion.