Loading the necessary packages to reproduce the report here:
library(readr) # Useful for importing data
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
This report used UCI (https://archive.ics.uci.edu/ ) Machine learining Repository data set Incident management process enriched event log.
URL :
Abstract
This event log was extracted from data gathered from the audit system of an instance of the ServiceNow platform used by an IT company and enriched with data loaded from a relational database.
Source
Claudio Aparecido Lira do Amaral, claudio.amaral at usp.br,University of São Paulo, Brazil Marcelo Fantinato, m.fantinato at usp.br, University of São Paulo, Brazil Sarajane Marques Peres, sarajane at usp.br, University of São Paulo, Brazil
Read/Import the data into R, then save it as a data frame incidents.
# This is an R chunk for importing the data. Provide your R codes here:
dir <- ".\\"
setwd(dir)
incidents<- read.csv("incident_event_log.csv")
head(incidents)
URL : https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log
This is an event log of an incident management process extracted from data gathered from the audit system of an instance of the ServiceNowTM platform used by an IT company. The event log is enriched with data loaded from a relational database underlying a corresponding process-aware information system. Information was anonymized for privacy.
Number of instances: 141,712 events (24,918 incidents) Number of attributes: 36 attributes (1 case identifier, 1 state identifier, 32 descriptive attributes, 2 dependent variables)
The attributed closed_at is used to determine the dependent variable for the time completion prediction task. The attribute resolved_at is highly correlated with closed_at. In this event log, some rows may have the same values (they are equal) since not all attributes involved in the real-world process are present in the log.
Attributes used to record textual information are not placed in this log.
The missing values should be considered unknown information.
Attribute Information:
| Attribute | Description |
|---|---|
| 1. number | incident identifier (24,918 different values) |
| 2. incident state | eight levels controlling the incident management process transitions from opening until closing the case |
| 3. active | boolean attribute that shows whether the record is active or closed/canceled |
| 4. reassignment_count | number of times the incident has the group or the support analysts changed |
| 5. reopen_count | number of times the incident resolution was rejected by the caller |
| 6. sys_mod_count | number of incident updates until that moment |
| 7. made_sla | boolean attribute that shows whether the incident exceeded the target SLA |
| 8. caller_id | identifier of the user affected |
| 9. opened_by | identifier of the user who reported the incident |
| 10. opened_at | incident user opening date and time |
| 11. sys_created_by | identifier of the user who registered the incident |
| 12. sys_created_at | incident system creation date and time |
| 13. sys_updated_by | identifier of the user who updated the incident and generated the current log record |
| 14. sys_updated_at | incident system update date and time |
| 15. contact_type | categorical attribute that shows by what means the incident was reported |
| 16. location | identifier of the location of the place affected |
| 17. category | first-level description of the affected service |
| 18. subcategory | second-level description of the affected service (related to the first level description, i.e., to category) |
| 19. u_symptom | description of the user perception about service availability |
| 20. cmdb_ci | (confirmation item) identifier used to report the affected item (not mandatory) |
| 21. impact | description of the impact caused by the incident (values: 1.High; 2.Medium; 3.Low) |
| 22. urgency | description of the urgency informed by the user for the incident resolution (values: 1.High; 2.Medium; 3.Low) |
| 23. priority | calculated by the system based on ‘impact’ and ‘urgency’ |
| 24. assignment_group | identifier of the support group in charge of the incident |
| 25. assigned_to | identifier of the user in charge of the incident |
| 26. knowledge | boolean attribute that shows whether a knowledge base document was used to resolve the incident |
| 27. u_priority_confirmation | boolean attribute that shows whether the priority field has been double-checked |
| 28. notify | categorical attribute that shows whether notifications were generated for the incident |
| 29. problem_id | identifier of the problem associated with the incident |
| 30. rfc | (request for change) identifier of the change request associated with the incident |
| 31. vendor | identifier of the vendor in charge of the incident |
| 32. caused_by | identifier of the RFC responsible by the incident |
| 33. close_code | identifier of the resolution of the incident |
| 34. resolved_by | identifier of the user who resolved the incident |
| 35. resolved_at | incident user resolution date and time (dependent variable) |
| 36. closed_at | incident user close date and time (dependent variable). |
dim(incidents)
## [1] 141712 36
str(incidents)
## 'data.frame': 141712 obs. of 36 variables:
## $ number : Factor w/ 24918 levels "INC0000045","INC0000047",..: 1 1 1 1 2 2 2 2 2 2 ...
## $ incident_state : Factor w/ 9 levels "-100","Active",..: 8 9 9 7 8 2 2 2 2 2 ...
## $ active : Factor w/ 2 levels "false","true": 2 2 2 1 2 2 2 2 2 2 ...
## $ reassignment_count : int 0 0 0 0 0 1 1 1 1 1 ...
## $ reopen_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sys_mod_count : int 0 2 3 4 0 1 2 3 4 5 ...
## $ made_sla : Factor w/ 2 levels "false","true": 2 2 2 2 2 2 2 2 2 2 ...
## $ caller_id : Factor w/ 5245 levels "?","Caller 10",..: 1464 1464 1464 1464 1464 1464 1464 1464 1464 1464 ...
## $ opened_by : Factor w/ 208 levels "?","Opened by 10",..: 202 202 202 202 122 122 122 122 122 122 ...
## $ opened_at : Factor w/ 19849 levels "1/1/2017 01:14",..: 12991 12991 12991 12991 12992 12992 12992 12992 12992 12992 ...
## $ sys_created_by : Factor w/ 186 levels "?","Created by 1",..: 153 153 153 153 60 60 60 60 60 60 ...
## $ sys_created_at : Factor w/ 11553 levels "?","1/1/2017 02:15",..: 7418 7418 7418 7418 7419 7419 7419 7419 7419 7419 ...
## $ sys_updated_by : Factor w/ 846 levels "Updated by 1",..: 105 510 659 763 606 105 105 659 565 222 ...
## $ sys_updated_at : Factor w/ 50664 levels "1/1/2017 01:14",..: 33113 33151 33242 41839 33114 33115 33116 33243 33244 176 ...
## $ contact_type : Factor w/ 5 levels "Direct opening",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ location : Factor w/ 225 levels "?","Location 10",..: 45 45 45 45 64 64 64 64 64 64 ...
## $ category : Factor w/ 59 levels "?","Category 10",..: 48 48 48 48 32 32 32 32 32 32 ...
## $ subcategory : Factor w/ 255 levels "?","Subcategory 10",..: 71 71 71 71 114 114 114 114 114 114 ...
## $ u_symptom : Factor w/ 526 levels "?","Symptom 10",..: 503 503 503 503 354 354 354 354 354 354 ...
## $ cmdb_ci : Factor w/ 51 levels "?","cmdb_ci 10",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ impact : Factor w/ 3 levels "1 - High","2 - Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ urgency : Factor w/ 3 levels "1 - High","2 - Medium",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ priority : Factor w/ 4 levels "1 - Critical",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ assignment_group : Factor w/ 79 levels "?","Group 10",..: 49 49 49 49 65 17 17 17 17 17 ...
## $ assigned_to : Factor w/ 235 levels "?","Resolver 10",..: 1 1 1 1 225 169 169 169 169 169 ...
## $ knowledge : Factor w/ 2 levels "false","true": 2 2 2 2 2 2 2 2 2 2 ...
## $ u_priority_confirmation: Factor w/ 2 levels "false","true": 1 1 1 1 1 1 1 1 1 1 ...
## $ notify : Factor w/ 2 levels "Do Not Notify",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ problem_id : Factor w/ 253 levels "?","Problem ID 10",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ rfc : Factor w/ 182 levels "?","CHG0000047",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ vendor : Factor w/ 5 levels "?","code 8s",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ caused_by : Factor w/ 4 levels "?","CHG0000097",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ closed_code : Factor w/ 18 levels "?","code 1","code 10",..: 14 14 14 14 14 14 14 14 14 14 ...
## $ resolved_by : Factor w/ 217 levels "?","Resolved by 10",..: 53 53 53 53 198 198 198 198 198 198 ...
## $ resolved_at : Factor w/ 18506 levels "?","1/1/2017 01:17",..: 12246 12246 12246 12246 42 42 42 42 42 42 ...
## $ closed_at : Factor w/ 2707 levels "1/11/2016 15:00",..: 2210 2210 2210 2210 2318 2318 2318 2318 2318 2318 ...
# active | made_sla | knowledge | u_priority_confirmation attributes are logical data type so lets covert from factor to logical
levels(incidents$active) <- c(FALSE,TRUE)
levels(incidents$made_sla) <- c(FALSE,TRUE)
levels(incidents$knowledge) <- c(FALSE,TRUE)
levels(incidents$u_priority_confirmation) <- c(FALSE,TRUE)
incidents$active <- as.logical(incidents$active)
incidents$made_sla <- as.logical(incidents$made_sla)
incidents$knowledge <- as.logical(incidents$knowledge)
incidents$u_priority_confirmation <- as.logical(incidents$u_priority_confirmation)
str(incidents$active)
## logi [1:141712] TRUE TRUE TRUE FALSE TRUE TRUE ...
str(incidents$made_sla)
## logi [1:141712] TRUE TRUE TRUE TRUE TRUE TRUE ...
str(incidents$knowledge)
## logi [1:141712] TRUE TRUE TRUE TRUE TRUE TRUE ...
str(incidents$u_priority_confirmation)
## logi [1:141712] FALSE FALSE FALSE FALSE FALSE FALSE ...
# number is a charcer type lets covert it from factor to charctor
incidents$number <- as.character(incidents$number)
incidents$caller_id <- as.character(incidents$caller_id)
incidents$opened_by <- as.character(incidents$opened_by)
incidents$opened_at <- as.character(incidents$opened_at)
incidents$sys_created_by <- as.character(incidents$sys_created_by)
incidents$sys_created_at <- as.character(incidents$sys_created_at)
incidents$sys_updated_by <- as.character(incidents$sys_updated_by)
incidents$sys_updated_at <- as.character(incidents$sys_updated_at)
incidents$location <- as.character(incidents$location)
incidents$u_symptom <- as.character(incidents$u_symptom)
incidents$cmdb_ci <- as.character(incidents$cmdb_ci)
incidents$problem_id <- as.character(incidents$problem_id)
incidents$rfc <- as.character(incidents$rfc)
incidents$vendor <- as.character(incidents$vendor)
incidents$vendor <- as.character(incidents$vendor)
incidents$resolved_by <- as.character(incidents$resolved_by)
incidents$resolved_at <- as.character(incidents$resolved_at)
incidents$closed_at <- as.character(incidents$closed_at)
levels(incidents$incident_state)
## [1] "-100" "Active" "Awaiting Evidence"
## [4] "Awaiting Problem" "Awaiting User Info" "Awaiting Vendor"
## [7] "Closed" "New" "Resolved"
#invisible(revalue(incidents$incident_state,c("-100" = "Undefine")))
levels(incidents$incident_state)[levels(incidents$incident_state)=="-100"] <- "Undefine"
levels(incidents$incident_state)
## [1] "Undefine" "Active" "Awaiting Evidence"
## [4] "Awaiting Problem" "Awaiting User Info" "Awaiting Vendor"
## [7] "Closed" "New" "Resolved"
names(incidents)[names(incidents)=="number"] <- "incident_id"
names(incidents)
## [1] "incident_id" "incident_state"
## [3] "active" "reassignment_count"
## [5] "reopen_count" "sys_mod_count"
## [7] "made_sla" "caller_id"
## [9] "opened_by" "opened_at"
## [11] "sys_created_by" "sys_created_at"
## [13] "sys_updated_by" "sys_updated_at"
## [15] "contact_type" "location"
## [17] "category" "subcategory"
## [19] "u_symptom" "cmdb_ci"
## [21] "impact" "urgency"
## [23] "priority" "assignment_group"
## [25] "assigned_to" "knowledge"
## [27] "u_priority_confirmation" "notify"
## [29] "problem_id" "rfc"
## [31] "vendor" "caused_by"
## [33] "closed_code" "resolved_by"
## [35] "resolved_at" "closed_at"
Provide your R codes with outputs and explain everything that you do in this step.
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format. If the data is in a tidy format, you will be expected to explain why the data is originally ‘tidy’.
Data set already tidy. Each variable must have its own column. No varables precent in different column Each observation must have its own row. Each value must have its own cell.
# This is a chunk where you check if the data conforms the tidy data principles and reshape your data into a tidy format.
Provide summary statistics (mean, median, minimum, maximum, standard deviation) of numeric variables grouped by one of the qualitative (categorical) variable. For example, if your categorical variable is age groups and quantitative variable is income, provide summary statistics of income grouped by the age groups.
# This is a chunk where you provide summary statistics
incidents %>%
group_by(incidents$incident_state) %>%
summarize(Mean = mean(sys_mod_count),
Median = median(sys_mod_count),
Minimum = min(sys_mod_count),
Maximum = max(sys_mod_count),
SD = sd(sys_mod_count))
Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.
# This is a chunk where you create a list
Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.
# This is a chunk where you join the list
Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure.
# This is a chunk to subset your data and convert it to a matrix
Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.
# This is a chunk to subset your data and convert it to an R object file
Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.
The ordinal variable has to be a factor and ordered properly. Make sure you name your variables.
Show the structure of your variables and the levels of the ordinal variable.
Create another numeric vector and use cbind() to add this vector to your data frame.
After this step you should have 3 variables in the data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.
# This is a chunk to create a new data frame with the given specifications
Create another data frame with a common variable to the dataset created in step 11.
Join the data frame to the dataset above, and ensure that the dataset is joined properly.
Ensuring the new categorical variable is carried to the larger dataset. Eg. A dataset to join could be State, Abbreviation, Municipality, Prevailing Religion.
Provide the R codes with outputs and explain everything that you do in this step.
# This is a chunk to create another data frame with the given specifications
The report must be uploaded to Assignment 1 section in Canvas as a PDF document with R codes and outputs showing. The easiest way to achieve this is to run all R chunks first, then Preview your notebook in HTML (by clicking Preview), then Open in Browser (Chrome), then Right Click on the report in Chrome , then Click Print and Select the Destination Option to Save as PDF. Upload this PDF report as one single file via the Assignment 1 page in CANVAS.
DELETE the instructional text provided in the template. Failure to do this will INCREASE the SIMILARITY INDEX reported in TURNITIN If you have any questions regarding the assignment instructions and the R template, please post it on Canvas discussion.